David Corvoysier
5eec3a8bb6
Avoid running neuron integration tests twice ( #3054 )
...
* test(neuron): refactor to prepare batch export
* test(neuron): add helper to batch export models
Also rename fixture file fro clarity.
* ci(neuron): do not run tests twice
* ci(neuron): rename precompilation job
* test(neuron): remove redundant subdirectory
* test(neuron): remove erroneous line
* doc(neuron): update links to installation page
* feat(neuron): cleanup Dockerfile
CARGO_REGISTRIES_CRATES_IO_PROTOCOL=sparse is not required anymore.
* test(neuron): try to reduce download errors
2025-02-26 12:15:01 +01:00
Adrien Gallouët
cfd4fbb479
[Backend] Add Llamacpp backend ( #2975 )
...
* Add llamacpp backend
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Get rid of llama_batch_get_one()
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Use max_batch_total_tokens
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Handle max_batch_size
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add some input validation checks
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Handle ctx args & fix sampling
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add GPU args
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add --defrag-threshold
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add a stupid batch mechanism
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Cleanup
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add --numa
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Fix args
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Enable flash attention by default
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add --offload-kqv
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Fix batch_pos
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* backend(llama): add CUDA Dockerfile_llamacpp for now
* Only export the latest logits
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Output real logprobs
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Fix batching
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Fix seq iterations
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Auto-detect n_threads when not provided
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Clear request cache after completion
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Remove warmup
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Cleanup
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* backend(llama): add CUDA architectures build argument for Dockerfile
* Add specific args for batch
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add --type-v & --type-k
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Bump llamacpp to b4623
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Disable graceful shutdown in debug mode
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Update Dockerfile_llamacpp
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Cleanup Dockerfile
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Update Cargo.lock
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Update args
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Simplify batching logic
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Set TGI_LLAMA_PKG_CUDA from CUDA_VERSION
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Rename bindings
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Remove n_ctx
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Make max_batch_total_tokens optional
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Ensure all samplers are freed on error
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Initialize penalty_last_n with llamacpp default value
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Cleanup
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Improve default settings
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add doc
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Update docs
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Thanks clippy
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Thanks cargo fmt
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Update docs
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Do not use HOSTNAME env
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Bump llama.cpp & cuda
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Fix requirements.txt
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Fix fmt
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Enable KQV offload by default
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Remove Ngrok tunneling
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Remove .cargo/config.toml
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Fix Dockerfile
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add missing cuda prefix
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Handle custom llama.cpp dir
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Cleanup
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add README.md
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add HF transfer
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Fix bool args
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Update doc
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Update doc
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
---------
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
Co-authored-by: Morgan Funtowicz <funtowiczmo@gmail.com>
2025-02-14 13:40:57 +01:00
drbh
a72f339c79
fix: lint backend and doc files ( #2850 )
2024-12-16 16:12:34 -05:00
Funtowicz Morgan
ea7f4082c4
TensorRT-LLM backend bump to latest version + misc fixes ( #2791 )
...
* misc(cmake) update dependencies
* feat(hardware) enable new hardware.hpp and unittests
* test(ctest) enable address sanitizer
* feat(backend): initial rewrite of the backend for simplicity
* feat(backend): remove all the logs from hardware.hpp
* feat(backend): added some logging
* feat(backend): enable compiler warning if support for RVO not applying
* feat(backend): missing return statement
* feat(backend): introduce backend_workspace_t to store precomputed information from the engine folder
* feat(backend): delete previous backend impl
* feat(backend): more impl
* feat(backend): use latest trtllm main version to have g++ >= 13 compatibility
* feat(backend): allow overriding which Python to use
* feat(backend): fix backend_exception_t -> backend_error_t naming
* feat(backend): impl missing generation_step_t as return value of pull_tokens
* feat(backend): make backend_workspace_t::engines_folder constexpr
* feat(backend): fix main.rs retrieving the tokenizer
* feat(backend): add guard to multiple header definitions
* test(backend): add more unittest
* feat(backend): remove constexpr from par
* feat(backend): remove constexpig
* test(backend): more test coverage
* chore(trtllm): update dependency towards 0.15.0
* effectively cancel the request on the executor
* feat(backend) fix moving backend when pulling
* feat(backend): make sure we can easily cancel request on the executor
* feat(backend): fix missing "0" field access
* misc(backend): fix reborrowing Pin<&mut T> as described in the doc https://doc.rust-lang.org/stable/std/pin/struct.Pin.html#method.as_mut
* chore: Add doc and CI for TRTLLM (#2799 )
* chore: Add doc and CI for TRTLLM
* chore: Add doc and CI for TRTLLM
* chore: Add doc and CI for TRTLLM
* chore: Add doc and CI for TRTLLM
* doc: Formatting
* misc(backend): indent
---------
Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co>
2024-12-13 15:50:59 +01:00