Adrien Gallouët
|
cfd4fbb479
|
[Backend] Add Llamacpp backend (#2975)
* Add llamacpp backend
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Get rid of llama_batch_get_one()
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Use max_batch_total_tokens
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Handle max_batch_size
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add some input validation checks
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Handle ctx args & fix sampling
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add GPU args
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add --defrag-threshold
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add a stupid batch mechanism
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Cleanup
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add --numa
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Fix args
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Enable flash attention by default
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add --offload-kqv
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Fix batch_pos
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* backend(llama): add CUDA Dockerfile_llamacpp for now
* Only export the latest logits
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Output real logprobs
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Fix batching
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Fix seq iterations
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Auto-detect n_threads when not provided
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Clear request cache after completion
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Remove warmup
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Cleanup
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* backend(llama): add CUDA architectures build argument for Dockerfile
* Add specific args for batch
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add --type-v & --type-k
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Bump llamacpp to b4623
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Disable graceful shutdown in debug mode
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Update Dockerfile_llamacpp
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Cleanup Dockerfile
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Update Cargo.lock
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Update args
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Simplify batching logic
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Set TGI_LLAMA_PKG_CUDA from CUDA_VERSION
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Rename bindings
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Remove n_ctx
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Make max_batch_total_tokens optional
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Ensure all samplers are freed on error
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Initialize penalty_last_n with llamacpp default value
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Cleanup
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Improve default settings
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add doc
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Update docs
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Thanks clippy
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Thanks cargo fmt
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Update docs
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Do not use HOSTNAME env
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Bump llama.cpp & cuda
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Fix requirements.txt
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Fix fmt
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Enable KQV offload by default
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Remove Ngrok tunneling
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Remove .cargo/config.toml
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Fix Dockerfile
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add missing cuda prefix
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Handle custom llama.cpp dir
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Cleanup
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add README.md
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add HF transfer
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Fix bool args
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Update doc
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Update doc
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
---------
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
Co-authored-by: Morgan Funtowicz <funtowiczmo@gmail.com>
|
2025-02-14 13:40:57 +01:00 |
|