mirror of
https://github.com/huggingface/text-generation-inference.git
synced 2025-04-21 14:52:20 +00:00
* Add llamacpp backend Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Get rid of llama_batch_get_one() Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Use max_batch_total_tokens Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Handle max_batch_size Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Add some input validation checks Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Handle ctx args & fix sampling Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Add GPU args Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Add --defrag-threshold Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Add a stupid batch mechanism Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Cleanup Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Add --numa Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Fix args Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Enable flash attention by default Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Add --offload-kqv Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Fix batch_pos Signed-off-by: Adrien Gallouët <angt@huggingface.co> * backend(llama): add CUDA Dockerfile_llamacpp for now * Only export the latest logits Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Output real logprobs Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Fix batching Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Fix seq iterations Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Auto-detect n_threads when not provided Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Clear request cache after completion Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Remove warmup Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Cleanup Signed-off-by: Adrien Gallouët <angt@huggingface.co> * backend(llama): add CUDA architectures build argument for Dockerfile * Add specific args for batch Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Add --type-v & --type-k Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Bump llamacpp to b4623 Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Disable graceful shutdown in debug mode Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Update Dockerfile_llamacpp Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Cleanup Dockerfile Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Update Cargo.lock Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Update args Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Simplify batching logic Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Set TGI_LLAMA_PKG_CUDA from CUDA_VERSION Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Rename bindings Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Remove n_ctx Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Make max_batch_total_tokens optional Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Ensure all samplers are freed on error Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Initialize penalty_last_n with llamacpp default value Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Cleanup Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Improve default settings Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Add doc Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Update docs Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Thanks clippy Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Thanks cargo fmt Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Update docs Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Do not use HOSTNAME env Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Bump llama.cpp & cuda Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Fix requirements.txt Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Fix fmt Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Enable KQV offload by default Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Remove Ngrok tunneling Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Remove .cargo/config.toml Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Fix Dockerfile Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Add missing cuda prefix Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Handle custom llama.cpp dir Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Cleanup Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Add README.md Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Add HF transfer Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Fix bool args Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Update doc Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Update doc Signed-off-by: Adrien Gallouët <angt@huggingface.co> --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co> Co-authored-by: Morgan Funtowicz <funtowiczmo@gmail.com>
121 lines
5.2 KiB
Markdown
121 lines
5.2 KiB
Markdown
# Llamacpp Backend
|
||
|
||
The llamacpp backend facilitates the deployment of large language models
|
||
(LLMs) by integrating [llama.cpp][llama.cpp], an advanced inference engine
|
||
optimized for both CPU and GPU computation. This backend is a component
|
||
of Hugging Face’s **Text Generation Inference (TGI)** suite,
|
||
specifically designed to streamline the deployment of LLMs in production
|
||
environments.
|
||
|
||
## Key Capabilities
|
||
|
||
- Full compatibility with GGUF format and all quantization formats
|
||
(GGUF-related constraints may be mitigated dynamically by on-the-fly
|
||
generation in future updates)
|
||
- Optimized inference on CPU and GPU architectures
|
||
- Containerized deployment, eliminating dependency complexity
|
||
- Seamless interoperability with the Hugging Face ecosystem
|
||
|
||
## Model Compatibility
|
||
|
||
This backend leverages models formatted in **GGUF**, providing an
|
||
optimized balance between computational efficiency and model accuracy.
|
||
You will find the best models on [Hugging Face][GGUF].
|
||
|
||
## Build Docker image
|
||
|
||
For optimal performance, the Docker image is compiled with native CPU
|
||
instructions, thus it's highly recommended to execute the container on
|
||
the host used during the build process. Efforts are ongoing to enhance
|
||
portability while maintaining high computational efficiency.
|
||
|
||
```bash
|
||
docker build \
|
||
-t tgi-llamacpp \
|
||
https://github.com/huggingface/text-generation-inference.git \
|
||
-f Dockerfile_llamacpp
|
||
```
|
||
|
||
### Build parameters
|
||
|
||
| Parameter | Description |
|
||
| ------------------------------------ | --------------------------------- |
|
||
| `--build-arg llamacpp_version=bXXXX` | Specific version of llama.cpp |
|
||
| `--build-arg llamacpp_cuda=ON` | Enables CUDA acceleration |
|
||
| `--build-arg cuda_arch=ARCH` | Defines target CUDA architecture |
|
||
|
||
## Model preparation
|
||
|
||
Retrieve a GGUF model and store it in a specific directory, for example:
|
||
|
||
```bash
|
||
mkdir -p ~/models
|
||
cd ~/models
|
||
curl -LOJ "https://huggingface.co/Qwen/Qwen2.5-3B-Instruct-GGUF/resolve/main/qwen2.5-3b-instruct-q4_0.gguf?download=true"
|
||
```
|
||
|
||
## Run Docker image
|
||
|
||
### CPU-based inference
|
||
|
||
```bash
|
||
docker run \
|
||
-p 3000:3000 \
|
||
-e "HF_TOKEN=$HF_TOKEN" \
|
||
-v "$HOME/models:/models" \
|
||
tgi-llamacpp \
|
||
--model-id "Qwen/Qwen2.5-3B-Instruct" \
|
||
--model-gguf "/models/qwen2.5-3b-instruct-q4_0.gguf"
|
||
```
|
||
|
||
### GPU-Accelerated inference
|
||
|
||
```bash
|
||
docker run \
|
||
--gpus all \
|
||
-p 3000:3000 \
|
||
-e "HF_TOKEN=$HF_TOKEN" \
|
||
-v "$HOME/models:/models" \
|
||
tgi-llamacpp \
|
||
--n-gpu-layers 99
|
||
--model-id "Qwen/Qwen2.5-3B-Instruct" \
|
||
--model-gguf "/models/qwen2.5-3b-instruct-q4_0.gguf"
|
||
```
|
||
|
||
## Advanced parameters
|
||
|
||
A full listing of configurable parameters is available in the `--help`:
|
||
|
||
```bash
|
||
docker run tgi-llamacpp --help
|
||
|
||
```
|
||
|
||
The table below summarizes key options:
|
||
|
||
| Parameter | Description |
|
||
|-------------------------------------|------------------------------------------------------------------------|
|
||
| `--n-threads` | Number of threads to use for generation |
|
||
| `--n-threads-batch` | Number of threads to use for batch processing |
|
||
| `--n-gpu-layers` | Number of layers to store in VRAM |
|
||
| `--split-mode` | Split the model across multiple GPUs |
|
||
| `--defrag-threshold` | Defragment the KV cache if holes/size > threshold |
|
||
| `--numa` | Enable NUMA optimizations |
|
||
| `--use-mmap` | Use memory mapping for the model |
|
||
| `--use-mlock` | Use memory locking to prevent swapping |
|
||
| `--offload-kqv` | Enable offloading of KQV operations to the GPU |
|
||
| `--flash-attention` | Enable flash attention for faster inference |
|
||
| `--type-k` | Data type used for K cache |
|
||
| `--type-v` | Data type used for V cache |
|
||
| `--validation-workers` | Number of tokenizer workers used for payload validation and truncation |
|
||
| `--max-concurrent-requests` | Maximum number of concurrent requests |
|
||
| `--max-input-tokens` | Maximum number of input tokens per request |
|
||
| `--max-total-tokens` | Maximum number of total tokens (input + output) per request |
|
||
| `--max-batch-total-tokens` | Maximum number of tokens in a batch |
|
||
| `--max-physical-batch-total-tokens` | Maximum number of tokens in a physical batch |
|
||
| `--max-batch-size` | Maximum number of requests per batch |
|
||
|
||
---
|
||
[llama.cpp]: https://github.com/ggerganov/llama.cpp
|
||
[GGUF]: https://huggingface.co/models?library=gguf&sort=trending
|