text-generation-inference/docs/source/backends/llamacpp.md

# Llamacpp Backend

The llamacpp backend facilitates the deployment of large language models
(LLMs) by integrating [llama.cpp][llama.cpp], an advanced inference engine
optimized for both CPU and GPU computation. This backend is a component
of Hugging Face’s **Text Generation Inference (TGI)** suite,
specifically designed to streamline the deployment of LLMs in production
environments.

## Key Capabilities

- Full compatibility with GGUF format and all quantization formats
  (GGUF-related constraints may be mitigated dynamically by on-the-fly
  generation in future updates)
- Optimized inference on CPU and GPU architectures
- Containerized deployment, eliminating dependency complexity
- Seamless interoperability with the Hugging Face ecosystem

## Model Compatibility

This backend leverages models formatted in **GGUF**, providing an
optimized balance between computational efficiency and model accuracy.
You will find the best models on [Hugging Face][GGUF].

## Build Docker image

For optimal performance, the Docker image is compiled with native CPU
instructions, thus it's highly recommended to execute the container on
the host used during the build process. Efforts are ongoing to enhance
portability while maintaining high computational efficiency.

```bash
docker build \
    -t tgi-llamacpp \
    https://github.com/huggingface/text-generation-inference.git \
    -f Dockerfile_llamacpp
```

### Build parameters

| Parameter                            | Description                       |
| ------------------------------------ | --------------------------------- |
| `--build-arg llamacpp_version=bXXXX` | Specific version of llama.cpp     |
| `--build-arg llamacpp_cuda=ON`       | Enables CUDA acceleration         |
| `--build-arg cuda_arch=ARCH`         | Defines target CUDA architecture  |

## Model preparation

Retrieve a GGUF model and store it in a specific directory, for example:

```bash
mkdir -p ~/models
cd ~/models
curl -LOJ "https://huggingface.co/Qwen/Qwen2.5-3B-Instruct-GGUF/resolve/main/qwen2.5-3b-instruct-q4_0.gguf?download=true"
```

## Run Docker image

### CPU-based inference

```bash
docker run \
    -p 3000:3000 \
    -e "HF_TOKEN=$HF_TOKEN" \
    -v "$HOME/models:/models" \
    tgi-llamacpp \
    --model-id "Qwen/Qwen2.5-3B-Instruct" \
    --model-gguf "/models/qwen2.5-3b-instruct-q4_0.gguf"
```

### GPU-Accelerated inference

```bash
docker run \
    --gpus all \
    -p 3000:3000 \
    -e "HF_TOKEN=$HF_TOKEN" \
    -v "$HOME/models:/models" \
    tgi-llamacpp \
    --n-gpu-layers 99
    --model-id "Qwen/Qwen2.5-3B-Instruct" \
    --model-gguf "/models/qwen2.5-3b-instruct-q4_0.gguf"
```

## Advanced parameters

A full listing of configurable parameters is available in the `--help`:

```bash
docker run tgi-llamacpp --help

```

The table below summarizes key options:

| Parameter                           | Description                                                            |
|-------------------------------------|------------------------------------------------------------------------|
| `--n-threads`                       | Number of threads to use for generation                                |
| `--n-threads-batch`                 | Number of threads to use for batch processing                          |
| `--n-gpu-layers`                    | Number of layers to store in VRAM                                      |
| `--split-mode`                      | Split the model across multiple GPUs                                   |
| `--defrag-threshold`                | Defragment the KV cache if holes/size > threshold                      |
| `--numa`                            | Enable NUMA optimizations                                              |
| `--use-mmap`                        | Use memory mapping for the model                                       |
| `--use-mlock`                       | Use memory locking to prevent swapping                                 |
| `--offload-kqv`                     | Enable offloading of KQV operations to the GPU                         |
| `--flash-attention`                 | Enable flash attention for faster inference                            |
| `--type-k`                          | Data type used for K cache                                             |
| `--type-v`                          | Data type used for V cache                                             |
| `--validation-workers`              | Number of tokenizer workers used for payload validation and truncation |
| `--max-concurrent-requests`         | Maximum number of concurrent requests                                  |
| `--max-input-tokens`                | Maximum number of input tokens per request                             |
| `--max-total-tokens`                | Maximum number of total tokens (input + output) per request            |
| `--max-batch-total-tokens`          | Maximum number of tokens in a batch                                    |
| `--max-physical-batch-total-tokens` | Maximum number of tokens in a physical batch                           |
| `--max-batch-size`                  | Maximum number of requests per batch                                   |

---
[llama.cpp]: https://github.com/ggerganov/llama.cpp
[GGUF]: https://huggingface.co/models?library=gguf&sort=trending
-												[Backend] Add Llamacpp backend (#2975)

* Add llamacpp backend

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Get rid of llama_batch_get_one()

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Use max_batch_total_tokens

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Handle max_batch_size

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Add some input validation checks

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Handle ctx args & fix sampling

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Add GPU args

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Add --defrag-threshold

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Add a stupid batch mechanism

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Cleanup

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Add --numa

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Fix args

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Enable flash attention by default

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Add --offload-kqv

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Fix batch_pos

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* backend(llama): add CUDA Dockerfile_llamacpp for now

* Only export the latest logits

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Output real logprobs

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Fix batching

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Fix seq iterations

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Auto-detect n_threads when not provided

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Clear request cache after completion

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Remove warmup

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Cleanup

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* backend(llama): add CUDA architectures build argument for Dockerfile

* Add specific args for batch

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Add --type-v & --type-k

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Bump llamacpp to b4623

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Disable graceful shutdown in debug mode

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Update Dockerfile_llamacpp

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Cleanup Dockerfile

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Update Cargo.lock

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Update args

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Simplify batching logic

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Set TGI_LLAMA_PKG_CUDA from CUDA_VERSION

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Rename bindings

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Remove n_ctx

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Make max_batch_total_tokens optional

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Ensure all samplers are freed on error

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Initialize penalty_last_n with llamacpp default value

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Cleanup

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Improve default settings

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Add doc

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Update docs

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Thanks clippy

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Thanks cargo fmt

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Update docs

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Do not use HOSTNAME env

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Bump llama.cpp & cuda

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Fix requirements.txt

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Fix fmt

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Enable KQV offload by default

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Remove Ngrok tunneling

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Remove .cargo/config.toml

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Fix Dockerfile

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Add missing cuda prefix

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Handle custom llama.cpp dir

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Cleanup

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Add README.md

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Add HF transfer

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Fix bool args

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Update doc

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Update doc

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

---------

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
Co-authored-by: Morgan Funtowicz <funtowiczmo@gmail.com>
											
										
										
											2025-02-14 12:40:57 +00:00
+								# Llamacpp Backend
 								The llamacpp backend facilitates the deployment of large language models
 								(LLMs) by integrating [llama.cpp][llama.cpp], an advanced inference engine
 								optimized for both CPU and GPU computation. This backend is a component
 								of Hugging Face’s **Text Generation Inference (TGI)** suite,
 								specifically designed to streamline the deployment of LLMs in production
 								environments.
 								## Key Capabilities
 								- Full compatibility with GGUF format and all quantization formats
 								  (GGUF-related constraints may be mitigated dynamically by on-the-fly
 								  generation in future updates)
 								- Optimized inference on CPU and GPU architectures
 								- Containerized deployment, eliminating dependency complexity
 								- Seamless interoperability with the Hugging Face ecosystem
 								## Model Compatibility
 								This backend leverages models formatted in **GGUF**, providing an
 								optimized balance between computational efficiency and model accuracy.
 								You will find the best models on [Hugging Face][GGUF].
 								## Build Docker image
 								For optimal performance, the Docker image is compiled with native CPU
 								instructions, thus it's highly recommended to execute the container on
 								the host used during the build process. Efforts are ongoing to enhance
 								portability while maintaining high computational efficiency.
 								```bash
 								docker build \
 								    -t tgi-llamacpp \
 								    https://github.com/huggingface/text-generation-inference.git \
 								    -f Dockerfile_llamacpp
 								```
 								### Build parameters
 								| Parameter                            | Description                       |
 								| ------------------------------------ | --------------------------------- |
 								| `--build-arg llamacpp_version=bXXXX` | Specific version of llama.cpp     |
 								| `--build-arg llamacpp_cuda=ON`       | Enables CUDA acceleration         |
 								| `--build-arg cuda_arch=ARCH`         | Defines target CUDA architecture  |
 								## Model preparation
 								Retrieve a GGUF model and store it in a specific directory, for example:
 								```bash
 								mkdir -p ~/models
 								cd ~/models
 								curl -LOJ "https://huggingface.co/Qwen/Qwen2.5-3B-Instruct-GGUF/resolve/main/qwen2.5-3b-instruct-q4_0.gguf?download=true"
 								```
 								## Run Docker image
 								### CPU-based inference
 								```bash
 								docker run \
 								    -p 3000:3000 \
 								    -e "HF_TOKEN=$HF_TOKEN" \
 								    -v "$HOME/models:/models" \
 								    tgi-llamacpp \
 								    --model-id "Qwen/Qwen2.5-3B-Instruct" \
 								    --model-gguf "/models/qwen2.5-3b-instruct-q4_0.gguf"
 								```
 								### GPU-Accelerated inference
 								```bash
 								docker run \
 								    --gpus all \
 								    -p 3000:3000 \
 								    -e "HF_TOKEN=$HF_TOKEN" \
 								    -v "$HOME/models:/models" \
 								    tgi-llamacpp \
 								    --n-gpu-layers 99
 								    --model-id "Qwen/Qwen2.5-3B-Instruct" \
 								    --model-gguf "/models/qwen2.5-3b-instruct-q4_0.gguf"
 								```
 								## Advanced parameters
 								A full listing of configurable parameters is available in the `--help`:
 								```bash
 								docker run tgi-llamacpp --help
 								```
 								The table below summarizes key options:
 								| Parameter                           | Description                                                            |
 								|-------------------------------------|------------------------------------------------------------------------|
 								| `--n-threads`                       | Number of threads to use for generation                                |
 								| `--n-threads-batch`                 | Number of threads to use for batch processing                          |
 								| `--n-gpu-layers`                    | Number of layers to store in VRAM                                      |
 								| `--split-mode`                      | Split the model across multiple GPUs                                   |
 								| `--defrag-threshold`                | Defragment the KV cache if holes/size > threshold                      |
 								| `--numa`                            | Enable NUMA optimizations                                              |
 								| `--use-mmap`                        | Use memory mapping for the model                                       |
 								| `--use-mlock`                       | Use memory locking to prevent swapping                                 |
 								| `--offload-kqv`                     | Enable offloading of KQV operations to the GPU                         |
 								| `--flash-attention`                 | Enable flash attention for faster inference                            |
 								| `--type-k`                          | Data type used for K cache                                             |
 								| `--type-v`                          | Data type used for V cache                                             |
 								| `--validation-workers`              | Number of tokenizer workers used for payload validation and truncation |
 								| `--max-concurrent-requests`         | Maximum number of concurrent requests                                  |
 								| `--max-input-tokens`                | Maximum number of input tokens per request                             |
 								| `--max-total-tokens`                | Maximum number of total tokens (input + output) per request            |
 								| `--max-batch-total-tokens`          | Maximum number of tokens in a batch                                    |
 								| `--max-physical-batch-total-tokens` | Maximum number of tokens in a physical batch                           |
 								| `--max-batch-size`                  | Maximum number of requests per batch                                   |
 								---
 								[llama.cpp]: https://github.com/ggerganov/llama.cpp
 								[GGUF]: https://huggingface.co/models?library=gguf&sort=trending