text-generation-inference/docs/source/backends/llamacpp.md

# Llamacpp Backend

The llamacpp backend facilitates the deployment of large language models
(LLMs) by integrating [llama.cpp][llama.cpp], an advanced inference engine
optimized for both CPU and GPU computation. This backend is a component
of Hugging Face’s **Text Generation Inference (TGI)** suite,
specifically designed to streamline the deployment of LLMs in production
environments.

## Key Capabilities

- Full compatibility with GGUF format and all quantization formats
  (GGUF-related constraints may be mitigated dynamically by on-the-fly
  generation in future updates)
- Optimized inference on CPU and GPU architectures
- Containerized deployment, eliminating dependency complexity
- Seamless interoperability with the Hugging Face ecosystem

## Model Compatibility

This backend leverages models formatted in **GGUF**, providing an
optimized balance between computational efficiency and model accuracy.
You will find the best models on [Hugging Face][GGUF].

## Build Docker image

For optimal performance, the Docker image is compiled with native CPU
instructions, thus it's highly recommended to execute the container on
the host used during the build process. Efforts are ongoing to enhance
portability while maintaining high computational efficiency.

```bash
docker build \
    -t tgi-llamacpp \
    https://github.com/huggingface/text-generation-inference.git \
    -f Dockerfile_llamacpp
```

### Build parameters

| Parameter                            | Description                       |
| ------------------------------------ | --------------------------------- |
| `--build-arg llamacpp_version=bXXXX` | Specific version of llama.cpp     |
| `--build-arg llamacpp_cuda=ON`       | Enables CUDA acceleration         |
| `--build-arg cuda_arch=ARCH`         | Defines target CUDA architecture  |

## Model preparation

Retrieve a GGUF model and store it in a specific directory, for example:

```bash
mkdir -p ~/models
cd ~/models
curl -O "https://huggingface.co/Qwen/Qwen2.5-3B-Instruct-GGUF/resolve/main/qwen2.5-3b-instruct-q4_0.gguf?download=true"
```

## Run Docker image

### CPU-based inference

```bash
docker run \
    -p 3000:3000 \
    -e "HF_TOKEN=$HF_TOKEN" \
    -v "$HOME/models:/models" \
    tgi-llamacpp \
    --model-id "Qwen/Qwen2.5-3B-Instruct" \
    --model-gguf "/models/qwen2.5-3b-instruct-q4_0.gguf"
```

### GPU-Accelerated inference

```bash
docker run \
    --gpus all \
    -p 3000:3000 \
    -e "HF_TOKEN=$HF_TOKEN" \
    -v "$HOME/models:/models" \
    tgi-llamacpp \
    --n-gpu-layers 99
    --model-id "Qwen/Qwen2.5-3B-Instruct" \
    --model-gguf "/models/qwen2.5-3b-instruct-q4_0.gguf"
```

## Advanced parameters

A full listing of configurable parameters is available in the `--help`:

```bash
docker run tgi-llamacpp --help

```

The table below summarizes key options:

| Parameter                           | Description                                                            |
|-------------------------------------|------------------------------------------------------------------------|
| `--n-threads`                       | Number of threads to use for generation                                |
| `--n-threads-batch`                 | Number of threads to use for batch processing                          |
| `--n-gpu-layers`                    | Number of layers to store in VRAM                                      |
| `--split-mode`                      | Split the model across multiple GPUs                                   |
| `--defrag-threshold`                | Defragment the KV cache if holes/size > threshold                      |
| `--numa`                            | Enable NUMA optimizations                                              |
| `--use-mlock`                       | Use memory locking to prevent swapping                                 |
| `--offload-kqv`                     | Enable offloading of KQV operations to the GPU                         |
| `--type-k`                          | Data type used for K cache                                             |
| `--type-v`                          | Data type used for V cache                                             |
| `--validation-workers`              | Number of tokenizer workers used for payload validation and truncation |
| `--max-concurrent-requests`         | Maximum number of concurrent requests                                  |
| `--max-input-tokens`                | Maximum number of input tokens per request                             |
| `--max-total-tokens`                | Maximum number of total tokens (input + output) per request            |
| `--max-batch-total-tokens`          | Maximum number of tokens in a batch                                    |
| `--max-physical-batch-total-tokens` | Maximum number of tokens in a physical batch                           |
| `--max-batch-size`                  | Maximum number of requests per batch                                   |

---
[llama.cpp]: https://github.com/ggerganov/llama.cpp
[GGUF]: https://huggingface.co/models?library=gguf&sort=trending
-												Update docs

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

											
										
										
											2025-02-06 09:46:24 +00:00
+								# Llamacpp Backend
-												Add doc

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

											
										
										
											2025-02-05 21:14:30 +00:00
-												Update docs

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

											
										
										
											2025-02-06 09:46:24 +00:00
+								The llamacpp backend facilitates the deployment of large language models
 								(LLMs) by integrating [llama.cpp][llama.cpp], an advanced inference engine
 								optimized for both CPU and GPU computation. This backend is a component
 								of Hugging Face’s **Text Generation Inference (TGI)** suite,
 								specifically designed to streamline the deployment of LLMs in production
 								environments.
-												Add doc

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

											
										
										
											2025-02-05 21:14:30 +00:00
-												Update docs

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

											
										
										
											2025-02-06 09:46:24 +00:00
+								## Key Capabilities
-												Add doc

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

											
										
										
											2025-02-05 21:14:30 +00:00
-												Update docs

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

											
										
										
											2025-02-06 09:46:24 +00:00
+								- Full compatibility with GGUF format and all quantization formats
 								  (GGUF-related constraints may be mitigated dynamically by on-the-fly
 								  generation in future updates)
 								- Optimized inference on CPU and GPU architectures
 								- Containerized deployment, eliminating dependency complexity
 								- Seamless interoperability with the Hugging Face ecosystem
-												Add doc

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

											
										
										
											2025-02-05 21:14:30 +00:00
-												Update docs

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

											
										
										
											2025-02-06 09:46:24 +00:00
+								## Model Compatibility
-												Add doc

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

											
										
										
											2025-02-05 21:14:30 +00:00
-												Update docs

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

											
										
										
											2025-02-06 09:46:24 +00:00
+								This backend leverages models formatted in **GGUF**, providing an
 								optimized balance between computational efficiency and model accuracy.
 								You will find the best models on [Hugging Face][GGUF].
 								## Build Docker image
 								For optimal performance, the Docker image is compiled with native CPU
 								instructions, thus it's highly recommended to execute the container on
 								the host used during the build process. Efforts are ongoing to enhance
 								portability while maintaining high computational efficiency.
-												Add doc

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

											
										
										
											2025-02-05 21:14:30 +00:00
 								```bash
 								docker build \
-												Update docs

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

											
										
										
											2025-02-06 09:46:24 +00:00
+								    -t tgi-llamacpp \
-												Add doc

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

											
										
										
											2025-02-05 21:14:30 +00:00
+								    https://github.com/huggingface/text-generation-inference.git \
 								    -f Dockerfile_llamacpp
 								```
-												Update docs

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

											
										
										
											2025-02-06 09:46:24 +00:00
+								### Build parameters
-												Add doc

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

											
										
										
											2025-02-05 21:14:30 +00:00
-												Update docs

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

											
										
										
											2025-02-06 09:46:24 +00:00
+								| Parameter                            | Description                       |
 								| ------------------------------------ | --------------------------------- |
 								| `--build-arg llamacpp_version=bXXXX` | Specific version of llama.cpp     |
 								| `--build-arg llamacpp_cuda=ON`       | Enables CUDA acceleration         |
 								| `--build-arg cuda_arch=ARCH`         | Defines target CUDA architecture  |
-												Add doc

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

											
										
										
											2025-02-05 21:14:30 +00:00
-												Update docs

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

											
										
										
											2025-02-06 09:46:24 +00:00
+								## Model preparation
-												Add doc

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

											
										
										
											2025-02-05 21:14:30 +00:00
-												Update docs

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

											
										
										
											2025-02-06 09:46:24 +00:00
+								Retrieve a GGUF model and store it in a specific directory, for example:
-												Add doc

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

											
										
										
											2025-02-05 21:14:30 +00:00
 								```bash
 								mkdir -p ~/models
 								cd ~/models
 								curl -O "https://huggingface.co/Qwen/Qwen2.5-3B-Instruct-GGUF/resolve/main/qwen2.5-3b-instruct-q4_0.gguf?download=true"
 								```
-												Update docs

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

											
										
										
											2025-02-06 09:46:24 +00:00
+								## Run Docker image
-												Add doc

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

											
										
										
											2025-02-05 21:14:30 +00:00
-												Update docs

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

											
										
										
											2025-02-06 09:46:24 +00:00
+								### CPU-based inference
-												Add doc

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

											
										
										
											2025-02-05 21:14:30 +00:00
 								```bash
 								docker run \
 								    -p 3000:3000 \
 								    -e "HF_TOKEN=$HF_TOKEN" \
 								    -v "$HOME/models:/models" \
-												Update docs

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

											
										
										
											2025-02-06 09:46:24 +00:00
+								    tgi-llamacpp \
-												Add doc

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

											
										
										
											2025-02-05 21:14:30 +00:00
+								    --model-id "Qwen/Qwen2.5-3B-Instruct" \
 								    --model-gguf "/models/qwen2.5-3b-instruct-q4_0.gguf"
 								```
-												Update docs

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

											
										
										
											2025-02-06 09:46:24 +00:00
+								### GPU-Accelerated inference
 								```bash
 								docker run \
 								    --gpus all \
 								    -p 3000:3000 \
 								    -e "HF_TOKEN=$HF_TOKEN" \
 								    -v "$HOME/models:/models" \
 								    tgi-llamacpp \
 								    --n-gpu-layers 99
 								    --model-id "Qwen/Qwen2.5-3B-Instruct" \
 								    --model-gguf "/models/qwen2.5-3b-instruct-q4_0.gguf"
 								```
 								## Advanced parameters
 								A full listing of configurable parameters is available in the `--help`:
 								```bash
 								docker run tgi-llamacpp --help
 								```
 								The table below summarizes key options:
 								| Parameter                           | Description                                                            |
 								|-------------------------------------|------------------------------------------------------------------------|
 								| `--n-threads`                       | Number of threads to use for generation                                |
 								| `--n-threads-batch`                 | Number of threads to use for batch processing                          |
 								| `--n-gpu-layers`                    | Number of layers to store in VRAM                                      |
 								| `--split-mode`                      | Split the model across multiple GPUs                                   |
 								| `--defrag-threshold`                | Defragment the KV cache if holes/size > threshold                      |
 								| `--numa`                            | Enable NUMA optimizations                                              |
 								| `--use-mlock`                       | Use memory locking to prevent swapping                                 |
 								| `--offload-kqv`                     | Enable offloading of KQV operations to the GPU                         |
 								| `--type-k`                          | Data type used for K cache                                             |
 								| `--type-v`                          | Data type used for V cache                                             |
 								| `--validation-workers`              | Number of tokenizer workers used for payload validation and truncation |
 								| `--max-concurrent-requests`         | Maximum number of concurrent requests                                  |
 								| `--max-input-tokens`                | Maximum number of input tokens per request                             |
 								| `--max-total-tokens`                | Maximum number of total tokens (input + output) per request            |
 								| `--max-batch-total-tokens`          | Maximum number of tokens in a batch                                    |
 								| `--max-physical-batch-total-tokens` | Maximum number of tokens in a physical batch                           |
 								| `--max-batch-size`                  | Maximum number of requests per batch                                   |
 								---
 								[llama.cpp]: https://github.com/ggerganov/llama.cpp
 								[GGUF]: https://huggingface.co/models?library=gguf&sort=trending