mirror of
https://github.com/huggingface/text-generation-inference.git
synced 2025-05-08 10:52:14 +00:00
Update doc
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
This commit is contained in:
parent
30cd3cf510
commit
2242d1a67c
@ -54,6 +54,10 @@ cd ~/models
|
|||||||
curl -LOJ "https://huggingface.co/Qwen/Qwen2.5-3B-Instruct-GGUF/resolve/main/qwen2.5-3b-instruct-q4_0.gguf?download=true"
|
curl -LOJ "https://huggingface.co/Qwen/Qwen2.5-3B-Instruct-GGUF/resolve/main/qwen2.5-3b-instruct-q4_0.gguf?download=true"
|
||||||
```
|
```
|
||||||
|
|
||||||
|
GGUF files are optional as they will be automatically generated at
|
||||||
|
startup if not already present in the `models` directory. This means you
|
||||||
|
do not need to manually download a GGUF file unless you prefer to do so.
|
||||||
|
|
||||||
## Run Docker image
|
## Run Docker image
|
||||||
|
|
||||||
### CPU-based inference
|
### CPU-based inference
|
||||||
@ -62,10 +66,9 @@ curl -LOJ "https://huggingface.co/Qwen/Qwen2.5-3B-Instruct-GGUF/resolve/main/qwe
|
|||||||
docker run \
|
docker run \
|
||||||
-p 3000:3000 \
|
-p 3000:3000 \
|
||||||
-e "HF_TOKEN=$HF_TOKEN" \
|
-e "HF_TOKEN=$HF_TOKEN" \
|
||||||
-v "$HOME/models:/models" \
|
-v "$HOME/models:/app/models" \
|
||||||
tgi-llamacpp \
|
tgi-llamacpp \
|
||||||
--model-id "Qwen/Qwen2.5-3B-Instruct" \
|
--model-id "Qwen/Qwen2.5-3B-Instruct"
|
||||||
--model-gguf "/models/qwen2.5-3b-instruct-q4_0.gguf"
|
|
||||||
```
|
```
|
||||||
|
|
||||||
### GPU-Accelerated inference
|
### GPU-Accelerated inference
|
||||||
@ -75,11 +78,10 @@ docker run \
|
|||||||
--gpus all \
|
--gpus all \
|
||||||
-p 3000:3000 \
|
-p 3000:3000 \
|
||||||
-e "HF_TOKEN=$HF_TOKEN" \
|
-e "HF_TOKEN=$HF_TOKEN" \
|
||||||
-v "$HOME/models:/models" \
|
-v "$HOME/models:/app/models" \
|
||||||
tgi-llamacpp \
|
tgi-llamacpp \
|
||||||
--n-gpu-layers 99
|
--n-gpu-layers 99
|
||||||
--model-id "Qwen/Qwen2.5-3B-Instruct" \
|
--model-id "Qwen/Qwen2.5-3B-Instruct"
|
||||||
--model-gguf "/models/qwen2.5-3b-instruct-q4_0.gguf"
|
|
||||||
```
|
```
|
||||||
|
|
||||||
## Advanced parameters
|
## Advanced parameters
|
||||||
@ -101,10 +103,10 @@ The table below summarizes key options:
|
|||||||
| `--split-mode` | Split the model across multiple GPUs |
|
| `--split-mode` | Split the model across multiple GPUs |
|
||||||
| `--defrag-threshold` | Defragment the KV cache if holes/size > threshold |
|
| `--defrag-threshold` | Defragment the KV cache if holes/size > threshold |
|
||||||
| `--numa` | Enable NUMA optimizations |
|
| `--numa` | Enable NUMA optimizations |
|
||||||
| `--use-mmap` | Use memory mapping for the model |
|
| `--disable-mmap` | Disable memory mapping for the model |
|
||||||
| `--use-mlock` | Use memory locking to prevent swapping |
|
| `--use-mlock` | Use memory locking to prevent swapping |
|
||||||
| `--offload-kqv` | Enable offloading of KQV operations to the GPU |
|
| `--disable-offload-kqv` | Disable offloading of KQV operations to the GPU |
|
||||||
| `--flash-attention` | Enable flash attention for faster inference |
|
| `--disable-flash-attention` | Disable flash attention |
|
||||||
| `--type-k` | Data type used for K cache |
|
| `--type-k` | Data type used for K cache |
|
||||||
| `--type-v` | Data type used for V cache |
|
| `--type-v` | Data type used for V cache |
|
||||||
| `--validation-workers` | Number of tokenizer workers used for payload validation and truncation |
|
| `--validation-workers` | Number of tokenizer workers used for payload validation and truncation |
|
||||||
|
Loading…
Reference in New Issue
Block a user