mirror of
https://github.com/huggingface/text-generation-inference.git
synced 2025-09-11 12:24:53 +00:00
Update doc
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
This commit is contained in:
parent
b77d05d3af
commit
d96a77705d
@ -101,8 +101,10 @@ The table below summarizes key options:
|
|||||||
| `--split-mode` | Split the model across multiple GPUs |
|
| `--split-mode` | Split the model across multiple GPUs |
|
||||||
| `--defrag-threshold` | Defragment the KV cache if holes/size > threshold |
|
| `--defrag-threshold` | Defragment the KV cache if holes/size > threshold |
|
||||||
| `--numa` | Enable NUMA optimizations |
|
| `--numa` | Enable NUMA optimizations |
|
||||||
|
| `--use-mmap` | Use memory mapping for the model |
|
||||||
| `--use-mlock` | Use memory locking to prevent swapping |
|
| `--use-mlock` | Use memory locking to prevent swapping |
|
||||||
| `--offload-kqv` | Enable offloading of KQV operations to the GPU |
|
| `--offload-kqv` | Enable offloading of KQV operations to the GPU |
|
||||||
|
| `--flash-attention` | Enable flash attention for faster inference |
|
||||||
| `--type-k` | Data type used for K cache |
|
| `--type-k` | Data type used for K cache |
|
||||||
| `--type-v` | Data type used for V cache |
|
| `--type-v` | Data type used for V cache |
|
||||||
| `--validation-workers` | Number of tokenizer workers used for payload validation and truncation |
|
| `--validation-workers` | Number of tokenizer workers used for payload validation and truncation |
|
||||||
|
Loading…
Reference in New Issue
Block a user