Update doc

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
This commit is contained in:
Adrien Gallouët 2025-02-07 16:48:28 +00:00
parent b77d05d3af
commit d96a77705d
No known key found for this signature in database

View File

@ -101,8 +101,10 @@ The table below summarizes key options:
| `--split-mode` | Split the model across multiple GPUs | | `--split-mode` | Split the model across multiple GPUs |
| `--defrag-threshold` | Defragment the KV cache if holes/size > threshold | | `--defrag-threshold` | Defragment the KV cache if holes/size > threshold |
| `--numa` | Enable NUMA optimizations | | `--numa` | Enable NUMA optimizations |
| `--use-mmap` | Use memory mapping for the model |
| `--use-mlock` | Use memory locking to prevent swapping | | `--use-mlock` | Use memory locking to prevent swapping |
| `--offload-kqv` | Enable offloading of KQV operations to the GPU | | `--offload-kqv` | Enable offloading of KQV operations to the GPU |
| `--flash-attention` | Enable flash attention for faster inference |
| `--type-k` | Data type used for K cache | | `--type-k` | Data type used for K cache |
| `--type-v` | Data type used for V cache | | `--type-v` | Data type used for V cache |
| `--validation-workers` | Number of tokenizer workers used for payload validation and truncation | | `--validation-workers` | Number of tokenizer workers used for payload validation and truncation |