From d96a77705dd314e5354af80547a190181ed6a385 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Adrien=20Gallou=C3=ABt?= Date: Fri, 7 Feb 2025 16:48:28 +0000 Subject: [PATCH] Update doc MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Signed-off-by: Adrien Gallouët --- docs/source/backends/llamacpp.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/source/backends/llamacpp.md b/docs/source/backends/llamacpp.md index f5aeb52c..dd4ef7b7 100644 --- a/docs/source/backends/llamacpp.md +++ b/docs/source/backends/llamacpp.md @@ -101,8 +101,10 @@ The table below summarizes key options: | `--split-mode` | Split the model across multiple GPUs | | `--defrag-threshold` | Defragment the KV cache if holes/size > threshold | | `--numa` | Enable NUMA optimizations | +| `--use-mmap` | Use memory mapping for the model | | `--use-mlock` | Use memory locking to prevent swapping | | `--offload-kqv` | Enable offloading of KQV operations to the GPU | +| `--flash-attention` | Enable flash attention for faster inference | | `--type-k` | Data type used for K cache | | `--type-v` | Data type used for V cache | | `--validation-workers` | Number of tokenizer workers used for payload validation and truncation |