text-generation-inference/docs/source
Adrien Gallouët 094975c3a8
Update the llamacpp backend (#3022)
* Build faster

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Make --model-gguf optional

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Bump llama.cpp

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Enable mmap, offload_kqv & flash_attention by default

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Update doc

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Better error message

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Update doc

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Update installed packages

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Save gguf in models/MODEL_ID/model.gguf

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Fix build with Mach-O

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Quantize without llama-quantize

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Bump llama.cpp and switch to ggml-org

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Remove make-gguf.sh

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Update Cargo.lock

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Support HF_HUB_USER_AGENT_ORIGIN

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Bump llama.cpp

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Add --build-arg llamacpp_native & llamacpp_cpu_arm_arch

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

---------

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-03-11 09:19:01 +01:00
..
backends Update the llamacpp backend (#3022) 2025-03-11 09:19:01 +01:00
basic_tutorials Fix a tiny typo in monitoring.md tutorial (#3056) 2025-03-04 17:06:26 +01:00
conceptual Preparing for release. (#3060) 2025-03-04 16:47:10 +01:00
reference Update --max-batch-total-tokens description (#3083) 2025-03-07 14:24:26 +01:00
_toctree.yml Avoid running neuron integration tests twice (#3054) 2025-02-26 12:15:01 +01:00
architecture.md Avoid running neuron integration tests twice (#3054) 2025-02-26 12:15:01 +01:00
index.md Removing ../ that broke the link (#2789) 2024-12-02 05:48:55 +01:00
installation_amd.md Preparing for release. (#3060) 2025-03-04 16:47:10 +01:00
installation_gaudi.md MI300 compatibility (#1764) 2024-05-17 15:30:47 +02:00
installation_inferentia.md Avoid running neuron integration tests twice (#3054) 2025-02-26 12:15:01 +01:00
installation_intel.md Preparing for release. (#3060) 2025-03-04 16:47:10 +01:00
installation_nvidia.md Preparing for release. (#3060) 2025-03-04 16:47:10 +01:00
installation_tpu.md Fix typo in TPU docs (#2911) 2025-01-15 18:32:07 +01:00
installation.md MI300 compatibility (#1764) 2024-05-17 15:30:47 +02:00
multi_backend_support.md Avoid running neuron integration tests twice (#3054) 2025-02-26 12:15:01 +01:00
quicktour.md Preparing for release. (#3060) 2025-03-04 16:47:10 +01:00
supported_models.md feat: add initial qwen2.5-vl model and test (#2971) 2025-02-19 12:38:20 +01:00
usage_statistics.md fix: Telemetry (#2957) 2025-01-28 10:29:18 +01:00