text-generation-inference

mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-10-10 07:25:23 +00:00

History

Adrien Gallouët 094975c3a8 Update the llamacpp backend (#3022 ) * Build faster Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Make --model-gguf optional Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Bump llama.cpp Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Enable mmap, offload_kqv & flash_attention by default Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Update doc Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Better error message Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Update doc Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Update installed packages Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Save gguf in models/MODEL_ID/model.gguf Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Fix build with Mach-O Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Quantize without llama-quantize Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Bump llama.cpp and switch to ggml-org Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Remove make-gguf.sh Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Update Cargo.lock Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Support HF_HUB_USER_AGENT_ORIGIN Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Bump llama.cpp Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Add --build-arg llamacpp_native & llamacpp_cpu_arm_arch Signed-off-by: Adrien Gallouët <angt@huggingface.co> --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co>		2025-03-11 09:19:01 +01:00
..
backends	Update the llamacpp backend (#3022 )	2025-03-11 09:19:01 +01:00
basic_tutorials	Fix a tiny typo in `monitoring.md` tutorial (#3056 )	2025-03-04 17:06:26 +01:00
conceptual	Preparing for release. (#3060 )	2025-03-04 16:47:10 +01:00
reference	Update `--max-batch-total-tokens` description (#3083 )	2025-03-07 14:24:26 +01:00
_toctree.yml	Avoid running neuron integration tests twice (#3054 )	2025-02-26 12:15:01 +01:00
architecture.md	Avoid running neuron integration tests twice (#3054 )	2025-02-26 12:15:01 +01:00
index.md	Removing ../ that broke the link (#2789 )	2024-12-02 05:48:55 +01:00
installation_amd.md	Preparing for release. (#3060 )	2025-03-04 16:47:10 +01:00
installation_gaudi.md	MI300 compatibility (#1764 )	2024-05-17 15:30:47 +02:00
installation_inferentia.md	Avoid running neuron integration tests twice (#3054 )	2025-02-26 12:15:01 +01:00
installation_intel.md	Preparing for release. (#3060 )	2025-03-04 16:47:10 +01:00
installation_nvidia.md	Preparing for release. (#3060 )	2025-03-04 16:47:10 +01:00
installation_tpu.md	Fix typo in TPU docs (#2911 )	2025-01-15 18:32:07 +01:00
installation.md	MI300 compatibility (#1764 )	2024-05-17 15:30:47 +02:00
multi_backend_support.md	Avoid running neuron integration tests twice (#3054 )	2025-02-26 12:15:01 +01:00
quicktour.md	Preparing for release. (#3060 )	2025-03-04 16:47:10 +01:00
supported_models.md	feat: add initial qwen2.5-vl model and test (#2971 )	2025-02-19 12:38:20 +01:00
usage_statistics.md	fix: Telemetry (#2957 )	2025-01-28 10:29:18 +01:00