text-generation-inference

mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-10-09 23:15:23 +00:00

Author	SHA1	Message	Date
Adrien Gallouët	c52f08351f	Set TGI_LLAMA_PKG_CUDA from CUDA_VERSION Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-05 10:57:50 +00:00
Adrien Gallouët	dbee804129	Simplify batching logic Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-05 10:12:39 +00:00
Adrien Gallouët	d3a772a8dd	Update args Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-05 10:10:38 +00:00
Adrien Gallouët	e007529590	Update Cargo.lock Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 17:54:53 +00:00
Adrien Gallouët	906c265aef	Cleanup Dockerfile Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 17:53:47 +00:00
Adrien Gallouët	df2a4fbb8a	Update Dockerfile_llamacpp Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:59 +00:00
Adrien Gallouët	d883109df6	Disable graceful shutdown in debug mode Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:59 +00:00
Adrien Gallouët	207041a977	Bump llamacpp to b4623 Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:59 +00:00
Adrien Gallouët	38b33e9698	Add --type-v & --type-k Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:59 +00:00
Adrien Gallouët	bfb8e03e9f	Add specific args for batch Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:59 +00:00
Morgan Funtowicz	e6a8d33902	backend(llama): add CUDA architectures build argument for Dockerfile	2025-02-04 13:32:59 +00:00
Adrien Gallouët	ea28332bb3	Cleanup Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:59 +00:00
Adrien Gallouët	104a968d01	Remove warmup Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:59 +00:00
Adrien Gallouët	8ed362d03a	Clear request cache after completion Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:59 +00:00
Adrien Gallouët	c8505fb300	Auto-detect n_threads when not provided Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:59 +00:00
Adrien Gallouët	27534d8ee4	Fix seq iterations Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:59 +00:00
Adrien Gallouët	96434a1e7e	Fix batching Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:59 +00:00
Adrien Gallouët	2a51e415ff	Output real logprobs Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:58 +00:00
Adrien Gallouët	161280f313	Only export the latest logits Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:58 +00:00
Morgan Funtowicz	960c12bd6e	backend(llama): add CUDA Dockerfile_llamacpp for now	2025-02-04 13:32:58 +00:00
Adrien Gallouët	f38c34aeb7	Fix batch_pos Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:58 +00:00
Adrien Gallouët	e88a527fcf	Add --offload-kqv Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:58 +00:00
Adrien Gallouët	ae5bb789c2	Enable flash attention by default Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:58 +00:00
Adrien Gallouët	3f199134f0	Fix args Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:58 +00:00
Adrien Gallouët	7a3ed4171e	Add --numa Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:58 +00:00
Adrien Gallouët	390f0ec061	Cleanup Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:58 +00:00
Adrien Gallouët	d6ded897a8	Add a stupid batch mechanism Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:58 +00:00
Adrien Gallouët	e07835c5b5	Add --defrag-threshold Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:58 +00:00
Adrien Gallouët	f388747985	Add GPU args Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:58 +00:00
Adrien Gallouët	8d2dfdf668	Handle ctx args & fix sampling Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:58 +00:00
Adrien Gallouët	a7b4b04cb5	Add some input validation checks Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:58 +00:00
Adrien Gallouët	e7facf692f	Handle max_batch_size Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:58 +00:00
Adrien Gallouët	3eb4823f3e	Use max_batch_total_tokens Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:58 +00:00
Adrien Gallouët	bd0cc9905c	Get rid of llama_batch_get_one() Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:58 +00:00
Adrien Gallouët	95e221eece	Add llamacpp backend Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:56 +00:00
Alvaro Bartolome	88fd56f549	Add `strftime_now` callable function for `minijinja` chat templates (#2983 ) * Add `chrono` and `strftime_now` function callable * Fix `test_chat_template_valid_with_strftime_now` * Fix `test_chat_template_valid_with_strftime_now`	2025-02-03 15:30:48 +01:00
Hugo Larcher	e3f2018cb5	hotfix: fix trtllm CI build on release (#2981 ) * hotfix: fix trtllm CI build on release * fix: test release. * fix: test release. * fix: test release. env not recognized https://github.com/actions/runner/issues/1661 * fix: test release. Works.	2025-02-03 11:11:15 +01:00
Nicolas Patry	bb69c5b199	Back on nix main. (#2979 )	2025-01-31 14:39:52 +01:00
Nicolas Patry	c9d68945cc	Prepare for release 3.1.0 (#2972 ) * Prepare for release 3.1.0 * Back on main flake. * Fixing stuff. * Upgrade to moe-kernels 0.8.2 for Hip support. * Deactivating the flaky test.	2025-01-31 14:19:01 +01:00
Mohit Sharma	c07a2cc82b	Update moe-kernel to 0.8.2 for rocm (#2977 ) update moe-kernel for amd	2025-01-31 11:40:00 +01:00
Hugo Larcher	065aabb13d	doc: Update TRTLLM deployment doc. (#2960 ) * doc: Update TRTLLM deployment doc. Update TRTLLM CI to allow release builds when tagging TGI. * doc: Update TRTLLM deployment doc. Update TRTLLM CI to allow release builds when tagging TGI. * fix: PR comments	2025-01-30 18:04:42 +01:00
Nicolas Patry	cb747b33da	Add deepseekv3 (#2968 ) * Add fp8 support moe models add deepseekv3 format codfe' update dockerfile update doc * Small modifications. * Moe kernels 0.8.1 * Upgrade to 0.8.1 * Fixing moe import. * Black. * Apply suggestions from code review Co-authored-by: Mohit Sharma <mohit21sharma.ms@gmail.com> * Fixing Mixtral + Nits. * Put link to ref. * Fix other call locations. * Scoring func `softmax` is the only one that works. --------- Co-authored-by: Mohit Sharma <mohit21sharma.ms@gmail.com>	2025-01-30 16:40:25 +01:00
Nicolas Patry	80e7d98f88	Hotfixing intel-cpu (not sure how it was working before). (#2967 ) * Hotfixing intel-cpu (not sure how it was working before). * Do not fail on missing moe-kernels (Intel-cpu).	2025-01-29 22:34:41 +01:00
Daniël de Kok	ee0dffcd14	Update to moe-kernels 0.8.0 (#2966 )	2025-01-29 18:19:55 +01:00
Mohit Sharma	4ef2e045c9	Add fp8 support moe models (#2928 ) * Add fp8 support moe models * flatten condition	2025-01-29 13:56:32 +01:00
Hugo Larcher	73b7cf83f6	Add backend name to telemetry (#2962 ) * feat: Add backend name to telemetry	2025-01-28 16:53:16 +01:00
Nicolas Patry	eb3df0f46f	Fixing the oom maybe with 2.5.1 change. (#2958 )	2025-01-28 10:30:28 +01:00
Hugo Larcher	c690da5973	fix: Telemetry (#2957 ) * fix: add telemetry regular pings and fix unhandled errors avoid not sending telemetry stop events. * fix: simplify error handling * fix: update ping delay and update doc. * fix: clippy * doc: Rephrase properly.	2025-01-28 10:29:18 +01:00
Daniël de Kok	db922eb77e	Update to attention-kernels 0.2.0 (#2950 ) This version removes our patches/custom API. Makes it simpler to get changes from upstream. One of which is that we can enable FP8 KV cache for paged attention as well.	2025-01-27 11:42:36 +01:00
Funtowicz Morgan	40b00275b2	Attempt to remove AWS S3 flaky cache for sccache (#2953 ) * backend(trtllm): attempt to remove AWS S3 flaky cache for sccache * backend(trtllm): what if we expose ENV instead of inline? * backend(trtllm): and with the right env var for gha sccache * backend(trtllm): relax the way to detect sccache * backend(trtllm): make sccache definition manually * backend(trtllm): ok let's try to define the launchers in build.rs when rustc_wrapper is present * backend(trtllm): export env variable in run mb? * backend(trtllm): Cache mode max to cache intermediate layers * backend(trtllm): inject ompi_version build arg in dependent step	2025-01-27 11:21:48 +01:00

1 2 3 4 5 ...

1289 Commits