text-generation-inference

mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-10-10 15:35:24 +00:00

Author	SHA1	Message	Date
Adrien Gallouët	104a968d01	Remove warmup Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:59 +00:00
Adrien Gallouët	8ed362d03a	Clear request cache after completion Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:59 +00:00
Adrien Gallouët	c8505fb300	Auto-detect n_threads when not provided Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:59 +00:00
Adrien Gallouët	27534d8ee4	Fix seq iterations Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:59 +00:00
Adrien Gallouët	96434a1e7e	Fix batching Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:59 +00:00
Adrien Gallouët	2a51e415ff	Output real logprobs Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:58 +00:00
Adrien Gallouët	161280f313	Only export the latest logits Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:58 +00:00
Morgan Funtowicz	960c12bd6e	backend(llama): add CUDA Dockerfile_llamacpp for now	2025-02-04 13:32:58 +00:00
Adrien Gallouët	f38c34aeb7	Fix batch_pos Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:58 +00:00
Adrien Gallouët	e88a527fcf	Add --offload-kqv Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:58 +00:00
Adrien Gallouët	ae5bb789c2	Enable flash attention by default Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:58 +00:00
Adrien Gallouët	3f199134f0	Fix args Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:58 +00:00
Adrien Gallouët	7a3ed4171e	Add --numa Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:58 +00:00
Adrien Gallouët	390f0ec061	Cleanup Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:58 +00:00
Adrien Gallouët	d6ded897a8	Add a stupid batch mechanism Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:58 +00:00
Adrien Gallouët	e07835c5b5	Add --defrag-threshold Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:58 +00:00
Adrien Gallouët	f388747985	Add GPU args Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:58 +00:00
Adrien Gallouët	8d2dfdf668	Handle ctx args & fix sampling Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:58 +00:00
Adrien Gallouët	a7b4b04cb5	Add some input validation checks Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:58 +00:00
Adrien Gallouët	e7facf692f	Handle max_batch_size Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:58 +00:00
Adrien Gallouët	3eb4823f3e	Use max_batch_total_tokens Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:58 +00:00
Adrien Gallouët	bd0cc9905c	Get rid of llama_batch_get_one() Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:58 +00:00
Adrien Gallouët	95e221eece	Add llamacpp backend Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-02-04 13:32:56 +00:00
Alvaro Bartolome	88fd56f549	Add `strftime_now` callable function for `minijinja` chat templates (#2983 ) * Add `chrono` and `strftime_now` function callable * Fix `test_chat_template_valid_with_strftime_now` * Fix `test_chat_template_valid_with_strftime_now`	2025-02-03 15:30:48 +01:00
Hugo Larcher	e3f2018cb5	hotfix: fix trtllm CI build on release (#2981 ) * hotfix: fix trtllm CI build on release * fix: test release. * fix: test release. * fix: test release. env not recognized https://github.com/actions/runner/issues/1661 * fix: test release. Works.	2025-02-03 11:11:15 +01:00
Nicolas Patry	bb69c5b199	Back on nix main. (#2979 )	2025-01-31 14:39:52 +01:00
Nicolas Patry	c9d68945cc	Prepare for release 3.1.0 (#2972 ) * Prepare for release 3.1.0 * Back on main flake. * Fixing stuff. * Upgrade to moe-kernels 0.8.2 for Hip support. * Deactivating the flaky test.	2025-01-31 14:19:01 +01:00
Mohit Sharma	c07a2cc82b	Update moe-kernel to 0.8.2 for rocm (#2977 ) update moe-kernel for amd	2025-01-31 11:40:00 +01:00
Hugo Larcher	065aabb13d	doc: Update TRTLLM deployment doc. (#2960 ) * doc: Update TRTLLM deployment doc. Update TRTLLM CI to allow release builds when tagging TGI. * doc: Update TRTLLM deployment doc. Update TRTLLM CI to allow release builds when tagging TGI. * fix: PR comments	2025-01-30 18:04:42 +01:00
Nicolas Patry	cb747b33da	Add deepseekv3 (#2968 ) * Add fp8 support moe models add deepseekv3 format codfe' update dockerfile update doc * Small modifications. * Moe kernels 0.8.1 * Upgrade to 0.8.1 * Fixing moe import. * Black. * Apply suggestions from code review Co-authored-by: Mohit Sharma <mohit21sharma.ms@gmail.com> * Fixing Mixtral + Nits. * Put link to ref. * Fix other call locations. * Scoring func `softmax` is the only one that works. --------- Co-authored-by: Mohit Sharma <mohit21sharma.ms@gmail.com>	2025-01-30 16:40:25 +01:00
Nicolas Patry	80e7d98f88	Hotfixing intel-cpu (not sure how it was working before). (#2967 ) * Hotfixing intel-cpu (not sure how it was working before). * Do not fail on missing moe-kernels (Intel-cpu).	2025-01-29 22:34:41 +01:00
Daniël de Kok	ee0dffcd14	Update to moe-kernels 0.8.0 (#2966 )	2025-01-29 18:19:55 +01:00
Mohit Sharma	4ef2e045c9	Add fp8 support moe models (#2928 ) * Add fp8 support moe models * flatten condition	2025-01-29 13:56:32 +01:00
Hugo Larcher	73b7cf83f6	Add backend name to telemetry (#2962 ) * feat: Add backend name to telemetry	2025-01-28 16:53:16 +01:00
Nicolas Patry	eb3df0f46f	Fixing the oom maybe with 2.5.1 change. (#2958 )	2025-01-28 10:30:28 +01:00
Hugo Larcher	c690da5973	fix: Telemetry (#2957 ) * fix: add telemetry regular pings and fix unhandled errors avoid not sending telemetry stop events. * fix: simplify error handling * fix: update ping delay and update doc. * fix: clippy * doc: Rephrase properly.	2025-01-28 10:29:18 +01:00
Daniël de Kok	db922eb77e	Update to attention-kernels 0.2.0 (#2950 ) This version removes our patches/custom API. Makes it simpler to get changes from upstream. One of which is that we can enable FP8 KV cache for paged attention as well.	2025-01-27 11:42:36 +01:00
Funtowicz Morgan	40b00275b2	Attempt to remove AWS S3 flaky cache for sccache (#2953 ) * backend(trtllm): attempt to remove AWS S3 flaky cache for sccache * backend(trtllm): what if we expose ENV instead of inline? * backend(trtllm): and with the right env var for gha sccache * backend(trtllm): relax the way to detect sccache * backend(trtllm): make sccache definition manually * backend(trtllm): ok let's try to define the launchers in build.rs when rustc_wrapper is present * backend(trtllm): export env variable in run mb? * backend(trtllm): Cache mode max to cache intermediate layers * backend(trtllm): inject ompi_version build arg in dependent step	2025-01-27 11:21:48 +01:00
Nicolas Patry	6cb41a80a1	Revert "Remove AWS credentials?" This reverts commit `d2ff68e98d`.	2025-01-24 14:34:17 +01:00
Nicolas Patry	d2ff68e98d	Remove AWS credentials?	2025-01-24 12:18:28 +01:00
Nicolas Patry	d9dda11726	Trying to put back the archlist (to fix the oom). (#2947 )	2025-01-24 09:32:17 +01:00
Nicolas Patry	d937eb64da	Fixing cargo lock.	2025-01-23 18:54:34 +01:00
Cyril Vallez	18c4607d46	Transformers backend TP fix (#2945 ) * init dispatch * cohere fix	2025-01-23 18:09:57 +01:00
Nicolas Patry	29a0893b67	Tmp tp transformers (#2942 ) * Upgrade the version number. * Remove modifications in Lock. * Tmp branch to test transformers backend with 2.5.1 and TP>1 * Fixing the transformers backend. inference_mode forces the use of `aten.matmul` instead of `aten.mm` the former doesn't have sharding support crashing the transformers TP support. `lm_head.forward` also crashes because it skips the hook that cast/decast the DTensor. Torch 2.5.1 is required for sharding support. * Put back the attention impl. * Revert the flashinfer (this will fails). * Building AOT. * Using 2.5 kernels. * Remove the archlist, it's defined in the docker anyway.	2025-01-23 18:07:30 +01:00
Funtowicz Morgan	0a89902663	[TRTLLM] Expose finish reason (#2841 ) * feat(trtllm): expose finish reason to Rust * misc(llamacpp): fix typo * misc(backend): update deps	2025-01-23 16:48:26 +01:00
Nikolai Kolodziej	4e172028aa	Add NVIDIA A40 to known cards (#2941 ) feat: add NVIDIA A40 to known cards	2025-01-23 14:19:21 +01:00
Alvaro Bartolome	6ab02931cf	Set `alias` for `max_completion_tokens` in `ChatRequest` (#2932 )	2025-01-23 14:18:47 +01:00
Funtowicz Morgan	cc212154e0	Bump TensorRT-LLM backend dependency to v0.16.0 (#2931 ) * backend(trtllm): update to 0.16.0 * backend(trtllm): do not use shallow clone * backend(trtllm): use tag instead * backend(trtllm): move to nvidia remote instead of hf * backend(trtllm): reenable shallow clone * backend(trtllm): attempt to use ADD instead of RUN for openmpi * backend(trtllm): make sure we are using correct path for openmpi ADD in dockerfile * backend(trtllm): add correctly untar it	2025-01-23 13:54:40 +01:00
Daniël de Kok	1dd346666a	Clarify FP8-Marlin use on capability 8.9 (#2940 ) The log message stated that the GPU does not support FP8 on capability 8.9. However we use FP8-Marlin on that capability because it is faster.	2025-01-22 18:18:11 +01:00
Wang, Yi	1d3c9beba8	fix moe in quantization path (#2935 ) update ipex xpu to support moe for mixtral Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-01-22 14:36:15 +01:00

1 2 3 4 5 ...

1277 Commits