text-generation-inference

mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-06-03 21:22:08 +00:00

Author	SHA1	Message	Date
Corentin REGAL	cee44bff7a	Improve message to be useful without spans	2025-03-24 16:01:30 +01:00
Nicolas Patry	54d15462dc	Torch 2.6 (#3134 ) * Torch 2.6 * Upgrade the toolchain. * Don't upgrade just yet. * Upgrade toolchain. * Time upgrade. * TGI-nix main. * Upgrade to transformers 4.50	2025-03-24 11:55:49 +01:00
Baptiste Colle	2e60a8dd65	CI: enable server tests for backends (#3128 ) add test for backends	2025-03-20 16:07:31 +01:00
Erik Kaunismäki	e5503eba78	configurable termination timeout (#3126 ) * make shard and webserver termination timeouts configurable * Updating documentation. * Fmt. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2025-03-20 14:25:56 +01:00
Nicolas Patry	e497bc09f6	Minor fixes. (#3125 )	2025-03-18 15:42:35 +01:00
Nicolas Patry	67ce543e04	Intel docker. (#3121 ) * Intel docker. * torchaudio ? * Fixing dockerfile ?	2025-03-18 15:12:11 +01:00
Nicolas Patry	83fe45c15e	Prepare for patch release. (#3124 )	2025-03-18 15:11:55 +01:00
Nicolas Patry	11f2eec10e	Publish nix docker image. (#3122 ) * Publish nix docker image. * Run during PR. * Something else. * Forgot to push. * Build zstd. * Pushing with skopeo * Testing the PR. * Runnign from nix. * Cleaner tags.	2025-03-18 12:58:21 +01:00
Mohit Sharma	a35fbdb925	Bug Fix: Sliding Window Attention (#3112 ) * (fix) sliding window attention * (fix) flashinfer * (typo) collection link * Add window_size_left param ipex rocm * Update window size rocm flash decoding * fix: bump snapshots and improve exceed window test case * feat: add tests for image types and remove alpha from png * Upgrading `from_env` to get token from file when necessary + fix pali_gemma. * fix: add pillow dependency and bump lock+requirements * fix: bump org name in gemma3 test * Fix qwen2. --------- Co-authored-by: drbh <david.richard.holtz@gmail.com> Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2025-03-18 10:37:33 +01:00
Baptiste Colle	8c2c348f3c	Gaudi: Sync TGI with the latest changes from the TGI-Gaudi fork (#3117 ) feat(gaudi): add all the changes from tgi-gaudi fork up to PR #289	2025-03-18 09:45:52 +01:00
Daniël de Kok	095775e05c	launcher: correctly get the head dimension for VLMs (#3116 ) * launcher: correctly get the head dimension for VLMs For most (?) VLMs, the head dimension is in the `text_config` configuration section. However, since we only queried the top-level `head_dim` (which typically doesn't exist in VLMs), we would never use flashinfer. This change adds a method that gets the head dimension from the top-level `Config` struct or `text_config` when that fails. * fix: bump org name in gemma3 test --------- Co-authored-by: drbh <david.richard.holtz@gmail.com>	2025-03-17 18:19:37 +01:00
Wang, Yi	0b3e3db043	xpu 2.6 update (#3051 ) * xpu 2.6 update Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * install whl Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * update get xpu memory api Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * int Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix awq crash if modules_to_not_convert is None Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-03-17 13:48:48 +01:00
Daniël de Kok	f91434e99b	Make the Nix-based Docker container work on non-NixOS (#3109 ) On NixOS, the CUDA driver shim gets mounted on /run/opengl-driver, where Nix packages expect the shim to be. However, on other distributions, some FHS paths are mounted. This is a small change to make the dynamic loader find the shim.	2025-03-13 14:02:45 +01:00
Nicolas Patry	8b91f92978	Fixing the docker build. (#3108 ) * Fixing the docker build. * Apply suggestions from code review	2025-03-13 11:26:44 +01:00
Baptiste Colle	27ed848676	Release of Gaudi Backend for TGI (#3091 ) * feat(gaudi): release ready (docs, docker image and vlm ready) * fix(gaudi): add default argument for the dockerfile * fix(gaudi): remove use of latest for gaudi docker image + redid gaudi benchmarking section to include best practices	2025-03-13 10:56:01 +01:00
Nicolas Patry	83ef364177	We need gcc during runtime to enable triton to compile kernels. (#3103 ) * We need gcc during runtime to enable triton to compile kernels. * Fixing the docker build.	2025-03-13 10:45:47 +01:00
Daniël de Kok	83b7b7bb92	Router: add `gemma3-text` model type (#3107 )	2025-03-13 10:41:33 +01:00
Daniël de Kok	c73ae0bd88	Update to `kernels` 0.2.1 (#3084 ) * Update to `kernels` 0.2.1 The package was renamed from `hf-kernels` to `kernels`. The new version also updates the lockfile format. * Download kernels in `install-cuda` target	2025-03-13 10:36:29 +01:00
Nicolas Patry	d4c6faa67b	Try to fix on main CI color. (#3101 )	2025-03-12 10:12:24 +01:00
Nicolas Patry	4ac06ddf56	Preparing relase 3.2.0 (#3100 ) * Preparing relase 3.2.0 * Forgot the README. * Update doc.	2025-03-12 10:11:33 +01:00
David Corvoysier	f01dc9e743	Update neuron backend (#3098 ) * feat(neuron): use AWS Neuron SDK 2.21.1 * feat(neuron): bump optimum-neuron version * feat(neuron): tag latest image for local tests * test(neuron): simplify sampling test	2025-03-12 09:53:15 +01:00
Nicolas Patry	5c5528e362	Fix tool call4 (#3094 ) * Removing the no_tool content information. * Removing a lot of NO_TOOL shenanigans. * Update the tests.	2025-03-12 09:28:47 +01:00
Mohit Sharma	ed46c2c414	Add gemma3 model (#3099 )	2025-03-12 09:25:51 +01:00
Nicolas Patry	f74c36fe0d	Fix tool call3 (#3086 ) * Fixing the tool calling convention. * Update tehe doc. * Fixing some corner cases. * Fixing the tool call id. * Fmt. * Snapshot update with the new updated tool_call_id. * More qwen2.	2025-03-12 09:22:53 +01:00
celsowm	ae4451c3da	Update README.md (#3095 ) space between param and value	2025-03-11 11:05:21 +01:00
Nicolas Patry	b447f7e821	Fix qwen vl (#3096 ) * Fixing qwen2.5 VL. * Fixing the CI.	2025-03-11 11:00:41 +01:00
Adrien Gallouët	094975c3a8	Update the llamacpp backend (#3022 ) * Build faster Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Make --model-gguf optional Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Bump llama.cpp Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Enable mmap, offload_kqv & flash_attention by default Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Update doc Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Better error message Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Update doc Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Update installed packages Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Save gguf in models/MODEL_ID/model.gguf Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Fix build with Mach-O Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Quantize without llama-quantize Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Bump llama.cpp and switch to ggml-org Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Remove make-gguf.sh Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Update Cargo.lock Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Support HF_HUB_USER_AGENT_ORIGIN Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Bump llama.cpp Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Add --build-arg llamacpp_native & llamacpp_cpu_arm_arch Signed-off-by: Adrien Gallouët <angt@huggingface.co> --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-03-11 09:19:01 +01:00
drbh	dc5f05f8e6	Pr 3003 ci branch (#3007 ) * change ChatCompletionChunk to align with "OpenAI Chat Completions streaming API" Moving after tool_calls2 Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> add in Buffering.. Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> fix: handle usage outside of stream state and add tests Simplifying everything quite a bit. Remove the unused model_dump. Clippy. Clippy ? Ruff. Uppgrade the flake for latest transformers. Upgrade after rebase. Remove potential footgun. Fix completion test. * Clippy. * Tweak for multi prompt. * Ruff. * Update the snapshot a bit. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2025-03-10 17:56:19 +01:00
Daniël de Kok	124398fa57	hotfix: qwen2 formatting (#3093 ) * hotfix: qwen2 formatting * cargo fmt	2025-03-10 16:19:50 +01:00
Daniël de Kok	c5ecc7a4de	Small test and typing fixes (#3078 ) * test_weights: add modules_to_not_convert * More typing fixes	2025-03-10 15:08:23 +01:00
jiqing-feng	cae0cbe87d	Add modules_to_not_convert in quantized model (#3053 ) * fix modules_to_not_convert Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix format Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix tp quant skip Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * revert unquantized changes Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * use DefaultWeightsLoader in skip modules Signed-off-by: jiqing-feng <jiqing.feng@intel.com> --------- Signed-off-by: jiqing-feng <jiqing.feng@intel.com>	2025-03-10 15:03:51 +01:00
EachSheep	bbe218a4f7	Add qwen2 multi lora layers support (#3089 ) add qwen2 multi lora layers support to solve problem like https://github.com/huggingface/text-generation-inference/issues/2881, the similar PR are at https://github.com/huggingface/text-generation-inference/pull/2883 Co-authored-by: hjs <hjs@pku.edu.cn>	2025-03-10 12:42:59 +01:00
Alex Weston	58a65f7914	Add request parameters to OTel span for `/v1/chat/completions` endpoint (#3000 ) Record request parameters in OTel span for /v1/chat/completions endpoint	2025-03-10 12:26:57 +01:00
Daniël de Kok	976eae216f	Nix: the launcher needs a Python env with Torch for GPU detection (#3085 ) This makes `nix run .` in the repository work again. Should fix #3025.	2025-03-10 12:11:10 +01:00
Nicolas Patry	622908deab	Fix tool call2 (#3076 ) * Making `tool_calls` a vector. * Arguments output is a string. * Update all the integration tests. * Add the requirements. * Upgrade other tests. * Clippy. * Update the old test.	2025-03-07 19:45:57 +01:00
Alvaro Bartolome	55a6618434	Update `--max-batch-total-tokens` description (#3083 ) * Update `--max-batch-total-tokens` description * Update docstring in `launcher/src/main.rs` instead	2025-03-07 14:24:26 +01:00
Daniël de Kok	036d802b62	Nix: add `openai` to impure shell for integration tests (#3081 )	2025-03-07 13:04:21 +01:00
Nicolas Patry	8e92942a18	Making `tool_calls` a vector. (#3075 ) * Making `tool_calls` a vector. * Update doc. * Fixing the nix overlay with updated version. * Add openai dependency. * Updating the old tests. * Trying to reduce the logs in the case of errors. * Less spammy logs too.	2025-03-05 22:32:31 +01:00
Nicolas Patry	3208d1cd1d	Revert "Trying to reduce the logs in the case of errors." This reverts commit `cdf70d6a28`.	2025-03-05 20:52:38 +01:00
Nicolas Patry	cdf70d6a28	Trying to reduce the logs in the case of errors.	2025-03-05 20:50:43 +01:00
Nicolas Patry	ab9dafc68f	Making sure Olmo (transformers backend) works. (#3074 )	2025-03-05 17:46:47 +01:00
Nicolas Patry	31766dad77	Force upgrade transformers version for olmo.	2025-03-05 12:17:09 +01:00
Nicolas Patry	ec35976f82	Only add token when it is defined. (#3073 ) * Only add token when it is defined. * Update router/src/server.rs	2025-03-05 11:59:52 +01:00
David Corvoysier	cb42b3ad83	fix(neuron): explicitly install toolchain (#3072 ) * fix(neuron): explicitly install toolchain * ci(neuron): trigger CI when Dockerfile is modified	2025-03-05 11:46:58 +01:00
Nicolas Patry	491ed9e11d	Patch rust release. (#3069 ) * Patch rust release. * Trying to remove the rust-toolchain hardcoded in action. * Upgrade rust toolchain. * Put back the toolchain ? * Fix neuron dockerfile. * Move to the proper version of Rust. * 1.85 since the GH action doesn't respect the override. * Typo. * Fixing the github action. * Fixing docker llamacpp. * Fixing the github action. * Update clippy.	2025-03-04 18:07:33 +01:00
Sadra Barikbin	144d99c147	Fix a tiny typo in `monitoring.md` tutorial (#3056 ) Update monitoring.md	2025-03-04 17:06:26 +01:00
Nicolas Patry	08bbfa16a1	Preparing for release. (#3060 ) * Preparing for release. * Upgrade doc. * Fix docs auto-generated. * Fix update doc along.	2025-03-04 16:47:10 +01:00
Hugo Larcher	d8ff7f2623	feat: add support for HF_HUB_USER_AGENT_ORIGIN to add user-agent Origin field in Hub requests. (#3061 ) * feat: add support for HF_HUB_USER_AGENT_ORIGIN to add user-agent Origin field in Hub requests. * fix: Rust version for Neuron * fix: PR comments, use rust-toolchain.toml	2025-03-04 16:43:50 +01:00
Daniël de Kok	e88f6f6ee9	Add property-based testing for `RadixAllocator` (#3068 )	2025-03-04 15:09:46 +01:00
Daniël de Kok	fa4e9511f8	Fix two edge cases in `RadixTrie::find` (#3067 ) - Always return a node, not its parent. - Do not recurse when a node does not represent a full prefix of the input.	2025-03-04 13:23:27 +01:00

1 2 3 4 5 ...

1345 Commits