text-generation-inference

mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-10-08 22:45:23 +00:00

Author	SHA1	Message	Date
Wang, Yi A	7900be5ac3	warmup decode Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-03-26 20:19:13 -07:00
Wang, Yi A	ba7a131e04	add warmup_decode Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-03-26 17:39:26 -07:00
Wang, Yi A	fd70ad703e	warmup prefill remove model where pageattn is not used, set block table to None since it's not used Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-03-26 03:10:58 -07:00
Wang, Yi A	69773767c5	enable fp8 Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-03-25 05:06:55 -07:00
Wang, Yi A	8d221b7b79	fix gptq issue Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-03-22 20:58:50 -07:00
Wang, Yi A	9914ffe1f1	remove unused quantization code and enable awq/gptq int4 Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-03-22 19:37:20 -07:00
Wang, Yi A	fdf0733f56	fix incorrect output in qwen2 idefics if hpu graph is used Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-03-21 01:01:37 -07:00
Wang, Yi A	36b6612f97	adjust warmup and enable vlm Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-03-20 23:12:52 -07:00
Wang, Yi A	f95aa42660	multi-modality initial PR Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-03-19 23:30:12 -07:00
Wang, Yi A	d5b78ba16f	Merge branch 'main' into gaudi_backend_pa	2025-03-19 18:15:08 -07:00
Wang, Yi A	2074d0516b	enable dbrx remove some unused code Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-03-19 03:16:41 -07:00
Wang, Yi A	2cde30de24	gpt_bigcode could also go pageattn Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-03-18 23:59:31 -07:00
Wang, Yi A	073f793976	fix phimoe issue Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-03-18 23:11:01 -07:00
Nicolas Patry	e497bc09f6	Minor fixes. (#3125 )	2025-03-18 15:42:35 +01:00
Nicolas Patry	67ce543e04	Intel docker. (#3121 ) * Intel docker. * torchaudio ? * Fixing dockerfile ?	2025-03-18 15:12:11 +01:00
Nicolas Patry	83fe45c15e	Prepare for patch release. (#3124 )	2025-03-18 15:11:55 +01:00
Nicolas Patry	11f2eec10e	Publish nix docker image. (#3122 ) * Publish nix docker image. * Run during PR. * Something else. * Forgot to push. * Build zstd. * Pushing with skopeo * Testing the PR. * Runnign from nix. * Cleaner tags.	2025-03-18 12:58:21 +01:00
Mohit Sharma	a35fbdb925	Bug Fix: Sliding Window Attention (#3112 ) * (fix) sliding window attention * (fix) flashinfer * (typo) collection link * Add window_size_left param ipex rocm * Update window size rocm flash decoding * fix: bump snapshots and improve exceed window test case * feat: add tests for image types and remove alpha from png * Upgrading `from_env` to get token from file when necessary + fix pali_gemma. * fix: add pillow dependency and bump lock+requirements * fix: bump org name in gemma3 test * Fix qwen2. --------- Co-authored-by: drbh <david.richard.holtz@gmail.com> Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2025-03-18 10:37:33 +01:00
Baptiste Colle	8c2c348f3c	Gaudi: Sync TGI with the latest changes from the TGI-Gaudi fork (#3117 ) feat(gaudi): add all the changes from tgi-gaudi fork up to PR #289	2025-03-18 09:45:52 +01:00
Wang, Yi A	5cd1c93cad	add moe support, fix qwen/mistral/mixtral crash Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-03-18 00:45:15 -07:00
Daniël de Kok	095775e05c	launcher: correctly get the head dimension for VLMs (#3116 ) * launcher: correctly get the head dimension for VLMs For most (?) VLMs, the head dimension is in the `text_config` configuration section. However, since we only queried the top-level `head_dim` (which typically doesn't exist in VLMs), we would never use flashinfer. This change adds a method that gets the head dimension from the top-level `Config` struct or `text_config` when that fails. * fix: bump org name in gemma3 test --------- Co-authored-by: drbh <david.richard.holtz@gmail.com>	2025-03-17 18:19:37 +01:00
Wang, Yi	0b3e3db043	xpu 2.6 update (#3051 ) * xpu 2.6 update Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * install whl Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * update get xpu memory api Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * int Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix awq crash if modules_to_not_convert is None Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-03-17 13:48:48 +01:00
Wang, Yi A	6bbe24d974	use tensor cache in hpu graph to avoid replay issue Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-03-17 01:36:49 -07:00
Wang, Yi A	a07e7437b6	enable all the model. not testet yet Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-03-17 01:26:32 -07:00
Wang, Yi A	5d3653943c	adjust block table in hpu to improve performance Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-03-16 20:28:01 -07:00
Wang, Yi A	b7fea6fc2f	fix TP in pageattn Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-03-14 18:01:58 -07:00
Wang, Yi A	201dc6294f	clean cuda/rocm code in hpu backend, enable flat_hpu Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-03-14 01:25:31 -07:00
Daniël de Kok	f91434e99b	Make the Nix-based Docker container work on non-NixOS (#3109 ) On NixOS, the CUDA driver shim gets mounted on /run/opengl-driver, where Nix packages expect the shim to be. However, on other distributions, some FHS paths are mounted. This is a small change to make the dynamic loader find the shim.	2025-03-13 14:02:45 +01:00
Nicolas Patry	8b91f92978	Fixing the docker build. (#3108 ) * Fixing the docker build. * Apply suggestions from code review	2025-03-13 11:26:44 +01:00
Baptiste Colle	27ed848676	Release of Gaudi Backend for TGI (#3091 ) * feat(gaudi): release ready (docs, docker image and vlm ready) * fix(gaudi): add default argument for the dockerfile * fix(gaudi): remove use of latest for gaudi docker image + redid gaudi benchmarking section to include best practices	2025-03-13 10:56:01 +01:00
Nicolas Patry	83ef364177	We need gcc during runtime to enable triton to compile kernels. (#3103 ) * We need gcc during runtime to enable triton to compile kernels. * Fixing the docker build.	2025-03-13 10:45:47 +01:00
Daniël de Kok	83b7b7bb92	Router: add `gemma3-text` model type (#3107 )	2025-03-13 10:41:33 +01:00
Daniël de Kok	c73ae0bd88	Update to `kernels` 0.2.1 (#3084 ) * Update to `kernels` 0.2.1 The package was renamed from `hf-kernels` to `kernels`. The new version also updates the lockfile format. * Download kernels in `install-cuda` target	2025-03-13 10:36:29 +01:00
Nicolas Patry	d4c6faa67b	Try to fix on main CI color. (#3101 )	2025-03-12 10:12:24 +01:00
Nicolas Patry	4ac06ddf56	Preparing relase 3.2.0 (#3100 ) * Preparing relase 3.2.0 * Forgot the README. * Update doc.	2025-03-12 10:11:33 +01:00
David Corvoysier	f01dc9e743	Update neuron backend (#3098 ) * feat(neuron): use AWS Neuron SDK 2.21.1 * feat(neuron): bump optimum-neuron version * feat(neuron): tag latest image for local tests * test(neuron): simplify sampling test	2025-03-12 09:53:15 +01:00
Nicolas Patry	5c5528e362	Fix tool call4 (#3094 ) * Removing the no_tool content information. * Removing a lot of NO_TOOL shenanigans. * Update the tests.	2025-03-12 09:28:47 +01:00
Mohit Sharma	ed46c2c414	Add gemma3 model (#3099 )	2025-03-12 09:25:51 +01:00
Nicolas Patry	f74c36fe0d	Fix tool call3 (#3086 ) * Fixing the tool calling convention. * Update tehe doc. * Fixing some corner cases. * Fixing the tool call id. * Fmt. * Snapshot update with the new updated tool_call_id. * More qwen2.	2025-03-12 09:22:53 +01:00
celsowm	ae4451c3da	Update README.md (#3095 ) space between param and value	2025-03-11 11:05:21 +01:00
Nicolas Patry	b447f7e821	Fix qwen vl (#3096 ) * Fixing qwen2.5 VL. * Fixing the CI.	2025-03-11 11:00:41 +01:00
Adrien Gallouët	094975c3a8	Update the llamacpp backend (#3022 ) * Build faster Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Make --model-gguf optional Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Bump llama.cpp Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Enable mmap, offload_kqv & flash_attention by default Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Update doc Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Better error message Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Update doc Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Update installed packages Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Save gguf in models/MODEL_ID/model.gguf Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Fix build with Mach-O Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Quantize without llama-quantize Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Bump llama.cpp and switch to ggml-org Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Remove make-gguf.sh Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Update Cargo.lock Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Support HF_HUB_USER_AGENT_ORIGIN Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Bump llama.cpp Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Add --build-arg llamacpp_native & llamacpp_cpu_arm_arch Signed-off-by: Adrien Gallouët <angt@huggingface.co> --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-03-11 09:19:01 +01:00
drbh	dc5f05f8e6	Pr 3003 ci branch (#3007 ) * change ChatCompletionChunk to align with "OpenAI Chat Completions streaming API" Moving after tool_calls2 Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> add in Buffering.. Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> fix: handle usage outside of stream state and add tests Simplifying everything quite a bit. Remove the unused model_dump. Clippy. Clippy ? Ruff. Uppgrade the flake for latest transformers. Upgrade after rebase. Remove potential footgun. Fix completion test. * Clippy. * Tweak for multi prompt. * Ruff. * Update the snapshot a bit. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2025-03-10 17:56:19 +01:00
Daniël de Kok	124398fa57	hotfix: qwen2 formatting (#3093 ) * hotfix: qwen2 formatting * cargo fmt	2025-03-10 16:19:50 +01:00
Daniël de Kok	c5ecc7a4de	Small test and typing fixes (#3078 ) * test_weights: add modules_to_not_convert * More typing fixes	2025-03-10 15:08:23 +01:00
jiqing-feng	cae0cbe87d	Add modules_to_not_convert in quantized model (#3053 ) * fix modules_to_not_convert Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix format Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix tp quant skip Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * revert unquantized changes Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * use DefaultWeightsLoader in skip modules Signed-off-by: jiqing-feng <jiqing.feng@intel.com> --------- Signed-off-by: jiqing-feng <jiqing.feng@intel.com>	2025-03-10 15:03:51 +01:00
EachSheep	bbe218a4f7	Add qwen2 multi lora layers support (#3089 ) add qwen2 multi lora layers support to solve problem like https://github.com/huggingface/text-generation-inference/issues/2881, the similar PR are at https://github.com/huggingface/text-generation-inference/pull/2883 Co-authored-by: hjs <hjs@pku.edu.cn>	2025-03-10 12:42:59 +01:00
Alex Weston	58a65f7914	Add request parameters to OTel span for `/v1/chat/completions` endpoint (#3000 ) Record request parameters in OTel span for /v1/chat/completions endpoint	2025-03-10 12:26:57 +01:00
Daniël de Kok	976eae216f	Nix: the launcher needs a Python env with Torch for GPU detection (#3085 ) This makes `nix run .` in the repository work again. Should fix #3025.	2025-03-10 12:11:10 +01:00
Nicolas Patry	622908deab	Fix tool call2 (#3076 ) * Making `tool_calls` a vector. * Arguments output is a string. * Update all the integration tests. * Add the requirements. * Upgrade other tests. * Clippy. * Update the old test.	2025-03-07 19:45:57 +01:00

1 2 3 4 5 ...

1360 Commits