text-generation-inference

mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-10-09 15:05:24 +00:00

Author	SHA1	Message	Date
baptiste	db98b4611b	wip(ci): rerun ci to debug	2025-04-22 08:15:51 +00:00
baptiste	9fdc67af5c	fix llama failing test	2025-04-22 08:15:51 +00:00
baptiste	1cd3f98ff7	feat(ci): llama3 test working	2025-04-22 08:15:51 +00:00
baptiste	e024f1dd22	feat(ci): llama3 test working	2025-04-22 08:15:51 +00:00
baptiste	23fe77f059	wip: able to launch gaudi tests	2025-04-22 08:15:51 +00:00
baptiste	918b29a0af	wip(test): adding test to ci	2025-04-22 08:15:51 +00:00
Nicolas Patry	8f8819795f	Fixing CI (#3184 )	2025-04-18 13:07:18 +02:00
Alvaro Bartolome	95ccba3705	Bump `sccache` to 0.10.0 (#3179 ) * Ensure that `sccache` version is 0.10.0 or higher * Rename `ACTIONS_CACHE_URL` to `ACTIONS_RESULTS_URL`	2025-04-18 12:45:32 +02:00
Hyeongchan Kim	b400c275e4	Get opentelemetry trace id from request headers instead of creating a new trace (#2648 ) feature: get trace id from req headers Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2025-04-18 09:06:41 +02:00
Daniël de Kok	84ab88d843	Support flashinfer for Gemma3 prefill (#3167 ) * launcher: ensure correct detection of Gemma 3 head size * Support flashinfer for Gemma3 prefill Gemma3 uses bidirectional attention for images. Flashinfer supports custom masks. Hook up the mask with flashinfer, so that we do not have to use the slower SDPA implementation for prefills with images. * Update Gemma3 test outputs * Fixed unused import	2025-04-17 18:07:41 +02:00
Nicolas Patry	4645678ff0	Hotfix gaudi2 with newer transformers. (#3176 )	2025-04-15 12:39:28 +02:00
Nicolas Patry	ad765cd06b	Hotfixing gaudi deps. (#3174 )	2025-04-15 11:55:28 +02:00
Nicolas Patry	16b4b7974a	Upgrading the dependencies in Gaudi backend. (#3170 ) * Upgrading the dependencies in Gaudi backend. * Upgrading transformers version.	2025-04-15 11:49:06 +02:00
Wang, Yi	459fbdebe3	transformers flash llm/vlm enabling in ipex (#3152 ) * transformers flash llm/vlm enabling in xpu Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * ipex cpu could also support in function Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-04-15 11:08:01 +02:00
Nicolas Patry	449cee49ca	setuptools <= 70.0 is vulnerable: CVE-2024-6345 (#3171 )	2025-04-15 10:09:37 +02:00
Mohit Sharma	73e797528d	L4 fixes (#3161 ) add fix	2025-04-14 22:13:53 +05:30
Nicolas Patry	fe56f760df	Upgrading the python client deps (still deprecated, but used for integration-tests)	2025-04-14 17:18:43 +02:00
Wang, Yi	d62c941c56	Gaudi: clean cuda/rocm code in hpu backend, enable flat_hpu (#3113 ) * clean cuda/rocm code in hpu backend, enable flat_hpu Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix TP in pageattn Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * adjust block table in hpu to improve performance Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * enable all the model. not testet yet Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * use tensor cache in hpu graph to avoid replay issue Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * add moe support, fix qwen/mistral/mixtral crash Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix phimoe issue Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * gpt_bigcode could also go pageattn Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * enable dbrx remove some unused code Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * multi-modality initial PR Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * adjust warmup and enable vlm Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix incorrect output in qwen2 idefics if hpu graph is used Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * remove unused quantization code and enable awq/gptq int4 Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix gptq issue Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * enable fp8 Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * warmup prefill remove model where pageattn is not used, set block table to None since it's not used Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * add warmup_decode Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * warmup decode Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * remove block_tables and prefill_cache_indices which will lead to dynamic shape Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix comment Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * missing gptj change... Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix some issue Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * remove torch.where to fix incorrect output in hpu graph model Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * match the latest vllm_extension ops Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-04-14 15:58:13 +02:00
Nicolas Patry	9a8d0462e1	Fixing tokenization like https://github.com/huggingface/text-embeddin … (#3156 ) Fixing tokenization like https://github.com/huggingface/text-embeddings-inference/issues/525	2025-04-09 18:42:25 +02:00
Nicolas Patry	5861da1ad7	Fixing Qwen 2.5 VL (32B). (#3157 ) Reduce the config constraints, and use common ground between the 8B and 32B.	2025-04-09 17:07:30 +02:00
Nicolas Patry	0b28aabb94	3.2.3 (#3151 )	2025-04-08 10:16:37 +02:00
oOraph	24bec29ffc	fix: compute type typo (#3150 ) Signed-off-by: Raphael Glon <oOraph@users.noreply.github.com> Co-authored-by: Raphael Glon <oOraph@users.noreply.github.com>	2025-04-07 17:24:11 +02:00
Baptiste Colle	37104acd75	Gaudi: Add Integration Test for Gaudi Backend (#3142 ) * feat(gaudi): add integration test * feat(test): add more models to integration tests * remove debug comments * fix typos	2025-04-07 16:55:03 +02:00
Mohit Sharma	87a0af4ec2	Update transformers to 4.51 (#3148 ) * update transformres * Upgrading the nix deps too. * Forcing torchvision to be in there. * Fixing bug in mllama. * Those tests cannot be run in CI. * Lint. --------- Co-authored-by: Pedro Cuenca <pedro@huggingface.co> Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2025-04-07 12:55:43 +02:00
Mohit Sharma	9c26b52940	Use ROCM 6.3.1 (#3141 ) * update dockerfile * add updated makefile * fix docker * Lint. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2025-04-07 12:55:11 +02:00
Nicolas Patry	d23b385eee	Preparing for release. (#3147 ) * Preparing for release. * Adding hf-xet dependency. * Merged tgi-nix update.	2025-04-06 11:36:00 +02:00
Mohit Sharma	d9bb9bebc9	Add llama4 (#3145 ) * initial changes * Add support for other vlm * cleanup comment * Improve attn_implementation * Add comments for support of models * add model * add model * fixes and improvements * update docker * Add cache position * Add tests * remove redundant changes * remove tr version * Upgrade doc + fix linting. * Fixing the CI. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2025-04-06 10:20:22 +02:00
Yuan Wu	3d059f91ab	Gaudi: Use exponential growth to replace BATCH_BUCKET_SIZE (#3131 ) * Gaudi: Use exponential growth to replace BATCH_BUCKET_SIZE Signed-off-by: yuanwu <yuan.wu@intel.com> * Remove debug modifications Signed-off-by: yuanwu <yuan.wu@intel.com> --------- Signed-off-by: yuanwu <yuan.wu@intel.com>	2025-04-03 10:34:53 +02:00
Corentin REGAL	0142550096	nix-v3.2.1 -> v3.2.1-nix (#3129 ) make it easier to check for version using semver semantic (same major and minor)	2025-03-26 15:36:43 +01:00
Yuan Wu	f5f14dc660	Gaudi: Fix llava-next and mllama crash issue (#3127 ) Signed-off-by: yuanwu <yuan.wu@intel.com>	2025-03-25 15:08:15 +01:00
Nicolas Patry	54d15462dc	Torch 2.6 (#3134 ) * Torch 2.6 * Upgrade the toolchain. * Don't upgrade just yet. * Upgrade toolchain. * Time upgrade. * TGI-nix main. * Upgrade to transformers 4.50	2025-03-24 11:55:49 +01:00
Baptiste Colle	2e60a8dd65	CI: enable server tests for backends (#3128 ) add test for backends	2025-03-20 16:07:31 +01:00
Erik Kaunismäki	e5503eba78	configurable termination timeout (#3126 ) * make shard and webserver termination timeouts configurable * Updating documentation. * Fmt. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2025-03-20 14:25:56 +01:00
Nicolas Patry	e497bc09f6	Minor fixes. (#3125 )	2025-03-18 15:42:35 +01:00
Nicolas Patry	67ce543e04	Intel docker. (#3121 ) * Intel docker. * torchaudio ? * Fixing dockerfile ?	2025-03-18 15:12:11 +01:00
Nicolas Patry	83fe45c15e	Prepare for patch release. (#3124 )	2025-03-18 15:11:55 +01:00
Nicolas Patry	11f2eec10e	Publish nix docker image. (#3122 ) * Publish nix docker image. * Run during PR. * Something else. * Forgot to push. * Build zstd. * Pushing with skopeo * Testing the PR. * Runnign from nix. * Cleaner tags.	2025-03-18 12:58:21 +01:00
Mohit Sharma	a35fbdb925	Bug Fix: Sliding Window Attention (#3112 ) * (fix) sliding window attention * (fix) flashinfer * (typo) collection link * Add window_size_left param ipex rocm * Update window size rocm flash decoding * fix: bump snapshots and improve exceed window test case * feat: add tests for image types and remove alpha from png * Upgrading `from_env` to get token from file when necessary + fix pali_gemma. * fix: add pillow dependency and bump lock+requirements * fix: bump org name in gemma3 test * Fix qwen2. --------- Co-authored-by: drbh <david.richard.holtz@gmail.com> Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2025-03-18 10:37:33 +01:00
Baptiste Colle	8c2c348f3c	Gaudi: Sync TGI with the latest changes from the TGI-Gaudi fork (#3117 ) feat(gaudi): add all the changes from tgi-gaudi fork up to PR #289	2025-03-18 09:45:52 +01:00
Daniël de Kok	095775e05c	launcher: correctly get the head dimension for VLMs (#3116 ) * launcher: correctly get the head dimension for VLMs For most (?) VLMs, the head dimension is in the `text_config` configuration section. However, since we only queried the top-level `head_dim` (which typically doesn't exist in VLMs), we would never use flashinfer. This change adds a method that gets the head dimension from the top-level `Config` struct or `text_config` when that fails. * fix: bump org name in gemma3 test --------- Co-authored-by: drbh <david.richard.holtz@gmail.com>	2025-03-17 18:19:37 +01:00
Wang, Yi	0b3e3db043	xpu 2.6 update (#3051 ) * xpu 2.6 update Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * install whl Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * update get xpu memory api Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * int Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix awq crash if modules_to_not_convert is None Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-03-17 13:48:48 +01:00
Daniël de Kok	f91434e99b	Make the Nix-based Docker container work on non-NixOS (#3109 ) On NixOS, the CUDA driver shim gets mounted on /run/opengl-driver, where Nix packages expect the shim to be. However, on other distributions, some FHS paths are mounted. This is a small change to make the dynamic loader find the shim.	2025-03-13 14:02:45 +01:00
Nicolas Patry	8b91f92978	Fixing the docker build. (#3108 ) * Fixing the docker build. * Apply suggestions from code review	2025-03-13 11:26:44 +01:00
Baptiste Colle	27ed848676	Release of Gaudi Backend for TGI (#3091 ) * feat(gaudi): release ready (docs, docker image and vlm ready) * fix(gaudi): add default argument for the dockerfile * fix(gaudi): remove use of latest for gaudi docker image + redid gaudi benchmarking section to include best practices	2025-03-13 10:56:01 +01:00
Nicolas Patry	83ef364177	We need gcc during runtime to enable triton to compile kernels. (#3103 ) * We need gcc during runtime to enable triton to compile kernels. * Fixing the docker build.	2025-03-13 10:45:47 +01:00
Daniël de Kok	83b7b7bb92	Router: add `gemma3-text` model type (#3107 )	2025-03-13 10:41:33 +01:00
Daniël de Kok	c73ae0bd88	Update to `kernels` 0.2.1 (#3084 ) * Update to `kernels` 0.2.1 The package was renamed from `hf-kernels` to `kernels`. The new version also updates the lockfile format. * Download kernels in `install-cuda` target	2025-03-13 10:36:29 +01:00
Nicolas Patry	d4c6faa67b	Try to fix on main CI color. (#3101 )	2025-03-12 10:12:24 +01:00
Nicolas Patry	4ac06ddf56	Preparing relase 3.2.0 (#3100 ) * Preparing relase 3.2.0 * Forgot the README. * Update doc.	2025-03-12 10:11:33 +01:00
David Corvoysier	f01dc9e743	Update neuron backend (#3098 ) * feat(neuron): use AWS Neuron SDK 2.21.1 * feat(neuron): bump optimum-neuron version * feat(neuron): tag latest image for local tests * test(neuron): simplify sampling test	2025-03-12 09:53:15 +01:00

1 2 3 4 5 ...

1374 Commits