text-generation-inference

mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-11-18 23:15:59 +00:00

Author	SHA1	Message	Date
Funtowicz Morgan	85790a19a7	misc(gha): expose action cache url and runtime as secrets (#2964 ) * misc(gha): expose action cache url and runtime as secrets * (CI): Move S3 Auth to OIDC * Fix Typo * change bucket name * fix aws auth creds * misc(gha): fix invalid syntax for secrets * WIP: Add AWS session token * Increase session time * Remove actions_cache_url mount from Dockerfile Removed an unused mount for actions_cache_url in the Dockerfile. * WIP --------- Co-authored-by: Guillaume LEGENDRE <glegendre01@gmail.com>	2025-11-17 10:50:10 +01:00
Alvaro Moran	efb94e0d3d	Patch version 3.3.6 (#3329 ) * chore: prepare version 3.3.6 * fix(benchmark): clear up progress_gauge fn signature Otherwise there is a compiler error.	2025-09-16 19:15:23 -04:00
drbh	5e747f4e30	Revert "feat: bump flake including transformers and huggingface_hub versions" (#3330 ) Revert "feat: bump flake including transformers and huggingface_hub versions …" This reverts commit `356de85c29`.	2025-09-16 11:32:19 -04:00
drbh	1b90c508af	Revert "Revert "feat: bump flake including transformers and huggingfa… (#3326 ) Revert "Revert "feat: bump flake including transformers and huggingface_hub v…" This reverts commit `9dedeb89ac`.	2025-09-09 10:44:25 -04:00
Eliott C.	d2ad7c484e	Update iframe sources for streaming demo (#3327 )	2025-09-09 15:36:19 +02:00
Daniël de Kok	c6071749db	Fix mask passed to flashinfer (#3324 ) Custom masks are padded to the shape `[batch_size, max_len, max_len]`. However, flashinfer expects an unpadded mask of the shape `[sum(q_len[i] * k_len[i] for i in range(batch_size)]`. This change unpads the custom mask (currently only used by Gemma 3) to this shape (assuming q_len == k_len, since we only use the custom mask during prefill).	2025-09-08 13:47:03 -04:00
drbh	4f067c22c3	fix: remove azure (#3325 )	2025-09-08 13:41:45 -04:00
drbh	9dedeb89ac	Revert "feat: bump flake including transformers and huggingface_hub versions" (#3323 ) Revert "feat: bump flake including transformers and huggingface_hub versions …" This reverts commit `356de85c29`.	2025-09-08 12:17:29 +02:00
Phil	5739b5b088	Add missing backslash (#3311 )	2025-09-06 09:50:14 +02:00
drbh	356de85c29	feat: bump flake including transformers and huggingface_hub versions (#3313 ) * feat: bump flake including transformers and huggingface_hub versions * fix: adjust outline version in overlay	2025-09-02 09:46:41 -04:00
Alvaro Moran	0f79162288	chore: prepare version 3.3.5 (#3314 ) * chore: prepare version 3.3.5 * black * neuron: black * Update hf-xet in uv lockfile * Attempt to fix API doc check failure Add `error_type` where missing. * Pin redocly version * Sync redocly with Nix for now --------- Co-authored-by: Daniël de Kok <me@danieldk.eu>	2025-09-02 15:35:42 +02:00
Daniël de Kok	06d9d88b95	Disable Cachix pushes (#3312 ) * Disable Cachix pushes This is not safe until we have sandboxed builds. For TGI alone this might not be a huge issue, but with Cachix caching disabled in hf-nix, TGI CI would build all the packages and push it to our cache. * fix: bump docs --------- Co-authored-by: drbh <david.richard.holtz@gmail.com>	2025-08-26 13:27:57 -04:00
Alvaro Moran	8801ba12cf	Optimum neuron 0.3.0 (#3308 ) * chore(neuron): update to optimum-neuron 0.3.0 Dependencies were changed accordingly, because Neuron SDK was updated to v2.24. * test: sample is not deterministic Also modify the temperature in decode test to avoid granite early stopping. * test(neuron): adjust expectations after graph changes * test(neuron): use greedy for stop sequences --------- Co-authored-by: David Corvoysier <david@huggingface.co>	2025-08-26 11:07:47 +02:00
Wang, Yi	d618424d50	HuggingFaceM4/Idefics3-8B-Llama3 crash fix (#3267 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-08-21 10:04:30 +02:00
Wang, Yi	c5e6f9a178	Fix outline import issue (#3282 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-08-21 09:53:04 +02:00
Wang, Yi	6624fec1f9	Some gptq case could not be handled by ipex. but could be handle by triton (#3298 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-08-19 09:37:49 +02:00
Wang, Yi	5284b5c654	Multi modality fix (#3283 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-08-19 09:36:36 +02:00
Wang, Yi	6a2fa83540	XCCL for XPU (#3252 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-08-19 00:37:27 +02:00
Emmanuel Ferdman	b4386b8c77	Migrate to V2 Pydantic interface (#3262 ) Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>	2025-08-18 23:55:21 +02:00
Wang, Yi	24c2bff659	Gaudi gptq gidx support (#3297 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-07-17 16:00:12 +02:00
Yuan Wu	fc2405c549	[gaudi] Fix the CI test errors (#3286 ) Signed-off-by: yuanwu <yuan.wu@intel.com>	2025-07-07 11:32:07 +02:00
Wang, Yi	ebb26f0ccd	[gaudi] Deepseek v2 mla and add ep to unquantized moe (#3287 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-07-07 11:29:39 +02:00
Wang, Yi	778b61c0da	[gaudi] Remove unnecessary reinitialize to HeterogeneousNextTokenChooser to make sampling output correct (#3284 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: regisss <15324346+regisss@users.noreply.github.com>	2025-07-03 10:03:16 +02:00
David Corvoysier	3d2e7c8fce	Optimum neuron 0.2.2 (#3281 ) * chore(neuron): use optimum-neuron 0.2.1 * test(neuron): adjust expectations Since the latest optimum-neuron uses a new modeling for granite and qwen, the greedy outputs are slighly different. * test(neuron): add phi3 and qwen3 tests * chore(neuron): use optimum-neuron 0.2.2	2025-07-03 07:59:25 +02:00
Wang, Yi	f6005d6813	xpu lora support (#3232 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-07-02 17:54:25 +02:00
Wang, Yi	429dcd9c64	[gaudi] Gemma3 sliding window support (#3280 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-07-01 10:06:01 +02:00
Baptiste Colle	9f38d93051	Gaudi: add CI (#3160 ) Co-authored-by: Pauline Bailly-Masson <155966238+paulinebm@users.noreply.github.com>	2025-06-24 18:51:09 +02:00
Wang, Yi	719907410b	[gaudi] Refine rope memory, do not need to keep sin/cos cache per layer (#3274 )	2025-06-23 11:15:39 +02:00
David Corvoysier	238fbd4d50	Neuron backend fix and patch version 3.3.4 (#3273 ) * fix(neuron): wrong assertion when batch_size==1 * chore: prepare 3.3.4	2025-06-19 10:52:41 +02:00
Wang, Yi	14ee6e7804	[gaudi] gemma3 text and vlm model intial support. need to add sliding window support later (#3270 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-06-19 09:32:34 +02:00
David Corvoysier	bd1bdebb47	doc: fix README (#3271 )	2025-06-18 12:35:36 +02:00
regisss	f13e28c98d	[gaudi] Refine logging for Gaudi warmup (#3222 ) * Refine logging for Gaudi warmup * Make style * Make style 2 * Flash causal LM case * Add log_master & VLM cases * Black	2025-06-18 12:34:00 +02:00
David Corvoysier	b4d17f18ff	chore: prepare release 3.3.3 (#3269 )	2025-06-18 11:55:26 +02:00
Wang, Yi	0627983c17	[Gaudi] use pad_token_id to pad input id (#3268 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-06-17 09:07:25 +02:00
Yuan Wu	3752143b39	[Gaudi] Fix the integration-test issues (#3265 ) Signed-off-by: yuanwu <yuan.wu@intel.com>	2025-06-13 14:47:06 +02:00
Yuan Wu	ded4cb52ac	[Gaudi] Enable Qwen3_moe model (#3244 ) Signed-off-by: yuanwu <yuan.wu@intel.com>	2025-06-13 12:03:24 +02:00
Wang, Yi	a220e57f45	[gaudi] HuggingFaceM4/idefics2-8b issue fix (#3264 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-06-13 12:00:08 +02:00
Yuan Wu	e07056ab3f	[Gaudi] Remove optimum-habana (#3261 ) Signed-off-by: yuanwu <yuan.wu@intel.com>	2025-06-12 22:35:36 +02:00
Yuan Wu	25fdc5f03c	[gaudi] Move the _update_cos_sin_cache into get_cos_sin (#3254 ) Signed-off-by: yuanwu <yuan.wu@intel.com>	2025-06-12 22:31:11 +02:00
Wang, Yi	613b8dd647	[gaudi] Vlm rebase and issue fix in benchmark test (#3263 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-06-12 22:26:37 +02:00
Wang, Yi	839477670a	[gaudi] Perf optimization (#3256 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-06-11 15:00:21 +02:00
David Corvoysier	79183d1647	Bump neuron SDK version (#3260 ) * chore(neuron): bump version to 0.2.0 * refactor(neuron): use named parameters in inputs helpers This allows to hide the differences between the two backends in terms of input parameters. * refactor(neuron): remove obsolete code paths * fix(neuron): use neuron_config whenever possible * fix(neuron): use new cache import path * fix(neuron): neuron config is not stored in config anymore * fix(nxd): adapt model retrieval to new APIs * fix(generator): emulate greedy in sampling parameters When on-device sampling is enabled, we need to emulate the greedy behaviour using top-k=1, top-p=1, temperature=1. * test(neuron): update models and expectations * feat(neuron): support on-device sampling * fix(neuron): adapt entrypoint * tests(neuron): remove obsolete models * fix(neuron): adjust test expectations for llama on nxd	2025-06-10 17:56:25 +02:00
Yuan Wu	1ff9d185d5	Remove useless packages (#3253 ) Signed-off-by: yuanwu <yuan.wu@intel.com>	2025-06-03 13:42:29 +02:00
Daniël de Kok	249189d96e	Prepare for 3.3.2 (#3249 )	2025-05-30 16:16:36 +02:00
Yuan Wu	6b6e30a6f6	[gaudi] Fix the Llama-4-Maverick-17B-128E crash issue (#3246 ) Signed-off-by: yuanwu <yuan.wu@intel.com>	2025-05-29 11:38:44 +02:00
Yuan Wu	70217ac345	[Gaudi] Fix the OOM issue of Llama-4-Scout-17B-16E-Instruct (#3245 ) Signed-off-by: yuanwu <yuan.wu@intel.com>	2025-05-29 09:58:24 +02:00
Wang, Yi	f14044009a	fp8 compressed tensors w8a8 support for Gaudi backend (#3242 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-05-28 14:54:20 +02:00
Yuan Wu	1883a62a94	Add Qwen3 for Gaudi backend (#3229 ) Signed-off-by: yuanwu <yuan.wu@intel.com>	2025-05-23 08:58:35 +02:00
Daniël de Kok	f58d7cf50e	Nix: switch to hf-nix (#3240 ) * Nix: switch to hf-nix * Remove outdated local overrides	2025-05-22 17:09:15 +02:00
Wang, Yi	f08b44ade5	Upgrade to new vllm extension ops for Gaudi backend (fix issue in exponential bucketing) (#3239 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-05-22 15:29:16 +02:00

1 2 3 4 5 ...

1443 Commits