text-generation-inference

mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-09-09 03:14:53 +00:00

Author	SHA1	Message	Date
Daniël de Kok	bd33a23cac	More grpcio shenanigans	2025-07-08 15:00:50 +00:00
Daniël de Kok	df53facda9	AMD grpcio?	2025-07-08 14:15:41 +00:00
Daniël de Kok	a3db7edd67	Set grpcio upper bound to 1.73 (exclusive)	2025-07-08 13:55:00 +00:00
Daniël de Kok	5a6e09e32e	Revert "protobuf < 6.0" This reverts commit `48bb4b4f1e`.	2025-07-08 13:53:13 +00:00
Daniël de Kok	48bb4b4f1e	protobuf < 6.0	2025-07-08 13:32:04 +00:00
Daniël de Kok	bfdaf5773c	Add outlines upper bound Version 1.0.0 and later does not have fsm.guides module. We should update against the 1.0.0 API later.	2025-07-08 13:28:51 +00:00
Yuan Wu	fc2405c549	[gaudi] Fix the CI test errors (#3286 ) Signed-off-by: yuanwu <yuan.wu@intel.com>	2025-07-07 11:32:07 +02:00
Wang, Yi	ebb26f0ccd	[gaudi] Deepseek v2 mla and add ep to unquantized moe (#3287 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-07-07 11:29:39 +02:00
Wang, Yi	778b61c0da	[gaudi] Remove unnecessary reinitialize to HeterogeneousNextTokenChooser to make sampling output correct (#3284 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: regisss <15324346+regisss@users.noreply.github.com>	2025-07-03 10:03:16 +02:00
David Corvoysier	3d2e7c8fce	Optimum neuron 0.2.2 (#3281 ) * chore(neuron): use optimum-neuron 0.2.1 * test(neuron): adjust expectations Since the latest optimum-neuron uses a new modeling for granite and qwen, the greedy outputs are slighly different. * test(neuron): add phi3 and qwen3 tests * chore(neuron): use optimum-neuron 0.2.2	2025-07-03 07:59:25 +02:00
Wang, Yi	f6005d6813	xpu lora support (#3232 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-07-02 17:54:25 +02:00
Wang, Yi	429dcd9c64	[gaudi] Gemma3 sliding window support (#3280 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-07-01 10:06:01 +02:00
Baptiste Colle	9f38d93051	Gaudi: add CI (#3160 ) Co-authored-by: Pauline Bailly-Masson <155966238+paulinebm@users.noreply.github.com>	2025-06-24 18:51:09 +02:00
Wang, Yi	719907410b	[gaudi] Refine rope memory, do not need to keep sin/cos cache per layer (#3274 )	2025-06-23 11:15:39 +02:00
David Corvoysier	238fbd4d50	Neuron backend fix and patch version 3.3.4 (#3273 ) * fix(neuron): wrong assertion when batch_size==1 * chore: prepare 3.3.4	2025-06-19 10:52:41 +02:00
Wang, Yi	14ee6e7804	[gaudi] gemma3 text and vlm model intial support. need to add sliding window support later (#3270 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-06-19 09:32:34 +02:00
David Corvoysier	bd1bdebb47	doc: fix README (#3271 )	2025-06-18 12:35:36 +02:00
regisss	f13e28c98d	[gaudi] Refine logging for Gaudi warmup (#3222 ) * Refine logging for Gaudi warmup * Make style * Make style 2 * Flash causal LM case * Add log_master & VLM cases * Black	2025-06-18 12:34:00 +02:00
David Corvoysier	b4d17f18ff	chore: prepare release 3.3.3 (#3269 )	2025-06-18 11:55:26 +02:00
Wang, Yi	0627983c17	[Gaudi] use pad_token_id to pad input id (#3268 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-06-17 09:07:25 +02:00
Yuan Wu	3752143b39	[Gaudi] Fix the integration-test issues (#3265 ) Signed-off-by: yuanwu <yuan.wu@intel.com>	2025-06-13 14:47:06 +02:00
Yuan Wu	ded4cb52ac	[Gaudi] Enable Qwen3_moe model (#3244 ) Signed-off-by: yuanwu <yuan.wu@intel.com>	2025-06-13 12:03:24 +02:00
Wang, Yi	a220e57f45	[gaudi] HuggingFaceM4/idefics2-8b issue fix (#3264 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-06-13 12:00:08 +02:00
Yuan Wu	e07056ab3f	[Gaudi] Remove optimum-habana (#3261 ) Signed-off-by: yuanwu <yuan.wu@intel.com>	2025-06-12 22:35:36 +02:00
Yuan Wu	25fdc5f03c	[gaudi] Move the _update_cos_sin_cache into get_cos_sin (#3254 ) Signed-off-by: yuanwu <yuan.wu@intel.com>	2025-06-12 22:31:11 +02:00
Wang, Yi	613b8dd647	[gaudi] Vlm rebase and issue fix in benchmark test (#3263 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-06-12 22:26:37 +02:00
Wang, Yi	839477670a	[gaudi] Perf optimization (#3256 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-06-11 15:00:21 +02:00
David Corvoysier	79183d1647	Bump neuron SDK version (#3260 ) * chore(neuron): bump version to 0.2.0 * refactor(neuron): use named parameters in inputs helpers This allows to hide the differences between the two backends in terms of input parameters. * refactor(neuron): remove obsolete code paths * fix(neuron): use neuron_config whenever possible * fix(neuron): use new cache import path * fix(neuron): neuron config is not stored in config anymore * fix(nxd): adapt model retrieval to new APIs * fix(generator): emulate greedy in sampling parameters When on-device sampling is enabled, we need to emulate the greedy behaviour using top-k=1, top-p=1, temperature=1. * test(neuron): update models and expectations * feat(neuron): support on-device sampling * fix(neuron): adapt entrypoint * tests(neuron): remove obsolete models * fix(neuron): adjust test expectations for llama on nxd	2025-06-10 17:56:25 +02:00
Yuan Wu	1ff9d185d5	Remove useless packages (#3253 ) Signed-off-by: yuanwu <yuan.wu@intel.com>	2025-06-03 13:42:29 +02:00
Daniël de Kok	249189d96e	Prepare for 3.3.2 (#3249 )	2025-05-30 16:16:36 +02:00
Yuan Wu	6b6e30a6f6	[gaudi] Fix the Llama-4-Maverick-17B-128E crash issue (#3246 ) Signed-off-by: yuanwu <yuan.wu@intel.com>	2025-05-29 11:38:44 +02:00
Yuan Wu	70217ac345	[Gaudi] Fix the OOM issue of Llama-4-Scout-17B-16E-Instruct (#3245 ) Signed-off-by: yuanwu <yuan.wu@intel.com>	2025-05-29 09:58:24 +02:00
Wang, Yi	f14044009a	fp8 compressed tensors w8a8 support for Gaudi backend (#3242 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-05-28 14:54:20 +02:00
Yuan Wu	1883a62a94	Add Qwen3 for Gaudi backend (#3229 ) Signed-off-by: yuanwu <yuan.wu@intel.com>	2025-05-23 08:58:35 +02:00
Daniël de Kok	f58d7cf50e	Nix: switch to hf-nix (#3240 ) * Nix: switch to hf-nix * Remove outdated local overrides	2025-05-22 17:09:15 +02:00
Wang, Yi	f08b44ade5	Upgrade to new vllm extension ops for Gaudi backend (fix issue in exponential bucketing) (#3239 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-05-22 15:29:16 +02:00
Daniël de Kok	674c514d44	Prepare for 3.3.1 (#3238 )	2025-05-22 09:43:55 +02:00
Wang, Yi	9e7e546923	Move input_ids to hpu and remove disposal of adapter_meta (#3237 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-05-22 09:21:31 +02:00
Daniël de Kok	e32528792c	Switch to punica-sgmv kernel from the Hub (#3236 ) * Switch to punica-sgmv kernel from the Hub This also switches (temporarily) to the tgi-nix/kernel-builder merge branch, bumping up to CUDA 12.8 (same as non-Nix Torch). * nix: client depends on aiohttp This probably worked before the nixpkgs bump because a dependency propagated aiohttp.	2025-05-21 15:44:15 +02:00
Wang, Yi	43b1b07fb9	Fix the crash in default ATTENTION path for Gaudi backend (#3235 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-05-20 14:02:32 +02:00
Wang, Yi	000e313a92	Refine warmup and upgrade to synapse AI 1.21.0 (#3234 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-05-20 10:22:43 +02:00
Wang, Yi	d658b5def3	Deepseek R1 for Gaudi backend (#3211 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-05-19 16:36:39 +02:00
drbh	58934c8b61	fix: count gpu uuids if NVIDIA_VISIBLE_DEVICES env set to all (#3230 )	2025-05-16 11:48:58 -04:00
Yuan Wu	18cbecfb38	Enable Llama4 for Gaudi backend (#3223 ) Signed-off-by: yuanwu <yuan.wu@intel.com>	2025-05-15 14:35:37 +02:00
Daniël de Kok	7e531f413d	Update to Torch 2.7.0 (#3221 ) * Update to Torch 2.7.0 * Try to fix typer/click issue * Pin click to fix incompatibility with typer * Fix some test outputs with slight deviations * Attempt again to sync with CI * Mamba too * Fixup mllama Also switch to `unsloth/Llama-3.2-11B-Vision-Instruct` for testing from the EU :).	2025-05-15 11:48:33 +02:00
kaixuanliu	535ce23827	Adjust the `round_up_seq` logic in Gaudi backend (#3224 ) Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>	2025-05-12 09:58:43 +02:00
kaixuanliu	c94f415af4	Change HPU warmup logic: seq length should be with exponential growth (#3217 ) Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> Co-authored-by: regisss <15324346+regisss@users.noreply.github.com>	2025-05-10 15:41:18 +02:00
Daniël de Kok	56c8189467	Prepare for 3.3.0 (#3220 )	2025-05-09 15:50:29 +02:00
Mohit Sharma	329f612e55	Chunked Prefill VLM (#3188 ) * add logic * working * add encoder cache free * fixes * fix idefics * update pixel_values * add improvements * add improvements * improve * nit * fix inputs_embeds * nit * optimizations * add prometheus port * rename vars * rename vars * nit * disable chunking for qwen * review comments * remove port * improve headdim * remove kwargs and redundant args * fix qwen2_5 * fix config image_token_id error * fix test * update paligemma * fix paligemma text * minor fix * fix qwen test * fix qwen test	2025-05-06 18:01:59 +02:00
Wang, Yi	533eee50dc	forward and tokenize chooser use the same shape (#3196 ) * forward and tokenize chooser use the same shape concate or filter happened to cpu tensor to avoid dynamic shape in hpu Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * use hpu set seed Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-05-06 10:49:32 +02:00

1 2 3 4 5 ...

1429 Commits