text-generation-inference

mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-10-10 23:45:23 +00:00

Author	SHA1	Message	Date
Wang, Yi	459fbdebe3	transformers flash llm/vlm enabling in ipex (#3152 ) * transformers flash llm/vlm enabling in xpu Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * ipex cpu could also support in function Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-04-15 11:08:01 +02:00
Mohit Sharma	73e797528d	L4 fixes (#3161 ) add fix	2025-04-14 22:13:53 +05:30
Mohit Sharma	87a0af4ec2	Update transformers to 4.51 (#3148 ) * update transformres * Upgrading the nix deps too. * Forcing torchvision to be in there. * Fixing bug in mllama. * Those tests cannot be run in CI. * Lint. --------- Co-authored-by: Pedro Cuenca <pedro@huggingface.co> Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2025-04-07 12:55:43 +02:00
Mohit Sharma	d9bb9bebc9	Add llama4 (#3145 ) * initial changes * Add support for other vlm * cleanup comment * Improve attn_implementation * Add comments for support of models * add model * add model * fixes and improvements * update docker * Add cache position * Add tests * remove redundant changes * remove tr version * Upgrade doc + fix linting. * Fixing the CI. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2025-04-06 10:20:22 +02:00
Mohit Sharma	a35fbdb925	Bug Fix: Sliding Window Attention (#3112 ) * (fix) sliding window attention * (fix) flashinfer * (typo) collection link * Add window_size_left param ipex rocm * Update window size rocm flash decoding * fix: bump snapshots and improve exceed window test case * feat: add tests for image types and remove alpha from png * Upgrading `from_env` to get token from file when necessary + fix pali_gemma. * fix: add pillow dependency and bump lock+requirements * fix: bump org name in gemma3 test * Fix qwen2. --------- Co-authored-by: drbh <david.richard.holtz@gmail.com> Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2025-03-18 10:37:33 +01:00
Wang, Yi	0b3e3db043	xpu 2.6 update (#3051 ) * xpu 2.6 update Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * install whl Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * update get xpu memory api Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * int Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix awq crash if modules_to_not_convert is None Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-03-17 13:48:48 +01:00
Daniël de Kok	c73ae0bd88	Update to `kernels` 0.2.1 (#3084 ) * Update to `kernels` 0.2.1 The package was renamed from `hf-kernels` to `kernels`. The new version also updates the lockfile format. * Download kernels in `install-cuda` target	2025-03-13 10:36:29 +01:00
Mohit Sharma	ed46c2c414	Add gemma3 model (#3099 )	2025-03-12 09:25:51 +01:00
Nicolas Patry	f74c36fe0d	Fix tool call3 (#3086 ) * Fixing the tool calling convention. * Update tehe doc. * Fixing some corner cases. * Fixing the tool call id. * Fmt. * Snapshot update with the new updated tool_call_id. * More qwen2.	2025-03-12 09:22:53 +01:00
Nicolas Patry	b447f7e821	Fix qwen vl (#3096 ) * Fixing qwen2.5 VL. * Fixing the CI.	2025-03-11 11:00:41 +01:00
Daniël de Kok	124398fa57	hotfix: qwen2 formatting (#3093 ) * hotfix: qwen2 formatting * cargo fmt	2025-03-10 16:19:50 +01:00
Daniël de Kok	c5ecc7a4de	Small test and typing fixes (#3078 ) * test_weights: add modules_to_not_convert * More typing fixes	2025-03-10 15:08:23 +01:00
jiqing-feng	cae0cbe87d	Add modules_to_not_convert in quantized model (#3053 ) * fix modules_to_not_convert Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix format Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix tp quant skip Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * revert unquantized changes Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * use DefaultWeightsLoader in skip modules Signed-off-by: jiqing-feng <jiqing.feng@intel.com> --------- Signed-off-by: jiqing-feng <jiqing.feng@intel.com>	2025-03-10 15:03:51 +01:00
EachSheep	bbe218a4f7	Add qwen2 multi lora layers support (#3089 ) add qwen2 multi lora layers support to solve problem like https://github.com/huggingface/text-generation-inference/issues/2881, the similar PR are at https://github.com/huggingface/text-generation-inference/pull/2883 Co-authored-by: hjs <hjs@pku.edu.cn>	2025-03-10 12:42:59 +01:00
Wang, Yi	d7a24c03cf	some minor fix (#3048 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-02-25 12:07:55 +01:00
Daniël de Kok	97c5f7e685	Use `rotary` kernel from the Hub (#3041 )	2025-02-21 13:55:31 +01:00
Wang, Yi	06dfe9abfe	fix qwen2 vl crash in continous batching (#3004 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-02-20 18:36:45 -05:00
drbh	d6a0c67e2f	feat: add initial qwen2.5-vl model and test (#2971 ) * feat: support qwen2.5 vl model * fix: bump support models doc * feat: check before rope type adjustment and small refactors * fix: add transformer overlay for processor support * fix: vendor processor and config from transformers * fix: refactor/simplify conditionals	2025-02-19 12:38:20 +01:00
Cyril Vallez	a7448661f7	Improve Transformers support (#2970 ) * Much better support * add gpt neox * bump transformers version * bump version	2025-02-18 19:04:34 +01:00
Daniël de Kok	f0ed76583c	Use eetq kernel from the hub (#3029 ) * Use eetq kernel from the hub * Fixing the CI. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2025-02-18 10:03:53 +01:00
Daniël de Kok	6df0fc0b55	Support sigmoid scoring function in GPTQ-MoE (#3017 )	2025-02-14 11:33:49 +01:00
Wang, Yi	76bcb4948d	fix Qwen VL break in intel platform (#3002 ) * fix Qwen VL break in intel platform Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * could use PositionRotaryEmbedding impl so rocm and ipex could all work Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-02-12 11:31:34 +01:00
Daniël de Kok	571ac9b507	Use kernels from the kernel hub (#2988 ) * Use Hub kernels for Marlin and cutlass quantization kernels * Use hub kernels for MoE/GPTQ-Marlin MoE * Use attention kernels from the Hub * Cache the kernels in the Docker image * Update moe kernels * Support loading local kernels for development * Support latest moe kernels * Update to moe 0.1.1 * CI: download locked kernels for server tests * Fixup some imports * CI: activate venv * Fix unused imports * Nix: add attention/moe/quantization kernels * Update hf-kernels to 0.1.5 * Update kernels * Update tgi-nix flake for hf-kernels * Fix EOF * Take `load_kernel` out of a frequently-called function * Hoist another case of kernel loading out of a somewhat hot function * marlin-kernels -> quantization * attention -> paged-attention * EOF fix * Update hf-kernels, fixup Docker * ipex fix * Remove outdated TODO	2025-02-10 19:19:25 +01:00
drbh	c1cf36c0dc	Improve qwen vl impl (#2943 ) * feat: refactor model, improve startup and re enable tests * fix: improve multimodal rotary embed caching * fix: limit vision flop calc to qwen2 vl models and update config typing * fix: include clippy lint * feat: refactor position ids in warmup and bump tests * fix: prefer default dtype * fix: enable all cuda graphs and bump snapshots * fix: adjust rotaty init path * fix: simplify get position ids and remove usused vision config * fix: update position ids so first dim is batch, simplify rotary and bump vlm default token limit * fix: improve position id init during cuda warmup for mrope and simplfy rotary forward * fix: check existance before accessing rope type in cuda warmup * fix: check key before access * fix: improve mrope check in cuda graph warmup * fix: remove check for default rope type * fix: add more test and improve model generation * fix: improve and simplify get_cos_sin, refactors and cleanup get_position_ids * fix: adjust signatures with types	2025-02-04 12:44:18 -05:00
Nicolas Patry	cb747b33da	Add deepseekv3 (#2968 ) * Add fp8 support moe models add deepseekv3 format codfe' update dockerfile update doc * Small modifications. * Moe kernels 0.8.1 * Upgrade to 0.8.1 * Fixing moe import. * Black. * Apply suggestions from code review Co-authored-by: Mohit Sharma <mohit21sharma.ms@gmail.com> * Fixing Mixtral + Nits. * Put link to ref. * Fix other call locations. * Scoring func `softmax` is the only one that works. --------- Co-authored-by: Mohit Sharma <mohit21sharma.ms@gmail.com>	2025-01-30 16:40:25 +01:00
Nicolas Patry	80e7d98f88	Hotfixing intel-cpu (not sure how it was working before). (#2967 ) * Hotfixing intel-cpu (not sure how it was working before). * Do not fail on missing moe-kernels (Intel-cpu).	2025-01-29 22:34:41 +01:00
Mohit Sharma	4ef2e045c9	Add fp8 support moe models (#2928 ) * Add fp8 support moe models * flatten condition	2025-01-29 13:56:32 +01:00
Nicolas Patry	eb3df0f46f	Fixing the oom maybe with 2.5.1 change. (#2958 )	2025-01-28 10:30:28 +01:00
Daniël de Kok	db922eb77e	Update to attention-kernels 0.2.0 (#2950 ) This version removes our patches/custom API. Makes it simpler to get changes from upstream. One of which is that we can enable FP8 KV cache for paged attention as well.	2025-01-27 11:42:36 +01:00
Cyril Vallez	18c4607d46	Transformers backend TP fix (#2945 ) * init dispatch * cohere fix	2025-01-23 18:09:57 +01:00
Nicolas Patry	29a0893b67	Tmp tp transformers (#2942 ) * Upgrade the version number. * Remove modifications in Lock. * Tmp branch to test transformers backend with 2.5.1 and TP>1 * Fixing the transformers backend. inference_mode forces the use of `aten.matmul` instead of `aten.mm` the former doesn't have sharding support crashing the transformers TP support. `lm_head.forward` also crashes because it skips the hook that cast/decast the DTensor. Torch 2.5.1 is required for sharding support. * Put back the attention impl. * Revert the flashinfer (this will fails). * Building AOT. * Using 2.5 kernels. * Remove the archlist, it's defined in the docker anyway.	2025-01-23 18:07:30 +01:00
Daniël de Kok	1dd346666a	Clarify FP8-Marlin use on capability 8.9 (#2940 ) The log message stated that the GPU does not support FP8 on capability 8.9. However we use FP8-Marlin on that capability because it is faster.	2025-01-22 18:18:11 +01:00
Wang, Yi	1d3c9beba8	fix moe in quantization path (#2935 ) update ipex xpu to support moe for mixtral Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-01-22 14:36:15 +01:00
Cyril Vallez	b980848abf	Flash Transformers modeling backend support (#2913 ) * add transformers_flash * inits * switch version to make it work * Update Makefile-flash-att-v2 * Update Makefile-flash-att-v2 * Update Makefile-flash-att-v2 * Update Makefile-flash-att-v2 * Update Makefile-flash-att-v2 * Update Makefile-flash-att-v2 * runnable version * working * push change * fix high dim * init * default * latest transformers changes * revert * simplify check * remove flag * improve type hints + required args * Update based on transformers PR * small fix * Remove Warpers for Processor * fix compatibility version issue * raise error if needed * Simplify with monkey patch * revert + style + minor improvements * update comment * device check * move the import to avoid device issue * Update __init__.py * check for non-native models * oupsi --------- Co-authored-by: System administrator <root@ip-10-90-0-159.ec2.internal>	2025-01-21 10:01:51 +01:00
Nicolas Patry	447a5b2f87	Fixing TRTLLM dockerfile. (#2922 ) * Fixing TRTLLM dockerfile. * Fixed. * Creating a dummy modification to chekc CI runs. * Removing the cache directive. * Modifying this should cache hit. * Revert "Modifying this should cache hit." This reverts commit `46a2bde108`. * Modifying this should cache hit. * Unwanted files.	2025-01-20 11:13:46 +01:00
Daniël de Kok	630f198624	flashinfer: switch to plan API (#2904 ) This change doesn't switch `forward` to `run` yet, since it requires that we have access to the softmax scale and the logit softcap outside the model.	2025-01-17 18:18:02 +01:00
drbh	8f6146f11a	Revert "feat: improve qwen2-vl startup " (#2924 ) Revert "feat: improve qwen2-vl startup (#2802)" This reverts commit `eecca27113`.	2025-01-17 12:09:05 -05:00
drbh	eecca27113	feat: improve qwen2-vl startup (#2802 ) * feat: tokenize each request individually and increase warmup image size * feat: adjust rotary embed and avoid cuda graphs of size 2 and smaller * fix: address image resize and rebase changes * feat: update to run qwen2-vl tests * fix: tweak param types	2025-01-17 11:50:41 -05:00
Wang, Yi	6e982f43a1	fix the crash of meta-llama/Llama-3.2-1B (#2918 ) * fix the crash of meta-llama/Llama-3.2-1B Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * Apply suggestions from code review Simpler fix (which doesn't break vlms). --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2025-01-17 15:50:58 +01:00
Mohit Sharma	c20025dbf7	Add fp8 kv cache for ROCm (#2856 ) * add fp8 kv cache for rocm * improvements * update log statement * remove bookkeeping field	2025-01-17 18:43:29 +05:30
Wang, Yi	885144166f	Flash decoding kernel adding and prefill-chunking and prefix caching enabling in intel cpu/xpu (#2815 ) * flash decoding Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * enable xpu flashdecoding Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * set flashdecoding blocksize as 64 Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * enable flashdecoding, prefill chunking and prefix caching Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * add flashdecoding-ipex Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-01-17 12:04:57 +01:00
drbh	82f6ea1b71	feat: improve star coder to support multi lora layers (#2883 ) * feat: improve star coder to support multi lora layers * feat: improve weight that support adapters and add tests for starcoder with lora * fix: bump snapshot for added tests * fix: rerun pre commit lints * fix: bump adapter test for added later names	2025-01-16 16:23:55 -05:00
Daniël de Kok	5f78ec32a5	Do not convert weight scale to e4m3fnuz on CUDA (#2917 )	2025-01-16 13:44:32 +01:00
Wang, Yi	cc8b9650bd	Baichuan2-13B does not have max_position_embeddings in config (#2903 ) * Baichuan2-13B does not have max_position_embeddings in config see https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat/blob/main/config.json Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * Update server/text_generation_server/models/flash_causal_lm.py Co-authored-by: Daniël de Kok <me@github.danieldk.eu> * fmt Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Daniël de Kok <me@github.danieldk.eu>	2025-01-15 15:56:52 +01:00
Mohit Sharma	e07acc7f68	Enable FP8 Per-Tensor Scales and Integrate Marlin/MoE Kernels Repo for ROCm (#2825 ) * (feat) convert tscales to tensorwise * (fix) fp8 scaling for cuda * (kernel) add marlin-kernels * add moe-kernels * fix moe kernel comit * fix scaling * nm changes	2025-01-15 11:38:58 +05:30
Mohit Sharma	880ab9c2f3	Add Flash decoding kernel ROCm (#2855 ) * (vllm) updated vllm rocm kernels * revert silu * update partition size * remove grouped_topk * (nit) remove log * add flash decoding	2025-01-13 11:12:35 +01:00
Wang, Yi	1660154ae6	fix crash in torch2.6 if TP=1 (#2885 ) error like "ValueError: Expecting a ProcessGroup, but got a <class 'text_generation_server.utils.dist.FakeGroup'>. rank=0" Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-01-13 11:11:31 +01:00
drbh	da5ab46705	Improve vlm support (add idefics3 support) (#2437 ) * feat: expand vlm support and add image token logic and tests * fix: avoid unused perceiver config * feat: integrate image tokens into inputs embeds * feat: add simple idefics3 test * feat: update docs, image token logic and weight names * fix: improve image processing * feat: improve prefix for idefics3 * fix: bump idefics3 tests and snapshots * fix: improve text model loading * feat: consolidate changes with existing vlms and add support and test for smolvlm * fix: create new idefic3 file, simplify logic and adjust llama weight loading * fix: lint with ruff * fix: clean up idefics 3 and improve prefix handling * fix: improve typing * fix: improve prompt_split_image with ref to original impl * fix: adjust ruff lints and small refactors * fix: adjust FlashLlamaModel prefix logic	2025-01-09 10:35:32 -05:00
Daniël de Kok	a9c7d2e3b6	Basic flashinfer 0.2 support (#2862 ) * Basic flashinfer 0.2 support This change does not use any of the new features yet, but makes some small compatibility changes. * Update to flashinfer 0.2.0.post1 * flashinfer: remove `contiguous` calls * Fix flashinfer install * flashinfer: fixup kv cache dtype * Fix some annoying perturbations * More output changes	2025-01-09 16:25:00 +01:00
Mohit Sharma	8f66d323d0	Update vllm kernels for ROCM (#2826 ) * (vllm) updated vllm rocm kernels * revert silu * update partition size * remove grouped_topk * (nit) remove log * update moe-kernels commit	2024-12-18 12:44:42 +01:00

1 2 3 4 5 ...

555 Commits