text-generation-inference

mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-10-09 15:05:24 +00:00

Author	SHA1	Message	Date
Daniël de Kok	84ab88d843	Support flashinfer for Gemma3 prefill (#3167 ) * launcher: ensure correct detection of Gemma 3 head size * Support flashinfer for Gemma3 prefill Gemma3 uses bidirectional attention for images. Flashinfer supports custom masks. Hook up the mask with flashinfer, so that we do not have to use the slower SDPA implementation for prefills with images. * Update Gemma3 test outputs * Fixed unused import	2025-04-17 18:07:41 +02:00
Wang, Yi	459fbdebe3	transformers flash llm/vlm enabling in ipex (#3152 ) * transformers flash llm/vlm enabling in xpu Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * ipex cpu could also support in function Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-04-15 11:08:01 +02:00
Mohit Sharma	73e797528d	L4 fixes (#3161 ) add fix	2025-04-14 22:13:53 +05:30
Mohit Sharma	87a0af4ec2	Update transformers to 4.51 (#3148 ) * update transformres * Upgrading the nix deps too. * Forcing torchvision to be in there. * Fixing bug in mllama. * Those tests cannot be run in CI. * Lint. --------- Co-authored-by: Pedro Cuenca <pedro@huggingface.co> Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2025-04-07 12:55:43 +02:00
Mohit Sharma	9c26b52940	Use ROCM 6.3.1 (#3141 ) * update dockerfile * add updated makefile * fix docker * Lint. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2025-04-07 12:55:11 +02:00
Mohit Sharma	d9bb9bebc9	Add llama4 (#3145 ) * initial changes * Add support for other vlm * cleanup comment * Improve attn_implementation * Add comments for support of models * add model * add model * fixes and improvements * update docker * Add cache position * Add tests * remove redundant changes * remove tr version * Upgrade doc + fix linting. * Fixing the CI. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2025-04-06 10:20:22 +02:00
Mohit Sharma	a35fbdb925	Bug Fix: Sliding Window Attention (#3112 ) * (fix) sliding window attention * (fix) flashinfer * (typo) collection link * Add window_size_left param ipex rocm * Update window size rocm flash decoding * fix: bump snapshots and improve exceed window test case * feat: add tests for image types and remove alpha from png * Upgrading `from_env` to get token from file when necessary + fix pali_gemma. * fix: add pillow dependency and bump lock+requirements * fix: bump org name in gemma3 test * Fix qwen2. --------- Co-authored-by: drbh <david.richard.holtz@gmail.com> Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2025-03-18 10:37:33 +01:00
Wang, Yi	0b3e3db043	xpu 2.6 update (#3051 ) * xpu 2.6 update Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * install whl Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * update get xpu memory api Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * int Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix awq crash if modules_to_not_convert is None Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-03-17 13:48:48 +01:00
Daniël de Kok	c73ae0bd88	Update to `kernels` 0.2.1 (#3084 ) * Update to `kernels` 0.2.1 The package was renamed from `hf-kernels` to `kernels`. The new version also updates the lockfile format. * Download kernels in `install-cuda` target	2025-03-13 10:36:29 +01:00
Mohit Sharma	ed46c2c414	Add gemma3 model (#3099 )	2025-03-12 09:25:51 +01:00
Nicolas Patry	f74c36fe0d	Fix tool call3 (#3086 ) * Fixing the tool calling convention. * Update tehe doc. * Fixing some corner cases. * Fixing the tool call id. * Fmt. * Snapshot update with the new updated tool_call_id. * More qwen2.	2025-03-12 09:22:53 +01:00
Nicolas Patry	b447f7e821	Fix qwen vl (#3096 ) * Fixing qwen2.5 VL. * Fixing the CI.	2025-03-11 11:00:41 +01:00
Daniël de Kok	124398fa57	hotfix: qwen2 formatting (#3093 ) * hotfix: qwen2 formatting * cargo fmt	2025-03-10 16:19:50 +01:00
Daniël de Kok	c5ecc7a4de	Small test and typing fixes (#3078 ) * test_weights: add modules_to_not_convert * More typing fixes	2025-03-10 15:08:23 +01:00
jiqing-feng	cae0cbe87d	Add modules_to_not_convert in quantized model (#3053 ) * fix modules_to_not_convert Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix format Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix tp quant skip Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * revert unquantized changes Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * use DefaultWeightsLoader in skip modules Signed-off-by: jiqing-feng <jiqing.feng@intel.com> --------- Signed-off-by: jiqing-feng <jiqing.feng@intel.com>	2025-03-10 15:03:51 +01:00
EachSheep	bbe218a4f7	Add qwen2 multi lora layers support (#3089 ) add qwen2 multi lora layers support to solve problem like https://github.com/huggingface/text-generation-inference/issues/2881, the similar PR are at https://github.com/huggingface/text-generation-inference/pull/2883 Co-authored-by: hjs <hjs@pku.edu.cn>	2025-03-10 12:42:59 +01:00
Nicolas Patry	31766dad77	Force upgrade transformers version for olmo.	2025-03-05 12:17:09 +01:00
Hugo Larcher	d8ff7f2623	feat: add support for HF_HUB_USER_AGENT_ORIGIN to add user-agent Origin field in Hub requests. (#3061 ) * feat: add support for HF_HUB_USER_AGENT_ORIGIN to add user-agent Origin field in Hub requests. * fix: Rust version for Neuron * fix: PR comments, use rust-toolchain.toml	2025-03-04 16:43:50 +01:00
Wang, Yi	d7a24c03cf	some minor fix (#3048 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-02-25 12:07:55 +01:00
Daniël de Kok	97c5f7e685	Use `rotary` kernel from the Hub (#3041 )	2025-02-21 13:55:31 +01:00
Wang, Yi	06dfe9abfe	fix qwen2 vl crash in continous batching (#3004 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-02-20 18:36:45 -05:00
Daniël de Kok	ed96ba6503	flashinfer 0.2.0.post1 -> post2 (#3040 ) * flashinfer 0.2.0.post1 -> post2 * Fix ruff stuff. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2025-02-20 12:34:20 +01:00
drbh	d6a0c67e2f	feat: add initial qwen2.5-vl model and test (#2971 ) * feat: support qwen2.5 vl model * fix: bump support models doc * feat: check before rope type adjustment and small refactors * fix: add transformer overlay for processor support * fix: vendor processor and config from transformers * fix: refactor/simplify conditionals	2025-02-19 12:38:20 +01:00
Cyril Vallez	a7448661f7	Improve Transformers support (#2970 ) * Much better support * add gpt neox * bump transformers version * bump version	2025-02-18 19:04:34 +01:00
Daniël de Kok	f0ed76583c	Use eetq kernel from the hub (#3029 ) * Use eetq kernel from the hub * Fixing the CI. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2025-02-18 10:03:53 +01:00
Daniël de Kok	6df0fc0b55	Support sigmoid scoring function in GPTQ-MoE (#3017 )	2025-02-14 11:33:49 +01:00
Nicolas Patry	d6881c37ab	Putting back the NCCL forced upgrade. (#2999 ) * Putting back the NCCL forced upgrade. * . * ... * Ignoring conda. * Dropping conda from the buidl system + torch 2.6 * Cache min. * Rolling back torch version. * Reverting the EETQ modification. * Fix flash attention ? * Actually stay on flash v1. * Patching flash v1. * Torch 2.6, fork of rotary, eetq updated. * Put back nccl latest (override torch). * Slightly more reproducible build and not as scary.	2025-02-14 11:31:59 +01:00
Wang, Yi	76bcb4948d	fix Qwen VL break in intel platform (#3002 ) * fix Qwen VL break in intel platform Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * could use PositionRotaryEmbedding impl so rocm and ipex could all work Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-02-12 11:31:34 +01:00
Daniël de Kok	571ac9b507	Use kernels from the kernel hub (#2988 ) * Use Hub kernels for Marlin and cutlass quantization kernels * Use hub kernels for MoE/GPTQ-Marlin MoE * Use attention kernels from the Hub * Cache the kernels in the Docker image * Update moe kernels * Support loading local kernels for development * Support latest moe kernels * Update to moe 0.1.1 * CI: download locked kernels for server tests * Fixup some imports * CI: activate venv * Fix unused imports * Nix: add attention/moe/quantization kernels * Update hf-kernels to 0.1.5 * Update kernels * Update tgi-nix flake for hf-kernels * Fix EOF * Take `load_kernel` out of a frequently-called function * Hoist another case of kernel loading out of a somewhat hot function * marlin-kernels -> quantization * attention -> paged-attention * EOF fix * Update hf-kernels, fixup Docker * ipex fix * Remove outdated TODO	2025-02-10 19:19:25 +01:00
Nicolas Patry	0ef8c8a97a	Using the "lockfile". (#2992 ) * Using the "lockfile". * Revert dummy modifications. * Lock on python 3.11 * Another attempt. * .. * Bad cache hits. * The good old monkey. * How in the world... * We need the launcher still. * . * .. * Attempt #42 * Don't break all other builds. * Mode max. * Applying to other builds.	2025-02-06 12:28:24 +01:00
drbh	c1cf36c0dc	Improve qwen vl impl (#2943 ) * feat: refactor model, improve startup and re enable tests * fix: improve multimodal rotary embed caching * fix: limit vision flop calc to qwen2 vl models and update config typing * fix: include clippy lint * feat: refactor position ids in warmup and bump tests * fix: prefer default dtype * fix: enable all cuda graphs and bump snapshots * fix: adjust rotaty init path * fix: simplify get position ids and remove usused vision config * fix: update position ids so first dim is batch, simplify rotary and bump vlm default token limit * fix: improve position id init during cuda warmup for mrope and simplfy rotary forward * fix: check existance before accessing rope type in cuda warmup * fix: check key before access * fix: improve mrope check in cuda graph warmup * fix: remove check for default rope type * fix: add more test and improve model generation * fix: improve and simplify get_cos_sin, refactors and cleanup get_position_ids * fix: adjust signatures with types	2025-02-04 12:44:18 -05:00
Nicolas Patry	c9d68945cc	Prepare for release 3.1.0 (#2972 ) * Prepare for release 3.1.0 * Back on main flake. * Fixing stuff. * Upgrade to moe-kernels 0.8.2 for Hip support. * Deactivating the flaky test.	2025-01-31 14:19:01 +01:00
Nicolas Patry	cb747b33da	Add deepseekv3 (#2968 ) * Add fp8 support moe models add deepseekv3 format codfe' update dockerfile update doc * Small modifications. * Moe kernels 0.8.1 * Upgrade to 0.8.1 * Fixing moe import. * Black. * Apply suggestions from code review Co-authored-by: Mohit Sharma <mohit21sharma.ms@gmail.com> * Fixing Mixtral + Nits. * Put link to ref. * Fix other call locations. * Scoring func `softmax` is the only one that works. --------- Co-authored-by: Mohit Sharma <mohit21sharma.ms@gmail.com>	2025-01-30 16:40:25 +01:00
Nicolas Patry	80e7d98f88	Hotfixing intel-cpu (not sure how it was working before). (#2967 ) * Hotfixing intel-cpu (not sure how it was working before). * Do not fail on missing moe-kernels (Intel-cpu).	2025-01-29 22:34:41 +01:00
Daniël de Kok	ee0dffcd14	Update to moe-kernels 0.8.0 (#2966 )	2025-01-29 18:19:55 +01:00
Mohit Sharma	4ef2e045c9	Add fp8 support moe models (#2928 ) * Add fp8 support moe models * flatten condition	2025-01-29 13:56:32 +01:00
Nicolas Patry	eb3df0f46f	Fixing the oom maybe with 2.5.1 change. (#2958 )	2025-01-28 10:30:28 +01:00
Daniël de Kok	db922eb77e	Update to attention-kernels 0.2.0 (#2950 ) This version removes our patches/custom API. Makes it simpler to get changes from upstream. One of which is that we can enable FP8 KV cache for paged attention as well.	2025-01-27 11:42:36 +01:00
Nicolas Patry	d9dda11726	Trying to put back the archlist (to fix the oom). (#2947 )	2025-01-24 09:32:17 +01:00
Cyril Vallez	18c4607d46	Transformers backend TP fix (#2945 ) * init dispatch * cohere fix	2025-01-23 18:09:57 +01:00
Nicolas Patry	29a0893b67	Tmp tp transformers (#2942 ) * Upgrade the version number. * Remove modifications in Lock. * Tmp branch to test transformers backend with 2.5.1 and TP>1 * Fixing the transformers backend. inference_mode forces the use of `aten.matmul` instead of `aten.mm` the former doesn't have sharding support crashing the transformers TP support. `lm_head.forward` also crashes because it skips the hook that cast/decast the DTensor. Torch 2.5.1 is required for sharding support. * Put back the attention impl. * Revert the flashinfer (this will fails). * Building AOT. * Using 2.5 kernels. * Remove the archlist, it's defined in the docker anyway.	2025-01-23 18:07:30 +01:00
Daniël de Kok	1dd346666a	Clarify FP8-Marlin use on capability 8.9 (#2940 ) The log message stated that the GPU does not support FP8 on capability 8.9. However we use FP8-Marlin on that capability because it is faster.	2025-01-22 18:18:11 +01:00
Wang, Yi	1d3c9beba8	fix moe in quantization path (#2935 ) update ipex xpu to support moe for mixtral Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-01-22 14:36:15 +01:00
Cyril Vallez	b980848abf	Flash Transformers modeling backend support (#2913 ) * add transformers_flash * inits * switch version to make it work * Update Makefile-flash-att-v2 * Update Makefile-flash-att-v2 * Update Makefile-flash-att-v2 * Update Makefile-flash-att-v2 * Update Makefile-flash-att-v2 * Update Makefile-flash-att-v2 * runnable version * working * push change * fix high dim * init * default * latest transformers changes * revert * simplify check * remove flag * improve type hints + required args * Update based on transformers PR * small fix * Remove Warpers for Processor * fix compatibility version issue * raise error if needed * Simplify with monkey patch * revert + style + minor improvements * update comment * device check * move the import to avoid device issue * Update __init__.py * check for non-native models * oupsi --------- Co-authored-by: System administrator <root@ip-10-90-0-159.ec2.internal>	2025-01-21 10:01:51 +01:00
Nicolas Patry	447a5b2f87	Fixing TRTLLM dockerfile. (#2922 ) * Fixing TRTLLM dockerfile. * Fixed. * Creating a dummy modification to chekc CI runs. * Removing the cache directive. * Modifying this should cache hit. * Revert "Modifying this should cache hit." This reverts commit `46a2bde108`. * Modifying this should cache hit. * Unwanted files.	2025-01-20 11:13:46 +01:00
Daniël de Kok	630f198624	flashinfer: switch to plan API (#2904 ) This change doesn't switch `forward` to `run` yet, since it requires that we have access to the softmax scale and the logit softcap outside the model.	2025-01-17 18:18:02 +01:00
drbh	8f6146f11a	Revert "feat: improve qwen2-vl startup " (#2924 ) Revert "feat: improve qwen2-vl startup (#2802)" This reverts commit `eecca27113`.	2025-01-17 12:09:05 -05:00
drbh	eecca27113	feat: improve qwen2-vl startup (#2802 ) * feat: tokenize each request individually and increase warmup image size * feat: adjust rotary embed and avoid cuda graphs of size 2 and smaller * fix: address image resize and rebase changes * feat: update to run qwen2-vl tests * fix: tweak param types	2025-01-17 11:50:41 -05:00
Wang, Yi	6e982f43a1	fix the crash of meta-llama/Llama-3.2-1B (#2918 ) * fix the crash of meta-llama/Llama-3.2-1B Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * Apply suggestions from code review Simpler fix (which doesn't break vlms). --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2025-01-17 15:50:58 +01:00
Mohit Sharma	c20025dbf7	Add fp8 kv cache for ROCm (#2856 ) * add fp8 kv cache for rocm * improvements * update log statement * remove bookkeeping field	2025-01-17 18:43:29 +05:30

1 2 3 4 5 ...

711 Commits