* launcher: ensure correct detection of Gemma 3 head size
* Support flashinfer for Gemma3 prefill
Gemma3 uses bidirectional attention for images. Flashinfer
supports custom masks. Hook up the mask with flashinfer, so that we do
not have to use the slower SDPA implementation for prefills with images.
* Update Gemma3 test outputs
* Fixed unused import
* transformers flash llm/vlm enabling in xpu
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* ipex cpu could also support in function
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* update transformres
* Upgrading the nix deps too.
* Forcing torchvision to be in there.
* Fixing bug in mllama.
* Those tests cannot be run in CI.
* Lint.
---------
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
* initial changes
* Add support for other vlm
* cleanup comment
* Improve attn_implementation
* Add comments for support of models
* add model
* add model
* fixes and improvements
* update docker
* Add cache position
* Add tests
* remove redundant changes
* remove tr version
* Upgrade doc + fix linting.
* Fixing the CI.
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
* xpu 2.6 update
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* install whl
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* update get xpu memory api
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* int
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* fix awq crash if modules_to_not_convert is None
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* Update to `kernels` 0.2.1
The package was renamed from `hf-kernels` to `kernels`. The new version
also updates the lockfile format.
* Download kernels in `install-cuda` target
* Fixing the tool calling convention.
* Update tehe doc.
* Fixing some corner cases.
* Fixing the tool call id.
* Fmt.
* Snapshot update with the new updated tool_call_id.
* More qwen2.
* feat: support qwen2.5 vl model
* fix: bump support models doc
* feat: check before rope type adjustment and small refactors
* fix: add transformer overlay for processor support
* fix: vendor processor and config from transformers
* fix: refactor/simplify conditionals
* fix Qwen VL break in intel platform
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* could use PositionRotaryEmbedding impl so rocm and ipex could all work
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* Use Hub kernels for Marlin and cutlass quantization kernels
* Use hub kernels for MoE/GPTQ-Marlin MoE
* Use attention kernels from the Hub
* Cache the kernels in the Docker image
* Update moe kernels
* Support loading local kernels for development
* Support latest moe kernels
* Update to moe 0.1.1
* CI: download locked kernels for server tests
* Fixup some imports
* CI: activate venv
* Fix unused imports
* Nix: add attention/moe/quantization kernels
* Update hf-kernels to 0.1.5
* Update kernels
* Update tgi-nix flake for hf-kernels
* Fix EOF
* Take `load_kernel` out of a frequently-called function
* Hoist another case of kernel loading out of a somewhat hot function
* marlin-kernels -> quantization
* attention -> paged-attention
* EOF fix
* Update hf-kernels, fixup Docker
* ipex fix
* Remove outdated TODO
* feat: refactor model, improve startup and re enable tests
* fix: improve multimodal rotary embed caching
* fix: limit vision flop calc to qwen2 vl models and update config typing
* fix: include clippy lint
* feat: refactor position ids in warmup and bump tests
* fix: prefer default dtype
* fix: enable all cuda graphs and bump snapshots
* fix: adjust rotaty init path
* fix: simplify get position ids and remove usused vision config
* fix: update position ids so first dim is batch, simplify rotary and bump vlm default token limit
* fix: improve position id init during cuda warmup for mrope and simplfy rotary forward
* fix: check existance before accessing rope type in cuda warmup
* fix: check key before access
* fix: improve mrope check in cuda graph warmup
* fix: remove check for default rope type
* fix: add more test and improve model generation
* fix: improve and simplify get_cos_sin, refactors and cleanup get_position_ids
* fix: adjust signatures with types
This version removes our patches/custom API. Makes it simpler to
get changes from upstream. One of which is that we can enable FP8
KV cache for paged attention as well.
* Upgrade the version number.
* Remove modifications in Lock.
* Tmp branch to test transformers backend with 2.5.1 and TP>1
* Fixing the transformers backend.
inference_mode forces the use of `aten.matmul` instead of `aten.mm` the
former doesn't have sharding support crashing the transformers TP
support.
`lm_head.forward` also crashes because it skips the hook that
cast/decast the DTensor.
Torch 2.5.1 is required for sharding support.
* Put back the attention impl.
* Revert the flashinfer (this will fails).
* Building AOT.
* Using 2.5 kernels.
* Remove the archlist, it's defined in the docker anyway.
* Fixing TRTLLM dockerfile.
* Fixed.
* Creating a dummy modification to chekc CI runs.
* Removing the cache directive.
* Modifying this should cache hit.
* Revert "Modifying this should cache hit."
This reverts commit 46a2bde108.
* Modifying this should cache hit.
* Unwanted files.
* feat: tokenize each request individually and increase warmup image size
* feat: adjust rotary embed and avoid cuda graphs of size 2 and smaller
* fix: address image resize and rebase changes
* feat: update to run qwen2-vl tests
* fix: tweak param types
* fix the crash of meta-llama/Llama-3.2-1B
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* Apply suggestions from code review
Simpler fix (which doesn't break vlms).
---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
* flash decoding
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* enable xpu flashdecoding
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* set flashdecoding blocksize as 64
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* enable flashdecoding, prefill chunking and prefix caching
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* add flashdecoding-ipex
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* feat: improve star coder to support multi lora layers
* feat: improve weight that support adapters and add tests for starcoder with lora
* fix: bump snapshot for added tests
* fix: rerun pre commit lints
* fix: bump adapter test for added later names
* Baichuan2-13B does not have max_position_embeddings in config
see https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat/blob/main/config.json
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* Update server/text_generation_server/models/flash_causal_lm.py
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
* fmt
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
error like "ValueError: Expecting a ProcessGroup, but got a <class
'text_generation_server.utils.dist.FakeGroup'>. rank=0"
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* Basic flashinfer 0.2 support
This change does not use any of the new features yet, but makes
some small compatibility changes.
* Update to flashinfer 0.2.0.post1
* flashinfer: remove `contiguous` calls
* Fix flashinfer install
* flashinfer: fixup kv cache dtype
* Fix some annoying perturbations
* More output changes