* transformers flash llm/vlm enabling in xpu
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* ipex cpu could also support in function
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* update transformres
* Upgrading the nix deps too.
* Forcing torchvision to be in there.
* Fixing bug in mllama.
* Those tests cannot be run in CI.
* Lint.
---------
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
* initial changes
* Add support for other vlm
* cleanup comment
* Improve attn_implementation
* Add comments for support of models
* add model
* add model
* fixes and improvements
* update docker
* Add cache position
* Add tests
* remove redundant changes
* remove tr version
* Upgrade doc + fix linting.
* Fixing the CI.
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
* Fixing the tool calling convention.
* Update tehe doc.
* Fixing some corner cases.
* Fixing the tool call id.
* Fmt.
* Snapshot update with the new updated tool_call_id.
* More qwen2.
* feat: support qwen2.5 vl model
* fix: bump support models doc
* feat: check before rope type adjustment and small refactors
* fix: add transformer overlay for processor support
* fix: vendor processor and config from transformers
* fix: refactor/simplify conditionals
* Use Hub kernels for Marlin and cutlass quantization kernels
* Use hub kernels for MoE/GPTQ-Marlin MoE
* Use attention kernels from the Hub
* Cache the kernels in the Docker image
* Update moe kernels
* Support loading local kernels for development
* Support latest moe kernels
* Update to moe 0.1.1
* CI: download locked kernels for server tests
* Fixup some imports
* CI: activate venv
* Fix unused imports
* Nix: add attention/moe/quantization kernels
* Update hf-kernels to 0.1.5
* Update kernels
* Update tgi-nix flake for hf-kernels
* Fix EOF
* Take `load_kernel` out of a frequently-called function
* Hoist another case of kernel loading out of a somewhat hot function
* marlin-kernels -> quantization
* attention -> paged-attention
* EOF fix
* Update hf-kernels, fixup Docker
* ipex fix
* Remove outdated TODO
* feat: refactor model, improve startup and re enable tests
* fix: improve multimodal rotary embed caching
* fix: limit vision flop calc to qwen2 vl models and update config typing
* fix: include clippy lint
* feat: refactor position ids in warmup and bump tests
* fix: prefer default dtype
* fix: enable all cuda graphs and bump snapshots
* fix: adjust rotaty init path
* fix: simplify get position ids and remove usused vision config
* fix: update position ids so first dim is batch, simplify rotary and bump vlm default token limit
* fix: improve position id init during cuda warmup for mrope and simplfy rotary forward
* fix: check existance before accessing rope type in cuda warmup
* fix: check key before access
* fix: improve mrope check in cuda graph warmup
* fix: remove check for default rope type
* fix: add more test and improve model generation
* fix: improve and simplify get_cos_sin, refactors and cleanup get_position_ids
* fix: adjust signatures with types
* Upgrade the version number.
* Remove modifications in Lock.
* Tmp branch to test transformers backend with 2.5.1 and TP>1
* Fixing the transformers backend.
inference_mode forces the use of `aten.matmul` instead of `aten.mm` the
former doesn't have sharding support crashing the transformers TP
support.
`lm_head.forward` also crashes because it skips the hook that
cast/decast the DTensor.
Torch 2.5.1 is required for sharding support.
* Put back the attention impl.
* Revert the flashinfer (this will fails).
* Building AOT.
* Using 2.5 kernels.
* Remove the archlist, it's defined in the docker anyway.
* feat: tokenize each request individually and increase warmup image size
* feat: adjust rotary embed and avoid cuda graphs of size 2 and smaller
* fix: address image resize and rebase changes
* feat: update to run qwen2-vl tests
* fix: tweak param types
* fix the crash of meta-llama/Llama-3.2-1B
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* Apply suggestions from code review
Simpler fix (which doesn't break vlms).
---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
* flash decoding
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* enable xpu flashdecoding
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* set flashdecoding blocksize as 64
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* enable flashdecoding, prefill chunking and prefix caching
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* add flashdecoding-ipex
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* feat: improve star coder to support multi lora layers
* feat: improve weight that support adapters and add tests for starcoder with lora
* fix: bump snapshot for added tests
* fix: rerun pre commit lints
* fix: bump adapter test for added later names
* Baichuan2-13B does not have max_position_embeddings in config
see https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat/blob/main/config.json
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* Update server/text_generation_server/models/flash_causal_lm.py
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
* fmt
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
* Basic flashinfer 0.2 support
This change does not use any of the new features yet, but makes
some small compatibility changes.
* Update to flashinfer 0.2.0.post1
* flashinfer: remove `contiguous` calls
* Fix flashinfer install
* flashinfer: fixup kv cache dtype
* Fix some annoying perturbations
* More output changes
* Fix runtime error when Qwen2-VL was prompted with multiple images
Fix runtime error when Qwen2-VL model is prompted with prompt with more
than one image. The runtime error was:
File "text-generation-inference/server/text_generation_server/models/custom_modeling/qwen2_vl.py", line 459, in get_position_ids
text_pos_ids = torch.arange(text_length, device=d)
RuntimeError: upper bound and larger bound inconsistent with step sign
The error was caused by text_length variable going to negative value
when multiple images caused multiple loops in the get_position_ids
function's main loop.
The error is a simple logic mistake where next_image_pos is initialized
as relative offset from current_pos, but was used like it was absolute
position from zero.
* Fix runtime error when Qwen2-VL was prompted with multiple images
Fix runtime error when Qwen2-VL model is prompted with prompt with more
than one image. The runtime error was:
File "text-generation-inference/server/text_generation_server/models/custom_modeling/qwen2_vl.py", line 534, in forward
inputs_embeds[input_ids == self.image_token_id] = image_embeds
RuntimeError: shape mismatch: value tensor of shape [512, 3584] cannot be broadcast to indexing result of shape [1024, 3584]
(The error message shape numbers can be different depending on the input
image resolutions)
The error was caused by adding the wrong number of <|image_pad|> tokens
to the tokenized input in the image_text_replacement function.
The error is a simple logical mistake where the number of image pad
tokens is checked from pixel_value_shape tensor's first dimension
length. However, the pixel_value_shape contains patches from all of the
images. Therefore the code added the total number of required image pad
tokens for the whole input to each of the images locations. This
resulted to extra image pad tokens to be present in the tokenized input.
The fix was to check the number of required tokens from the
image_grid_thw tensor. The tensor includes grid_t, grid_h, and grid_w
values for each image. grid_t * grid_h * grid_w results to the total
number of patches for the image [1]. The number of required image pad
tokens is number_of_patches // 4.
[1] 31f9a289a6/src/transformers/models/qwen2_vl/image_processing_qwen2_vl.py (L311)
---------
Co-authored-by: Janne Alatalo <janne.alatalo@jamk.fi>
* Using both value from config as they might not be correct.
* Fixing max_position_embeddings for falcon.
* Simple attempt to fix the healthcheck block allocation.
* Much simpler solution.
* Default value for Backend start_health
* Attempt at automatic max batch prefill.
* Taking into account number of shards.
* Adding more cards.
* Adding A100 + H100
* Adding a few more cards.
* Logprobs cost too much.
* h100 better name, and keep factor of 2
* Damn inflated sparse tflops.
* Typo in h100.
* Updated the flops calculation (checked with fvcore).
* chunking by default.
* Fix prefix caching for chat completion since we removed logprobs.
* More tests.
* Dropping all the prefill logprobs.
* Add a flag that enables users to get logprobs back.
* Repairing prompt token counting.
* Fixing a few tests.
* Remove some scaffolding.
* Attempting to reduces the issues (workarounds for now).
* Saving some VRAM.
- 8B on 4xL4 attention=flashdecoding . Before 4.28GB left, After 4.32GB
left, so 400MB saved.
- Effect not as visible on attention=flashinfer and n_shard=1. I suspect
it's linked to the torch allocator.
* Adding assertion.
* Sync (most) server dependencies with Nix
Skipped most grpcio packages, because of protobuf version
incompatibility with the opentelemetry packages.
* Add a primitive script to generate Poetry commands to sync with Nix
This is not fully automated, since getting the Nix versions may be
unresolvable. However, it does take most of the work out of doing
this manually.
* Upgrade eetq ?
* Fmt.
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
LLama 3 has a list of values as eos_token_id:
"['<|end_of_text|>', '<|eom_id|>', '<|eot_id|>']"
This breaks tokenizer since it expects single value. This
commit uses tokenizer.eos_token_id instead in such a case.
Fixes: #2440
Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
The compressed-tensors configuration can specify the configuration of
the KV cache as well. Use an FP8 KV cache when the configuration tells
us to do so (all other options and types are ignored for now).
* add ipex moe implementation to support Mixtral and PhiMoe
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* update to ipex xpu 2.5
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* torch has xpu support in 2.5
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* fix oneapi basekit version
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* Apply suggestions from code review
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
* Remove vLLM dependency for CUDA
This change adds `attention-kernels` as a dependency for paged
attention and cache reshaping. With that, we don't use vLLM
anywhere for CUDA.
Tested run (since we don't have paged attention in CI):
```
❯ ATTENTION=paged python -m pytest integration-tests -k "llama and awq" --release
[...]
5 snapshots passed.
```
* Fix clippy warning