* feat: support qwen2.5 vl model
* fix: bump support models doc
* feat: check before rope type adjustment and small refactors
* fix: add transformer overlay for processor support
* fix: vendor processor and config from transformers
* fix: refactor/simplify conditionals
* fix Qwen VL break in intel platform
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* could use PositionRotaryEmbedding impl so rocm and ipex could all work
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* Use Hub kernels for Marlin and cutlass quantization kernels
* Use hub kernels for MoE/GPTQ-Marlin MoE
* Use attention kernels from the Hub
* Cache the kernels in the Docker image
* Update moe kernels
* Support loading local kernels for development
* Support latest moe kernels
* Update to moe 0.1.1
* CI: download locked kernels for server tests
* Fixup some imports
* CI: activate venv
* Fix unused imports
* Nix: add attention/moe/quantization kernels
* Update hf-kernels to 0.1.5
* Update kernels
* Update tgi-nix flake for hf-kernels
* Fix EOF
* Take `load_kernel` out of a frequently-called function
* Hoist another case of kernel loading out of a somewhat hot function
* marlin-kernels -> quantization
* attention -> paged-attention
* EOF fix
* Update hf-kernels, fixup Docker
* ipex fix
* Remove outdated TODO
* feat: refactor model, improve startup and re enable tests
* fix: improve multimodal rotary embed caching
* fix: limit vision flop calc to qwen2 vl models and update config typing
* fix: include clippy lint
* feat: refactor position ids in warmup and bump tests
* fix: prefer default dtype
* fix: enable all cuda graphs and bump snapshots
* fix: adjust rotaty init path
* fix: simplify get position ids and remove usused vision config
* fix: update position ids so first dim is batch, simplify rotary and bump vlm default token limit
* fix: improve position id init during cuda warmup for mrope and simplfy rotary forward
* fix: check existance before accessing rope type in cuda warmup
* fix: check key before access
* fix: improve mrope check in cuda graph warmup
* fix: remove check for default rope type
* fix: add more test and improve model generation
* fix: improve and simplify get_cos_sin, refactors and cleanup get_position_ids
* fix: adjust signatures with types
This version removes our patches/custom API. Makes it simpler to
get changes from upstream. One of which is that we can enable FP8
KV cache for paged attention as well.
* Upgrade the version number.
* Remove modifications in Lock.
* Tmp branch to test transformers backend with 2.5.1 and TP>1
* Fixing the transformers backend.
inference_mode forces the use of `aten.matmul` instead of `aten.mm` the
former doesn't have sharding support crashing the transformers TP
support.
`lm_head.forward` also crashes because it skips the hook that
cast/decast the DTensor.
Torch 2.5.1 is required for sharding support.
* Put back the attention impl.
* Revert the flashinfer (this will fails).
* Building AOT.
* Using 2.5 kernels.
* Remove the archlist, it's defined in the docker anyway.
* Fixing TRTLLM dockerfile.
* Fixed.
* Creating a dummy modification to chekc CI runs.
* Removing the cache directive.
* Modifying this should cache hit.
* Revert "Modifying this should cache hit."
This reverts commit 46a2bde108.
* Modifying this should cache hit.
* Unwanted files.
* feat: tokenize each request individually and increase warmup image size
* feat: adjust rotary embed and avoid cuda graphs of size 2 and smaller
* fix: address image resize and rebase changes
* feat: update to run qwen2-vl tests
* fix: tweak param types
* fix the crash of meta-llama/Llama-3.2-1B
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* Apply suggestions from code review
Simpler fix (which doesn't break vlms).
---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
* flash decoding
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* enable xpu flashdecoding
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* set flashdecoding blocksize as 64
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* enable flashdecoding, prefill chunking and prefix caching
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* add flashdecoding-ipex
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* feat: improve star coder to support multi lora layers
* feat: improve weight that support adapters and add tests for starcoder with lora
* fix: bump snapshot for added tests
* fix: rerun pre commit lints
* fix: bump adapter test for added later names
* Baichuan2-13B does not have max_position_embeddings in config
see https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat/blob/main/config.json
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* Update server/text_generation_server/models/flash_causal_lm.py
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
* fmt
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
error like "ValueError: Expecting a ProcessGroup, but got a <class
'text_generation_server.utils.dist.FakeGroup'>. rank=0"
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* Basic flashinfer 0.2 support
This change does not use any of the new features yet, but makes
some small compatibility changes.
* Update to flashinfer 0.2.0.post1
* flashinfer: remove `contiguous` calls
* Fix flashinfer install
* flashinfer: fixup kv cache dtype
* Fix some annoying perturbations
* More output changes
* Fix runtime error when Qwen2-VL was prompted with multiple images
Fix runtime error when Qwen2-VL model is prompted with prompt with more
than one image. The runtime error was:
File "text-generation-inference/server/text_generation_server/models/custom_modeling/qwen2_vl.py", line 459, in get_position_ids
text_pos_ids = torch.arange(text_length, device=d)
RuntimeError: upper bound and larger bound inconsistent with step sign
The error was caused by text_length variable going to negative value
when multiple images caused multiple loops in the get_position_ids
function's main loop.
The error is a simple logic mistake where next_image_pos is initialized
as relative offset from current_pos, but was used like it was absolute
position from zero.
* Fix runtime error when Qwen2-VL was prompted with multiple images
Fix runtime error when Qwen2-VL model is prompted with prompt with more
than one image. The runtime error was:
File "text-generation-inference/server/text_generation_server/models/custom_modeling/qwen2_vl.py", line 534, in forward
inputs_embeds[input_ids == self.image_token_id] = image_embeds
RuntimeError: shape mismatch: value tensor of shape [512, 3584] cannot be broadcast to indexing result of shape [1024, 3584]
(The error message shape numbers can be different depending on the input
image resolutions)
The error was caused by adding the wrong number of <|image_pad|> tokens
to the tokenized input in the image_text_replacement function.
The error is a simple logical mistake where the number of image pad
tokens is checked from pixel_value_shape tensor's first dimension
length. However, the pixel_value_shape contains patches from all of the
images. Therefore the code added the total number of required image pad
tokens for the whole input to each of the images locations. This
resulted to extra image pad tokens to be present in the tokenized input.
The fix was to check the number of required tokens from the
image_grid_thw tensor. The tensor includes grid_t, grid_h, and grid_w
values for each image. grid_t * grid_h * grid_w results to the total
number of patches for the image [1]. The number of required image pad
tokens is number_of_patches // 4.
[1] 31f9a289a6/src/transformers/models/qwen2_vl/image_processing_qwen2_vl.py (L311)
---------
Co-authored-by: Janne Alatalo <janne.alatalo@jamk.fi>
* Using both value from config as they might not be correct.
* Fixing max_position_embeddings for falcon.
* Simple attempt to fix the healthcheck block allocation.
* Much simpler solution.
* Default value for Backend start_health
* Attempt at automatic max batch prefill.
* Taking into account number of shards.
* Adding more cards.
* Adding A100 + H100
* Adding a few more cards.
* Logprobs cost too much.
* h100 better name, and keep factor of 2
* Damn inflated sparse tflops.
* Typo in h100.
* Updated the flops calculation (checked with fvcore).
* chunking by default.
* Fix prefix caching for chat completion since we removed logprobs.
* More tests.
* Dropping all the prefill logprobs.
* Add a flag that enables users to get logprobs back.
* Repairing prompt token counting.
* Fixing a few tests.
* Remove some scaffolding.
* Attempting to reduces the issues (workarounds for now).
* Saving some VRAM.
- 8B on 4xL4 attention=flashdecoding . Before 4.28GB left, After 4.32GB
left, so 400MB saved.
- Effect not as visible on attention=flashinfer and n_shard=1. I suspect
it's linked to the torch allocator.
* Adding assertion.
* Sync (most) server dependencies with Nix
Skipped most grpcio packages, because of protobuf version
incompatibility with the opentelemetry packages.
* Add a primitive script to generate Poetry commands to sync with Nix
This is not fully automated, since getting the Nix versions may be
unresolvable. However, it does take most of the work out of doing
this manually.
* Upgrade eetq ?
* Fmt.
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
LLama 3 has a list of values as eos_token_id:
"['<|end_of_text|>', '<|eom_id|>', '<|eot_id|>']"
This breaks tokenizer since it expects single value. This
commit uses tokenizer.eos_token_id instead in such a case.
Fixes: #2440
Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
The compressed-tensors configuration can specify the configuration of
the KV cache as well. Use an FP8 KV cache when the configuration tells
us to do so (all other options and types are ignored for now).
* Move JSON grammar -> regex grammar conversion to the router
This change moves the JSON grammar -> regex grammar conversion to the
router by adding a dependency on the `outlines-core` Rust crate. In
contrast to the Python implementation, the conversions are not LRU-cached
since they seem to be fast enough:
simple schema time: [5.8293 µs 5.8307 µs 5.8320 µs]
change: [-13.166% -12.884% -12.641%] (p = 0.00 < 0.05)
Performance has improved.
complex schema time: [14.875 µs 14.881 µs 14.887 µs]
change: [-2.1637% -1.9914% -1.7852%] (p = 0.00 < 0.05)
Performance has improved.
Using the schemas from:
https://github.com/dottxt-ai/outlines-core/blob/main/benchmarks/bench_json_schema.py