text-generation-inference/server
janne-alatalo 7eeefa3b57
Qwen2-VL runtime error fix when prompted with multiple images (#2840)
* Fix runtime error when Qwen2-VL was prompted with multiple images

Fix runtime error when Qwen2-VL model is prompted with prompt with more
than one image. The runtime error was:

 File "text-generation-inference/server/text_generation_server/models/custom_modeling/qwen2_vl.py", line 459, in get_position_ids
    text_pos_ids = torch.arange(text_length, device=d)
RuntimeError: upper bound and larger bound inconsistent with step sign

The error was caused by text_length variable going to negative value
when multiple images caused multiple loops in the get_position_ids
function's main loop.

The error is a simple logic mistake where next_image_pos is initialized
as relative offset from current_pos, but was used like it was absolute
position from zero.

* Fix runtime error when Qwen2-VL was prompted with multiple images

Fix runtime error when Qwen2-VL model is prompted with prompt with more
than one image. The runtime error was:

File "text-generation-inference/server/text_generation_server/models/custom_modeling/qwen2_vl.py", line 534, in forward
    inputs_embeds[input_ids == self.image_token_id] = image_embeds
RuntimeError: shape mismatch: value tensor of shape [512, 3584] cannot be broadcast to indexing result of shape [1024, 3584]

(The error message shape numbers can be different depending on the input
image resolutions)

The error was caused by adding the wrong number of <|image_pad|> tokens
to the tokenized input in the image_text_replacement function.

The error is a simple logical mistake where the number of image pad
tokens is checked from pixel_value_shape tensor's first dimension
length. However, the pixel_value_shape contains patches from all of the
images. Therefore the code added the total number of required image pad
tokens for the whole input to each of the images locations. This
resulted to extra image pad tokens to be present in the tokenized input.

The fix was to check the number of required tokens from the
image_grid_thw tensor. The tensor includes grid_t, grid_h, and grid_w
values for each image. grid_t * grid_h * grid_w results to the total
number of patches for the image [1]. The number of required image pad
tokens is number_of_patches // 4.

[1] 31f9a289a6/src/transformers/models/qwen2_vl/image_processing_qwen2_vl.py (L311)

---------

Co-authored-by: Janne Alatalo <janne.alatalo@jamk.fi>
2024-12-16 22:55:11 -05:00
..
custom_kernels All integration tests back everywhere (too many failed CI). (#2428) 2024-08-16 21:19:46 +02:00
exllama_kernels Update ROCM libs and improvements (#2579) 2024-09-30 10:54:32 +02:00
exllamav2_kernels Update ROCM libs and improvements (#2579) 2024-09-30 10:54:32 +02:00
tests feat: prefill chunking (#2600) 2024-10-16 12:49:33 +02:00
text_generation_server Qwen2-VL runtime error fix when prompted with multiple images (#2840) 2024-12-16 22:55:11 -05:00
.gitignore Impl simple mamba model (#1480) 2024-02-08 10:19:45 +01:00
bounds-from-nix.py Sync (most) server dependencies with Nix (#2782) 2024-12-03 04:04:06 +01:00
Makefile Remove vLLM dependency for CUDA (#2751) 2024-11-17 17:34:50 +01:00
Makefile-awq chore: add pre-commit (#1569) 2024-02-16 11:58:58 +01:00
Makefile-eetq Sync (most) server dependencies with Nix (#2782) 2024-12-03 04:04:06 +01:00
Makefile-exllamav2 Upgrading exl2. (#2415) 2024-08-14 11:58:08 +02:00
Makefile-flash-att Hotfixing make install. (#2008) 2024-06-04 23:34:03 +02:00
Makefile-flash-att-v2 Update ROCM libs and improvements (#2579) 2024-09-30 10:54:32 +02:00
Makefile-flashinfer Prefix test - Different kind of load test to trigger prefix test bugs. (#2490) 2024-09-11 18:10:40 +02:00
Makefile-lorax-punica Enable multiple LoRa adapters (#2010) 2024-06-25 14:46:27 -04:00
Makefile-selective-scan chore: add pre-commit (#1569) 2024-02-16 11:58:58 +01:00
Makefile-vllm Remove vLLM dependency for CUDA (#2751) 2024-11-17 17:34:50 +01:00
poetry.lock Sync (most) server dependencies with Nix (#2782) 2024-12-03 04:04:06 +01:00
pyproject.toml Sync (most) server dependencies with Nix (#2782) 2024-12-03 04:04:06 +01:00
README.md chore: add pre-commit (#1569) 2024-02-16 11:58:58 +01:00
requirements_cuda.txt Sync (most) server dependencies with Nix (#2782) 2024-12-03 04:04:06 +01:00
requirements_intel.txt Sync (most) server dependencies with Nix (#2782) 2024-12-03 04:04:06 +01:00
requirements_rocm.txt Sync (most) server dependencies with Nix (#2782) 2024-12-03 04:04:06 +01:00

Text Generation Inference Python gRPC Server

A Python gRPC server for Text Generation Inference

Install

make install

Run

make run-dev