text-generation-inference/server/text_generation_server/layers
Daniël de Kok 84ab88d843
Support flashinfer for Gemma3 prefill (#3167)
* launcher: ensure correct detection of Gemma 3 head size

* Support flashinfer for Gemma3 prefill

Gemma3 uses bidirectional attention for images. Flashinfer
supports custom masks. Hook up the mask with flashinfer, so that we do
not have to use the slower SDPA implementation for prefills with images.

* Update Gemma3 test outputs

* Fixed unused import
2025-04-17 18:07:41 +02:00
..
attention Support flashinfer for Gemma3 prefill (#3167) 2025-04-17 18:07:41 +02:00
awq fix incorrect output of Qwen2-7B-Instruct-GPTQ-Int4 and Qwen2-7B-Inst… (#2717) 2024-11-04 16:07:51 +01:00
compressed_tensors Use kernels from the kernel hub (#2988) 2025-02-10 19:19:25 +01:00
gptq Small test and typing fixes (#3078) 2025-03-10 15:08:23 +01:00
marlin Use kernels from the kernel hub (#2988) 2025-02-10 19:19:25 +01:00
moe some minor fix (#3048) 2025-02-25 12:07:55 +01:00
__init__.py feat: add ruff and resolve issue (#2262) 2024-07-26 10:29:09 -04:00
bnb.py feat: add ruff and resolve issue (#2262) 2024-07-26 10:29:09 -04:00
conv.py Refactor layers. (#1866) 2024-05-13 12:44:30 +02:00
eetq.py Use eetq kernel from the hub (#3029) 2025-02-18 10:03:53 +01:00
exl2.py Add support for Deepseek V2 (#2224) 2024-07-19 17:23:20 +02:00
fp8.py Use kernels from the kernel hub (#2988) 2025-02-10 19:19:25 +01:00
layernorm.py Update vllm kernels for ROCM (#2826) 2024-12-18 12:44:42 +01:00
linear.py Update vllm kernels for ROCM (#2826) 2024-12-18 12:44:42 +01:00
lora.py feat: add ruff and resolve issue (#2262) 2024-07-26 10:29:09 -04:00
medusa.py Prefix caching (#2402) 2024-08-20 11:15:30 +02:00
mlp.py Tied embeddings in MLP speculator. (#2473) 2024-08-29 17:44:54 +02:00
rotary.py Use rotary kernel from the Hub (#3041) 2025-02-21 13:55:31 +01:00
speculative.py feat: add ruff and resolve issue (#2262) 2024-07-26 10:29:09 -04:00
tensor_parallel.py feat: add ruff and resolve issue (#2262) 2024-07-26 10:29:09 -04:00