text-generation-inference/server/text_generation_server/layers/attention
Daniël de Kok c6071749db
Fix mask passed to flashinfer (#3324)
Custom masks are padded to the shape `[batch_size, max_len, max_len]`.
However, flashinfer expects an unpadded mask of the shape
`[sum(q_len[i] * k_len[i] for i in range(batch_size)]`.

This change unpads the custom mask (currently only used by Gemma 3)
to this shape (assuming q_len == k_len, since we only use the custom
mask during prefill).
2025-09-08 13:47:03 -04:00
..
__init__.py Add support for FP8 KV cache scales (#2628) 2024-10-24 16:36:18 +02:00
common.py feat: prefill chunking (#2600) 2024-10-16 12:49:33 +02:00
cuda.py Bug Fix: Sliding Window Attention (#3112) 2025-03-18 10:37:33 +01:00
flash_attn_triton.py feat: prefill chunking (#2600) 2024-10-16 12:49:33 +02:00
flashinfer.py Fix mask passed to flashinfer (#3324) 2025-09-08 13:47:03 -04:00
ipex.py IPEX support FP8 kvcache/softcap/slidingwindow (#3144) 2025-05-06 10:49:24 +02:00
kv_cache.py IPEX support FP8 kvcache/softcap/slidingwindow (#3144) 2025-05-06 10:49:24 +02:00
rocm.py Bug Fix: Sliding Window Attention (#3112) 2025-03-18 10:37:33 +01:00