text-generation-inference

mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-09-09 11:24:53 +00:00

History

Daniël de Kok c6071749db Fix mask passed to flashinfer (#3324 ) Custom masks are padded to the shape `[batch_size, max_len, max_len]`. However, flashinfer expects an unpadded mask of the shape `[sum(q_len[i] * k_len[i] for i in range(batch_size)]`. This change unpads the custom mask (currently only used by Gemma 3) to this shape (assuming q_len == k_len, since we only use the custom mask during prefill).		2025-09-08 13:47:03 -04:00
..
__init__.py	Add support for FP8 KV cache scales (#2628 )	2024-10-24 16:36:18 +02:00
common.py	feat: prefill chunking (#2600 )	2024-10-16 12:49:33 +02:00
cuda.py	Bug Fix: Sliding Window Attention (#3112 )	2025-03-18 10:37:33 +01:00
flash_attn_triton.py	feat: prefill chunking (#2600 )	2024-10-16 12:49:33 +02:00
flashinfer.py	Fix mask passed to flashinfer (#3324 )	2025-09-08 13:47:03 -04:00
ipex.py	IPEX support FP8 kvcache/softcap/slidingwindow (#3144 )	2025-05-06 10:49:24 +02:00
kv_cache.py	IPEX support FP8 kvcache/softcap/slidingwindow (#3144 )	2025-05-06 10:49:24 +02:00
rocm.py	Bug Fix: Sliding Window Attention (#3112 )	2025-03-18 10:37:33 +01:00