text-generation-inference

mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-04-25 20:12:07 +00:00

History

Daniël de Kok 6652d6e6e0 Support flashinfer for Gemma3 prefill Gemma3 uses bidirectional attention for images. Flashinfer supports custom masks. Hook up the mask with flashinfer, so that we do not have to use the slower SDPA implementation for prefills with images.		2025-04-11 18:20:54 +00:00
..
custom_modeling	Support flashinfer for Gemma3 prefill	2025-04-11 18:20:54 +00:00
__init__.py	Add llama4 (#3145 )	2025-04-06 10:20:22 +02:00
bloom.py	Refactor dead code - Removing all `flash_xxx.py` files. (#2166 )	2024-07-05 10:29:56 +02:00
causal_lm.py	Sync (most) server dependencies with Nix (#2782 )	2024-12-03 04:04:06 +01:00
flash_causal_lm.py	Support flashinfer for Gemma3 prefill	2025-04-11 18:20:54 +00:00
galactica.py	feat: add ruff and resolve issue (#2262 )	2024-07-26 10:29:09 -04:00
globals.py	Fixing the oom maybe with 2.5.1 change. (#2958 )	2025-01-28 10:30:28 +01:00
idefics_causal_lm.py	feat: prefill chunking (#2600 )	2024-10-16 12:49:33 +02:00
mamba.py	Choosing input/total tokens automatically based on available VRAM? (#2673 )	2024-10-28 04:59:49 +01:00
metadata_kernels.py	feat: add payload limit (#2726 )	2024-11-21 18:20:15 +00:00
mllama_causal_lm.py	Update transformers to 4.51 (#3148 )	2025-04-07 12:55:43 +02:00
model.py	Bug Fix: Sliding Window Attention (#3112 )	2025-03-18 10:37:33 +01:00
pali_gemma.py	feat: add ruff and resolve issue (#2262 )	2024-07-26 10:29:09 -04:00
seq2seq_lm.py	feat: prefill chunking (#2600 )	2024-10-16 12:49:33 +02:00
transformers_flash_causal_lm.py	Add llama4 (#3145 )	2025-04-06 10:20:22 +02:00
transformers_flash_vlm.py	Add llama4 (#3145 )	2025-04-06 10:20:22 +02:00
types.py	feat: prefill chunking (#2600 )	2024-10-16 12:49:33 +02:00
vlm_causal_lm.py	Support flashinfer for Gemma3 prefill	2025-04-11 18:20:54 +00:00