text-generation-inference

mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-10-10 15:35:24 +00:00

History

Daniël de Kok 84ab88d843 Support flashinfer for Gemma3 prefill (#3167 ) * launcher: ensure correct detection of Gemma 3 head size * Support flashinfer for Gemma3 prefill Gemma3 uses bidirectional attention for images. Flashinfer supports custom masks. Hook up the mask with flashinfer, so that we do not have to use the slower SDPA implementation for prefills with images. * Update Gemma3 test outputs * Fixed unused import		2025-04-17 18:07:41 +02:00
..
__snapshots__	Support flashinfer for Gemma3 prefill (#3167 )	2025-04-17 18:07:41 +02:00
test_bloom_560m_sharded.py	Add pytest release marker (#2114 )	2024-06-25 16:53:20 +02:00
test_bloom_560m.py	Avoiding timeout for bloom tests. (#2693 )	2024-10-26 05:35:28 +02:00
test_chat_llama.py	Lots of improvements (Still 2 allocators) (#2449 )	2024-08-29 16:29:01 +02:00
test_chat_stream_options.py	Pr 3003 ci branch (#3007 )	2025-03-10 17:56:19 +01:00
test_completion_prompts.py	Pr 3003 ci branch (#3007 )	2025-03-10 17:56:19 +01:00
test_compressed_tensors_w8a8_int_dynamic_weight.py	Improve qwen vl impl (#2943 )	2025-02-04 12:44:18 -05:00
test_compressed_tensors_w8a8_int.py	Add support for compressed-tensors w8a8 int checkpoints (#2745 )	2024-11-18 17:20:31 +01:00
test_compressed_tensors_w8an_fp.py	Add initial support for compressed-tensors checkpoints (#2732 )	2024-11-10 13:54:07 +01:00
test_compressed_tensors_wna16_int_24.py	Add support for wNa16 int 2:4 compressed-tensors checkpoints (#2758 )	2024-11-20 18:25:23 +01:00
test_compressed_tensors_wna16_int.py	Add initial support for compressed-tensors checkpoints (#2732 )	2024-11-10 13:54:07 +01:00
test_continue_final_message.py	Support continue final message (#2733 )	2024-11-27 19:13:30 -05:00
test_flash_awq_sharded.py	Add pytest release marker (#2114 )	2024-06-25 16:53:20 +02:00
test_flash_awq.py	Add pytest release marker (#2114 )	2024-06-25 16:53:20 +02:00
test_flash_deepseek_v2.py	Add support for Deepseek V2 (#2224 )	2024-07-19 17:23:20 +02:00
test_flash_falcon.py	Add pytest release marker (#2114 )	2024-06-25 16:53:20 +02:00
test_flash_gemma2.py	Softcapping for gemma2. (#2273 )	2024-07-22 18:27:10 +02:00
test_flash_gemma3.py	Bug Fix: Sliding Window Attention (#3112 )	2025-03-18 10:37:33 +01:00
test_flash_gemma_gptq.py	Add pytest release marker (#2114 )	2024-06-25 16:53:20 +02:00
test_flash_gemma.py	Intel ci (#2630 )	2024-10-10 16:51:57 +02:00
test_flash_gpt2.py	Add pytest release marker (#2114 )	2024-06-25 16:53:20 +02:00
test_flash_grammar_llama.py	fix: correctly index into mask when applying grammar (#1618 )	2024-03-01 18:22:01 +01:00
test_flash_llama_exl2.py	Fixing exl2 and other quanize tests again. (#2419 )	2024-08-15 11:12:51 +02:00
test_flash_llama_fp8_kv_cache.py	Add support for FP8 KV cache scales (#2628 )	2024-10-24 16:36:18 +02:00
test_flash_llama_fp8.py	Further fixes. (#2426 )	2024-08-16 13:21:44 +02:00
test_flash_llama_gptq.py	GPTQ CI improvements (#2151 )	2024-07-05 14:12:16 +02:00
test_flash_llama_marlin_24.py	Improve the handling of quantized weights (#2250 )	2024-07-19 09:37:39 +02:00
test_flash_llama_marlin.py	Add pytest release marker (#2114 )	2024-06-25 16:53:20 +02:00
test_flash_llama_prefix_flashdecoding.py	Attempt for cleverer auto batch_prefill values (some simplifications). (#2808 )	2024-12-09 19:44:32 +01:00
test_flash_llama_prefix.py	Attempt for cleverer auto batch_prefill values (some simplifications). (#2808 )	2024-12-09 19:44:32 +01:00
test_flash_llama.py	Intel ci (#2630 )	2024-10-10 16:51:57 +02:00
test_flash_medusa.py	Revamp medusa implementation so that every model can benefit. (#1588 )	2024-02-26 19:49:28 +01:00
test_flash_mistral.py	fix(router): fix openapi and add jsonschema validation (#1578 )	2024-02-21 11:05:32 +01:00
test_flash_mixtral_awq.py	Add support for fused MoE Marlin for AWQ (#2616 )	2024-10-08 11:56:41 +02:00
test_flash_mixtral_gptq.py	Test Marlin MoE with `desc_act=true` (#2622 )	2024-10-21 12:50:35 +02:00
test_flash_mixtral.py	Add tests for Mixtral (#2520 )	2024-09-16 12:39:18 +02:00
test_flash_neox_sharded.py	Add pytest release marker (#2114 )	2024-06-25 16:53:20 +02:00
test_flash_neox.py	Add pytest release marker (#2114 )	2024-06-25 16:53:20 +02:00
test_flash_pali_gemma2.py	Enable paligemma2 (#2807 )	2024-12-06 14:41:49 -05:00
test_flash_pali_gemma.py	feat: prefill chunking (#2600 )	2024-10-16 12:49:33 +02:00
test_flash_phi35_moe.py	Attempt for cleverer auto batch_prefill values (some simplifications). (#2808 )	2024-12-09 19:44:32 +01:00
test_flash_phi.py	Add pytest release marker (#2114 )	2024-06-25 16:53:20 +02:00
test_flash_qwen2_5_vl.py	feat: add initial qwen2.5-vl model and test (#2971 )	2025-02-19 12:38:20 +01:00
test_flash_qwen2_vl.py	Improve qwen vl impl (#2943 )	2025-02-04 12:44:18 -05:00
test_flash_qwen2.py	Add pytest release marker (#2114 )	2024-06-25 16:53:20 +02:00
test_flash_santacoder.py	Add pytest release marker (#2114 )	2024-06-25 16:53:20 +02:00
test_flash_starcoder2_lora.py	feat: improve star coder to support multi lora layers (#2883 )	2025-01-16 16:23:55 -05:00
test_flash_starcoder2.py	Add pytest release marker (#2114 )	2024-06-25 16:53:20 +02:00
test_flash_starcoder_gptq.py	Prepare for release 3.1.0 (#2972 )	2025-01-31 14:19:01 +01:00
test_flash_starcoder.py	Add pytest release marker (#2114 )	2024-06-25 16:53:20 +02:00
test_grammar_llama.py	Add pytest release marker (#2114 )	2024-06-25 16:53:20 +02:00
test_grammar_response_format_llama.py	Move JSON grammar -> regex grammar conversion to the router (#2772 )	2024-11-25 18:47:34 +01:00
test_idefics2.py	feat: prefill chunking (#2600 )	2024-10-16 12:49:33 +02:00
test_idefics3.py	Improve vlm support (add idefics3 support) (#2437 )	2025-01-09 10:35:32 -05:00
test_idefics.py	feat: prefill chunking (#2600 )	2024-10-16 12:49:33 +02:00
test_llava_next.py	feat: prefill chunking (#2600 )	2024-10-16 12:49:33 +02:00
test_lora_mistral.py	feat: simple mistral lora integration tests (#2180 )	2024-07-15 09:16:15 -04:00
test_mamba.py	We can have a tokenizer anywhere. (#2527 )	2024-10-28 05:00:24 +01:00
test_mllama.py	Update the flaky mllama test. (#3015 )	2025-02-12 12:26:52 +01:00
test_mpt.py	Add pytest release marker (#2114 )	2024-06-25 16:53:20 +02:00
test_mt0_base.py	Add pytest release marker (#2114 )	2024-06-25 16:53:20 +02:00
test_neox_sharded.py	Add pytest release marker (#2114 )	2024-06-25 16:53:20 +02:00
test_neox.py	Add pytest release marker (#2114 )	2024-06-25 16:53:20 +02:00
test_opt.py	Fix the prefix for OPT model in opt_modelling.py #2370 (CI RUN) (#2371 )	2024-08-07 23:14:02 -04:00
test_smolvlm.py	Improve vlm support (add idefics3 support) (#2437 )	2025-01-09 10:35:32 -05:00
test_t5_sharded.py	Add pytest release marker (#2114 )	2024-06-25 16:53:20 +02:00
test_tools_llama.py	Fix tool call4 (#3094 )	2025-03-12 09:28:47 +01:00
test_transformers_llama4.py	Update transformers to 4.51 (#3148 )	2025-04-07 12:55:43 +02:00
test_transformers_olmo.py	Making sure Olmo (transformers backend) works. (#3074 )	2025-03-05 17:46:47 +01:00