text-generation-inference/server
Daniël de Kok 84ab88d843
Support flashinfer for Gemma3 prefill (#3167)
* launcher: ensure correct detection of Gemma 3 head size

* Support flashinfer for Gemma3 prefill

Gemma3 uses bidirectional attention for images. Flashinfer
supports custom masks. Hook up the mask with flashinfer, so that we do
not have to use the slower SDPA implementation for prefills with images.

* Update Gemma3 test outputs

* Fixed unused import
2025-04-17 18:07:41 +02:00
..
custom_kernels All integration tests back everywhere (too many failed CI). (#2428) 2024-08-16 21:19:46 +02:00
exllama_kernels Update ROCM libs and improvements (#2579) 2024-09-30 10:54:32 +02:00
exllamav2_kernels Update ROCM libs and improvements (#2579) 2024-09-30 10:54:32 +02:00
tests Small test and typing fixes (#3078) 2025-03-10 15:08:23 +01:00
text_generation_server Support flashinfer for Gemma3 prefill (#3167) 2025-04-17 18:07:41 +02:00
.gitignore Impl simple mamba model (#1480) 2024-02-08 10:19:45 +01:00
bounds-from-nix.py Sync (most) server dependencies with Nix (#2782) 2024-12-03 04:04:06 +01:00
kernels.lock Update to kernels 0.2.1 (#3084) 2025-03-13 10:36:29 +01:00
Makefile Update to kernels 0.2.1 (#3084) 2025-03-13 10:36:29 +01:00
Makefile-awq chore: add pre-commit (#1569) 2024-02-16 11:58:58 +01:00
Makefile-eetq Putting back the NCCL forced upgrade. (#2999) 2025-02-14 11:31:59 +01:00
Makefile-exllamav2 Upgrading exl2. (#2415) 2024-08-14 11:58:08 +02:00
Makefile-flash-att Putting back the NCCL forced upgrade. (#2999) 2025-02-14 11:31:59 +01:00
Makefile-flash-att-v2 Add Flash decoding kernel ROCm (#2855) 2025-01-13 11:12:35 +01:00
Makefile-flashinfer flashinfer 0.2.0.post1 -> post2 (#3040) 2025-02-20 12:34:20 +01:00
Makefile-lorax-punica Enable multiple LoRa adapters (#2010) 2024-06-25 14:46:27 -04:00
Makefile-selective-scan chore: add pre-commit (#1569) 2024-02-16 11:58:58 +01:00
Makefile-vllm Use ROCM 6.3.1 (#3141) 2025-04-07 12:55:11 +02:00
pyproject.toml Update transformers to 4.51 (#3148) 2025-04-07 12:55:43 +02:00
README.md chore: add pre-commit (#1569) 2024-02-16 11:58:58 +01:00
req.txt Using the "lockfile". (#2992) 2025-02-06 12:28:24 +01:00
requirements_cuda.txt Improve Transformers support (#2970) 2025-02-18 19:04:34 +01:00
requirements_gen.txt Improve Transformers support (#2970) 2025-02-18 19:04:34 +01:00
requirements_intel.txt Improve Transformers support (#2970) 2025-02-18 19:04:34 +01:00
requirements_rocm.txt Improve Transformers support (#2970) 2025-02-18 19:04:34 +01:00
uv.lock Update transformers to 4.51 (#3148) 2025-04-07 12:55:43 +02:00

Text Generation Inference Python gRPC Server

A Python gRPC server for Text Generation Inference

Install

make install

Run

make run-dev