text-generation-inference/server/text_generation_server/models
Daniël de Kok f1f28404e7 Add support for GPTQ Marlin (#2052)
Add support for GPTQ Marlin kernels

GPTQ Marlin extends the Marlin kernels to support common GPTQ
configurations:

- bits: 4 or 8
- groupsize: -1, 32, 64, or 128
- desc_act: true/false

Using the GPTQ Marlin kernels requires repacking the parameters in the
Marlin quantizer format.

The kernels were contributed by Neural Magic to VLLM. We vendor them
here for convenience.
2024-09-24 03:43:30 +00:00
..
custom_modeling Add support for GPTQ Marlin (#2052) 2024-09-24 03:43:30 +00:00
__init__.py ROCm and sliding windows fixes (#2033) 2024-09-24 03:42:29 +00:00
bloom.py Add support for GPTQ Marlin (#2052) 2024-09-24 03:43:30 +00:00
causal_lm.py server: use chunked inputs 2024-09-24 03:42:29 +00:00
flash_causal_lm.py ROCm and sliding windows fixes (#2033) 2024-09-24 03:42:29 +00:00
flash_cohere.py Add support for GPTQ Marlin (#2052) 2024-09-24 03:43:30 +00:00
flash_dbrx.py Add support for GPTQ Marlin (#2052) 2024-09-24 03:43:30 +00:00
flash_gemma.py Add support for GPTQ Marlin (#2052) 2024-09-24 03:43:30 +00:00
flash_gpt2.py Add support for Marlin-quantized models 2024-09-24 03:38:05 +00:00
flash_llama.py Add support for GPTQ Marlin (#2052) 2024-09-24 03:43:30 +00:00
flash_mistral.py Add support for GPTQ Marlin (#2052) 2024-09-24 03:43:30 +00:00
flash_mixtral.py MLPSpeculator. (#1865) 2024-07-17 05:36:58 +00:00
flash_neox.py Add support for GPTQ Marlin (#2052) 2024-09-24 03:43:30 +00:00
flash_phi.py Add support for GPTQ Marlin (#2052) 2024-09-24 03:43:30 +00:00
flash_qwen2.py Add support for GPTQ Marlin (#2052) 2024-09-24 03:43:30 +00:00
flash_rw.py Add support for GPTQ Marlin (#2052) 2024-09-24 03:43:30 +00:00
flash_santacoder.py Add support for GPTQ Marlin (#2052) 2024-09-24 03:43:30 +00:00
flash_starcoder2.py Add support for GPTQ Marlin (#2052) 2024-09-24 03:43:30 +00:00
galactica.py Add support for GPTQ Marlin (#2052) 2024-09-24 03:43:30 +00:00
globals.py Purely refactors paged/attention into layers/attention and make hardware differences more obvious with 1 file per hardware. (#1986) 2024-09-24 03:19:39 +00:00
gpt_neox.py Add support for GPTQ Marlin (#2052) 2024-09-24 03:43:30 +00:00
idefics2.py MLPSpeculator. (#1865) 2024-07-17 05:36:58 +00:00
idefics_causal_lm.py server: use chunked inputs 2024-09-24 03:42:29 +00:00
idefics.py MLPSpeculator. (#1865) 2024-07-17 05:36:58 +00:00
llava_next.py MLPSpeculator. (#1865) 2024-07-17 05:36:58 +00:00
mamba.py server: use chunked inputs 2024-09-24 03:42:29 +00:00
model.py Aligin the source code with main branch 2.0.4 2024-09-24 03:06:55 +00:00
mpt.py Add support for GPTQ Marlin (#2052) 2024-09-24 03:43:30 +00:00
opt.py Add support for GPTQ Marlin (#2052) 2024-09-24 03:43:30 +00:00
pali_gemma.py server: use chunked inputs 2024-09-24 03:42:29 +00:00
phi.py MLPSpeculator. (#1865) 2024-07-17 05:36:58 +00:00
rw.py fix(server): fix OPT implementation (#2061) 2024-09-24 03:42:29 +00:00
santacoder.py Aligin the source code with main branch 2.0.4 2024-09-24 03:06:55 +00:00
seq2seq_lm.py server: use chunked inputs 2024-09-24 03:42:29 +00:00
t5.py MLPSpeculator. (#1865) 2024-07-17 05:36:58 +00:00
types.py chore: add pre-commit (#1569) 2024-04-24 15:32:02 +03:00
vlm_causal_lm.py server: use chunked inputs 2024-09-24 03:42:29 +00:00