text-generation-inference

mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-06-14 13:22:07 +00:00

History

Daniël de Kok f1f28404e7 Add support for GPTQ Marlin (#2052 ) Add support for GPTQ Marlin kernels GPTQ Marlin extends the Marlin kernels to support common GPTQ configurations: - bits: 4 or 8 - groupsize: -1, 32, 64, or 128 - desc_act: true/false Using the GPTQ Marlin kernels requires repacking the parameters in the Marlin quantizer format. The kernels were contributed by Neural Magic to VLLM. We vendor them here for convenience.		2024-09-24 03:43:30 +00:00
..
custom_modeling	Add support for GPTQ Marlin (#2052 )	2024-09-24 03:43:30 +00:00
__init__.py	ROCm and sliding windows fixes (#2033 )	2024-09-24 03:42:29 +00:00
bloom.py	Add support for GPTQ Marlin (#2052 )	2024-09-24 03:43:30 +00:00
causal_lm.py	server: use chunked inputs	2024-09-24 03:42:29 +00:00
flash_causal_lm.py	ROCm and sliding windows fixes (#2033 )	2024-09-24 03:42:29 +00:00
flash_cohere.py	Add support for GPTQ Marlin (#2052 )	2024-09-24 03:43:30 +00:00
flash_dbrx.py	Add support for GPTQ Marlin (#2052 )	2024-09-24 03:43:30 +00:00
flash_gemma.py	Add support for GPTQ Marlin (#2052 )	2024-09-24 03:43:30 +00:00
flash_gpt2.py	Add support for Marlin-quantized models	2024-09-24 03:38:05 +00:00
flash_llama.py	Add support for GPTQ Marlin (#2052 )	2024-09-24 03:43:30 +00:00
flash_mistral.py	Add support for GPTQ Marlin (#2052 )	2024-09-24 03:43:30 +00:00
flash_mixtral.py	MLPSpeculator. (#1865 )	2024-07-17 05:36:58 +00:00
flash_neox.py	Add support for GPTQ Marlin (#2052 )	2024-09-24 03:43:30 +00:00
flash_phi.py	Add support for GPTQ Marlin (#2052 )	2024-09-24 03:43:30 +00:00
flash_qwen2.py	Add support for GPTQ Marlin (#2052 )	2024-09-24 03:43:30 +00:00
flash_rw.py	Add support for GPTQ Marlin (#2052 )	2024-09-24 03:43:30 +00:00
flash_santacoder.py	Add support for GPTQ Marlin (#2052 )	2024-09-24 03:43:30 +00:00
flash_starcoder2.py	Add support for GPTQ Marlin (#2052 )	2024-09-24 03:43:30 +00:00
galactica.py	Add support for GPTQ Marlin (#2052 )	2024-09-24 03:43:30 +00:00
globals.py	Purely refactors paged/attention into `layers/attention` and make hardware differences more obvious with 1 file per hardware. (#1986 )	2024-09-24 03:19:39 +00:00
gpt_neox.py	Add support for GPTQ Marlin (#2052 )	2024-09-24 03:43:30 +00:00
idefics2.py	MLPSpeculator. (#1865 )	2024-07-17 05:36:58 +00:00
idefics_causal_lm.py	server: use chunked inputs	2024-09-24 03:42:29 +00:00
idefics.py	MLPSpeculator. (#1865 )	2024-07-17 05:36:58 +00:00
llava_next.py	MLPSpeculator. (#1865 )	2024-07-17 05:36:58 +00:00
mamba.py	server: use chunked inputs	2024-09-24 03:42:29 +00:00
model.py	Aligin the source code with main branch 2.0.4	2024-09-24 03:06:55 +00:00
mpt.py	Add support for GPTQ Marlin (#2052 )	2024-09-24 03:43:30 +00:00
opt.py	Add support for GPTQ Marlin (#2052 )	2024-09-24 03:43:30 +00:00
pali_gemma.py	server: use chunked inputs	2024-09-24 03:42:29 +00:00
phi.py	MLPSpeculator. (#1865 )	2024-07-17 05:36:58 +00:00
rw.py	fix(server): fix OPT implementation (#2061 )	2024-09-24 03:42:29 +00:00
santacoder.py	Aligin the source code with main branch 2.0.4	2024-09-24 03:06:55 +00:00
seq2seq_lm.py	server: use chunked inputs	2024-09-24 03:42:29 +00:00
t5.py	MLPSpeculator. (#1865 )	2024-07-17 05:36:58 +00:00
types.py	chore: add pre-commit (#1569 )	2024-04-24 15:32:02 +03:00
vlm_causal_lm.py	server: use chunked inputs	2024-09-24 03:42:29 +00:00