Commit Graph

3 Commits

Author SHA1 Message Date
fxmarty
b452620c04 fix gptq tests, LLMM1 matrix bound 2024-06-11 07:27:14 +00:00
fxmarty
de6f2cd08d disable marlin tests on rocm/xpu 2024-06-10 13:06:11 +00:00
Daniël de Kok
4594e6faba Add support for Marlin-quantized models
This change adds support for Marlin-quantized models. Marlin is an
FP16xINT4 matmul kernel, which provides good speedups decoding batches
of 16-32 tokens. It supports quantized models with symmetric
quantization, groupsize -1 or 128, and 4-bit.

Tested with:

- Llama 2
- Llama 3
- Phi 3
2024-06-06 13:16:52 +02:00