mirror of
https://github.com/huggingface/text-generation-inference.git
synced 2025-11-18 23:15:59 +00:00
This change adds support for Marlin-quantized models. Marlin is an FP16xINT4 matmul kernel, which provides good speedups decoding batches of 16-32 tokens. It supports quantized models with symmetric quantization, groupsize -1 or 128, and 4-bit. Tested with: - Llama 2 - Llama 3 - Phi 3 |
||
|---|---|---|
| .. | ||
| consuming_tgi.md | ||
| gated_model_access.md | ||
| launcher.md | ||
| monitoring.md | ||
| non_core_models.md | ||
| preparing_model.md | ||
| safety.md | ||
| train_medusa.md | ||
| using_cli.md | ||
| using_guidance.md | ||
| visual_language_models.md | ||