mirror of
https://github.com/huggingface/text-generation-inference.git
synced 2025-04-21 14:52:20 +00:00
This change adds support for Marlin-quantized models. Marlin is an FP16xINT4 matmul kernel, which provides good speedups decoding batches of 16-32 tokens. It supports quantized models with symmetric quantization, groupsize -1 or 128, and 4-bit. Tested with: - Llama 2 - Llama 3 - Phi 3 |
||
---|---|---|
.. | ||
consuming_tgi.md | ||
gated_model_access.md | ||
launcher.md | ||
monitoring.md | ||
non_core_models.md | ||
preparing_model.md | ||
safety.md | ||
train_medusa.md | ||
using_cli.md | ||
using_guidance.md | ||
visual_language_models.md |