mirror of
https://github.com/huggingface/text-generation-inference.git
synced 2025-10-15 18:05:22 +00:00
compressed-tensors is a safetensors extension for sparse, quantized tensors. The format is more powerful than earlier AWQ/GPTQ/FP8 quantization, because - Different quantizer configurations can be used for different targets. - The format can specify input/output quantizers in addition to weight quantizers. - Configurable exclusions for quantization. This change adds a dependency on the `compressed-tensors` package for its configuration parsing and layer matching functionality. The following types of quantization are supported in this PR: - W8A16 and W4A16 INT using GPTQ-Marlin kernels. - W8A8 and W8A16 FP using FP8-Marlin and cutlass kernels. Support for other quantization types will be added in subsequent PRs. |
||
|---|---|---|
| .. | ||
| attention | ||
| awq | ||
| compressed_tensors | ||
| gptq | ||
| marlin | ||
| moe | ||
| __init__.py | ||
| bnb.py | ||
| conv.py | ||
| eetq.py | ||
| exl2.py | ||
| fp8.py | ||
| layernorm.py | ||
| linear.py | ||
| lora.py | ||
| medusa.py | ||
| mlp.py | ||
| rotary.py | ||
| speculative.py | ||
| tensor_parallel.py | ||