Update quantization.md

2025-09-10 20:04:52 +00:00 · 2023-08-25 12:32:50 +03:00 · 2023-08-25 12:32:50 +03:00 · 5f4dcd5a4b
commit 5f4dcd5a4b
parent 7c2db76b89
1 changed files with 1 additions and 1 deletions
--- a/docs/source/conceptual/quantization.md
+++ b/docs/source/conceptual/quantization.md
@ -39,7 +39,7 @@ You can learn more about GPTQ from the [paper](https://arxiv.org/pdf/2210.17323.

 ## Quantization with bitsandbytes

-bitsandbytes is a library used to apply 8-bit and 4-bit quantization to models. It can be used during training for mixed-precision training or before inference to make the model smaller. Unlike GPTQ quantization, bitsandbytes quantization doesn't require a calibration dataset. One caveat of bitsandbytes 8-bit quantization is that the inference speed is slower compared to GPTQ.
+bitsandbytes is a library used to apply 8-bit and 4-bit quantization to models. It can be used during training for mixed-precision training or before inference to make the model smaller. Unlike GPTQ quantization, bitsandbytes quantization doesn't require a calibration dataset or pre-quantized weights. One caveat of bitsandbytes 8-bit quantization is that the inference speed is slower compared to GPTQ or FP16 precision.

 8-bit quantization enables multi-billion parameter scale models to fit in smaller hardware without degrading performance too much. 
 In TGI, you can use 8-bit quantization by adding `--quantize bitsandbytes` like below 👇