From e82259106c1bb8b26341671a6705f74f31af5e71 Mon Sep 17 00:00:00 2001 From: Merve Noyan Date: Fri, 8 Sep 2023 12:55:45 +0200 Subject: [PATCH] Update docs/source/conceptual/quantization.md Co-authored-by: Pedro Cuenca --- docs/source/conceptual/quantization.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/conceptual/quantization.md b/docs/source/conceptual/quantization.md index 0185039c..d6f96751 100644 --- a/docs/source/conceptual/quantization.md +++ b/docs/source/conceptual/quantization.md @@ -39,7 +39,7 @@ You can learn more about GPTQ from the [paper](https://arxiv.org/pdf/2210.17323. ## Quantization with bitsandbytes -bitsandbytes is a library used to apply 8-bit and 4-bit quantization to models. It can be used during training for mixed-precision training or before inference to make the model smaller. Unlike GPTQ quantization, bitsandbytes quantization doesn't require a calibration dataset or pre-quantized weights. One caveat of bitsandbytes 8-bit quantization is that the inference speed is slower compared to GPTQ or FP16 precision. +bitsandbytes is a library used to apply 8-bit and 4-bit quantization to models. Unlike GPTQ quantization, bitsandbytes doesn't require a calibration dataset or any post-processing – weights are automatically quantized on load. However, inference with bitsandbytes is slower than GPTQ or FP16 precision. 8-bit quantization enables multi-billion parameter scale models to fit in smaller hardware without degrading performance too much. In TGI, you can use 8-bit quantization by adding `--quantize bitsandbytes` like below 👇