From e9ae678699f20eac30ad60be539838ceb2ac248b Mon Sep 17 00:00:00 2001 From: Merve Noyan Date: Tue, 12 Sep 2023 15:52:46 +0200 Subject: [PATCH] Quantization docs (#911) Co-authored-by: Nicolas Patry Co-authored-by: Pedro Cuenca --- docs/source/_toctree.yml | 2 + .../source/basic_tutorials/preparing_model.md | 2 +- docs/source/conceptual/quantization.md | 59 +++++++++++++++++++ 3 files changed, 62 insertions(+), 1 deletion(-) create mode 100644 docs/source/conceptual/quantization.md diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml index 4a81f48f..25f3815e 100644 --- a/docs/source/_toctree.yml +++ b/docs/source/_toctree.yml @@ -21,6 +21,8 @@ - sections: - local: conceptual/streaming title: Streaming + - local: conceptual/quantization + title: Quantization - local: conceptual/tensor_parallelism title: Tensor Parallelism - local: conceptual/paged_attention diff --git a/docs/source/basic_tutorials/preparing_model.md b/docs/source/basic_tutorials/preparing_model.md index 65a2a197..6b622d99 100644 --- a/docs/source/basic_tutorials/preparing_model.md +++ b/docs/source/basic_tutorials/preparing_model.md @@ -4,7 +4,7 @@ Text Generation Inference improves the model in several aspects. ## Quantization -TGI supports [bits-and-bytes](https://github.com/TimDettmers/bitsandbytes#bitsandbytes) and [GPT-Q](https://arxiv.org/abs/2210.17323) quantization. To speed up inference with quantization, simply set `quantize` flag to `bitsandbytes` or `gptq` depending on the quantization technique you wish to use. When using GPT-Q quantization, you need to point to one of the models [here](https://huggingface.co/models?search=gptq). +TGI supports [bits-and-bytes](https://github.com/TimDettmers/bitsandbytes#bitsandbytes) and [GPT-Q](https://arxiv.org/abs/2210.17323) quantization. To speed up inference with quantization, simply set `quantize` flag to `bitsandbytes` or `gptq` depending on the quantization technique you wish to use. When using GPT-Q quantization, you need to point to one of the models [here](https://huggingface.co/models?search=gptq). To get more information about quantization, please refer to (./conceptual/quantization.md) ## RoPE Scaling diff --git a/docs/source/conceptual/quantization.md b/docs/source/conceptual/quantization.md new file mode 100644 index 00000000..1a44e3c2 --- /dev/null +++ b/docs/source/conceptual/quantization.md @@ -0,0 +1,59 @@ +# Quantization + +TGI offers GPTQ and bits-and-bytes quantization to quantize large language models. + +## Quantization with GPTQ + +GPTQ is a post-training quantization method to make the model smaller. It quantizes the layers by finding a compressed version of that weight, that will yield a minimum mean squared error like below 👇 + +Given a layer \\(l\\) with weight matrix \\(W_{l}\\) and layer input \\(X_{l}\\), find quantized weight \\(\\hat{W}_{l}\\): + +$$({\hat{W}_{l}}^{*} = argmin_{\hat{W_{l}}} ||W_{l}X-\hat{W}_{l}X||^{2}_{2})$$ + + +TGI allows you to both run an already GPTQ quantized model (see available models [here](https://huggingface.co/models?search=gptq)) or quantize a model of your choice using quantization script. You can run a quantized model by simply passing --quantize like below 👇 + +```bash +docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id $model --quantize gptq +``` + +Note that TGI's GPTQ implementation doesn't use [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) under the hood. However, models quantized using AutoGPTQ or Optimum can still be served by TGI. + +To quantize a given model using GPTQ with a calibration dataset, simply run + +```bash +text-generation-server quantize tiiuae/falcon-40b /data/falcon-40b-gptq +# Add --upload-to-model-id MYUSERNAME/falcon-40b to push the created model to the hub directly +``` + +This will create a new directory with the quantized files which you can use with, + +```bash +text-generation-launcher --model-id /data/falcon-40b-gptq/ --sharded true --num-shard 2 --quantize gptq +``` + +You can learn more about the quantization options by running `text-generation-server quantize --help`. + +If you wish to do more with GPTQ models (e.g. train an adapter on top), you can read about transformers GPTQ integration [here](https://huggingface.co/blog/gptq-integration). +You can learn more about GPTQ from the [paper](https://arxiv.org/pdf/2210.17323.pdf). + +## Quantization with bitsandbytes + +bitsandbytes is a library used to apply 8-bit and 4-bit quantization to models. Unlike GPTQ quantization, bitsandbytes doesn't require a calibration dataset or any post-processing – weights are automatically quantized on load. However, inference with bitsandbytes is slower than GPTQ or FP16 precision. + +8-bit quantization enables multi-billion parameter scale models to fit in smaller hardware without degrading performance too much. +In TGI, you can use 8-bit quantization by adding `--quantize bitsandbytes` like below 👇 + +```bash +docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id $model --quantize --bitsandbytes +``` + +4-bit quantization is also possible with bitsandbytes. You can choose one of the following 4-bit data types: 4-bit float (`fp4`), or 4-bit `NormalFloat` (`nf4`). These data types were introduced in the context of parameter-efficient fine-tuning, but you can apply them for inference by automatically converting the model weights on load. + +In TGI, you can use 4-bit quantization by adding `--quantize bitsandbytes-nf4` or `--quantize bitsandbytes-fp4` like below 👇 + +```bash +docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id $model --quantize --bitsandbytes-nf4 +``` + +You can get more information about 8-bit quantization by reading this [blog post](https://huggingface.co/blog/hf-bitsandbytes-integration), and 4-bit quantization by reading [this blog post](https://huggingface.co/blog/4bit-transformers-bitsandbytes).