mirror of
https://github.com/huggingface/text-generation-inference.git
synced 2025-09-12 12:54:52 +00:00
Apply suggestions from code review
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
This commit is contained in:
parent
3c9a840362
commit
dea2b747d1
@ -55,7 +55,7 @@ Options:
|
|||||||
## QUANTIZE
|
## QUANTIZE
|
||||||
```shell
|
```shell
|
||||||
--quantize <QUANTIZE>
|
--quantize <QUANTIZE>
|
||||||
Whether you want the model to be quantized/ load a pre-quantized model.
|
Whether you want the model to be quantized / load a pre-quantized model.
|
||||||
|
|
||||||
[env: QUANTIZE=]
|
[env: QUANTIZE=]
|
||||||
|
|
||||||
|
@ -4,7 +4,7 @@ Text Generation Inference improves the model in several aspects.
|
|||||||
|
|
||||||
## Quantization
|
## Quantization
|
||||||
|
|
||||||
TGI supports [bits-and-bytes](https://github.com/TimDettmers/bitsandbytes#bitsandbytes), [GPT-Q](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), [Marlin](https://github.com/IST-DASLab/marlin), [EETQ](https://github.com/NetEase-FuXi/EETQ), [EXL2](https://github.com/turboderp/exllamav2), and [fp8](https://developer.nvidia.com/blog/nvidia-arm-and-intel-publish-fp8-specification-for-standardization-as-an-interchange-format-for-ai/) quantization. To speed up inference with quantization, simply set `quantize` flag to `bitsandbytes`, `gptq`, `awq`, `marlin`, `exl2`, `eetq` or `fp8` depending on the quantization technique you wish to use. When using GPT-Q quantization, you need to point to one of the models [here](https://huggingface.co/models?search=gptq) when using AWQ quantization, you need to point to one of the models [here](https://huggingface.co/models?search=awq). To get more information about quantization, please refer to [quantization guide](./../conceptual/quantization)
|
TGI supports [bits-and-bytes](https://github.com/TimDettmers/bitsandbytes#bitsandbytes), [GPT-Q](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), [Marlin](https://github.com/IST-DASLab/marlin), [EETQ](https://github.com/NetEase-FuXi/EETQ), [EXL2](https://github.com/turboderp/exllamav2), and [fp8](https://developer.nvidia.com/blog/nvidia-arm-and-intel-publish-fp8-specification-for-standardization-as-an-interchange-format-for-ai/) quantization. To speed up inference with quantization, simply set `quantize` flag to `bitsandbytes`, `gptq`, `awq`, `marlin`, `exl2`, `eetq` or `fp8` depending on the quantization technique you wish to use. When using GPT-Q quantization, you need to point to one of the models [here](https://huggingface.co/models?search=gptq). Similarly, when using AWQ quantization, you need to point to one of [these models](https://huggingface.co/models?search=awq). To get more information about quantization, please refer to [quantization guide](./../conceptual/quantization)
|
||||||
|
|
||||||
|
|
||||||
## RoPE Scaling
|
## RoPE Scaling
|
||||||
|
@ -2,14 +2,14 @@
|
|||||||
|
|
||||||
TGI offers many quantization schemes to run LLMs effectively and fast based on your use-case. TGI supports GPTQ, AWQ, bits-and-bytes, EETQ, Marlin, EXL2 and fp8 quantization.
|
TGI offers many quantization schemes to run LLMs effectively and fast based on your use-case. TGI supports GPTQ, AWQ, bits-and-bytes, EETQ, Marlin, EXL2 and fp8 quantization.
|
||||||
|
|
||||||
To leverage GPTQ, AWQ, Marlin and EXL2 quants, you must provide pre-quantized weights. Where as for bits-and-bytes, EETQ and fp8 the weights are quantized on the fly by TGI.
|
To leverage GPTQ, AWQ, Marlin and EXL2 quants, you must provide pre-quantized weights. Whereas for bits-and-bytes, EETQ and fp8, weights are quantized by TGI on the fly.
|
||||||
|
|
||||||
We recommend using the official quantization scripts for creating your quants:
|
We recommend using the official quantization scripts for creating your quants:
|
||||||
1. [AWQ](https://github.com/casper-hansen/AutoAWQ/blob/main/examples/quantize.py)
|
1. [AWQ](https://github.com/casper-hansen/AutoAWQ/blob/main/examples/quantize.py)
|
||||||
2. [GPTQ/ Marlin](https://github.com/AutoGPTQ/AutoGPTQ/blob/main/examples/quantization/basic_usage.py)
|
2. [GPTQ/ Marlin](https://github.com/AutoGPTQ/AutoGPTQ/blob/main/examples/quantization/basic_usage.py)
|
||||||
3. [EXL2](https://github.com/turboderp/exllamav2/blob/master/doc/convert.md)
|
3. [EXL2](https://github.com/turboderp/exllamav2/blob/master/doc/convert.md)
|
||||||
|
|
||||||
For on-the-fly quantization you simply need to pass a the quantization type for TGI and it takes care of the rest.
|
For on-the-fly quantization you simply need to pass one of the supported quantization types and TGI takes care of the rest.
|
||||||
|
|
||||||
## Quantization with bitsandbytes
|
## Quantization with bitsandbytes
|
||||||
|
|
||||||
|
Loading…
Reference in New Issue
Block a user