mirror of
https://github.com/huggingface/text-generation-inference.git
synced 2025-04-24 00:12:08 +00:00
Update Quantization docs and minor doc fix. (#2368)
* Update Quantization docs and minor doc fix. * update readme with latest quants info * Apply suggestions from code review Co-authored-by: Pedro Cuenca <pedro@huggingface.co> * up --------- Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
This commit is contained in:
parent
853fb96fec
commit
b1bc0ecb7f
79
.redocly.lint-ignore.yaml
Normal file
79
.redocly.lint-ignore.yaml
Normal file
@ -0,0 +1,79 @@
|
||||
# This file instructs Redocly's linter to ignore the rules contained for specific parts of your API.
|
||||
# See https://redoc.ly/docs/cli/ for more information.
|
||||
docs/openapi.json:
|
||||
no-empty-servers:
|
||||
- '#/openapi'
|
||||
spec:
|
||||
- >-
|
||||
#/components/schemas/GenerateParameters/properties/best_of/exclusiveMinimum
|
||||
- >-
|
||||
#/components/schemas/GenerateParameters/properties/frequency_penalty/exclusiveMinimum
|
||||
- '#/components/schemas/GenerateParameters/properties/grammar/nullable'
|
||||
- >-
|
||||
#/components/schemas/GenerateParameters/properties/repetition_penalty/exclusiveMinimum
|
||||
- '#/components/schemas/GenerateParameters/properties/seed/exclusiveMinimum'
|
||||
- >-
|
||||
#/components/schemas/GenerateParameters/properties/temperature/exclusiveMinimum
|
||||
- '#/components/schemas/GenerateParameters/properties/top_k/exclusiveMinimum'
|
||||
- >-
|
||||
#/components/schemas/GenerateParameters/properties/top_n_tokens/exclusiveMinimum
|
||||
- '#/components/schemas/GenerateParameters/properties/top_p/exclusiveMinimum'
|
||||
- >-
|
||||
#/components/schemas/GenerateParameters/properties/typical_p/exclusiveMinimum
|
||||
- '#/components/schemas/GenerateResponse/properties/details/nullable'
|
||||
- '#/components/schemas/StreamResponse/properties/details/nullable'
|
||||
- '#/components/schemas/ChatRequest/properties/response_format/nullable'
|
||||
- '#/components/schemas/ChatRequest/properties/tool_choice/nullable'
|
||||
- '#/components/schemas/ToolChoice/nullable'
|
||||
- '#/components/schemas/ChatCompletionComplete/properties/logprobs/nullable'
|
||||
- '#/components/schemas/ChatCompletionChoice/properties/logprobs/nullable'
|
||||
no-invalid-media-type-examples:
|
||||
- '#/paths/~1/post/responses/422/content/application~1json/example'
|
||||
- '#/paths/~1/post/responses/424/content/application~1json/example'
|
||||
- '#/paths/~1/post/responses/429/content/application~1json/example'
|
||||
- '#/paths/~1/post/responses/500/content/application~1json/example'
|
||||
- '#/paths/~1generate/post/responses/422/content/application~1json/example'
|
||||
- '#/paths/~1generate/post/responses/424/content/application~1json/example'
|
||||
- '#/paths/~1generate/post/responses/429/content/application~1json/example'
|
||||
- '#/paths/~1generate/post/responses/500/content/application~1json/example'
|
||||
- >-
|
||||
#/paths/~1generate_stream/post/responses/422/content/text~1event-stream/example
|
||||
- >-
|
||||
#/paths/~1generate_stream/post/responses/424/content/text~1event-stream/example
|
||||
- >-
|
||||
#/paths/~1generate_stream/post/responses/429/content/text~1event-stream/example
|
||||
- >-
|
||||
#/paths/~1generate_stream/post/responses/500/content/text~1event-stream/example
|
||||
- '#/paths/~1tokenize/post/responses/404/content/application~1json/example'
|
||||
- >-
|
||||
#/paths/~1v1~1chat~1completions/post/responses/422/content/application~1json/example
|
||||
- >-
|
||||
#/paths/~1v1~1chat~1completions/post/responses/424/content/application~1json/example
|
||||
- >-
|
||||
#/paths/~1v1~1chat~1completions/post/responses/429/content/application~1json/example
|
||||
- >-
|
||||
#/paths/~1v1~1chat~1completions/post/responses/500/content/application~1json/example
|
||||
- >-
|
||||
#/paths/~1v1~1completions/post/responses/422/content/application~1json/example
|
||||
- >-
|
||||
#/paths/~1v1~1completions/post/responses/424/content/application~1json/example
|
||||
- >-
|
||||
#/paths/~1v1~1completions/post/responses/429/content/application~1json/example
|
||||
- >-
|
||||
#/paths/~1v1~1completions/post/responses/500/content/application~1json/example
|
||||
operation-4xx-response:
|
||||
- '#/paths/~1health/get/responses'
|
||||
- '#/paths/~1info/get/responses'
|
||||
- '#/paths/~1metrics/get/responses'
|
||||
no-unused-components:
|
||||
- '#/components/schemas/Completion'
|
||||
security-defined:
|
||||
- '#/paths/~1/post'
|
||||
- '#/paths/~1generate/post'
|
||||
- '#/paths/~1generate_stream/post'
|
||||
- '#/paths/~1health/get'
|
||||
- '#/paths/~1info/get'
|
||||
- '#/paths/~1metrics/get'
|
||||
- '#/paths/~1tokenize/post'
|
||||
- '#/paths/~1v1~1chat~1completions/post'
|
||||
- '#/paths/~1v1~1completions/post'
|
@ -2080,4 +2080,4 @@
|
||||
"description": "Hugging Face Text Generation Inference API"
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
@ -4,7 +4,7 @@ Text Generation Inference improves the model in several aspects.
|
||||
|
||||
## Quantization
|
||||
|
||||
TGI supports [bits-and-bytes](https://github.com/TimDettmers/bitsandbytes#bitsandbytes), [GPT-Q](https://arxiv.org/abs/2210.17323) and [AWQ](https://arxiv.org/abs/2306.00978) quantization. To speed up inference with quantization, simply set `quantize` flag to `bitsandbytes`, `gptq` or `awq` depending on the quantization technique you wish to use. When using GPT-Q quantization, you need to point to one of the models [here](https://huggingface.co/models?search=gptq) when using AWQ quantization, you need to point to one of the models [here](https://huggingface.co/models?search=awq). To get more information about quantization, please refer to [quantization guide](./../conceptual/quantization)
|
||||
TGI supports [bits-and-bytes](https://github.com/TimDettmers/bitsandbytes#bitsandbytes), [GPT-Q](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), [Marlin](https://github.com/IST-DASLab/marlin), [EETQ](https://github.com/NetEase-FuXi/EETQ), [EXL2](https://github.com/turboderp/exllamav2), and [fp8](https://developer.nvidia.com/blog/nvidia-arm-and-intel-publish-fp8-specification-for-standardization-as-an-interchange-format-for-ai/) quantization. To speed up inference with quantization, simply set `quantize` flag to `bitsandbytes`, `gptq`, `awq`, `marlin`, `exl2`, `eetq` or `fp8` depending on the quantization technique you wish to use. When using GPT-Q quantization, you need to point to one of the models [here](https://huggingface.co/models?search=gptq). Similarly, when using AWQ quantization, you need to point to one of [these models](https://huggingface.co/models?search=awq). To get more information about quantization, please refer to [quantization guide](./../conceptual/quantization)
|
||||
|
||||
|
||||
## RoPE Scaling
|
||||
|
@ -84,7 +84,7 @@ print(chat)
|
||||
|
||||
```
|
||||
|
||||
or with OpenAi's library:
|
||||
or with OpenAI's [client library](https://github.com/openai/openai-python):
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
|
@ -1,6 +1,40 @@
|
||||
# Quantization
|
||||
|
||||
TGI offers GPTQ and bits-and-bytes quantization to quantize large language models.
|
||||
TGI offers many quantization schemes to run LLMs effectively and fast based on your use-case. TGI supports GPTQ, AWQ, bits-and-bytes, EETQ, Marlin, EXL2 and fp8 quantization.
|
||||
|
||||
To leverage GPTQ, AWQ, Marlin and EXL2 quants, you must provide pre-quantized weights. Whereas for bits-and-bytes, EETQ and fp8, weights are quantized by TGI on the fly.
|
||||
|
||||
We recommend using the official quantization scripts for creating your quants:
|
||||
1. [AWQ](https://github.com/casper-hansen/AutoAWQ/blob/main/examples/quantize.py)
|
||||
2. [GPTQ/ Marlin](https://github.com/AutoGPTQ/AutoGPTQ/blob/main/examples/quantization/basic_usage.py)
|
||||
3. [EXL2](https://github.com/turboderp/exllamav2/blob/master/doc/convert.md)
|
||||
|
||||
For on-the-fly quantization you simply need to pass one of the supported quantization types and TGI takes care of the rest.
|
||||
|
||||
## Quantization with bitsandbytes
|
||||
|
||||
bitsandbytes is a library used to apply 8-bit and 4-bit quantization to models. Unlike GPTQ quantization, bitsandbytes doesn't require a calibration dataset or any post-processing – weights are automatically quantized on load. However, inference with bitsandbytes is slower than GPTQ or FP16 precision.
|
||||
|
||||
8-bit quantization enables multi-billion parameter scale models to fit in smaller hardware without degrading performance too much.
|
||||
In TGI, you can use 8-bit quantization by adding `--quantize bitsandbytes` like below 👇
|
||||
|
||||
```bash
|
||||
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id $model --quantize bitsandbytes
|
||||
```
|
||||
|
||||
4-bit quantization is also possible with bitsandbytes. You can choose one of the following 4-bit data types: 4-bit float (`fp4`), or 4-bit `NormalFloat` (`nf4`). These data types were introduced in the context of parameter-efficient fine-tuning, but you can apply them for inference by automatically converting the model weights on load.
|
||||
|
||||
In TGI, you can use 4-bit quantization by adding `--quantize bitsandbytes-nf4` or `--quantize bitsandbytes-fp4` like below 👇
|
||||
|
||||
```bash
|
||||
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id $model --quantize bitsandbytes-nf4
|
||||
```
|
||||
|
||||
You can get more information about 8-bit quantization by reading this [blog post](https://huggingface.co/blog/hf-bitsandbytes-integration), and 4-bit quantization by reading [this blog post](https://huggingface.co/blog/4bit-transformers-bitsandbytes).
|
||||
|
||||
Use `eetq` or `fp8` for other quantization schemes.
|
||||
|
||||
In addition to this, TGI allows creating GPTQ quants directly by passing the model weights and a calibration dataset.
|
||||
|
||||
## Quantization with GPTQ
|
||||
|
||||
@ -35,25 +69,4 @@ text-generation-launcher --model-id /data/falcon-40b-gptq/ --sharded true --num-
|
||||
You can learn more about the quantization options by running `text-generation-server quantize --help`.
|
||||
|
||||
If you wish to do more with GPTQ models (e.g. train an adapter on top), you can read about transformers GPTQ integration [here](https://huggingface.co/blog/gptq-integration).
|
||||
You can learn more about GPTQ from the [paper](https://arxiv.org/pdf/2210.17323.pdf).
|
||||
|
||||
## Quantization with bitsandbytes
|
||||
|
||||
bitsandbytes is a library used to apply 8-bit and 4-bit quantization to models. Unlike GPTQ quantization, bitsandbytes doesn't require a calibration dataset or any post-processing – weights are automatically quantized on load. However, inference with bitsandbytes is slower than GPTQ or FP16 precision.
|
||||
|
||||
8-bit quantization enables multi-billion parameter scale models to fit in smaller hardware without degrading performance too much.
|
||||
In TGI, you can use 8-bit quantization by adding `--quantize bitsandbytes` like below 👇
|
||||
|
||||
```bash
|
||||
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id $model --quantize bitsandbytes
|
||||
```
|
||||
|
||||
4-bit quantization is also possible with bitsandbytes. You can choose one of the following 4-bit data types: 4-bit float (`fp4`), or 4-bit `NormalFloat` (`nf4`). These data types were introduced in the context of parameter-efficient fine-tuning, but you can apply them for inference by automatically converting the model weights on load.
|
||||
|
||||
In TGI, you can use 4-bit quantization by adding `--quantize bitsandbytes-nf4` or `--quantize bitsandbytes-fp4` like below 👇
|
||||
|
||||
```bash
|
||||
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id $model --quantize bitsandbytes-nf4
|
||||
```
|
||||
|
||||
You can get more information about 8-bit quantization by reading this [blog post](https://huggingface.co/blog/hf-bitsandbytes-integration), and 4-bit quantization by reading [this blog post](https://huggingface.co/blog/4bit-transformers-bitsandbytes).
|
||||
You can learn more about GPTQ from the [paper](https://arxiv.org/pdf/2210.17323.pdf).
|
Loading…
Reference in New Issue
Block a user