Add info about FP8 support (#137)

Co-authored-by: jkaniecki <153085639+jkaniecki@users.noreply.github.com>
Co-authored-by: Karol Damaszke <kdamaszke@habana.ai>
Co-authored-by: regisss <15324346+regisss@users.noreply.github.com>
This commit is contained in:
Karol Damaszke 2024-05-06 11:03:14 +02:00 committed by GitHub
parent f82da93318
commit 4169ff8e6f
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

View File

@ -20,6 +20,7 @@ limitations under the License.
- [Running TGI on Gaudi](#running-tgi-on-gaudi) - [Running TGI on Gaudi](#running-tgi-on-gaudi)
- [Adjusting TGI parameters](#adjusting-tgi-parameters) - [Adjusting TGI parameters](#adjusting-tgi-parameters)
- [Running TGI with FP8 precision](#running-tgi-with-fp8-precision)
- [Currently supported configurations](#currently-supported-configurations) - [Currently supported configurations](#currently-supported-configurations)
- [Environment variables](#environment-variables) - [Environment variables](#environment-variables)
- [Profiler](#profiler) - [Profiler](#profiler)
@ -74,8 +75,8 @@ To use [🤗 text-generation-inference](https://github.com/huggingface/text-gene
## Adjusting TGI parameters ## Adjusting TGI parameters
Maximum sequence length is controlled by two arguments: Maximum sequence length is controlled by two arguments:
- `--max-input-length` is the maximum possible input prompt length. Default value is `1024`. - `--max-input-length` is the maximum possible input prompt length. Default value is `4095`.
- `--max-total-tokens` is the maximum possible total length of the sequence (input and output). Default value is `2048`. - `--max-total-tokens` is the maximum possible total length of the sequence (input and output). Default value is `4096`.
Maximum batch size is controlled by two arguments: Maximum batch size is controlled by two arguments:
- For prefill operation, please set `--max-prefill-total-tokens` as `bs * max-input-length`, where `bs` is your expected maximum prefill batch size. - For prefill operation, please set `--max-prefill-total-tokens` as `bs * max-input-length`, where `bs` is your expected maximum prefill batch size.
@ -91,23 +92,33 @@ Except those already mentioned, there are other parameters that need to be prope
For more information and documentation about Text Generation Inference, checkout [the README](https://github.com/huggingface/text-generation-inference#text-generation-inference) of the original repo. For more information and documentation about Text Generation Inference, checkout [the README](https://github.com/huggingface/text-generation-inference#text-generation-inference) of the original repo.
## Running TGI with FP8 precision
TGI supports FP8 precision runs within the limits provided by [Habana Quantization Toolkit](https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_FP8.html). Models with FP8 can be ran by properly setting QUANT_CONFIG environment variable. Detailed instruction on how to use that variable can be found in [Optimum Habana FP8 guide](https://github.com/huggingface/optimum-habana/tree/main/examples/text-generation#running-with-fp8). Summarising that instruction in TGI cases:
1. Measure quantization statistics of requested model by using [Optimum Habana measurement script](https://github.com/huggingface/optimum-habana/tree/main/examples/text-generation#running-with-fp8:~:text=use_deepspeed%20%2D%2Dworld_size%208-,run_lm_eval.py,-%5C%0A%2Do%20acc_70b_bs1_measure.txt)
2. Run requested model in TGI with proper QUANT_CONFIG setting - e.g. `QUANT_CONFIG=./quantization_config/maxabs_quant.json`
> [!NOTE]
> Only models pointed in [supported configurations](#currently-supported-configurations) are guaranteed to work with FP8
Additional hints to quantize model for TGI when using `run_lm_eval.py`:
* use `--limit_hpu_graphs` flag to save memory
* try to model your use case situation by adjusting `--batch_size` , `--max_new_tokens 512` and `--max_input_tokens 512`; in case of memory issues, lower those values
* use dataset/tasks suitable for your use case (see `--help` for defining tasks/datasets)
## Currently supported configurations ## Currently supported configurations
Not all features of TGI are currently supported as this is still a work in progress. Not all features of TGI are currently supported as this is still a work in progress.
Currently supported and validated configurations (other configurations are not guaranted to work or ensure reasonable performance): Currently supported and validated configurations (other configurations are not guaranted to work or ensure reasonable performance):
* LLaMA 70b:
* Num cards: 8
* Decode batch size: 128
* Dtype: bfloat16
* Max input tokens: 1024
* Max total tokens: 2048
* LLaMA 7b: <div align="left">
* Num cards: 1
* Decode batch size: 16 | Model| Cards| Decode batch size| Dtype| Max input tokens |Max total tokens|
* Dtype: bfloat16 |:----:|:----:|:----------------:|:----:|:----------------:|:--------------:|
* Max input tokens: 1024 | LLaMA 70b | 8 | 128 | bfloat16/FP8 | 1024 | 2048 |
* Max total tokens: 2048 | LLaMA 7b | 1/8 | 16 | bfloat16/FP8 | 1024 | 2048 |
</div>
Other sequence lengths can be used with proportionally decreased/increased batch size (the higher sequence length, the lower batch size). Other sequence lengths can be used with proportionally decreased/increased batch size (the higher sequence length, the lower batch size).
Support for other models from Optimum Habana will be added successively. Support for other models from Optimum Habana will be added successively.