Add info about FP8 support (#137)

Co-authored-by: jkaniecki <153085639+jkaniecki@users.noreply.github.com> Co-authored-by: Karol Damaszke <kdamaszke@habana.ai> Co-authored-by: regisss <15324346+regisss@users.noreply.github.com>
2025-09-16 23:04:52 +00:00 · 2024-05-06 11:03:14 +02:00 · 2024-05-06 11:03:14 +02:00 · 4169ff8e6f
commit 4169ff8e6f
parent f82da93318
1 changed files with 26 additions and 15 deletions
--- a/README.md
+++ b/README.md
@ -20,6 +20,7 @@ limitations under the License.
 - [Running TGI on Gaudi](#running-tgi-on-gaudi)
 - [Adjusting TGI parameters](#adjusting-tgi-parameters)
 - [Running TGI with FP8 precision](#running-tgi-with-fp8-precision)
 - [Currently supported configurations](#currently-supported-configurations)
 - [Environment variables](#environment-variables)
 - [Profiler](#profiler)
@ -74,8 +75,8 @@ To use [🤗 text-generation-inference](https://github.com/huggingface/text-gene
 ## Adjusting TGI parameters
 Maximum sequence length is controlled by two arguments:
- `--max-input-length` is the maximum possible input prompt length. Default value is `1024`.
+- `--max-input-length` is the maximum possible input prompt length. Default value is `4095`.
- `--max-total-tokens` is the maximum possible total length of the sequence (input and output). Default value is `2048`.
+- `--max-total-tokens` is the maximum possible total length of the sequence (input and output). Default value is `4096`.
 Maximum batch size is controlled by two arguments:
 - For prefill operation, please set `--max-prefill-total-tokens` as `bs * max-input-length`, where `bs` is your expected maximum prefill batch size.
@ -91,23 +92,33 @@ Except those already mentioned, there are other parameters that need to be prope
 For more information and documentation about Text Generation Inference, checkout [the README](https://github.com/huggingface/text-generation-inference#text-generation-inference) of the original repo.
 ## Running TGI with FP8 precision
 TGI supports FP8 precision runs within the limits provided by [Habana Quantization Toolkit](https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_FP8.html). Models with FP8 can be ran by properly setting QUANT_CONFIG environment variable. Detailed instruction on how to use that variable can be found in [Optimum Habana FP8 guide](https://github.com/huggingface/optimum-habana/tree/main/examples/text-generation#running-with-fp8). Summarising that instruction in TGI cases:
 1. Measure quantization statistics of requested model by using [Optimum Habana measurement script](https://github.com/huggingface/optimum-habana/tree/main/examples/text-generation#running-with-fp8:~:text=use_deepspeed%20%2D%2Dworld_size%208-,run_lm_eval.py,-%5C%0A%2Do%20acc_70b_bs1_measure.txt)
 2. Run requested model in TGI with proper QUANT_CONFIG setting - e.g. `QUANT_CONFIG=./quantization_config/maxabs_quant.json`
 > [!NOTE]
 > Only models pointed in [supported configurations](#currently-supported-configurations) are guaranteed to work with FP8
 Additional hints to quantize model for TGI when using `run_lm_eval.py`:
 * use `--limit_hpu_graphs` flag to save memory
 * try to model your use case situation by adjusting `--batch_size` , `--max_new_tokens 512` and `--max_input_tokens 512`; in case of memory issues, lower those values
 * use dataset/tasks suitable for your use case (see `--help` for defining tasks/datasets)
 ## Currently supported configurations
 Not all features of TGI are currently supported as this is still a work in progress.
 Currently supported and validated configurations (other configurations are not guaranted to work or ensure reasonable performance):
 * LLaMA 70b:
    * Num cards: 8
    * Decode batch size: 128
    * Dtype: bfloat16
    * Max input tokens: 1024
    * Max total tokens: 2048
-* LLaMA 7b:
+<div align="left">
-    * Num cards: 1
+
-    * Decode batch size: 16
+| Model| Cards| Decode batch size| Dtype| Max input tokens |Max total tokens|
-    * Dtype: bfloat16
+|:----:|:----:|:----------------:|:----:|:----------------:|:--------------:|
-    * Max input tokens: 1024
+| LLaMA 70b | 8     | 128 | bfloat16/FP8 | 1024 | 2048 |
-    * Max total tokens: 2048
+| LLaMA 7b  | 1/8   | 16  | bfloat16/FP8 | 1024 | 2048 |
 </div>
 Other sequence lengths can be used with proportionally decreased/increased batch size (the higher sequence length, the lower batch size).
 Support for other models from Optimum Habana will be added successively.