text-generation-inference/docs/source/backends/gaudi.mdx

# Gaudi Backend for Text Generation Inference

## Overview
Text Generation Inference (TGI) has been optimized to run on Gaudi hardware via the Gaudi backend for TGI.

## Supported Hardware
- **Gaudi1**: Available on [AWS EC2 DL1 instances](https://aws.amazon.com/ec2/instance-types/dl1/)
- **Gaudi2**: Available on [Intel Cloud](https://console.cloud.intel.com/docs/reference/ai_instances.html)
- **Gaudi3**: Available on [Intel Cloud](https://console.cloud.intel.com/docs/reference/ai_instances.html)

## Tutorial: Getting Started with TGI on Gaudi

### Basic Usage
The easiest way to run TGI on Gaudi is to use the official Docker image:

```bash
model=meta-llama/Meta-Llama-3.1-8B-Instruct
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
hf_token=YOUR_HF_ACCESS_TOKEN

docker run --runtime=habana --cap-add=sys_nice --ipc=host \
    -p 8080:80 -v $volume:/data -e HF_TOKEN=$hf_token \
    ghcr.io/huggingface/text-generation-inference:3.2.3-gaudi \
    --model-id $model
```

Once you see the `connected` log, the server is ready to accept requests:
> 2024-05-22T19:31:48.302239Z  INFO text_generation_router: router/src/main.rs:378: Connected

You can find your `YOUR_HF_ACCESS_TOKEN` at [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens). This is necessary to access gated models like llama3.1.

### Making Your First Request
You can send a request from a separate terminal:

```bash
curl 127.0.0.1:8080/generate \
    -X POST \
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":32}}' \
    -H 'Content-Type: application/json'
```

## How-to Guides

You can view the full list of supported models in the [Supported Models](https://huggingface.co/docs/text-generation-inference/backends/gaudi#supported-models) section.

For example, to run Llama3.1-8B, you can use the following command:

```bash
model=meta-llama/Meta-Llama-3.1-8B-Instruct
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
hf_token=YOUR_ACCESS_TOKEN

docker run --runtime=habana --cap-add=sys_nice --ipc=host \
    -p 8080:80 -v $volume:/data -e HF_TOKEN=$hf_token \
    ghcr.io/huggingface/text-generation-inference:3.2.3-gaudi \
    --model-id $model
    <text-generation-inference-launcher-arguments>
```

For the full list of service parameters, refer to the [launcher-arguments page](https://huggingface.co/docs/text-generation-inference/reference/launcher).

The validated docker commands can be found in the [examples/docker_commands folder](https://github.com/huggingface/text-generation-inference/tree/main/backends/gaudi/examples/docker_commands).

> Note: `--runtime=habana --cap-add=sys_nice --ipc=host ` is required to enable docker to use the Gaudi hardware (more details [here](https://docs.habana.ai/en/latest/Installation_Guide/Additional_Installation/Docker_Installation.html)).

### How to Enable Multi-Card Inference (Sharding)

TGI-Gaudi supports sharding for multi-card inference, allowing you to distribute the load across multiple Gaudi cards. This is recommended to run large models and to speed up inference.

For example, on a machine with 8 Gaudi cards, you can run:

```bash
docker run --runtime=habana --ipc=host --cap-add=sys_nice \
    -p 8080:80 -v $volume:/data -e HF_TOKEN=$hf_token \
    tgi-gaudi \
    --model-id $model --sharded true --num-shard 8
```

<Tip>
We recommend always using sharding when running on a multi-card machine.
</Tip>

### How to Use Different Precision Formats

#### BF16 Precision (Default)
By default, all models run with BF16 precision on Gaudi hardware.

#### FP8 Precision
TGI-Gaudi supports FP8 precision inference with [Intel Neural Compressor (INC)](https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_FP8.html).

To run FP8 Inference:

1. Measure statistics using [Optimum Habana measurement script](https://github.com/huggingface/optimum-habana/tree/main/examples/text-generation#running-with-fp8)
2. Run the model in TGI with QUANT_CONFIG setting - e.g. `-e QUANT_CONFIG=./quantization_config/maxabs_quant.json`.

The following commmand example for FP8 inference is based on the assumption that measurement is done via the first step above.

Example for Llama3.1-70B on 8 cards with FP8 precision:

```bash
model=meta-llama/Meta-Llama-3.1-70B-Instruct
hf_token=YOUR_ACCESS_TOKEN
volume=$PWD/data   # share a volume with the Docker container to avoid downloading weights every run

docker run -p 8080:80 \
   --runtime=habana \
   --cap-add=sys_nice \
   --ipc=host \
   -v $volume:/data \
   -v $PWD/quantization_config:/usr/src/quantization_config \
   -v $PWD/hqt_output:/usr/src/hqt_output \
   -e QUANT_CONFIG=./quantization_config/maxabs_quant.json \
   -e HF_TOKEN=$hf_token \
   -e MAX_TOTAL_TOKENS=2048 \
   -e BATCH_BUCKET_SIZE=256 \
   -e PREFILL_BATCH_BUCKET_SIZE=4 \
   -e PAD_SEQUENCE_TO_MULTIPLE_OF=64 \
   ghcr.io/huggingface/text-generation-inference:3.2.3-gaudi \
   --model-id $model \
   --sharded true --num-shard 8 \
   --max-input-tokens 1024 --max-total-tokens 2048 \
   --max-batch-prefill-tokens 4096 --max-batch-size 256 \
   --max-waiting-tokens 7 --waiting-served-ratio 1.2 --max-concurrent-requests 512
```

### How to Run Vision-Language Models (VLMs)

Gaudi supports VLM inference.

Example for Llava-v1.6-Mistral-7B on 1 card:

Start the TGI server via the following command:
```bash
model=llava-hf/llava-v1.6-mistral-7b-hf
volume=$PWD/data   # share a volume with the Docker container to avoid downloading weights every run

docker run -p 8080:80 \
   --runtime=habana \
   --cap-add=sys_nice \
   --ipc=host \
   -v $volume:/data \
    -e PREFILL_BATCH_BUCKET_SIZE=1 \
    -e BATCH_BUCKET_SIZE=1 \
   ghcr.io/huggingface/text-generation-inference:3.2.3-gaudi \
   --model-id $model \
   --max-input-tokens 4096 --max-batch-prefill-tokens 16384 \
   --max-total-tokens 8192 --max-batch-size 4
```

You can then send a request to the server via the following command:
```bash
curl -N 127.0.0.1:8080/generate \
    -X POST \
    -d '{"inputs":"![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png)What is this a picture of?\n\n","parameters":{"max_new_tokens":32}}' \
    -H 'Content-Type: application/json'
```

> Note: In Llava-v1.6-Mistral-7B, an image usually accounts for 2000 input tokens. For example, an image of size 512x512 is represented by 2800 tokens. Thus, `max-input-tokens` must be larger than the number of tokens associated with the image. Otherwise the image may be truncated. We set `BASE_IMAGE_TOKENS=2048` as the default image token value. This is the minimum value of `max-input-tokens`. You can override the environment variable `BASE_IMAGE_TOKENS` to change this value. The warmup will generate graphs with input length from `BASE_IMAGE_TOKENS` to `max-input-tokens`. For Llava-v1.6-Mistral-7B, the value of `max-batch-prefill-tokens` is 16384, which is calcualted as follows: `prefill_batch_size` = `max-batch-prefill-tokens` / `max-input-tokens`.

### How to Benchmark Performance

We recommend using the [inference-benchmarker tool](https://github.com/huggingface/inference-benchmarker) to benchmark performance on Gaudi hardware.

This benchmark tool simulates user requests and measures the performance of the model on realistic scenarios.

To run it on the same machine, you can do the following:
```bash
MODEL=meta-llama/Llama-3.1-8B-Instruct
HF_TOKEN=<your HF READ token>
# run a benchmark to evaluate the performance of the model for chat use case
# we mount results to the current directory
docker run \
    --rm \
    -it \
    --net host \
    -v $(pwd):/opt/inference-benchmarker/results \
    -e "HF_TOKEN=$HF_TOKEN" \
    ghcr.io/huggingface/inference-benchmarker:latest \
    inference-benchmarker \
    --tokenizer-name "$MODEL" \
    --url http://localhost:8080 \
    --profile chat
```

Please refer to the [inference-benchmarker README](https://github.com/huggingface/inference-benchmarker) for more details.

### How to Profile Performance

To collect performance profiling, you need to set the following environment variables:

| Name               | Value(s)   | Default          | Description                                              |
|--------------------| :--------- | :--------------- | :------------------------------------------------------- |
| PROF_WAITSTEP      | integer    | 0                | Control profile wait steps                               |
| PROF_WARMUPSTEP    | integer    | 0                | Control profile warmup steps                             |
| PROF_STEP          | integer    | 0                | Enable/disable profile, control profile active steps     |
| PROF_PATH          | string     | /tmp/hpu_profile | Define profile folder                                    |
| PROF_RANKS         | string     | 0                | Comma-separated list of ranks to profile                 |
| PROF_RECORD_SHAPES | True/False | False            | Control record_shapes option in the profiler             |

To use these environment variables, add them to your docker run command with the -e flag. For example:

```bash
docker run --runtime=habana --ipc=host --cap-add=sys_nice \
    -p 8080:80 -v $volume:/data -e HF_TOKEN=$hf_token \
    -e PROF_WAITSTEP=10 \
    -e PROF_WARMUPSTEP=10 \
    -e PROF_STEP=1 \
    -e PROF_PATH=/tmp/hpu_profile \
    -e PROF_RANKS=0 \
    -e PROF_RECORD_SHAPES=True \
    ghcr.io/huggingface/text-generation-inference:3.2.3-gaudi \
    --model-id $model
```

## Explanation: Understanding TGI on Gaudi

### The Warmup Process

To ensure optimal performance, warmup is performed at the beginning of each server run. This process creates queries with various input shapes based on provided parameters and runs basic TGI operations (prefill, decode, concatenate).

Note: Model warmup can take several minutes, especially for FP8 inference. For faster subsequent runs, refer to [Disk Caching Eviction Policy](https://docs.habana.ai/en/latest/PyTorch/Model_Optimization_PyTorch/Optimization_in_PyTorch_Models.html#disk-caching-eviction-policy).

### Understanding Parameter Tuning

#### Sequence Length Parameters
- `--max-input-tokens` is the maximum possible input prompt length. Default value is `4095`.
- `--max-total-tokens` is the maximum possible total length of the sequence (input and output). Default value is `4096`.

#### Batch Size Parameters
- For prefill operation, please set `--max-batch-prefill-tokens` as `bs * max-input-tokens`, where `bs` is your expected maximum prefill batch size.
- For decode operation, please set `--max-batch-size` as `bs`, where `bs` is your expected maximum decode batch size.
- Please note that batch size will be always padded to the nearest multiplication of `BATCH_BUCKET_SIZE` and `PREFILL_BATCH_BUCKET_SIZE`.

#### Performance and Memory Parameters
- `PAD_SEQUENCE_TO_MULTIPLE_OF` determines sizes of input length buckets. Since warmup creates several graphs for each bucket, it's important to adjust that value proportionally to input sequence length. Otherwise, some out of memory issues can be observed.
- `ENABLE_HPU_GRAPH` enables HPU graphs usage, which is crucial for performance results. Recommended value to keep is `true`.

#### Sequence Length Parameters
- `--max-input-tokens`: Maximum possible input prompt length (default: 4095)
- `--max-total-tokens`: Maximum possible total sequence length (input + output) (default: 4096)

#### Batch Size Parameters
- `--max-batch-prefill-tokens`: Set as `bs * max-input-tokens` where `bs` is your expected maximum prefill batch size
- `--max-batch-size`: Set as `bs` where `bs` is your expected maximum decode batch size
- Note: Batch sizes are padded to the nearest multiple of `BATCH_BUCKET_SIZE` and `PREFILL_BATCH_BUCKET_SIZE`

## Reference

This section contains reference information about the Gaudi backend.

### Supported Models

Text Generation Inference enables serving optimized models on Gaudi hardware. The following sections list which models (VLMs & LLMs) are supported on Gaudi.

**Large Language Models (LLMs)**
- [Llama2-7B](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)
- [Llama2-70B](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf)
- [Llama3-8B](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct)
- [Llama3-70B](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct)
- [LLama3.1-8B](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct)
- [LLama3.1-70B](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct)
- [CodeLlama-13B](https://huggingface.co/codellama/CodeLlama-13b-hf)
- [Opt-125m](https://huggingface.co/facebook/opt-125m)
- [OpenAI-gpt2](https://huggingface.co/openai-community/gpt2)
- [Mixtral-8x7B](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1)
- [Mistral-7B](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3)
- [Qwen2-72B](https://huggingface.co/Qwen/Qwen2-72B-Instruct)
- [Qwen2-7B](https://huggingface.co/Qwen/Qwen2-7B-Instruct)
- [Phi-1.5](https://huggingface.co/microsoft/phi-1_5)
- [Gemma-7b](https://huggingface.co/google/gemma-7b-it)
- [Starcoder2-3b](https://huggingface.co/bigcode/starcoder2-3b)
- [Starcoder2-15b](https://huggingface.co/bigcode/starcoder2-15b)
- [Starcoder](https://huggingface.co/bigcode/starcoder)
- [falcon-7b-instruct](https://huggingface.co/tiiuae/falcon-7b-instruct)
- [Falcon-180B](https://huggingface.co/tiiuae/falcon-180B-chat)
- [GPT-2](https://huggingface.co/openai-community/gpt2)
- [gpt-j-6b](https://huggingface.co/EleutherAI/gpt-j-6b)

**Vision-Language Models (VLMs)**
- [LLaVA-v1.6-Mistral-7B](https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf)
- [Mllama (Multimodal Llama from Meta)](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct)
- [Idefics](https://huggingface.co/HuggingFaceM4/idefics-9b)
- [Idefics 2](https://huggingface.co/HuggingFaceM4/idefics2-8b)
- [Idefics 2.5](https://huggingface.co/HuggingFaceM4/Idefics3-8B-Llama3)
- [Qwen2-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct)
- [Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)

We also support on a best effort basis models with different parameters count that use the same model architecture but those models were not tested. For example, the gaudi backend supports `meta-llama/Llama-3.2-1B` as the architecture is the standard llama3 architecture. If you have an issue with a model, please open an issue on the [Gaudi backend repository](https://github.com/huggingface/text-generation-inference/issues).

### Environment Variables

The following table contains the environment variables that can be used to configure the Gaudi backend:

| Name                        | Value(s)   | Default          | Description                                                                                                                      | Usage                        |
|-----------------------------| :--------- | :--------------- | :------------------------------------------------------------------------------------------------------------------------------- | :--------------------------- |
| ENABLE_HPU_GRAPH            | True/False | True             | Enable hpu graph or not                                                                                                          | add -e in docker run command |
| LIMIT_HPU_GRAPH             | True/False | True             | Skip HPU graph usage for prefill to save memory, set to `True` for large sequence/decoding lengths(e.g. 300/212)                 | add -e in docker run command |
| BATCH_BUCKET_SIZE           | integer    | 8                | Batch size for decode operation will be rounded to the nearest multiple of this number. This limits the number of cached graphs  | add -e in docker run command |
| PREFILL_BATCH_BUCKET_SIZE   | integer    | 4                | Batch size for prefill operation will be rounded to the nearest multiple of this number. This limits the number of cached graphs | add -e in docker run command |
| PAD_SEQUENCE_TO_MULTIPLE_OF | integer    | 128              | For prefill operation, sequences will be padded to a multiple of provided value.                                                 | add -e in docker run command |
| SKIP_TOKENIZER_IN_TGI       | True/False | False            | Skip tokenizer for input/output processing                                                                                       | add -e in docker run command |
| WARMUP_ENABLED              | True/False | True             | Enable warmup during server initialization to recompile all graphs. This can increase TGI setup time.                            | add -e in docker run command |
| QUEUE_THRESHOLD_MS          | integer    | 120              | Controls the threshold beyond which the request are considered overdue and handled with priority. Shorter requests are prioritized otherwise. | add -e in docker run command |
| USE_FLASH_ATTENTION         | True/False | True             | Whether to enable Habana Flash Attention, provided that the model supports it. Please refer to https://docs.habana.ai/en/latest/PyTorch/Model_Optimization_PyTorch/Optimization_in_PyTorch_Models.html?highlight=fusedsdpa#using-fused-scaled-dot-product-attention-fusedsdpa | add -e in docker run command |
| FLASH_ATTENTION_RECOMPUTE   | True/False | True             | Whether to enable Habana Flash Attention in recompute mode on first token generation.                                           | add -e in docker run command |

## Contributing

Contributions to the TGI-Gaudi project are welcome. Please refer to the [contributing guide](https://github.com/huggingface/text-generation-inference/blob/main/CONTRIBUTING.md).

**Guidelines for contributing to Gaudi on TGI:** All changes should be made within the `backends/gaudi` folder. In general, you should avoid modifying the router, launcher, or benchmark to accommodate Gaudi hardware, as all Gaudi-specific logic should be contained within the `backends/gaudi` folder.

### Building the Docker Image from Source

To build the Docker image from source:

```bash
make -C backends/gaudi image
```

This builds the image and saves it as `tgi-gaudi`. You can then run TGI-Gaudi with this image:

```bash
model=meta-llama/Meta-Llama-3.1-8B-Instruct
volume=$PWD/data
hf_token=YOUR_ACCESS_TOKEN

docker run --runtime=habana --ipc=host --cap-add=sys_nice \
    -p 8080:80 -v $volume:/data -e HF_TOKEN=$hf_token \
    tgi-gaudi \
    --model-id $model
```

For more details, see the [README of the Gaudi backend](https://github.com/huggingface/text-generation-inference/blob/main/backends/gaudi/README.md) and the [Makefile of the Gaudi backend](https://github.com/huggingface/text-generation-inference/blob/main/backends/gaudi/Makefile).