feat(gaudi): release ready (docs, docker image and vlm ready)

This commit is contained in:
baptiste 2025-03-10 09:01:45 +00:00
parent d4c6faa67b
commit 02fa6adb20
13 changed files with 862 additions and 10 deletions

View File

@ -124,6 +124,15 @@ jobs:
export extra_pytest="--neuron"
export target=""
;;
gaudi)
export dockerfile="Dockerfile_gaudi"
export label_extension="-gaudi"
export docker_volume="/mnt/cache"
export docker_devices=""
export runs_on="ubuntu-latest"
export platform=""
export extra_pytest=""
export target=""
esac
echo $dockerfile
echo "Dockerfile=${dockerfile}"

View File

@ -21,6 +21,7 @@ on:
- "Dockerfile_amd"
- "Dockerfile_intel"
- "Dockerfile.neuron"
- "Dockerfile_gaudi"
branches:
- "main"
workflow_dispatch:
@ -38,7 +39,7 @@ jobs:
# fail-fast is true by default
fail-fast: false
matrix:
hardware: ["cuda", "cuda-trtllm", "rocm", "intel-xpu", "intel-cpu", "neuron"]
hardware: ["cuda", "cuda-trtllm", "rocm", "intel-xpu", "intel-cpu", "neuron", "gaudi"]
uses: ./.github/workflows/build.yaml # calls the one above ^
permissions:
contents: write

1
.gitignore vendored
View File

@ -28,3 +28,4 @@ server/fbgemmm
hl-smi_log*.txt
.graph_dumps
out
hqt_output

View File

@ -53,3 +53,6 @@ run-falcon-7b-instruct-quantize:
clean:
rm -rf target aml
preview_doc:
doc-builder preview text-generation-inference docs/source --not_python_module

View File

@ -0,0 +1,39 @@
# TGI-Gaudi example
This example provide a simple way of usage of `tgi-gaudi` with continuous batching. It uses a small dataset [DIBT/10k_prompts_ranked](https://huggingface.co/datasets/DIBT/10k_prompts_ranked) and present basic performance numbers.
## Get started
### Install
```
pip install -r requirements.txt
```
### Setup TGI server
More details on runing the TGI server available [here](https://github.com/huggingface/tgi-gaudi/blob/habana-main/README.md#running-tgi-on-gaudi).
### Run benchmark
To run benchmark use below command:
```
python run_generation --model_id MODEL_ID
```
where `MODEL_ID` should be set to the same value as in the TGI server instance.
> For gated models such as [LLama](https://huggingface.co/meta-llama) or [StarCoder](https://huggingface.co/bigcode/starcoder), you will have to set environment variable `HF_TOKEN=<token>` with a valid Hugging Face Hub read token.
All possible parameters are described in the below table:
<div align="left">
| Name | Default value | Description |
| ------------------------- | :---------------------------- | :------------------------------------------------------------ |
| SERVER_ADDRESS | http://localhost:8080 | The address and port at which the TGI server is available. |
| MODEL_ID | meta-llama/Llama-2-7b-chat-hf | Model ID used in the TGI server instance. |
| MAX_INPUT_LENGTH | 1024 | Maximum input length supported by the TGI server. |
| MAX_OUTPUT_LENGTH | 1024 | Maximum output length supported by the TGI server. |
| TOTAL_SAMPLE_COUNT | 2048 | Number of samples to run. |
| MAX_CONCURRENT_REQUESTS | 256 | The number of requests sent simultaneously to the TGI server. |
</div>

View File

@ -0,0 +1,4 @@
huggingface_hub==0.23.5
requests==2.31.0
datasets==2.18.0
transformers>=4.37.0

View File

@ -0,0 +1,110 @@
# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
import argparse
import requests
import time
from typing import List
from datasets import load_dataset
from transformers import AutoTokenizer
from tgi_client import TgiClient
def get_args():
parser = argparse.ArgumentParser()
parser.add_argument(
"--server_address",
type=str,
default="http://localhost:8083",
help="Address of the TGI server",
)
parser.add_argument(
"--model_id",
type=str,
default="meta-llama/Llama-2-7b-chat-hf",
help="Model id used in TGI server",
)
parser.add_argument(
"--max_input_length",
type=int,
default=1024,
help="Max input length for TGI model",
)
parser.add_argument(
"--max_output_length",
type=int,
default=1024,
help="Max output length for TGI model",
)
parser.add_argument(
"--total_sample_count",
type=int,
default=2048,
help="Total number of samples to generate",
)
parser.add_argument(
"--max_concurrent_requests",
type=int,
default=256,
help="Max number of concurrent requests",
)
parser.add_argument("--seed", type=int, default=42, help="Random seed for datasets")
return parser.parse_args()
def read_dataset(
max_input_length: int,
total_sample_count: int,
model_id: str,
seed: int,
) -> List[str]:
"""
Loads public dataset from HF: https://huggingface.co/datasets/DIBT/10k_prompts_ranked
and filters out too long samples.
"""
tokenizer = AutoTokenizer.from_pretrained(model_id)
dataset = load_dataset(
"DIBT/10k_prompts_ranked", split="train", trust_remote_code=True
)
dataset = dataset.filter(
lambda x: len(tokenizer(x["prompt"])["input_ids"]) < max_input_length
)
if len(dataset) > total_sample_count:
dataset = dataset.select(range(total_sample_count))
dataset = dataset.shuffle(seed=seed)
return [sample["prompt"] for sample in dataset]
def is_tgi_available(server_address: str) -> bool:
"""
Checks if TGI server is available under the specified address.
"""
try:
info = requests.get(f"{server_address}/info")
return info.status_code == 200
except Exception:
return False
def main():
args = get_args()
dataset = read_dataset(
args.max_input_length, args.total_sample_count, args.model_id, args.seed
)
if not is_tgi_available(args.server_address):
raise RuntimeError("Cannot connect with TGI server!")
tgi_client = TgiClient(args.server_address, args.max_concurrent_requests)
timestamp = time.perf_counter_ns()
tgi_client.run_generation(dataset, args.max_output_length)
duration_s = (time.perf_counter_ns() - timestamp) * 1e-9
tgi_client.print_performance_metrics(duration_s)
if __name__ == "__main__":
main()

View File

@ -0,0 +1,88 @@
# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.
import os
import statistics
import threading
import time
import tqdm
from typing import List
from huggingface_hub import InferenceClient
def except_hook(args):
print(f"Thread failed with error: {args.exc_value}")
os._exit(1)
threading.excepthook = except_hook
class TgiClient:
def __init__(self, server_address: str, max_num_threads: int) -> None:
self._lock = threading.Lock()
self._semaphore = threading.Semaphore(max_num_threads)
self._client = InferenceClient(server_address)
self._ttft = []
self._tpot = []
self._generated_tokens = []
def run_generation(self, samples: List[str], max_new_tokens: int) -> None:
"""
Run generation for every sample in dataset.
Creates a separate thread for every sample.
"""
threads: List[threading.Thread] = []
for sample in tqdm.tqdm(samples):
self._semaphore.acquire()
threads.append(
threading.Thread(
target=self._process_sample, args=[sample, max_new_tokens]
)
)
threads[-1].start()
for thread in threads:
if thread is not None:
thread.join()
def _process_sample(self, sample: str, max_new_tokens: int) -> None:
"""
Generates response stream for a single sample.
Collects performance metrics.
"""
timestamp = time.perf_counter_ns()
response_stream = self._client.text_generation(
sample, max_new_tokens=max_new_tokens, stream=True, details=True
)
out = ""
for id, response in enumerate(response_stream):
if id == 0:
self._ttft.append(time.perf_counter_ns() - timestamp)
else:
self._tpot.append(time.perf_counter_ns() - timestamp)
timestamp = time.perf_counter_ns()
out += response.token.text
if response.details:
self._generated_tokens.append(response.details.generated_tokens)
self._semaphore.release()
def print_performance_metrics(self, duration_s: float) -> None:
def line():
print(32 * "-")
line()
print("----- Performance summary -----")
line()
print(f"Throughput: {sum(self._generated_tokens) / duration_s:.1f} tokens/s")
print(f"Throughput: {len(self._generated_tokens) / duration_s:.1f} queries/s")
line()
print("First token latency:")
print(f"\tMedian: \t{statistics.median(self._ttft)*1e-6:.2f}ms")
print(f"\tAverage: \t{statistics.fmean(self._ttft)*1e-6:.2f}ms")
line()
print("Output token latency:")
print(f"\tMedian: \t{statistics.median(self._tpot)*1e-6:.2f}ms")
print(f"\tAverage: \t{statistics.fmean(self._tpot)*1e-6:.2f}ms")
line()

View File

@ -0,0 +1,283 @@
# Examples of Docker Commands for Gaudi Backend
This page gives a list of examples of docker run commands for some of the most popular models.
> **Note:** The parameters are chosen for Gaudi2 hardware to maximize performance on this given hardware, please adjust the parameters based on your hardware. For example, if you are using Gaudi3, you may want to increase the batch size.
## Default Precision (BF16)
### Llama3.1-8B on 1 card (BF16)
```bash
model=meta-llama/Meta-Llama-3.1-8B-Instruct
hf_token=YOUR_ACCESS_TOKEN
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
docker run -p 8080:80 \
--runtime=habana \
--cap-add=sys_nice \
--ipc=host \
-v $volume:/data \
-e HF_TOKEN=$hf_token \
-e MAX_TOTAL_TOKENS=2048 \
-e PREFILL_BATCH_BUCKET_SIZE=2 \
-e BATCH_BUCKET_SIZE=32 \
-e PAD_SEQUENCE_TO_MULTIPLE_OF=256 \
ghcr.io/huggingface/text-generation-inference:latest-gaudi \
--model-id $model \
--max-input-tokens 1024 --max-total-tokens 2048 \
--max-batch-prefill-tokens 2048 --max-batch-size 32 \
--max-waiting-tokens 7 --waiting-served-ratio 1.2 --max-concurrent-requests 64
```
### Llama3.1-70B 8 cards (BF16)
```bash
model=meta-llama/Meta-Llama-3.1-70B-Instruct
hf_token=YOUR_ACCESS_TOKEN
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
docker run -p 8080:80 \
--runtime=habana \
--cap-add=sys_nice \
--ipc=host \
-v $volume:/data \
-e HF_TOKEN=$hf_token \
-e MAX_TOTAL_TOKENS=2048 \
-e BATCH_BUCKET_SIZE=256 \
-e PREFILL_BATCH_BUCKET_SIZE=4 \
-e PAD_SEQUENCE_TO_MULTIPLE_OF=64 \
ghcr.io/huggingface/text-generation-inference:latest-gaudi \
--model-id $model \
--sharded true --num-shard 8 \
--max-input-tokens 1024 --max-total-tokens 2048 \
--max-batch-prefill-tokens 4096 --max-batch-size 256 \
--max-waiting-tokens 7 --waiting-served-ratio 1.2 --max-concurrent-requests 512
```
### Llama2-7B on 1 Card (BF16)
```bash
model=meta-llama/Llama-2-7b-chat-hf
hf_token=YOUR_ACCESS_TOKEN
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
docker run -p 8080:80 \
--runtime=habana \
--cap-add=sys_nice \
--ipc=host \
-v $volume:/data \
-e HF_TOKEN=$hf_token \
-e MAX_TOTAL_TOKENS=2048 \
-e PREFILL_BATCH_BUCKET_SIZE=2 \
-e BATCH_BUCKET_SIZE=32 \
-e PAD_SEQUENCE_TO_MULTIPLE_OF=256 \
ghcr.io/huggingface/text-generation-inference:latest-gaudi \
--model-id $model \
--max-input-tokens 1024 --max-total-tokens 2048 \
--max-batch-prefill-tokens 2048 --max-batch-size 32 \
--max-waiting-tokens 7 --waiting-served-ratio 1.2 --max-concurrent-requests 64
```
### Llama2-70B on 8 cards (BF16)
```bash
model=meta-llama/Llama-2-70b-chat-hf
hf_token=YOUR_ACCESS_TOKEN
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
docker run -p 8080:80 \
--runtime=habana \
--cap-add=sys_nice \
--ipc=host \
-v $volume:/data \
-e HF_TOKEN=$hf_token \
-e MAX_TOTAL_TOKENS=2048 \
-e BATCH_BUCKET_SIZE=256 \
-e PREFILL_BATCH_BUCKET_SIZE=4 \
-e PAD_SEQUENCE_TO_MULTIPLE_OF=64 \
ghcr.io/huggingface/text-generation-inference:latest-gaudi \
--model-id $model \
--sharded true --num-shard 8 \
--max-input-tokens 1024 --max-total-tokens 2048 \
--max-batch-prefill-tokens 4096 --max-batch-size 256 \
--max-waiting-tokens 7 --waiting-served-ratio 1.2 --max-concurrent-requests 512
```
### Llava-v1.6-Mistral-7B on 1 card (BF16)
```bash
model=llava-hf/llava-v1.6-mistral-7b-hf
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
docker run -p 8080:80 \
--runtime=habana \
--cap-add=sys_nice \
--ipc=host \
-v $volume:/data \
-e PREFILL_BATCH_BUCKET_SIZE=1 \
-e BATCH_BUCKET_SIZE=1 \
ghcr.io/huggingface/text-generation-inference:latest-gaudi \
--model-id $model \
--max-input-tokens 4096 --max-batch-prefill-tokens 16384 \
--max-total-tokens 8192 --max-batch-size 4
```
## FP8 Precision
Please refer to the [FP8 Precision](https://huggingface.co/docs/text-generation-inference/backends/gaudi_new#how-to-use-different-precision-formats) section for more details. You need to measure the statistics of the model first before running the model in FP8 precision.
## Llama3.1-8B on 1 Card (FP8)
```bash
model=meta-llama/Meta-Llama-3.1-8B-Instruct
hf_token=YOUR_ACCESS_TOKEN
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
docker run -p 8080:80 \
--runtime=habana \
--cap-add=sys_nice \
--ipc=host \
-v $volume:/data \
-v $PWD/quantization_config:/usr/src/quantization_config \
-v $PWD/hqt_output:/usr/src/hqt_output \
-e QUANT_CONFIG=./quantization_config/maxabs_quant.json \
-e HF_TOKEN=$hf_token \
-e MAX_TOTAL_TOKENS=2048 \
-e PREFILL_BATCH_BUCKET_SIZE=2 \
-e BATCH_BUCKET_SIZE=32 \
-e PAD_SEQUENCE_TO_MULTIPLE_OF=256 \
ghcr.io/huggingface/text-generation-inference:latest-gaudi \
--model-id $model \
--max-input-tokens 1024 --max-total-tokens 2048 \
--max-batch-prefill-tokens 2048 --max-batch-size 32 \
--max-waiting-tokens 7 --waiting-served-ratio 1.2 --max-concurrent-requests 64
```
## Llama3.1-70B on 8 cards (FP8)
```bash
model=meta-llama/Meta-Llama-3.1-70B-Instruct
hf_token=YOUR_ACCESS_TOKEN
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
docker run -p 8080:80 \
--runtime=habana \
--cap-add=sys_nice \
--ipc=host \
-v $volume:/data \
-v $PWD/quantization_config:/usr/src/quantization_config \
-v $PWD/hqt_output:/usr/src/hqt_output \
-e QUANT_CONFIG=./quantization_config/maxabs_quant.json \
-e HF_TOKEN=$hf_token \
-e MAX_TOTAL_TOKENS=2048 \
-e BATCH_BUCKET_SIZE=256 \
-e PREFILL_BATCH_BUCKET_SIZE=4 \
-e PAD_SEQUENCE_TO_MULTIPLE_OF=64 \
ghcr.io/huggingface/text-generation-inference:latest-gaudi \
--model-id $model \
--sharded true --num-shard 8 \
--max-input-tokens 1024 --max-total-tokens 2048 \
--max-batch-prefill-tokens 4096 --max-batch-size 256 \
--max-waiting-tokens 7 --waiting-served-ratio 1.2 --max-concurrent-requests 512
```
## Llama2-7B on 1 Card (FP8)
```bash
model=meta-llama/Llama-2-7b-chat-hf
hf_token=YOUR_ACCESS_TOKEN
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
docker run -p 8080:80 \
--runtime=habana \
--cap-add=sys_nice \
--ipc=host \
-v $volume:/data \
-v $PWD/quantization_config:/usr/src/quantization_config \
-v $PWD/hqt_output:/usr/src/hqt_output \
-e QUANT_CONFIG=./quantization_config/maxabs_quant.json \
-e HF_TOKEN=$hf_token \
-e MAX_TOTAL_TOKENS=2048 \
-e PREFILL_BATCH_BUCKET_SIZE=2 \
-e BATCH_BUCKET_SIZE=32 \
-e PAD_SEQUENCE_TO_MULTIPLE_OF=256 \
ghcr.io/huggingface/text-generation-inference:latest-gaudi \
--model-id $model \
--max-input-tokens 1024 --max-total-tokens 2048 \
--max-batch-prefill-tokens 2048 --max-batch-size 32 \
--max-waiting-tokens 7 --waiting-served-ratio 1.2 --max-concurrent-requests 64
```
## Llama2-70B on 8 Cards (FP8)
```bash
model=meta-llama/Llama-2-70b-chat-hf
hf_token=YOUR_ACCESS_TOKEN
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
docker run -p 8080:80 \
--runtime=habana \
--cap-add=sys_nice \
--ipc=host \
-v $volume:/data \
-v $PWD/quantization_config:/usr/src/quantization_config \
-v $PWD/hqt_output:/usr/src/hqt_output \
-e QUANT_CONFIG=./quantization_config/maxabs_quant.json \
-e HF_TOKEN=$hf_token \
-e MAX_TOTAL_TOKENS=2048 \
-e BATCH_BUCKET_SIZE=256 \
-e PREFILL_BATCH_BUCKET_SIZE=4 \
-e PAD_SEQUENCE_TO_MULTIPLE_OF=64 \
ghcr.io/huggingface/text-generation-inference:latest-gaudi \
--model-id $model \
--sharded true --num-shard 8 \
--max-input-tokens 1024 --max-total-tokens 2048 \
--max-batch-prefill-tokens 4096 --max-batch-size 256 \
--max-waiting-tokens 7 --waiting-served-ratio 1.2 --max-concurrent-requests 512
```
## Llava-v1.6-Mistral-7B on 1 Card (FP8)
```bash
model=llava-hf/llava-v1.6-mistral-7b-hf
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
docker run -p 8080:80 \
--runtime=habana \
--cap-add=sys_nice \
--ipc=host \
-v $volume:/data \
-v $PWD/quantization_config:/usr/src/quantization_config \
-v $PWD/hqt_output:/usr/src/hqt_output \
-e QUANT_CONFIG=./quantization_config/maxabs_quant.json \
-e PREFILL_BATCH_BUCKET_SIZE=1 \
-e BATCH_BUCKET_SIZE=1 \
ghcr.io/huggingface/text-generation-inference:latest-gaudi \
--model-id $model \
--max-input-tokens 4096 --max-batch-prefill-tokens 16384 \
--max-total-tokens 8192 --max-batch-size 4
```
## Llava-v1.6-Mistral-7B on 8 Cards (FP8)
```bash
model=llava-hf/llava-v1.6-mistral-7b-hf
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
docker run -p 8080:80 \
--runtime=habana \
--cap-add=sys_nice \
--ipc=host \
-v $volume:/data \
-v $PWD/quantization_config:/usr/src/quantization_config \
-v $PWD/hqt_output:/usr/src/hqt_output \
-e QUANT_CONFIG=./quantization_config/maxabs_quant.json \
-e PREFILL_BATCH_BUCKET_SIZE=1 \
-e BATCH_BUCKET_SIZE=1 \
ghcr.io/huggingface/text-generation-inference:latest-gaudi \
--model-id $model \
--sharded true --num-shard 8 \
--max-input-tokens 4096 --max-batch-prefill-tokens 16384 \
--max-total-tokens 8192 --max-batch-size 4
```

View File

@ -68,10 +68,14 @@ IDEFICS2_IMAGE_TOKEN = "<image>"
IMAGES = re.compile(r"!\[[^\]]*\]\((.*?)\s*(\"(?:.*[^\"])\")?\s*\)")
BASE_IMAGE_TOKENS = int(os.environ.get("BASE_IMAGE_TOKENS", 2048))
MAX_TOTAL_TOKENS = int(os.environ.get("MAX_TOTAL_TOKENS", 8192))
MAX_BATCH_TOTAL_TOKENS = int(os.environ.get("MAX_BATCH_TOTAL_TOKENS", 131072))
PAD_SEQUENCE_TO_MULTIPLE_OF = int(os.environ.get("PAD_SEQUENCE_TO_MULTIPLE_OF", 256))
CHUNK_SIZES = [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048]
LAZY_MODE = int(os.environ.get("PT_HPU_LAZY_MODE", 1))
MAX_BATCH_SIZE = (
int(os.environ.get("MAX_BATCH_SIZE"))
if os.environ.get("MAX_BATCH_SIZE") is not None
else None
)
PREFILL_WARMUP_BATCH_SIZE_LIST = []
PREFILL_WARMUP_SEQLEN_LIST = []
@ -1197,7 +1201,9 @@ class VlmCausalLM(Model):
return self.batch_from_pb(batch, is_warmup)
def warmup(self, request) -> None:
def warmup(
self, request: generate_pb2.WarmupRequest
) -> Tuple[Optional[int], Optional[int], Optional[int]]:
is_warmup = True
batch = self.batch_from_pb(request.batch, is_warmup)
@ -1210,7 +1216,8 @@ class VlmCausalLM(Model):
f"You need to decrease `--max-batch-prefill-tokens`"
)
global BASE_IMAGE_TOKENS, MAX_TOTAL_TOKENS, MAX_BATCH_TOTAL_TOKENS, PREFILL_WARMUP_BATCH_SIZE_LIST, PREFILL_WARMUP_SEQLEN_LIST, DECODE_WARMUP_BATCH_SIZE_LIST
global BASE_IMAGE_TOKENS, MAX_TOTAL_TOKENS, PREFILL_WARMUP_BATCH_SIZE_LIST, PREFILL_WARMUP_SEQLEN_LIST, DECODE_WARMUP_BATCH_SIZE_LIST
MAX_TOTAL_TOKENS = request.max_total_tokens
max_input_length = batch.input_ids.shape[1]
max_prefill_batch_size = batch.input_ids.shape[0]
PREFILL_WARMUP_BATCH_SIZE_LIST = []
@ -1264,7 +1271,7 @@ class VlmCausalLM(Model):
f"Memory stats: {mem_stats} "
)
max_decode_batch_size = math.floor(MAX_BATCH_TOTAL_TOKENS / MAX_TOTAL_TOKENS)
max_decode_batch_size = MAX_BATCH_SIZE
batch_size = max_prefill_batch_size * 2
# Decode warmup with bigger batch_size
try:
@ -1310,14 +1317,12 @@ class VlmCausalLM(Model):
batches.append(prefill_batch)
_, decode_batch, _ = self.generate_token(batches, is_warmup)
DECODE_WARMUP_BATCH_SIZE_LIST.append(max_decode_batch_size)
max_batch_total_tokens = max_decode_batch_size * MAX_TOTAL_TOKENS
MAX_BATCH_TOTAL_TOKENS = max_batch_total_tokens
except Exception:
raise RuntimeError(
f"Not enough memory to handle batch_size({batch_size}) decode warmup."
f"Decode batch size list:{DECODE_WARMUP_BATCH_SIZE_LIST}"
f"max_decode_batch_size is {max_decode_batch_size}"
f"You need to decrease env `MAX_BATCH_TOTAL_TOKENS` or '--max_batch_total_tokens'"
f"You need to decrease env `MAX_BATCH_SIZE` or '--max_batch_size'"
)
mem_stats = get_hpu_memory_stats(self.device)
@ -1327,4 +1332,8 @@ class VlmCausalLM(Model):
f"Memory stats: {mem_stats}"
)
return MAX_BATCH_TOTAL_TOKENS
max_supported_total_tokens = MAX_BATCH_SIZE * MAX_TOTAL_TOKENS
max_input_tokens = max_input_length
max_total_tokens = MAX_TOTAL_TOKENS
return max_supported_total_tokens, max_input_tokens, max_total_tokens

View File

@ -52,6 +52,8 @@
- sections:
- local: backends/neuron
title: Neuron
- local: backends/gaudi
title: Gaudi
- local: backends/trtllm
title: TensorRT-LLM
- local: backends/llamacpp

View File

@ -0,0 +1,303 @@
# Gaudi Backend for Text Generation Inference
## Overview
Text Generation Inference (TGI) has been optimized to run on Gaudi hardware via the Gaudi backend for TGI.
## Supported Hardware
- **Gaudi1**: Available on [AWS EC2 DL1 instances](https://aws.amazon.com/ec2/instance-types/dl1/)
- **Gaudi2**: Available on [Intel Cloud](https://console.cloud.intel.com/docs/reference/ai_instances.html)
- **Gaudi3**: Available on [Intel Cloud](https://console.cloud.intel.com/docs/reference/ai_instances.html)
## Tutorial: Getting Started with TGI on Gaudi
### Basic Usage
The easiest way to run TGI on Gaudi is to use the official Docker image:
```bash
model=meta-llama/Meta-Llama-3.1-8B-Instruct
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
hf_token=YOUR_HF_ACCESS_TOKEN
docker run --runtime=habana --cap-add=sys_nice --ipc=host \
-p 8080:80 -v $volume:/data -e HF_TOKEN=$hf_token \
ghcr.io/huggingface/text-generation-inference:latest-gaudi \
--model-id $model
```
Once you see the `connected` log, the server is ready to accept requests:
> 2024-05-22T19:31:48.302239Z INFO text_generation_router: router/src/main.rs:378: Connected
You can find your `YOUR_HF_ACCESS_TOKEN` at [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens). This is necessary to access gated models like llama3.1.
### Making Your First Request
You can send a request from a separate terminal:
```bash
curl 127.0.0.1:8080/generate \
-X POST \
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":32}}' \
-H 'Content-Type: application/json'
```
## How-to Guides
### How to Run Specific Models
The following models have been validated on Gaudi2:
| Model | Model ID | BF16 | | FP8 | |
|-----------------------|----------------------------------------|-------------|------------|-------------|------------|
| | | Single Card | Multi-Card | Single Card | Multi-Card |
| Llama2-7B | meta-llama/Llama-2-7b-chat-hf | ✔ | ✔ | ✔ | ✔ |
| Llama2-70B | meta-llama/Llama-2-70b-chat-hf | | ✔ | | ✔ |
| Llama3-8B | meta-llama/Meta-Llama-3.1-8B-Instruct | ✔ | ✔ | ✔ | ✔ |
| Llama3-70B | meta-llama/Meta-Llama-3-70B-Instruct | | ✔ | | ✔ |
| Llama3.1-8B | meta-llama/Meta-Llama-3.1-8B-Instruct | ✔ | ✔ | ✔ | ✔ |
| Llama3.1-70B | meta-llama/Meta-Llama-3.1-70B-Instruct | | ✔ | | ✔ |
| CodeLlama-13B | codellama/CodeLlama-13b-hf | ✔ | ✔ | ✔ | ✔ |
| Mixtral-8x7B | mistralai/Mixtral-8x7B-Instruct-v0.1 | ✔ | ✔ | ✔ | ✔ |
| Mistral-7B | mistralai/Mistral-7B-Instruct-v0.3 | ✔ | ✔ | ✔ | ✔ |
| Falcon-180B | tiiuae/falcon-180B-chat | | ✔ | | ✔ |
| Qwen2-72B | Qwen/Qwen2-72B-Instruct | | ✔ | | ✔ |
| Starcoder2-3b | bigcode/starcoder2-3b | ✔ | ✔ | ✔ | |
| Starcoder2-15b | bigcode/starcoder2-15b | ✔ | ✔ | ✔ | |
| Starcoder | bigcode/starcoder | ✔ | ✔ | ✔ | ✔ |
| Gemma-7b | google/gemma-7b-it | ✔ | ✔ | ✔ | ✔ |
| Llava-v1.6-Mistral-7B | llava-hf/llava-v1.6-mistral-7b-hf | ✔ | ✔ | ✔ | ✔ |
To run any of these models:
```bash
model=MODEL_ID_THAT_YOU_WANT_TO_RUN
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
hf_token=YOUR_ACCESS_TOKEN
docker run --runtime=habana --cap-add=sys_nice --ipc=host \
-p 8080:80 -v $volume:/data -e HF_TOKEN=$hf_token \
ghcr.io/huggingface/text-generation-inference:latest-gaudi \
--model-id $model
<text-generation-inference-launcher-arguments>
```
For the full list of service parameters, refer to the [launcher-arguments page](https://huggingface.co/docs/text-generation-inference/reference/launcher).
The validated docker commands can be found in the [examples/docker_commands folder](https://github.com/huggingface/text-generation-inference/tree/main/backends/gaudi/examples/docker_commands).
> Note: `--runtime=habana --cap-add=sys_nice --ipc=host ` is required to enable docker to use the Gaudi hardware (more details [here](https://docs.habana.ai/en/latest/Installation_Guide/Additional_Installation/Docker_Installation.html)).
### How to Enable Multi-Card Inference (Sharding)
TGI-Gaudi supports sharding for multi-card inference, allowing you to distribute the load across multiple Gaudi cards.
For example, on a machine with 8 Gaudi cards, you can run:
```bash
docker run --runtime=habana --ipc=host --cap-add=sys_nice \
-p 8080:80 -v $volume:/data -e HF_TOKEN=$hf_token \
tgi-gaudi \
--model-id $model --sharded true --num-shard 8
```
<Tip>
We recommend always using sharding when running on a multi-card machine.
</Tip>
### How to Use Different Precision Formats
#### BF16 Precision (Default)
By default, all models run with BF16 precision on Gaudi hardware.
#### FP8 Precision
TGI-Gaudi supports FP8 precision inference with [Intel Neural Compressor (INC)](https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_FP8.html).
To run FP8 Inference:
1. Measure statistics using [Optimum Habana measurement script](https://github.com/huggingface/optimum-habana/tree/main/examples/text-generation#running-with-fp8)
2. Run the model in TGI with QUANT_CONFIG setting - e.g. `-e QUANT_CONFIG=./quantization_config/maxabs_quant.json`.
The following commmand example for FP8 inference is based on the assumption that measurement is done via the first step above.
Example for Llama3.1-70B on 8 cards with FP8 precision:
```bash
model=meta-llama/Meta-Llama-3.1-70B-Instruct
hf_token=YOUR_ACCESS_TOKEN
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
docker run -p 8080:80 \
--runtime=habana \
--cap-add=sys_nice \
--ipc=host \
-v $volume:/data \
-v $PWD/quantization_config:/usr/src/quantization_config \
-v $PWD/hqt_output:/usr/src/hqt_output \
-e QUANT_CONFIG=./quantization_config/maxabs_quant.json \
-e HF_TOKEN=$hf_token \
-e MAX_TOTAL_TOKENS=2048 \
-e BATCH_BUCKET_SIZE=256 \
-e PREFILL_BATCH_BUCKET_SIZE=4 \
-e PAD_SEQUENCE_TO_MULTIPLE_OF=64 \
ghcr.io/huggingface/text-generation-inference:latest-gaudi \
--model-id $model \
--sharded true --num-shard 8 \
--max-input-tokens 1024 --max-total-tokens 2048 \
--max-batch-prefill-tokens 4096 --max-batch-size 256 \
--max-waiting-tokens 7 --waiting-served-ratio 1.2 --max-concurrent-requests 512
```
### How to Run Vision-Language Models (VLMs)
Gaudi supports VLM inference.
Example for Llava-v1.6-Mistral-7B on 1 card:
Start the TGI server via the following command:
```bash
model=llava-hf/llava-v1.6-mistral-7b-hf
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
docker run -p 8080:80 \
--runtime=habana \
--cap-add=sys_nice \
--ipc=host \
-v $volume:/data \
-e PREFILL_BATCH_BUCKET_SIZE=1 \
-e BATCH_BUCKET_SIZE=1 \
ghcr.io/huggingface/text-generation-inference:latest-gaudi \
--model-id $model \
--max-input-tokens 4096 --max-batch-prefill-tokens 16384 \
--max-total-tokens 8192 --max-batch-size 4
```
You can then send a request to the server via the following command:
```bash
curl -N 127.0.0.1:8080/generate \
-X POST \
-d '{"inputs":"![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png)What is this a picture of?\n\n","parameters":{"max_new_tokens":32}}' \
-H 'Content-Type: application/json'
```
> Note: In Llava-v1.6-Mistral-7B, an image usually accounts for 2000 input tokens. For example, an image of size 512x512 is represented by 2800 tokens. Thus, `max-input-tokens` must be larger than the number of tokens associated with the image. Otherwise the image may be truncated. We set `BASE_IMAGE_TOKENS=2048` as the default image token value. This is the minimum value of `max-input-tokens`. You can override the environment variable `BASE_IMAGE_TOKENS` to change this value. The warmup will generate graphs with input length from `BASE_IMAGE_TOKENS` to `max-input-tokens`. For Llava-v1.6-Mistral-7B, the value of `max-batch-prefill-tokens` is 16384, which is calcualted as follows: `prefill_batch_size` = `max-batch-prefill-tokens` / `max-input-tokens`.
### How to Benchmark Performance
#### Option 1: Static Batching Benchmark (Recommended)
To run static batching benchmark, please refer to [TGI's benchmark tool](https://github.com/huggingface/text-generation-inference/tree/main/benchmark).
To run it on the same machine, you can do the following:
* `docker exec -it <docker name> bash` , pick the docker started from step 2 using docker ps
* `text-generation-benchmark -t <model-id>` , pass the model-id from docker run command
* after the completion of tests, hit ctrl+c to see the performance data summary.
> Note: This benchmark runs the model with bs=[1, 2, 4, 8, 16, 32], sequence_length=10 and decode_length=8 by default. if you want to run other configs, please check text-generation-benchmark -h and change the parameters.
#### Option 2: Continuous Batching Benchmark
To run continuous batching benchmark, please refer to [README in examples/benchmark folder in the gaudi backend](https://github.com/huggingface/text-generation-inference/tree/main/backends/gaudi/examples/benchmark).
### How to Profile Performance
To collect performance profiling, you need to set the following environment variables:
| Name | Value(s) | Default | Description |
|--------------------| :--------- | :--------------- | :------------------------------------------------------- |
| PROF_WAITSTEP | integer | 0 | Control profile wait steps |
| PROF_WARMUPSTEP | integer | 0 | Control profile warmup steps |
| PROF_STEP | integer | 0 | Enable/disable profile, control profile active steps |
| PROF_PATH | string | /tmp/hpu_profile | Define profile folder |
| PROF_RANKS | string | 0 | Comma-separated list of ranks to profile |
| PROF_RECORD_SHAPES | True/False | False | Control record_shapes option in the profiler |
To use these environment variables, add them to your docker run command with the -e flag. For example:
```bash
docker run --runtime=habana --ipc=host --cap-add=sys_nice \
-p 8080:80 -v $volume:/data -e HF_TOKEN=$hf_token \
-e PROF_WAITSTEP=10 \
-e PROF_WARMUPSTEP=10 \
-e PROF_STEP=1 \
-e PROF_PATH=/tmp/hpu_profile \
-e PROF_RANKS=0 \
-e PROF_RECORD_SHAPES=True \
ghcr.io/huggingface/text-generation-inference:latest-gaudi \
--model-id $model
```
## Explanation: Understanding TGI on Gaudi
### The Warmup Process
To ensure optimal performance, warmup is performed at the beginning of each server run. This process creates queries with various input shapes based on provided parameters and runs basic TGI operations (prefill, decode, concatenate).
Note: Model warmup can take several minutes, especially for FP8 inference. For faster subsequent runs, refer to [Disk Caching Eviction Policy](https://docs.habana.ai/en/latest/PyTorch/Model_Optimization_PyTorch/Optimization_in_PyTorch_Models.html#disk-caching-eviction-policy).
### Understanding Parameter Tuning
#### Sequence Length Parameters
- `--max-input-tokens` is the maximum possible input prompt length. Default value is `4095`.
- `--max-total-tokens` is the maximum possible total length of the sequence (input and output). Default value is `4096`.
#### Batch Size Parameters
- For prefill operation, please set `--max-batch-prefill-tokens` as `bs * max-input-tokens`, where `bs` is your expected maximum prefill batch size.
- For decode operation, please set `--max-batch-size` as `bs`, where `bs` is your expected maximum decode batch size.
- Please note that batch size will be always padded to the nearest multiplication of `BATCH_BUCKET_SIZE` and `PREFILL_BATCH_BUCKET_SIZE`.
#### Performance and Memory Parameters
- `PAD_SEQUENCE_TO_MULTIPLE_OF` determines sizes of input length buckets. Since warmup creates several graphs for each bucket, it's important to adjust that value proportionally to input sequence length. Otherwise, some out of memory issues can be observed.
- `ENABLE_HPU_GRAPH` enables HPU graphs usage, which is crucial for performance results. Recommended value to keep is `true`.
#### Sequence Length Parameters
- `--max-input-tokens`: Maximum possible input prompt length (default: 4095)
- `--max-total-tokens`: Maximum possible total sequence length (input + output) (default: 4096)
#### Batch Size Parameters
- `--max-batch-prefill-tokens`: Set as `bs * max-input-tokens` where `bs` is your expected maximum prefill batch size
- `--max-batch-size`: Set as `bs` where `bs` is your expected maximum decode batch size
- Note: Batch sizes are padded to the nearest multiple of `BATCH_BUCKET_SIZE` and `PREFILL_BATCH_BUCKET_SIZE`
## Reference
This section contains reference information about the Gaudi backend.
### Environment Variables
The following table contains the environment variables that can be used to configure the Gaudi backend:
| Name | Value(s) | Default | Description | Usage |
|-----------------------------| :--------- | :--------------- | :------------------------------------------------------------------------------------------------------------------------------- | :--------------------------- |
| ENABLE_HPU_GRAPH | True/False | True | Enable hpu graph or not | add -e in docker run command |
| LIMIT_HPU_GRAPH | True/False | True | Skip HPU graph usage for prefill to save memory, set to `True` for large sequence/decoding lengths(e.g. 300/212) | add -e in docker run command |
| BATCH_BUCKET_SIZE | integer | 8 | Batch size for decode operation will be rounded to the nearest multiple of this number. This limits the number of cached graphs | add -e in docker run command |
| PREFILL_BATCH_BUCKET_SIZE | integer | 4 | Batch size for prefill operation will be rounded to the nearest multiple of this number. This limits the number of cached graphs | add -e in docker run command |
| PAD_SEQUENCE_TO_MULTIPLE_OF | integer | 128 | For prefill operation, sequences will be padded to a multiple of provided value. | add -e in docker run command |
| SKIP_TOKENIZER_IN_TGI | True/False | False | Skip tokenizer for input/output processing | add -e in docker run command |
| WARMUP_ENABLED | True/False | True | Enable warmup during server initialization to recompile all graphs. This can increase TGI setup time. | add -e in docker run command |
| QUEUE_THRESHOLD_MS | integer | 120 | Controls the threshold beyond which the request are considered overdue and handled with priority. Shorter requests are prioritized otherwise. | add -e in docker run command |
| USE_FLASH_ATTENTION | True/False | True | Whether to enable Habana Flash Attention, provided that the model supports it. Please refer to https://docs.habana.ai/en/latest/PyTorch/Model_Optimization_PyTorch/Optimization_in_PyTorch_Models.html?highlight=fusedsdpa#using-fused-scaled-dot-product-attention-fusedsdpa | add -e in docker run command |
| FLASH_ATTENTION_RECOMPUTE | True/False | True | Whether to enable Habana Flash Attention in recompute mode on first token generation. | add -e in docker run command |
## Contributing
Contributions to the TGI-Gaudi project are welcome. Please refer to the [contributing guide](https://github.com/huggingface/text-generation-inference/blob/main/CONTRIBUTING.md).
### Building the Docker Image from Source
To build the Docker image from source:
```bash
make -C backends/gaudi image
```
This builds the image and saves it as `tgi-gaudi`. You can then run TGI-Gaudi with this image:
```bash
model=meta-llama/Meta-Llama-3.1-8B-Instruct
volume=$PWD/data
hf_token=YOUR_ACCESS_TOKEN
docker run --runtime=habana --ipc=host --cap-add=sys_nice \
-p 8080:80 -v $volume:/data -e HF_TOKEN=$hf_token \
tgi-gaudi \
--model-id $model
```
For more details, see the [README of the Gaudi backend](https://github.com/huggingface/text-generation-inference/blob/main/backends/gaudi/README.md) and the [Makefile of the Gaudi backend](https://github.com/huggingface/text-generation-inference/blob/main/backends/gaudi/Makefile).

View File

@ -1,3 +1,3 @@
# Using TGI with Intel Gaudi
Check out this [repository](https://github.com/huggingface/tgi-gaudi) to serve models with TGI on Gaudi and Gaudi2 with [Optimum Habana](https://huggingface.co/docs/optimum/habana/index).
You can use TGI on Intel Gaudi using the [TGI gaudi backend](https://huggingface.co/docs/text-generation-inference/backends/gaudi).