text-generation-inference/backends/gaudi/README.md

# Text-generation-inference - Gaudi backend

## Description

This is the TGI backend for Intel Gaudi. This backend is composed of the tgi server optimized for Gaudi hardware.

## Build your own image

The simplest way to build TGI with the Gaudi backend is to use the provided `Makefile`:

Option 1: From the project root directory:
```bash
make -C backends/gaudi image
```

Option 2: From the Gaudi backend directory:
```bash
cd backends/gaudi
make image
```

You can now run the server with the following command:

Option 1: Sharded:
```bash
model=meta-llama/Llama-3.1-8B-Instruct
hf_token=$(cat ${HOME}/.cache/huggingface/token)
volume=${HOME}/.cache/huggingface

docker run --runtime=habana --ipc=host --cap-add=sys_nice \
  -p 8080:80 -v $volume:/data \
  -e LOG_LEVEL=debug -e HF_TOKEN=$hf_token \
  tgi-gaudi --model-id $model \
  --sharded true --num-shard 8 \
  --max-input-tokens 512 --max-total-tokens 1024 --max-batch-size 8 --max-batch-prefill-tokens 2048
```

Option 2: Non-sharded:
```bash
model=meta-llama/Llama-3.1-8B-Instruct
hf_token=$(cat ${HOME}/.cache/huggingface/token)
volume=${HOME}/.cache/huggingface

docker run --runtime=habana --ipc=host --cap-add=sys_nice \
  -p 8080:80 -v $volume:/data \
  -e LOG_LEVEL=debug -e HF_TOKEN=$hf_token \
  tgi-gaudi --model-id $model \
  --max-input-tokens 512 --max-total-tokens 1024 --max-batch-size 4 --max-batch-prefill-tokens 2048
```

## Contributing

### Local Development

This is useful if you want to run the server locally for better debugging.
```bash
make -C backends/gaudi run-local-dev-container
```

Then run the following command inside the container to install tgi for gaudi:
```bash
make -C backends/gaudi local-dev-install
```

Add rust to path:
```bash
. "$HOME/.cargo/env"
```

Option 1: Run the server (sharded model):
```bash
LOG_LEVEL=debug text-generation-launcher \
    --model-id meta-llama/Llama-3.1-8B-Instruct \
    --sharded true \
    --num-shard 8 \
    --max-input-tokens 512 \
    --max-total-tokens 1024 \
    --max-batch-size 8 \
    --max-batch-prefill-tokens 2048
```

Option 2: Run the server (non-sharded model):
```bash
LOG_LEVEL=debug text-generation-launcher \
    --model-id meta-llama/Llama-3.1-8B-Instruct \
    --max-input-tokens 512 \
    --max-total-tokens 1024 \
    --max-batch-size 4 \
    --max-batch-prefill-tokens 2048
```

You can then test the server with the following curl command from another terminal (can be outside the container):
```bash
curl 127.0.0.1:8080/generate \
     -X POST \
     -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
     -H 'Content-Type: application/json'
```

### Integration tests

To run the integration tests, you need to first build the image:
```bash
make -C backends/gaudi image
```

Then run the following command to run the integration tests:
```bash
make -C backends/gaudi run-integration-tests
```

To capture the expected outputs for the integration tests, you can run the following command:
```bash
make -C backends/gaudi capture-expected-outputs-for-integration-tests
```

#### How the integration tests works
The integration tests works as follows:

1. Start a tgi server in a container, similar to the command:
```bash
docker run --runtime=habana --ipc=host --cap-add=sys_nice \
  -p 8080:80 -v $volume:/data \
  -e LOG_LEVEL=debug -e HF_TOKEN=$hf_token \
  tgi-gaudi --model-id $model \
  --max-input-tokens 512 --max-total-tokens 1024 --max-batch-size 4 --max-batch-prefill-tokens 2048
```

2. Do a /generate request to the server, similar to the command:
```bash
curl 127.0.0.1:8080/generate \
     -X POST \
     -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
     -H 'Content-Type: application/json'
```

3. Check the output of the server against the expected output:
```python
assert curl_output == expected_output
```

This is the repeated for a set of models and configurations.
Add Gaudi Backend (#3055) * wip(gaudi): import server and dockerfile from tgi-gaudi fork * feat(gaudi): new gaudi backend working * fix: fix style * fix prehooks issues * fix(gaudi): refactor server and implement requested changes 2025-02-28 11:14:58 +00:00			`# Text-generation-inference - Gaudi backend`

			`## Description`

			`This is the TGI backend for Intel Gaudi. This backend is composed of the tgi server optimized for Gaudi hardware.`

			`## Build your own image`

			The simplest way to build TGI with the Gaudi backend is to use the provided `Makefile`:

			`Option 1: From the project root directory:`
			```bash
			`make -C backends/gaudi image`
			```

			`Option 2: From the Gaudi backend directory:`
			```bash
			`cd backends/gaudi`
			`make image`
			```

			`You can now run the server with the following command:`

			`Option 1: Sharded:`
			```bash
			`model=meta-llama/Llama-3.1-8B-Instruct`
			`hf_token=$(cat ${HOME}/.cache/huggingface/token)`
			`volume=${HOME}/.cache/huggingface`

			`docker run --runtime=habana --ipc=host --cap-add=sys_nice \`
			`-p 8080:80 -v $volume:/data \`
			`-e LOG_LEVEL=debug -e HF_TOKEN=$hf_token \`
			`tgi-gaudi --model-id $model \`
			`--sharded true --num-shard 8 \`
			`--max-input-tokens 512 --max-total-tokens 1024 --max-batch-size 8 --max-batch-prefill-tokens 2048`
			```

			`Option 2: Non-sharded:`
			```bash
			`model=meta-llama/Llama-3.1-8B-Instruct`
			`hf_token=$(cat ${HOME}/.cache/huggingface/token)`
			`volume=${HOME}/.cache/huggingface`

			`docker run --runtime=habana --ipc=host --cap-add=sys_nice \`
			`-p 8080:80 -v $volume:/data \`
			`-e LOG_LEVEL=debug -e HF_TOKEN=$hf_token \`
			`tgi-gaudi --model-id $model \`
			`--max-input-tokens 512 --max-total-tokens 1024 --max-batch-size 4 --max-batch-prefill-tokens 2048`
			```

			`## Contributing`

			`### Local Development`

			`This is useful if you want to run the server locally for better debugging.`
			```bash
			`make -C backends/gaudi run-local-dev-container`
			```

			`Then run the following command inside the container to install tgi for gaudi:`
			```bash
			`make -C backends/gaudi local-dev-install`
			```

			`Add rust to path:`
			```bash
			`. "$HOME/.cargo/env"`
			```

			`Option 1: Run the server (sharded model):`
			```bash
			`LOG_LEVEL=debug text-generation-launcher \`
			`--model-id meta-llama/Llama-3.1-8B-Instruct \`
			`--sharded true \`
			`--num-shard 8 \`
			`--max-input-tokens 512 \`
			`--max-total-tokens 1024 \`
			`--max-batch-size 8 \`
			`--max-batch-prefill-tokens 2048`
			```

			`Option 2: Run the server (non-sharded model):`
			```bash
			`LOG_LEVEL=debug text-generation-launcher \`
			`--model-id meta-llama/Llama-3.1-8B-Instruct \`
			`--max-input-tokens 512 \`
			`--max-total-tokens 1024 \`
			`--max-batch-size 4 \`
			`--max-batch-prefill-tokens 2048`
			```

			`You can then test the server with the following curl command from another terminal (can be outside the container):`
			```bash
			`curl 127.0.0.1:8080/generate \`
			`-X POST \`
			`-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \`
			`-H 'Content-Type: application/json'`
			```
Gaudi: Add Integration Test for Gaudi Backend (#3142) * feat(gaudi): add integration test * feat(test): add more models to integration tests * remove debug comments * fix typos 2025-04-07 14:55:03 +00:00
			`### Integration tests`

			`To run the integration tests, you need to first build the image:`
			```bash
			`make -C backends/gaudi image`
			```

			`Then run the following command to run the integration tests:`
			```bash
			`make -C backends/gaudi run-integration-tests`
			```

			`To capture the expected outputs for the integration tests, you can run the following command:`
			```bash
			`make -C backends/gaudi capture-expected-outputs-for-integration-tests`
			```

			`#### How the integration tests works`
			`The integration tests works as follows:`

			`1. Start a tgi server in a container, similar to the command:`
			```bash
			`docker run --runtime=habana --ipc=host --cap-add=sys_nice \`
			`-p 8080:80 -v $volume:/data \`
			`-e LOG_LEVEL=debug -e HF_TOKEN=$hf_token \`
			`tgi-gaudi --model-id $model \`
			`--max-input-tokens 512 --max-total-tokens 1024 --max-batch-size 4 --max-batch-prefill-tokens 2048`
			```

			`2. Do a /generate request to the server, similar to the command:`
			```bash
			`curl 127.0.0.1:8080/generate \`
			`-X POST \`
			`-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \`
			`-H 'Content-Type: application/json'`
			```

			`3. Check the output of the server against the expected output:`
			```python
			`assert curl_output == expected_output`
			```

			`This is the repeated for a set of models and configurations.`