text-generation-inference/backends/gaudi
Baptiste Colle 27ed848676
Release of Gaudi Backend for TGI (#3091)
* feat(gaudi): release ready (docs, docker image and vlm ready)

* fix(gaudi): add default argument for the dockerfile

* fix(gaudi): remove use of latest for gaudi docker image + redid gaudi benchmarking section to include best practices
2025-03-13 10:56:01 +01:00
..
examples/docker_commands Release of Gaudi Backend for TGI (#3091) 2025-03-13 10:56:01 +01:00
server Release of Gaudi Backend for TGI (#3091) 2025-03-13 10:56:01 +01:00
Makefile Add Gaudi Backend (#3055) 2025-02-28 12:14:58 +01:00
README.md Add Gaudi Backend (#3055) 2025-02-28 12:14:58 +01:00
tgi-entrypoint.sh Add Gaudi Backend (#3055) 2025-02-28 12:14:58 +01:00

Text-generation-inference - Gaudi backend

Description

This is the TGI backend for Intel Gaudi. This backend is composed of the tgi server optimized for Gaudi hardware.

Build your own image

The simplest way to build TGI with the Gaudi backend is to use the provided Makefile:

Option 1: From the project root directory:

make -C backends/gaudi image

Option 2: From the Gaudi backend directory:

cd backends/gaudi
make image

You can now run the server with the following command:

Option 1: Sharded:

model=meta-llama/Llama-3.1-8B-Instruct
hf_token=$(cat ${HOME}/.cache/huggingface/token)
volume=${HOME}/.cache/huggingface

docker run --runtime=habana --ipc=host --cap-add=sys_nice \
  -p 8080:80 -v $volume:/data \
  -e LOG_LEVEL=debug -e HF_TOKEN=$hf_token \
  tgi-gaudi --model-id $model \
  --sharded true --num-shard 8 \
  --max-input-tokens 512 --max-total-tokens 1024 --max-batch-size 8 --max-batch-prefill-tokens 2048

Option 2: Non-sharded:

model=meta-llama/Llama-3.1-8B-Instruct
hf_token=$(cat ${HOME}/.cache/huggingface/token)
volume=${HOME}/.cache/huggingface

docker run --runtime=habana --ipc=host --cap-add=sys_nice \
  -p 8080:80 -v $volume:/data \
  -e LOG_LEVEL=debug -e HF_TOKEN=$hf_token \
  tgi-gaudi --model-id $model \
  --max-input-tokens 512 --max-total-tokens 1024 --max-batch-size 4 --max-batch-prefill-tokens 2048

Contributing

Local Development

This is useful if you want to run the server locally for better debugging.

make -C backends/gaudi run-local-dev-container

Then run the following command inside the container to install tgi for gaudi:

make -C backends/gaudi local-dev-install

Add rust to path:

. "$HOME/.cargo/env"

Option 1: Run the server (sharded model):

LOG_LEVEL=debug text-generation-launcher \
    --model-id meta-llama/Llama-3.1-8B-Instruct \
    --sharded true \
    --num-shard 8 \
    --max-input-tokens 512 \
    --max-total-tokens 1024 \
    --max-batch-size 8 \
    --max-batch-prefill-tokens 2048

Option 2: Run the server (non-sharded model):

LOG_LEVEL=debug text-generation-launcher \
    --model-id meta-llama/Llama-3.1-8B-Instruct \
    --max-input-tokens 512 \
    --max-total-tokens 1024 \
    --max-batch-size 4 \
    --max-batch-prefill-tokens 2048

You can then test the server with the following curl command from another terminal (can be outside the container):

curl 127.0.0.1:8080/generate \
     -X POST \
     -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
     -H 'Content-Type: application/json'