mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-10-09 15:05:24 +00:00

History

Wang, Yi d62c941c56 Gaudi: clean cuda/rocm code in hpu backend, enable flat_hpu (#3113 ) * clean cuda/rocm code in hpu backend, enable flat_hpu Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix TP in pageattn Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * adjust block table in hpu to improve performance Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * enable all the model. not testet yet Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * use tensor cache in hpu graph to avoid replay issue Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * add moe support, fix qwen/mistral/mixtral crash Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix phimoe issue Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * gpt_bigcode could also go pageattn Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * enable dbrx remove some unused code Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * multi-modality initial PR Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * adjust warmup and enable vlm Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix incorrect output in qwen2 idefics if hpu graph is used Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * remove unused quantization code and enable awq/gptq int4 Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix gptq issue Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * enable fp8 Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * warmup prefill remove model where pageattn is not used, set block table to None since it's not used Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * add warmup_decode Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * warmup decode Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * remove block_tables and prefill_cache_indices which will lead to dynamic shape Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix comment Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * missing gptj change... Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix some issue Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * remove torch.where to fix incorrect output in hpu graph model Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * match the latest vllm_extension ops Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>		2025-04-14 15:58:13 +02:00
..
examples/docker_commands	Release of Gaudi Backend for TGI (#3091 )	2025-03-13 10:56:01 +01:00
server	Gaudi: clean cuda/rocm code in hpu backend, enable flat_hpu (#3113 )	2025-04-14 15:58:13 +02:00
Makefile	Gaudi: Add Integration Test for Gaudi Backend (#3142 )	2025-04-07 16:55:03 +02:00
README.md	Gaudi: Add Integration Test for Gaudi Backend (#3142 )	2025-04-07 16:55:03 +02:00
tgi-entrypoint.sh	Add Gaudi Backend (#3055 )	2025-02-28 12:14:58 +01:00

README.md

Text-generation-inference - Gaudi backend

Description

This is the TGI backend for Intel Gaudi. This backend is composed of the tgi server optimized for Gaudi hardware.

Build your own image

The simplest way to build TGI with the Gaudi backend is to use the provided Makefile:

Option 1: From the project root directory:

make -C backends/gaudi image

Option 2: From the Gaudi backend directory:

cd backends/gaudi
make image

You can now run the server with the following command:

Option 1: Sharded:

model=meta-llama/Llama-3.1-8B-Instruct
hf_token=$(cat ${HOME}/.cache/huggingface/token)
volume=${HOME}/.cache/huggingface

docker run --runtime=habana --ipc=host --cap-add=sys_nice \
  -p 8080:80 -v $volume:/data \
  -e LOG_LEVEL=debug -e HF_TOKEN=$hf_token \
  tgi-gaudi --model-id $model \
  --sharded true --num-shard 8 \
  --max-input-tokens 512 --max-total-tokens 1024 --max-batch-size 8 --max-batch-prefill-tokens 2048

Option 2: Non-sharded:

model=meta-llama/Llama-3.1-8B-Instruct
hf_token=$(cat ${HOME}/.cache/huggingface/token)
volume=${HOME}/.cache/huggingface

docker run --runtime=habana --ipc=host --cap-add=sys_nice \
  -p 8080:80 -v $volume:/data \
  -e LOG_LEVEL=debug -e HF_TOKEN=$hf_token \
  tgi-gaudi --model-id $model \
  --max-input-tokens 512 --max-total-tokens 1024 --max-batch-size 4 --max-batch-prefill-tokens 2048

Contributing

Local Development

This is useful if you want to run the server locally for better debugging.

make -C backends/gaudi run-local-dev-container

Then run the following command inside the container to install tgi for gaudi:

make -C backends/gaudi local-dev-install

Add rust to path:

. "$HOME/.cargo/env"

Option 1: Run the server (sharded model):

LOG_LEVEL=debug text-generation-launcher \
    --model-id meta-llama/Llama-3.1-8B-Instruct \
    --sharded true \
    --num-shard 8 \
    --max-input-tokens 512 \
    --max-total-tokens 1024 \
    --max-batch-size 8 \
    --max-batch-prefill-tokens 2048

Option 2: Run the server (non-sharded model):

LOG_LEVEL=debug text-generation-launcher \
    --model-id meta-llama/Llama-3.1-8B-Instruct \
    --max-input-tokens 512 \
    --max-total-tokens 1024 \
    --max-batch-size 4 \
    --max-batch-prefill-tokens 2048

You can then test the server with the following curl command from another terminal (can be outside the container):

curl 127.0.0.1:8080/generate \
     -X POST \
     -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
     -H 'Content-Type: application/json'

Integration tests

To run the integration tests, you need to first build the image:

make -C backends/gaudi image

Then run the following command to run the integration tests:

make -C backends/gaudi run-integration-tests

To capture the expected outputs for the integration tests, you can run the following command:

make -C backends/gaudi capture-expected-outputs-for-integration-tests

How the integration tests works

The integration tests works as follows:

Start a tgi server in a container, similar to the command:

docker run --runtime=habana --ipc=host --cap-add=sys_nice \
  -p 8080:80 -v $volume:/data \
  -e LOG_LEVEL=debug -e HF_TOKEN=$hf_token \
  tgi-gaudi --model-id $model \
  --max-input-tokens 512 --max-total-tokens 1024 --max-batch-size 4 --max-batch-prefill-tokens 2048

Do a /generate request to the server, similar to the command:

curl 127.0.0.1:8080/generate \
     -X POST \
     -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
     -H 'Content-Type: application/json'

Check the output of the server against the expected output:

assert curl_output == expected_output

This is the repeated for a set of models and configurations.