# Text-generation-inference - Gaudi backend ## Description This is the TGI backend for Intel Gaudi. This backend is composed of the tgi server optimized for Gaudi hardware. ## Build your own image The simplest way to build TGI with the gaudi backend is to use the provided `Makefile`: Option 1: From the project root directory: ```bash make -C backends/gaudi image ``` Option 2: From the Gaudi backend directory: ```bash cd backends/gaudi make image ``` You can now run the server with the following command: ```bash model=meta-llama/Llama-3.1-8B-Instruct hf_token=$(cat ${HOME}/.cache/huggingface/token) volume=${HOME}/.cache/huggingface docker run -p 8080:80 -v $volume:/data --runtime=habana -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true \ -e LOG_LEVEL=debug \ -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none \ -e HF_TOKEN=$hf_token -e ENABLE_HPU_GRAPH=true -e LIMIT_HPU_GRAPH=true \ -e USE_FLASH_ATTENTION=true -e FLASH_ATTENTION_RECOMPUTE=true --cap-add=sys_nice \ --ipc=host tgi-gaudi --model-id $model --sharded true \ --num-shard 8 --max-input-tokens 512 --max-total-tokens 1024 --max-batch-size 8 --max-batch-prefill-tokens 2048 --max-batch-total-tokens 8192 ``` ## Contributing ### Local Development This is useful if you want to run the server in locally for better debugging. ```bash make -C backends/gaudi run-local-dev-container ``` Then run the following command inside the container to install tgi for gaudi: ```bash make -C backends/gaudi local-dev-install ``` Add rust to path: ```bash . "$HOME/.cargo/env" ``` Option 1: Run the server (sharded model): ```bash LOG_LEVEL=debug text-generation-launcher \ --model-id meta-llama/Llama-3.1-8B-Instruct \ --sharded true \ --num-shard 8 \ --max-input-tokens 512 \ --max-total-tokens 1024 \ --max-batch-size 8 \ --max-batch-prefill-tokens 2048 ``` Option 2: Run the server (non-sharded model): ```bash LOG_LEVEL=debug text-generation-launcher \ --model-id meta-llama/Llama-3.1-8B-Instruct \ --max-input-tokens 512 \ --max-total-tokens 1024 \ --max-batch-size 4 \ --max-batch-prefill-tokens 2048 ``` You can then test the server with the following curl command from another terminal (can be outside the container): ```bash curl 127.0.0.1:8080/generate \ -X POST \ -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \ -H 'Content-Type: application/json' ```