text-generation-inference/docs/source/backends/trtllm.md

# TensorRT-LLM backend

The NVIDIA TensorRT-LLM (TRTLLM) backend is a high-performance backend for LLMs
that uses NVIDIA's TensorRT library for inference acceleration.
It makes use of specific optimizations for NVIDIA GPUs, such as custom kernels.

To use the TRTLLM backend you need to compile `engines` for the models you want to use.
Each `engine` must be compiled on the same GPU architecture that you will use for inference.

## Supported models

Check the [support matrix](https://nvidia.github.io/TensorRT-LLM/reference/support-matrix.html) to see which models are
supported.

## Compiling engines

You can use [Optimum-NVIDIA](https://github.com/huggingface/optimum-nvidia) to compile engines for the models you
want to use.

```bash
MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct"

# Install huggingface_cli
python -m pip install huggingface-cli[hf_transfer]

# Login to the Hugging Face Hub
huggingface-cli login

# Create a directory to store the model
mkdir -p /tmp/models/$MODEL_NAME

# Create a directory to store the compiled engine
mkdir -p /tmp/engines/$MODEL_NAME

# Download the model
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download --local-dir /tmp/models/$MODEL_NAME $MODEL_NAME

# Compile the engine using Optimum-NVIDIA
docker run \
  --rm \
  -it \
  --gpus=1 \
  -v /tmp/models/$MODEL_NAME:/model \
  -v /tmp/engines/$MODEL_NAME:/engine \
  huggingface/optimum-nvidia \
    optimum-cli export trtllm \
    --tp=1 \
    --pp=1 \
    --max-batch-size=128 \
    --max-input-length 4096 \
    --max-output-length 8192 \
    --max-beams-width=1 \
    --destination /engine \
    $MODEL_NAME
```

Your compiled engine will be saved in the `/tmp/engines/$MODEL_NAME` directory.

## Using the TRTLLM backend

Run TGI-TRTLLM Docker image with the compiled engine:

```bash
docker run \
  --gpus 1 \
  -it \
  --rm \
  -p 3000:3000 \
  -e MODEL=$MODEL_NAME \
  -e PORT=3000 \
  -e HF_TOKEN='hf_XXX' \
  -v /tmp/engines/$MODEL_NAME:/data \
  ghcr.io/huggingface/text-generation-inference:latest-trtllm \
  --executor-worker executorWorker \
  --model-id /data/$MODEL_NAME
```

## Development

To develop TRTLLM backend, you can use [dev containers](https://containers.dev/) located in
`.devcontainer` directory.
TensorRT-LLM backend bump to latest version + misc fixes (#2791) * misc(cmake) update dependencies * feat(hardware) enable new hardware.hpp and unittests * test(ctest) enable address sanitizer * feat(backend): initial rewrite of the backend for simplicity * feat(backend): remove all the logs from hardware.hpp * feat(backend): added some logging * feat(backend): enable compiler warning if support for RVO not applying * feat(backend): missing return statement * feat(backend): introduce backend_workspace_t to store precomputed information from the engine folder * feat(backend): delete previous backend impl * feat(backend): more impl * feat(backend): use latest trtllm main version to have g++ >= 13 compatibility * feat(backend): allow overriding which Python to use * feat(backend): fix backend_exception_t -> backend_error_t naming * feat(backend): impl missing generation_step_t as return value of pull_tokens * feat(backend): make backend_workspace_t::engines_folder constexpr * feat(backend): fix main.rs retrieving the tokenizer * feat(backend): add guard to multiple header definitions * test(backend): add more unittest * feat(backend): remove constexpr from par * feat(backend): remove constexpig * test(backend): more test coverage * chore(trtllm): update dependency towards 0.15.0 * effectively cancel the request on the executor * feat(backend) fix moving backend when pulling * feat(backend): make sure we can easily cancel request on the executor * feat(backend): fix missing "0" field access * misc(backend): fix reborrowing Pin<&mut T> as described in the doc https://doc.rust-lang.org/stable/std/pin/struct.Pin.html#method.as_mut * chore: Add doc and CI for TRTLLM (#2799) * chore: Add doc and CI for TRTLLM * chore: Add doc and CI for TRTLLM * chore: Add doc and CI for TRTLLM * chore: Add doc and CI for TRTLLM * doc: Formatting * misc(backend): indent --------- Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co> 2024-12-13 14:50:59 +00:00			`# TensorRT-LLM backend`

			`The NVIDIA TensorRT-LLM (TRTLLM) backend is a high-performance backend for LLMs`
			`that uses NVIDIA's TensorRT library for inference acceleration.`
			`It makes use of specific optimizations for NVIDIA GPUs, such as custom kernels.`

			To use the TRTLLM backend you need to compile `engines` for the models you want to use.
			Each `engine` must be compiled on the same GPU architecture that you will use for inference.

			`## Supported models`

			`Check the [support matrix](https://nvidia.github.io/TensorRT-LLM/reference/support-matrix.html) to see which models are`
			`supported.`

			`## Compiling engines`

			`You can use [Optimum-NVIDIA](https://github.com/huggingface/optimum-nvidia) to compile engines for the models you`
			`want to use.`

fix: lint backend and doc files (#2850) 2024-12-16 21:12:34 +00:00			```bash
TensorRT-LLM backend bump to latest version + misc fixes (#2791) * misc(cmake) update dependencies * feat(hardware) enable new hardware.hpp and unittests * test(ctest) enable address sanitizer * feat(backend): initial rewrite of the backend for simplicity * feat(backend): remove all the logs from hardware.hpp * feat(backend): added some logging * feat(backend): enable compiler warning if support for RVO not applying * feat(backend): missing return statement * feat(backend): introduce backend_workspace_t to store precomputed information from the engine folder * feat(backend): delete previous backend impl * feat(backend): more impl * feat(backend): use latest trtllm main version to have g++ >= 13 compatibility * feat(backend): allow overriding which Python to use * feat(backend): fix backend_exception_t -> backend_error_t naming * feat(backend): impl missing generation_step_t as return value of pull_tokens * feat(backend): make backend_workspace_t::engines_folder constexpr * feat(backend): fix main.rs retrieving the tokenizer * feat(backend): add guard to multiple header definitions * test(backend): add more unittest * feat(backend): remove constexpr from par * feat(backend): remove constexpig * test(backend): more test coverage * chore(trtllm): update dependency towards 0.15.0 * effectively cancel the request on the executor * feat(backend) fix moving backend when pulling * feat(backend): make sure we can easily cancel request on the executor * feat(backend): fix missing "0" field access * misc(backend): fix reborrowing Pin<&mut T> as described in the doc https://doc.rust-lang.org/stable/std/pin/struct.Pin.html#method.as_mut * chore: Add doc and CI for TRTLLM (#2799) * chore: Add doc and CI for TRTLLM * chore: Add doc and CI for TRTLLM * chore: Add doc and CI for TRTLLM * chore: Add doc and CI for TRTLLM * doc: Formatting * misc(backend): indent --------- Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co> 2024-12-13 14:50:59 +00:00			`MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct"`

			`# Install huggingface_cli`
			`python -m pip install huggingface-cli[hf_transfer]`

			`# Login to the Hugging Face Hub`
			`huggingface-cli login`

			`# Create a directory to store the model`
			`mkdir -p /tmp/models/$MODEL_NAME`

			`# Create a directory to store the compiled engine`
			`mkdir -p /tmp/engines/$MODEL_NAME`

fix: lint backend and doc files (#2850) 2024-12-16 21:12:34 +00:00			`# Download the model`
TensorRT-LLM backend bump to latest version + misc fixes (#2791) * misc(cmake) update dependencies * feat(hardware) enable new hardware.hpp and unittests * test(ctest) enable address sanitizer * feat(backend): initial rewrite of the backend for simplicity * feat(backend): remove all the logs from hardware.hpp * feat(backend): added some logging * feat(backend): enable compiler warning if support for RVO not applying * feat(backend): missing return statement * feat(backend): introduce backend_workspace_t to store precomputed information from the engine folder * feat(backend): delete previous backend impl * feat(backend): more impl * feat(backend): use latest trtllm main version to have g++ >= 13 compatibility * feat(backend): allow overriding which Python to use * feat(backend): fix backend_exception_t -> backend_error_t naming * feat(backend): impl missing generation_step_t as return value of pull_tokens * feat(backend): make backend_workspace_t::engines_folder constexpr * feat(backend): fix main.rs retrieving the tokenizer * feat(backend): add guard to multiple header definitions * test(backend): add more unittest * feat(backend): remove constexpr from par * feat(backend): remove constexpig * test(backend): more test coverage * chore(trtllm): update dependency towards 0.15.0 * effectively cancel the request on the executor * feat(backend) fix moving backend when pulling * feat(backend): make sure we can easily cancel request on the executor * feat(backend): fix missing "0" field access * misc(backend): fix reborrowing Pin<&mut T> as described in the doc https://doc.rust-lang.org/stable/std/pin/struct.Pin.html#method.as_mut * chore: Add doc and CI for TRTLLM (#2799) * chore: Add doc and CI for TRTLLM * chore: Add doc and CI for TRTLLM * chore: Add doc and CI for TRTLLM * chore: Add doc and CI for TRTLLM * doc: Formatting * misc(backend): indent --------- Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co> 2024-12-13 14:50:59 +00:00			`HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download --local-dir /tmp/models/$MODEL_NAME $MODEL_NAME`

			`# Compile the engine using Optimum-NVIDIA`
			`docker run \`
			`--rm \`
			`-it \`
			`--gpus=1 \`
			`-v /tmp/models/$MODEL_NAME:/model \`
			`-v /tmp/engines/$MODEL_NAME:/engine \`
			`huggingface/optimum-nvidia \`
			`optimum-cli export trtllm \`
			`--tp=1 \`
			`--pp=1 \`
			`--max-batch-size=128 \`
			`--max-input-length 4096 \`
			`--max-output-length 8192 \`
			`--max-beams-width=1 \`
			`--destination /engine \`
			`$MODEL_NAME`
			```

			Your compiled engine will be saved in the `/tmp/engines/$MODEL_NAME` directory.

			`## Using the TRTLLM backend`

			`Run TGI-TRTLLM Docker image with the compiled engine:`

			```bash
			`docker run \`
			`--gpus 1 \`
			`-it \`
			`--rm \`
			`-p 3000:3000 \`
			`-e MODEL=$MODEL_NAME \`
			`-e PORT=3000 \`
			`-e HF_TOKEN='hf_XXX' \`
fix: lint backend and doc files (#2850) 2024-12-16 21:12:34 +00:00			`-v /tmp/engines/$MODEL_NAME:/data \`
TensorRT-LLM backend bump to latest version + misc fixes (#2791) * misc(cmake) update dependencies * feat(hardware) enable new hardware.hpp and unittests * test(ctest) enable address sanitizer * feat(backend): initial rewrite of the backend for simplicity * feat(backend): remove all the logs from hardware.hpp * feat(backend): added some logging * feat(backend): enable compiler warning if support for RVO not applying * feat(backend): missing return statement * feat(backend): introduce backend_workspace_t to store precomputed information from the engine folder * feat(backend): delete previous backend impl * feat(backend): more impl * feat(backend): use latest trtllm main version to have g++ >= 13 compatibility * feat(backend): allow overriding which Python to use * feat(backend): fix backend_exception_t -> backend_error_t naming * feat(backend): impl missing generation_step_t as return value of pull_tokens * feat(backend): make backend_workspace_t::engines_folder constexpr * feat(backend): fix main.rs retrieving the tokenizer * feat(backend): add guard to multiple header definitions * test(backend): add more unittest * feat(backend): remove constexpr from par * feat(backend): remove constexpig * test(backend): more test coverage * chore(trtllm): update dependency towards 0.15.0 * effectively cancel the request on the executor * feat(backend) fix moving backend when pulling * feat(backend): make sure we can easily cancel request on the executor * feat(backend): fix missing "0" field access * misc(backend): fix reborrowing Pin<&mut T> as described in the doc https://doc.rust-lang.org/stable/std/pin/struct.Pin.html#method.as_mut * chore: Add doc and CI for TRTLLM (#2799) * chore: Add doc and CI for TRTLLM * chore: Add doc and CI for TRTLLM * chore: Add doc and CI for TRTLLM * chore: Add doc and CI for TRTLLM * doc: Formatting * misc(backend): indent --------- Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co> 2024-12-13 14:50:59 +00:00			`ghcr.io/huggingface/text-generation-inference:latest-trtllm \`
			`--executor-worker executorWorker \`
			`--model-id /data/$MODEL_NAME`
			```

			`## Development`

			`To develop TRTLLM backend, you can use [dev containers](https://containers.dev/) located in`
fix: lint backend and doc files (#2850) 2024-12-16 21:12:34 +00:00			`.devcontainer` directory.