text-generation-inference/docs/source/backends/trtllm.md

82 lines
2.2 KiB
Markdown

# TensorRT-LLM backend
The NVIDIA TensorRT-LLM (TRTLLM) backend is a high-performance backend for LLMs
that uses NVIDIA's TensorRT library for inference acceleration.
It makes use of specific optimizations for NVIDIA GPUs, such as custom kernels.
To use the TRTLLM backend you need to compile `engines` for the models you want to use.
Each `engine` must be compiled on the same GPU architecture that you will use for inference.
## Supported models
Check the [support matrix](https://nvidia.github.io/TensorRT-LLM/reference/support-matrix.html) to see which models are
supported.
## Compiling engines
You can use [Optimum-NVIDIA](https://github.com/huggingface/optimum-nvidia) to compile engines for the models you
want to use.
```bash
MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct"
# Install huggingface_cli
python -m pip install huggingface-cli[hf_transfer]
# Login to the Hugging Face Hub
huggingface-cli login
# Create a directory to store the model
mkdir -p /tmp/models/$MODEL_NAME
# Create a directory to store the compiled engine
mkdir -p /tmp/engines/$MODEL_NAME
# Download the model
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download --local-dir /tmp/models/$MODEL_NAME $MODEL_NAME
# Compile the engine using Optimum-NVIDIA
docker run \
--rm \
-it \
--gpus=1 \
-v /tmp/models/$MODEL_NAME:/model \
-v /tmp/engines/$MODEL_NAME:/engine \
huggingface/optimum-nvidia \
optimum-cli export trtllm \
--tp=1 \
--pp=1 \
--max-batch-size=128 \
--max-input-length 4096 \
--max-output-length 8192 \
--max-beams-width=1 \
--destination /engine \
$MODEL_NAME
```
Your compiled engine will be saved in the `/tmp/engines/$MODEL_NAME` directory.
## Using the TRTLLM backend
Run TGI-TRTLLM Docker image with the compiled engine:
```bash
docker run \
--gpus 1 \
-it \
--rm \
-p 3000:3000 \
-e MODEL=$MODEL_NAME \
-e PORT=3000 \
-e HF_TOKEN='hf_XXX' \
-v /tmp/engines/$MODEL_NAME:/data \
ghcr.io/huggingface/text-generation-inference:latest-trtllm \
--executor-worker executorWorker \
--model-id /data/$MODEL_NAME
```
## Development
To develop TRTLLM backend, you can use [dev containers](https://containers.dev/) located in
`.devcontainer` directory.