2024-12-13 14:50:59 +00:00
|
|
|
# TensorRT-LLM backend
|
|
|
|
|
|
|
|
The NVIDIA TensorRT-LLM (TRTLLM) backend is a high-performance backend for LLMs
|
|
|
|
that uses NVIDIA's TensorRT library for inference acceleration.
|
|
|
|
It makes use of specific optimizations for NVIDIA GPUs, such as custom kernels.
|
|
|
|
|
|
|
|
To use the TRTLLM backend you need to compile `engines` for the models you want to use.
|
|
|
|
Each `engine` must be compiled on the same GPU architecture that you will use for inference.
|
|
|
|
|
|
|
|
## Supported models
|
|
|
|
|
|
|
|
Check the [support matrix](https://nvidia.github.io/TensorRT-LLM/reference/support-matrix.html) to see which models are
|
|
|
|
supported.
|
|
|
|
|
|
|
|
## Compiling engines
|
|
|
|
|
|
|
|
You can use [Optimum-NVIDIA](https://github.com/huggingface/optimum-nvidia) to compile engines for the models you
|
|
|
|
want to use.
|
|
|
|
|
2024-12-16 21:12:34 +00:00
|
|
|
```bash
|
2024-12-13 14:50:59 +00:00
|
|
|
MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct"
|
|
|
|
|
|
|
|
# Install huggingface_cli
|
|
|
|
python -m pip install huggingface-cli[hf_transfer]
|
|
|
|
|
|
|
|
# Login to the Hugging Face Hub
|
|
|
|
huggingface-cli login
|
|
|
|
|
|
|
|
# Create a directory to store the model
|
|
|
|
mkdir -p /tmp/models/$MODEL_NAME
|
|
|
|
|
|
|
|
# Create a directory to store the compiled engine
|
|
|
|
mkdir -p /tmp/engines/$MODEL_NAME
|
|
|
|
|
2024-12-16 21:12:34 +00:00
|
|
|
# Download the model
|
2024-12-13 14:50:59 +00:00
|
|
|
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download --local-dir /tmp/models/$MODEL_NAME $MODEL_NAME
|
|
|
|
|
|
|
|
# Compile the engine using Optimum-NVIDIA
|
|
|
|
docker run \
|
|
|
|
--rm \
|
|
|
|
-it \
|
|
|
|
--gpus=1 \
|
|
|
|
-v /tmp/models/$MODEL_NAME:/model \
|
|
|
|
-v /tmp/engines/$MODEL_NAME:/engine \
|
|
|
|
huggingface/optimum-nvidia \
|
|
|
|
optimum-cli export trtllm \
|
|
|
|
--tp=1 \
|
|
|
|
--pp=1 \
|
|
|
|
--max-batch-size=128 \
|
|
|
|
--max-input-length 4096 \
|
|
|
|
--max-output-length 8192 \
|
|
|
|
--max-beams-width=1 \
|
|
|
|
--destination /engine \
|
|
|
|
$MODEL_NAME
|
|
|
|
```
|
|
|
|
|
|
|
|
Your compiled engine will be saved in the `/tmp/engines/$MODEL_NAME` directory.
|
|
|
|
|
|
|
|
## Using the TRTLLM backend
|
|
|
|
|
|
|
|
Run TGI-TRTLLM Docker image with the compiled engine:
|
|
|
|
|
|
|
|
```bash
|
|
|
|
docker run \
|
|
|
|
--gpus 1 \
|
|
|
|
-it \
|
|
|
|
--rm \
|
|
|
|
-p 3000:3000 \
|
|
|
|
-e MODEL=$MODEL_NAME \
|
|
|
|
-e PORT=3000 \
|
|
|
|
-e HF_TOKEN='hf_XXX' \
|
2024-12-16 21:12:34 +00:00
|
|
|
-v /tmp/engines/$MODEL_NAME:/data \
|
2024-12-13 14:50:59 +00:00
|
|
|
ghcr.io/huggingface/text-generation-inference:latest-trtllm \
|
|
|
|
--executor-worker executorWorker \
|
|
|
|
--model-id /data/$MODEL_NAME
|
|
|
|
```
|
|
|
|
|
|
|
|
## Development
|
|
|
|
|
|
|
|
To develop TRTLLM backend, you can use [dev containers](https://containers.dev/) located in
|
2024-12-16 21:12:34 +00:00
|
|
|
`.devcontainer` directory.
|