text-generation-inference/docs/source/backends/trtllm.md
Funtowicz Morgan ea7f4082c4
TensorRT-LLM backend bump to latest version + misc fixes (#2791)
* misc(cmake) update dependencies

* feat(hardware) enable new hardware.hpp and unittests

* test(ctest) enable address sanitizer

* feat(backend): initial rewrite of the backend for simplicity

* feat(backend): remove all the logs from hardware.hpp

* feat(backend): added some logging

* feat(backend): enable compiler warning if support for RVO not applying

* feat(backend): missing return statement

* feat(backend): introduce backend_workspace_t to store precomputed information from the engine folder

* feat(backend): delete previous backend impl

* feat(backend): more impl

* feat(backend): use latest trtllm main version to have g++ >= 13 compatibility

* feat(backend): allow overriding which Python to use

* feat(backend): fix backend_exception_t -> backend_error_t naming

* feat(backend): impl missing generation_step_t as return value of pull_tokens

* feat(backend): make backend_workspace_t::engines_folder constexpr

* feat(backend): fix main.rs retrieving the tokenizer

* feat(backend): add guard to multiple header definitions

* test(backend): add more unittest

* feat(backend): remove constexpr from par

* feat(backend): remove constexpig

* test(backend): more test coverage

* chore(trtllm): update dependency towards 0.15.0

* effectively cancel the request on the executor

* feat(backend) fix moving backend when pulling

* feat(backend): make sure we can easily cancel request on the executor

* feat(backend): fix missing "0" field access

* misc(backend): fix reborrowing Pin<&mut T> as described in the doc https://doc.rust-lang.org/stable/std/pin/struct.Pin.html#method.as_mut

* chore: Add doc and CI for TRTLLM (#2799)

* chore: Add doc and CI for TRTLLM

* chore: Add doc and CI for TRTLLM

* chore: Add doc and CI for TRTLLM

* chore: Add doc and CI for TRTLLM

* doc: Formatting

* misc(backend): indent

---------

Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co>
2024-12-13 15:50:59 +01:00

2.2 KiB

TensorRT-LLM backend

The NVIDIA TensorRT-LLM (TRTLLM) backend is a high-performance backend for LLMs that uses NVIDIA's TensorRT library for inference acceleration. It makes use of specific optimizations for NVIDIA GPUs, such as custom kernels.

To use the TRTLLM backend you need to compile engines for the models you want to use. Each engine must be compiled on the same GPU architecture that you will use for inference.

Supported models

Check the support matrix to see which models are supported.

Compiling engines

You can use Optimum-NVIDIA to compile engines for the models you want to use.

MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct"

# Install huggingface_cli
python -m pip install huggingface-cli[hf_transfer]

# Login to the Hugging Face Hub
huggingface-cli login

# Create a directory to store the model
mkdir -p /tmp/models/$MODEL_NAME

# Create a directory to store the compiled engine
mkdir -p /tmp/engines/$MODEL_NAME

# Download the model 
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download --local-dir /tmp/models/$MODEL_NAME $MODEL_NAME

# Compile the engine using Optimum-NVIDIA
docker run \
  --rm \
  -it \
  --gpus=1 \
  -v /tmp/models/$MODEL_NAME:/model \
  -v /tmp/engines/$MODEL_NAME:/engine \
  huggingface/optimum-nvidia \
    optimum-cli export trtllm \
    --tp=1 \
    --pp=1 \
    --max-batch-size=128 \
    --max-input-length 4096 \
    --max-output-length 8192 \
    --max-beams-width=1 \
    --destination /engine \
    $MODEL_NAME

Your compiled engine will be saved in the /tmp/engines/$MODEL_NAME directory.

Using the TRTLLM backend

Run TGI-TRTLLM Docker image with the compiled engine:

docker run \
  --gpus 1 \
  -it \
  --rm \
  -p 3000:3000 \
  -e MODEL=$MODEL_NAME \
  -e PORT=3000 \
  -e HF_TOKEN='hf_XXX' \
  -v /tmp/engines/$MODEL_NAME:/data \ 
  ghcr.io/huggingface/text-generation-inference:latest-trtllm \
  --executor-worker executorWorker \
  --model-id /data/$MODEL_NAME

Development

To develop TRTLLM backend, you can use dev containers located in .devcontainer directory.