mirror of
https://github.com/huggingface/text-generation-inference.git
synced 2025-04-22 15:32:08 +00:00
* misc(cmake) update dependencies * feat(hardware) enable new hardware.hpp and unittests * test(ctest) enable address sanitizer * feat(backend): initial rewrite of the backend for simplicity * feat(backend): remove all the logs from hardware.hpp * feat(backend): added some logging * feat(backend): enable compiler warning if support for RVO not applying * feat(backend): missing return statement * feat(backend): introduce backend_workspace_t to store precomputed information from the engine folder * feat(backend): delete previous backend impl * feat(backend): more impl * feat(backend): use latest trtllm main version to have g++ >= 13 compatibility * feat(backend): allow overriding which Python to use * feat(backend): fix backend_exception_t -> backend_error_t naming * feat(backend): impl missing generation_step_t as return value of pull_tokens * feat(backend): make backend_workspace_t::engines_folder constexpr * feat(backend): fix main.rs retrieving the tokenizer * feat(backend): add guard to multiple header definitions * test(backend): add more unittest * feat(backend): remove constexpr from par * feat(backend): remove constexpig * test(backend): more test coverage * chore(trtllm): update dependency towards 0.15.0 * effectively cancel the request on the executor * feat(backend) fix moving backend when pulling * feat(backend): make sure we can easily cancel request on the executor * feat(backend): fix missing "0" field access * misc(backend): fix reborrowing Pin<&mut T> as described in the doc https://doc.rust-lang.org/stable/std/pin/struct.Pin.html#method.as_mut * chore: Add doc and CI for TRTLLM (#2799) * chore: Add doc and CI for TRTLLM * chore: Add doc and CI for TRTLLM * chore: Add doc and CI for TRTLLM * chore: Add doc and CI for TRTLLM * doc: Formatting * misc(backend): indent --------- Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co>
81 lines
2.2 KiB
Markdown
81 lines
2.2 KiB
Markdown
# TensorRT-LLM backend
|
|
|
|
The NVIDIA TensorRT-LLM (TRTLLM) backend is a high-performance backend for LLMs
|
|
that uses NVIDIA's TensorRT library for inference acceleration.
|
|
It makes use of specific optimizations for NVIDIA GPUs, such as custom kernels.
|
|
|
|
To use the TRTLLM backend you need to compile `engines` for the models you want to use.
|
|
Each `engine` must be compiled on the same GPU architecture that you will use for inference.
|
|
|
|
## Supported models
|
|
|
|
Check the [support matrix](https://nvidia.github.io/TensorRT-LLM/reference/support-matrix.html) to see which models are
|
|
supported.
|
|
|
|
## Compiling engines
|
|
|
|
You can use [Optimum-NVIDIA](https://github.com/huggingface/optimum-nvidia) to compile engines for the models you
|
|
want to use.
|
|
|
|
```bash
|
|
MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct"
|
|
|
|
# Install huggingface_cli
|
|
python -m pip install huggingface-cli[hf_transfer]
|
|
|
|
# Login to the Hugging Face Hub
|
|
huggingface-cli login
|
|
|
|
# Create a directory to store the model
|
|
mkdir -p /tmp/models/$MODEL_NAME
|
|
|
|
# Create a directory to store the compiled engine
|
|
mkdir -p /tmp/engines/$MODEL_NAME
|
|
|
|
# Download the model
|
|
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download --local-dir /tmp/models/$MODEL_NAME $MODEL_NAME
|
|
|
|
# Compile the engine using Optimum-NVIDIA
|
|
docker run \
|
|
--rm \
|
|
-it \
|
|
--gpus=1 \
|
|
-v /tmp/models/$MODEL_NAME:/model \
|
|
-v /tmp/engines/$MODEL_NAME:/engine \
|
|
huggingface/optimum-nvidia \
|
|
optimum-cli export trtllm \
|
|
--tp=1 \
|
|
--pp=1 \
|
|
--max-batch-size=128 \
|
|
--max-input-length 4096 \
|
|
--max-output-length 8192 \
|
|
--max-beams-width=1 \
|
|
--destination /engine \
|
|
$MODEL_NAME
|
|
```
|
|
|
|
Your compiled engine will be saved in the `/tmp/engines/$MODEL_NAME` directory.
|
|
|
|
## Using the TRTLLM backend
|
|
|
|
Run TGI-TRTLLM Docker image with the compiled engine:
|
|
|
|
```bash
|
|
docker run \
|
|
--gpus 1 \
|
|
-it \
|
|
--rm \
|
|
-p 3000:3000 \
|
|
-e MODEL=$MODEL_NAME \
|
|
-e PORT=3000 \
|
|
-e HF_TOKEN='hf_XXX' \
|
|
-v /tmp/engines/$MODEL_NAME:/data \
|
|
ghcr.io/huggingface/text-generation-inference:latest-trtllm \
|
|
--executor-worker executorWorker \
|
|
--model-id /data/$MODEL_NAME
|
|
```
|
|
|
|
## Development
|
|
|
|
To develop TRTLLM backend, you can use [dev containers](https://containers.dev/) located in
|
|
`.devcontainer` directory. |