mirror of
https://github.com/huggingface/text-generation-inference.git
synced 2025-09-11 12:24:53 +00:00
chore: Add doc and CI for TRTLLM (#2799)
* chore: Add doc and CI for TRTLLM * chore: Add doc and CI for TRTLLM * chore: Add doc and CI for TRTLLM * chore: Add doc and CI for TRTLLM * doc: Formatting
This commit is contained in:
parent
f0cd4742c2
commit
ab6591e759
10
.github/workflows/build.yaml
vendored
10
.github/workflows/build.yaml
vendored
@ -8,6 +8,7 @@ on:
|
|||||||
description: Hardware
|
description: Hardware
|
||||||
# options:
|
# options:
|
||||||
# - cuda
|
# - cuda
|
||||||
|
# - cuda-trtllm
|
||||||
# - rocm
|
# - rocm
|
||||||
# - intel
|
# - intel
|
||||||
required: true
|
required: true
|
||||||
@ -52,6 +53,15 @@ jobs:
|
|||||||
export platform=""
|
export platform=""
|
||||||
export extra_pytest=""
|
export extra_pytest=""
|
||||||
;;
|
;;
|
||||||
|
cuda-trtllm)
|
||||||
|
export dockerfile="Dockerfile_trtllm"
|
||||||
|
export label_extension="-trtllm"
|
||||||
|
export docker_volume="/mnt/cache"
|
||||||
|
export docker_devices=""
|
||||||
|
export runs_on="ubuntu-latest"
|
||||||
|
export platform=""
|
||||||
|
export extra_pytest=""
|
||||||
|
;;
|
||||||
rocm)
|
rocm)
|
||||||
export dockerfile="Dockerfile_amd"
|
export dockerfile="Dockerfile_amd"
|
||||||
export label_extension="-rocm"
|
export label_extension="-rocm"
|
||||||
|
2
.github/workflows/ci_build.yaml
vendored
2
.github/workflows/ci_build.yaml
vendored
@ -37,7 +37,7 @@ jobs:
|
|||||||
# fail-fast is true by default
|
# fail-fast is true by default
|
||||||
fail-fast: false
|
fail-fast: false
|
||||||
matrix:
|
matrix:
|
||||||
hardware: ["cuda", "rocm", "intel-xpu", "intel-cpu"]
|
hardware: ["cuda", "cuda-trtllm", "rocm", "intel-xpu", "intel-cpu"]
|
||||||
uses: ./.github/workflows/build.yaml # calls the one above ^
|
uses: ./.github/workflows/build.yaml # calls the one above ^
|
||||||
permissions:
|
permissions:
|
||||||
contents: write
|
contents: write
|
||||||
|
@ -17,6 +17,8 @@
|
|||||||
title: Using TGI with Intel GPUs
|
title: Using TGI with Intel GPUs
|
||||||
- local: installation
|
- local: installation
|
||||||
title: Installation from source
|
title: Installation from source
|
||||||
|
- local: multi_backend_support
|
||||||
|
title: Multi-backend support
|
||||||
|
|
||||||
- local: architecture
|
- local: architecture
|
||||||
title: Internal Architecture
|
title: Internal Architecture
|
||||||
@ -45,6 +47,10 @@
|
|||||||
- local: basic_tutorials/train_medusa
|
- local: basic_tutorials/train_medusa
|
||||||
title: Train Medusa
|
title: Train Medusa
|
||||||
title: Tutorials
|
title: Tutorials
|
||||||
|
- sections:
|
||||||
|
- local: backends/trtllm
|
||||||
|
title: TensorRT-LLM
|
||||||
|
title: Backends
|
||||||
- sections:
|
- sections:
|
||||||
- local: reference/launcher
|
- local: reference/launcher
|
||||||
title: All TGI CLI options
|
title: All TGI CLI options
|
||||||
|
@ -9,8 +9,10 @@ A high-level architecture diagram can be seen here:
|
|||||||
This diagram shows well there are these separate components:
|
This diagram shows well there are these separate components:
|
||||||
|
|
||||||
- **The router**, also named `webserver`, that receives the client requests, buffers them, creates some batches, and prepares gRPC calls to a model server.
|
- **The router**, also named `webserver`, that receives the client requests, buffers them, creates some batches, and prepares gRPC calls to a model server.
|
||||||
- **The model server**, responsible of receiving the gRPC requests and to process the inference on the model. If the model is sharded across multiple accelerators (e.g.: multiple GPUs), the model server shards might be synchronized via NCCL or equivalent.
|
|
||||||
- **The launcher** is a helper that will be able to launch one or several model servers (if model is sharded), and it launches the router with the compatible arguments.
|
- **The launcher** is a helper that will be able to launch one or several model servers (if model is sharded), and it launches the router with the compatible arguments.
|
||||||
|
- **The model server**, responsible for receiving the gRPC requests and to process the inference on the model. If the model is sharded across multiple accelerators (e.g.: multiple GPUs), the model server shards might be synchronized via NCCL or equivalent.
|
||||||
|
|
||||||
|
Note that for other backends (eg. TRTLLM) the model server and launcher are specific to the backend.
|
||||||
|
|
||||||
The router and the model server can be two different machines, they do not need to be deployed together.
|
The router and the model server can be two different machines, they do not need to be deployed together.
|
||||||
|
|
||||||
|
81
docs/source/backends/trtllm.md
Normal file
81
docs/source/backends/trtllm.md
Normal file
@ -0,0 +1,81 @@
|
|||||||
|
# TensorRT-LLM backend
|
||||||
|
|
||||||
|
The NVIDIA TensorRT-LLM (TRTLLM) backend is a high-performance backend for LLMs
|
||||||
|
that uses NVIDIA's TensorRT library for inference acceleration.
|
||||||
|
It makes use of specific optimizations for NVIDIA GPUs, such as custom kernels.
|
||||||
|
|
||||||
|
To use the TRTLLM backend you need to compile `engines` for the models you want to use.
|
||||||
|
Each `engine` must be compiled on the same GPU architecture that you will use for inference.
|
||||||
|
|
||||||
|
## Supported models
|
||||||
|
|
||||||
|
Check the [support matrix](https://nvidia.github.io/TensorRT-LLM/reference/support-matrix.html) to see which models are
|
||||||
|
supported.
|
||||||
|
|
||||||
|
## Compiling engines
|
||||||
|
|
||||||
|
You can use [Optimum-NVIDIA](https://github.com/huggingface/optimum-nvidia) to compile engines for the models you
|
||||||
|
want to use.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct"
|
||||||
|
|
||||||
|
# Install huggingface_cli
|
||||||
|
python -m pip install huggingface-cli[hf_transfer]
|
||||||
|
|
||||||
|
# Login to the Hugging Face Hub
|
||||||
|
huggingface-cli login
|
||||||
|
|
||||||
|
# Create a directory to store the model
|
||||||
|
mkdir -p /tmp/models/$MODEL_NAME
|
||||||
|
|
||||||
|
# Create a directory to store the compiled engine
|
||||||
|
mkdir -p /tmp/engines/$MODEL_NAME
|
||||||
|
|
||||||
|
# Download the model
|
||||||
|
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download --local-dir /tmp/models/$MODEL_NAME $MODEL_NAME
|
||||||
|
|
||||||
|
# Compile the engine using Optimum-NVIDIA
|
||||||
|
docker run \
|
||||||
|
--rm \
|
||||||
|
-it \
|
||||||
|
--gpus=1 \
|
||||||
|
-v /tmp/models/$MODEL_NAME:/model \
|
||||||
|
-v /tmp/engines/$MODEL_NAME:/engine \
|
||||||
|
huggingface/optimum-nvidia \
|
||||||
|
optimum-cli export trtllm \
|
||||||
|
--tp=1 \
|
||||||
|
--pp=1 \
|
||||||
|
--max-batch-size=128 \
|
||||||
|
--max-input-length 4096 \
|
||||||
|
--max-output-length 8192 \
|
||||||
|
--max-beams-width=1 \
|
||||||
|
--destination /engine \
|
||||||
|
$MODEL_NAME
|
||||||
|
```
|
||||||
|
|
||||||
|
Your compiled engine will be saved in the `/tmp/engines/$MODEL_NAME` directory.
|
||||||
|
|
||||||
|
## Using the TRTLLM backend
|
||||||
|
|
||||||
|
Run TGI-TRTLLM Docker image with the compiled engine:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker run \
|
||||||
|
--gpus 1 \
|
||||||
|
-it \
|
||||||
|
--rm \
|
||||||
|
-p 3000:3000 \
|
||||||
|
-e MODEL=$MODEL_NAME \
|
||||||
|
-e PORT=3000 \
|
||||||
|
-e HF_TOKEN='hf_XXX' \
|
||||||
|
-v /tmp/engines/$MODEL_NAME:/data \
|
||||||
|
ghcr.io/huggingface/text-generation-inference:latest-trtllm \
|
||||||
|
--executor-worker executorWorker \
|
||||||
|
--model-id /data/$MODEL_NAME
|
||||||
|
```
|
||||||
|
|
||||||
|
## Development
|
||||||
|
|
||||||
|
To develop TRTLLM backend, you can use [dev containers](https://containers.dev/) located in
|
||||||
|
`.devcontainer` directory.
|
13
docs/source/multi_backend_support.md
Normal file
13
docs/source/multi_backend_support.md
Normal file
@ -0,0 +1,13 @@
|
|||||||
|
# Multi-backend support
|
||||||
|
|
||||||
|
TGI (Text Generation Inference) offers flexibility by supporting multiple backends for serving large language models (LLMs).
|
||||||
|
With multi-backend support, you can choose the backend that best suits your needs,
|
||||||
|
whether you prioritize performance, ease of use, or compatibility with specific hardware. API interaction with
|
||||||
|
TGI remains consistent across backends, allowing you to switch between them seamlessly.
|
||||||
|
|
||||||
|
**Supported backends:**
|
||||||
|
* **TGI CUDA backend**: This high-performance backend is optimized for NVIDIA GPUs and serves as the default option
|
||||||
|
within TGI. Developed in-house, it boasts numerous optimizations and is used in production by various projects, including those by Hugging Face.
|
||||||
|
* **[TGI TRTLLM backend](./backends/trtllm)**: This backend leverages NVIDIA's TensorRT library to accelerate LLM inference.
|
||||||
|
It utilizes specialized optimizations and custom kernels for enhanced performance.
|
||||||
|
However, it requires a model-specific compilation step for each GPU architecture.
|
Loading…
Reference in New Issue
Block a user