mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-07-05 15:30:19 +00:00

Funtowicz Morgan ea7f4082c4

TensorRT-LLM backend bump to latest version + misc fixes (#2791 )

* misc(cmake) update dependencies

* feat(hardware) enable new hardware.hpp and unittests

* test(ctest) enable address sanitizer

* feat(backend): initial rewrite of the backend for simplicity

* feat(backend): remove all the logs from hardware.hpp

* feat(backend): added some logging

* feat(backend): enable compiler warning if support for RVO not applying

* feat(backend): missing return statement

* feat(backend): introduce backend_workspace_t to store precomputed information from the engine folder

* feat(backend): delete previous backend impl

* feat(backend): more impl

* feat(backend): use latest trtllm main version to have g++ >= 13 compatibility

* feat(backend): allow overriding which Python to use

* feat(backend): fix backend_exception_t -> backend_error_t naming

* feat(backend): impl missing generation_step_t as return value of pull_tokens

* feat(backend): make backend_workspace_t::engines_folder constexpr

* feat(backend): fix main.rs retrieving the tokenizer

* feat(backend): add guard to multiple header definitions

* test(backend): add more unittest

* feat(backend): remove constexpr from par

* feat(backend): remove constexpig

* test(backend): more test coverage

* chore(trtllm): update dependency towards 0.15.0

* effectively cancel the request on the executor

* feat(backend) fix moving backend when pulling

* feat(backend): make sure we can easily cancel request on the executor

* feat(backend): fix missing "0" field access

* misc(backend): fix reborrowing Pin<&mut T> as described in the doc https://doc.rust-lang.org/stable/std/pin/struct.Pin.html#method.as_mut

* chore: Add doc and CI for TRTLLM (#2799)

* chore: Add doc and CI for TRTLLM

* chore: Add doc and CI for TRTLLM

* chore: Add doc and CI for TRTLLM

* chore: Add doc and CI for TRTLLM

* doc: Formatting

* misc(backend): indent

---------

Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co>

2024-12-13 15:50:59 +01:00

1018 B

Raw Blame History

Multi-backend support

TGI (Text Generation Inference) offers flexibility by supporting multiple backends for serving large language models (LLMs). With multi-backend support, you can choose the backend that best suits your needs, whether you prioritize performance, ease of use, or compatibility with specific hardware. API interaction with TGI remains consistent across backends, allowing you to switch between them seamlessly.

Supported backends:

TGI CUDA backend: This high-performance backend is optimized for NVIDIA GPUs and serves as the default option within TGI. Developed in-house, it boasts numerous optimizations and is used in production by various projects, including those by Hugging Face.
TGI TRTLLM backend: This backend leverages NVIDIA's TensorRT library to accelerate LLM inference. It utilizes specialized optimizations and custom kernels for enhanced performance. However, it requires a model-specific compilation step for each GPU architecture.

1018 B Raw Blame History

Multi-backend support

1018 B

Raw Blame History