mirror of
https://github.com/huggingface/text-generation-inference.git
synced 2025-04-25 20:12:07 +00:00
* misc(cmake) update dependencies * feat(hardware) enable new hardware.hpp and unittests * test(ctest) enable address sanitizer * feat(backend): initial rewrite of the backend for simplicity * feat(backend): remove all the logs from hardware.hpp * feat(backend): added some logging * feat(backend): enable compiler warning if support for RVO not applying * feat(backend): missing return statement * feat(backend): introduce backend_workspace_t to store precomputed information from the engine folder * feat(backend): delete previous backend impl * feat(backend): more impl * feat(backend): use latest trtllm main version to have g++ >= 13 compatibility * feat(backend): allow overriding which Python to use * feat(backend): fix backend_exception_t -> backend_error_t naming * feat(backend): impl missing generation_step_t as return value of pull_tokens * feat(backend): make backend_workspace_t::engines_folder constexpr * feat(backend): fix main.rs retrieving the tokenizer * feat(backend): add guard to multiple header definitions * test(backend): add more unittest * feat(backend): remove constexpr from par * feat(backend): remove constexpig * test(backend): more test coverage * chore(trtllm): update dependency towards 0.15.0 * effectively cancel the request on the executor * feat(backend) fix moving backend when pulling * feat(backend): make sure we can easily cancel request on the executor * feat(backend): fix missing "0" field access * misc(backend): fix reborrowing Pin<&mut T> as described in the doc https://doc.rust-lang.org/stable/std/pin/struct.Pin.html#method.as_mut * chore: Add doc and CI for TRTLLM (#2799) * chore: Add doc and CI for TRTLLM * chore: Add doc and CI for TRTLLM * chore: Add doc and CI for TRTLLM * chore: Add doc and CI for TRTLLM * doc: Formatting * misc(backend): indent --------- Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co> |
||
---|---|---|
.. | ||
cmake | ||
csrc | ||
scripts | ||
src | ||
tests | ||
build.rs | ||
Cargo.toml | ||
CMakeLists.txt | ||
README.md |
Text Generation Inference - TensorRT-LLM Backend Implementation
Description
This folder provides the sources of the TensorRT-LLM backend implementation powered by TensorRT-LLM Executor new API
Simplified Request Sequence
sequenceDiagram
actor User
participant TextGenerationInference.HttpServer
participant TextGenerationInference.TensorRtLlmBackend
participant TextGenerationInference.TensorRtLlmWorkerThread
participant TensorRtLlm.Executor
participant Nvidia.Gpu
User ->> TextGenerationInference.HttpServer: POST /generate
TextGenerationInference.HttpServer ->> TextGenerationInference.TensorRtLlmBackend: Validate and forward inputs & parameters
TextGenerationInference.TensorRtLlmBackend ->> TextGenerationInference.TensorRtLlmWorkerThread: Allocate a new context and spawn a new thread to handle the request
TextGenerationInference.TensorRtLlmWorkerThread ->> TensorRtLlm.Executor: Submit the request to the In-Flight Batcher
activate Nvidia.Gpu
TensorRtLlm.Executor ->> Nvidia.Gpu: Add the request to the poll for execution
TensorRtLlm.Executor -->> TextGenerationInference.TensorRtLlmWorkerThread: Response with an unique request identifier
rect rgb(10, 92, 54)
loop every 100us
rect rgb(15, 81, 50)
alt Acquire lock to query executor
TextGenerationInference.TensorRtLlmWorkerThread ->> TensorRtLlm.Executor: Poll request number of new token(s) generated
else There are new generated tokens
TextGenerationInference.TensorRtLlmWorkerThread ->> TensorRtLlm.Executor: Retrieve newly generated tokens
TensorRtLlm.Executor -->> TextGenerationInference.TensorRtLlmWorkerThread: Return decoded token information and potential error (omitted)
rect rgb(11, 110, 79)
alt Generated token is final
TensorRtLlm.Executor ->> Nvidia.Gpu: Remove request from the scheduler and from the GPU
TextGenerationInference.TensorRtLlmWorkerThread -->> User: Stream the remaining decoded tokens and flush the connection
else Generated token is not final
TextGenerationInference.TensorRtLlmWorkerThread -->> User: Stream token back to the user as they get decoded
end
end
end
end
deactivate Nvidia.Gpu
end
end