text-generation-inference/backends/trtllm/README.md

# Text Generation Inference - TensorRT-LLM Backend Implementation

## Description

This folder provides the sources of the TensorRT-LLM backend implementation powered by TensorRT-LLM Executor new API

## Simplified Request Sequence

```mermaid
sequenceDiagram
    actor User
    participant TextGenerationInference.HttpServer
    participant TextGenerationInference.TensorRtLlmBackend
    participant TextGenerationInference.TensorRtLlmWorkerThread
    participant TensorRtLlm.Executor
    participant Nvidia.Gpu
    User ->> TextGenerationInference.HttpServer: POST /generate
    TextGenerationInference.HttpServer ->> TextGenerationInference.TensorRtLlmBackend: Validate and forward inputs & parameters
    TextGenerationInference.TensorRtLlmBackend ->> TextGenerationInference.TensorRtLlmWorkerThread: Allocate a new context and spawn a new thread to handle the request
    TextGenerationInference.TensorRtLlmWorkerThread ->> TensorRtLlm.Executor: Submit the request to the In-Flight Batcher
    activate Nvidia.Gpu
    TensorRtLlm.Executor ->> Nvidia.Gpu: Add the request to the poll for execution
    TensorRtLlm.Executor -->> TextGenerationInference.TensorRtLlmWorkerThread: Response with an unique request identifier
    rect rgb(10, 92, 54)
        loop every 100us
            rect rgb(15, 81, 50)
                alt Acquire lock to query executor
                    TextGenerationInference.TensorRtLlmWorkerThread ->> TensorRtLlm.Executor: Poll request number of new token(s) generated
                else There are new generated tokens
                    TextGenerationInference.TensorRtLlmWorkerThread ->> TensorRtLlm.Executor: Retrieve newly generated tokens
                    TensorRtLlm.Executor -->> TextGenerationInference.TensorRtLlmWorkerThread: Return decoded token information and potential error (omitted)
                    rect rgb(11, 110, 79)
                        alt Generated token is final
                            TensorRtLlm.Executor ->> Nvidia.Gpu: Remove request from the scheduler and from the GPU
                            TextGenerationInference.TensorRtLlmWorkerThread -->> User: Stream the remaining decoded tokens and flush the connection
                        else Generated token is not final
                            TextGenerationInference.TensorRtLlmWorkerThread -->> User: Stream token back to the user as they get decoded
                        end
                    end
                end
            end
            deactivate Nvidia.Gpu
        end
    end

```
add some more basic info in README.md 2024-07-30 08:45:29 +00:00			`# Text Generation Inference - TensorRT-LLM Backend Implementation`

			`## Description`

			`This folder provides the sources of the TensorRT-LLM backend implementation powered by TensorRT-LLM Executor new API`

			`## Simplified Request Sequence`

adding missing ld_library_path for cuda stubs in Dockerfile 2024-07-22 15:16:39 +00:00			```mermaid
			`sequenceDiagram`
add some more basic info in README.md 2024-07-30 08:45:29 +00:00			`actor User`
			`participant TextGenerationInference.HttpServer`
			`participant TextGenerationInference.TensorRtLlmBackend`
			`participant TextGenerationInference.TensorRtLlmWorkerThread`
			`participant TensorRtLlm.Executor`
			`participant Nvidia.Gpu`
			`User ->> TextGenerationInference.HttpServer: POST /generate`
			`TextGenerationInference.HttpServer ->> TextGenerationInference.TensorRtLlmBackend: Validate and forward inputs & parameters`
			`TextGenerationInference.TensorRtLlmBackend ->> TextGenerationInference.TensorRtLlmWorkerThread: Allocate a new context and spawn a new thread to handle the request`
			`TextGenerationInference.TensorRtLlmWorkerThread ->> TensorRtLlm.Executor: Submit the request to the In-Flight Batcher`
			`activate Nvidia.Gpu`
			`TensorRtLlm.Executor ->> Nvidia.Gpu: Add the request to the poll for execution`
			`TensorRtLlm.Executor -->> TextGenerationInference.TensorRtLlmWorkerThread: Response with an unique request identifier`
			`rect rgb(10, 92, 54)`
			`loop every 100us`
			`rect rgb(15, 81, 50)`
			`alt Acquire lock to query executor`
			`TextGenerationInference.TensorRtLlmWorkerThread ->> TensorRtLlm.Executor: Poll request number of new token(s) generated`
			`else There are new generated tokens`
			`TextGenerationInference.TensorRtLlmWorkerThread ->> TensorRtLlm.Executor: Retrieve newly generated tokens`
			`TensorRtLlm.Executor -->> TextGenerationInference.TensorRtLlmWorkerThread: Return decoded token information and potential error (omitted)`
			`rect rgb(11, 110, 79)`
			`alt Generated token is final`
			`TensorRtLlm.Executor ->> Nvidia.Gpu: Remove request from the scheduler and from the GPU`
			`TextGenerationInference.TensorRtLlmWorkerThread -->> User: Stream the remaining decoded tokens and flush the connection`
			`else Generated token is not final`
			`TextGenerationInference.TensorRtLlmWorkerThread -->> User: Stream token back to the user as they get decoded`
			`end`
			`end`
			`end`
			`end`
			`deactivate Nvidia.Gpu`
			`end`
			`end`

adding missing ld_library_path for cuda stubs in Dockerfile 2024-07-22 15:16:39 +00:00			```