Morgan Funtowicz
e82dc30e8a
expose information about potential error happening while decoding
2024-07-18 22:07:59 +00:00
Morgan Funtowicz
a19d318947
define a shared struct to hold the result of a decoding step
2024-07-18 21:33:04 +00:00
Morgan Funtowicz
a036574a86
add some more validation about grammar not supported
2024-07-18 20:57:29 +00:00
Morgan Funtowicz
b643a436f3
forward tgi parameters rep/freq penalty
2024-07-18 20:56:58 +00:00
Morgan Funtowicz
95847c6587
expose the internal missing start/queue timestamp
2024-07-18 15:57:33 +00:00
Morgan Funtowicz
fd021e5461
refactor Stream impl for Generation to factorise code
2024-07-18 14:21:43 +00:00
Morgan Funtowicz
b56c43ec30
remove unneeded scope variable for now
2024-07-18 12:57:10 +00:00
Morgan Funtowicz
0212b1774a
correctly forward back the log probabilities
2024-07-17 22:33:10 +00:00
Morgan Funtowicz
bcb96feea6
update invalid doc in cpp file
2024-07-17 22:23:22 +00:00
Morgan Funtowicz
69674a3a2d
add all the necessary plumbery to return the generated content
2024-07-17 22:12:49 +00:00
Morgan Funtowicz
ce715c76f8
remove unnecessary log
2024-07-17 22:09:50 +00:00
Morgan Funtowicz
e983ee5bb8
make sure the context is not dropped in the middle of the async decoding.
2024-07-17 21:56:50 +00:00
Morgan Funtowicz
9220340ff7
compute the number of maximum new tokens for each request independently
2024-07-17 13:55:29 +00:00
Morgan Funtowicz
a01cd030d4
oops missing c++ backend definitions
2024-07-16 20:11:59 +00:00
Morgan Funtowicz
7784a21d48
impl RwLock scenario for TensorRtLllmBackend
2024-07-16 20:08:10 +00:00
Morgan Funtowicz
31d9f4d5dc
expose shutdown function at ffi layer
2024-07-15 07:36:01 +00:00
Morgan Funtowicz
b291be64a0
impl the rust backend which currently cannot move the actual computation in background thread
2024-07-12 19:26:32 +00:00
Morgan Funtowicz
518d9a9e0b
make sure to track include/ffi.h to trigger rebuild from cargo
2024-07-12 19:26:04 +00:00
Morgan Funtowicz
344f33f398
end to end ffi flow working
2024-07-12 19:25:40 +00:00
Morgan Funtowicz
b846ae2d9e
use external fmt lib
2024-07-12 19:24:59 +00:00
Morgan Funtowicz
1972669f49
remove fmt import
2024-07-12 19:24:09 +00:00
Morgan Funtowicz
50e9fc89c8
working setup of the ffi layer
2024-07-11 21:24:32 +00:00
Morgan Funtowicz
5aede911f8
include guard to build example in cmakelists
2024-07-11 21:24:01 +00:00
Morgan Funtowicz
ed14bd6818
use correct include for spdlog
2024-07-10 13:57:31 +00:00
Morgan Funtowicz
42748d5960
allow converting huggingface::tokenizers error to TensorRtLlmBackendError
2024-07-10 13:56:57 +00:00
Morgan Funtowicz
40fe2ec0ff
add auth_token CLI argument to provide hf hub authentification token
2024-07-10 13:50:28 +00:00
Morgan Funtowicz
ca9da2dd49
create cmake install target to put everything relevant in installation folder
2024-07-10 13:48:59 +00:00
Morgan Funtowicz
4272b8cf51
correctly tell cmake to build dependent tensorrt-llm required libraries
2024-07-10 13:48:44 +00:00
Morgan Funtowicz
6c92ebe6a8
update trtllm to latest version a96cccafcf6365c128f004f779160951f8c0801c
2024-07-10 13:47:56 +00:00
Morgan Funtowicz
7b9f92a0aa
use spdlog release 1.14.1 moving forward
2024-07-10 13:47:31 +00:00
Morgan Funtowicz
13eabfabcb
implement the Stream method to send new tokens through a callback
2024-07-09 13:46:48 +00:00
Morgan Funtowicz
09292b06a0
updated logic and comment to detect cuda compute capabilities
2024-07-09 12:15:41 +00:00
Morgan Funtowicz
bec188ff73
bind to CUDA::nvml to retrieve compute capabilities at runtime
2024-07-08 22:32:41 +00:00
Morgan Funtowicz
68a0247a2c
unconditionally call InitializeBackend on the FFI layer
2024-07-08 22:09:09 +00:00
Morgan Funtowicz
da926feaa1
make leader executor mode working
2024-07-08 22:08:49 +00:00
Morgan Funtowicz
f53ffa886d
Specify which default log level to use depending on CMake build type
2024-07-08 22:06:49 +00:00
Morgan Funtowicz
4113d6d51b
Move to latest TensorRT-LLM version
2024-07-08 22:06:30 +00:00
Morgan Funtowicz
29c7cb36e5
Remembering to check how we can detect support for chunked context
2024-07-03 21:38:17 +00:00
Morgan Funtowicz
f57f2a4521
First version loading engines and making it ready for inference
2024-07-03 21:12:24 +00:00
Morgan Funtowicz
f8a1463915
Enable end to end CMake build
2024-07-03 10:27:53 +02:00
Morgan Funtowicz
818162e0c2
Overall build TRTLLM and deps through CMake build system
2024-07-02 17:16:27 +02:00
Morgan Funtowicz
6dc98abe46
Remove unused parameters annd force tokenizer name to be set
2024-07-01 16:11:59 +02:00
Morgan Funtowicz
47ac5c654d
Working FFI call for TGI and TRTLLM backend
2024-07-01 15:53:23 +02:00
Morgan Funtowicz
dc402dc9ac
Initial setup for CXX binding to TRTLLM
2024-06-30 23:37:20 +02:00
OlivierDehaene
230f2a415a
refacto
2024-06-26 14:12:01 +02:00
OlivierDehaene
93e0a7de8b
refacto
2024-06-26 14:00:03 +02:00
OlivierDehaene
b562680be4
wip
2024-06-26 13:13:32 +02:00
OlivierDehaene
504754861f
wip
2024-06-26 12:08:56 +02:00
drbh
be2d38032a
fix: simplify kserve endpoint and fix imports ( #2119 )
2024-06-25 19:30:10 -04:00
Daniël de Kok
f1f98e369f
Add support for Marlin 2:4 sparsity ( #2102 )
...
This change adds support for 2:4 sparsity when using Marlin
quantization. The 2:4 kernel is used when:
* The quantizer is `marlin`;
* the quantizer checkpoint format is `marlin_24`.
Fixes #2098 .
2024-06-25 21:09:42 +02:00