* Publish nix docker image.
* Run during PR.
* Something else.
* Forgot to push.
* Build zstd.
* Pushing with skopeo
* Testing the PR.
* Runnign from nix.
* Cleaner tags.
* feat(gaudi): release ready (docs, docker image and vlm ready)
* fix(gaudi): add default argument for the dockerfile
* fix(gaudi): remove use of latest for gaudi docker image + redid gaudi benchmarking section to include best practices
* Update to `kernels` 0.2.1
The package was renamed from `hf-kernels` to `kernels`. The new version
also updates the lockfile format.
* Download kernels in `install-cuda` target
* Patch rust release.
* Trying to remove the rust-toolchain hardcoded in action.
* Upgrade rust toolchain.
* Put back the toolchain ?
* Fix neuron dockerfile.
* Move to the proper version of Rust.
* 1.85 since the GH action doesn't respect the override.
* Typo.
* Fixing the github action.
* Fixing docker llamacpp.
* Fixing the github action.
* Update clippy.
* feat: add neuron backend
* feat(neuron): add server standalone installation
* feat(neuron): add server and integration tests
* fix(neuron): increase ulimit when building image
The base image used to compile the rust components seems to have a low
ulimit for opened files, which leads to errors during compilation.
* test(neuron): merge integration tests and fixtures
* test: add --neuron option
* review: do not use latest tag
* review: remove ureq pinned version
* review: --privileged should be the exception
* feat: add neuron case to build ci
* fix(neuron): export models from container in test fixtures
The neuron tests require models to have been previously exported and
cached on the hub. This is done automatically by the neuron.model
fixture the first time the tests are ran for a specific version.
This fixture used to export the models using optimum-neuron directly,
but this package is not necessarily present on the system.
Instead, it is now done through the neuron TGI itself, since it
contains all the tools required to export the models.
Note that since the CI runs docker in docker (dind) it does not seem
possible to share a volume between the CI container and the container
used to export the model.
For that reason, a specific image with a modified entrypoint is built
on-the-fly when a model export is required.
* refactor: remove sagemaker entry-point
The SageMaker image is built differently anyway.
* fix(neuron): avoid using Levenshtein
* test(neuron): use smaller llama model
* feat(neuron): avoid installing CUDA in image
* test(neuron): no error anymore when requesting too many tokens
* ci: doing a precompilation step (with a different token).
* test(neuron): avoid using image sha when exporting models
We now manually evaluate the apparent hash of the neuron backend by
combining the hash of the neuron backend directory and Dockerfile.
This new hash is used to identify exported neuron models instead of the
image sha.
This has two benefits:
- it changes less frequently (only hwen the neuron backend changes),
which means less neuron models being pushed to the hub,
- it can be evaluated locally, meaning that running the tests once
locally will export the models before the CI uses them.
* test(neuron): added a small script to prune test models
---------
Co-authored-by: drbh <david.richard.holtz@gmail.com>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
* feat: Add the parsing of HF_HUB_USER_AGENT_ORIGIN environment variable to add info about the environment running TGI. That is useful to track usage in case of collaborations for example.
* fix: trufflehog
* Putting back the NCCL forced upgrade.
* .
* ...
* Ignoring conda.
* Dropping conda from the buidl system + torch 2.6
* Cache min.
* Rolling back torch version.
* Reverting the EETQ modification.
* Fix flash attention ?
* Actually stay on flash v1.
* Patching flash v1.
* Torch 2.6, fork of rotary, eetq updated.
* Put back nccl latest (override torch).
* Slightly more reproducible build and not as scary.
* Use Hub kernels for Marlin and cutlass quantization kernels
* Use hub kernels for MoE/GPTQ-Marlin MoE
* Use attention kernels from the Hub
* Cache the kernels in the Docker image
* Update moe kernels
* Support loading local kernels for development
* Support latest moe kernels
* Update to moe 0.1.1
* CI: download locked kernels for server tests
* Fixup some imports
* CI: activate venv
* Fix unused imports
* Nix: add attention/moe/quantization kernels
* Update hf-kernels to 0.1.5
* Update kernels
* Update tgi-nix flake for hf-kernels
* Fix EOF
* Take `load_kernel` out of a frequently-called function
* Hoist another case of kernel loading out of a somewhat hot function
* marlin-kernels -> quantization
* attention -> paged-attention
* EOF fix
* Update hf-kernels, fixup Docker
* ipex fix
* Remove outdated TODO
* Using the "lockfile".
* Revert dummy modifications.
* Lock on python 3.11
* Another attempt.
* ..
* Bad cache hits.
* The good old monkey.
* How in the world...
* We need the launcher still.
* .
* ..
* Attempt #42
* Don't break all other builds.
* Mode max.
* Applying to other builds.
* hotfix: fix trtllm CI build on release
* fix: test release.
* fix: test release.
* fix: test release. env not recognized https://github.com/actions/runner/issues/1661
* fix: test release. Works.
* backend(trtllm): attempt to remove AWS S3 flaky cache for sccache
* backend(trtllm): what if we expose ENV instead of inline?
* backend(trtllm): and with the right env var for gha sccache
* backend(trtllm): relax the way to detect sccache
* backend(trtllm): make sccache definition manually
* backend(trtllm): ok let's try to define the launchers in build.rs when rustc_wrapper is present
* backend(trtllm): export env variable in run mb?
* backend(trtllm): Cache mode max to cache intermediate layers
* backend(trtllm): inject ompi_version build arg in dependent step
* Moving to `uv` instead of `poetry`.
More in the standard, faster, seemingly better lockfile.
* Creating venv if not created.
* Create the venv.
* Fix ?
* Fixing the test by activating the environment ?
* Install system ?
* Add the cli entry point.
* docker install on system
* Monkeying this...
* `--system` is redundant.
* Trying to force-include this pb folder.
* TRying to check that pb is imported correctly.
* Editable install necessary ?
* Non editable?
* Editable it is.
* misc(cmake) update dependencies
* feat(hardware) enable new hardware.hpp and unittests
* test(ctest) enable address sanitizer
* feat(backend): initial rewrite of the backend for simplicity
* feat(backend): remove all the logs from hardware.hpp
* feat(backend): added some logging
* feat(backend): enable compiler warning if support for RVO not applying
* feat(backend): missing return statement
* feat(backend): introduce backend_workspace_t to store precomputed information from the engine folder
* feat(backend): delete previous backend impl
* feat(backend): more impl
* feat(backend): use latest trtllm main version to have g++ >= 13 compatibility
* feat(backend): allow overriding which Python to use
* feat(backend): fix backend_exception_t -> backend_error_t naming
* feat(backend): impl missing generation_step_t as return value of pull_tokens
* feat(backend): make backend_workspace_t::engines_folder constexpr
* feat(backend): fix main.rs retrieving the tokenizer
* feat(backend): add guard to multiple header definitions
* test(backend): add more unittest
* feat(backend): remove constexpr from par
* feat(backend): remove constexpig
* test(backend): more test coverage
* chore(trtllm): update dependency towards 0.15.0
* effectively cancel the request on the executor
* feat(backend) fix moving backend when pulling
* feat(backend): make sure we can easily cancel request on the executor
* feat(backend): fix missing "0" field access
* misc(backend): fix reborrowing Pin<&mut T> as described in the doc https://doc.rust-lang.org/stable/std/pin/struct.Pin.html#method.as_mut
* chore: Add doc and CI for TRTLLM (#2799)
* chore: Add doc and CI for TRTLLM
* chore: Add doc and CI for TRTLLM
* chore: Add doc and CI for TRTLLM
* chore: Add doc and CI for TRTLLM
* doc: Formatting
* misc(backend): indent
---------
Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co>
* nix: build and cache all devshells
* nix: add poetry to the impure shell
This shouldn't be used to manage dependencies in a Nix devshell, but can
be handy to update `poetry.lock`.
* Fix Nix build, disable pure shell (covered by Nix tests)
* Stream options.
* Fetch stuff from nix integration test for easier testing.
* Adding the assert.
* Only send the usage when asked for.
* Update the docs.
* Impure test because we need network.
* develop.
* Optional usage.
* Fixes.
* Workflow
* Move to moe-kernels package and switch to common MoE layer
This change introduces the new `moe-kernels` package:
- Add `moe-kernels` as a dependency.
- Introduce a `SparseMoELayer` module that can be used by MoE
models.
- Port over Mixtral and Deepseek.
* Make `cargo check` pass
* Update runner
* Add nix test.
* Modifying yourself means you need to rerun.
* Fixing the test + adding click (needed for pre-commit hooks).
* Try thuis.
* Our runner + pure test (not written)
* Reemove server.
* Root user.
* Different user ?
* Add the actual test target.
* Forgot this modification.
* Add a formatter.
* Add the secrets.
* Fixed the auth token ?
* Adding the other tests.
* Missing pre-commit.
* Test requires cargo for cargo fmt.
* Update it a bit.
* Up.
* Attempting to use a cache location for the models.
* Ignore the cache for now.
* Fixing odd tokenization self modifications on the Rust side (load and
resave in Python).
* Fixing the builds ?
* Fix the gh action?
* Fixing the location ?
* Validation is odd.
* Try a faster runner
* Upgrade python version.
* Remove sccache
* No sccache.
* Getting libpython maybe ?
* List stuff.
* Monkey it up.
* have no idea at this point
* Tmp.
* Shot in the dark.
* Tmate the hell out of this.
* Desperation.
* WTF.
* -y.
* Apparently 3.10 is not available anymore.
* Updating the dockerfile to make libpython discoverable at runtime too.
* Put back rust tests.
* Why do we want mkl on AMD ?
* Forcing 3.11 ?
* Making prefix/flashinfer the default and testing the full release tests.
* Include flashinfer in the docker.
* Using prebuilt.
* Allowing window_left_size (dummy version).
* Disabling flashinfer/prefix caching on odd head_dim
* Disable prefix caching for lora.
* More specific codes.
* Update lock
* Updating integration tests with new values with FI/FD.
Remove paged as a default too, and using FD everywhere.
* Update cargo lock ?
* Upgrade to 1.80 because of bitstream...
* Everywhere 1.80
* Forgot last default place.
* Apply suggestions from code review
Co-authored-by: drbh <david.richard.holtz@gmail.com>
* Updated flake lock
* Tmp
* Upgrade resolution system for less errors in resolution.
* Remove lambda for cleaner function.
* Handling debugger.
* OVerride the env in server tests.
* Is this enough to make it work ?
* This seems to be working.
* Downgrade some logs.
* Fixing the default for vlm.
* Don't enable prefix caching on VLM just yet.
* Change `add_special_tokens` in order to have the correct tokens for chat
input and not (since it's super important with the prefixing now)
* Fixing prefix caching for flashdecoding.
* Update all models.
* Fixed flashinfer version.
* add_special_tokens is internal only
* Fixing seqlen with the new vlms.
* Fixing the issue with `add_special_tokens` not being passed around.
* Fixing the test.
* Removing encoder_decoder (seq2seq).
* Update the chat test.
* Fixing the batching tokenization in flash causal lm.
* Truncating left for radix purposes.
* Oops this doesn't belong here.
* Put back default pure shell.
* Update server tests
- Default to throughput test in k6
- Use TGI_WIGGLE_ROOM to adjust wiggle room
* Only n_heads / process_group.size() are necessary.
* Revert the integrationt tests change (seem linked to head_size
modification).
* Adding error message when assert is violated.
* Fixing the free algorithm to handle times where the common prefix is
smaller.
* Apply suggestions from code review
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
* Update server/text_generation_server/layers/attention/common.py
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
* Fix disabling prefix caching - Fix windowing checks.
* Revert the Cohere tokenizer change (for now using a revision instead).
* Fmt.
---------
Co-authored-by: drbh <david.richard.holtz@gmail.com>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
* All integration tests back everywhere (too many failed CI).
* Upgrade integration tests after 12.4
* Attempt to remove the specifed compute cap.
* Common arch list.
* Punica uses raw ASM which is not valid on 9.0 apparently.
* doc: Add metrics documentation and add a 'Reference' section
* doc: Add API reference
* doc: Refactor API reference
* fix: Message API link
* Bad rebase
* Moving the docs.
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
* Fix cache block size for flash decoding
This seems to have been accidentally dropped during the TRT-LLM
PR rebase.
* Also run CI on changes to `backends`
* wip
wip
refacto
refacto
Initial setup for CXX binding to TRTLLM
Working FFI call for TGI and TRTLLM backend
Remove unused parameters annd force tokenizer name to be set
Overall build TRTLLM and deps through CMake build system
Enable end to end CMake build
First version loading engines and making it ready for inference
Remembering to check how we can detect support for chunked context
Move to latest TensorRT-LLM version
Specify which default log level to use depending on CMake build type
make leader executor mode working
unconditionally call InitializeBackend on the FFI layer
bind to CUDA::nvml to retrieve compute capabilities at runtime
updated logic and comment to detect cuda compute capabilities
implement the Stream method to send new tokens through a callback
use spdlog release 1.14.1 moving forward
update trtllm to latest version a96cccafcf6365c128f004f779160951f8c0801c
correctly tell cmake to build dependent tensorrt-llm required libraries
create cmake install target to put everything relevant in installation folder
add auth_token CLI argument to provide hf hub authentification token
allow converting huggingface::tokenizers error to TensorRtLlmBackendError
use correct include for spdlog
include guard to build example in cmakelists
working setup of the ffi layer
remove fmt import
use external fmt lib
end to end ffi flow working
make sure to track include/ffi.h to trigger rebuild from cargo
impl the rust backend which currently cannot move the actual computation in background thread
expose shutdown function at ffi layer
impl RwLock scenario for TensorRtLllmBackend
oops missing c++ backend definitions
compute the number of maximum new tokens for each request independently
make sure the context is not dropped in the middle of the async decoding.
remove unnecessary log
add all the necessary plumbery to return the generated content
update invalid doc in cpp file
correctly forward back the log probabilities
remove unneeded scope variable for now
refactor Stream impl for Generation to factorise code
expose the internal missing start/queue timestamp
forward tgi parameters rep/freq penalty
add some more validation about grammar not supported
define a shared struct to hold the result of a decoding step
expose information about potential error happening while decoding
remove logging
add logging in case of decoding error
make sure executor_worker is provided
add initial Dockerfile for TRTLLM backend
add some more information in CMakeLists.txt to correctly install executorWorker
add some more information in CMakeLists.txt to correctly find and install nvrtc wrapper
simplify prebuilt trtllm libraries name definition
do the same name definition stuff for tensorrt_llm_executor_static
leverage pkg-config to probe libraries paths and reuse new install structure from cmake
fix bad copy/past missing nvinfer linkage direction
align all the linker search dependency
add missing pkgconfig folder for MPI in Dockerfile
correctly setup linking search path for runtime layer
fix missing / before tgi lib path
adding missing ld_library_path for cuda stubs in Dockerfile
update tgi entrypoint
commenting out Python part for TensorRT installation
refactored docker image
move to TensorRT-LLM v0.11.0
make docker linter happy with same capitalization rule
fix typo
refactor the compute capabilities detection along with num gpus
update TensorRT-LLM to latest version
update TensorRT install script to latest
update build.rs to link to cuda 12.5
add missing dependant libraries for linking
clean up a bit
install to decoder_attention target
add some custom stuff for nccl linkage
fix envvar CARGO_CFG_TARGET_ARCH set at runtime vs compile time
use std::env::const::ARCH
make sure variable live long enough...
look for cuda 12.5
add some more basic info in README.md
* Rebase.
* Fix autodocs.
* Let's try to enable trtllm backend.
* Ignore backends/v3 by default.
* Fixing client.
* Fix makefile + autodocs.
* Updating the schema thing + redocly.
* Fix trtllm lint.
* Adding pb files ?
* Remove cargo fmt temporarily.
* ?
* Tmp.
* Remove both check + clippy ?
* Backporting telemetry.
* Backporting 457fb0a1
* Remove PB from git.
* Fixing PB with default member backends/client
* update TensorRT-LLM to latest version
* provided None for api_key
* link against libtensorrt_llm and not libtensorrt-llm
---------
Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>
Co-authored-by: Morgan Funtowicz <morgan@huggingface.co>