* feat(gaudi): release ready (docs, docker image and vlm ready)
* fix(gaudi): add default argument for the dockerfile
* fix(gaudi): remove use of latest for gaudi docker image + redid gaudi benchmarking section to include best practices
* feat(neuron): use AWS Neuron SDK 2.21.1
* feat(neuron): bump optimum-neuron version
* feat(neuron): tag latest image for local tests
* test(neuron): simplify sampling test
* Making `tool_calls` a vector.
* Update doc.
* Fixing the nix overlay with updated version.
* Add openai dependency.
* Updating the old tests.
* Trying to reduce the logs in the case of errors.
* Less spammy logs too.
* feat: add support for HF_HUB_USER_AGENT_ORIGIN to add user-agent Origin field in Hub requests.
* fix: Rust version for Neuron
* fix: PR comments, use rust-toolchain.toml
* wip(gaudi): import server and dockerfile from tgi-gaudi fork
* feat(gaudi): new gaudi backend working
* fix: fix style
* fix prehooks issues
* fix(gaudi): refactor server and implement requested changes
* feat: add neuron backend
* feat(neuron): add server standalone installation
* feat(neuron): add server and integration tests
* fix(neuron): increase ulimit when building image
The base image used to compile the rust components seems to have a low
ulimit for opened files, which leads to errors during compilation.
* test(neuron): merge integration tests and fixtures
* test: add --neuron option
* review: do not use latest tag
* review: remove ureq pinned version
* review: --privileged should be the exception
* feat: add neuron case to build ci
* fix(neuron): export models from container in test fixtures
The neuron tests require models to have been previously exported and
cached on the hub. This is done automatically by the neuron.model
fixture the first time the tests are ran for a specific version.
This fixture used to export the models using optimum-neuron directly,
but this package is not necessarily present on the system.
Instead, it is now done through the neuron TGI itself, since it
contains all the tools required to export the models.
Note that since the CI runs docker in docker (dind) it does not seem
possible to share a volume between the CI container and the container
used to export the model.
For that reason, a specific image with a modified entrypoint is built
on-the-fly when a model export is required.
* refactor: remove sagemaker entry-point
The SageMaker image is built differently anyway.
* fix(neuron): avoid using Levenshtein
* test(neuron): use smaller llama model
* feat(neuron): avoid installing CUDA in image
* test(neuron): no error anymore when requesting too many tokens
* ci: doing a precompilation step (with a different token).
* test(neuron): avoid using image sha when exporting models
We now manually evaluate the apparent hash of the neuron backend by
combining the hash of the neuron backend directory and Dockerfile.
This new hash is used to identify exported neuron models instead of the
image sha.
This has two benefits:
- it changes less frequently (only hwen the neuron backend changes),
which means less neuron models being pushed to the hub,
- it can be evaluated locally, meaning that running the tests once
locally will export the models before the CI uses them.
* test(neuron): added a small script to prune test models
---------
Co-authored-by: drbh <david.richard.holtz@gmail.com>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
* backend(trtllm): bump TRTLLM to v.0.17.0
* backend(trtllm): forget to bump dockerfile
* backend(trtllm): use arg instead of env
* backend(trtllm): use correct library reference decoder_attention_src
* backend(trtllm): link against decoder_attention_{0|1}
* backend(trtllm): build against gcc-14 with cuda12.8
* backend(trtllm): use return value optimization flag as as error if available
* backend(trtllm): make sure we escalade all warnings as errors on the backend impl in debug mode
* backend(trtllm): link against CUDA 12.8
* backend(trtllm): attempt to remove AWS S3 flaky cache for sccache
* backend(trtllm): what if we expose ENV instead of inline?
* backend(trtllm): and with the right env var for gha sccache
* backend(trtllm): relax the way to detect sccache
* backend(trtllm): make sccache definition manually
* backend(trtllm): ok let's try to define the launchers in build.rs when rustc_wrapper is present
* backend(trtllm): export env variable in run mb?
* backend(trtllm): Cache mode max to cache intermediate layers
* backend(trtllm): inject ompi_version build arg in dependent step
* backend(trtllm): update to 0.16.0
* backend(trtllm): do not use shallow clone
* backend(trtllm): use tag instead
* backend(trtllm): move to nvidia remote instead of hf
* backend(trtllm): reenable shallow clone
* backend(trtllm): attempt to use ADD instead of RUN for openmpi
* backend(trtllm): make sure we are using correct path for openmpi ADD in dockerfile
* backend(trtllm): add correctly untar it
* feat: tokenize each request individually and increase warmup image size
* feat: adjust rotary embed and avoid cuda graphs of size 2 and smaller
* fix: address image resize and rebase changes
* feat: update to run qwen2-vl tests
* fix: tweak param types
* misc(cmake) update dependencies
* feat(hardware) enable new hardware.hpp and unittests
* test(ctest) enable address sanitizer
* feat(backend): initial rewrite of the backend for simplicity
* feat(backend): remove all the logs from hardware.hpp
* feat(backend): added some logging
* feat(backend): enable compiler warning if support for RVO not applying
* feat(backend): missing return statement
* feat(backend): introduce backend_workspace_t to store precomputed information from the engine folder
* feat(backend): delete previous backend impl
* feat(backend): more impl
* feat(backend): use latest trtllm main version to have g++ >= 13 compatibility
* feat(backend): allow overriding which Python to use
* feat(backend): fix backend_exception_t -> backend_error_t naming
* feat(backend): impl missing generation_step_t as return value of pull_tokens
* feat(backend): make backend_workspace_t::engines_folder constexpr
* feat(backend): fix main.rs retrieving the tokenizer
* feat(backend): add guard to multiple header definitions
* test(backend): add more unittest
* feat(backend): remove constexpr from par
* feat(backend): remove constexpig
* test(backend): more test coverage
* chore(trtllm): update dependency towards 0.15.0
* effectively cancel the request on the executor
* feat(backend) fix moving backend when pulling
* feat(backend): make sure we can easily cancel request on the executor
* feat(backend): fix missing "0" field access
* misc(backend): fix reborrowing Pin<&mut T> as described in the doc https://doc.rust-lang.org/stable/std/pin/struct.Pin.html#method.as_mut
* chore: Add doc and CI for TRTLLM (#2799)
* chore: Add doc and CI for TRTLLM
* chore: Add doc and CI for TRTLLM
* chore: Add doc and CI for TRTLLM
* chore: Add doc and CI for TRTLLM
* doc: Formatting
* misc(backend): indent
---------
Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co>
* Using both value from config as they might not be correct.
* Fixing max_position_embeddings for falcon.
* Simple attempt to fix the healthcheck block allocation.
* Much simpler solution.
* Default value for Backend start_health