* clean cuda/rocm code in hpu backend, enable flat_hpu
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* fix TP in pageattn
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* adjust block table in hpu to improve performance
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* enable all the model. not testet yet
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* use tensor cache in hpu graph to avoid replay issue
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* add moe support, fix qwen/mistral/mixtral crash
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* fix phimoe issue
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* gpt_bigcode could also go pageattn
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* enable dbrx remove some unused code
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* multi-modality initial PR
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* adjust warmup and enable vlm
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* fix incorrect output in qwen2 idefics if hpu graph is used
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* remove unused quantization code and enable awq/gptq int4
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* fix gptq issue
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* enable fp8
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* warmup prefill
remove model where pageattn is not used, set block table to None since it's not used
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* add warmup_decode
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* warmup decode
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* remove block_tables and prefill_cache_indices which will lead to dynamic shape
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* fix comment
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* missing gptj change...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* fix some issue
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* remove torch.where to fix incorrect output in hpu graph model
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* match the latest vllm_extension ops
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* feat(gaudi): release ready (docs, docker image and vlm ready)
* fix(gaudi): add default argument for the dockerfile
* fix(gaudi): remove use of latest for gaudi docker image + redid gaudi benchmarking section to include best practices
* feat(neuron): use AWS Neuron SDK 2.21.1
* feat(neuron): bump optimum-neuron version
* feat(neuron): tag latest image for local tests
* test(neuron): simplify sampling test
* Making `tool_calls` a vector.
* Update doc.
* Fixing the nix overlay with updated version.
* Add openai dependency.
* Updating the old tests.
* Trying to reduce the logs in the case of errors.
* Less spammy logs too.
* feat: add support for HF_HUB_USER_AGENT_ORIGIN to add user-agent Origin field in Hub requests.
* fix: Rust version for Neuron
* fix: PR comments, use rust-toolchain.toml
* wip(gaudi): import server and dockerfile from tgi-gaudi fork
* feat(gaudi): new gaudi backend working
* fix: fix style
* fix prehooks issues
* fix(gaudi): refactor server and implement requested changes
* feat: add neuron backend
* feat(neuron): add server standalone installation
* feat(neuron): add server and integration tests
* fix(neuron): increase ulimit when building image
The base image used to compile the rust components seems to have a low
ulimit for opened files, which leads to errors during compilation.
* test(neuron): merge integration tests and fixtures
* test: add --neuron option
* review: do not use latest tag
* review: remove ureq pinned version
* review: --privileged should be the exception
* feat: add neuron case to build ci
* fix(neuron): export models from container in test fixtures
The neuron tests require models to have been previously exported and
cached on the hub. This is done automatically by the neuron.model
fixture the first time the tests are ran for a specific version.
This fixture used to export the models using optimum-neuron directly,
but this package is not necessarily present on the system.
Instead, it is now done through the neuron TGI itself, since it
contains all the tools required to export the models.
Note that since the CI runs docker in docker (dind) it does not seem
possible to share a volume between the CI container and the container
used to export the model.
For that reason, a specific image with a modified entrypoint is built
on-the-fly when a model export is required.
* refactor: remove sagemaker entry-point
The SageMaker image is built differently anyway.
* fix(neuron): avoid using Levenshtein
* test(neuron): use smaller llama model
* feat(neuron): avoid installing CUDA in image
* test(neuron): no error anymore when requesting too many tokens
* ci: doing a precompilation step (with a different token).
* test(neuron): avoid using image sha when exporting models
We now manually evaluate the apparent hash of the neuron backend by
combining the hash of the neuron backend directory and Dockerfile.
This new hash is used to identify exported neuron models instead of the
image sha.
This has two benefits:
- it changes less frequently (only hwen the neuron backend changes),
which means less neuron models being pushed to the hub,
- it can be evaluated locally, meaning that running the tests once
locally will export the models before the CI uses them.
* test(neuron): added a small script to prune test models
---------
Co-authored-by: drbh <david.richard.holtz@gmail.com>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
* backend(trtllm): bump TRTLLM to v.0.17.0
* backend(trtllm): forget to bump dockerfile
* backend(trtllm): use arg instead of env
* backend(trtllm): use correct library reference decoder_attention_src
* backend(trtllm): link against decoder_attention_{0|1}
* backend(trtllm): build against gcc-14 with cuda12.8
* backend(trtllm): use return value optimization flag as as error if available
* backend(trtllm): make sure we escalade all warnings as errors on the backend impl in debug mode
* backend(trtllm): link against CUDA 12.8
* backend(trtllm): attempt to remove AWS S3 flaky cache for sccache
* backend(trtllm): what if we expose ENV instead of inline?
* backend(trtllm): and with the right env var for gha sccache
* backend(trtllm): relax the way to detect sccache
* backend(trtllm): make sccache definition manually
* backend(trtllm): ok let's try to define the launchers in build.rs when rustc_wrapper is present
* backend(trtllm): export env variable in run mb?
* backend(trtllm): Cache mode max to cache intermediate layers
* backend(trtllm): inject ompi_version build arg in dependent step
* backend(trtllm): update to 0.16.0
* backend(trtllm): do not use shallow clone
* backend(trtllm): use tag instead
* backend(trtllm): move to nvidia remote instead of hf
* backend(trtllm): reenable shallow clone
* backend(trtllm): attempt to use ADD instead of RUN for openmpi
* backend(trtllm): make sure we are using correct path for openmpi ADD in dockerfile
* backend(trtllm): add correctly untar it
* feat: tokenize each request individually and increase warmup image size
* feat: adjust rotary embed and avoid cuda graphs of size 2 and smaller
* fix: address image resize and rebase changes
* feat: update to run qwen2-vl tests
* fix: tweak param types
* misc(cmake) update dependencies
* feat(hardware) enable new hardware.hpp and unittests
* test(ctest) enable address sanitizer
* feat(backend): initial rewrite of the backend for simplicity
* feat(backend): remove all the logs from hardware.hpp
* feat(backend): added some logging
* feat(backend): enable compiler warning if support for RVO not applying
* feat(backend): missing return statement
* feat(backend): introduce backend_workspace_t to store precomputed information from the engine folder
* feat(backend): delete previous backend impl
* feat(backend): more impl
* feat(backend): use latest trtllm main version to have g++ >= 13 compatibility
* feat(backend): allow overriding which Python to use
* feat(backend): fix backend_exception_t -> backend_error_t naming
* feat(backend): impl missing generation_step_t as return value of pull_tokens
* feat(backend): make backend_workspace_t::engines_folder constexpr
* feat(backend): fix main.rs retrieving the tokenizer
* feat(backend): add guard to multiple header definitions
* test(backend): add more unittest
* feat(backend): remove constexpr from par
* feat(backend): remove constexpig
* test(backend): more test coverage
* chore(trtllm): update dependency towards 0.15.0
* effectively cancel the request on the executor
* feat(backend) fix moving backend when pulling
* feat(backend): make sure we can easily cancel request on the executor
* feat(backend): fix missing "0" field access
* misc(backend): fix reborrowing Pin<&mut T> as described in the doc https://doc.rust-lang.org/stable/std/pin/struct.Pin.html#method.as_mut
* chore: Add doc and CI for TRTLLM (#2799)
* chore: Add doc and CI for TRTLLM
* chore: Add doc and CI for TRTLLM
* chore: Add doc and CI for TRTLLM
* chore: Add doc and CI for TRTLLM
* doc: Formatting
* misc(backend): indent
---------
Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co>
* Using both value from config as they might not be correct.
* Fixing max_position_embeddings for falcon.
* Simple attempt to fix the healthcheck block allocation.
* Much simpler solution.
* Default value for Backend start_health
* Incomplete generation stream fix (#2754)
entries.len() could > batch.size in prefill, so need to filter as well.
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* entries was wrongly extended for model that did not support chunking
---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Wang, Yi <yi.a.wang@intel.com>
* feat(trtllm): rewrite health to not account for current state
* chore(looper): cleanup a bit more
* feat(post_processing): max_new_tokens is const evaluated now
* chore(ffi):formatting
* feat(trtllm): add stop words handling
# Conflicts:
# backends/trtllm/lib/backend.cpp
* chore(trtllm): create specific parallelconfig factory and logging init methods
* chore(trtllm): define a macro for SizeType cast
* chore(trtllm): use GetParallelConfig
* chore(trtllm): minor refactoring
* chore(trtllm): validate there are enough GPus on the system for the desired model
* chore(trtllm): ensure max throughput scheduling policy is selected
* chore(trtllm): minor fix
* chore(router): minor refactorings
* feat(docker): build with-slurm ompi
* feat(docker): add python3.10 dev to runtime deps
* chore(docker): add mpi to ld_library_path
* chore(docker): install transformers
* feat(trtllm): detect stop_words from generation_config.json
* (backend) use parking_lot crate for RwLock fairness
# Conflicts:
# backends/trtllm/src/backend.rs
* (launcher) default new server::run parameters to false for now
* (chore) fmt ... why?
* (ffi) use const for GetSamplingConfig
* (server) expose new SchedulingError
* (trt)
* (build) setup ccache if available
* (ffi) add max_new_tokens parameters
* (backend) cleanup a bit
* (backend) expose PullNewTokens
* (ffi) cleanup again
* (ffi) add missing headers imports
* (ffi) add template specialization to catch and convert to Rust Result<T, tensorrt_llm::common::TllmException>
* (looper) new looper initial implementation
* (ffi) remove narrowing type warning
* (ffi) encode the provided user prompt within each request thread
* (misc) change scope identifiers
* (backend) implement the post_processor background thread
* (misc) missing Result types for Rust
* use blocking_recv in looper to consume awaiting_requests at max before pulling in a single step
* (server) forward auth_token to server::run
* (build) fetchcontent use archives instead of git
* (ffi) fix usage of wrong vector constructor making a capacity fill call
* (ffi) missing namespace for tle::Response
* (ffi) do not use reference capture in lambda as we are not capturing anything
* (backend) refactor & cleanup
* (Dockerfile.trtllm) delete for now
* (misc) simplify [make_]move_iterator by using c++20 type inference
* (misc) no need to move for uint32_t items
* (scheduler) rework submit/pull logic
* (post) impl postprocessing
* (misc) delete backend.rs
* (misc) rerun-if-changed all the cmake modules
* (misc) move to latest trtllm
* (fix): HOPPER_SM_MAJOR is 9 not 8
* (misc: build for sm_{75,80,86,89,90} by default
* (misc): build with trtllm 0.13.0
* (misc): increase verbosity of spdlog
* (fix): do not recreate the stateful hashmap at every it
* (misc): update dependency in trtllm dockerfile
* (misc): update dependency in trtllm dockerfile
* (misc): disable logging in release mode
* (misc): improve trtllm download script robustness
* (fix): ore fixes for Dockerfile
* misc(cuda): require 12.6
* chore(cmake): use correct policy for download_timestamp
* feat(looper): check engine and executorWorker paths exist before creating the backend
* chore(cmake): download timestamp should be before URL
* feat(looper): minor optimizations to avoid growing too much the containers
* chore(trtllm): move dockerfile to right place
* chore(trtllm): disable tokenizer parallelism by default
* chore(trtllm): fmt
* chore(trtllm): post-rebase commit
* chore(trtllm): remove unused method
* feat(trtllm): cache maxNumTokens to avoid calling JSON everytime
* misc(router): remove SchedulingError
* feat(trtllm): do not tokenize twice
* Revert "chore(trtllm): remove unused method"
This reverts commit 31747163
* chore(rebase): fix invalid references
* chore(router): add python dependency
* Lint.
* Fix bad rebase
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
* adding max_token_capacity_metric
* added tgi to name of metric
* Adding max capacity metric.
* Add description for the metrics
---------
Co-authored-by: Edwinhr716 <Edandres249@gmail.com>
* Adding a test for FD.
* Fixing flashdecoding (empty batch doesn't work).
* Fixing the invalid popping.
* Fixing radix with block_size > 1
* Last reference.
* Use an actual hash.
* Update hash for slice.len() == 1
* Update the locks.
* Increasing docker timeout.
* Fixing odd tokenization self modifications on the Rust side (load and
resave in Python).
* Fixing the builds ?
* Fix the gh action?
* Fixing the location ?
* Validation is odd.
* Try a faster runner
* Upgrade python version.
* Remove sccache
* No sccache.
* Getting libpython maybe ?
* List stuff.
* Monkey it up.
* have no idea at this point
* Tmp.
* Shot in the dark.
* Tmate the hell out of this.
* Desperation.
* WTF.
* -y.
* Apparently 3.10 is not available anymore.
* Updating the dockerfile to make libpython discoverable at runtime too.
* Put back rust tests.
* Why do we want mkl on AMD ?
* Forcing 3.11 ?