We now manually evaluate the apparent hash of the neuron backend by
combining the hash of the neuron backend directory and Dockerfile.
This new hash is used to identify exported neuron models instead of the
image sha.
This has two benefits:
- it changes less frequently (only hwen the neuron backend changes),
which means less neuron models being pushed to the hub,
- it can be evaluated locally, meaning that running the tests once
locally will export the models before the CI uses them.
The neuron tests require models to have been previously exported and
cached on the hub. This is done automatically by the neuron.model
fixture the first time the tests are ran for a specific version.
This fixture used to export the models using optimum-neuron directly,
but this package is not necessarily present on the system.
Instead, it is now done through the neuron TGI itself, since it
contains all the tools required to export the models.
Note that since the CI runs docker in docker (dind) it does not seem
possible to share a volume between the CI container and the container
used to export the model.
For that reason, a specific image with a modified entrypoint is built
on-the-fly when a model export is required.
* feat: Add the parsing of HF_HUB_USER_AGENT_ORIGIN environment variable to add info about the environment running TGI. That is useful to track usage in case of collaborations for example.
* fix: trufflehog
* feat: support qwen2.5 vl model
* fix: bump support models doc
* feat: check before rope type adjustment and small refactors
* fix: add transformer overlay for processor support
* fix: vendor processor and config from transformers
* fix: refactor/simplify conditionals
It's find in some machine. using hf_hub::api::sync::Api to download config is not successful which will make warmup fail since attribute like max_position_embeddings could not be got. update hf-hub to the latest version could fix it
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Wang, Yi A <yi.a.wang@intel.com>
* Putting back the NCCL forced upgrade.
* .
* ...
* Ignoring conda.
* Dropping conda from the buidl system + torch 2.6
* Cache min.
* Rolling back torch version.
* Reverting the EETQ modification.
* Fix flash attention ?
* Actually stay on flash v1.
* Patching flash v1.
* Torch 2.6, fork of rotary, eetq updated.
* Put back nccl latest (override torch).
* Slightly more reproducible build and not as scary.
* fix Qwen VL break in intel platform
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* could use PositionRotaryEmbedding impl so rocm and ipex could all work
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* Use Hub kernels for Marlin and cutlass quantization kernels
* Use hub kernels for MoE/GPTQ-Marlin MoE
* Use attention kernels from the Hub
* Cache the kernels in the Docker image
* Update moe kernels
* Support loading local kernels for development
* Support latest moe kernels
* Update to moe 0.1.1
* CI: download locked kernels for server tests
* Fixup some imports
* CI: activate venv
* Fix unused imports
* Nix: add attention/moe/quantization kernels
* Update hf-kernels to 0.1.5
* Update kernels
* Update tgi-nix flake for hf-kernels
* Fix EOF
* Take `load_kernel` out of a frequently-called function
* Hoist another case of kernel loading out of a somewhat hot function
* marlin-kernels -> quantization
* attention -> paged-attention
* EOF fix
* Update hf-kernels, fixup Docker
* ipex fix
* Remove outdated TODO
* Updating mllama after strftime.
* Town instead village.
* Forgot the integration snapshot.
* Attempt to fix intel CPU.
* Intel extension fix.
* Workaround intel.
* Moving those deps directly into pyproject.
* Revert "Moving those deps directly into pyproject."
This reverts commit 98c1496ea6.
* Non system uv.
* Fixing the docker environment hopefully.
* Missed a step.
* Move workdir up a bit.
* Bailing out of reproducible python env.
* Triton version.
* backend(trtllm): bump TRTLLM to v.0.17.0
* backend(trtllm): forget to bump dockerfile
* backend(trtllm): use arg instead of env
* backend(trtllm): use correct library reference decoder_attention_src
* backend(trtllm): link against decoder_attention_{0|1}
* backend(trtllm): build against gcc-14 with cuda12.8
* backend(trtllm): use return value optimization flag as as error if available
* backend(trtllm): make sure we escalade all warnings as errors on the backend impl in debug mode
* backend(trtllm): link against CUDA 12.8