* wip(gaudi): import server and dockerfile from tgi-gaudi fork
* feat(gaudi): new gaudi backend working
* fix: fix style
* fix prehooks issues
* fix(gaudi): refactor server and implement requested changes
* feat: add neuron backend
* feat(neuron): add server standalone installation
* feat(neuron): add server and integration tests
* fix(neuron): increase ulimit when building image
The base image used to compile the rust components seems to have a low
ulimit for opened files, which leads to errors during compilation.
* test(neuron): merge integration tests and fixtures
* test: add --neuron option
* review: do not use latest tag
* review: remove ureq pinned version
* review: --privileged should be the exception
* feat: add neuron case to build ci
* fix(neuron): export models from container in test fixtures
The neuron tests require models to have been previously exported and
cached on the hub. This is done automatically by the neuron.model
fixture the first time the tests are ran for a specific version.
This fixture used to export the models using optimum-neuron directly,
but this package is not necessarily present on the system.
Instead, it is now done through the neuron TGI itself, since it
contains all the tools required to export the models.
Note that since the CI runs docker in docker (dind) it does not seem
possible to share a volume between the CI container and the container
used to export the model.
For that reason, a specific image with a modified entrypoint is built
on-the-fly when a model export is required.
* refactor: remove sagemaker entry-point
The SageMaker image is built differently anyway.
* fix(neuron): avoid using Levenshtein
* test(neuron): use smaller llama model
* feat(neuron): avoid installing CUDA in image
* test(neuron): no error anymore when requesting too many tokens
* ci: doing a precompilation step (with a different token).
* test(neuron): avoid using image sha when exporting models
We now manually evaluate the apparent hash of the neuron backend by
combining the hash of the neuron backend directory and Dockerfile.
This new hash is used to identify exported neuron models instead of the
image sha.
This has two benefits:
- it changes less frequently (only hwen the neuron backend changes),
which means less neuron models being pushed to the hub,
- it can be evaluated locally, meaning that running the tests once
locally will export the models before the CI uses them.
* test(neuron): added a small script to prune test models
---------
Co-authored-by: drbh <david.richard.holtz@gmail.com>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
* make content field optional in chat request
* add tool_calls field to Message struct
* feat: add test and serialize tool messages
* fix: bump utopia, openapi doc version and improve test
* fix: rerun update docs
* fix: suppoer tool call id in template and remove unnecessary changes
* fix: ruff lint remove unused import
* fix: adjust message types in tests
---------
Co-authored-by: sailesh duddupudi <saileshradar@gmail.com>
The current code does not work and gives the following message:
UserWarning: You have not specified a value for the `type` parameter. Defaulting to the 'tuples' format for chatbot messages, but this is deprecated and will be removed in a future version of Gradio. Please set type='messages' instead, which uses openai-style dictionaries with 'role' and 'content' keys.
warnings.warn(
Traceback (most recent call last):
File "/Users/angt/hf/tgi/test-gradio.py", line 22, in <module>
gr.ChatInterface(
TypeError: ChatInterface.__init__() got an unexpected keyword argument 'retry_btn'
Signed-off-by: Adrien Gallouët <adrien@gallouet.fr>
* feat: Add the parsing of HF_HUB_USER_AGENT_ORIGIN environment variable to add info about the environment running TGI. That is useful to track usage in case of collaborations for example.
* fix: trufflehog
* feat: support qwen2.5 vl model
* fix: bump support models doc
* feat: check before rope type adjustment and small refactors
* fix: add transformer overlay for processor support
* fix: vendor processor and config from transformers
* fix: refactor/simplify conditionals
It's find in some machine. using hf_hub::api::sync::Api to download config is not successful which will make warmup fail since attribute like max_position_embeddings could not be got. update hf-hub to the latest version could fix it
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Wang, Yi A <yi.a.wang@intel.com>
* Putting back the NCCL forced upgrade.
* .
* ...
* Ignoring conda.
* Dropping conda from the buidl system + torch 2.6
* Cache min.
* Rolling back torch version.
* Reverting the EETQ modification.
* Fix flash attention ?
* Actually stay on flash v1.
* Patching flash v1.
* Torch 2.6, fork of rotary, eetq updated.
* Put back nccl latest (override torch).
* Slightly more reproducible build and not as scary.
* fix Qwen VL break in intel platform
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* could use PositionRotaryEmbedding impl so rocm and ipex could all work
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* Use Hub kernels for Marlin and cutlass quantization kernels
* Use hub kernels for MoE/GPTQ-Marlin MoE
* Use attention kernels from the Hub
* Cache the kernels in the Docker image
* Update moe kernels
* Support loading local kernels for development
* Support latest moe kernels
* Update to moe 0.1.1
* CI: download locked kernels for server tests
* Fixup some imports
* CI: activate venv
* Fix unused imports
* Nix: add attention/moe/quantization kernels
* Update hf-kernels to 0.1.5
* Update kernels
* Update tgi-nix flake for hf-kernels
* Fix EOF
* Take `load_kernel` out of a frequently-called function
* Hoist another case of kernel loading out of a somewhat hot function
* marlin-kernels -> quantization
* attention -> paged-attention
* EOF fix
* Update hf-kernels, fixup Docker
* ipex fix
* Remove outdated TODO
* Updating mllama after strftime.
* Town instead village.
* Forgot the integration snapshot.
* Attempt to fix intel CPU.
* Intel extension fix.
* Workaround intel.
* Moving those deps directly into pyproject.
* Revert "Moving those deps directly into pyproject."
This reverts commit 98c1496ea6.
* Non system uv.
* Fixing the docker environment hopefully.
* Missed a step.
* Move workdir up a bit.
* Bailing out of reproducible python env.
* Triton version.
* backend(trtllm): bump TRTLLM to v.0.17.0
* backend(trtllm): forget to bump dockerfile
* backend(trtllm): use arg instead of env
* backend(trtllm): use correct library reference decoder_attention_src
* backend(trtllm): link against decoder_attention_{0|1}
* backend(trtllm): build against gcc-14 with cuda12.8
* backend(trtllm): use return value optimization flag as as error if available
* backend(trtllm): make sure we escalade all warnings as errors on the backend impl in debug mode
* backend(trtllm): link against CUDA 12.8
* Using the "lockfile".
* Revert dummy modifications.
* Lock on python 3.11
* Another attempt.
* ..
* Bad cache hits.
* The good old monkey.
* How in the world...
* We need the launcher still.
* .
* ..
* Attempt #42
* Don't break all other builds.
* Mode max.
* Applying to other builds.
* feat: refactor model, improve startup and re enable tests
* fix: improve multimodal rotary embed caching
* fix: limit vision flop calc to qwen2 vl models and update config typing
* fix: include clippy lint
* feat: refactor position ids in warmup and bump tests
* fix: prefer default dtype
* fix: enable all cuda graphs and bump snapshots
* fix: adjust rotaty init path
* fix: simplify get position ids and remove usused vision config
* fix: update position ids so first dim is batch, simplify rotary and bump vlm default token limit
* fix: improve position id init during cuda warmup for mrope and simplfy rotary forward
* fix: check existance before accessing rope type in cuda warmup
* fix: check key before access
* fix: improve mrope check in cuda graph warmup
* fix: remove check for default rope type
* fix: add more test and improve model generation
* fix: improve and simplify get_cos_sin, refactors and cleanup get_position_ids
* fix: adjust signatures with types
* hotfix: fix trtllm CI build on release
* fix: test release.
* fix: test release.
* fix: test release. env not recognized https://github.com/actions/runner/issues/1661
* fix: test release. Works.