Commit Graph

1304 Commits

Author SHA1 Message Date
David Corvoysier
cd477d800c test(neuron): avoid using image sha when exporting models
We now manually evaluate the apparent hash of the neuron backend by
combining the hash of the neuron backend directory and Dockerfile.
This new hash is used to identify exported neuron models instead of the
image sha.
This has two benefits:
- it changes less frequently (only hwen the neuron backend changes),
  which means less neuron models being pushed to the hub,
- it can be evaluated locally, meaning that running the tests once
  locally will export the models before the CI uses them.
2025-02-23 14:17:02 +01:00
Nicolas Patry
05ca5e4c0f ci: doing a precompilation step (with a different token). 2025-02-23 14:17:02 +01:00
David Corvoysier
10b57727c2 test(neuron): no error anymore when requesting too many tokens 2025-02-23 14:17:02 +01:00
David Corvoysier
4c0fa92cb4 feat(neuron): avoid installing CUDA in image 2025-02-23 14:17:02 +01:00
David Corvoysier
b5e98a6d5a test(neuron): use smaller llama model 2025-02-23 14:17:02 +01:00
David Corvoysier
6f92198eb9 fix(neuron): avoid using Levenshtein 2025-02-23 14:17:02 +01:00
David Corvoysier
88a0948692 refactor: remove sagemaker entry-point
The SageMaker image is built differently anyway.
2025-02-23 14:17:02 +01:00
David Corvoysier
ae37890eef fix(neuron): export models from container in test fixtures
The neuron tests require models to have been previously exported and
cached on the hub. This is done automatically by the neuron.model
fixture the first time the tests are ran for a specific version.
This fixture used to export the models using optimum-neuron directly,
but this package is not necessarily present on the system.
Instead, it is now done through the neuron TGI itself, since it
contains all the tools required to export the models.
Note that since the CI runs docker in docker (dind) it does not seem
possible to share a volume between the CI container and the container
used to export the model.
For that reason, a specific image with a modified entrypoint is built
on-the-fly when a model export is required.
2025-02-23 14:17:02 +01:00
drbh
bb51c5138c feat: add neuron case to build ci 2025-02-23 14:17:02 +01:00
David Corvoysier
3bcc523e76 review: --privileged should be the exception 2025-02-23 14:17:02 +01:00
David Corvoysier
a053523e93 review: remove ureq pinned version 2025-02-23 14:17:02 +01:00
David Corvoysier
00931438ea review: do not use latest tag 2025-02-23 14:17:02 +01:00
David Corvoysier
9c998f9f7e test: add --neuron option 2025-02-23 14:17:02 +01:00
David Corvoysier
a3dcdab706 test(neuron): merge integration tests and fixtures 2025-02-23 14:17:02 +01:00
David Corvoysier
68e1c608f6 fix(neuron): increase ulimit when building image
The base image used to compile the rust components seems to have a low
ulimit for opened files, which leads to errors during compilation.
2025-02-23 14:17:02 +01:00
David Corvoysier
90578bfc65 feat(neuron): add server and integration tests 2025-02-23 14:17:02 +01:00
David Corvoysier
27526a55bc feat(neuron): add server standalone installation 2025-02-23 14:17:02 +01:00
David Corvoysier
d0ed1918d7 feat: add neuron backend 2025-02-23 14:17:02 +01:00
Daniël de Kok
97c5f7e685
Use rotary kernel from the Hub (#3041) 2025-02-21 13:55:31 +01:00
drbh
1cae3197c4
Improve tool call message processing (#3036)
* make content field optional in chat request

* add tool_calls field to Message struct

* feat: add test and serialize tool messages

* fix: bump utopia, openapi doc version and improve test

* fix: rerun update docs

* fix: suppoer tool call id in template and remove unnecessary changes

* fix: ruff lint remove unused import

* fix: adjust message types in tests

---------

Co-authored-by: sailesh duddupudi <saileshradar@gmail.com>
2025-02-21 10:30:29 +01:00
Adrien Gallouët
3498f6085e
Update Gradio ChatInterface configuration in consuming_tgi.md (#3042)
The current code does not work and gives the following message:

    UserWarning: You have not specified a value for the `type` parameter. Defaulting to the 'tuples' format for chatbot messages, but this is deprecated and will be removed in a future version of Gradio. Please set type='messages' instead, which uses openai-style dictionaries with 'role' and 'content' keys.
      warnings.warn(
    Traceback (most recent call last):
      File "/Users/angt/hf/tgi/test-gradio.py", line 22, in <module>
        gr.ChatInterface(
    TypeError: ChatInterface.__init__() got an unexpected keyword argument 'retry_btn'

Signed-off-by: Adrien Gallouët <adrien@gallouet.fr>
2025-02-21 10:11:28 +01:00
Nicolas Patry
142a49a80d
Simplify logs2. (#3045)
* Simplify logs2.

* Changing the scope from module to session to fix the event_loop issue.
2025-02-21 10:03:40 +01:00
Wang, Yi
06dfe9abfe
fix qwen2 vl crash in continous batching (#3004)
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-02-20 18:36:45 -05:00
Daniël de Kok
ed96ba6503
flashinfer 0.2.0.post1 -> post2 (#3040)
* flashinfer 0.2.0.post1 -> post2

* Fix ruff stuff.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2025-02-20 12:34:20 +01:00
Wang, Yi
feaa2477b7
update ipex and torch to 2.6 for cpu (#3039)
ipex cpu 2.6 support topk_group in moe fusion ops

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-02-20 09:12:28 +01:00
Hugo Larcher
230aa25641
feat: Add the parsing of HF_HUB_USER_AGENT_ORIGIN environment variable for telemetry (#3027)
* feat: Add the parsing of HF_HUB_USER_AGENT_ORIGIN environment variable to add info about the environment running TGI. That is useful to track usage in case of collaborations for example.

* fix: trufflehog
2025-02-19 21:09:12 +01:00
Nicolas Patry
9c89d0070e
Having less logs in case of failure for checking CI more easily. (#3037)
* Having less logs in case of failure for checking CI more easily.

* Cleaning up the versions to uv for the client.

* Ignore entirely the API.
2025-02-19 17:01:33 +01:00
Nicolas Patry
fde3234cbc
Using public external registry (to use external runners for CI). (#3031)
* Using public external registry (to use external runners for CI).

* Fix build.

* Fixing the external registry.

* Fixing trtllm tests.
2025-02-19 14:53:14 +01:00
drbh
d6a0c67e2f
feat: add initial qwen2.5-vl model and test (#2971)
* feat: support qwen2.5 vl model

* fix: bump support models doc

* feat: check before rope type adjustment and small refactors

* fix: add transformer overlay for processor support

* fix: vendor processor and config from transformers

* fix: refactor/simplify conditionals
2025-02-19 12:38:20 +01:00
Cyril Vallez
a7448661f7
Improve Transformers support (#2970)
* Much better support

* add gpt neox

* bump transformers version

* bump version
2025-02-18 19:04:34 +01:00
Nicolas Patry
5543fdc765
It's find in some machine. using hf_hub::api::sync::Api to download c… (#3030)
It's find in some machine. using hf_hub::api::sync::Api to download config is not successful which will make warmup fail since attribute like max_position_embeddings could not be got. update hf-hub to the latest version could fix it

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Wang, Yi A <yi.a.wang@intel.com>
2025-02-18 12:19:51 +01:00
Nicolas Patry
b8a4928d0e
Pinning trufflehog. (#3032) 2025-02-18 12:03:41 +01:00
Alvaro Bartolome
8a1cfd6122
Add loop_controls feature to minijinja to handle {% break %} (#2998)
* Add `loop_controls` feature to `minijinja`

* Add `test_chat_template_loop_controls` to test `break`
2025-02-18 10:33:22 +01:00
celsowm
794ec58b75
Update README.md (#3024)
only way to avoid:
error: experimental Nix feature 'nix-command' is disabled; add '--extra-experimental-features nix-command' to enable it
2025-02-18 10:08:28 +01:00
Daniël de Kok
f0ed76583c
Use eetq kernel from the hub (#3029)
* Use eetq kernel from the hub

* Fixing the CI.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2025-02-18 10:03:53 +01:00
Adrien Gallouët
cfd4fbb479
[Backend] Add Llamacpp backend (#2975)
* Add llamacpp backend

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Get rid of llama_batch_get_one()

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Use max_batch_total_tokens

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Handle max_batch_size

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Add some input validation checks

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Handle ctx args & fix sampling

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Add GPU args

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Add --defrag-threshold

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Add a stupid batch mechanism

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Cleanup

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Add --numa

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Fix args

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Enable flash attention by default

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Add --offload-kqv

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Fix batch_pos

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* backend(llama): add CUDA Dockerfile_llamacpp for now

* Only export the latest logits

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Output real logprobs

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Fix batching

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Fix seq iterations

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Auto-detect n_threads when not provided

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Clear request cache after completion

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Remove warmup

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Cleanup

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* backend(llama): add CUDA architectures build argument for Dockerfile

* Add specific args for batch

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Add --type-v & --type-k

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Bump llamacpp to b4623

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Disable graceful shutdown in debug mode

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Update Dockerfile_llamacpp

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Cleanup Dockerfile

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Update Cargo.lock

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Update args

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Simplify batching logic

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Set TGI_LLAMA_PKG_CUDA from CUDA_VERSION

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Rename bindings

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Remove n_ctx

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Make max_batch_total_tokens optional

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Ensure all samplers are freed on error

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Initialize penalty_last_n with llamacpp default value

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Cleanup

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Improve default settings

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Add doc

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Update docs

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Thanks clippy

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Thanks cargo fmt

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Update docs

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Do not use HOSTNAME env

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Bump llama.cpp & cuda

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Fix requirements.txt

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Fix fmt

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Enable KQV offload by default

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Remove Ngrok tunneling

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Remove .cargo/config.toml

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Fix Dockerfile

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Add missing cuda prefix

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Handle custom llama.cpp dir

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Cleanup

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Add README.md

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Add HF transfer

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Fix bool args

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Update doc

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Update doc

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

---------

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
Co-authored-by: Morgan Funtowicz <funtowiczmo@gmail.com>
2025-02-14 13:40:57 +01:00
Daniël de Kok
6df0fc0b55
Support sigmoid scoring function in GPTQ-MoE (#3017) 2025-02-14 11:33:49 +01:00
Nicolas Patry
d6881c37ab
Putting back the NCCL forced upgrade. (#2999)
* Putting back the NCCL forced upgrade.

* .

* ...

* Ignoring conda.

* Dropping conda from the buidl system + torch 2.6

* Cache min.

* Rolling back torch version.

* Reverting the EETQ modification.

* Fix flash attention ?

* Actually stay on flash v1.

* Patching flash v1.

* Torch 2.6, fork of rotary, eetq updated.

* Put back nccl latest (override torch).

* Slightly more reproducible build and not as scary.
2025-02-14 11:31:59 +01:00
Nicolas Patry
8a211dc7fc
Preventing single user hugging the server to death by asking (#3016)
for way too many tokens.
2025-02-13 11:23:17 +01:00
Nicolas Patry
4cccce4b44
Update the flaky mllama test. (#3015) 2025-02-12 12:26:52 +01:00
Wang, Yi
76bcb4948d
fix Qwen VL break in intel platform (#3002)
* fix Qwen VL break in intel platform

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* could use PositionRotaryEmbedding impl so rocm and ipex could all work

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-02-12 11:31:34 +01:00
Nicolas Patry
b86c3947ab
Revert "Update the flaky mllama test."
This reverts commit 8a870b31b9.
2025-02-11 17:13:06 +01:00
Nicolas Patry
8a870b31b9
Update the flaky mllama test. 2025-02-11 17:10:36 +01:00
Daniël de Kok
571ac9b507
Use kernels from the kernel hub (#2988)
* Use Hub kernels for Marlin and cutlass quantization kernels

* Use hub kernels for MoE/GPTQ-Marlin MoE

* Use attention kernels from the Hub

* Cache the kernels in the Docker image

* Update moe kernels

* Support loading local kernels for development

* Support latest moe kernels

* Update to moe 0.1.1

* CI: download locked kernels for server tests

* Fixup some imports

* CI: activate venv

* Fix unused imports

* Nix: add attention/moe/quantization kernels

* Update hf-kernels to 0.1.5

* Update kernels

* Update tgi-nix flake for hf-kernels

* Fix EOF

* Take `load_kernel` out of a frequently-called function

* Hoist another case of kernel loading out of a somewhat hot function

* marlin-kernels -> quantization

* attention -> paged-attention

* EOF fix

* Update hf-kernels, fixup Docker

* ipex fix

* Remove outdated TODO
2025-02-10 19:19:25 +01:00
Nicolas Patry
4b8cda684b
Updating mllama after strftime. (#2993)
* Updating mllama after strftime.

* Town instead village.

* Forgot the integration snapshot.

* Attempt to fix intel CPU.

* Intel extension fix.

* Workaround intel.

* Moving those deps directly into pyproject.

* Revert "Moving those deps directly into pyproject."

This reverts commit 98c1496ea6.

* Non system uv.

* Fixing the docker environment hopefully.

* Missed a step.

* Move workdir up a bit.

* Bailing out of reproducible python env.

* Triton version.
2025-02-07 10:38:13 +01:00
Funtowicz Morgan
856709d5c3
[Backend] Bump TRTLLM to v.0.17.0 (#2991)
* backend(trtllm): bump TRTLLM to v.0.17.0

* backend(trtllm): forget to bump dockerfile

* backend(trtllm): use arg instead of env

* backend(trtllm): use correct library reference decoder_attention_src

* backend(trtllm): link against decoder_attention_{0|1}

* backend(trtllm): build against gcc-14 with cuda12.8

* backend(trtllm): use return value optimization flag as as error if available

* backend(trtllm): make sure we escalade all warnings as errors on the backend impl in debug mode

* backend(trtllm): link against CUDA 12.8
2025-02-06 16:45:03 +01:00
Wang, Yi
36223f834e
Triton fix (#2995)
fix triton to 3.1.0 to fix ipex import issue

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-02-06 12:28:41 +01:00
Nicolas Patry
0ef8c8a97a
Using the "lockfile". (#2992)
* Using the "lockfile".

* Revert dummy modifications.

* Lock on python 3.11

* Another attempt.

* ..

* Bad cache hits.

* The good old monkey.

* How in the world...

* We need the launcher still.

* .

* ..

* Attempt #42

* Don't break all other builds.

* Mode max.

* Applying to other builds.
2025-02-06 12:28:24 +01:00
drbh
c1cf36c0dc
Improve qwen vl impl (#2943)
* feat: refactor model, improve startup and re enable tests

* fix: improve multimodal rotary embed caching

* fix: limit vision flop calc to qwen2 vl models and update config typing

* fix: include clippy lint

* feat: refactor position ids in warmup and bump tests

* fix: prefer default dtype

* fix: enable all cuda graphs and bump snapshots

* fix: adjust rotaty init path

* fix: simplify get position ids and remove usused vision config

* fix: update position ids so first dim is batch, simplify rotary and bump vlm default token limit

* fix: improve position id init during cuda warmup for mrope and simplfy rotary forward

* fix: check existance before accessing rope type in cuda warmup

* fix: check key before access

* fix: improve mrope check in cuda graph warmup

* fix: remove check for default rope type

* fix: add more test and improve model generation

* fix: improve and simplify get_cos_sin, refactors and cleanup  get_position_ids

* fix: adjust signatures with types
2025-02-04 12:44:18 -05:00
Daniël de Kok
dd2bd5fdb3
impureWithCuda: fix gcc version (#2990)
* impureWithCuda: fix gcc version

* trufflehog: do not fail on unverified results
2025-02-04 17:01:59 +01:00