Commit Graph

1288 Commits

Author SHA1 Message Date
Wang, Yi A
a38e65a9f9 install whl
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-11 04:27:05 -07:00
Wang, Yi A
3e7a879773 xpu 2.6 update
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-02-23 19:01:55 -08:00
Daniël de Kok
97c5f7e685
Use rotary kernel from the Hub (#3041) 2025-02-21 13:55:31 +01:00
drbh
1cae3197c4
Improve tool call message processing (#3036)
* make content field optional in chat request

* add tool_calls field to Message struct

* feat: add test and serialize tool messages

* fix: bump utopia, openapi doc version and improve test

* fix: rerun update docs

* fix: suppoer tool call id in template and remove unnecessary changes

* fix: ruff lint remove unused import

* fix: adjust message types in tests

---------

Co-authored-by: sailesh duddupudi <saileshradar@gmail.com>
2025-02-21 10:30:29 +01:00
Adrien Gallouët
3498f6085e
Update Gradio ChatInterface configuration in consuming_tgi.md (#3042)
The current code does not work and gives the following message:

    UserWarning: You have not specified a value for the `type` parameter. Defaulting to the 'tuples' format for chatbot messages, but this is deprecated and will be removed in a future version of Gradio. Please set type='messages' instead, which uses openai-style dictionaries with 'role' and 'content' keys.
      warnings.warn(
    Traceback (most recent call last):
      File "/Users/angt/hf/tgi/test-gradio.py", line 22, in <module>
        gr.ChatInterface(
    TypeError: ChatInterface.__init__() got an unexpected keyword argument 'retry_btn'

Signed-off-by: Adrien Gallouët <adrien@gallouet.fr>
2025-02-21 10:11:28 +01:00
Nicolas Patry
142a49a80d
Simplify logs2. (#3045)
* Simplify logs2.

* Changing the scope from module to session to fix the event_loop issue.
2025-02-21 10:03:40 +01:00
Wang, Yi
06dfe9abfe
fix qwen2 vl crash in continous batching (#3004)
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-02-20 18:36:45 -05:00
Daniël de Kok
ed96ba6503
flashinfer 0.2.0.post1 -> post2 (#3040)
* flashinfer 0.2.0.post1 -> post2

* Fix ruff stuff.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2025-02-20 12:34:20 +01:00
Wang, Yi
feaa2477b7
update ipex and torch to 2.6 for cpu (#3039)
ipex cpu 2.6 support topk_group in moe fusion ops

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-02-20 09:12:28 +01:00
Hugo Larcher
230aa25641
feat: Add the parsing of HF_HUB_USER_AGENT_ORIGIN environment variable for telemetry (#3027)
* feat: Add the parsing of HF_HUB_USER_AGENT_ORIGIN environment variable to add info about the environment running TGI. That is useful to track usage in case of collaborations for example.

* fix: trufflehog
2025-02-19 21:09:12 +01:00
Nicolas Patry
9c89d0070e
Having less logs in case of failure for checking CI more easily. (#3037)
* Having less logs in case of failure for checking CI more easily.

* Cleaning up the versions to uv for the client.

* Ignore entirely the API.
2025-02-19 17:01:33 +01:00
Nicolas Patry
fde3234cbc
Using public external registry (to use external runners for CI). (#3031)
* Using public external registry (to use external runners for CI).

* Fix build.

* Fixing the external registry.

* Fixing trtllm tests.
2025-02-19 14:53:14 +01:00
drbh
d6a0c67e2f
feat: add initial qwen2.5-vl model and test (#2971)
* feat: support qwen2.5 vl model

* fix: bump support models doc

* feat: check before rope type adjustment and small refactors

* fix: add transformer overlay for processor support

* fix: vendor processor and config from transformers

* fix: refactor/simplify conditionals
2025-02-19 12:38:20 +01:00
Cyril Vallez
a7448661f7
Improve Transformers support (#2970)
* Much better support

* add gpt neox

* bump transformers version

* bump version
2025-02-18 19:04:34 +01:00
Nicolas Patry
5543fdc765
It's find in some machine. using hf_hub::api::sync::Api to download c… (#3030)
It's find in some machine. using hf_hub::api::sync::Api to download config is not successful which will make warmup fail since attribute like max_position_embeddings could not be got. update hf-hub to the latest version could fix it

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Wang, Yi A <yi.a.wang@intel.com>
2025-02-18 12:19:51 +01:00
Nicolas Patry
b8a4928d0e
Pinning trufflehog. (#3032) 2025-02-18 12:03:41 +01:00
Alvaro Bartolome
8a1cfd6122
Add loop_controls feature to minijinja to handle {% break %} (#2998)
* Add `loop_controls` feature to `minijinja`

* Add `test_chat_template_loop_controls` to test `break`
2025-02-18 10:33:22 +01:00
celsowm
794ec58b75
Update README.md (#3024)
only way to avoid:
error: experimental Nix feature 'nix-command' is disabled; add '--extra-experimental-features nix-command' to enable it
2025-02-18 10:08:28 +01:00
Daniël de Kok
f0ed76583c
Use eetq kernel from the hub (#3029)
* Use eetq kernel from the hub

* Fixing the CI.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2025-02-18 10:03:53 +01:00
Adrien Gallouët
cfd4fbb479
[Backend] Add Llamacpp backend (#2975)
* Add llamacpp backend

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Get rid of llama_batch_get_one()

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Use max_batch_total_tokens

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Handle max_batch_size

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Add some input validation checks

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Handle ctx args & fix sampling

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Add GPU args

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Add --defrag-threshold

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Add a stupid batch mechanism

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Cleanup

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Add --numa

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Fix args

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Enable flash attention by default

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Add --offload-kqv

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Fix batch_pos

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* backend(llama): add CUDA Dockerfile_llamacpp for now

* Only export the latest logits

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Output real logprobs

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Fix batching

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Fix seq iterations

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Auto-detect n_threads when not provided

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Clear request cache after completion

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Remove warmup

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Cleanup

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* backend(llama): add CUDA architectures build argument for Dockerfile

* Add specific args for batch

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Add --type-v & --type-k

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Bump llamacpp to b4623

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Disable graceful shutdown in debug mode

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Update Dockerfile_llamacpp

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Cleanup Dockerfile

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Update Cargo.lock

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Update args

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Simplify batching logic

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Set TGI_LLAMA_PKG_CUDA from CUDA_VERSION

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Rename bindings

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Remove n_ctx

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Make max_batch_total_tokens optional

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Ensure all samplers are freed on error

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Initialize penalty_last_n with llamacpp default value

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Cleanup

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Improve default settings

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Add doc

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Update docs

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Thanks clippy

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Thanks cargo fmt

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Update docs

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Do not use HOSTNAME env

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Bump llama.cpp & cuda

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Fix requirements.txt

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Fix fmt

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Enable KQV offload by default

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Remove Ngrok tunneling

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Remove .cargo/config.toml

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Fix Dockerfile

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Add missing cuda prefix

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Handle custom llama.cpp dir

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Cleanup

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Add README.md

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Add HF transfer

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Fix bool args

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Update doc

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Update doc

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

---------

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
Co-authored-by: Morgan Funtowicz <funtowiczmo@gmail.com>
2025-02-14 13:40:57 +01:00
Daniël de Kok
6df0fc0b55
Support sigmoid scoring function in GPTQ-MoE (#3017) 2025-02-14 11:33:49 +01:00
Nicolas Patry
d6881c37ab
Putting back the NCCL forced upgrade. (#2999)
* Putting back the NCCL forced upgrade.

* .

* ...

* Ignoring conda.

* Dropping conda from the buidl system + torch 2.6

* Cache min.

* Rolling back torch version.

* Reverting the EETQ modification.

* Fix flash attention ?

* Actually stay on flash v1.

* Patching flash v1.

* Torch 2.6, fork of rotary, eetq updated.

* Put back nccl latest (override torch).

* Slightly more reproducible build and not as scary.
2025-02-14 11:31:59 +01:00
Nicolas Patry
8a211dc7fc
Preventing single user hugging the server to death by asking (#3016)
for way too many tokens.
2025-02-13 11:23:17 +01:00
Nicolas Patry
4cccce4b44
Update the flaky mllama test. (#3015) 2025-02-12 12:26:52 +01:00
Wang, Yi
76bcb4948d
fix Qwen VL break in intel platform (#3002)
* fix Qwen VL break in intel platform

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* could use PositionRotaryEmbedding impl so rocm and ipex could all work

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-02-12 11:31:34 +01:00
Nicolas Patry
b86c3947ab
Revert "Update the flaky mllama test."
This reverts commit 8a870b31b9.
2025-02-11 17:13:06 +01:00
Nicolas Patry
8a870b31b9
Update the flaky mllama test. 2025-02-11 17:10:36 +01:00
Daniël de Kok
571ac9b507
Use kernels from the kernel hub (#2988)
* Use Hub kernels for Marlin and cutlass quantization kernels

* Use hub kernels for MoE/GPTQ-Marlin MoE

* Use attention kernels from the Hub

* Cache the kernels in the Docker image

* Update moe kernels

* Support loading local kernels for development

* Support latest moe kernels

* Update to moe 0.1.1

* CI: download locked kernels for server tests

* Fixup some imports

* CI: activate venv

* Fix unused imports

* Nix: add attention/moe/quantization kernels

* Update hf-kernels to 0.1.5

* Update kernels

* Update tgi-nix flake for hf-kernels

* Fix EOF

* Take `load_kernel` out of a frequently-called function

* Hoist another case of kernel loading out of a somewhat hot function

* marlin-kernels -> quantization

* attention -> paged-attention

* EOF fix

* Update hf-kernels, fixup Docker

* ipex fix

* Remove outdated TODO
2025-02-10 19:19:25 +01:00
Nicolas Patry
4b8cda684b
Updating mllama after strftime. (#2993)
* Updating mllama after strftime.

* Town instead village.

* Forgot the integration snapshot.

* Attempt to fix intel CPU.

* Intel extension fix.

* Workaround intel.

* Moving those deps directly into pyproject.

* Revert "Moving those deps directly into pyproject."

This reverts commit 98c1496ea6.

* Non system uv.

* Fixing the docker environment hopefully.

* Missed a step.

* Move workdir up a bit.

* Bailing out of reproducible python env.

* Triton version.
2025-02-07 10:38:13 +01:00
Funtowicz Morgan
856709d5c3
[Backend] Bump TRTLLM to v.0.17.0 (#2991)
* backend(trtllm): bump TRTLLM to v.0.17.0

* backend(trtllm): forget to bump dockerfile

* backend(trtllm): use arg instead of env

* backend(trtllm): use correct library reference decoder_attention_src

* backend(trtllm): link against decoder_attention_{0|1}

* backend(trtllm): build against gcc-14 with cuda12.8

* backend(trtllm): use return value optimization flag as as error if available

* backend(trtllm): make sure we escalade all warnings as errors on the backend impl in debug mode

* backend(trtllm): link against CUDA 12.8
2025-02-06 16:45:03 +01:00
Wang, Yi
36223f834e
Triton fix (#2995)
fix triton to 3.1.0 to fix ipex import issue

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-02-06 12:28:41 +01:00
Nicolas Patry
0ef8c8a97a
Using the "lockfile". (#2992)
* Using the "lockfile".

* Revert dummy modifications.

* Lock on python 3.11

* Another attempt.

* ..

* Bad cache hits.

* The good old monkey.

* How in the world...

* We need the launcher still.

* .

* ..

* Attempt #42

* Don't break all other builds.

* Mode max.

* Applying to other builds.
2025-02-06 12:28:24 +01:00
drbh
c1cf36c0dc
Improve qwen vl impl (#2943)
* feat: refactor model, improve startup and re enable tests

* fix: improve multimodal rotary embed caching

* fix: limit vision flop calc to qwen2 vl models and update config typing

* fix: include clippy lint

* feat: refactor position ids in warmup and bump tests

* fix: prefer default dtype

* fix: enable all cuda graphs and bump snapshots

* fix: adjust rotaty init path

* fix: simplify get position ids and remove usused vision config

* fix: update position ids so first dim is batch, simplify rotary and bump vlm default token limit

* fix: improve position id init during cuda warmup for mrope and simplfy rotary forward

* fix: check existance before accessing rope type in cuda warmup

* fix: check key before access

* fix: improve mrope check in cuda graph warmup

* fix: remove check for default rope type

* fix: add more test and improve model generation

* fix: improve and simplify get_cos_sin, refactors and cleanup  get_position_ids

* fix: adjust signatures with types
2025-02-04 12:44:18 -05:00
Daniël de Kok
dd2bd5fdb3
impureWithCuda: fix gcc version (#2990)
* impureWithCuda: fix gcc version

* trufflehog: do not fail on unverified results
2025-02-04 17:01:59 +01:00
Alvaro Bartolome
88fd56f549
Add strftime_now callable function for minijinja chat templates (#2983)
* Add `chrono` and `strftime_now` function callable

* Fix `test_chat_template_valid_with_strftime_now`

* Fix `test_chat_template_valid_with_strftime_now`
2025-02-03 15:30:48 +01:00
Hugo Larcher
e3f2018cb5
hotfix: fix trtllm CI build on release (#2981)
* hotfix: fix trtllm CI build on release

* fix: test release.

* fix: test release.

* fix: test release. env not recognized https://github.com/actions/runner/issues/1661

* fix: test release. Works.
2025-02-03 11:11:15 +01:00
Nicolas Patry
bb69c5b199
Back on nix main. (#2979) 2025-01-31 14:39:52 +01:00
Nicolas Patry
c9d68945cc
Prepare for release 3.1.0 (#2972)
* Prepare for release 3.1.0

* Back on main flake.

* Fixing stuff.

* Upgrade to moe-kernels 0.8.2 for Hip support.

* Deactivating the flaky test.
2025-01-31 14:19:01 +01:00
Mohit Sharma
c07a2cc82b
Update moe-kernel to 0.8.2 for rocm (#2977)
update moe-kernel for amd
2025-01-31 11:40:00 +01:00
Hugo Larcher
065aabb13d
doc: Update TRTLLM deployment doc. (#2960)
* doc: Update TRTLLM deployment doc. Update TRTLLM CI to allow release builds when tagging TGI.

* doc: Update TRTLLM deployment doc. Update TRTLLM CI to allow release builds when tagging TGI.

* fix: PR comments
2025-01-30 18:04:42 +01:00
Nicolas Patry
cb747b33da
Add deepseekv3 (#2968)
* Add fp8 support moe models

add deepseekv3

format codfe'

update dockerfile

update doc

* Small modifications.

* Moe kernels 0.8.1

* Upgrade to 0.8.1

* Fixing moe import.

* Black.

* Apply suggestions from code review

Co-authored-by: Mohit Sharma <mohit21sharma.ms@gmail.com>

* Fixing Mixtral + Nits.

* Put link to ref.

* Fix other call locations.

* Scoring func `softmax` is the only one that works.

---------

Co-authored-by: Mohit Sharma <mohit21sharma.ms@gmail.com>
2025-01-30 16:40:25 +01:00
Nicolas Patry
80e7d98f88
Hotfixing intel-cpu (not sure how it was working before). (#2967)
* Hotfixing intel-cpu (not sure how it was working before).

* Do not fail on missing moe-kernels (Intel-cpu).
2025-01-29 22:34:41 +01:00
Daniël de Kok
ee0dffcd14
Update to moe-kernels 0.8.0 (#2966) 2025-01-29 18:19:55 +01:00
Mohit Sharma
4ef2e045c9
Add fp8 support moe models (#2928)
* Add fp8 support moe models

* flatten condition
2025-01-29 13:56:32 +01:00
Hugo Larcher
73b7cf83f6
Add backend name to telemetry (#2962)
* feat: Add backend name to telemetry
2025-01-28 16:53:16 +01:00
Nicolas Patry
eb3df0f46f
Fixing the oom maybe with 2.5.1 change. (#2958) 2025-01-28 10:30:28 +01:00
Hugo Larcher
c690da5973
fix: Telemetry (#2957)
* fix: add telemetry regular pings and fix unhandled errors avoid not sending telemetry stop events.

* fix: simplify error handling

* fix: update ping delay and update doc.

* fix: clippy

* doc: Rephrase properly.
2025-01-28 10:29:18 +01:00
Daniël de Kok
db922eb77e
Update to attention-kernels 0.2.0 (#2950)
This version removes our patches/custom API. Makes it simpler to
get changes from upstream. One of which is that we can enable FP8
KV cache for paged attention as well.
2025-01-27 11:42:36 +01:00
Funtowicz Morgan
40b00275b2
Attempt to remove AWS S3 flaky cache for sccache (#2953)
* backend(trtllm): attempt to remove AWS S3 flaky cache for sccache

* backend(trtllm): what if we expose ENV instead of inline?

* backend(trtllm): and with the right env var for gha sccache

* backend(trtllm): relax the way to detect sccache

* backend(trtllm): make sccache definition manually

* backend(trtllm): ok let's try to define the launchers in build.rs when rustc_wrapper is present

* backend(trtllm): export env variable in run mb?

* backend(trtllm): Cache mode max to cache intermediate layers

* backend(trtllm): inject ompi_version build arg in dependent step
2025-01-27 11:21:48 +01:00
Nicolas Patry
6cb41a80a1
Revert "Remove AWS credentials?"
This reverts commit d2ff68e98d.
2025-01-24 14:34:17 +01:00