Commit Graph

1416 Commits

Author SHA1 Message Date
David Corvoysier
d4bd5cac79 chore: version 3.3.4 2025-06-19 09:08:38 +00:00
David Corvoysier
238fbd4d50
Neuron backend fix and patch version 3.3.4 (#3273)
* fix(neuron): wrong assertion when batch_size==1

* chore: prepare 3.3.4
2025-06-19 10:52:41 +02:00
Wang, Yi
14ee6e7804
[gaudi] gemma3 text and vlm model intial support. need to add sliding window support later (#3270)
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-06-19 09:32:34 +02:00
David Corvoysier
bd1bdebb47
doc: fix README (#3271) 2025-06-18 12:35:36 +02:00
regisss
f13e28c98d
[gaudi] Refine logging for Gaudi warmup (#3222)
* Refine logging for Gaudi warmup

* Make style

* Make style 2

* Flash causal LM case

* Add log_master & VLM cases

* Black
2025-06-18 12:34:00 +02:00
David Corvoysier
b4d17f18ff
chore: prepare release 3.3.3 (#3269) 2025-06-18 11:55:26 +02:00
Wang, Yi
0627983c17
[Gaudi] use pad_token_id to pad input id (#3268)
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-06-17 09:07:25 +02:00
Yuan Wu
3752143b39
[Gaudi] Fix the integration-test issues (#3265)
Signed-off-by: yuanwu <yuan.wu@intel.com>
2025-06-13 14:47:06 +02:00
Yuan Wu
ded4cb52ac
[Gaudi] Enable Qwen3_moe model (#3244)
Signed-off-by: yuanwu <yuan.wu@intel.com>
2025-06-13 12:03:24 +02:00
Wang, Yi
a220e57f45
[gaudi] HuggingFaceM4/idefics2-8b issue fix (#3264)
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-06-13 12:00:08 +02:00
Yuan Wu
e07056ab3f
[Gaudi] Remove optimum-habana (#3261)
Signed-off-by: yuanwu <yuan.wu@intel.com>
2025-06-12 22:35:36 +02:00
Yuan Wu
25fdc5f03c
[gaudi] Move the _update_cos_sin_cache into get_cos_sin (#3254)
Signed-off-by: yuanwu <yuan.wu@intel.com>
2025-06-12 22:31:11 +02:00
Wang, Yi
613b8dd647
[gaudi] Vlm rebase and issue fix in benchmark test (#3263)
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-06-12 22:26:37 +02:00
Wang, Yi
839477670a
[gaudi] Perf optimization (#3256)
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-06-11 15:00:21 +02:00
David Corvoysier
79183d1647
Bump neuron SDK version (#3260)
* chore(neuron): bump version to 0.2.0

* refactor(neuron): use named parameters in inputs helpers

This allows to hide the differences between the two backends in terms of
input parameters.

* refactor(neuron): remove obsolete code paths

* fix(neuron): use neuron_config whenever possible

* fix(neuron): use new cache import path

* fix(neuron): neuron config is not stored in config anymore

* fix(nxd): adapt model retrieval to new APIs

* fix(generator): emulate greedy in sampling parameters

When on-device sampling is enabled, we need to emulate the greedy
behaviour using top-k=1, top-p=1, temperature=1.

* test(neuron): update models and expectations

* feat(neuron): support on-device sampling

* fix(neuron): adapt entrypoint

* tests(neuron): remove obsolete models

* fix(neuron): adjust test expectations for llama on nxd
2025-06-10 17:56:25 +02:00
Yuan Wu
1ff9d185d5
Remove useless packages (#3253)
Signed-off-by: yuanwu <yuan.wu@intel.com>
2025-06-03 13:42:29 +02:00
Daniël de Kok
249189d96e
Prepare for 3.3.2 (#3249) 2025-05-30 16:16:36 +02:00
Yuan Wu
6b6e30a6f6
[gaudi] Fix the Llama-4-Maverick-17B-128E crash issue (#3246)
Signed-off-by: yuanwu <yuan.wu@intel.com>
2025-05-29 11:38:44 +02:00
Yuan Wu
70217ac345
[Gaudi] Fix the OOM issue of Llama-4-Scout-17B-16E-Instruct (#3245)
Signed-off-by: yuanwu <yuan.wu@intel.com>
2025-05-29 09:58:24 +02:00
Wang, Yi
f14044009a
fp8 compressed tensors w8a8 support for Gaudi backend (#3242)
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-05-28 14:54:20 +02:00
Yuan Wu
1883a62a94
Add Qwen3 for Gaudi backend (#3229)
Signed-off-by: yuanwu <yuan.wu@intel.com>
2025-05-23 08:58:35 +02:00
Daniël de Kok
f58d7cf50e
Nix: switch to hf-nix (#3240)
* Nix: switch to hf-nix

* Remove outdated local overrides
2025-05-22 17:09:15 +02:00
Wang, Yi
f08b44ade5
Upgrade to new vllm extension ops for Gaudi backend (fix issue in exponential bucketing) (#3239)
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-05-22 15:29:16 +02:00
Daniël de Kok
674c514d44
Prepare for 3.3.1 (#3238) 2025-05-22 09:43:55 +02:00
Wang, Yi
9e7e546923
Move input_ids to hpu and remove disposal of adapter_meta (#3237)
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-05-22 09:21:31 +02:00
Daniël de Kok
e32528792c
Switch to punica-sgmv kernel from the Hub (#3236)
* Switch to punica-sgmv kernel from the Hub

This also switches (temporarily) to the tgi-nix/kernel-builder merge
branch, bumping up to CUDA 12.8 (same as non-Nix Torch).

* nix: client depends on aiohttp

This probably worked before the nixpkgs bump because a dependency
propagated aiohttp.
2025-05-21 15:44:15 +02:00
Wang, Yi
43b1b07fb9
Fix the crash in default ATTENTION path for Gaudi backend (#3235)
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-05-20 14:02:32 +02:00
Wang, Yi
000e313a92
Refine warmup and upgrade to synapse AI 1.21.0 (#3234)
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-05-20 10:22:43 +02:00
Wang, Yi
d658b5def3
Deepseek R1 for Gaudi backend (#3211)
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-05-19 16:36:39 +02:00
drbh
58934c8b61
fix: count gpu uuids if NVIDIA_VISIBLE_DEVICES env set to all (#3230) 2025-05-16 11:48:58 -04:00
Yuan Wu
18cbecfb38
Enable Llama4 for Gaudi backend (#3223)
Signed-off-by: yuanwu <yuan.wu@intel.com>
2025-05-15 14:35:37 +02:00
Daniël de Kok
7e531f413d
Update to Torch 2.7.0 (#3221)
* Update to Torch 2.7.0

* Try to fix typer/click issue

* Pin click to fix incompatibility with typer

* Fix some test outputs with slight deviations

* Attempt again to sync with CI

* Mamba too

* Fixup mllama

Also switch to `unsloth/Llama-3.2-11B-Vision-Instruct` for testing
from the EU :).
2025-05-15 11:48:33 +02:00
kaixuanliu
535ce23827
Adjust the round_up_seq logic in Gaudi backend (#3224)
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
2025-05-12 09:58:43 +02:00
kaixuanliu
c94f415af4
Change HPU warmup logic: seq length should be with exponential growth (#3217)
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
Co-authored-by: regisss <15324346+regisss@users.noreply.github.com>
2025-05-10 15:41:18 +02:00
Daniël de Kok
56c8189467
Prepare for 3.3.0 (#3220) 2025-05-09 15:50:29 +02:00
Mohit Sharma
329f612e55
Chunked Prefill VLM (#3188)
* add logic

* working

* add encoder cache free

* fixes

* fix idefics

* update pixel_values

* add improvements

* add improvements

* improve

* nit

* fix inputs_embeds

* nit

* optimizations

* add prometheus port

* rename vars

* rename vars

* nit

* disable chunking for qwen

* review comments

* remove port

* improve headdim

* remove kwargs and redundant args

* fix qwen2_5

* fix config image_token_id error

* fix test

* update paligemma

* fix paligemma text

* minor fix

* fix qwen test

* fix qwen test
2025-05-06 18:01:59 +02:00
Wang, Yi
533eee50dc
forward and tokenize chooser use the same shape (#3196)
* forward and tokenize chooser use the same shape
concate or filter happened to cpu tensor to avoid dynamic shape in hpu

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* use hpu set seed

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-05-06 10:49:32 +02:00
Wang, Yi
51a0b9d11c
IPEX support FP8 kvcache/softcap/slidingwindow (#3144)
* IPEX support FP8 kvcache

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* add kvcache dtype

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* add softcap and slidingwindow

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* kv scale in pageattn

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* remove triton installation, will be installed with torch

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* install xelink lib

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* softcap default -1.0

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* softcap default -1.0

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-05-06 10:49:24 +02:00
regisss
f208ba6afc
Fix HF_HUB_OFFLINE=1 for Gaudi backend (#3193)
* Fix `HF_HUB_OFFLINE=1` for Gaudi backend

* Fix HF cache default value in server.rs

* Format
2025-05-06 10:47:53 +02:00
Julien Chaumond
7253be349a
Update client SDK snippets (#3207)
* Update client SDK snippets

* good catch from copilot
2025-05-01 17:10:51 +02:00
drbh
d303c1e37e
fix: bump snaps for mllama (#3202) 2025-05-01 10:20:45 -04:00
drbh
12ea8d74c7
Pr 2982 ci branch (#3046)
* Add json_schema alias for GrammarType

* Add tests for all aliases

* fix: various linter adjustments

* fix: end-of-file-fixer lint

* fix: add test snapshots and avoid docs change

* fix: another end-of-file-fixer lint

* feat: support json_schema grammar constraining and add tests

* fix: bump openapi doc with new grammar option

* fix: adjust test payload

* fix: bump test snaps

---------

Co-authored-by: Alex Weston <alexw@alkymi.io>
2025-05-01 10:17:16 -04:00
Julien Chaumond
6afe4307ab
doc typo (#3206)
typo
2025-05-01 14:31:48 +02:00
Alvaro Bartolome
40dfce644a
Skip {% generation %} and {% endgeneration %} template handling (#3204)
* Add `.DS_Store` file to `.gitignore`

* Skip `{% generation %}` and `{% endgeneration %}`

Custom syntax within the chat template for the Phi4 Reasoning models
e.g. https://huggingface.co/microsoft/Phi-4-reasoning-plus, which is
AFAIK not handled natively yet, so skipping for now

* Update explanation on `{% generation %}` and `{% endgeneration %}` removal

* Revert "Add `.DS_Store` file to `.gitignore`"

This reverts commit d64d6d2f7f.
2025-05-01 12:13:17 +02:00
Nicolas Patry
e7329fec18
Fixing the router + template for Qwen3. (#3200) 2025-04-29 16:29:26 +02:00
Nicolas Patry
39cfe232fd
Put more wiggle room. (#3189)
* Put more wiggle room.

* Fixing the makefile by using lockfile.

* Pre commit
2025-04-24 17:23:32 +02:00
Wang, Yi
375802948d
Warmup gaudi backend (#3172)
* clean cuda/rocm code in hpu backend, enable flat_hpu

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix TP in pageattn

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* adjust block table in hpu to improve performance

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* enable all the model. not testet yet

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* use tensor cache in hpu graph to avoid replay issue

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* add moe support, fix qwen/mistral/mixtral crash

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix phimoe issue

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* gpt_bigcode could also go pageattn

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* enable dbrx remove some unused code

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* multi-modality initial PR

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* adjust warmup and enable vlm

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix incorrect output in qwen2 idefics if hpu graph is used

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* remove unused quantization code and enable awq/gptq int4

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix gptq issue

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* enable fp8

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* warmup prefill

remove model where pageattn is not used, set block table to None since it's not used

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* add warmup_decode

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* warmup decode

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* remove block_tables and prefill_cache_indices which will lead to dynamic shape

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix comment

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* missing gptj change...

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix some issue

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* remove torch.where to fix incorrect output in hpu graph model

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* LLM warmup logic

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* multi-modality warmup

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* optimize code

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* refine log and fix some issue

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix warmup issue for mllama

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* pingpong optimization

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* match the latest vllm_extension ops

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* work with the latest vllm extension ops

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* remove block_scales which is not needed anymore

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* improve performance

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* prefill bypass graph

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* pingpong optimization issue fix

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-04-24 09:57:08 +02:00
Mohit Sharma
02715dc53f
Add option to configure prometheus port (#3187)
* add prometheus port

* fix doc

* add port for trtllm and llamacpp

* Fixing format after rebase.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2025-04-23 20:43:25 +05:30
Nicolas Patry
8f8819795f
Fixing CI (#3184) 2025-04-18 13:07:18 +02:00
Alvaro Bartolome
95ccba3705
Bump sccache to 0.10.0 (#3179)
* Ensure that `sccache` version is 0.10.0 or higher

* Rename `ACTIONS_CACHE_URL` to `ACTIONS_RESULTS_URL`
2025-04-18 12:45:32 +02:00