Commit Graph

1410 Commits

Author SHA1 Message Date
Wang, Yi
87840ab374
Merge 01f17d526c into 8f8819795f 2025-04-18 19:54:59 +05:30
Nicolas Patry
8f8819795f
Fixing CI (#3184) 2025-04-18 13:07:18 +02:00
Alvaro Bartolome
95ccba3705
Bump sccache to 0.10.0 (#3179)
* Ensure that `sccache` version is 0.10.0 or higher

* Rename `ACTIONS_CACHE_URL` to `ACTIONS_RESULTS_URL`
2025-04-18 12:45:32 +02:00
Hyeongchan Kim
b400c275e4
Get opentelemetry trace id from request headers instead of creating a new trace (#2648)
feature: get trace id from req headers

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2025-04-18 09:06:41 +02:00
Daniël de Kok
84ab88d843
Support flashinfer for Gemma3 prefill (#3167)
* launcher: ensure correct detection of Gemma 3 head size

* Support flashinfer for Gemma3 prefill

Gemma3 uses bidirectional attention for images. Flashinfer
supports custom masks. Hook up the mask with flashinfer, so that we do
not have to use the slower SDPA implementation for prefills with images.

* Update Gemma3 test outputs

* Fixed unused import
2025-04-17 18:07:41 +02:00
Wang, Yi A
01f17d526c Merge branch 'main' into warmup_gaudi_backend 2025-04-15 22:16:42 -07:00
Wang, Yi A
bf3987e25e pingpong optimization issue fix
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-04-15 21:56:51 -07:00
Nicolas Patry
4645678ff0
Hotfix gaudi2 with newer transformers. (#3176) 2025-04-15 12:39:28 +02:00
Nicolas Patry
ad765cd06b
Hotfixing gaudi deps. (#3174) 2025-04-15 11:55:28 +02:00
Nicolas Patry
16b4b7974a
Upgrading the dependencies in Gaudi backend. (#3170)
* Upgrading the dependencies in Gaudi backend.

* Upgrading transformers version.
2025-04-15 11:49:06 +02:00
Wang, Yi
459fbdebe3
transformers flash llm/vlm enabling in ipex (#3152)
* transformers flash llm/vlm enabling in xpu

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* ipex cpu could also support in function

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-04-15 11:08:01 +02:00
Nicolas Patry
449cee49ca
setuptools <= 70.0 is vulnerable: CVE-2024-6345 (#3171) 2025-04-15 10:09:37 +02:00
Wang, Yi A
5ec7f15d0c prefill bypass graph
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-04-15 00:27:07 -07:00
Wang, Yi A
6b21985c95 Merge branch 'main' into warmup_gaudi_backend
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-04-14 18:24:34 -07:00
Mohit Sharma
73e797528d
L4 fixes (#3161)
add fix
2025-04-14 22:13:53 +05:30
Nicolas Patry
fe56f760df
Upgrading the python client deps (still deprecated, but used for
integration-tests)
2025-04-14 17:18:43 +02:00
Wang, Yi
d62c941c56
Gaudi: clean cuda/rocm code in hpu backend, enable flat_hpu (#3113)
* clean cuda/rocm code in hpu backend, enable flat_hpu

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix TP in pageattn

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* adjust block table in hpu to improve performance

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* enable all the model. not testet yet

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* use tensor cache in hpu graph to avoid replay issue

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* add moe support, fix qwen/mistral/mixtral crash

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix phimoe issue

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* gpt_bigcode could also go pageattn

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* enable dbrx remove some unused code

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* multi-modality initial PR

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* adjust warmup and enable vlm

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix incorrect output in qwen2 idefics if hpu graph is used

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* remove unused quantization code and enable awq/gptq int4

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix gptq issue

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* enable fp8

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* warmup prefill

remove model where pageattn is not used, set block table to None since it's not used

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* add warmup_decode

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* warmup decode

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* remove block_tables and prefill_cache_indices which will lead to dynamic shape

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix comment

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* missing gptj change...

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix some issue

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* remove torch.where to fix incorrect output in hpu graph model

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* match the latest vllm_extension ops

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-04-14 15:58:13 +02:00
Wang, Yi A
ba049c9d49 improve performance
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-04-13 20:00:27 -07:00
Wang, Yi A
76cc129796 remove block_scales which is not needed anymore
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-04-11 01:28:14 -07:00
Wang, Yi A
a83e9fe003 work with the latest vllm extension ops
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-04-10 19:56:58 -07:00
Wang, Yi A
4de8fb0127 Merge branch 'gaudi_backend_pa' into warmup_gaudi_backend 2025-04-10 19:42:22 -07:00
Wang, Yi A
4cdc34ec4d match the latest vllm_extension ops
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-04-10 19:32:32 -07:00
Wang, Yi A
610dd200e5 Merge branch 'main' into gaudi_backend_pa
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-04-10 18:20:28 -07:00
Wang, Yi A
cd900c3b72 pingpong optimization
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-04-10 18:16:05 -07:00
Nicolas Patry
9a8d0462e1
Fixing tokenization like https://github.com/huggingface/text-embeddin… (#3156)
Fixing tokenization like https://github.com/huggingface/text-embeddings-inference/issues/525
2025-04-09 18:42:25 +02:00
Nicolas Patry
5861da1ad7
Fixing Qwen 2.5 VL (32B). (#3157)
Reduce the config constraints, and use common ground between the 8B and
32B.
2025-04-09 17:07:30 +02:00
Nicolas Patry
0b28aabb94
3.2.3 (#3151) 2025-04-08 10:16:37 +02:00
oOraph
24bec29ffc
fix: compute type typo (#3150)
Signed-off-by: Raphael Glon <oOraph@users.noreply.github.com>
Co-authored-by: Raphael Glon <oOraph@users.noreply.github.com>
2025-04-07 17:24:11 +02:00
Baptiste Colle
37104acd75
Gaudi: Add Integration Test for Gaudi Backend (#3142)
* feat(gaudi): add integration test

* feat(test): add more models to integration tests

* remove debug comments

* fix typos
2025-04-07 16:55:03 +02:00
Mohit Sharma
87a0af4ec2
Update transformers to 4.51 (#3148)
* update transformres

* Upgrading the nix deps too.

* Forcing torchvision to be in there.

* Fixing bug in mllama.

* Those tests cannot be run in CI.

* Lint.

---------

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2025-04-07 12:55:43 +02:00
Mohit Sharma
9c26b52940
Use ROCM 6.3.1 (#3141)
* update dockerfile

* add updated makefile

* fix docker

* Lint.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2025-04-07 12:55:11 +02:00
Nicolas Patry
d23b385eee
Preparing for release. (#3147)
* Preparing for release.

* Adding hf-xet dependency.

* Merged tgi-nix update.
2025-04-06 11:36:00 +02:00
Mohit Sharma
d9bb9bebc9
Add llama4 (#3145)
* initial changes

* Add support for other vlm

* cleanup comment

* Improve attn_implementation

* Add comments for support of models

* add model

* add model

* fixes and improvements

* update docker

* Add cache position

* Add tests

* remove redundant changes

* remove tr version

* Upgrade doc + fix linting.

* Fixing the CI.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2025-04-06 10:20:22 +02:00
Wang, Yi A
29703dbd27 fix warmup issue for mllama
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-04-04 20:25:01 -07:00
Yuan Wu
3d059f91ab
Gaudi: Use exponential growth to replace BATCH_BUCKET_SIZE (#3131)
* Gaudi: Use exponential growth to replace BATCH_BUCKET_SIZE

Signed-off-by: yuanwu <yuan.wu@intel.com>

* Remove debug modifications

Signed-off-by: yuanwu <yuan.wu@intel.com>

---------

Signed-off-by: yuanwu <yuan.wu@intel.com>
2025-04-03 10:34:53 +02:00
Wang, Yi A
8591687561 refine log and fix some issue
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-04-03 00:11:22 -07:00
Wang, Yi A
a84da5b698 optimize code
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-04-02 00:56:15 -07:00
Wang, Yi A
705cc0b619 multi-modality warmup
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-04-02 00:09:16 -07:00
Wang, Yi A
9d85ac9485 LLM warmup logic
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-31 23:07:14 -07:00
Wang, Yi A
c55a8caea2 remove torch.where to fix incorrect output in hpu graph model
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-31 22:51:54 -07:00
Wang, Yi A
f0e5faec1a fix some issue
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-28 07:01:06 -07:00
Wang, Yi A
376e0507b7 missing gptj change...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-28 01:08:40 -07:00
Wang, Yi A
787dbe98a8 fix comment
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-28 00:09:26 -07:00
Wang, Yi A
7914e980e2 Merge branch 'main' into gaudi_backend_pa
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-28 00:03:49 -07:00
Wang, Yi A
1508ee8de1 remove block_tables and prefill_cache_indices which will lead to dynamic shape
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-27 23:57:59 -07:00
Wang, Yi A
7900be5ac3 warmup decode
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-26 20:19:13 -07:00
Wang, Yi A
ba7a131e04 add warmup_decode
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-26 17:39:26 -07:00
Corentin REGAL
0142550096
nix-v3.2.1 -> v3.2.1-nix (#3129)
make it easier to check for version using semver semantic (same major
and minor)
2025-03-26 15:36:43 +01:00
Wang, Yi A
fd70ad703e warmup prefill
remove model where pageattn is not used, set block table to None since it's not used

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-26 03:10:58 -07:00
Yuan Wu
f5f14dc660
Gaudi: Fix llava-next and mllama crash issue (#3127)
Signed-off-by: yuanwu <yuan.wu@intel.com>
2025-03-25 15:08:15 +01:00