Commit Graph

1398 Commits

Author SHA1 Message Date
Wang, Yi A
5ec7f15d0c prefill bypass graph
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-04-15 00:27:07 -07:00
Wang, Yi A
6b21985c95 Merge branch 'main' into warmup_gaudi_backend
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-04-14 18:24:34 -07:00
Mohit Sharma
73e797528d
L4 fixes (#3161)
add fix
2025-04-14 22:13:53 +05:30
Nicolas Patry
fe56f760df
Upgrading the python client deps (still deprecated, but used for
integration-tests)
2025-04-14 17:18:43 +02:00
Wang, Yi
d62c941c56
Gaudi: clean cuda/rocm code in hpu backend, enable flat_hpu (#3113)
* clean cuda/rocm code in hpu backend, enable flat_hpu

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix TP in pageattn

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* adjust block table in hpu to improve performance

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* enable all the model. not testet yet

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* use tensor cache in hpu graph to avoid replay issue

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* add moe support, fix qwen/mistral/mixtral crash

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix phimoe issue

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* gpt_bigcode could also go pageattn

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* enable dbrx remove some unused code

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* multi-modality initial PR

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* adjust warmup and enable vlm

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix incorrect output in qwen2 idefics if hpu graph is used

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* remove unused quantization code and enable awq/gptq int4

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix gptq issue

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* enable fp8

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* warmup prefill

remove model where pageattn is not used, set block table to None since it's not used

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* add warmup_decode

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* warmup decode

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* remove block_tables and prefill_cache_indices which will lead to dynamic shape

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix comment

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* missing gptj change...

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix some issue

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* remove torch.where to fix incorrect output in hpu graph model

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* match the latest vllm_extension ops

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-04-14 15:58:13 +02:00
Wang, Yi A
ba049c9d49 improve performance
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-04-13 20:00:27 -07:00
Wang, Yi A
76cc129796 remove block_scales which is not needed anymore
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-04-11 01:28:14 -07:00
Wang, Yi A
a83e9fe003 work with the latest vllm extension ops
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-04-10 19:56:58 -07:00
Wang, Yi A
4de8fb0127 Merge branch 'gaudi_backend_pa' into warmup_gaudi_backend 2025-04-10 19:42:22 -07:00
Wang, Yi A
4cdc34ec4d match the latest vllm_extension ops
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-04-10 19:32:32 -07:00
Wang, Yi A
610dd200e5 Merge branch 'main' into gaudi_backend_pa
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-04-10 18:20:28 -07:00
Wang, Yi A
cd900c3b72 pingpong optimization
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-04-10 18:16:05 -07:00
Nicolas Patry
9a8d0462e1
Fixing tokenization like https://github.com/huggingface/text-embeddin… (#3156)
Fixing tokenization like https://github.com/huggingface/text-embeddings-inference/issues/525
2025-04-09 18:42:25 +02:00
Nicolas Patry
5861da1ad7
Fixing Qwen 2.5 VL (32B). (#3157)
Reduce the config constraints, and use common ground between the 8B and
32B.
2025-04-09 17:07:30 +02:00
Nicolas Patry
0b28aabb94
3.2.3 (#3151) 2025-04-08 10:16:37 +02:00
oOraph
24bec29ffc
fix: compute type typo (#3150)
Signed-off-by: Raphael Glon <oOraph@users.noreply.github.com>
Co-authored-by: Raphael Glon <oOraph@users.noreply.github.com>
2025-04-07 17:24:11 +02:00
Baptiste Colle
37104acd75
Gaudi: Add Integration Test for Gaudi Backend (#3142)
* feat(gaudi): add integration test

* feat(test): add more models to integration tests

* remove debug comments

* fix typos
2025-04-07 16:55:03 +02:00
Mohit Sharma
87a0af4ec2
Update transformers to 4.51 (#3148)
* update transformres

* Upgrading the nix deps too.

* Forcing torchvision to be in there.

* Fixing bug in mllama.

* Those tests cannot be run in CI.

* Lint.

---------

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2025-04-07 12:55:43 +02:00
Mohit Sharma
9c26b52940
Use ROCM 6.3.1 (#3141)
* update dockerfile

* add updated makefile

* fix docker

* Lint.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2025-04-07 12:55:11 +02:00
Nicolas Patry
d23b385eee
Preparing for release. (#3147)
* Preparing for release.

* Adding hf-xet dependency.

* Merged tgi-nix update.
2025-04-06 11:36:00 +02:00
Mohit Sharma
d9bb9bebc9
Add llama4 (#3145)
* initial changes

* Add support for other vlm

* cleanup comment

* Improve attn_implementation

* Add comments for support of models

* add model

* add model

* fixes and improvements

* update docker

* Add cache position

* Add tests

* remove redundant changes

* remove tr version

* Upgrade doc + fix linting.

* Fixing the CI.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2025-04-06 10:20:22 +02:00
Wang, Yi A
29703dbd27 fix warmup issue for mllama
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-04-04 20:25:01 -07:00
Yuan Wu
3d059f91ab
Gaudi: Use exponential growth to replace BATCH_BUCKET_SIZE (#3131)
* Gaudi: Use exponential growth to replace BATCH_BUCKET_SIZE

Signed-off-by: yuanwu <yuan.wu@intel.com>

* Remove debug modifications

Signed-off-by: yuanwu <yuan.wu@intel.com>

---------

Signed-off-by: yuanwu <yuan.wu@intel.com>
2025-04-03 10:34:53 +02:00
Wang, Yi A
8591687561 refine log and fix some issue
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-04-03 00:11:22 -07:00
Wang, Yi A
a84da5b698 optimize code
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-04-02 00:56:15 -07:00
Wang, Yi A
705cc0b619 multi-modality warmup
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-04-02 00:09:16 -07:00
Wang, Yi A
9d85ac9485 LLM warmup logic
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-31 23:07:14 -07:00
Wang, Yi A
c55a8caea2 remove torch.where to fix incorrect output in hpu graph model
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-31 22:51:54 -07:00
Wang, Yi A
f0e5faec1a fix some issue
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-28 07:01:06 -07:00
Wang, Yi A
376e0507b7 missing gptj change...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-28 01:08:40 -07:00
Wang, Yi A
787dbe98a8 fix comment
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-28 00:09:26 -07:00
Wang, Yi A
7914e980e2 Merge branch 'main' into gaudi_backend_pa
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-28 00:03:49 -07:00
Wang, Yi A
1508ee8de1 remove block_tables and prefill_cache_indices which will lead to dynamic shape
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-27 23:57:59 -07:00
Wang, Yi A
7900be5ac3 warmup decode
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-26 20:19:13 -07:00
Wang, Yi A
ba7a131e04 add warmup_decode
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-26 17:39:26 -07:00
Corentin REGAL
0142550096
nix-v3.2.1 -> v3.2.1-nix (#3129)
make it easier to check for version using semver semantic (same major
and minor)
2025-03-26 15:36:43 +01:00
Wang, Yi A
fd70ad703e warmup prefill
remove model where pageattn is not used, set block table to None since it's not used

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-26 03:10:58 -07:00
Yuan Wu
f5f14dc660
Gaudi: Fix llava-next and mllama crash issue (#3127)
Signed-off-by: yuanwu <yuan.wu@intel.com>
2025-03-25 15:08:15 +01:00
Wang, Yi A
69773767c5 enable fp8
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-25 05:06:55 -07:00
Nicolas Patry
54d15462dc
Torch 2.6 (#3134)
* Torch 2.6

* Upgrade the toolchain.

* Don't upgrade just yet.

* Upgrade toolchain.

* Time upgrade.

* TGI-nix main.

* Upgrade to transformers 4.50
2025-03-24 11:55:49 +01:00
Wang, Yi A
8d221b7b79 fix gptq issue
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-22 20:58:50 -07:00
Wang, Yi A
9914ffe1f1 remove unused quantization code and enable awq/gptq int4
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-22 19:37:20 -07:00
Wang, Yi A
fdf0733f56 fix incorrect output in qwen2 idefics if hpu graph is used
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-21 01:01:37 -07:00
Wang, Yi A
36b6612f97 adjust warmup and enable vlm
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-20 23:12:52 -07:00
Baptiste Colle
2e60a8dd65
CI: enable server tests for backends (#3128)
add test for backends
2025-03-20 16:07:31 +01:00
Erik Kaunismäki
e5503eba78
configurable termination timeout (#3126)
* make shard and webserver termination timeouts configurable

* Updating documentation.

* Fmt.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2025-03-20 14:25:56 +01:00
Wang, Yi A
f95aa42660 multi-modality initial PR
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-19 23:30:12 -07:00
Wang, Yi A
d5b78ba16f Merge branch 'main' into gaudi_backend_pa 2025-03-19 18:15:08 -07:00
Wang, Yi A
2074d0516b enable dbrx remove some unused code
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-19 03:16:41 -07:00
Wang, Yi A
2cde30de24 gpt_bigcode could also go pageattn
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-18 23:59:31 -07:00