Commit Graph

98 Commits

Author SHA1 Message Date
Wang, Yi A
bf3987e25e pingpong optimization issue fix
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-04-15 21:56:51 -07:00
Wang, Yi A
5ec7f15d0c prefill bypass graph
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-04-15 00:27:07 -07:00
Wang, Yi A
ba049c9d49 improve performance
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-04-13 20:00:27 -07:00
Wang, Yi A
76cc129796 remove block_scales which is not needed anymore
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-04-11 01:28:14 -07:00
Wang, Yi A
a83e9fe003 work with the latest vllm extension ops
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-04-10 19:56:58 -07:00
Wang, Yi A
4de8fb0127 Merge branch 'gaudi_backend_pa' into warmup_gaudi_backend 2025-04-10 19:42:22 -07:00
Wang, Yi A
4cdc34ec4d match the latest vllm_extension ops
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-04-10 19:32:32 -07:00
Wang, Yi A
610dd200e5 Merge branch 'main' into gaudi_backend_pa
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-04-10 18:20:28 -07:00
Wang, Yi A
cd900c3b72 pingpong optimization
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-04-10 18:16:05 -07:00
Baptiste Colle
37104acd75
Gaudi: Add Integration Test for Gaudi Backend (#3142)
* feat(gaudi): add integration test

* feat(test): add more models to integration tests

* remove debug comments

* fix typos
2025-04-07 16:55:03 +02:00
Wang, Yi A
29703dbd27 fix warmup issue for mllama
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-04-04 20:25:01 -07:00
Yuan Wu
3d059f91ab
Gaudi: Use exponential growth to replace BATCH_BUCKET_SIZE (#3131)
* Gaudi: Use exponential growth to replace BATCH_BUCKET_SIZE

Signed-off-by: yuanwu <yuan.wu@intel.com>

* Remove debug modifications

Signed-off-by: yuanwu <yuan.wu@intel.com>

---------

Signed-off-by: yuanwu <yuan.wu@intel.com>
2025-04-03 10:34:53 +02:00
Wang, Yi A
8591687561 refine log and fix some issue
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-04-03 00:11:22 -07:00
Wang, Yi A
a84da5b698 optimize code
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-04-02 00:56:15 -07:00
Wang, Yi A
705cc0b619 multi-modality warmup
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-04-02 00:09:16 -07:00
Wang, Yi A
9d85ac9485 LLM warmup logic
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-31 23:07:14 -07:00
Wang, Yi A
c55a8caea2 remove torch.where to fix incorrect output in hpu graph model
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-31 22:51:54 -07:00
Wang, Yi A
f0e5faec1a fix some issue
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-28 07:01:06 -07:00
Wang, Yi A
376e0507b7 missing gptj change...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-28 01:08:40 -07:00
Wang, Yi A
7914e980e2 Merge branch 'main' into gaudi_backend_pa
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-28 00:03:49 -07:00
Wang, Yi A
1508ee8de1 remove block_tables and prefill_cache_indices which will lead to dynamic shape
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-27 23:57:59 -07:00
Wang, Yi A
7900be5ac3 warmup decode
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-26 20:19:13 -07:00
Wang, Yi A
ba7a131e04 add warmup_decode
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-26 17:39:26 -07:00
Wang, Yi A
fd70ad703e warmup prefill
remove model where pageattn is not used, set block table to None since it's not used

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-26 03:10:58 -07:00
Yuan Wu
f5f14dc660
Gaudi: Fix llava-next and mllama crash issue (#3127)
Signed-off-by: yuanwu <yuan.wu@intel.com>
2025-03-25 15:08:15 +01:00
Wang, Yi A
69773767c5 enable fp8
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-25 05:06:55 -07:00
Wang, Yi A
8d221b7b79 fix gptq issue
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-22 20:58:50 -07:00
Wang, Yi A
9914ffe1f1 remove unused quantization code and enable awq/gptq int4
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-22 19:37:20 -07:00
Wang, Yi A
fdf0733f56 fix incorrect output in qwen2 idefics if hpu graph is used
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-21 01:01:37 -07:00
Wang, Yi A
36b6612f97 adjust warmup and enable vlm
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-20 23:12:52 -07:00
Wang, Yi A
f95aa42660 multi-modality initial PR
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-19 23:30:12 -07:00
Wang, Yi A
d5b78ba16f Merge branch 'main' into gaudi_backend_pa 2025-03-19 18:15:08 -07:00
Wang, Yi A
2074d0516b enable dbrx remove some unused code
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-19 03:16:41 -07:00
Wang, Yi A
2cde30de24 gpt_bigcode could also go pageattn
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-18 23:59:31 -07:00
Wang, Yi A
073f793976 fix phimoe issue
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-18 23:11:01 -07:00
Baptiste Colle
8c2c348f3c
Gaudi: Sync TGI with the latest changes from the TGI-Gaudi fork (#3117)
feat(gaudi): add all the changes from tgi-gaudi fork up to PR #289
2025-03-18 09:45:52 +01:00
Wang, Yi A
5cd1c93cad add moe support, fix qwen/mistral/mixtral crash
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-18 00:45:15 -07:00
Wang, Yi A
6bbe24d974 use tensor cache in hpu graph to avoid replay issue
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-17 01:36:49 -07:00
Wang, Yi A
a07e7437b6 enable all the model. not testet yet
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-17 01:26:32 -07:00
Wang, Yi A
5d3653943c adjust block table in hpu to improve performance
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-16 20:28:01 -07:00
Wang, Yi A
b7fea6fc2f fix TP in pageattn
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-14 18:01:58 -07:00
Wang, Yi A
201dc6294f clean cuda/rocm code in hpu backend, enable flat_hpu
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-14 01:25:31 -07:00
Baptiste Colle
27ed848676
Release of Gaudi Backend for TGI (#3091)
* feat(gaudi): release ready (docs, docker image and vlm ready)

* fix(gaudi): add default argument for the dockerfile

* fix(gaudi): remove use of latest for gaudi docker image + redid gaudi benchmarking section to include best practices
2025-03-13 10:56:01 +01:00
David Corvoysier
f01dc9e743
Update neuron backend (#3098)
* feat(neuron): use AWS Neuron SDK 2.21.1

* feat(neuron): bump optimum-neuron version

* feat(neuron): tag latest image for local tests

* test(neuron): simplify sampling test
2025-03-12 09:53:15 +01:00
Adrien Gallouët
094975c3a8
Update the llamacpp backend (#3022)
* Build faster

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Make --model-gguf optional

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Bump llama.cpp

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Enable mmap, offload_kqv & flash_attention by default

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Update doc

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Better error message

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Update doc

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Update installed packages

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Save gguf in models/MODEL_ID/model.gguf

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Fix build with Mach-O

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Quantize without llama-quantize

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Bump llama.cpp and switch to ggml-org

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Remove make-gguf.sh

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Update Cargo.lock

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Support HF_HUB_USER_AGENT_ORIGIN

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Bump llama.cpp

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Add --build-arg llamacpp_native & llamacpp_cpu_arm_arch

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

---------

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-03-11 09:19:01 +01:00
Nicolas Patry
8e92942a18
Making tool_calls a vector. (#3075)
* Making `tool_calls` a vector.

* Update doc.

* Fixing the nix overlay with updated version.

* Add openai dependency.

* Updating the old tests.

* Trying to reduce the logs in the case of errors.

* Less spammy logs too.
2025-03-05 22:32:31 +01:00
Hugo Larcher
d8ff7f2623
feat: add support for HF_HUB_USER_AGENT_ORIGIN to add user-agent Origin field in Hub requests. (#3061)
* feat: add support for HF_HUB_USER_AGENT_ORIGIN to add user-agent Origin field in Hub requests.

* fix: Rust version for Neuron

* fix: PR comments, use rust-toolchain.toml
2025-03-04 16:43:50 +01:00
Daniël de Kok
e88f6f6ee9
Add property-based testing for RadixAllocator (#3068) 2025-03-04 15:09:46 +01:00
Daniël de Kok
fa4e9511f8
Fix two edge cases in RadixTrie::find (#3067)
- Always return a node, not its parent.
- Do not recurse when a node does not represent a full prefix of the
  input.
2025-03-04 13:23:27 +01:00
Baptiste Colle
683ff53fa3
Add Gaudi Backend (#3055)
* wip(gaudi): import server and dockerfile from tgi-gaudi fork

* feat(gaudi): new gaudi backend working

* fix: fix style

* fix prehooks issues

* fix(gaudi): refactor server and implement requested changes
2025-02-28 12:14:58 +01:00