text-generation-inference/backends/gaudi/server
Wang, Yi 375802948d
Warmup gaudi backend (#3172)
* clean cuda/rocm code in hpu backend, enable flat_hpu

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix TP in pageattn

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* adjust block table in hpu to improve performance

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* enable all the model. not testet yet

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* use tensor cache in hpu graph to avoid replay issue

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* add moe support, fix qwen/mistral/mixtral crash

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix phimoe issue

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* gpt_bigcode could also go pageattn

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* enable dbrx remove some unused code

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* multi-modality initial PR

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* adjust warmup and enable vlm

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix incorrect output in qwen2 idefics if hpu graph is used

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* remove unused quantization code and enable awq/gptq int4

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix gptq issue

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* enable fp8

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* warmup prefill

remove model where pageattn is not used, set block table to None since it's not used

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* add warmup_decode

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* warmup decode

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* remove block_tables and prefill_cache_indices which will lead to dynamic shape

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix comment

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* missing gptj change...

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix some issue

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* remove torch.where to fix incorrect output in hpu graph model

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* LLM warmup logic

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* multi-modality warmup

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* optimize code

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* refine log and fix some issue

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix warmup issue for mllama

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* pingpong optimization

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* match the latest vllm_extension ops

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* work with the latest vllm extension ops

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* remove block_scales which is not needed anymore

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* improve performance

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* prefill bypass graph

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* pingpong optimization issue fix

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-04-24 09:57:08 +02:00
..
integration-tests Gaudi: Add Integration Test for Gaudi Backend (#3142) 2025-04-07 16:55:03 +02:00
text_generation_server Warmup gaudi backend (#3172) 2025-04-24 09:57:08 +02:00
.gitignore Add Gaudi Backend (#3055) 2025-02-28 12:14:58 +01:00
dill-0.3.7-patch.sh Add Gaudi Backend (#3055) 2025-02-28 12:14:58 +01:00
dill-0.3.8-patch.sh Add Gaudi Backend (#3055) 2025-02-28 12:14:58 +01:00
Makefile Add Gaudi Backend (#3055) 2025-02-28 12:14:58 +01:00
Makefile-awq Add Gaudi Backend (#3055) 2025-02-28 12:14:58 +01:00
Makefile-eetq Add Gaudi Backend (#3055) 2025-02-28 12:14:58 +01:00
Makefile-fbgemm Add Gaudi Backend (#3055) 2025-02-28 12:14:58 +01:00
Makefile-flash-att Add Gaudi Backend (#3055) 2025-02-28 12:14:58 +01:00
Makefile-flash-att-v2 Add Gaudi Backend (#3055) 2025-02-28 12:14:58 +01:00
Makefile-selective-scan Add Gaudi Backend (#3055) 2025-02-28 12:14:58 +01:00
Makefile-vllm Add Gaudi Backend (#3055) 2025-02-28 12:14:58 +01:00
poetry.lock Upgrading the dependencies in Gaudi backend. (#3170) 2025-04-15 11:49:06 +02:00
pyproject.toml Upgrading the dependencies in Gaudi backend. (#3170) 2025-04-15 11:49:06 +02:00
README.md Add Gaudi Backend (#3055) 2025-02-28 12:14:58 +01:00
requirements.txt Hotfixing gaudi deps. (#3174) 2025-04-15 11:55:28 +02:00

Text Generation Inference Python gRPC Server

A Python gRPC server for Text Generation Inference

Install

make install

Run

make run-dev