text-generation-inference/backends
Wang, Yi 375802948d
Warmup gaudi backend (#3172)
* clean cuda/rocm code in hpu backend, enable flat_hpu

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix TP in pageattn

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* adjust block table in hpu to improve performance

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* enable all the model. not testet yet

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* use tensor cache in hpu graph to avoid replay issue

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* add moe support, fix qwen/mistral/mixtral crash

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix phimoe issue

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* gpt_bigcode could also go pageattn

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* enable dbrx remove some unused code

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* multi-modality initial PR

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* adjust warmup and enable vlm

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix incorrect output in qwen2 idefics if hpu graph is used

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* remove unused quantization code and enable awq/gptq int4

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix gptq issue

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* enable fp8

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* warmup prefill

remove model where pageattn is not used, set block table to None since it's not used

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* add warmup_decode

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* warmup decode

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* remove block_tables and prefill_cache_indices which will lead to dynamic shape

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix comment

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* missing gptj change...

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix some issue

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* remove torch.where to fix incorrect output in hpu graph model

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* LLM warmup logic

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* multi-modality warmup

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* optimize code

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* refine log and fix some issue

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix warmup issue for mllama

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* pingpong optimization

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* match the latest vllm_extension ops

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* work with the latest vllm extension ops

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* remove block_scales which is not needed anymore

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* improve performance

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* prefill bypass graph

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* pingpong optimization issue fix

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-04-24 09:57:08 +02:00
..
client Revert "feat: improve qwen2-vl startup " (#2924) 2025-01-17 12:09:05 -05:00
gaudi Warmup gaudi backend (#3172) 2025-04-24 09:57:08 +02:00
grpc-metadata Upgrading our rustc version. (#2908) 2025-01-15 17:04:03 +01:00
llamacpp Add option to configure prometheus port (#3187) 2025-04-23 20:43:25 +05:30
neuron setuptools <= 70.0 is vulnerable: CVE-2024-6345 (#3171) 2025-04-15 10:09:37 +02:00
trtllm Add option to configure prometheus port (#3187) 2025-04-23 20:43:25 +05:30
v2 Add option to configure prometheus port (#3187) 2025-04-23 20:43:25 +05:30
v3 Warmup gaudi backend (#3172) 2025-04-24 09:57:08 +02:00