text-generation-inference/backends/v3
Wang, Yi 375802948d
Warmup gaudi backend (#3172)
* clean cuda/rocm code in hpu backend, enable flat_hpu

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix TP in pageattn

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* adjust block table in hpu to improve performance

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* enable all the model. not testet yet

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* use tensor cache in hpu graph to avoid replay issue

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* add moe support, fix qwen/mistral/mixtral crash

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix phimoe issue

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* gpt_bigcode could also go pageattn

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* enable dbrx remove some unused code

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* multi-modality initial PR

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* adjust warmup and enable vlm

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix incorrect output in qwen2 idefics if hpu graph is used

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* remove unused quantization code and enable awq/gptq int4

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix gptq issue

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* enable fp8

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* warmup prefill

remove model where pageattn is not used, set block table to None since it's not used

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* add warmup_decode

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* warmup decode

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* remove block_tables and prefill_cache_indices which will lead to dynamic shape

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix comment

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* missing gptj change...

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix some issue

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* remove torch.where to fix incorrect output in hpu graph model

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* LLM warmup logic

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* multi-modality warmup

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* optimize code

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* refine log and fix some issue

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix warmup issue for mllama

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* pingpong optimization

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* match the latest vllm_extension ops

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* work with the latest vllm extension ops

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* remove block_scales which is not needed anymore

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* improve performance

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* prefill bypass graph

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* pingpong optimization issue fix

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-04-24 09:57:08 +02:00
..
benches Keeping the benchmark somewhere (#2401) 2024-08-12 15:22:02 +02:00
src Warmup gaudi backend (#3172) 2025-04-24 09:57:08 +02:00
build.rs Rebase TRT-llm (#2331) 2024-07-31 10:33:10 +02:00
Cargo.toml Add property-based testing for RadixAllocator (#3068) 2025-03-04 15:09:46 +01:00