mirror of
https://github.com/huggingface/text-generation-inference.git
synced 2025-04-21 23:12:07 +00:00
* clean cuda/rocm code in hpu backend, enable flat_hpu Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix TP in pageattn Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * adjust block table in hpu to improve performance Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * enable all the model. not testet yet Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * use tensor cache in hpu graph to avoid replay issue Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * add moe support, fix qwen/mistral/mixtral crash Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix phimoe issue Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * gpt_bigcode could also go pageattn Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * enable dbrx remove some unused code Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * multi-modality initial PR Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * adjust warmup and enable vlm Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix incorrect output in qwen2 idefics if hpu graph is used Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * remove unused quantization code and enable awq/gptq int4 Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix gptq issue Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * enable fp8 Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * warmup prefill remove model where pageattn is not used, set block table to None since it's not used Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * add warmup_decode Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * warmup decode Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * remove block_tables and prefill_cache_indices which will lead to dynamic shape Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix comment Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * missing gptj change... Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix some issue Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * remove torch.where to fix incorrect output in hpu graph model Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * match the latest vllm_extension ops Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> |
||
---|---|---|
.. | ||
__init__.py | ||
bloom_modeling.py | ||
clip.py | ||
flash_cohere_modeling.py | ||
flash_dbrx_modeling.py | ||
flash_deepseek_v2_modeling.py | ||
flash_deepseek_v3_modeling.py | ||
flash_gemma2_modeling.py | ||
flash_gemma_modeling.py | ||
flash_gpt2_modeling.py | ||
flash_gptj_modeling.py | ||
flash_llama_modeling.py | ||
flash_llava_next.py | ||
flash_mistral_modeling.py | ||
flash_mixtral_modeling.py | ||
flash_mllama.py | ||
flash_neox_modeling.py | ||
flash_pali_gemma_modeling.py | ||
flash_phi_modeling.py | ||
flash_phi_moe_modeling.py | ||
flash_qwen2_modeling.py | ||
flash_rw_modeling.py | ||
flash_santacoder_modeling.py | ||
flash_starcoder2_modeling.py | ||
idefics2.py | ||
idefics3.py | ||
idefics_config.py | ||
idefics_image_processing.py | ||
idefics_modeling.py | ||
idefics_perceiver.py | ||
idefics_processing.py | ||
idefics_vision.py | ||
llava_next.py | ||
mamba_modeling.py | ||
mllama.py | ||
qwen2_5_vl.py | ||
qwen2_vl.py | ||
siglip.py | ||
vlm.py |