text-generation-inference/server/text_generation_server/layers/attention
fxmarty 5e035063cf ROCm and sliding windows fixes (#2033)
* update vllm commit & fix models using sliding window

* update

* update commit

* fix bug where tunableop is bound to cuda graph even when cuda graph are disabled

* enable tunableop by default

* fix sliding window

* address review

* dead code

* precise comment

* is it flaky?
2024-09-24 03:42:29 +00:00
..
__init__.py Purely refactors paged/attention into layers/attention and make hardware differences more obvious with 1 file per hardware. (#1986) 2024-09-24 03:19:39 +00:00
cuda.py Purely refactors paged/attention into layers/attention and make hardware differences more obvious with 1 file per hardware. (#1986) 2024-09-24 03:19:39 +00:00
flash_attn_triton.py Purely refactors paged/attention into layers/attention and make hardware differences more obvious with 1 file per hardware. (#1986) 2024-09-24 03:19:39 +00:00
rocm.py ROCm and sliding windows fixes (#2033) 2024-09-24 03:42:29 +00:00
xpu.py ROCm and sliding windows fixes (#2033) 2024-09-24 03:42:29 +00:00