text-generation-inference/server/text_generation_server/layers/attention
Daniël de Kok fe41e13b45 Unify attention output handling
- Always return the hidden states.
- Create the output tensor inside the `attention` and `paged_attention`
  functions.

This removes the difference between how the output is handled between
attention (output parameter) and paged attention (return value). This
also removes the assumption that the attention implementation can
write to an output tensor (in preparation of FlashInfer).
2024-08-01 13:41:34 +00:00
..
__init__.py feat: add ruff and resolve issue (#2262) 2024-07-26 10:29:09 -04:00
common.py [Major Change][Undecided yet] Move to FlashDecoding instead of PagedAttention kernel. (#1940) 2024-07-01 23:28:00 +02:00
cuda.py Unify attention output handling 2024-08-01 13:41:34 +00:00
flash_attn_triton.py feat: add ruff and resolve issue (#2262) 2024-07-26 10:29:09 -04:00
ipex.py Unify attention output handling 2024-08-01 13:41:34 +00:00
rocm.py Unify attention output handling 2024-08-01 13:41:34 +00:00