text-generation-inference/server/text_generation_server/layers/attention
Daniël de Kok 59ea38cbca
Simplify the attention function (#2609)
* Simplify the `attention` function

- Use one definition rather than multiple.
- Add `key`/`value` arguments, so that we don't need the
  `PREFILL_IN_KVCACHE` constant.
- Make it kwargs-only (to avoid mixing up the various `Tensor` args).

* Fixup flashinfer support
2024-10-17 10:42:52 +02:00
..
__init__.py Simplify the attention function (#2609) 2024-10-17 10:42:52 +02:00
common.py feat: prefill chunking (#2600) 2024-10-16 12:49:33 +02:00
cuda.py Simplify the attention function (#2609) 2024-10-17 10:42:52 +02:00
flash_attn_triton.py feat: prefill chunking (#2600) 2024-10-16 12:49:33 +02:00
flashinfer.py flashinfer: pass window size and dtype (#2574) 2024-09-28 18:41:41 +02:00
ipex.py Simplify the attention function (#2609) 2024-10-17 10:42:52 +02:00
kv_cache.py Simplify the attention function (#2609) 2024-10-17 10:42:52 +02:00
rocm.py Simplify the attention function (#2609) 2024-10-17 10:42:52 +02:00