From 08bf10ca17fe877825b96afe1435de68290956a7 Mon Sep 17 00:00:00 2001 From: Merve Noyan Date: Mon, 21 Aug 2023 00:23:10 +0300 Subject: [PATCH] initial commit --- docs/source/conceptual/flash_attention.md | 7 +++++++ 1 file changed, 7 insertions(+) create mode 100644 docs/source/conceptual/flash_attention.md diff --git a/docs/source/conceptual/flash_attention.md b/docs/source/conceptual/flash_attention.md new file mode 100644 index 00000000..5214d96d --- /dev/null +++ b/docs/source/conceptual/flash_attention.md @@ -0,0 +1,7 @@ +# Flash Attention + +Scaling transformer architecture is heavily bottlenecked by self-attention mechanism, which has quadratic time and memory complexity. Recent developments in accelerator hardware are mainly focused on enhancing compute capacities, and not memory and transferring data between hardware. This results in attention operation to have a bottleneck in memory, also called as _memory-bound_. Flash Attention is an attention algorithm used to overcome this problem and scale transformer based models more efficiently, enabling faster training and inference. +In standard attention implementation, cost of loading and writing key, query, values from High Bandwidth Memory (HBM) is high. It loads key, query, value from HBM to GPU, performs each singular step of attention mechanism and writes it back to HBM repeatedly. Instead, Flash Attention loads Q, K, V once, fuses the operations in attention mechanism and writes it back. +It is implemented for models with custom kernels, you can check out full list of models that support Flash Attention [here](https://github.com/huggingface/text-generation-inference/tree/main/server/text_generation_server/models). + +