Added paper

This commit is contained in:
Merve Noyan 2023-08-22 14:19:17 +03:00 committed by GitHub
parent df8330194f
commit 048d44cfcd
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -1,7 +1,7 @@
# Flash Attention # Flash Attention
Scaling transformer architecture is heavily bottlenecked by the self-attention mechanism, which has quadratic time and memory complexity. Recent developments in accelerator hardware are mainly focused on enhancing compute capacities and not memory and transferring data between hardware. This results in attention operation having a bottleneck in memory, also called _memory-bound_. Flash Attention is an attention algorithm used to reduce this problem and scale transformer-based models more efficiently, enabling faster training and inference. Scaling transformer architecture is heavily bottlenecked by the self-attention mechanism, which has quadratic time and memory complexity. Recent developments in accelerator hardware are mainly focused on enhancing compute capacities and not memory and transferring data between hardware. This results in attention operation having a bottleneck in memory, also called _memory-bound_. Flash Attention is an attention algorithm used to reduce this problem and scale transformer-based models more efficiently, enabling faster training and inference.
In standard attention implementation, the cost of loading and writing keys, queries, and values from High Bandwidth Memory (HBM) is high. It loads key, query, value from HBM to GPU on-chip SRAM., performs a single step of the attention mechanism and writes it back to HBM, and repeats this for every singular step of the attention. Instead, Flash Attention loads keys, queries, and values once, fuses the operations of the attention mechanism and writes them back. In standard attention implementation, the cost of loading and writing keys, queries, and values from High Bandwidth Memory (HBM) is high. It loads key, query, value from HBM to GPU on-chip SRAM, performs a single step of the attention mechanism and writes it back to HBM, and repeats this for every singular step of the attention. Instead, Flash Attention loads keys, queries, and values once, fuses the operations of the attention mechanism and writes them back.
It is implemented for models with custom kernels, you can check out the full list of models that support Flash Attention [here](https://github.com/huggingface/text-generation-inference/tree/main/server/text_generation_server/models). It is implemented for models with custom kernels, you can check out the full list of models that support Flash Attention [here](https://github.com/huggingface/text-generation-inference/tree/main/server/text_generation_server/models).
You can learn more about Flash Attention by reading the paper in this [link](https://arxiv.org/abs/2205.14135).