From 3fe2836a5498a09380c1d4244afb181df334022f Mon Sep 17 00:00:00 2001 From: Merve Noyan Date: Tue, 22 Aug 2023 21:18:50 +0300 Subject: [PATCH] paged attention initial commit --- docs/source/conceptual/paged_attention.md | 9 +++++++++ 1 file changed, 9 insertions(+) create mode 100644 docs/source/conceptual/paged_attention.md diff --git a/docs/source/conceptual/paged_attention.md b/docs/source/conceptual/paged_attention.md new file mode 100644 index 00000000..6ac86fdb --- /dev/null +++ b/docs/source/conceptual/paged_attention.md @@ -0,0 +1,9 @@ +# Paged Attention + +LLMs struggle with memory limitations during generation. In the decoding part of generation, all input tokens generated keys and values are stored in GPU memory, also referred as _KV cache_. KV cache is exhaustive for memory which causes inefficiencies in LLM serving. + +PagedAttention addresses the memory waste by partitioning the KV cache into blocks, allowing keys and values to be stored in non-contiguous memory. This approach improves GPU utilization and throughput. + +PagedAttention also enables memory sharing, useful for parallel sampling. PagedAttention keeps track of shared memory through a block table and implements the Copy-on-Write mechanism to ensure safe sharing. + +You can learn more about PagedAttention by reading the documentation [here](https://vllm.ai/). \ No newline at end of file