From 3fe2836a5498a09380c1d4244afb181df334022f Mon Sep 17 00:00:00 2001
From: Merve Noyan <merveenoyan@gmail.com>
Date: Tue, 22 Aug 2023 21:18:50 +0300
Subject: [PATCH] paged attention initial commit

---
 docs/source/conceptual/paged_attention.md | 9 +++++++++
 1 file changed, 9 insertions(+)
 create mode 100644 docs/source/conceptual/paged_attention.md

diff --git a/docs/source/conceptual/paged_attention.md b/docs/source/conceptual/paged_attention.md
new file mode 100644
index 00000000..6ac86fdb
--- /dev/null
+++ b/docs/source/conceptual/paged_attention.md
@@ -0,0 +1,9 @@
+# Paged Attention
+
+LLMs struggle with memory limitations during generation. In the decoding part of generation, all input tokens generated keys and values are stored in GPU memory, also referred as _KV cache_. KV cache is exhaustive for memory which causes inefficiencies in LLM serving.
+
+PagedAttention addresses the memory waste by partitioning the KV cache into blocks, allowing keys and values to be stored in non-contiguous memory. This approach improves GPU utilization and throughput.
+
+PagedAttention also enables memory sharing, useful for parallel sampling. PagedAttention keeps track of shared memory through a block table and implements the Copy-on-Write mechanism to ensure safe sharing.
+
+You can learn more about PagedAttention by reading the documentation [here](https://vllm.ai/).
\ No newline at end of file