From 41cd2e350c77e408ceb523b5578b4437e81767ad Mon Sep 17 00:00:00 2001 From: Merve Noyan Date: Thu, 7 Sep 2023 14:46:21 +0200 Subject: [PATCH] Update docs/source/conceptual/paged_attention.md Co-authored-by: Pedro Cuenca --- docs/source/conceptual/paged_attention.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/conceptual/paged_attention.md b/docs/source/conceptual/paged_attention.md index e1b20f4d..90e68bb3 100644 --- a/docs/source/conceptual/paged_attention.md +++ b/docs/source/conceptual/paged_attention.md @@ -1,6 +1,6 @@ # PagedAttention -LLMs struggle with memory limitations during generation. In the decoding part of generation, all input tokens generated keys and values are stored in GPU memory, also referred to as _KV cache_. KV cache is exhaustive for memory, which causes inefficiencies in LLM serving. +LLMs struggle with memory limitations during generation. In the decoding part of generation, all the attention keys and values generated for previous tokens are stored in GPU memory for reuse. This is called _KV cache_, and it may take up a large amount of memory for large models and long sequences. PagedAttention addresses the memory waste by partitioning the KV cache into blocks, allowing keys and values to be stored in non-contiguous memory. This approach improves GPU utilization and throughput.