From 2ec5436f9c6f19b352b6a234c7f33c204d75faf2 Mon Sep 17 00:00:00 2001 From: Merve Noyan Date: Wed, 23 Aug 2023 17:00:40 +0300 Subject: [PATCH] Removed internal implementation details and clarified --- docs/source/conceptual/paged_attention.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/source/conceptual/paged_attention.md b/docs/source/conceptual/paged_attention.md index 6ac86fdb..b5fe8e08 100644 --- a/docs/source/conceptual/paged_attention.md +++ b/docs/source/conceptual/paged_attention.md @@ -1,9 +1,9 @@ # Paged Attention -LLMs struggle with memory limitations during generation. In the decoding part of generation, all input tokens generated keys and values are stored in GPU memory, also referred as _KV cache_. KV cache is exhaustive for memory which causes inefficiencies in LLM serving. +LLMs struggle with memory limitations during generation. In the decoding part of generation, all input tokens generated keys and values are stored in GPU memory, also referred to as _KV cache_. KV cache is exhaustive for memory, which causes inefficiencies in LLM serving. PagedAttention addresses the memory waste by partitioning the KV cache into blocks, allowing keys and values to be stored in non-contiguous memory. This approach improves GPU utilization and throughput. -PagedAttention also enables memory sharing, useful for parallel sampling. PagedAttention keeps track of shared memory through a block table and implements the Copy-on-Write mechanism to ensure safe sharing. +PagedAttention keeps a block table for memory sharing. This enables e.g. parallel sampling, where for a given prompt, multiple outputs are generated, and the computation and memory are shared between the outputs. -You can learn more about PagedAttention by reading the documentation [here](https://vllm.ai/). \ No newline at end of file +You can learn more about PagedAttention by reading the documentation [here](https://vllm.ai/).