From 5d27a467eb32463c8320eb45efed7130148d991b Mon Sep 17 00:00:00 2001 From: Merve Noyan Date: Thu, 7 Sep 2023 14:46:29 +0200 Subject: [PATCH] Update docs/source/conceptual/paged_attention.md Co-authored-by: Pedro Cuenca --- docs/source/conceptual/paged_attention.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/conceptual/paged_attention.md b/docs/source/conceptual/paged_attention.md index 90e68bb3..f5062767 100644 --- a/docs/source/conceptual/paged_attention.md +++ b/docs/source/conceptual/paged_attention.md @@ -4,6 +4,6 @@ LLMs struggle with memory limitations during generation. In the decoding part of PagedAttention addresses the memory waste by partitioning the KV cache into blocks, allowing keys and values to be stored in non-contiguous memory. This approach improves GPU utilization and throughput. -PagedAttention keeps a block table for memory sharing. This enables e.g. parallel sampling, where for a given prompt, multiple outputs are generated, and the computation and memory are shared between the outputs. +The use of a lookup table to access the memory blocks can also help with KV sharing across multiple generations. This is helpful for techniques such as _parallel sampling_, where multiple outputs are generated simultaneously for the same prompt. In this case, the cached KV blocks can be shared among the generations. You can learn more about PagedAttention by reading the documentation [here](https://vllm.ai/).