From 5ec7b1a2afc7c94f15cd220c24b683d33bfb5822 Mon Sep 17 00:00:00 2001 From: Merve Noyan Date: Thu, 7 Sep 2023 14:46:03 +0200 Subject: [PATCH] Update docs/source/conceptual/paged_attention.md Co-authored-by: Pedro Cuenca --- docs/source/conceptual/paged_attention.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/conceptual/paged_attention.md b/docs/source/conceptual/paged_attention.md index b5fe8e08..e1b20f4d 100644 --- a/docs/source/conceptual/paged_attention.md +++ b/docs/source/conceptual/paged_attention.md @@ -1,4 +1,4 @@ -# Paged Attention +# PagedAttention LLMs struggle with memory limitations during generation. In the decoding part of generation, all input tokens generated keys and values are stored in GPU memory, also referred to as _KV cache_. KV cache is exhaustive for memory, which causes inefficiencies in LLM serving.