Update docs/source/conceptual/paged_attention.md

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
This commit is contained in:
Merve Noyan 2023-09-07 14:46:03 +02:00 committed by GitHub
parent 2ec5436f9c
commit 5ec7b1a2af
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -1,4 +1,4 @@
# Paged Attention # PagedAttention
LLMs struggle with memory limitations during generation. In the decoding part of generation, all input tokens generated keys and values are stored in GPU memory, also referred to as _KV cache_. KV cache is exhaustive for memory, which causes inefficiencies in LLM serving. LLMs struggle with memory limitations during generation. In the decoding part of generation, all input tokens generated keys and values are stored in GPU memory, also referred to as _KV cache_. KV cache is exhaustive for memory, which causes inefficiencies in LLM serving.