mirror of
https://github.com/huggingface/text-generation-inference.git
synced 2025-09-11 04:14:52 +00:00
Update docs/source/conceptual/paged_attention.md
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
This commit is contained in:
parent
2ec5436f9c
commit
5ec7b1a2af
@ -1,4 +1,4 @@
|
|||||||
# Paged Attention
|
# PagedAttention
|
||||||
|
|
||||||
LLMs struggle with memory limitations during generation. In the decoding part of generation, all input tokens generated keys and values are stored in GPU memory, also referred to as _KV cache_. KV cache is exhaustive for memory, which causes inefficiencies in LLM serving.
|
LLMs struggle with memory limitations during generation. In the decoding part of generation, all input tokens generated keys and values are stored in GPU memory, also referred to as _KV cache_. KV cache is exhaustive for memory, which causes inefficiencies in LLM serving.
|
||||||
|
|
||||||
|
Loading…
Reference in New Issue
Block a user