mirror of
https://github.com/huggingface/text-generation-inference.git
synced 2025-09-10 20:04:52 +00:00
paged attention initial commit
This commit is contained in:
parent
c4422e5678
commit
3fe2836a54
9
docs/source/conceptual/paged_attention.md
Normal file
9
docs/source/conceptual/paged_attention.md
Normal file
@ -0,0 +1,9 @@
|
|||||||
|
# Paged Attention
|
||||||
|
|
||||||
|
LLMs struggle with memory limitations during generation. In the decoding part of generation, all input tokens generated keys and values are stored in GPU memory, also referred as _KV cache_. KV cache is exhaustive for memory which causes inefficiencies in LLM serving.
|
||||||
|
|
||||||
|
PagedAttention addresses the memory waste by partitioning the KV cache into blocks, allowing keys and values to be stored in non-contiguous memory. This approach improves GPU utilization and throughput.
|
||||||
|
|
||||||
|
PagedAttention also enables memory sharing, useful for parallel sampling. PagedAttention keeps track of shared memory through a block table and implements the Copy-on-Write mechanism to ensure safe sharing.
|
||||||
|
|
||||||
|
You can learn more about PagedAttention by reading the documentation [here](https://vllm.ai/).
|
Loading…
Reference in New Issue
Block a user