mirror of
https://github.com/huggingface/text-generation-inference.git
synced 2025-09-10 20:04:52 +00:00
Removed internal implementation details and clarified
This commit is contained in:
parent
98afdbbc1d
commit
2ec5436f9c
@ -1,9 +1,9 @@
|
||||
# Paged Attention
|
||||
|
||||
LLMs struggle with memory limitations during generation. In the decoding part of generation, all input tokens generated keys and values are stored in GPU memory, also referred as _KV cache_. KV cache is exhaustive for memory which causes inefficiencies in LLM serving.
|
||||
LLMs struggle with memory limitations during generation. In the decoding part of generation, all input tokens generated keys and values are stored in GPU memory, also referred to as _KV cache_. KV cache is exhaustive for memory, which causes inefficiencies in LLM serving.
|
||||
|
||||
PagedAttention addresses the memory waste by partitioning the KV cache into blocks, allowing keys and values to be stored in non-contiguous memory. This approach improves GPU utilization and throughput.
|
||||
|
||||
PagedAttention also enables memory sharing, useful for parallel sampling. PagedAttention keeps track of shared memory through a block table and implements the Copy-on-Write mechanism to ensure safe sharing.
|
||||
PagedAttention keeps a block table for memory sharing. This enables e.g. parallel sampling, where for a given prompt, multiple outputs are generated, and the computation and memory are shared between the outputs.
|
||||
|
||||
You can learn more about PagedAttention by reading the documentation [here](https://vllm.ai/).
|
||||
You can learn more about PagedAttention by reading the documentation [here](https://vllm.ai/).
|
||||
|
Loading…
Reference in New Issue
Block a user