mirror of
https://github.com/huggingface/text-generation-inference.git
synced 2025-09-10 20:04:52 +00:00
Update docs/source/conceptual/paged_attention.md
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
This commit is contained in:
parent
5d27a467eb
commit
9973f4041c
@ -6,4 +6,4 @@ PagedAttention addresses the memory waste by partitioning the KV cache into bloc
|
||||
|
||||
The use of a lookup table to access the memory blocks can also help with KV sharing across multiple generations. This is helpful for techniques such as _parallel sampling_, where multiple outputs are generated simultaneously for the same prompt. In this case, the cached KV blocks can be shared among the generations.
|
||||
|
||||
You can learn more about PagedAttention by reading the documentation [here](https://vllm.ai/).
|
||||
TGI's PagedAttention implementation leverages the custom cuda kernels developed by the [vLLM Project](https://github.com/vllm-project/vllm). You can learn more about this technique in the [project's page](https://vllm.ai/).
|
||||
|
Loading…
Reference in New Issue
Block a user