Update docs/source/conceptual/paged_attention.md

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
This commit is contained in:
Merve Noyan 2023-09-07 14:46:39 +02:00 committed by GitHub
parent 5d27a467eb
commit 9973f4041c
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -6,4 +6,4 @@ PagedAttention addresses the memory waste by partitioning the KV cache into bloc
The use of a lookup table to access the memory blocks can also help with KV sharing across multiple generations. This is helpful for techniques such as _parallel sampling_, where multiple outputs are generated simultaneously for the same prompt. In this case, the cached KV blocks can be shared among the generations.
You can learn more about PagedAttention by reading the documentation [here](https://vllm.ai/).
TGI's PagedAttention implementation leverages the custom cuda kernels developed by the [vLLM Project](https://github.com/vllm-project/vllm). You can learn more about this technique in the [project's page](https://vllm.ai/).