text-generation-inference

mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-04-24 16:32:12 +00:00

History

Daniël de Kok 5b6b74e21d Improve support for GPUs with capability < 8 (#2575 ) * Improve support for GPUs with capability < 8 - For models that cannot use flashinfer, use flash-attn v1 + paged attention for models with a compute capability older than 8. - Disable prefix caching when using paged attention. - When using flash-attn v1, pass the key/value, rather than the cache, since v1 cannot use block tables. * nix: add flash-attn-v1 to the server environment * Move disabling prefix caching into the block of exceptions * Capability as `usize`s		2024-09-27 16:19:42 +02:00
..
adapters	feat: add ruff and resolve issue (#2262 )	2024-07-26 10:29:09 -04:00
layers	Improve support for GPUs with capability < 8 (#2575 )	2024-09-27 16:19:42 +02:00
models	Improve support for GPUs with capability < 8 (#2575 )	2024-09-27 16:19:42 +02:00
pb	chore: add pre-commit (#1569 )	2024-02-16 11:58:58 +01:00
utils	Micro cleanup. (#2555 )	2024-09-24 11:19:24 +02:00
__init__.py	feat(clients): Python client (#103 )	2023-03-07 18:52:22 +01:00
cache.py	fix(server): decrease memory fragmentation (#557 )	2023-07-06 14:28:33 +02:00
cli.py	feat: add ruff and resolve issue (#2262 )	2024-07-26 10:29:09 -04:00
interceptor.py	v2.0.0 (#1736 )	2024-04-12 18:38:34 +02:00
server.py	Upgrading exl2. (#2415 )	2024-08-14 11:58:08 +02:00
tracing.py	Add OTLP Service Name Environment Variable (#2076 )	2024-06-25 09:33:01 +02:00