text-generation-inference

mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-10-09 15:05:24 +00:00

History

Daniël de Kok fe41e13b45 Unify attention output handling - Always return the hidden states. - Create the output tensor inside the `attention` and `paged_attention` functions. This removes the difference between how the output is handled between attention (output parameter) and paged attention (return value). This also removes the assumption that the attention implementation can write to an output tensor (in preparation of FlashInfer).		2024-08-01 13:41:34 +00:00
..
adapters	feat: add ruff and resolve issue (#2262 )	2024-07-26 10:29:09 -04:00
layers	Unify attention output handling	2024-08-01 13:41:34 +00:00
models	Unify attention output handling	2024-08-01 13:41:34 +00:00
pb	chore: add pre-commit (#1569 )	2024-02-16 11:58:58 +01:00
utils	Handle GPTQ-Marlin loading in `GPTQMarlinWeightLoader` (#2300 )	2024-07-31 13:08:41 +02:00
__init__.py	feat(clients): Python client (#103 )	2023-03-07 18:52:22 +01:00
cache.py	fix(server): decrease memory fragmentation (#557 )	2023-07-06 14:28:33 +02:00
cli.py	feat: add ruff and resolve issue (#2262 )	2024-07-26 10:29:09 -04:00
interceptor.py	v2.0.0 (#1736 )	2024-04-12 18:38:34 +02:00
server.py	Pr 2290 ci run (#2329 )	2024-07-31 10:27:15 -04:00
tracing.py	Add OTLP Service Name Environment Variable (#2076 )	2024-06-25 09:33:01 +02:00