text-generation-inference/server/text_generation_server
Daniël de Kok 48fec7b198 Unify attention output handling (#2343)
- Always return the hidden states.
- Create the output tensor inside the `attention` and `paged_attention`
  functions.

This removes the difference between how the output is handled between
attention (output parameter) and paged attention (return value). This
also removes the assumption that the attention implementation can
write to an output tensor (in preparation of FlashInfer).
2024-09-25 05:55:39 +00:00
..
adapters feat: add ruff and resolve issue (#2262) 2024-09-25 05:46:24 +00:00
layers Unify attention output handling (#2343) 2024-09-25 05:55:39 +00:00
models Unify attention output handling (#2343) 2024-09-25 05:55:39 +00:00
pb chore: add pre-commit (#1569) 2024-04-24 15:32:02 +03:00
utils Handle GPTQ-Marlin loading in GPTQMarlinWeightLoader (#2300) 2024-09-25 05:55:39 +00:00
__init__.py feat(clients): Python client (#103) 2023-03-07 18:52:22 +01:00
cache.py fix(server): decrease memory fragmentation (#557) 2023-07-06 14:28:33 +02:00
cli.py feat: add ruff and resolve issue (#2262) 2024-09-25 05:46:24 +00:00
interceptor.py Aligin the source code with main branch 2.0.4 2024-09-24 03:06:55 +00:00
server.py Pr 2290 ci run (#2329) 2024-09-25 05:55:39 +00:00
tracing.py Add OTLP Service Name Environment Variable (#2076) 2024-09-24 03:51:26 +00:00