text-generation-inference

mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-05-02 23:42:06 +00:00

History

Daniël de Kok 52e48739a5 Remove vLLM dependency for CUDA (#2751 ) * Remove vLLM dependency for CUDA This change adds `attention-kernels` as a dependency for paged attention and cache reshaping. With that, we don't use vLLM anywhere for CUDA. Tested run (since we don't have paged attention in CI): ``` ❯ ATTENTION=paged python -m pytest integration-tests -k "llama and awq" --release [...] 5 snapshots passed. ``` * Fix clippy warning		2024-11-17 17:34:50 +01:00
..
infer	feat: return streaming errors as an event formatted for openai's client (#2668 )	2024-11-15 14:49:19 +01:00
config.rs	Support qwen2 vl (#2689 )	2024-10-30 12:40:51 -04:00
kserve.rs	fix: simplify kserve endpoint and fix imports (#2119 )	2024-06-25 19:30:10 -04:00
lib.rs	Remove vLLM dependency for CUDA (#2751 )	2024-11-17 17:34:50 +01:00
logging.rs	Rebase TRT-llm (#2331 )	2024-07-31 10:33:10 +02:00
sagemaker.rs	feat: allow any supported payload on /invocations (#2683 )	2024-10-23 11:26:01 +00:00
server.rs	feat: return streaming errors as an event formatted for openai's client (#2668 )	2024-11-15 14:49:19 +01:00
usage_stats.rs	feat: allow any supported payload on /invocations (#2683 )	2024-10-23 11:26:01 +00:00
validation.rs	add trust_remote_code in tokenizer to fix baichuan issue (#2725 )	2024-11-07 14:43:38 +01:00
vertex.rs	Rollback to `ChatRequest` for Vertex AI Chat instead of `VertexChat` (#2651 )	2024-10-15 18:11:59 +02:00