text-generation-inference/router/src
Daniël de Kok 52e48739a5
Remove vLLM dependency for CUDA (#2751)
* Remove vLLM dependency for CUDA

This change adds `attention-kernels` as a dependency for paged
attention and cache reshaping. With that, we don't use vLLM
anywhere for CUDA.

Tested run (since we don't have paged attention in CI):

```
❯ ATTENTION=paged python -m pytest integration-tests -k "llama and awq" --release
[...]
5 snapshots passed.
```

* Fix clippy warning
2024-11-17 17:34:50 +01:00
..
infer feat: return streaming errors as an event formatted for openai's client (#2668) 2024-11-15 14:49:19 +01:00
config.rs Support qwen2 vl (#2689) 2024-10-30 12:40:51 -04:00
kserve.rs fix: simplify kserve endpoint and fix imports (#2119) 2024-06-25 19:30:10 -04:00
lib.rs Remove vLLM dependency for CUDA (#2751) 2024-11-17 17:34:50 +01:00
logging.rs Rebase TRT-llm (#2331) 2024-07-31 10:33:10 +02:00
sagemaker.rs feat: allow any supported payload on /invocations (#2683) 2024-10-23 11:26:01 +00:00
server.rs feat: return streaming errors as an event formatted for openai's client (#2668) 2024-11-15 14:49:19 +01:00
usage_stats.rs feat: allow any supported payload on /invocations (#2683) 2024-10-23 11:26:01 +00:00
validation.rs add trust_remote_code in tokenizer to fix baichuan issue (#2725) 2024-11-07 14:43:38 +01:00
vertex.rs Rollback to ChatRequest for Vertex AI Chat instead of VertexChat (#2651) 2024-10-15 18:11:59 +02:00