Commit Graph

12 Commits

Author SHA1 Message Date
Nicolas Patry
e30fb25444
Fixing the default for vlm. 2024-08-27 20:06:11 +02:00
Nicolas Patry
27b566baa8
Downgrade some logs. 2024-08-27 20:06:11 +02:00
Nicolas Patry
26e5037de4
This seems to be working. 2024-08-27 20:06:10 +02:00
Nicolas Patry
f5182c188c
Is this enough to make it work ? 2024-08-27 20:06:10 +02:00
Nicolas Patry
1568e82548
OVerride the env in server tests. 2024-08-27 20:06:10 +02:00
Nicolas Patry
682db34b6a
Handling debugger. 2024-08-27 20:06:10 +02:00
Nicolas Patry
32f6416358
Upgrade resolution system for less errors in resolution. 2024-08-27 20:06:10 +02:00
Nicolas Patry
5eb6ea0063
Tmp 2024-08-27 20:06:09 +02:00
Nicolas Patry
f55278de2d
Allowing window_left_size (dummy version). 2024-08-27 20:05:29 +02:00
Nicolas Patry
b70ae0969f
Prefix caching (#2402)
* Prefix caching WIP

* Fixing prefix attention.

* Fixing flashinfer import.

* Fixing black.

* Fixing medusa (still wrong outputs, but functional).

* Just medusa values now.

* Fixing medusa without prefix caching.

* Fixing prefix caching.

* Medusa requires reshaping.

* Removing the logs.

* Remove router.nix

* Fixup:

- Remove logs
- Disable VLMs (they do not work)
- Disable prefix caching when user wants prefill logprobs.

* Update flake.lock

---------

Co-authored-by: Daniël de Kok <me@danieldk.eu>
2024-08-20 11:15:30 +02:00
Nicolas Patry
136bcc8128
Keeping the benchmark somewhere (#2401)
Co-authored-by: Daniël de Kok <me@danieldk.eu>
2024-08-12 15:22:02 +02:00
Daniël de Kok
8deeaca4ff
Add support for prefix caching to the v3 router (#2392)
This change adds support for prefix caching to the v3 router. This
is broken up from the backend support to ease reviewing.

For now prefix caching is only enabled with `USE_PREFIX_CACHING=1`
in this case, the router will switch to `RadixAllocator`. This
allocator uses a radix trie to keep track of prefills that were
seen prior. If a new prefill is a prefix of a previously-seen
prefil, the router will send a request with `prefix_len>0`, which
can be used by the backend to decide to reuse KV blocks from the
cache, rather than recomputing them.

Even though backend support is not added in this PR, the backend
will still work with prefix caching enabled. The prefix lengths
are just ignored and not used.
2024-08-12 14:59:17 +02:00