OlivierDehaene
|
8c3669b287
|
feat: auto max_new_tokens (#2803)
* feat: auto max_new_tokens
* update default
* Fixing the tests.
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
|
2024-12-06 05:50:35 +01:00 |
|
OlivierDehaene
|
ab7ccf5bc3
|
feat: add payload limit (#2726)
* feat: add payload limit
* update launcher
|
2024-11-21 18:20:15 +00:00 |
|
Nicolas Patry
|
ed87b464b4
|
Fixing "deadlock" when python prompts for trust_remote_code by always (#2664)
specifiying a value.
|
2024-10-25 06:39:21 +02:00 |
|
OlivierDehaene
|
41c2623735
|
feat: allow any supported payload on /invocations (#2683)
* feat: allow any supported payload on /invocations
* update openAPI
* update doc
|
2024-10-23 11:26:01 +00:00 |
|
OlivierDehaene
|
a6a0c97ed9
|
feat: prefill chunking (#2600)
* wip
* rollback
* refactor to use prefix/postfix namming + fix all_input_ids_tensor
* maybe patching vlms?
* fix filter and concat
* wip, no filter, no concat
* current
* add prepare_for_prefill
* working
* load tested
* re-create slots
* re-create slots
* fix slot_filtering_indices
* feedback loop
* remove log
* fix benchmarker
* fix vlm and seq2seq
* rename to cache and input lengths
* fix prefill logprobs
* fix launcher
* fix logprobs?
* idk at this point
* max input length
* omfg
* remove debugging lines
* fix tests
* fix mllama
* fix cargo tests
* remove support chunking for paged
* Fixing non blocked attentions
* Fixing dtype + AMD, Ipex targets.
* lint fix.
* rename
* Fix prefix_caching variable, remove defaults in server (confusing a lot
of the times).
* Add simple resolution when user specifies ATTENTION=paged.
* Put back non default simple tests.
* Fix env name
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
|
2024-10-16 12:49:33 +02:00 |
|
Nicolas Patry
|
0ff6ff60ad
|
Hotfixing main (#2556)
|
2024-09-24 11:51:14 +02:00 |
|
OlivierDehaene
|
10e6f29295
|
chore: Add old V2 backend (#2551)
* wip
* added v2
|
2024-09-24 08:38:17 +02:00 |
|