Commit Graph

55 Commits

Author SHA1 Message Date
Morgan Funtowicz
188e4dc64f (misc: build for sm_{75,80,86,89,90} by default 2024-10-21 10:00:27 +02:00
Morgan Funtowicz
544c9d9dba (fix): HOPPER_SM_MAJOR is 9 not 8 2024-10-21 10:00:27 +02:00
Morgan Funtowicz
213acc6e34 (misc) move to latest trtllm 2024-10-21 10:00:27 +02:00
Morgan Funtowicz
507ff66692 (misc) rerun-if-changed all the cmake modules 2024-10-21 10:00:27 +02:00
Morgan Funtowicz
b242f45c04 (misc) delete backend.rs 2024-10-21 10:00:27 +02:00
Morgan Funtowicz
984ae9798f (post) impl postprocessing 2024-10-21 10:00:27 +02:00
Morgan Funtowicz
fa63db0d07 (scheduler) rework submit/pull logic 2024-10-21 10:00:27 +02:00
Morgan Funtowicz
42ccf4e77c (misc) no need to move for uint32_t items 2024-10-21 10:00:27 +02:00
Morgan Funtowicz
b41875c139 (misc) simplify [make_]move_iterator by using c++20 type inference 2024-10-21 10:00:27 +02:00
Morgan Funtowicz
b1846fb4e6 (backend) refactor & cleanup 2024-10-21 10:00:27 +02:00
Morgan Funtowicz
483f172938 (ffi) do not use reference capture in lambda as we are not capturing anything 2024-10-21 10:00:27 +02:00
Morgan Funtowicz
3d0e90b631 (ffi) missing namespace for tle::Response 2024-10-21 10:00:27 +02:00
Morgan Funtowicz
8e648ce425 (ffi) fix usage of wrong vector constructor making a capacity fill call 2024-10-21 10:00:27 +02:00
Morgan Funtowicz
dddc9a44bd (build) fetchcontent use archives instead of git 2024-10-21 10:00:27 +02:00
Morgan Funtowicz
089c5fe668 (server) forward auth_token to server::run 2024-10-21 10:00:27 +02:00
Morgan Funtowicz
291eaa99fb use blocking_recv in looper to consume awaiting_requests at max before pulling in a single step 2024-10-21 10:00:27 +02:00
Morgan Funtowicz
7bebc629af (misc) missing Result types for Rust 2024-10-21 10:00:27 +02:00
Morgan Funtowicz
c2e21d8725 (backend) implement the post_processor background thread 2024-10-21 10:00:27 +02:00
Morgan Funtowicz
0dca168bcb (misc) change scope identifiers 2024-10-21 10:00:27 +02:00
Morgan Funtowicz
933ab67aa1 (ffi) encode the provided user prompt within each request thread 2024-10-21 10:00:27 +02:00
Morgan Funtowicz
0b0c30fe8b (ffi) remove narrowing type warning 2024-10-21 10:00:27 +02:00
Morgan Funtowicz
fb759bdd2a (looper) new looper initial implementation 2024-10-21 10:00:27 +02:00
Morgan Funtowicz
5f7c0b67c3 (ffi) add template specialization to catch and convert to Rust Result<T, tensorrt_llm::common::TllmException> 2024-10-21 10:00:27 +02:00
Morgan Funtowicz
33c962ef41 (ffi) add missing headers imports 2024-10-21 10:00:27 +02:00
Morgan Funtowicz
2883c042ed (ffi) cleanup again 2024-10-21 10:00:27 +02:00
Morgan Funtowicz
f4a74be384 (backend) expose PullNewTokens 2024-10-21 10:00:27 +02:00
Morgan Funtowicz
b8a40a0af3 (backend) cleanup a bit 2024-10-21 10:00:27 +02:00
Morgan Funtowicz
38b5263c61 (ffi) add max_new_tokens parameters 2024-10-21 10:00:27 +02:00
Morgan Funtowicz
f6f689f509 (build) setup ccache if available 2024-10-21 10:00:27 +02:00
Morgan Funtowicz
2a339f99dd (trt) 2024-10-21 10:00:25 +02:00
Morgan Funtowicz
0cd7538a48 (ffi) use const for GetSamplingConfig 2024-10-21 09:57:26 +02:00
Morgan Funtowicz
cea64e234f (chore) fmt ... why? 2024-10-21 09:57:26 +02:00
Morgan Funtowicz
a3f7d76f7b (launcher) default new server::run parameters to false for now 2024-10-21 09:57:24 +02:00
Morgan Funtowicz
25b20cba2a (backend) use parking_lot crate for RwLock fairness
# Conflicts:
#	backends/trtllm/src/backend.rs
2024-10-21 09:57:16 +02:00
OlivierDehaene
a6a0c97ed9
feat: prefill chunking (#2600)
* wip

* rollback

* refactor to use prefix/postfix namming + fix all_input_ids_tensor

* maybe patching vlms?

* fix filter and concat

* wip, no filter, no concat

* current

* add prepare_for_prefill

* working

* load tested

* re-create slots

* re-create slots

* fix slot_filtering_indices

* feedback loop

* remove log

* fix benchmarker

* fix vlm and seq2seq

* rename to cache and input lengths

* fix prefill logprobs

* fix launcher

* fix logprobs?

* idk at this point

* max input length

* omfg

* remove debugging lines

* fix tests

* fix mllama

* fix cargo tests

* remove support chunking for paged

* Fixing non blocked attentions

* Fixing dtype + AMD, Ipex targets.

* lint fix.

* rename

* Fix prefix_caching variable, remove defaults in server (confusing a lot
of the times).

* Add simple resolution when user specifies ATTENTION=paged.

* Put back non default simple tests.

* Fix env name

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-10-16 12:49:33 +02:00
Nicolas Patry
0204946d26
Max token capacity metric (#2595)
* adding max_token_capacity_metric

* added tgi to name of metric

* Adding max capacity metric.

* Add description for the metrics

---------

Co-authored-by: Edwinhr716 <Edandres249@gmail.com>
2024-10-02 16:32:36 +02:00
Nicolas Patry
0ff6ff60ad
Hotfixing main (#2556) 2024-09-24 11:51:14 +02:00
OlivierDehaene
10e6f29295
chore: Add old V2 backend (#2551)
* wip

* added v2
2024-09-24 08:38:17 +02:00
Nicolas Patry
38fcafcf96
Adding a test for FD. (#2516)
* Adding a test for FD.

* Fixing flashdecoding (empty batch doesn't work).

* Fixing the invalid popping.

* Fixing radix with block_size > 1

* Last reference.

* Use an actual hash.

* Update hash for slice.len() == 1

* Update the locks.

* Increasing docker timeout.
2024-09-16 17:00:54 +02:00
Nicolas Patry
dae3bf1d87
Fix tokenization yi (#2507)
* Fixing odd tokenization self modifications on the Rust side (load and
resave in Python).

* Fixing the builds ?

* Fix the gh action?

* Fixing the location ?

* Validation is odd.

* Try a faster runner

* Upgrade python version.

* Remove sccache

* No sccache.

* Getting libpython maybe ?

* List stuff.

* Monkey it up.

* have no idea at this point

* Tmp.

* Shot in the dark.

* Tmate the hell out of this.

* Desperation.

* WTF.

* -y.

* Apparently 3.10 is not available anymore.

* Updating the dockerfile to make libpython discoverable at runtime too.

* Put back rust tests.

* Why do we want mkl on AMD ?

* Forcing 3.11 ?
2024-09-11 22:41:56 +02:00
Nicolas Patry
a4e3e8c608
Prefix test - Different kind of load test to trigger prefix test bugs. (#2490)
* Adding prefix test.

* [WIP] tmp dump of integration load tests.

* Remove other tensor creation.

* Fixed the radix tree.

Used a slice everywhere in radix.rs to keep the cheap Arc cloning
instead of recomputing the input_ids.

* Fix parsing

* Is it really flashinfer version ?

* Remove some comments.

* Revert the max prefix hit.

* Adding numpy to diff.

* Upgraded flashinfer.

* Upgrading some stuff.

* Are we done yet ?

* Minor fixup

* Remove 1 log and put back the other.

* Add comment for why slot 0 is OK.

* Mounting on the job.

* Get me a debug branch

* Debugging CIs is fun.

* Attempt #28

* wip

* Tmate.

* Praying.

* Updating VLM causal model with updated context.

* Important line got squashed.

* Tmate again.

* Fingers crossed.

* We want only 1 run of integration tests.....

---------

Co-authored-by: Guillaume LEGENDRE <glegendre01@gmail.com>
2024-09-11 18:10:40 +02:00
Nicolas Patry
c1fe28d694
Fixing more correctly the invalid drop of the batch. (#2498) 2024-09-06 17:35:49 +02:00
Daniël de Kok
379472c4c2
radix trie: add assertions (#2491)
These should all be cheap assertions.

Also:

* Fixup some comments.
* Delete a `remove` that was done unnecessarily twice.
2024-09-06 11:55:23 +02:00
Daniël de Kok
deec30f893
hotfix: avoid non-prefilled block use when using prefix caching (#2489)
The minimum batch size logic could cause prefix blocks to be
deallocated without prefill. The next allocation of the same
prefix would then use garbage blocks.
2024-09-05 15:09:29 +02:00
Nicolas Patry
e415b690a6
Lots of improvements (Still 2 allocators) (#2449)
* Making prefix/flashinfer the default and testing the full release tests.

* Include flashinfer in the docker.

* Using prebuilt.

* Allowing window_left_size (dummy version).

* Disabling flashinfer/prefix caching on odd head_dim

* Disable prefix caching for lora.

* More specific codes.

* Update lock

* Updating integration tests with new values with FI/FD.

Remove paged as a default too, and using FD everywhere.

* Update cargo lock ?

* Upgrade to 1.80 because of bitstream...

* Everywhere 1.80

* Forgot last default place.

* Apply suggestions from code review

Co-authored-by: drbh <david.richard.holtz@gmail.com>

* Updated flake lock

* Tmp

* Upgrade resolution system for less errors in resolution.

* Remove lambda for cleaner function.

* Handling debugger.

* OVerride the env in server tests.

* Is this enough to make it work ?

* This seems to be working.

* Downgrade some logs.

* Fixing the default for vlm.

* Don't enable prefix caching on VLM just yet.

* Change `add_special_tokens` in order to have the correct tokens for chat
input and not (since it's super important with the prefixing now)

* Fixing prefix caching for flashdecoding.

* Update all models.

* Fixed flashinfer version.

* add_special_tokens is internal only

* Fixing seqlen with the new vlms.

* Fixing the issue with `add_special_tokens` not being passed around.

* Fixing the test.

* Removing encoder_decoder (seq2seq).

* Update the chat test.

* Fixing the batching tokenization in flash causal lm.

* Truncating left for radix purposes.

* Oops this doesn't belong here.

* Put back default pure shell.

* Update server tests

- Default to throughput test in k6
- Use TGI_WIGGLE_ROOM to adjust wiggle room

* Only n_heads / process_group.size() are necessary.

* Revert the integrationt tests change (seem linked to head_size
modification).

* Adding error message when assert is violated.

* Fixing the free algorithm to handle times where the common prefix is
smaller.

* Apply suggestions from code review

Co-authored-by: OlivierDehaene <olivier@huggingface.co>

* Update server/text_generation_server/layers/attention/common.py

Co-authored-by: OlivierDehaene <olivier@huggingface.co>

* Fix disabling prefix caching - Fix windowing checks.

* Revert the Cohere tokenizer change (for now using a revision instead).

* Fmt.

---------

Co-authored-by: drbh <david.richard.holtz@gmail.com>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
2024-08-29 16:29:01 +02:00
drbh
21187c27c9
fix: bump minijinja version and add test for llama 3.1 tools (#2463)
* fix: support tojson and avoid message indexing issue in template

* fix: prefer minijinja native methods and prefer workspace level dependency

* fix: adjust comment typo
2024-08-27 13:31:08 -04:00
Nicolas Patry
b70ae0969f
Prefix caching (#2402)
* Prefix caching WIP

* Fixing prefix attention.

* Fixing flashinfer import.

* Fixing black.

* Fixing medusa (still wrong outputs, but functional).

* Just medusa values now.

* Fixing medusa without prefix caching.

* Fixing prefix caching.

* Medusa requires reshaping.

* Removing the logs.

* Remove router.nix

* Fixup:

- Remove logs
- Disable VLMs (they do not work)
- Disable prefix caching when user wants prefill logprobs.

* Update flake.lock

---------

Co-authored-by: Daniël de Kok <me@danieldk.eu>
2024-08-20 11:15:30 +02:00
Funtowicz Morgan
3f385991b0
More fixes trtllm (#2342)
* (backend) use parking_lot crate for RwLock fairness

* (docker) let's put rust in the TRTLLM folder when building

* (docker) build ompi with SLURM support

* (launcher) default new server::run parameters to false for now

* (chore) fmt ... why?
2024-08-14 12:02:05 +02:00
Nicolas Patry
136bcc8128
Keeping the benchmark somewhere (#2401)
Co-authored-by: Daniël de Kok <me@danieldk.eu>
2024-08-12 15:22:02 +02:00
Daniël de Kok
8deeaca4ff
Add support for prefix caching to the v3 router (#2392)
This change adds support for prefix caching to the v3 router. This
is broken up from the backend support to ease reviewing.

For now prefix caching is only enabled with `USE_PREFIX_CACHING=1`
in this case, the router will switch to `RadixAllocator`. This
allocator uses a radix trie to keep track of prefills that were
seen prior. If a new prefill is a prefix of a previously-seen
prefil, the router will send a request with `prefix_len>0`, which
can be used by the backend to decide to reuse KV blocks from the
cache, rather than recomputing them.

Even though backend support is not added in this PR, the backend
will still work with prefix caching enabled. The prefix lengths
are just ignored and not used.
2024-08-12 14:59:17 +02:00