Commit Graph

22 Commits

Author SHA1 Message Date
Daniël de Kok
c1a99e2f15 Update to moe-kenels 0.3.1 (#2535)
* Update to moe-kenels 0.3.1

* Attempt to fix apt failure
2024-09-25 06:19:20 +00:00
Nicolas Patry
2d470c8282 Stream options. (#2533)
* Stream options.

* Fetch stuff from nix integration test for easier testing.

* Adding the assert.

* Only send the usage when asked for.

* Update the docs.

* Impure test because we need network.

* develop.

* Optional usage.

* Fixes.

* Workflow
2024-09-25 06:19:20 +00:00
Nicolas Patry
0110b83aff Adding a test for FD. (#2516)
* Adding a test for FD.

* Fixing flashdecoding (empty batch doesn't work).

* Fixing the invalid popping.

* Fixing radix with block_size > 1

* Last reference.

* Use an actual hash.

* Update hash for slice.len() == 1

* Update the locks.

* Increasing docker timeout.
2024-09-25 06:17:09 +00:00
Nicolas Patry
f32fa568b6 Fix truffle (#2514)
* Attempting to discard the trufflehog warning.

* Attempt to fix trufflehog.
2024-09-25 06:15:35 +00:00
Nicolas Patry
510d1c76c8 Prefix test - Different kind of load test to trigger prefix test bugs. (#2490)
* Adding prefix test.

* [WIP] tmp dump of integration load tests.

* Remove other tensor creation.

* Fixed the radix tree.

Used a slice everywhere in radix.rs to keep the cheap Arc cloning
instead of recomputing the input_ids.

* Fix parsing

* Is it really flashinfer version ?

* Remove some comments.

* Revert the max prefix hit.

* Adding numpy to diff.

* Upgraded flashinfer.

* Upgrading some stuff.

* Are we done yet ?

* Minor fixup

* Remove 1 log and put back the other.

* Add comment for why slot 0 is OK.

* Mounting on the job.

* Get me a debug branch

* Debugging CIs is fun.

* Attempt #28

* wip

* Tmate.

* Praying.

* Updating VLM causal model with updated context.

* Important line got squashed.

* Tmate again.

* Fingers crossed.

* We want only 1 run of integration tests.....

---------

Co-authored-by: Guillaume LEGENDRE <glegendre01@gmail.com>
2024-09-25 06:14:07 +00:00
Daniël de Kok
3e17cb7866 nix: add punica-kernels (#2477)
Enables LoRA support.
2024-09-25 06:13:11 +00:00
Nicolas Patry
4e1ca8d7bd Lots of improvements (Still 2 allocators) (#2449)
* Making prefix/flashinfer the default and testing the full release tests.

* Include flashinfer in the docker.

* Using prebuilt.

* Allowing window_left_size (dummy version).

* Disabling flashinfer/prefix caching on odd head_dim

* Disable prefix caching for lora.

* More specific codes.

* Update lock

* Updating integration tests with new values with FI/FD.

Remove paged as a default too, and using FD everywhere.

* Update cargo lock ?

* Upgrade to 1.80 because of bitstream...

* Everywhere 1.80

* Forgot last default place.

* Apply suggestions from code review

Co-authored-by: drbh <david.richard.holtz@gmail.com>

* Updated flake lock

* Tmp

* Upgrade resolution system for less errors in resolution.

* Remove lambda for cleaner function.

* Handling debugger.

* OVerride the env in server tests.

* Is this enough to make it work ?

* This seems to be working.

* Downgrade some logs.

* Fixing the default for vlm.

* Don't enable prefix caching on VLM just yet.

* Change `add_special_tokens` in order to have the correct tokens for chat
input and not (since it's super important with the prefixing now)

* Fixing prefix caching for flashdecoding.

* Update all models.

* Fixed flashinfer version.

* add_special_tokens is internal only

* Fixing seqlen with the new vlms.

* Fixing the issue with `add_special_tokens` not being passed around.

* Fixing the test.

* Removing encoder_decoder (seq2seq).

* Update the chat test.

* Fixing the batching tokenization in flash causal lm.

* Truncating left for radix purposes.

* Oops this doesn't belong here.

* Put back default pure shell.

* Update server tests

- Default to throughput test in k6
- Use TGI_WIGGLE_ROOM to adjust wiggle room

* Only n_heads / process_group.size() are necessary.

* Revert the integrationt tests change (seem linked to head_size
modification).

* Adding error message when assert is violated.

* Fixing the free algorithm to handle times where the common prefix is
smaller.

* Apply suggestions from code review

Co-authored-by: OlivierDehaene <olivier@huggingface.co>

* Update server/text_generation_server/layers/attention/common.py

Co-authored-by: OlivierDehaene <olivier@huggingface.co>

* Fix disabling prefix caching - Fix windowing checks.

* Revert the Cohere tokenizer change (for now using a revision instead).

* Fmt.

---------

Co-authored-by: drbh <david.richard.holtz@gmail.com>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
2024-09-25 06:13:11 +00:00
Daniël de Kok
622c9c367a nix: build Torch against MKL and various other improvements (#2469)
Updates tgi-nix input:

- Move Torch closer to upstream by building against MKL.
- Remove compute capability 8.7 from Torch (Jetson).
- Sync nixpkgs cumpute capabilities with Torch (avoids
  compiling too mana capabilities for MAGMA).
- Use nixpkgs configuration passed through by `tgi-nix`.
2024-09-25 06:11:21 +00:00
Daniël de Kok
b7d1adc3e9 nix: add awq-inference-engine as server dependency (#2442) 2024-09-25 06:10:59 +00:00
Nicolas Patry
6654c2d11b Adding eetq to flake. (#2438) 2024-09-25 06:10:59 +00:00
Daniël de Kok
516392d790 nix: add pure server to flake, add both pure and impure devshells (#2430)
* nix: pure server and support both pure and impure devShells

* nix: remove unused poetry2nix input

It is not wired up and we now have a pure server.

* nix: add ipdb to impure devshell
2024-09-25 06:10:59 +00:00
Nicolas Patry
635dde8af9 Prefix caching (#2402)
* Prefix caching WIP

* Fixing prefix attention.

* Fixing flashinfer import.

* Fixing black.

* Fixing medusa (still wrong outputs, but functional).

* Just medusa values now.

* Fixing medusa without prefix caching.

* Fixing prefix caching.

* Medusa requires reshaping.

* Removing the logs.

* Remove router.nix

* Fixup:

- Remove logs
- Disable VLMs (they do not work)
- Disable prefix caching when user wants prefill logprobs.

* Update flake.lock

---------

Co-authored-by: Daniël de Kok <me@danieldk.eu>
2024-09-25 06:10:59 +00:00
Daniël de Kok
ddba272a66 nix: update to CUDA 12.4 (#2429)
* Update to CUDA 12.4

* poetry2nix: follow tgi-nix nixpkgs
2024-09-25 06:10:59 +00:00
Daniël de Kok
20ed7b598e nix: try to reduce the number of Rust rebuilds (#2424)
Try to reduce the number of router/launcher rebuilds by filtering
sources. In this way, recompiles should only be triggered by changes
in Cargo or Rust files.
2024-09-25 06:08:38 +00:00
Daniël de Kok
e5c39a5545 nix: build router incrementally (#2422) 2024-09-25 06:08:00 +00:00
Daniël de Kok
bae161ab84 nix: partial incremental build of the router (#2416)
This is less incremental than crate2nix, but does build all dependencies
separately, so avoids full rebuilds.
2024-09-25 06:06:17 +00:00
Nicolas Patry
c5e4c1877b Adding more kernels to flake. (#2411) 2024-09-25 06:06:17 +00:00
Daniël de Kok
eb561bb715 nix: incremental build of the launcher (#2410) 2024-09-25 06:06:17 +00:00
Nicolas Patry
18d6be6af4 Updating the flake. (#2404) 2024-09-25 06:06:17 +00:00
Daniël de Kok
bb833389e0 Update flake for 9.0a capability in Torch (#2394) 2024-09-25 06:04:51 +00:00
Daniël de Kok
df719fd527 flake: use rust-overlay (#2390) 2024-09-25 06:04:51 +00:00
Daniël de Kok
dc0fa60f55 Add experimental flake (#2384)
Add flake.nix
2024-09-25 06:01:59 +00:00