Commit Graph

1138 Commits

Author SHA1 Message Date
Nicolas Patry
0110b83aff Adding a test for FD. (#2516)
* Adding a test for FD.

* Fixing flashdecoding (empty batch doesn't work).

* Fixing the invalid popping.

* Fixing radix with block_size > 1

* Last reference.

* Use an actual hash.

* Update hash for slice.len() == 1

* Update the locks.

* Increasing docker timeout.
2024-09-25 06:17:09 +00:00
Daniël de Kok
e8c329372b Add tests for Mixtral (#2520)
Disable by default because CI runners do not have enough GPUs.
2024-09-25 06:16:08 +00:00
Alex Strick van Linschoten
afe5cae8fc Use ratatui not (deprecated) tui (#2521)
* use ratatui not archived tui

* bump ratatui all the way with options
2024-09-25 06:16:07 +00:00
Wang, Yi
cbfe9e5185 hotfix : enable intel ipex cpu and xpu in python3.11 (#2517)
enable intel ipex cpu and xpu in python3.11

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-09-25 06:15:35 +00:00
drbh
5fc0e0c589 fix: pass missing revision arg for lora adapter when loading multiple… (#2510)
fix: pass missing revision arg for lora adapter when loading multiple adapters
2024-09-25 06:15:35 +00:00
Nicolas Patry
7d897188d5 Add nix test. (#2513)
* Add nix test.

* Modifying yourself means you need to rerun.

* Fixing the test + adding click (needed for pre-commit hooks).

* Try thuis.

* Our runner + pure test (not written)

* Reemove server.

* Root user.

* Different user ?

* Add the actual test target.

* Forgot this modification.

* Add a formatter.

* Add the secrets.

* Fixed the auth token ?

* Adding the other tests.

* Missing pre-commit.

* Test requires cargo for cargo fmt.

* Update it a bit.

* Up.

* Attempting to use a cache location for the models.

* Ignore the cache for now.
2024-09-25 06:15:35 +00:00
Daniël de Kok
7be7ab7015 nix: support Python tokenizer conversion in the router (#2515)
Ideally we wouldn't have the router wrapper that this change adds,
but when I give PyO3 a Python interpreter with packages, it ends
up linking libpython from the Python interpreter rather than the
constructed environment and cannot pick up the Python modules as
a result.
2024-09-25 06:15:35 +00:00
Nicolas Patry
f32fa568b6 Fix truffle (#2514)
* Attempting to discard the trufflehog warning.

* Attempt to fix trufflehog.
2024-09-25 06:15:35 +00:00
Nicolas Patry
c6b568b892 Fix tokenization yi (#2507)
* Fixing odd tokenization self modifications on the Rust side (load and
resave in Python).

* Fixing the builds ?

* Fix the gh action?

* Fixing the location ?

* Validation is odd.

* Try a faster runner

* Upgrade python version.

* Remove sccache

* No sccache.

* Getting libpython maybe ?

* List stuff.

* Monkey it up.

* have no idea at this point

* Tmp.

* Shot in the dark.

* Tmate the hell out of this.

* Desperation.

* WTF.

* -y.

* Apparently 3.10 is not available anymore.

* Updating the dockerfile to make libpython discoverable at runtime too.

* Put back rust tests.

* Why do we want mkl on AMD ?

* Forcing 3.11 ?
2024-09-25 06:15:35 +00:00
Nicolas Patry
510d1c76c8 Prefix test - Different kind of load test to trigger prefix test bugs. (#2490)
* Adding prefix test.

* [WIP] tmp dump of integration load tests.

* Remove other tensor creation.

* Fixed the radix tree.

Used a slice everywhere in radix.rs to keep the cheap Arc cloning
instead of recomputing the input_ids.

* Fix parsing

* Is it really flashinfer version ?

* Remove some comments.

* Revert the max prefix hit.

* Adding numpy to diff.

* Upgraded flashinfer.

* Upgrading some stuff.

* Are we done yet ?

* Minor fixup

* Remove 1 log and put back the other.

* Add comment for why slot 0 is OK.

* Mounting on the job.

* Get me a debug branch

* Debugging CIs is fun.

* Attempt #28

* wip

* Tmate.

* Praying.

* Updating VLM causal model with updated context.

* Important line got squashed.

* Tmate again.

* Fingers crossed.

* We want only 1 run of integration tests.....

---------

Co-authored-by: Guillaume LEGENDRE <glegendre01@gmail.com>
2024-09-25 06:14:07 +00:00
Vallepu Vamsi Krishna
b67a0cd37b Add Directory Check to Prevent Redundant Cloning in Build Process (#2486)
Update Makefile-fbgemm

Added Directory check for FBGEMM repository cloning.
2024-09-25 06:14:07 +00:00
Nicolas Patry
eb54d956ef Fixing more correctly the invalid drop of the batch. (#2498) 2024-09-25 06:14:07 +00:00
Martin Iglesias Goyanes
7c2ed55b2e Add links to Adyen blogpost (#2500)
* Add links to Adyen blogpost

* Adding to toctree.

* Update external.md

* Update _toctree.yml

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-09-25 06:14:07 +00:00
Daniël de Kok
0198db125e hotfix: add syrupy to the right subproject (#2499) 2024-09-25 06:13:36 +00:00
Daniël de Kok
67f44cce0d radix trie: add assertions (#2491)
These should all be cheap assertions.

Also:

* Fixup some comments.
* Delete a `remove` that was done unnecessarily twice.
2024-09-25 06:13:36 +00:00
Daniël de Kok
8ba790a14e Fix incompatibility with latest syrupy and update in Poetry (#2497) 2024-09-25 06:13:36 +00:00
Daniël de Kok
1e14a94721 nix: add pyright/ruff for proper LSP in the impure devshell (#2496)
We need this to ensure that pyright/ruff are part of the same
interpreter/venv.
2024-09-25 06:13:36 +00:00
Wang, Yi
938a7f3c3a hotfix: fix regression of attention api change in intel platform (#2439)
fix regression caused by attention api change. ipex.varlen_attention does not support paged-cache
format kv input now.

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-09-25 06:13:36 +00:00
Daniël de Kok
d8610a6219 Add two handy gitignores for Nix environments (#2484) 2024-09-25 06:13:36 +00:00
Nicolas Patry
556a87030b Adding links to Adyen blogpost. (#2492) 2024-09-25 06:13:36 +00:00
Daniël de Kok
c7b495f97d hotfix: avoid non-prefilled block use when using prefix caching (#2489)
The minimum batch size logic could cause prefix blocks to be
deallocated without prefill. The next allocation of the same
prefix would then use garbage blocks.
2024-09-25 06:13:11 +00:00
drbh
34a6399a50 feat: support lora revisions and qkv_proj weights (#2482)
* feat: support lora revisions and qkv_proj weights

* fix: add qkv_proj weights to weight test
2024-09-25 06:13:11 +00:00
drbh
be5cb0cf7f fix: enable chat requests in vertex endpoint (#2481)
* fix: enable chat requests in vertex endpoint

* feat: avoid unwrap and pre allocate future vec
2024-09-25 06:13:11 +00:00
Daniël de Kok
3e17cb7866 nix: add punica-kernels (#2477)
Enables LoRA support.
2024-09-25 06:13:11 +00:00
Daniël de Kok
07c70e7840 nix: improve impure devshell (#2478)
- Add some test dependencies.
- Install server in venv.
- Install Python client in venv.
2024-09-25 06:13:11 +00:00
Nicolas Patry
a313355d2b Tied embeddings in MLP speculator. (#2473)
* Tied embeddings in MLP speculator.

* Fixing the scale_weight when users decide to not use the speculation as
much as defined in the config.

* Adding scaling support + optimize some ops.
2024-09-25 06:13:11 +00:00
Wang, Yi
61b2f493a8 update doc with intel cpu part (#2420)
* update doc with intel cpu part

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* Apply suggestions from code review

we do not use latest ever in documentation, it causes too many issues for users. Release number get update on every release.

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-09-25 06:13:11 +00:00
drbh
990478b285 feat: add /v1/models endpoint (#2433)
* feat: add /v1/models endpoint

* feat: add /v1/models endpoint

* fix: remove unused type import

* fix: revert route typo

* fix: update docs with new endpoint

* fix: add to redocly ignore and lint
2024-09-25 06:13:11 +00:00
Nicolas Patry
4e1ca8d7bd Lots of improvements (Still 2 allocators) (#2449)
* Making prefix/flashinfer the default and testing the full release tests.

* Include flashinfer in the docker.

* Using prebuilt.

* Allowing window_left_size (dummy version).

* Disabling flashinfer/prefix caching on odd head_dim

* Disable prefix caching for lora.

* More specific codes.

* Update lock

* Updating integration tests with new values with FI/FD.

Remove paged as a default too, and using FD everywhere.

* Update cargo lock ?

* Upgrade to 1.80 because of bitstream...

* Everywhere 1.80

* Forgot last default place.

* Apply suggestions from code review

Co-authored-by: drbh <david.richard.holtz@gmail.com>

* Updated flake lock

* Tmp

* Upgrade resolution system for less errors in resolution.

* Remove lambda for cleaner function.

* Handling debugger.

* OVerride the env in server tests.

* Is this enough to make it work ?

* This seems to be working.

* Downgrade some logs.

* Fixing the default for vlm.

* Don't enable prefix caching on VLM just yet.

* Change `add_special_tokens` in order to have the correct tokens for chat
input and not (since it's super important with the prefixing now)

* Fixing prefix caching for flashdecoding.

* Update all models.

* Fixed flashinfer version.

* add_special_tokens is internal only

* Fixing seqlen with the new vlms.

* Fixing the issue with `add_special_tokens` not being passed around.

* Fixing the test.

* Removing encoder_decoder (seq2seq).

* Update the chat test.

* Fixing the batching tokenization in flash causal lm.

* Truncating left for radix purposes.

* Oops this doesn't belong here.

* Put back default pure shell.

* Update server tests

- Default to throughput test in k6
- Use TGI_WIGGLE_ROOM to adjust wiggle room

* Only n_heads / process_group.size() are necessary.

* Revert the integrationt tests change (seem linked to head_size
modification).

* Adding error message when assert is violated.

* Fixing the free algorithm to handle times where the common prefix is
smaller.

* Apply suggestions from code review

Co-authored-by: OlivierDehaene <olivier@huggingface.co>

* Update server/text_generation_server/layers/attention/common.py

Co-authored-by: OlivierDehaene <olivier@huggingface.co>

* Fix disabling prefix caching - Fix windowing checks.

* Revert the Cohere tokenizer change (for now using a revision instead).

* Fmt.

---------

Co-authored-by: drbh <david.richard.holtz@gmail.com>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
2024-09-25 06:13:11 +00:00
Daniël de Kok
622c9c367a nix: build Torch against MKL and various other improvements (#2469)
Updates tgi-nix input:

- Move Torch closer to upstream by building against MKL.
- Remove compute capability 8.7 from Torch (Jetson).
- Sync nixpkgs cumpute capabilities with Torch (avoids
  compiling too mana capabilities for MAGMA).
- Use nixpkgs configuration passed through by `tgi-nix`.
2024-09-25 06:11:21 +00:00
drbh
08834e0cfd fix: improve regex expression (#2468) 2024-09-25 06:11:21 +00:00
drbh
e80b2c21dc fix: bump minijinja version and add test for llama 3.1 tools (#2463)
* fix: support tojson and avoid message indexing issue in template

* fix: prefer minijinja native methods and prefer workspace level dependency

* fix: adjust comment typo
2024-09-25 06:11:21 +00:00
Nicolas Patry
6793b720ba Fixing CI. (#2462) 2024-09-25 06:11:21 +00:00
drbh
73ebbd05f8 Pr 2451 ci branch (#2454)
* fix[router]: Fix tools not passed in chat template

Signed-off-by: GitHub <noreply@github.com>

* feat: improve default tool serialization and lints

* feat: refactor tool logic to include notify_error in prompt and adjust typing

* fix: adjust non tool template apply

* fix: simplify tool grammar logic and improve schema

* feat: avoid skip tool test and avoid empty tool prompts

* fix: increase test client timeout for grammar compilation tests

---------

Signed-off-by: GitHub <noreply@github.com>
Co-authored-by: Simone Rossi <simone.rossi.93@gmail.com>
2024-09-25 06:10:59 +00:00
drbh
7aebb953e2 Fix: don't apply post layernorm in SiglipVisionTransformer (#2459)
* Fix: don't apply post layernorm in SiglipVisionTransformer

This fixes a bug with LLaVA Next when using Siglip as the vision model. LLaVA Next expects the output of the vision model to be the encoder outputs before layernorm (see original transformers implementation here: https://github.com/huggingface/transformers/blob/main/src/transformers/models/llava_next/modeling_llava_next.py#L813).

This also makes Siglip consistent with the existing Clip implementation:

https://github.com/huggingface/text-generation-inference/blob/main/server/text_generation_server/models/custom_modeling/clip.py#L613

* fix: adjust pali gemma for post layer norm and small refactors

---------

Co-authored-by: Travis Addair <tgaddair@gmail.com>
2024-09-25 06:10:59 +00:00
Daniël de Kok
92ac02e4f2 nix: add default package (#2453)
The default package wraps the launcher and puts the server/router in the
path.

As a result, TGI can be started using something like:

```
nix run .# -- \
  --model-id hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 \
  --port 8080
```
2024-09-25 06:10:59 +00:00
Daniël de Kok
b7d1adc3e9 nix: add awq-inference-engine as server dependency (#2442) 2024-09-25 06:10:59 +00:00
Nicolas Patry
6654c2d11b Adding eetq to flake. (#2438) 2024-09-25 06:10:59 +00:00
Daniël de Kok
a5af557359 nix: add text-generation-benchmark to pure devshell (#2431)
nix: add text-generation-benchmark to pure devshell
2024-09-25 06:10:59 +00:00
Daniël de Kok
516392d790 nix: add pure server to flake, add both pure and impure devshells (#2430)
* nix: pure server and support both pure and impure devShells

* nix: remove unused poetry2nix input

It is not wired up and we now have a pure server.

* nix: add ipdb to impure devshell
2024-09-25 06:10:59 +00:00
Nicolas Patry
635dde8af9 Prefix caching (#2402)
* Prefix caching WIP

* Fixing prefix attention.

* Fixing flashinfer import.

* Fixing black.

* Fixing medusa (still wrong outputs, but functional).

* Just medusa values now.

* Fixing medusa without prefix caching.

* Fixing prefix caching.

* Medusa requires reshaping.

* Removing the logs.

* Remove router.nix

* Fixup:

- Remove logs
- Disable VLMs (they do not work)
- Disable prefix caching when user wants prefill logprobs.

* Update flake.lock

---------

Co-authored-by: Daniël de Kok <me@danieldk.eu>
2024-09-25 06:10:59 +00:00
Daniël de Kok
ddba272a66 nix: update to CUDA 12.4 (#2429)
* Update to CUDA 12.4

* poetry2nix: follow tgi-nix nixpkgs
2024-09-25 06:10:59 +00:00
Nicolas Patry
cd208c5043 All integration tests back everywhere (too many failed CI). (#2428)
* All integration tests back everywhere (too many failed CI).

* Upgrade integration tests after 12.4

* Attempt to remove the specifed compute cap.

* Common arch list.

* Punica uses raw ASM which is not valid on 9.0 apparently.
2024-09-25 06:10:59 +00:00
Hugo Larcher
53fdbe617d doc: Add metrics documentation and add a 'Reference' section (#2230)
* doc: Add metrics documentation and add a 'Reference' section

* doc: Add API reference

* doc: Refactor API reference

* fix: Message API link

* Bad rebase

* Moving the docs.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-09-25 06:10:13 +00:00
Nicolas Patry
11d25a4bd3 FIxing the CI. 2024-09-25 06:09:22 +00:00
Nicolas Patry
85df9fc2db Further fixes. (#2426)
* Further fixes.

* Update the conftest to allow NaN (first logprob).

* Fix the condition.
2024-09-25 06:09:22 +00:00
Vaibhav Srivastav
df0e650891 Improve the Consuming TGI + Streaming docs. (#2412)
* Improve the Consuming TGI docs.

* Fix erronous update to .

* add info about Open AI client.

* More updates.

* Apply suggestions from code review

Co-authored-by: Erik Kaunismäki <erik.kaum@gmail.com>

* Suggestions from Lucain.

* Update Gradio snippet.

* Up.

* Apply suggestions from code review

Co-authored-by: Lucain <lucainp@gmail.com>

* Update docs/source/basic_tutorials/consuming_tgi.md

Co-authored-by: Lucain <lucainp@gmail.com>

* Up.

* Apply suggestions from code review

Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>

* Up.

* Up.

* Doc review from Nico.

* Doc review from Nico. x2

* Last nit

---------

Co-authored-by: Erik Kaunismäki <erik.kaum@gmail.com>
Co-authored-by: Lucain <lucainp@gmail.com>
Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>
2024-09-25 06:08:38 +00:00
Daniël de Kok
20ed7b598e nix: try to reduce the number of Rust rebuilds (#2424)
Try to reduce the number of router/launcher rebuilds by filtering
sources. In this way, recompiles should only be triggered by changes
in Cargo or Rust files.
2024-09-25 06:08:38 +00:00
Nicolas Patry
f0181ed2d7 Upgrading the tests to match the current workings. (#2423) 2024-09-25 06:08:38 +00:00
Nicolas Patry
df6ea89da9 Fixing exl2 and other quanize tests again. (#2419)
* Fixing exl2 and other quanize tests again.

* Mark exl2 as non release (so CI tests them, needs to be removed latet).

* Fixing exl2 (by disabling cuda graphs)

* Fix quantization defaults without cuda graphs on exl2 (linked to new
issues with it).

* Removing serde override.

* Go back to released exl2 and remove log.

* Adding warnings for deprecated bitsandbytes + upgrade info to warn.
2024-09-25 06:08:38 +00:00