Commit Graph

1109 Commits

Author SHA1 Message Date
Daniël de Kok
622c9c367a nix: build Torch against MKL and various other improvements (#2469)
Updates tgi-nix input:

- Move Torch closer to upstream by building against MKL.
- Remove compute capability 8.7 from Torch (Jetson).
- Sync nixpkgs cumpute capabilities with Torch (avoids
  compiling too mana capabilities for MAGMA).
- Use nixpkgs configuration passed through by `tgi-nix`.
2024-09-25 06:11:21 +00:00
drbh
08834e0cfd fix: improve regex expression (#2468) 2024-09-25 06:11:21 +00:00
drbh
e80b2c21dc fix: bump minijinja version and add test for llama 3.1 tools (#2463)
* fix: support tojson and avoid message indexing issue in template

* fix: prefer minijinja native methods and prefer workspace level dependency

* fix: adjust comment typo
2024-09-25 06:11:21 +00:00
Nicolas Patry
6793b720ba Fixing CI. (#2462) 2024-09-25 06:11:21 +00:00
drbh
73ebbd05f8 Pr 2451 ci branch (#2454)
* fix[router]: Fix tools not passed in chat template

Signed-off-by: GitHub <noreply@github.com>

* feat: improve default tool serialization and lints

* feat: refactor tool logic to include notify_error in prompt and adjust typing

* fix: adjust non tool template apply

* fix: simplify tool grammar logic and improve schema

* feat: avoid skip tool test and avoid empty tool prompts

* fix: increase test client timeout for grammar compilation tests

---------

Signed-off-by: GitHub <noreply@github.com>
Co-authored-by: Simone Rossi <simone.rossi.93@gmail.com>
2024-09-25 06:10:59 +00:00
drbh
7aebb953e2 Fix: don't apply post layernorm in SiglipVisionTransformer (#2459)
* Fix: don't apply post layernorm in SiglipVisionTransformer

This fixes a bug with LLaVA Next when using Siglip as the vision model. LLaVA Next expects the output of the vision model to be the encoder outputs before layernorm (see original transformers implementation here: https://github.com/huggingface/transformers/blob/main/src/transformers/models/llava_next/modeling_llava_next.py#L813).

This also makes Siglip consistent with the existing Clip implementation:

https://github.com/huggingface/text-generation-inference/blob/main/server/text_generation_server/models/custom_modeling/clip.py#L613

* fix: adjust pali gemma for post layer norm and small refactors

---------

Co-authored-by: Travis Addair <tgaddair@gmail.com>
2024-09-25 06:10:59 +00:00
Daniël de Kok
92ac02e4f2 nix: add default package (#2453)
The default package wraps the launcher and puts the server/router in the
path.

As a result, TGI can be started using something like:

```
nix run .# -- \
  --model-id hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 \
  --port 8080
```
2024-09-25 06:10:59 +00:00
Daniël de Kok
b7d1adc3e9 nix: add awq-inference-engine as server dependency (#2442) 2024-09-25 06:10:59 +00:00
Nicolas Patry
6654c2d11b Adding eetq to flake. (#2438) 2024-09-25 06:10:59 +00:00
Daniël de Kok
a5af557359 nix: add text-generation-benchmark to pure devshell (#2431)
nix: add text-generation-benchmark to pure devshell
2024-09-25 06:10:59 +00:00
Daniël de Kok
516392d790 nix: add pure server to flake, add both pure and impure devshells (#2430)
* nix: pure server and support both pure and impure devShells

* nix: remove unused poetry2nix input

It is not wired up and we now have a pure server.

* nix: add ipdb to impure devshell
2024-09-25 06:10:59 +00:00
Nicolas Patry
635dde8af9 Prefix caching (#2402)
* Prefix caching WIP

* Fixing prefix attention.

* Fixing flashinfer import.

* Fixing black.

* Fixing medusa (still wrong outputs, but functional).

* Just medusa values now.

* Fixing medusa without prefix caching.

* Fixing prefix caching.

* Medusa requires reshaping.

* Removing the logs.

* Remove router.nix

* Fixup:

- Remove logs
- Disable VLMs (they do not work)
- Disable prefix caching when user wants prefill logprobs.

* Update flake.lock

---------

Co-authored-by: Daniël de Kok <me@danieldk.eu>
2024-09-25 06:10:59 +00:00
Daniël de Kok
ddba272a66 nix: update to CUDA 12.4 (#2429)
* Update to CUDA 12.4

* poetry2nix: follow tgi-nix nixpkgs
2024-09-25 06:10:59 +00:00
Nicolas Patry
cd208c5043 All integration tests back everywhere (too many failed CI). (#2428)
* All integration tests back everywhere (too many failed CI).

* Upgrade integration tests after 12.4

* Attempt to remove the specifed compute cap.

* Common arch list.

* Punica uses raw ASM which is not valid on 9.0 apparently.
2024-09-25 06:10:59 +00:00
Hugo Larcher
53fdbe617d doc: Add metrics documentation and add a 'Reference' section (#2230)
* doc: Add metrics documentation and add a 'Reference' section

* doc: Add API reference

* doc: Refactor API reference

* fix: Message API link

* Bad rebase

* Moving the docs.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-09-25 06:10:13 +00:00
Nicolas Patry
11d25a4bd3 FIxing the CI. 2024-09-25 06:09:22 +00:00
Nicolas Patry
85df9fc2db Further fixes. (#2426)
* Further fixes.

* Update the conftest to allow NaN (first logprob).

* Fix the condition.
2024-09-25 06:09:22 +00:00
Vaibhav Srivastav
df0e650891 Improve the Consuming TGI + Streaming docs. (#2412)
* Improve the Consuming TGI docs.

* Fix erronous update to .

* add info about Open AI client.

* More updates.

* Apply suggestions from code review

Co-authored-by: Erik Kaunismäki <erik.kaum@gmail.com>

* Suggestions from Lucain.

* Update Gradio snippet.

* Up.

* Apply suggestions from code review

Co-authored-by: Lucain <lucainp@gmail.com>

* Update docs/source/basic_tutorials/consuming_tgi.md

Co-authored-by: Lucain <lucainp@gmail.com>

* Up.

* Apply suggestions from code review

Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>

* Up.

* Up.

* Doc review from Nico.

* Doc review from Nico. x2

* Last nit

---------

Co-authored-by: Erik Kaunismäki <erik.kaum@gmail.com>
Co-authored-by: Lucain <lucainp@gmail.com>
Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>
2024-09-25 06:08:38 +00:00
Daniël de Kok
20ed7b598e nix: try to reduce the number of Rust rebuilds (#2424)
Try to reduce the number of router/launcher rebuilds by filtering
sources. In this way, recompiles should only be triggered by changes
in Cargo or Rust files.
2024-09-25 06:08:38 +00:00
Nicolas Patry
f0181ed2d7 Upgrading the tests to match the current workings. (#2423) 2024-09-25 06:08:38 +00:00
Nicolas Patry
df6ea89da9 Fixing exl2 and other quanize tests again. (#2419)
* Fixing exl2 and other quanize tests again.

* Mark exl2 as non release (so CI tests them, needs to be removed latet).

* Fixing exl2 (by disabling cuda graphs)

* Fix quantization defaults without cuda graphs on exl2 (linked to new
issues with it).

* Removing serde override.

* Go back to released exl2 and remove log.

* Adding warnings for deprecated bitsandbytes + upgrade info to warn.
2024-09-25 06:08:38 +00:00
Daniël de Kok
e5c39a5545 nix: build router incrementally (#2422) 2024-09-25 06:08:00 +00:00
Funtowicz Morgan
c3401e0b99 More fixes trtllm (#2342)
* (backend) use parking_lot crate for RwLock fairness

* (docker) let's put rust in the TRTLLM folder when building

* (docker) build ompi with SLURM support

* (launcher) default new server::run parameters to false for now

* (chore) fmt ... why?
2024-09-25 06:08:00 +00:00
Nicolas Patry
4baa6ff59f Upgrading exl2. (#2415)
* Upgrading exl2.

* Fixing the other pathways.

* Fix idefics.
2024-09-25 06:07:40 +00:00
Daniël de Kok
bae161ab84 nix: partial incremental build of the router (#2416)
This is less incremental than crate2nix, but does build all dependencies
separately, so avoids full rebuilds.
2024-09-25 06:06:17 +00:00
drbh
ffc8fb0850 fix: adds causal to attention params (#2408)
fix: adds causal to attention params to check when using flash attn v1
2024-09-25 06:06:17 +00:00
Wang, Yi
7a4d831d17 add numa to improve cpu inference perf (#2330)
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-09-25 06:06:17 +00:00
Nicolas Patry
c5e4c1877b Adding more kernels to flake. (#2411) 2024-09-25 06:06:17 +00:00
Daniël de Kok
eb561bb715 nix: incremental build of the launcher (#2410) 2024-09-25 06:06:17 +00:00
drbh
10b2be6536 fix: include create_exllama_buffers and set_device for exllama (#2407) 2024-09-25 06:06:17 +00:00
drbh
1f8c0f83e3 Pr 2395 ci run (#2406)
* fix(router): Fix appending to message content

* feat: add message and chat template test

---------

Co-authored-by: Simone Rossi <simone.rossi.93@gmail.com>
2024-09-25 06:06:17 +00:00
Nicolas Patry
18d6be6af4 Updating the flake. (#2404) 2024-09-25 06:06:17 +00:00
drbh
96e8fa37b0 fix: improve completions to send a final chunk with usage details (#2336)
* fix: improve completions to send a final chunk with usage details

* fix: include finish reason string

* fix: remove dev debug trait and unneeded mut

* fix: update openapi schema
2024-09-25 06:06:17 +00:00
drbh
3079865b60 fix: allocate tmp based on sgmv kernel if available (#2345)
* fix: allocate tmp based on sgmv kernel if available

* fix: re add copy build artifacts step for punica kernels
2024-09-25 06:06:17 +00:00
drbh
8e6bfa2fc5 feat: validate template variables before apply and improve sliding wi… (#2403)
* feat: validate template variables before apply and improve sliding window check

* fix: improve missing template var test
2024-09-25 06:05:43 +00:00
Nicolas Patry
6393cdee63 Keeping the benchmark somewhere (#2401)
Co-authored-by: Daniël de Kok <me@danieldk.eu>
2024-09-25 06:05:43 +00:00
Daniël de Kok
f586cc7f0c Add support for prefix caching to the v3 router (#2392)
This change adds support for prefix caching to the v3 router. This
is broken up from the backend support to ease reviewing.

For now prefix caching is only enabled with `USE_PREFIX_CACHING=1`
in this case, the router will switch to `RadixAllocator`. This
allocator uses a radix trie to keep track of prefills that were
seen prior. If a new prefill is a prefix of a previously-seen
prefil, the router will send a request with `prefix_len>0`, which
can be used by the backend to decide to reuse KV blocks from the
cache, rather than recomputing them.

Even though backend support is not added in this PR, the backend
will still work with prefix caching enabled. The prefix lengths
are just ignored and not used.
2024-09-25 06:05:08 +00:00
Wang, Yi
b8efd6d00c Cpu dockerimage (#2367)
add intel-cpu docker image

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-09-25 06:05:08 +00:00
Nicolas Patry
1daaddd072 Fixing import exl2 (#2399) 2024-09-25 06:04:51 +00:00
Nicolas Patry
fbe59c6267 Adding launcher to build. (#2397) 2024-09-25 06:04:51 +00:00
Nicolas Patry
8750dc878e Upgrade fbgemm (#2398)
* Upgrade fbgemm

* Fix fbgemm version
2024-09-25 06:04:51 +00:00
Daniël de Kok
197dd3af12 nix: add router to the devshell (#2396) 2024-09-25 06:04:51 +00:00
Daniël de Kok
bb833389e0 Update flake for 9.0a capability in Torch (#2394) 2024-09-25 06:04:51 +00:00
drbh
959add5e9b feat: add guideline to chat request and template (#2391)
* feat: add guideline to chat request and template

* fix: add template test and update docs
2024-09-25 06:04:51 +00:00
Nicolas Patry
849bd93dc3 Using an enum for flash backens (paged/flashdecoding/flashinfer) (#2385)
* Using an enum for flash backens (paged/flashdecoding/flashinfer)

* Early exit on server too.

* Clippy.

* Fix clippy and fmt.
2024-09-25 06:04:51 +00:00
Daniël de Kok
df719fd527 flake: use rust-overlay (#2390) 2024-09-25 06:04:51 +00:00
Vaibhav Srivastav
1d4a35a23c Update documentation for Supported models (#2386)
* Minor doc fixes

* up.

* Other minor updates.
2024-09-25 06:04:51 +00:00
Daniël de Kok
e9ba044250 flake: add fmt and clippy (#2389) 2024-09-25 06:03:56 +00:00
Nicolas Patry
afa14b7595 Using HF_HOME instead of CACHE to get token read in addition to models. (#2288) 2024-09-25 06:03:56 +00:00
Daniël de Kok
dc0fa60f55 Add experimental flake (#2384)
Add flake.nix
2024-09-25 06:01:59 +00:00