Commit Graph

1186 Commits

Author SHA1 Message Date
Nicolas Patry
6f0b8c947d
New arg. (#2845) 2024-12-16 10:34:50 +01:00
Hugo Larcher
1708865fdc
Feat/trtllm cancellation dev container (#2795)
Add devcontainers for TRTLLM backend.

---------

Co-authored-by: Morgan Funtowicz <morgan@huggingface.co>
2024-12-13 16:19:06 +01:00
Funtowicz Morgan
ea7f4082c4
TensorRT-LLM backend bump to latest version + misc fixes (#2791)
* misc(cmake) update dependencies

* feat(hardware) enable new hardware.hpp and unittests

* test(ctest) enable address sanitizer

* feat(backend): initial rewrite of the backend for simplicity

* feat(backend): remove all the logs from hardware.hpp

* feat(backend): added some logging

* feat(backend): enable compiler warning if support for RVO not applying

* feat(backend): missing return statement

* feat(backend): introduce backend_workspace_t to store precomputed information from the engine folder

* feat(backend): delete previous backend impl

* feat(backend): more impl

* feat(backend): use latest trtllm main version to have g++ >= 13 compatibility

* feat(backend): allow overriding which Python to use

* feat(backend): fix backend_exception_t -> backend_error_t naming

* feat(backend): impl missing generation_step_t as return value of pull_tokens

* feat(backend): make backend_workspace_t::engines_folder constexpr

* feat(backend): fix main.rs retrieving the tokenizer

* feat(backend): add guard to multiple header definitions

* test(backend): add more unittest

* feat(backend): remove constexpr from par

* feat(backend): remove constexpig

* test(backend): more test coverage

* chore(trtllm): update dependency towards 0.15.0

* effectively cancel the request on the executor

* feat(backend) fix moving backend when pulling

* feat(backend): make sure we can easily cancel request on the executor

* feat(backend): fix missing "0" field access

* misc(backend): fix reborrowing Pin<&mut T> as described in the doc https://doc.rust-lang.org/stable/std/pin/struct.Pin.html#method.as_mut

* chore: Add doc and CI for TRTLLM (#2799)

* chore: Add doc and CI for TRTLLM

* chore: Add doc and CI for TRTLLM

* chore: Add doc and CI for TRTLLM

* chore: Add doc and CI for TRTLLM

* doc: Formatting

* misc(backend): indent

---------

Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co>
2024-12-13 15:50:59 +01:00
Nicolas Patry
3bb3fd19ae
Fixup opt to reduce the amount of odd if statements. (#2833)
* Fixup opt to reduce the amount of odd if statements.

* Fixing cargo lock
2024-12-12 18:20:13 +01:00
Wang, Yi
bf59118a93
fix facebook/opt-125m not working issue (#2824)
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-12-12 14:41:30 +01:00
Nicolas Patry
c3bd7212c2
Fixing latest flavor by disabling it. (#2831) 2024-12-12 14:09:35 +01:00
Guspan Tanadi
f01f2fb6e7
docs(README): supported hardware links TGI AMD GPUs (#2814) 2024-12-12 13:49:33 +01:00
Nicolas Patry
07b01293c5
Prepare patch release. (#2829) 2024-12-11 21:03:50 +01:00
RodriMora
cc66dccbe8
Update README.md (#2827)
Added instructions to clone the repo and change directory into it. 

In following steps there is a "make install" step that would fail if people have not cloned the repo and cd into it, so it may be confusing for some

Added python venv alternative to conda too.
2024-12-11 19:45:49 +01:00
Nicolas Patry
82c24f7420
Using both value from config as they might not be correct. (#2817)
* Using both value from config as they might not be correct.

* Fixing max_position_embeddings for falcon.

* Simple attempt to fix the healthcheck block allocation.

* Much simpler solution.

* Default value for Backend start_health
2024-12-10 19:37:09 +01:00
Nicolas Patry
a2d878fa0f
Small update to docs (#2816) 2024-12-10 10:46:26 +01:00
Nicolas Patry
b2fac5d947
Hotfix link2 (#2812)
2nd hotfix ?
2024-12-09 20:57:18 +01:00
Nicolas Patry
a70dd2998b
Hotfixing the link. (#2811) 2024-12-09 20:50:07 +01:00
Nicolas Patry
042791fbd5
Prep new version (#2810)
* New version.

* Link fixup.

* Update docs.

* FIxup.
2024-12-09 20:42:42 +01:00
Nicolas Patry
27fa83ca5b
V3 doc (#2809)
* V3 document.

* Updating asset.
2024-12-09 19:58:07 +01:00
Nicolas Patry
a04356fb8c
Attempt for cleverer auto batch_prefill values (some simplifications). (#2808)
* Attempt for cleverer auto batch_prefill values (some simplifications).

* Less flaky tests.

* Fixing typo insertion.

* Update launcher/src/main.rs

Co-authored-by: Daniël de Kok <me@danieldk.eu>

* Adding small comment for source of calculation.

* Adding L40.

* Adding L40s.

---------

Co-authored-by: Daniël de Kok <me@danieldk.eu>
2024-12-09 19:44:32 +01:00
drbh
9f5c9a5e22
Enable paligemma2 (#2807)
* feat: support loading gemma2 as vlm text model

* feat: add test for paligemma2
2024-12-06 14:41:49 -05:00
Nicolas Patry
08f6fa0b59
Removing experimental to prefill chunking. 2024-12-06 19:09:40 +01:00
Nicolas Patry
d96dcb1797
Adding A100 compute. (#2806) 2024-12-06 18:19:15 +01:00
Nicolas Patry
5df8059037
Auto max prefill (#2797)
* Attempt at automatic max batch prefill.

* Taking into account number of shards.

* Adding more cards.

* Adding A100 + H100

* Adding a few more cards.

* Logprobs cost too much.

* h100 better name, and keep factor of 2

* Damn inflated sparse tflops.

* Typo in h100.

* Updated the flops calculation (checked with fvcore).

* chunking by default.

* Fix prefix caching for chat completion since we removed logprobs.

* More tests.

* Dropping all the prefill logprobs.

* Add a flag that enables users to get logprobs back.

* Repairing prompt token counting.

* Fixing a few tests.

* Remove some scaffolding.

* Attempting to reduces the issues (workarounds for now).
2024-12-06 05:52:00 +01:00
OlivierDehaene
8c3669b287
feat: auto max_new_tokens (#2803)
* feat: auto max_new_tokens

* update default

* Fixing the tests.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-12-06 05:50:35 +01:00
Wang, Yi
6685e8fcda
use oneapi 2024 docker image directly for xpu (#2793)
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-12-06 09:36:23 +05:30
drbh
e0db633396
fix: avoid setting use_sgmv if no kernels present (#2796) 2024-12-04 15:26:09 -05:00
Nicolas Patry
b57f370386
Saving some VRAM. (#2790)
* Saving some VRAM.

- 8B on 4xL4 attention=flashdecoding . Before 4.28GB left, After 4.32GB
  left, so 400MB saved.

- Effect not as visible on attention=flashinfer and n_shard=1. I suspect
  it's linked to the torch allocator.

* Adding assertion.
2024-12-03 04:04:21 +01:00
Daniël de Kok
2003d8be0c
Sync (most) server dependencies with Nix (#2782)
* Sync (most) server dependencies with Nix

Skipped most grpcio packages, because of protobuf version
incompatibility with the opentelemetry packages.

* Add a primitive script to generate Poetry commands to sync with Nix

This is not fully automated, since getting the Nix versions may be
unresolvable. However, it does take most of the work out of doing
this manually.

* Upgrade eetq ?

* Fmt.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-12-03 04:04:06 +01:00
Dmitry Rogozhkin
535149d872
fix: only use eos_token_id as pad_token_id if int (#2774)
LLama 3 has a list of values as eos_token_id:
  "['<|end_of_text|>', '<|eom_id|>', '<|eot_id|>']"
This breaks tokenizer since it expects single value. This
commit uses tokenizer.eos_token_id instead in such a case.

Fixes: #2440

Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
2024-12-02 06:26:37 +01:00
drbh
2c74c55637
fix: add merge-lora arg for model id (#2788) 2024-12-02 05:52:02 +01:00
Torsten Raudssus
a35d1e6fe5
Removing ../ that broke the link (#2789) 2024-12-02 05:48:55 +01:00
Nicolas Patry
1d2cb356b9
Fix doc. (#2792) 2024-12-02 05:28:26 +01:00
drbh
d471805134
Support continue final message (#2733)
* feat: support continue_final_message param in chat request

* feat: add test for continue final message

* fix: bump openapi docs

* fix: remove continue_final_message chat request param

* fix: remove unneeded launcher args in continue test

* fix: bump test output

* fix: remove accidentally included guideline from rebase

* fix: remove guideline tests

* fix: adjust continuation tests expected text

* fix: replace expected output for continue test
2024-11-27 19:13:30 -05:00
jp
caff779dd4
Fix: docs typo (#2777)
Fix: typo in model loading code

Fix typo in model loading code
2024-11-26 14:28:58 +01:00
Wang, Yi
892a26e549
upgrade ipex cpu to fix coredump in tiiuae/falcon-7b-instruct (pageat… (#2778)
upgrade ipex cpu to fix coredump in tiiuae/falcon-7b-instruct (pageattention)

Signed-off-by: Wang,Yi A <yi.a.wang@intel.com>
2024-11-26 14:28:11 +01:00
Daniël de Kok
72ab60fdd5
Use FP8 KV cache when specified by compressed-tensors (#2761)
The compressed-tensors configuration can specify the configuration of
the KV cache as well. Use an FP8 KV cache when the configuration tells
us to do so (all other options and types are ignored for now).
2024-11-26 08:27:41 +01:00
Daniël de Kok
289aa48554
Move JSON grammar -> regex grammar conversion to the router (#2772)
* Move JSON grammar -> regex grammar conversion to the router

This change moves the JSON grammar -> regex grammar conversion to the
router by adding a dependency on the `outlines-core` Rust crate. In
contrast to the Python implementation, the conversions are not LRU-cached
since they seem to be fast enough:

simple schema           time:   [5.8293 µs 5.8307 µs 5.8320 µs]
                        change: [-13.166% -12.884% -12.641%] (p = 0.00 < 0.05)
                        Performance has improved.

complex schema          time:   [14.875 µs 14.881 µs 14.887 µs]
                        change: [-2.1637% -1.9914% -1.7852%] (p = 0.00 < 0.05)
                        Performance has improved.

Using the schemas from:
https://github.com/dottxt-ai/outlines-core/blob/main/benchmarks/bench_json_schema.py
2024-11-25 18:47:34 +01:00
drbh
c637d68d74
feat: concat the adapter id to the model id in chat response (#2779)
* feat: concat the adapter id to the model id in chat response

* fix: updated to include only the adapter id in chat response
2024-11-25 12:36:31 -05:00
OlivierDehaene
780531ec77
chore: prepare 2.4.1 release (#2773)
* chore: prepare 2.4.1 release

* fix tests

* fmt
2024-11-22 17:26:15 +00:00
Daniël de Kok
e87893d38e
chore: Update to marlin-kernels 0.3.6 (#2771)
This fixes a bug in 2:4 Marlin:
https://github.com/vllm-project/vllm/pull/10464
2024-11-22 14:44:47 +00:00
OlivierDehaene
ab7ccf5bc3
feat: add payload limit (#2726)
* feat: add payload limit

* update launcher
2024-11-21 18:20:15 +00:00
Hugo Larcher
d5bc6a20bd
feat: Add automatic nightly benchmarks (#2591)
* feat: Add automatic nightly benchmarks

* fix: Update runners group

* fix: add created_at field to results

* fix: Add variable results file location
2024-11-21 17:11:42 +00:00
Lucain
d012f229c6
Remove guideline from API (#2762) 2024-11-21 16:56:38 +00:00
Daniël de Kok
c5b5b3a11c
docs: Add a README section about using Nix (#2767) 2024-11-21 16:53:27 +00:00
drbh
faa10ad0bc
fix: tweak grammar test response (#2769) 2024-11-21 16:46:00 +00:00
OlivierDehaene
8e0c161d0a
fix: incomplete generations w/ single tokens generations and models that did not support chunking (#2770)
* Incomplete generation stream fix (#2754)

entries.len() could > batch.size in prefill, so need to filter as well.

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* entries was wrongly extended for model that did not support chunking

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Wang, Yi <yi.a.wang@intel.com>
2024-11-21 16:37:55 +00:00
Daniël de Kok
3c54488638
nix: downgrade to outlines 0.1.3 (#2768) 2024-11-21 13:00:26 +01:00
drbh
6ee8d6dd3b
fix: set outlines version to 0.1.3 to avoid caching serialization issue (#2766)
fix: set outlines version to 0.1.3 to avoid bug
2024-11-20 18:09:39 -05:00
Daniël de Kok
07bed530f7
nix: build and cache impure devshells (#2765)
* nix: build and cache all devshells

* nix: add poetry to the impure shell

This shouldn't be used to manage dependencies in a Nix devshell, but can
be handy to update `poetry.lock`.

* Fix Nix build, disable pure shell (covered by Nix tests)
2024-11-20 20:56:11 +01:00
Daniël de Kok
46a5a7e73e
Add support for wNa16 int 2:4 compressed-tensors checkpoints (#2758)
This change adds support for wNa16 int checkpoints with 2:4 sparsity
using Marlin 2:4 kernels.
2024-11-20 18:25:23 +01:00
Daniël de Kok
2fda8845a7
nix: update for outlines 0.1.4 (#2764) 2024-11-20 18:24:29 +01:00
Daniël de Kok
45013b60a4 Install compressed-tensors in Docker CPU builds 2024-11-20 14:17:47 +00:00
drbh
bd6e8b3c13
fix: adjust llama MLP name from dense to mlp to correctly apply lora (#2760) 2024-11-19 15:10:22 -05:00