Commit Graph

1184 Commits

Author SHA1 Message Date
Morgan Funtowicz
62530649b8 feat(backend): remove constexpig 2024-12-03 16:47:48 +01:00
Morgan Funtowicz
881527a544 feat(backend): remove constexpr from par 2024-12-03 16:46:59 +01:00
Morgan Funtowicz
ad3ed0d1a1 test(backend): add more unittest 2024-12-03 14:39:10 +01:00
Morgan Funtowicz
c94b9de445 feat(backend): add guard to multiple header definitions 2024-12-03 14:07:49 +01:00
Morgan Funtowicz
16ba2f5a2b feat(backend): fix main.rs retrieving the tokenizer 2024-12-03 12:11:17 +01:00
Morgan Funtowicz
874bc28d6c feat(backend): make backend_workspace_t::engines_folder constexpr 2024-12-03 09:43:03 +01:00
Morgan Funtowicz
2f8634ec01 feat(backend): impl missing generation_step_t as return value of pull_tokens 2024-12-03 09:43:03 +01:00
Morgan Funtowicz
a7bad25c41 feat(backend): fix backend_exception_t -> backend_error_t naming 2024-12-03 09:43:03 +01:00
Morgan Funtowicz
879e1a4178 feat(backend): allow overriding which Python to use 2024-12-03 09:43:03 +01:00
Morgan Funtowicz
71e700a6ea feat(backend): use latest trtllm main version to have g++ >= 13 compatibility 2024-12-03 09:43:03 +01:00
Morgan Funtowicz
fd7e2b5bbd feat(backend): more impl 2024-12-03 09:43:03 +01:00
Morgan Funtowicz
df99164dc1 feat(backend): delete previous backend impl 2024-12-03 09:43:03 +01:00
Morgan Funtowicz
25c6bbe142 feat(backend): introduce backend_workspace_t to store precomputed information from the engine folder 2024-12-03 09:43:03 +01:00
Morgan Funtowicz
702dc9cd05 feat(backend): missing return statement 2024-12-03 09:43:03 +01:00
Morgan Funtowicz
87272ffe39 feat(backend): enable compiler warning if support for RVO not applying 2024-12-03 09:43:03 +01:00
Morgan Funtowicz
9bb6309712 feat(backend): added some logging 2024-12-03 09:43:03 +01:00
Morgan Funtowicz
6d3565759a feat(backend): remove all the logs from hardware.hpp 2024-12-03 09:43:03 +01:00
Morgan Funtowicz
3a2698fb79 feat(backend): initial rewrite of the backend for simplicity 2024-12-03 09:43:03 +01:00
Morgan Funtowicz
1830fe8833 test(ctest) enable address sanitizer 2024-12-03 09:43:03 +01:00
Morgan Funtowicz
7a81040d1a feat(hardware) enable new hardware.hpp and unittests 2024-12-03 09:43:03 +01:00
Morgan Funtowicz
0f17415d54 misc(cmake) update dependencies 2024-12-03 09:43:03 +01:00
Nicolas Patry
b57f370386
Saving some VRAM. (#2790)
* Saving some VRAM.

- 8B on 4xL4 attention=flashdecoding . Before 4.28GB left, After 4.32GB
  left, so 400MB saved.

- Effect not as visible on attention=flashinfer and n_shard=1. I suspect
  it's linked to the torch allocator.

* Adding assertion.
2024-12-03 04:04:21 +01:00
Daniël de Kok
2003d8be0c
Sync (most) server dependencies with Nix (#2782)
* Sync (most) server dependencies with Nix

Skipped most grpcio packages, because of protobuf version
incompatibility with the opentelemetry packages.

* Add a primitive script to generate Poetry commands to sync with Nix

This is not fully automated, since getting the Nix versions may be
unresolvable. However, it does take most of the work out of doing
this manually.

* Upgrade eetq ?

* Fmt.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-12-03 04:04:06 +01:00
Dmitry Rogozhkin
535149d872
fix: only use eos_token_id as pad_token_id if int (#2774)
LLama 3 has a list of values as eos_token_id:
  "['<|end_of_text|>', '<|eom_id|>', '<|eot_id|>']"
This breaks tokenizer since it expects single value. This
commit uses tokenizer.eos_token_id instead in such a case.

Fixes: #2440

Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
2024-12-02 06:26:37 +01:00
drbh
2c74c55637
fix: add merge-lora arg for model id (#2788) 2024-12-02 05:52:02 +01:00
Torsten Raudssus
a35d1e6fe5
Removing ../ that broke the link (#2789) 2024-12-02 05:48:55 +01:00
Nicolas Patry
1d2cb356b9
Fix doc. (#2792) 2024-12-02 05:28:26 +01:00
drbh
d471805134
Support continue final message (#2733)
* feat: support continue_final_message param in chat request

* feat: add test for continue final message

* fix: bump openapi docs

* fix: remove continue_final_message chat request param

* fix: remove unneeded launcher args in continue test

* fix: bump test output

* fix: remove accidentally included guideline from rebase

* fix: remove guideline tests

* fix: adjust continuation tests expected text

* fix: replace expected output for continue test
2024-11-27 19:13:30 -05:00
jp
caff779dd4
Fix: docs typo (#2777)
Fix: typo in model loading code

Fix typo in model loading code
2024-11-26 14:28:58 +01:00
Wang, Yi
892a26e549
upgrade ipex cpu to fix coredump in tiiuae/falcon-7b-instruct (pageat… (#2778)
upgrade ipex cpu to fix coredump in tiiuae/falcon-7b-instruct (pageattention)

Signed-off-by: Wang,Yi A <yi.a.wang@intel.com>
2024-11-26 14:28:11 +01:00
Daniël de Kok
72ab60fdd5
Use FP8 KV cache when specified by compressed-tensors (#2761)
The compressed-tensors configuration can specify the configuration of
the KV cache as well. Use an FP8 KV cache when the configuration tells
us to do so (all other options and types are ignored for now).
2024-11-26 08:27:41 +01:00
Daniël de Kok
289aa48554
Move JSON grammar -> regex grammar conversion to the router (#2772)
* Move JSON grammar -> regex grammar conversion to the router

This change moves the JSON grammar -> regex grammar conversion to the
router by adding a dependency on the `outlines-core` Rust crate. In
contrast to the Python implementation, the conversions are not LRU-cached
since they seem to be fast enough:

simple schema           time:   [5.8293 µs 5.8307 µs 5.8320 µs]
                        change: [-13.166% -12.884% -12.641%] (p = 0.00 < 0.05)
                        Performance has improved.

complex schema          time:   [14.875 µs 14.881 µs 14.887 µs]
                        change: [-2.1637% -1.9914% -1.7852%] (p = 0.00 < 0.05)
                        Performance has improved.

Using the schemas from:
https://github.com/dottxt-ai/outlines-core/blob/main/benchmarks/bench_json_schema.py
2024-11-25 18:47:34 +01:00
drbh
c637d68d74
feat: concat the adapter id to the model id in chat response (#2779)
* feat: concat the adapter id to the model id in chat response

* fix: updated to include only the adapter id in chat response
2024-11-25 12:36:31 -05:00
OlivierDehaene
780531ec77
chore: prepare 2.4.1 release (#2773)
* chore: prepare 2.4.1 release

* fix tests

* fmt
2024-11-22 17:26:15 +00:00
Daniël de Kok
e87893d38e
chore: Update to marlin-kernels 0.3.6 (#2771)
This fixes a bug in 2:4 Marlin:
https://github.com/vllm-project/vllm/pull/10464
2024-11-22 14:44:47 +00:00
OlivierDehaene
ab7ccf5bc3
feat: add payload limit (#2726)
* feat: add payload limit

* update launcher
2024-11-21 18:20:15 +00:00
Hugo Larcher
d5bc6a20bd
feat: Add automatic nightly benchmarks (#2591)
* feat: Add automatic nightly benchmarks

* fix: Update runners group

* fix: add created_at field to results

* fix: Add variable results file location
2024-11-21 17:11:42 +00:00
Lucain
d012f229c6
Remove guideline from API (#2762) 2024-11-21 16:56:38 +00:00
Daniël de Kok
c5b5b3a11c
docs: Add a README section about using Nix (#2767) 2024-11-21 16:53:27 +00:00
drbh
faa10ad0bc
fix: tweak grammar test response (#2769) 2024-11-21 16:46:00 +00:00
OlivierDehaene
8e0c161d0a
fix: incomplete generations w/ single tokens generations and models that did not support chunking (#2770)
* Incomplete generation stream fix (#2754)

entries.len() could > batch.size in prefill, so need to filter as well.

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* entries was wrongly extended for model that did not support chunking

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Wang, Yi <yi.a.wang@intel.com>
2024-11-21 16:37:55 +00:00
Daniël de Kok
3c54488638
nix: downgrade to outlines 0.1.3 (#2768) 2024-11-21 13:00:26 +01:00
drbh
6ee8d6dd3b
fix: set outlines version to 0.1.3 to avoid caching serialization issue (#2766)
fix: set outlines version to 0.1.3 to avoid bug
2024-11-20 18:09:39 -05:00
Daniël de Kok
07bed530f7
nix: build and cache impure devshells (#2765)
* nix: build and cache all devshells

* nix: add poetry to the impure shell

This shouldn't be used to manage dependencies in a Nix devshell, but can
be handy to update `poetry.lock`.

* Fix Nix build, disable pure shell (covered by Nix tests)
2024-11-20 20:56:11 +01:00
Daniël de Kok
46a5a7e73e
Add support for wNa16 int 2:4 compressed-tensors checkpoints (#2758)
This change adds support for wNa16 int checkpoints with 2:4 sparsity
using Marlin 2:4 kernels.
2024-11-20 18:25:23 +01:00
Daniël de Kok
2fda8845a7
nix: update for outlines 0.1.4 (#2764) 2024-11-20 18:24:29 +01:00
Daniël de Kok
45013b60a4 Install compressed-tensors in Docker CPU builds 2024-11-20 14:17:47 +00:00
drbh
bd6e8b3c13
fix: adjust llama MLP name from dense to mlp to correctly apply lora (#2760) 2024-11-19 15:10:22 -05:00
drbh
5489406c4a
PR 2634 CI - Fix the tool_choice format for named choice by adapting OpenAIs scheme (#2645)
* add OpenAI like tool_choice for named choice

* add tests

* fix: run linter and bump api docs

* fix: consolidate changes and remove old tool type

* feat: improve, simplify and rename tool choice struct add required support and refactor

* fix: simplify tool choice logic, improve tests, openapi and rust docs

* fix: refactor away prepare_chat_input and improve tool grammar apply control flow

* feat: update docs and add tool choice configuration section

* fix: simplify naming, tool choice default and improve test

* fix: adjust tool choice none logic, add test and small refactors

* fix: add missing snapshot file

* fix: adjust tool choice type in test

* fix: adjust default when json tool choice is

* fix: remove trailing space lint after rebase

* fix: remove mostly mocked unit test

---------

Co-authored-by: Linus Bierhoff <linus.bierhoff@icloud.com>
2024-11-19 13:31:59 -05:00
Daniël de Kok
2007a9473a
Update to moe-kernels 0.7.0 (#2720)
This version syncs with the vLLM kernels and brings some performance
improvements.
2024-11-19 14:55:29 +01:00