Commit Graph

65 Commits

Author SHA1 Message Date
Morgan Funtowicz
11c593dc69 feat(backend): make eog clearer on c++ side 2024-11-14 08:42:01 +01:00
Morgan Funtowicz
06424aa9ff feat(backend): correctly handle the max_new_tokens case for is_eos 2024-11-14 08:42:01 +01:00
Morgan Funtowicz
05ff551950 feat(backend): add number of generated tokens in the callback 2024-11-14 08:42:01 +01:00
Morgan Funtowicz
188442f67d misc(lint): make clippy happier 2024-11-14 08:42:01 +01:00
Morgan Funtowicz
31d9254776 feat(backend): remove static from inner_fw visitor as it leads to invalid memory locations 2024-11-14 08:42:01 +01:00
Morgan Funtowicz
7b0a56f40f feat(backend): fix memory leaking on llama_sampler when the decode ends 2024-11-14 08:42:01 +01:00
Morgan Funtowicz
86a2ae6ba2 chore: unsued variables 2024-11-14 08:42:01 +01:00
Morgan Funtowicz
2cdfed94d9 feat(backend): correctly link to shared fmt and spdlog instead of static 2024-11-14 08:42:01 +01:00
Morgan Funtowicz
bd8f0f15e1 feat(backend): fix invalid reference to ctx instead of context in release build 2024-11-14 08:42:01 +01:00
Morgan Funtowicz
3e82f14f57 feat(backend): somewhat generates the final infer response 2024-11-14 08:42:01 +01:00
Morgan Funtowicz
b50dcddbb8 feat(backend): avoid dropping the boxed stream at the end of the callback 2024-11-14 08:42:01 +01:00
Morgan Funtowicz
612f2f939f feat(backend): bind incoming request to the server 2024-11-14 08:42:01 +01:00
Morgan Funtowicz
d4aee42fd8 feat(backend): add logit parameter in the callback fn 2024-11-14 08:42:01 +01:00
Morgan Funtowicz
f39edc72ff feat(backend): add mapping for ignore_eos_token stopping criteria 2024-11-14 08:42:01 +01:00
Morgan Funtowicz
3af2c6837c misc(offline): match rework 2024-11-14 08:42:01 +01:00
Morgan Funtowicz
d52b4c4978 feat(backend): full rework of the backend internal to safer c++ 2024-11-14 08:42:01 +01:00
Morgan Funtowicz
6a5f6b0755 misc(offline): update offline tester 2024-11-14 08:42:01 +01:00
Morgan Funtowicz
b98c635781 feat(backend): entirely rewrite backend 2024-11-14 08:42:01 +01:00
Morgan Funtowicz
611590440d misc(offline): expose more parameters for generate 2024-11-14 08:42:01 +01:00
Morgan Funtowicz
dbc5b7a0f7 misc(offline): link correctly 2024-11-14 08:42:01 +01:00
Morgan Funtowicz
0c1dd0ed2b feat(llamacpp): wip explosion 2024-11-14 08:42:01 +01:00
Morgan Funtowicz
a316c53255 feat(llamacpp): expose number of threads for the backend when constructing the model 2024-11-14 08:42:01 +01:00
Morgan Funtowicz
179309b364 misc(build): refactor build type detection in cmake 2024-11-14 08:42:01 +01:00
Morgan Funtowicz
f0859c247f misc(build): handle different lib destination folder lib/lib64 2024-11-14 08:42:01 +01:00
Morgan Funtowicz
e4d803c94e feat(backend): build and link through build.rs 2024-11-14 08:42:01 +01:00
Morgan Funtowicz
355d8a55b4 feat(backend): wip Rust binding 2024-11-14 08:42:01 +01:00
Morgan Funtowicz
f9c248657d chore(backend): minor formatting 2024-11-14 08:42:01 +01:00
Morgan Funtowicz
37faeb34b2 feat(backend): expose frequency and repetition penalties 2024-11-14 08:42:01 +01:00
Morgan Funtowicz
d4b5be10f9 feat(backend): minor refactor 2024-11-14 08:42:01 +01:00
Morgan Funtowicz
92bb113653 feat(backend): use llama_token as TokenId type 2024-11-14 08:42:01 +01:00
Morgan Funtowicz
45d5a6a8c5 feat(backend): add some initial decoding steps 2024-11-14 08:42:01 +01:00
Morgan Funtowicz
098c66920d feat(backend): tell cmake to build llama-common and link to it 2024-11-14 08:42:01 +01:00
Morgan Funtowicz
0911076320 feat(backend): correctly load llama.cpp model from llama api and not gpt2 2024-11-14 08:42:01 +01:00
Morgan Funtowicz
05ad684676 feat(llamacpp): enable cuda 2024-11-14 08:42:01 +01:00
Morgan Funtowicz
fa89d1e613 misc(cmake): wut 2024-11-14 08:42:01 +01:00
Morgan Funtowicz
e4432d36b1 misc(cmake): add parameter to build specific cuda arch 2024-11-14 08:42:01 +01:00
Morgan Funtowicz
52d57dca79 feat(llamacpp): initial end2end build 2024-11-14 08:42:01 +01:00
Morgan Funtowicz
7d1f8a2bd6 feat(llamacpp): correctly handle CMAKE_BUILD_TYPE for spdlog macros 2024-11-14 08:42:01 +01:00
Morgan Funtowicz
aa1fcba59f feat(llamacpp): initial commit
# Conflicts:
#	Cargo.lock
2024-11-14 08:42:01 +01:00
Nicolas Patry
0c9b6cdd76
Choosing input/total tokens automatically based on available VRAM? (#2673)
* Choosing input/total tokens automatically based on available VRAM?

* Update doc.

* Remove generated files.

* Trying to fix non chunking targets.

* Attempt #2

* fix.

* QuantLinear is rocm compatible.

* Much simpler logic after the overhead.

* Updating logic + non flash.

* Revert doc text.

* Simple updates.

* Fix integration mt0 (transformers update).
2024-10-28 04:59:49 +01:00
Funtowicz Morgan
ba5fc7d922
Add support for stop words in TRTLLM (#2678)
* feat(trtllm): rewrite health to not account for current state

* chore(looper): cleanup a bit more

* feat(post_processing): max_new_tokens is const evaluated now

* chore(ffi):formatting

* feat(trtllm): add stop words handling

# Conflicts:
#	backends/trtllm/lib/backend.cpp

* chore(trtllm): create specific parallelconfig factory and logging init methods

* chore(trtllm): define a macro for SizeType cast

* chore(trtllm): use GetParallelConfig

* chore(trtllm): minor refactoring

* chore(trtllm): validate there are enough GPus on the system for the desired model

* chore(trtllm): ensure max throughput scheduling policy is selected

* chore(trtllm): minor fix

* chore(router): minor refactorings

* feat(docker): build with-slurm ompi

* feat(docker): add python3.10 dev to runtime deps

* chore(docker): add mpi to ld_library_path

* chore(docker): install transformers

* feat(trtllm): detect stop_words from generation_config.json
2024-10-25 10:58:34 +02:00
Funtowicz Morgan
43df056eee
[TENSORRT-LLM] - Implement new looper thread based backend (#2357)
* (backend) use parking_lot crate for RwLock fairness

# Conflicts:
#	backends/trtllm/src/backend.rs

* (launcher) default new server::run parameters to false for now

* (chore) fmt ... why?

* (ffi) use const for GetSamplingConfig

* (server) expose new SchedulingError

* (trt)

* (build) setup ccache if available

* (ffi) add max_new_tokens parameters

* (backend) cleanup a bit

* (backend) expose PullNewTokens

* (ffi) cleanup again

* (ffi) add missing headers imports

* (ffi) add template specialization to catch and convert to Rust Result<T, tensorrt_llm::common::TllmException>

* (looper) new looper initial implementation

* (ffi) remove narrowing type warning

* (ffi) encode the provided user prompt within each request thread

* (misc) change scope identifiers

* (backend) implement the post_processor background thread

* (misc) missing Result types for Rust

* use blocking_recv in looper to consume awaiting_requests at max before pulling in a single step

* (server) forward auth_token to server::run

* (build) fetchcontent use archives instead of git

* (ffi) fix usage of wrong vector constructor making a capacity fill call

* (ffi) missing namespace for tle::Response

* (ffi) do not use reference capture in lambda as we are not capturing anything

* (backend) refactor & cleanup

* (Dockerfile.trtllm) delete for now

* (misc) simplify [make_]move_iterator by using c++20 type inference

* (misc) no need to move for uint32_t items

* (scheduler) rework submit/pull logic

* (post) impl postprocessing

* (misc) delete backend.rs

* (misc) rerun-if-changed all the cmake modules

* (misc) move to latest trtllm

* (fix): HOPPER_SM_MAJOR is 9 not 8

* (misc: build for sm_{75,80,86,89,90} by default

* (misc): build with trtllm 0.13.0

* (misc): increase verbosity of spdlog

* (fix): do not recreate the stateful hashmap at every it

* (misc): update dependency in trtllm dockerfile

* (misc): update dependency in trtllm dockerfile

* (misc): disable logging in release mode

* (misc): improve trtllm download script robustness

* (fix): ore fixes for Dockerfile

* misc(cuda): require 12.6

* chore(cmake): use correct policy for download_timestamp

* feat(looper): check engine and executorWorker paths exist before creating the backend

* chore(cmake): download timestamp should be before URL

* feat(looper): minor optimizations to avoid growing too much the containers

* chore(trtllm): move dockerfile to right place

* chore(trtllm): disable tokenizer parallelism by default

* chore(trtllm): fmt

* chore(trtllm): post-rebase commit

* chore(trtllm): remove unused method

* feat(trtllm): cache maxNumTokens to avoid calling JSON everytime

* misc(router): remove SchedulingError

* feat(trtllm): do not tokenize twice

* Revert "chore(trtllm): remove unused method"

This reverts commit 31747163

* chore(rebase): fix invalid references

* chore(router): add python dependency

* Lint.

* Fix bad rebase

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-10-25 07:17:14 +02:00
Nicolas Patry
ed87b464b4
Fixing "deadlock" when python prompts for trust_remote_code by always (#2664)
specifiying a value.
2024-10-25 06:39:21 +02:00
OlivierDehaene
41c2623735
feat: allow any supported payload on /invocations (#2683)
* feat: allow any supported payload on /invocations

* update openAPI

* update doc
2024-10-23 11:26:01 +00:00
OlivierDehaene
a6a0c97ed9
feat: prefill chunking (#2600)
* wip

* rollback

* refactor to use prefix/postfix namming + fix all_input_ids_tensor

* maybe patching vlms?

* fix filter and concat

* wip, no filter, no concat

* current

* add prepare_for_prefill

* working

* load tested

* re-create slots

* re-create slots

* fix slot_filtering_indices

* feedback loop

* remove log

* fix benchmarker

* fix vlm and seq2seq

* rename to cache and input lengths

* fix prefill logprobs

* fix launcher

* fix logprobs?

* idk at this point

* max input length

* omfg

* remove debugging lines

* fix tests

* fix mllama

* fix cargo tests

* remove support chunking for paged

* Fixing non blocked attentions

* Fixing dtype + AMD, Ipex targets.

* lint fix.

* rename

* Fix prefix_caching variable, remove defaults in server (confusing a lot
of the times).

* Add simple resolution when user specifies ATTENTION=paged.

* Put back non default simple tests.

* Fix env name

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-10-16 12:49:33 +02:00
Nicolas Patry
0204946d26
Max token capacity metric (#2595)
* adding max_token_capacity_metric

* added tgi to name of metric

* Adding max capacity metric.

* Add description for the metrics

---------

Co-authored-by: Edwinhr716 <Edandres249@gmail.com>
2024-10-02 16:32:36 +02:00
Nicolas Patry
0ff6ff60ad
Hotfixing main (#2556) 2024-09-24 11:51:14 +02:00
OlivierDehaene
10e6f29295
chore: Add old V2 backend (#2551)
* wip

* added v2
2024-09-24 08:38:17 +02:00
Nicolas Patry
38fcafcf96
Adding a test for FD. (#2516)
* Adding a test for FD.

* Fixing flashdecoding (empty batch doesn't work).

* Fixing the invalid popping.

* Fixing radix with block_size > 1

* Last reference.

* Use an actual hash.

* Update hash for slice.len() == 1

* Update the locks.

* Increasing docker timeout.
2024-09-16 17:00:54 +02:00
Nicolas Patry
dae3bf1d87
Fix tokenization yi (#2507)
* Fixing odd tokenization self modifications on the Rust side (load and
resave in Python).

* Fixing the builds ?

* Fix the gh action?

* Fixing the location ?

* Validation is odd.

* Try a faster runner

* Upgrade python version.

* Remove sccache

* No sccache.

* Getting libpython maybe ?

* List stuff.

* Monkey it up.

* have no idea at this point

* Tmp.

* Shot in the dark.

* Tmate the hell out of this.

* Desperation.

* WTF.

* -y.

* Apparently 3.10 is not available anymore.

* Updating the dockerfile to make libpython discoverable at runtime too.

* Put back rust tests.

* Why do we want mkl on AMD ?

* Forcing 3.11 ?
2024-09-11 22:41:56 +02:00