text-generation-inference

mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-07-10 18:00:16 +00:00

Author	SHA1	Message	Date
Morgan Funtowicz	11c593dc69	feat(backend): make eog clearer on c++ side	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	06424aa9ff	feat(backend): correctly handle the max_new_tokens case for is_eos	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	05ff551950	feat(backend): add number of generated tokens in the callback	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	188442f67d	misc(lint): make clippy happier	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	31d9254776	feat(backend): remove static from inner_fw visitor as it leads to invalid memory locations	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	7b0a56f40f	feat(backend): fix memory leaking on llama_sampler when the decode ends	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	86a2ae6ba2	chore: unsued variables	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	2cdfed94d9	feat(backend): correctly link to shared fmt and spdlog instead of static	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	bd8f0f15e1	feat(backend): fix invalid reference to ctx instead of context in release build	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	3e82f14f57	feat(backend): somewhat generates the final infer response	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	b50dcddbb8	feat(backend): avoid dropping the boxed stream at the end of the callback	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	612f2f939f	feat(backend): bind incoming request to the server	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	d4aee42fd8	feat(backend): add logit parameter in the callback fn	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	f39edc72ff	feat(backend): add mapping for ignore_eos_token stopping criteria	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	3af2c6837c	misc(offline): match rework	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	d52b4c4978	feat(backend): full rework of the backend internal to safer c++	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	6a5f6b0755	misc(offline): update offline tester	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	b98c635781	feat(backend): entirely rewrite backend	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	611590440d	misc(offline): expose more parameters for generate	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	dbc5b7a0f7	misc(offline): link correctly	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	0c1dd0ed2b	feat(llamacpp): wip explosion	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	a316c53255	feat(llamacpp): expose number of threads for the backend when constructing the model	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	179309b364	misc(build): refactor build type detection in cmake	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	f0859c247f	misc(build): handle different lib destination folder lib/lib64	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	e4d803c94e	feat(backend): build and link through build.rs	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	355d8a55b4	feat(backend): wip Rust binding	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	f9c248657d	chore(backend): minor formatting	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	37faeb34b2	feat(backend): expose frequency and repetition penalties	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	d4b5be10f9	feat(backend): minor refactor	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	92bb113653	feat(backend): use llama_token as TokenId type	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	45d5a6a8c5	feat(backend): add some initial decoding steps	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	098c66920d	feat(backend): tell cmake to build llama-common and link to it	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	0911076320	feat(backend): correctly load llama.cpp model from llama api and not gpt2	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	05ad684676	feat(llamacpp): enable cuda	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	fa89d1e613	misc(cmake): wut	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	e4432d36b1	misc(cmake): add parameter to build specific cuda arch	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	52d57dca79	feat(llamacpp): initial end2end build	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	7d1f8a2bd6	feat(llamacpp): correctly handle CMAKE_BUILD_TYPE for spdlog macros	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	aa1fcba59f	feat(llamacpp): initial commit # Conflicts: # Cargo.lock	2024-11-14 08:42:01 +01:00
Nicolas Patry	0c9b6cdd76	Choosing input/total tokens automatically based on available VRAM? (#2673 ) * Choosing input/total tokens automatically based on available VRAM? * Update doc. * Remove generated files. * Trying to fix non chunking targets. * Attempt #2 * fix. * QuantLinear is rocm compatible. * Much simpler logic after the overhead. * Updating logic + non flash. * Revert doc text. * Simple updates. * Fix integration mt0 (transformers update).	2024-10-28 04:59:49 +01:00
Funtowicz Morgan	ba5fc7d922	Add support for stop words in TRTLLM (#2678 ) * feat(trtllm): rewrite health to not account for current state * chore(looper): cleanup a bit more * feat(post_processing): max_new_tokens is const evaluated now * chore(ffi):formatting * feat(trtllm): add stop words handling # Conflicts: # backends/trtllm/lib/backend.cpp * chore(trtllm): create specific parallelconfig factory and logging init methods * chore(trtllm): define a macro for SizeType cast * chore(trtllm): use GetParallelConfig * chore(trtllm): minor refactoring * chore(trtllm): validate there are enough GPus on the system for the desired model * chore(trtllm): ensure max throughput scheduling policy is selected * chore(trtllm): minor fix * chore(router): minor refactorings * feat(docker): build with-slurm ompi * feat(docker): add python3.10 dev to runtime deps * chore(docker): add mpi to ld_library_path * chore(docker): install transformers * feat(trtllm): detect stop_words from generation_config.json	2024-10-25 10:58:34 +02:00
Funtowicz Morgan	43df056eee	[TENSORRT-LLM] - Implement new looper thread based backend (#2357 ) * (backend) use parking_lot crate for RwLock fairness # Conflicts: # backends/trtllm/src/backend.rs * (launcher) default new server::run parameters to false for now * (chore) fmt ... why? * (ffi) use const for GetSamplingConfig * (server) expose new SchedulingError * (trt) * (build) setup ccache if available * (ffi) add max_new_tokens parameters * (backend) cleanup a bit * (backend) expose PullNewTokens * (ffi) cleanup again * (ffi) add missing headers imports * (ffi) add template specialization to catch and convert to Rust Result<T, tensorrt_llm::common::TllmException> * (looper) new looper initial implementation * (ffi) remove narrowing type warning * (ffi) encode the provided user prompt within each request thread * (misc) change scope identifiers * (backend) implement the post_processor background thread * (misc) missing Result types for Rust * use blocking_recv in looper to consume awaiting_requests at max before pulling in a single step * (server) forward auth_token to server::run * (build) fetchcontent use archives instead of git * (ffi) fix usage of wrong vector constructor making a capacity fill call * (ffi) missing namespace for tle::Response * (ffi) do not use reference capture in lambda as we are not capturing anything * (backend) refactor & cleanup * (Dockerfile.trtllm) delete for now * (misc) simplify [make_]move_iterator by using c++20 type inference * (misc) no need to move for uint32_t items * (scheduler) rework submit/pull logic * (post) impl postprocessing * (misc) delete backend.rs * (misc) rerun-if-changed all the cmake modules * (misc) move to latest trtllm * (fix): HOPPER_SM_MAJOR is 9 not 8 * (misc: build for sm_{75,80,86,89,90} by default * (misc): build with trtllm 0.13.0 * (misc): increase verbosity of spdlog * (fix): do not recreate the stateful hashmap at every it * (misc): update dependency in trtllm dockerfile * (misc): update dependency in trtllm dockerfile * (misc): disable logging in release mode * (misc): improve trtllm download script robustness * (fix): ore fixes for Dockerfile * misc(cuda): require 12.6 * chore(cmake): use correct policy for download_timestamp * feat(looper): check engine and executorWorker paths exist before creating the backend * chore(cmake): download timestamp should be before URL * feat(looper): minor optimizations to avoid growing too much the containers * chore(trtllm): move dockerfile to right place * chore(trtllm): disable tokenizer parallelism by default * chore(trtllm): fmt * chore(trtllm): post-rebase commit * chore(trtllm): remove unused method * feat(trtllm): cache maxNumTokens to avoid calling JSON everytime * misc(router): remove SchedulingError * feat(trtllm): do not tokenize twice * Revert "chore(trtllm): remove unused method" This reverts commit `31747163` * chore(rebase): fix invalid references * chore(router): add python dependency * Lint. * Fix bad rebase --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-10-25 07:17:14 +02:00
Nicolas Patry	ed87b464b4	Fixing "deadlock" when python prompts for trust_remote_code by always (#2664 ) specifiying a value.	2024-10-25 06:39:21 +02:00
OlivierDehaene	41c2623735	feat: allow any supported payload on /invocations (#2683 ) * feat: allow any supported payload on /invocations * update openAPI * update doc	2024-10-23 11:26:01 +00:00
OlivierDehaene	a6a0c97ed9	feat: prefill chunking (#2600 ) * wip * rollback * refactor to use prefix/postfix namming + fix all_input_ids_tensor * maybe patching vlms? * fix filter and concat * wip, no filter, no concat * current * add prepare_for_prefill * working * load tested * re-create slots * re-create slots * fix slot_filtering_indices * feedback loop * remove log * fix benchmarker * fix vlm and seq2seq * rename to cache and input lengths * fix prefill logprobs * fix launcher * fix logprobs? * idk at this point * max input length * omfg * remove debugging lines * fix tests * fix mllama * fix cargo tests * remove support chunking for paged * Fixing non blocked attentions * Fixing dtype + AMD, Ipex targets. * lint fix. * rename * Fix prefix_caching variable, remove defaults in server (confusing a lot of the times). * Add simple resolution when user specifies ATTENTION=paged. * Put back non default simple tests. * Fix env name --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-10-16 12:49:33 +02:00
Nicolas Patry	0204946d26	Max token capacity metric (#2595 ) * adding max_token_capacity_metric * added tgi to name of metric * Adding max capacity metric. * Add description for the metrics --------- Co-authored-by: Edwinhr716 <Edandres249@gmail.com>	2024-10-02 16:32:36 +02:00
Nicolas Patry	0ff6ff60ad	Hotfixing main (#2556 )	2024-09-24 11:51:14 +02:00
OlivierDehaene	10e6f29295	chore: Add old V2 backend (#2551 ) * wip * added v2	2024-09-24 08:38:17 +02:00
Nicolas Patry	38fcafcf96	Adding a test for FD. (#2516 ) * Adding a test for FD. * Fixing flashdecoding (empty batch doesn't work). * Fixing the invalid popping. * Fixing radix with block_size > 1 * Last reference. * Use an actual hash. * Update hash for slice.len() == 1 * Update the locks. * Increasing docker timeout.	2024-09-16 17:00:54 +02:00
Nicolas Patry	dae3bf1d87	Fix tokenization yi (#2507 ) * Fixing odd tokenization self modifications on the Rust side (load and resave in Python). * Fixing the builds ? * Fix the gh action? * Fixing the location ? * Validation is odd. * Try a faster runner * Upgrade python version. * Remove sccache * No sccache. * Getting libpython maybe ? * List stuff. * Monkey it up. * have no idea at this point * Tmp. * Shot in the dark. * Tmate the hell out of this. * Desperation. * WTF. * -y. * Apparently 3.10 is not available anymore. * Updating the dockerfile to make libpython discoverable at runtime too. * Put back rust tests. * Why do we want mkl on AMD ? * Forcing 3.11 ?	2024-09-11 22:41:56 +02:00

1 2

65 Commits