text-generation-inference

mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-10-10 15:35:24 +00:00

Author	SHA1	Message	Date
Morgan Funtowicz	c8a99af6c9	(fix): do not recreate the stateful hashmap at every it	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	eb13d8d1f3	(misc): increase verbosity of spdlog	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	ce0cd1fce8	(misc): build with trtllm 0.13.0	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	188e4dc64f	(misc: build for sm_{75,80,86,89,90} by default	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	544c9d9dba	(fix): HOPPER_SM_MAJOR is 9 not 8	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	213acc6e34	(misc) move to latest trtllm	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	507ff66692	(misc) rerun-if-changed all the cmake modules	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	b242f45c04	(misc) delete backend.rs	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	984ae9798f	(post) impl postprocessing	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	fa63db0d07	(scheduler) rework submit/pull logic	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	42ccf4e77c	(misc) no need to move for uint32_t items	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	b41875c139	(misc) simplify [make_]move_iterator by using c++20 type inference	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	b1846fb4e6	(backend) refactor & cleanup	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	483f172938	(ffi) do not use reference capture in lambda as we are not capturing anything	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	3d0e90b631	(ffi) missing namespace for tle::Response	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	8e648ce425	(ffi) fix usage of wrong vector constructor making a capacity fill call	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	dddc9a44bd	(build) fetchcontent use archives instead of git	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	089c5fe668	(server) forward auth_token to server::run	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	291eaa99fb	use blocking_recv in looper to consume awaiting_requests at max before pulling in a single step	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	7bebc629af	(misc) missing Result types for Rust	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	c2e21d8725	(backend) implement the post_processor background thread	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	0dca168bcb	(misc) change scope identifiers	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	933ab67aa1	(ffi) encode the provided user prompt within each request thread	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	0b0c30fe8b	(ffi) remove narrowing type warning	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	fb759bdd2a	(looper) new looper initial implementation	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	5f7c0b67c3	(ffi) add template specialization to catch and convert to Rust Result<T, tensorrt_llm::common::TllmException>	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	33c962ef41	(ffi) add missing headers imports	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	2883c042ed	(ffi) cleanup again	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	f4a74be384	(backend) expose PullNewTokens	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	b8a40a0af3	(backend) cleanup a bit	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	38b5263c61	(ffi) add max_new_tokens parameters	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	f6f689f509	(build) setup ccache if available	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	2a339f99dd	(trt)	2024-10-21 10:00:25 +02:00
Morgan Funtowicz	0cd7538a48	(ffi) use const for GetSamplingConfig	2024-10-21 09:57:26 +02:00
Morgan Funtowicz	cea64e234f	(chore) fmt ... why?	2024-10-21 09:57:26 +02:00
Morgan Funtowicz	a3f7d76f7b	(launcher) default new server::run parameters to false for now	2024-10-21 09:57:24 +02:00
Morgan Funtowicz	25b20cba2a	(backend) use parking_lot crate for RwLock fairness # Conflicts: # backends/trtllm/src/backend.rs	2024-10-21 09:57:16 +02:00
OlivierDehaene	a6a0c97ed9	feat: prefill chunking (#2600 ) * wip * rollback * refactor to use prefix/postfix namming + fix all_input_ids_tensor * maybe patching vlms? * fix filter and concat * wip, no filter, no concat * current * add prepare_for_prefill * working * load tested * re-create slots * re-create slots * fix slot_filtering_indices * feedback loop * remove log * fix benchmarker * fix vlm and seq2seq * rename to cache and input lengths * fix prefill logprobs * fix launcher * fix logprobs? * idk at this point * max input length * omfg * remove debugging lines * fix tests * fix mllama * fix cargo tests * remove support chunking for paged * Fixing non blocked attentions * Fixing dtype + AMD, Ipex targets. * lint fix. * rename * Fix prefix_caching variable, remove defaults in server (confusing a lot of the times). * Add simple resolution when user specifies ATTENTION=paged. * Put back non default simple tests. * Fix env name --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-10-16 12:49:33 +02:00
Nicolas Patry	0204946d26	Max token capacity metric (#2595 ) * adding max_token_capacity_metric * added tgi to name of metric * Adding max capacity metric. * Add description for the metrics --------- Co-authored-by: Edwinhr716 <Edandres249@gmail.com>	2024-10-02 16:32:36 +02:00
Nicolas Patry	0ff6ff60ad	Hotfixing main (#2556 )	2024-09-24 11:51:14 +02:00
OlivierDehaene	10e6f29295	chore: Add old V2 backend (#2551 ) * wip * added v2	2024-09-24 08:38:17 +02:00
Nicolas Patry	38fcafcf96	Adding a test for FD. (#2516 ) * Adding a test for FD. * Fixing flashdecoding (empty batch doesn't work). * Fixing the invalid popping. * Fixing radix with block_size > 1 * Last reference. * Use an actual hash. * Update hash for slice.len() == 1 * Update the locks. * Increasing docker timeout.	2024-09-16 17:00:54 +02:00
Nicolas Patry	dae3bf1d87	Fix tokenization yi (#2507 ) * Fixing odd tokenization self modifications on the Rust side (load and resave in Python). * Fixing the builds ? * Fix the gh action? * Fixing the location ? * Validation is odd. * Try a faster runner * Upgrade python version. * Remove sccache * No sccache. * Getting libpython maybe ? * List stuff. * Monkey it up. * have no idea at this point * Tmp. * Shot in the dark. * Tmate the hell out of this. * Desperation. * WTF. * -y. * Apparently 3.10 is not available anymore. * Updating the dockerfile to make libpython discoverable at runtime too. * Put back rust tests. * Why do we want mkl on AMD ? * Forcing 3.11 ?	2024-09-11 22:41:56 +02:00
Nicolas Patry	a4e3e8c608	Prefix test - Different kind of load test to trigger prefix test bugs. (#2490 ) * Adding prefix test. * [WIP] tmp dump of integration load tests. * Remove other tensor creation. * Fixed the radix tree. Used a slice everywhere in radix.rs to keep the cheap Arc cloning instead of recomputing the input_ids. * Fix parsing * Is it really flashinfer version ? * Remove some comments. * Revert the max prefix hit. * Adding numpy to diff. * Upgraded flashinfer. * Upgrading some stuff. * Are we done yet ? * Minor fixup * Remove 1 log and put back the other. * Add comment for why slot 0 is OK. * Mounting on the job. * Get me a debug branch * Debugging CIs is fun. * Attempt #28 * wip * Tmate. * Praying. * Updating VLM causal model with updated context. * Important line got squashed. * Tmate again. * Fingers crossed. * We want only 1 run of integration tests..... --------- Co-authored-by: Guillaume LEGENDRE <glegendre01@gmail.com>	2024-09-11 18:10:40 +02:00
Nicolas Patry	c1fe28d694	Fixing more correctly the invalid drop of the batch. (#2498 )	2024-09-06 17:35:49 +02:00
Daniël de Kok	379472c4c2	radix trie: add assertions (#2491 ) These should all be cheap assertions. Also: * Fixup some comments. * Delete a `remove` that was done unnecessarily twice.	2024-09-06 11:55:23 +02:00
Daniël de Kok	deec30f893	hotfix: avoid non-prefilled block use when using prefix caching (#2489 ) The minimum batch size logic could cause prefix blocks to be deallocated without prefill. The next allocation of the same prefix would then use garbage blocks.	2024-09-05 15:09:29 +02:00
Nicolas Patry	e415b690a6	Lots of improvements (Still 2 allocators) (#2449 ) * Making prefix/flashinfer the default and testing the full release tests. * Include flashinfer in the docker. * Using prebuilt. * Allowing window_left_size (dummy version). * Disabling flashinfer/prefix caching on odd head_dim * Disable prefix caching for lora. * More specific codes. * Update lock * Updating integration tests with new values with FI/FD. Remove paged as a default too, and using FD everywhere. * Update cargo lock ? * Upgrade to 1.80 because of bitstream... * Everywhere 1.80 * Forgot last default place. * Apply suggestions from code review Co-authored-by: drbh <david.richard.holtz@gmail.com> * Updated flake lock * Tmp * Upgrade resolution system for less errors in resolution. * Remove lambda for cleaner function. * Handling debugger. * OVerride the env in server tests. * Is this enough to make it work ? * This seems to be working. * Downgrade some logs. * Fixing the default for vlm. * Don't enable prefix caching on VLM just yet. * Change `add_special_tokens` in order to have the correct tokens for chat input and not (since it's super important with the prefixing now) * Fixing prefix caching for flashdecoding. * Update all models. * Fixed flashinfer version. * add_special_tokens is internal only * Fixing seqlen with the new vlms. * Fixing the issue with `add_special_tokens` not being passed around. * Fixing the test. * Removing encoder_decoder (seq2seq). * Update the chat test. * Fixing the batching tokenization in flash causal lm. * Truncating left for radix purposes. * Oops this doesn't belong here. * Put back default pure shell. * Update server tests - Default to throughput test in k6 - Use TGI_WIGGLE_ROOM to adjust wiggle room * Only n_heads / process_group.size() are necessary. * Revert the integrationt tests change (seem linked to head_size modification). * Adding error message when assert is violated. * Fixing the free algorithm to handle times where the common prefix is smaller. * Apply suggestions from code review Co-authored-by: OlivierDehaene <olivier@huggingface.co> * Update server/text_generation_server/layers/attention/common.py Co-authored-by: OlivierDehaene <olivier@huggingface.co> * Fix disabling prefix caching - Fix windowing checks. * Revert the Cohere tokenizer change (for now using a revision instead). * Fmt. --------- Co-authored-by: drbh <david.richard.holtz@gmail.com> Co-authored-by: OlivierDehaene <olivier@huggingface.co>	2024-08-29 16:29:01 +02:00
drbh	21187c27c9	fix: bump minijinja version and add test for llama 3.1 tools (#2463 ) * fix: support tojson and avoid message indexing issue in template * fix: prefer minijinja native methods and prefer workspace level dependency * fix: adjust comment typo	2024-08-27 13:31:08 -04:00
Nicolas Patry	b70ae0969f	Prefix caching (#2402 ) * Prefix caching WIP * Fixing prefix attention. * Fixing flashinfer import. * Fixing black. * Fixing medusa (still wrong outputs, but functional). * Just medusa values now. * Fixing medusa without prefix caching. * Fixing prefix caching. * Medusa requires reshaping. * Removing the logs. * Remove router.nix * Fixup: - Remove logs - Disable VLMs (they do not work) - Disable prefix caching when user wants prefill logprobs. * Update flake.lock --------- Co-authored-by: Daniël de Kok <me@danieldk.eu>	2024-08-20 11:15:30 +02:00

1 2

58 Commits