text-generation-inference

mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-10-11 07:55:24 +00:00

Author	SHA1	Message	Date
Morgan Funtowicz	0c3ba932cc	(misc): disable logging in release mode	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	437c2aa142	(misc): update dependency in trtllm dockerfile	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	cb69c9a967	(misc): update dependency in trtllm dockerfile	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	c8a99af6c9	(fix): do not recreate the stateful hashmap at every it	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	eb13d8d1f3	(misc): increase verbosity of spdlog	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	ce0cd1fce8	(misc): build with trtllm 0.13.0	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	188e4dc64f	(misc: build for sm_{75,80,86,89,90} by default	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	544c9d9dba	(fix): HOPPER_SM_MAJOR is 9 not 8	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	213acc6e34	(misc) move to latest trtllm	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	507ff66692	(misc) rerun-if-changed all the cmake modules	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	b242f45c04	(misc) delete backend.rs	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	984ae9798f	(post) impl postprocessing	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	fa63db0d07	(scheduler) rework submit/pull logic	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	42ccf4e77c	(misc) no need to move for uint32_t items	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	b41875c139	(misc) simplify [make_]move_iterator by using c++20 type inference	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	0f50539b77	(Dockerfile.trtllm) delete for now	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	b1846fb4e6	(backend) refactor & cleanup	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	483f172938	(ffi) do not use reference capture in lambda as we are not capturing anything	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	3d0e90b631	(ffi) missing namespace for tle::Response	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	8e648ce425	(ffi) fix usage of wrong vector constructor making a capacity fill call	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	dddc9a44bd	(build) fetchcontent use archives instead of git	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	089c5fe668	(server) forward auth_token to server::run	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	291eaa99fb	use blocking_recv in looper to consume awaiting_requests at max before pulling in a single step	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	7bebc629af	(misc) missing Result types for Rust	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	c2e21d8725	(backend) implement the post_processor background thread	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	0dca168bcb	(misc) change scope identifiers	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	933ab67aa1	(ffi) encode the provided user prompt within each request thread	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	0b0c30fe8b	(ffi) remove narrowing type warning	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	fb759bdd2a	(looper) new looper initial implementation	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	5f7c0b67c3	(ffi) add template specialization to catch and convert to Rust Result<T, tensorrt_llm::common::TllmException>	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	33c962ef41	(ffi) add missing headers imports	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	2883c042ed	(ffi) cleanup again	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	f4a74be384	(backend) expose PullNewTokens	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	b8a40a0af3	(backend) cleanup a bit	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	38b5263c61	(ffi) add max_new_tokens parameters	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	f6f689f509	(build) setup ccache if available	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	2a339f99dd	(trt)	2024-10-21 10:00:25 +02:00
Morgan Funtowicz	169e1f452f	(server) expose new SchedulingError	2024-10-21 10:00:04 +02:00
Morgan Funtowicz	0cd7538a48	(ffi) use const for GetSamplingConfig	2024-10-21 09:57:26 +02:00
Morgan Funtowicz	cea64e234f	(chore) fmt ... why?	2024-10-21 09:57:26 +02:00
Morgan Funtowicz	a3f7d76f7b	(launcher) default new server::run parameters to false for now	2024-10-21 09:57:24 +02:00
Morgan Funtowicz	25b20cba2a	(backend) use parking_lot crate for RwLock fairness # Conflicts: # backends/trtllm/src/backend.rs	2024-10-21 09:57:16 +02:00
Daniël de Kok	5e0fb46821	Make handling of FP8 scales more consisent (#2666 ) Change `fp8_quantize` so that we can pass around reciprocals everywhere, so scales are always passed around in the checkpoint format. I also noticed that we ignore any input scales that we might have when fbgemm is available. Skip this path if we already have a scale.	2024-10-19 09:05:01 +02:00
Nicolas Patry	153ff3740b	CI job. Gpt awq 4 (#2665 ) * add gptq and awq int4 support in intel platform Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix ci failure Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * set kv cache dtype Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * refine the code according to the review command Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * Simplifying conditionals + reverting integration tests values. * Unused import * Fix redundant import. * Revert change after rebase. * Upgrading the tests (TP>1 fix changes to use different kernels.) * Update server/text_generation_server/layers/gptq/__init__.py --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Wang, Yi A <yi.a.wang@intel.com>	2024-10-18 17:55:53 +02:00
Daniël de Kok	8ec57558cd	Break cycle between the attention implementations and KV cache (#2627 )	2024-10-17 14:54:22 +02:00
drbh	5f32dea1e2	fix: prefer inplace softmax to avoid copy (#2661 ) * fix: prefer inplace softmax to avoid copy * Update server/text_generation_server/models/flash_causal_lm.py Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-10-17 08:49:02 -04:00
oOraph	1b97e084bf	fix tgi-entrypoint wrapper in docker file: exec instead of spawning a child process (#2663 ) tgi-entrypoint: exec instead of spawning a child process reason: otherwise parent will receive the signals when we'd like tgi to receive them keeping the parent/child is not necessary and would require the parent to handle signals to forward them properly to the child Signed-off-by: Raphael Glon <oOraph@users.noreply.github.com> Co-authored-by: Raphael Glon <oOraph@users.noreply.github.com>	2024-10-17 11:15:26 +02:00
Daniël de Kok	59ea38cbca	Simplify the `attention` function (#2609 ) * Simplify the `attention` function - Use one definition rather than multiple. - Add `key`/`value` arguments, so that we don't need the `PREFILL_IN_KVCACHE` constant. - Make it kwargs-only (to avoid mixing up the various `Tensor` args). * Fixup flashinfer support	2024-10-17 10:42:52 +02:00
Daniël de Kok	5bbe1ce028	Support `e4m3fn` KV cache (#2655 ) * Support `e4m3fn` KV cache * Make check more obvious	2024-10-17 10:42:16 +02:00
OlivierDehaene	a6a0c97ed9	feat: prefill chunking (#2600 ) * wip * rollback * refactor to use prefix/postfix namming + fix all_input_ids_tensor * maybe patching vlms? * fix filter and concat * wip, no filter, no concat * current * add prepare_for_prefill * working * load tested * re-create slots * re-create slots * fix slot_filtering_indices * feedback loop * remove log * fix benchmarker * fix vlm and seq2seq * rename to cache and input lengths * fix prefill logprobs * fix launcher * fix logprobs? * idk at this point * max input length * omfg * remove debugging lines * fix tests * fix mllama * fix cargo tests * remove support chunking for paged * Fixing non blocked attentions * Fixing dtype + AMD, Ipex targets. * lint fix. * rename * Fix prefix_caching variable, remove defaults in server (confusing a lot of the times). * Add simple resolution when user specifies ATTENTION=paged. * Put back non default simple tests. * Fix env name --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-10-16 12:49:33 +02:00

1 2 3 4 5 ...

1124 Commits