text-generation-inference

mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-10-10 23:45:23 +00:00

Author	SHA1	Message	Date
Morgan Funtowicz	fa63db0d07	(scheduler) rework submit/pull logic	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	42ccf4e77c	(misc) no need to move for uint32_t items	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	b41875c139	(misc) simplify [make_]move_iterator by using c++20 type inference	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	0f50539b77	(Dockerfile.trtllm) delete for now	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	b1846fb4e6	(backend) refactor & cleanup	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	483f172938	(ffi) do not use reference capture in lambda as we are not capturing anything	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	3d0e90b631	(ffi) missing namespace for tle::Response	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	8e648ce425	(ffi) fix usage of wrong vector constructor making a capacity fill call	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	dddc9a44bd	(build) fetchcontent use archives instead of git	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	089c5fe668	(server) forward auth_token to server::run	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	291eaa99fb	use blocking_recv in looper to consume awaiting_requests at max before pulling in a single step	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	7bebc629af	(misc) missing Result types for Rust	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	c2e21d8725	(backend) implement the post_processor background thread	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	0dca168bcb	(misc) change scope identifiers	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	933ab67aa1	(ffi) encode the provided user prompt within each request thread	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	0b0c30fe8b	(ffi) remove narrowing type warning	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	fb759bdd2a	(looper) new looper initial implementation	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	5f7c0b67c3	(ffi) add template specialization to catch and convert to Rust Result<T, tensorrt_llm::common::TllmException>	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	33c962ef41	(ffi) add missing headers imports	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	2883c042ed	(ffi) cleanup again	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	f4a74be384	(backend) expose PullNewTokens	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	b8a40a0af3	(backend) cleanup a bit	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	38b5263c61	(ffi) add max_new_tokens parameters	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	f6f689f509	(build) setup ccache if available	2024-10-21 10:00:27 +02:00
Morgan Funtowicz	2a339f99dd	(trt)	2024-10-21 10:00:25 +02:00
Morgan Funtowicz	169e1f452f	(server) expose new SchedulingError	2024-10-21 10:00:04 +02:00
Morgan Funtowicz	0cd7538a48	(ffi) use const for GetSamplingConfig	2024-10-21 09:57:26 +02:00
Morgan Funtowicz	cea64e234f	(chore) fmt ... why?	2024-10-21 09:57:26 +02:00
Morgan Funtowicz	a3f7d76f7b	(launcher) default new server::run parameters to false for now	2024-10-21 09:57:24 +02:00
Morgan Funtowicz	25b20cba2a	(backend) use parking_lot crate for RwLock fairness # Conflicts: # backends/trtllm/src/backend.rs	2024-10-21 09:57:16 +02:00
Daniël de Kok	5e0fb46821	Make handling of FP8 scales more consisent (#2666 ) Change `fp8_quantize` so that we can pass around reciprocals everywhere, so scales are always passed around in the checkpoint format. I also noticed that we ignore any input scales that we might have when fbgemm is available. Skip this path if we already have a scale.	2024-10-19 09:05:01 +02:00
Nicolas Patry	153ff3740b	CI job. Gpt awq 4 (#2665 ) * add gptq and awq int4 support in intel platform Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix ci failure Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * set kv cache dtype Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * refine the code according to the review command Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * Simplifying conditionals + reverting integration tests values. * Unused import * Fix redundant import. * Revert change after rebase. * Upgrading the tests (TP>1 fix changes to use different kernels.) * Update server/text_generation_server/layers/gptq/__init__.py --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Wang, Yi A <yi.a.wang@intel.com>	2024-10-18 17:55:53 +02:00
Daniël de Kok	8ec57558cd	Break cycle between the attention implementations and KV cache (#2627 )	2024-10-17 14:54:22 +02:00
drbh	5f32dea1e2	fix: prefer inplace softmax to avoid copy (#2661 ) * fix: prefer inplace softmax to avoid copy * Update server/text_generation_server/models/flash_causal_lm.py Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-10-17 08:49:02 -04:00
oOraph	1b97e084bf	fix tgi-entrypoint wrapper in docker file: exec instead of spawning a child process (#2663 ) tgi-entrypoint: exec instead of spawning a child process reason: otherwise parent will receive the signals when we'd like tgi to receive them keeping the parent/child is not necessary and would require the parent to handle signals to forward them properly to the child Signed-off-by: Raphael Glon <oOraph@users.noreply.github.com> Co-authored-by: Raphael Glon <oOraph@users.noreply.github.com>	2024-10-17 11:15:26 +02:00
Daniël de Kok	59ea38cbca	Simplify the `attention` function (#2609 ) * Simplify the `attention` function - Use one definition rather than multiple. - Add `key`/`value` arguments, so that we don't need the `PREFILL_IN_KVCACHE` constant. - Make it kwargs-only (to avoid mixing up the various `Tensor` args). * Fixup flashinfer support	2024-10-17 10:42:52 +02:00
Daniël de Kok	5bbe1ce028	Support `e4m3fn` KV cache (#2655 ) * Support `e4m3fn` KV cache * Make check more obvious	2024-10-17 10:42:16 +02:00
OlivierDehaene	a6a0c97ed9	feat: prefill chunking (#2600 ) * wip * rollback * refactor to use prefix/postfix namming + fix all_input_ids_tensor * maybe patching vlms? * fix filter and concat * wip, no filter, no concat * current * add prepare_for_prefill * working * load tested * re-create slots * re-create slots * fix slot_filtering_indices * feedback loop * remove log * fix benchmarker * fix vlm and seq2seq * rename to cache and input lengths * fix prefill logprobs * fix launcher * fix logprobs? * idk at this point * max input length * omfg * remove debugging lines * fix tests * fix mllama * fix cargo tests * remove support chunking for paged * Fixing non blocked attentions * Fixing dtype + AMD, Ipex targets. * lint fix. * rename * Fix prefix_caching variable, remove defaults in server (confusing a lot of the times). * Add simple resolution when user specifies ATTENTION=paged. * Put back non default simple tests. * Fix env name --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-10-16 12:49:33 +02:00
Mohit Sharma	704a58c807	Fp8 e4m3_fnuz support for rocm (#2588 ) * (feat) fp8 fnuz support for rocm * (review comments) Fix compression_config load, type hints * (bug) update all has_tensor * (review_comments) fix typo and added comments * (nit) improved comment	2024-10-16 09:54:50 +02:00
Alvaro Bartolome	ffe05ccd05	Rollback to `ChatRequest` for Vertex AI Chat instead of `VertexChat` (#2651 ) As spotted by @philschmid, the payload was compliant with Vertex AI, but just partially, since ideally the most compliant version would be with the generation kwargs flattened to be on the same level as the `messages`; meaning that Vertex AI would still expect a list of instances, but each instance would be an OpenAI-compatible instance, which is more clear; and more aligned with the SageMaker integration too, so kudos to him for spotting that; and sorry from my end for any inconvenience @Narsil.	2024-10-15 18:11:59 +02:00
Daniël de Kok	ce7e356561	Use flashinfer for Gemma 2.	2024-10-15 13:49:32 +00:00
Nicolas Patry	cf04a43fb1	Fixing linters. (#2650 )	2024-10-15 12:43:49 +02:00
Dmitry Rogozhkin	58848cb471	feat: enable pytorch xpu support for non-attention models (#2561 ) XPU backend is available natively (without IPEX) in pytorch starting from pytorch 2.4. This commit extends TGI to cover the case when user has XPU support thru pytorch 2.4, but does not have IPEX installed. Models which don't require attention can work. For attention required models more work is needed to provide attention implementation. Tested with the following models: * teknium/OpenHermes-2.5-Mistral-7B * bigscience/bloom-560m * google/gemma-7b * google/flan-t5-xxl Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>	2024-10-14 18:28:49 +02:00
Wang, Yi	7a82ddcbd0	update ipex to fix incorrect output of mllama in cpu (#2640 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-10-14 16:32:33 +02:00
Omar Sanseviero	51f5401893	Clarify gated description and quicktour (#2631 ) Update quicktour.md	2024-10-14 16:31:37 +02:00
Nicolas Patry	3ea82d008c	Cpu perf (#2596 ) * break when there's nothing to read Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * Different approach, only listen on stdin when `LOG_LEVEL=debug` (which is where dropping to a debugger is important). --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Wang, Yi A <yi.a.wang@intel.com>	2024-10-14 15:34:08 +02:00
Omar Sanseviero	ce28ee88d5	Small fixes for supported models (#2471 ) * Small improvements for docs * Update _toctree.yml * Updating the doc (we keep the list actually). --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-10-14 15:26:39 +02:00
Nicolas Patry	0c478846c5	Fixing intel Supports windowing. (#2637 )	2024-10-11 21:47:03 +02:00
Nicolas Patry	3dbdf63ec5	Intel ci (#2630 ) * Intel CI ? * Let's try non sharded gemma. * Snapshot rename * Apparently container can be gone already.	2024-10-10 16:51:57 +02:00
vb	d912f0bf55	Update documentation to most recent stable version of TGI. (#2625 ) Update to most recent stable version of TGI.	2024-10-10 16:00:25 +02:00

1 2 3 4 5 ...

1112 Commits