text-generation-inference

mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-10-09 15:05:24 +00:00

Author	SHA1	Message	Date
Nicolas Patry	df6ea89da9	Fixing exl2 and other quanize tests again. (#2419 ) * Fixing exl2 and other quanize tests again. * Mark exl2 as non release (so CI tests them, needs to be removed latet). * Fixing exl2 (by disabling cuda graphs) * Fix quantization defaults without cuda graphs on exl2 (linked to new issues with it). * Removing serde override. * Go back to released exl2 and remove log. * Adding warnings for deprecated bitsandbytes + upgrade info to warn.	2024-09-25 06:08:38 +00:00
Daniël de Kok	e5c39a5545	nix: build router incrementally (#2422 )	2024-09-25 06:08:00 +00:00
Funtowicz Morgan	c3401e0b99	More fixes trtllm (#2342 ) * (backend) use parking_lot crate for RwLock fairness * (docker) let's put rust in the TRTLLM folder when building * (docker) build ompi with SLURM support * (launcher) default new server::run parameters to false for now * (chore) fmt ... why?	2024-09-25 06:08:00 +00:00
Nicolas Patry	4baa6ff59f	Upgrading exl2. (#2415 ) * Upgrading exl2. * Fixing the other pathways. * Fix idefics.	2024-09-25 06:07:40 +00:00
Daniël de Kok	bae161ab84	nix: partial incremental build of the router (#2416 ) This is less incremental than crate2nix, but does build all dependencies separately, so avoids full rebuilds.	2024-09-25 06:06:17 +00:00
drbh	ffc8fb0850	fix: adds causal to attention params (#2408 ) fix: adds causal to attention params to check when using flash attn v1	2024-09-25 06:06:17 +00:00
Wang, Yi	7a4d831d17	add numa to improve cpu inference perf (#2330 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-09-25 06:06:17 +00:00
Nicolas Patry	c5e4c1877b	Adding more kernels to flake. (#2411 )	2024-09-25 06:06:17 +00:00
Daniël de Kok	eb561bb715	nix: incremental build of the launcher (#2410 )	2024-09-25 06:06:17 +00:00
drbh	10b2be6536	fix: include create_exllama_buffers and set_device for exllama (#2407 )	2024-09-25 06:06:17 +00:00
drbh	1f8c0f83e3	Pr 2395 ci run (#2406 ) * fix(router): Fix appending to message content * feat: add message and chat template test --------- Co-authored-by: Simone Rossi <simone.rossi.93@gmail.com>	2024-09-25 06:06:17 +00:00
Nicolas Patry	18d6be6af4	Updating the flake. (#2404 )	2024-09-25 06:06:17 +00:00
drbh	96e8fa37b0	fix: improve completions to send a final chunk with usage details (#2336 ) * fix: improve completions to send a final chunk with usage details * fix: include finish reason string * fix: remove dev debug trait and unneeded mut * fix: update openapi schema	2024-09-25 06:06:17 +00:00
drbh	3079865b60	fix: allocate tmp based on sgmv kernel if available (#2345 ) * fix: allocate tmp based on sgmv kernel if available * fix: re add copy build artifacts step for punica kernels	2024-09-25 06:06:17 +00:00
drbh	8e6bfa2fc5	feat: validate template variables before apply and improve sliding wi… (#2403 ) * feat: validate template variables before apply and improve sliding window check * fix: improve missing template var test	2024-09-25 06:05:43 +00:00
Nicolas Patry	6393cdee63	Keeping the benchmark somewhere (#2401 ) Co-authored-by: Daniël de Kok <me@danieldk.eu>	2024-09-25 06:05:43 +00:00
Daniël de Kok	f586cc7f0c	Add support for prefix caching to the v3 router (#2392 ) This change adds support for prefix caching to the v3 router. This is broken up from the backend support to ease reviewing. For now prefix caching is only enabled with `USE_PREFIX_CACHING=1` in this case, the router will switch to `RadixAllocator`. This allocator uses a radix trie to keep track of prefills that were seen prior. If a new prefill is a prefix of a previously-seen prefil, the router will send a request with `prefix_len>0`, which can be used by the backend to decide to reuse KV blocks from the cache, rather than recomputing them. Even though backend support is not added in this PR, the backend will still work with prefix caching enabled. The prefix lengths are just ignored and not used.	2024-09-25 06:05:08 +00:00
Wang, Yi	b8efd6d00c	Cpu dockerimage (#2367 ) add intel-cpu docker image Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-09-25 06:05:08 +00:00
Nicolas Patry	1daaddd072	Fixing import exl2 (#2399 )	2024-09-25 06:04:51 +00:00
Nicolas Patry	fbe59c6267	Adding launcher to build. (#2397 )	2024-09-25 06:04:51 +00:00
Nicolas Patry	8750dc878e	Upgrade fbgemm (#2398 ) * Upgrade fbgemm * Fix fbgemm version	2024-09-25 06:04:51 +00:00
Daniël de Kok	197dd3af12	nix: add router to the devshell (#2396 )	2024-09-25 06:04:51 +00:00
Daniël de Kok	bb833389e0	Update flake for 9.0a capability in Torch (#2394 )	2024-09-25 06:04:51 +00:00
drbh	959add5e9b	feat: add guideline to chat request and template (#2391 ) * feat: add guideline to chat request and template * fix: add template test and update docs	2024-09-25 06:04:51 +00:00
Nicolas Patry	849bd93dc3	Using an enum for flash backens (paged/flashdecoding/flashinfer) (#2385 ) * Using an enum for flash backens (paged/flashdecoding/flashinfer) * Early exit on server too. * Clippy. * Fix clippy and fmt.	2024-09-25 06:04:51 +00:00
Daniël de Kok	df719fd527	flake: use rust-overlay (#2390 )	2024-09-25 06:04:51 +00:00
Vaibhav Srivastav	1d4a35a23c	Update documentation for Supported models (#2386 ) * Minor doc fixes * up. * Other minor updates.	2024-09-25 06:04:51 +00:00
Daniël de Kok	e9ba044250	flake: add fmt and clippy (#2389 )	2024-09-25 06:03:56 +00:00
Nicolas Patry	afa14b7595	Using HF_HOME instead of CACHE to get token read in addition to models. (#2288 )	2024-09-25 06:03:56 +00:00
Daniël de Kok	dc0fa60f55	Add experimental flake (#2384 ) Add flake.nix	2024-09-25 06:01:59 +00:00
Daniël de Kok	4a16da5d49	Add FlashInfer support (#2354 ) This change adds support for FlashInfer. FlashInfer can be enabled using `FLASH_INFER=1` and is currently only implemented in `FlashCausalLM`. Since this functionality is currently only for testing, FlashInfer is not installed anywhere yet. The FlashInfer API is quite different from FlashAttention/vLLM in that it requires more global bookkeeping: * A wrapper class needs to be contstructed (which we just call state). Since this is fairly expensive (due to pinned host memory allocation), we only do this once in a FlashCausalLM instance or for each CUDA Graph size. * Each model forward call needs to be wrapped in `begin_forward` and `end_forward`. This sets up data structures that can be reused for all calls to attention for that forward call. When calling attention, we need access to the state object. To avoid passing an argument down the call chain (which would require changes to all models), we use a context variable. Each model forward call is wrapped using a context manager that does all the bookkeeping for such a call: * Set the context variable to the forward call's state. * Call `begin_forward` on the state. * Yield. * Call `end_forward` on the state. * Reset the context variable. We cannot use a single shared global variable for this, since e.g. CUDA Graphs of different sizes each have their own state.	2024-09-25 06:01:59 +00:00
drbh	6f2a468a64	Pr 2352 ci branch (#2382 ) * Fix unsigned integer underflow Passing --max-batch-size to the launcher actually had no effect because after a few requests the max_size passed to State::next_batch would underflow becoming a largo positive number. In the scheduler, as soon as the cached batch size reached the max_batch_size the max_size passed to next_batch becomes 0. Since the only check in that funcion is ``` if Some(batch_requests.len()) == max_size { break; } ``` and it's called after the `batch_requests.len()` has become 1, it doesn't do anything to prevent more than 0 requests from being batched. Now we have cached batch in the server that is large than max_batch_size and `max_size - batch_size as usize` underflows. Signed-off-by: Max de Bayser <mbayser@br.ibm.com> * fix: update v3 scheduler and ensure max_batch_size > 0 --------- Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Co-authored-by: Max de Bayser <mbayser@br.ibm.com>	2024-09-25 06:01:59 +00:00
Vaibhav Srivastav	b1bc0ecb7f	Update Quantization docs and minor doc fix. (#2368 ) * Update Quantization docs and minor doc fix. * update readme with latest quants info * Apply suggestions from code review Co-authored-by: Pedro Cuenca <pedro@huggingface.co> * up --------- Co-authored-by: Pedro Cuenca <pedro@huggingface.co>	2024-09-25 06:01:59 +00:00
drbh	853fb96fec	fix: prefer hidden_activation over hidden_act in gemma2 (#2381 )	2024-09-25 05:55:39 +00:00
drbh	1057f28128	Pr 2337 ci branch (#2379 ) * hotfix: fix xpu crash brought by code refine. torch.xpu rely on import ipex Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * reable gemma2 in xpu Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix in regression in ipex flashattention Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Wang, Yi A <yi.a.wang@intel.com>	2024-09-25 05:55:39 +00:00
Wang, Yi	3893d00927	fix EleutherAI/gpt-neox-20b does not work in tgi (#2346 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-09-25 05:55:39 +00:00
drbh	06b638f310	Pr 2374 ci branch (#2378 ) * Update __init__.py Fix issue with NoneType comparison for max_input_tokens and sliding_window - Add default values for max_input_tokens and sliding_window to handle None cases. - Ensure the comparison between max_input_tokens and sliding_window is handled correctly to prevent TypeError. - This change addresses the error: TypeError: '<=' not supported between instances of 'int' and 'NoneType'. * Update __init__.py Handle NoneType in sliding_window comparison to fix TypeError in __init__.py by ensuring the comparison logic accounts for NoneType values, preventing errors and improving code robustness. * fix: syntax/style tweak --------- Co-authored-by: Praz <prazanth2006@gmail.com>	2024-09-25 05:55:39 +00:00
drbh	9b1b545bb4	Fix the prefix for OPT model in opt_modelling.py #2370 (CI RUN) (#2371 ) * Fix the bug * fix: run lints * fix: small syntax tweak --------- Co-authored-by: Sadra Barikbin <sadraqazvin1@yahoo.com>	2024-09-25 05:55:39 +00:00
drbh	3ea8e8a2d5	add gptj modeling in TGI #2366 (CI RUN) (#2372 ) * add gptj modeling Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix: update docs for model addition * fix: adjust syntax typo * fix: adjust syntax typo again --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Wang, Yi A <yi.a.wang@intel.com>	2024-09-25 05:55:39 +00:00
almersawi	11fab8a20c	fix: fix num_ln_in_parallel_attn attribute name typo in RWConfig (#2350 ) Co-authored-by: Islam Almersawi <islam.almersawi@openinnovation.ai>	2024-09-25 05:55:39 +00:00
drbh	3ccde430d9	fix: prefer original layernorm names for 180B (#2365 )	2024-09-25 05:55:39 +00:00
drbh	db873be177	fix: default num_ln_in_parallel_attn to one if not supplied (#2364 )	2024-09-25 05:55:39 +00:00
drbh	5400c7155d	feat: return the generated text when parsing fails (#2353 )	2024-09-25 05:55:39 +00:00
drbh	b4562e1369	feat: prefer stop over eos_token to align with openai finish_reason (#2344 )	2024-09-25 05:55:39 +00:00
drbh	88e07f12cc	feat: implement a templated endpoint for visibility into chat requests (#2333 ) * feat: implement a templated endpoint for visibility into chat requests * feat: improve to tokenize too * fix: adjust return type * feat: simplify prepare_chat_input logic and adjust start stop chars	2024-09-25 05:55:39 +00:00
drbh	83d1f23fea	fix: return the out tensor rather then the functions return value (#2361 )	2024-09-25 05:55:39 +00:00
drbh	8b0f5feb02	feat: include local lora adapter loading docs (#2359 )	2024-09-25 05:55:39 +00:00
drbh	688321bcc4	fix: attempt forward on flash attn2 to check hardware support (#2335 ) * fix: attempt forward on flash attn2 to check hardware support * fix: warn window_size_left when using flash attn 1 * fix: prefer version check over test op and avoid window_size_left if not flash attn2 * fix: improve condtional and error message * fix: update sliding window conditional * fix: simplify changes and revert model changes * fix: avoid changing conditional * fix: typo tweak	2024-09-25 05:55:39 +00:00
Daniël de Kok	48fec7b198	Unify attention output handling (#2343 ) - Always return the hidden states. - Create the output tensor inside the `attention` and `paged_attention` functions. This removes the difference between how the output is handled between attention (output parameter) and paged attention (return value). This also removes the assumption that the attention implementation can write to an output tensor (in preparation of FlashInfer).	2024-09-25 05:55:39 +00:00
Daniël de Kok	ccddb30c02	Fix cache block size for flash decoding (#2351 ) * Fix cache block size for flash decoding This seems to have been accidentally dropped during the TRT-LLM PR rebase. * Also run CI on changes to `backends`	2024-09-25 05:55:39 +00:00

1 2 3 4 5 ...

1089 Commits