text-generation-inference

mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-10-10 15:35:24 +00:00

Author	SHA1	Message	Date
Daniël de Kok	e9ba044250	flake: add fmt and clippy (#2389 )	2024-09-25 06:03:56 +00:00
Nicolas Patry	afa14b7595	Using HF_HOME instead of CACHE to get token read in addition to models. (#2288 )	2024-09-25 06:03:56 +00:00
Daniël de Kok	dc0fa60f55	Add experimental flake (#2384 ) Add flake.nix	2024-09-25 06:01:59 +00:00
Daniël de Kok	4a16da5d49	Add FlashInfer support (#2354 ) This change adds support for FlashInfer. FlashInfer can be enabled using `FLASH_INFER=1` and is currently only implemented in `FlashCausalLM`. Since this functionality is currently only for testing, FlashInfer is not installed anywhere yet. The FlashInfer API is quite different from FlashAttention/vLLM in that it requires more global bookkeeping: * A wrapper class needs to be contstructed (which we just call state). Since this is fairly expensive (due to pinned host memory allocation), we only do this once in a FlashCausalLM instance or for each CUDA Graph size. * Each model forward call needs to be wrapped in `begin_forward` and `end_forward`. This sets up data structures that can be reused for all calls to attention for that forward call. When calling attention, we need access to the state object. To avoid passing an argument down the call chain (which would require changes to all models), we use a context variable. Each model forward call is wrapped using a context manager that does all the bookkeeping for such a call: * Set the context variable to the forward call's state. * Call `begin_forward` on the state. * Yield. * Call `end_forward` on the state. * Reset the context variable. We cannot use a single shared global variable for this, since e.g. CUDA Graphs of different sizes each have their own state.	2024-09-25 06:01:59 +00:00
drbh	6f2a468a64	Pr 2352 ci branch (#2382 ) * Fix unsigned integer underflow Passing --max-batch-size to the launcher actually had no effect because after a few requests the max_size passed to State::next_batch would underflow becoming a largo positive number. In the scheduler, as soon as the cached batch size reached the max_batch_size the max_size passed to next_batch becomes 0. Since the only check in that funcion is ``` if Some(batch_requests.len()) == max_size { break; } ``` and it's called after the `batch_requests.len()` has become 1, it doesn't do anything to prevent more than 0 requests from being batched. Now we have cached batch in the server that is large than max_batch_size and `max_size - batch_size as usize` underflows. Signed-off-by: Max de Bayser <mbayser@br.ibm.com> * fix: update v3 scheduler and ensure max_batch_size > 0 --------- Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Co-authored-by: Max de Bayser <mbayser@br.ibm.com>	2024-09-25 06:01:59 +00:00
Vaibhav Srivastav	b1bc0ecb7f	Update Quantization docs and minor doc fix. (#2368 ) * Update Quantization docs and minor doc fix. * update readme with latest quants info * Apply suggestions from code review Co-authored-by: Pedro Cuenca <pedro@huggingface.co> * up --------- Co-authored-by: Pedro Cuenca <pedro@huggingface.co>	2024-09-25 06:01:59 +00:00
drbh	853fb96fec	fix: prefer hidden_activation over hidden_act in gemma2 (#2381 )	2024-09-25 05:55:39 +00:00
drbh	1057f28128	Pr 2337 ci branch (#2379 ) * hotfix: fix xpu crash brought by code refine. torch.xpu rely on import ipex Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * reable gemma2 in xpu Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix in regression in ipex flashattention Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Wang, Yi A <yi.a.wang@intel.com>	2024-09-25 05:55:39 +00:00
Wang, Yi	3893d00927	fix EleutherAI/gpt-neox-20b does not work in tgi (#2346 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-09-25 05:55:39 +00:00
drbh	06b638f310	Pr 2374 ci branch (#2378 ) * Update __init__.py Fix issue with NoneType comparison for max_input_tokens and sliding_window - Add default values for max_input_tokens and sliding_window to handle None cases. - Ensure the comparison between max_input_tokens and sliding_window is handled correctly to prevent TypeError. - This change addresses the error: TypeError: '<=' not supported between instances of 'int' and 'NoneType'. * Update __init__.py Handle NoneType in sliding_window comparison to fix TypeError in __init__.py by ensuring the comparison logic accounts for NoneType values, preventing errors and improving code robustness. * fix: syntax/style tweak --------- Co-authored-by: Praz <prazanth2006@gmail.com>	2024-09-25 05:55:39 +00:00
drbh	9b1b545bb4	Fix the prefix for OPT model in opt_modelling.py #2370 (CI RUN) (#2371 ) * Fix the bug * fix: run lints * fix: small syntax tweak --------- Co-authored-by: Sadra Barikbin <sadraqazvin1@yahoo.com>	2024-09-25 05:55:39 +00:00
drbh	3ea8e8a2d5	add gptj modeling in TGI #2366 (CI RUN) (#2372 ) * add gptj modeling Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix: update docs for model addition * fix: adjust syntax typo * fix: adjust syntax typo again --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Wang, Yi A <yi.a.wang@intel.com>	2024-09-25 05:55:39 +00:00
almersawi	11fab8a20c	fix: fix num_ln_in_parallel_attn attribute name typo in RWConfig (#2350 ) Co-authored-by: Islam Almersawi <islam.almersawi@openinnovation.ai>	2024-09-25 05:55:39 +00:00
drbh	3ccde430d9	fix: prefer original layernorm names for 180B (#2365 )	2024-09-25 05:55:39 +00:00
drbh	db873be177	fix: default num_ln_in_parallel_attn to one if not supplied (#2364 )	2024-09-25 05:55:39 +00:00
drbh	5400c7155d	feat: return the generated text when parsing fails (#2353 )	2024-09-25 05:55:39 +00:00
drbh	b4562e1369	feat: prefer stop over eos_token to align with openai finish_reason (#2344 )	2024-09-25 05:55:39 +00:00
drbh	88e07f12cc	feat: implement a templated endpoint for visibility into chat requests (#2333 ) * feat: implement a templated endpoint for visibility into chat requests * feat: improve to tokenize too * fix: adjust return type * feat: simplify prepare_chat_input logic and adjust start stop chars	2024-09-25 05:55:39 +00:00
drbh	83d1f23fea	fix: return the out tensor rather then the functions return value (#2361 )	2024-09-25 05:55:39 +00:00
drbh	8b0f5feb02	feat: include local lora adapter loading docs (#2359 )	2024-09-25 05:55:39 +00:00
drbh	688321bcc4	fix: attempt forward on flash attn2 to check hardware support (#2335 ) * fix: attempt forward on flash attn2 to check hardware support * fix: warn window_size_left when using flash attn 1 * fix: prefer version check over test op and avoid window_size_left if not flash attn2 * fix: improve condtional and error message * fix: update sliding window conditional * fix: simplify changes and revert model changes * fix: avoid changing conditional * fix: typo tweak	2024-09-25 05:55:39 +00:00
Daniël de Kok	48fec7b198	Unify attention output handling (#2343 ) - Always return the hidden states. - Create the output tensor inside the `attention` and `paged_attention` functions. This removes the difference between how the output is handled between attention (output parameter) and paged attention (return value). This also removes the assumption that the attention implementation can write to an output tensor (in preparation of FlashInfer).	2024-09-25 05:55:39 +00:00
Daniël de Kok	ccddb30c02	Fix cache block size for flash decoding (#2351 ) * Fix cache block size for flash decoding This seems to have been accidentally dropped during the TRT-LLM PR rebase. * Also run CI on changes to `backends`	2024-09-25 05:55:39 +00:00
Wang, Yi	d70da59c25	enable HuggingFaceM4/idefics-9b in intel gpu (#2338 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-09-25 05:55:39 +00:00
Erik Kaunismäki	3c4f816ae3	refactor usage stats (#2339 ) * refactor usage stats * Update docs/source/usage_statistics.md Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> * Update router/src/server.rs Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> * changes based on feedback * run python3 udpate_doc.py * fix pre-commit * Update router/src/server.rs Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> * delete option around usage stats arg --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-09-25 05:55:39 +00:00
drbh	c73d1d604f	Pr 2290 ci run (#2329 ) * MODEL_ID propagation fix * fix: remove global model id --------- Co-authored-by: root <root@tw031.pit.tensorwave.lan>	2024-09-25 05:55:39 +00:00
Daniël de Kok	468e5c6874	Handle GPTQ-Marlin loading in `GPTQMarlinWeightLoader` (#2300 ) The `GPTWeightLoader` was structured like this in pseudocode: if marlin: Set up tensors in a way that GPTQ-Marlin expects else: Set up tensors in a way that ExLlama/GPTQ/AWQ expect However, the GPT-Marlin implementation details should really be in the `marlin` module. So move the former part out to a separate `GPTQMarlinWeightsLoader`.	2024-09-25 05:55:39 +00:00
Nicolas Patry	120d5773e8	Rebase TRT-llm (#2331 ) * wip wip refacto refacto Initial setup for CXX binding to TRTLLM Working FFI call for TGI and TRTLLM backend Remove unused parameters annd force tokenizer name to be set Overall build TRTLLM and deps through CMake build system Enable end to end CMake build First version loading engines and making it ready for inference Remembering to check how we can detect support for chunked context Move to latest TensorRT-LLM version Specify which default log level to use depending on CMake build type make leader executor mode working unconditionally call InitializeBackend on the FFI layer bind to CUDA::nvml to retrieve compute capabilities at runtime updated logic and comment to detect cuda compute capabilities implement the Stream method to send new tokens through a callback use spdlog release 1.14.1 moving forward update trtllm to latest version a96cccafcf6365c128f004f779160951f8c0801c correctly tell cmake to build dependent tensorrt-llm required libraries create cmake install target to put everything relevant in installation folder add auth_token CLI argument to provide hf hub authentification token allow converting huggingface::tokenizers error to TensorRtLlmBackendError use correct include for spdlog include guard to build example in cmakelists working setup of the ffi layer remove fmt import use external fmt lib end to end ffi flow working make sure to track include/ffi.h to trigger rebuild from cargo impl the rust backend which currently cannot move the actual computation in background thread expose shutdown function at ffi layer impl RwLock scenario for TensorRtLllmBackend oops missing c++ backend definitions compute the number of maximum new tokens for each request independently make sure the context is not dropped in the middle of the async decoding. remove unnecessary log add all the necessary plumbery to return the generated content update invalid doc in cpp file correctly forward back the log probabilities remove unneeded scope variable for now refactor Stream impl for Generation to factorise code expose the internal missing start/queue timestamp forward tgi parameters rep/freq penalty add some more validation about grammar not supported define a shared struct to hold the result of a decoding step expose information about potential error happening while decoding remove logging add logging in case of decoding error make sure executor_worker is provided add initial Dockerfile for TRTLLM backend add some more information in CMakeLists.txt to correctly install executorWorker add some more information in CMakeLists.txt to correctly find and install nvrtc wrapper simplify prebuilt trtllm libraries name definition do the same name definition stuff for tensorrt_llm_executor_static leverage pkg-config to probe libraries paths and reuse new install structure from cmake fix bad copy/past missing nvinfer linkage direction align all the linker search dependency add missing pkgconfig folder for MPI in Dockerfile correctly setup linking search path for runtime layer fix missing / before tgi lib path adding missing ld_library_path for cuda stubs in Dockerfile update tgi entrypoint commenting out Python part for TensorRT installation refactored docker image move to TensorRT-LLM v0.11.0 make docker linter happy with same capitalization rule fix typo refactor the compute capabilities detection along with num gpus update TensorRT-LLM to latest version update TensorRT install script to latest update build.rs to link to cuda 12.5 add missing dependant libraries for linking clean up a bit install to decoder_attention target add some custom stuff for nccl linkage fix envvar CARGO_CFG_TARGET_ARCH set at runtime vs compile time use std::env::const::ARCH make sure variable live long enough... look for cuda 12.5 add some more basic info in README.md * Rebase. * Fix autodocs. * Let's try to enable trtllm backend. * Ignore backends/v3 by default. * Fixing client. * Fix makefile + autodocs. * Updating the schema thing + redocly. * Fix trtllm lint. * Adding pb files ? * Remove cargo fmt temporarily. * ? * Tmp. * Remove both check + clippy ? * Backporting telemetry. * Backporting `457fb0a1` * Remove PB from git. * Fixing PB with default member backends/client * update TensorRT-LLM to latest version * provided None for api_key * link against libtensorrt_llm and not libtensorrt-llm --------- Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com> Co-authored-by: Morgan Funtowicz <morgan@huggingface.co>	2024-09-25 05:55:39 +00:00
Daniël de Kok	247a29f77c	server quantize: store quantizer config in standard format (#2299 ) - Create `quantization_config` option in the model config. - Don't store the quantizer config in tensors anymore.	2024-09-25 05:50:17 +00:00
drbh	bafab73f76	fix: adjust test snapshots and small refactors (#2323 ) * fix: adjust test snapshots and small refactors * fix: revert non snapshot changes	2024-09-25 05:50:17 +00:00
Erik Kaunismäki	b1d1d26559	patch-error-on-invalid-grammar (#2282 ) * quick fix * allow silent failure * explicit todo that this is only short term	2024-09-25 05:50:17 +00:00
drbh	a574381cb4	fix: reject grammars without properties (#2309 )	2024-09-25 05:50:17 +00:00
Daniël de Kok	23a3927eb6	Install Marlin from standalone package (#2320 )	2024-09-25 05:50:17 +00:00
Erik Kaunismäki	2c1d280fae	Run ci api key (#2315 ) * Add API_Key for Auth and conditionally add authorisation for non info/health endpoints. * change name to info routes * Fix comment * convert strings to lowercase for case insensitive comparison * convert header to string * fixes and update docs * update docs again * revert wrong update --------- Co-authored-by: Kevin Duffy <kevin.duffy94@gmail.com>	2024-09-25 05:46:41 +00:00
drbh	a87791d7c9	feat: add ruff and resolve issue (#2262 ) * feat: add ruff and resolve issue * fix: update client exports and adjust after rebase * fix: adjust syntax to avoid circular import * fix: adjust client ruff settings * fix: lint and refactor import check and avoid model enum as global names * fix: improve fbgemm_gpu check and lints * fix: update lints * fix: prefer comparing model enum over str * fix: adjust lints and ignore specific rules * fix: avoid unneeded quantize check	2024-09-25 05:46:24 +00:00
Daniël de Kok	fc6d80fdb8	Support tied embeddings in 0.5B and 1.5B Qwen2 models (#2313 )	2024-09-25 05:41:43 +00:00
Adrien	1674f441d0	Fix registry name (#2307 )	2024-09-25 05:41:43 +00:00
Nicolas Patry	d5e054342e	Fixing idefics on g6 tests. (#2306 )	2024-09-25 05:40:25 +00:00
Daniël de Kok	64ffd642fa	Some small fixes for the Torch 2.4.0 update (#2304 ) * Fix GPTQ autotune data type to be compatible with Torch 2.4.0 * Update poetry lock file * Fix small PaliGemma logprob differences after the torch update	2024-09-25 05:40:25 +00:00
Nicolas Patry	69db13e5e5	Using g6 instead of g5. (#2281 ) * Using g6 instead of g5. * Update the idefics2 snapshot.	2024-09-25 05:40:25 +00:00
drbh	7ebee37641	fix: refactor adapter weight loading and mapping (#2193 ) * fix: refactor adapter weight loading and mapping * feat: enable lora load from directory * fix: adjust launcher for local lora adapters * feat: improve weight loading and add tests * fix: improve logging and rebase syntax issue * fix: impove adapter merge comments and remove unused conditional * fix: improve get_model_with_lora_adapters naming * fix: comment typo	2024-09-25 05:39:58 +00:00
Daniël de Kok	457791f511	Split up `layers.marlin` into several files (#2292 ) The marlin.py file was getting large, split it up.	2024-09-25 05:39:58 +00:00
Wang, Yi	d93931567d	fix of use of unquantized weights in cohere GQA loading, also enable … (#2291 ) fix of use of unquantized weights in cohere GQA loading, also enable the model in intel platform Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-09-25 05:39:58 +00:00
Wang, Yi	204142153f	fix crash in multi-modal (#2245 ) * fix crash in multi-modal Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * update according to review comment Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix llava_next regression in latest main Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-09-25 05:39:58 +00:00
OlivierDehaene	a994f6aedd	hotfix: update nccl	2024-09-25 05:39:58 +00:00
OlivierDehaene	34c472bd64	chore: update to torch 2.4 (#2259 ) * chore: update to torch 2.4 * remove un-necessary patch * fix	2024-09-25 05:39:14 +00:00
Daniël de Kok	b1077b077c	hotfix: pin numpy (#2289 )	2024-09-25 05:38:48 +00:00
Daniël de Kok	43f49141fd	Add support for Llama 3 rotary embeddings (#2286 ) * Add support for Llama 3 rotary embeddings * Update transformers to 4.43	2024-09-25 05:38:48 +00:00
Nicolas Patry	5390973c09	Preparing for release. (#2285 ) * Preparing for release. * Updating docs. * Fixing token within the docker image for the launcher.	2024-09-25 05:38:48 +00:00
shaltielshmid	69b67b7add	Add support for Mistral-Nemo by supporting head_dim through config (#2254 ) * Support passing head_dim through config * Using `head_dim` as a fallback is necessary since it's a non standard key in mistralConfig (as defined in transformers). * Shorter diff. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-09-25 05:31:31 +00:00

1 2 3 4 5 ...

1062 Commits