text-generation-inference

mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-07-12 02:40:16 +00:00

Author	SHA1	Message	Date
drbh	34a3d09bb8	Merge `e721574729` into `0b28aabb94`	2025-04-08 16:25:23 +08:00
Mohit Sharma	d9bb9bebc9	Add llama4 (#3145 ) * initial changes * Add support for other vlm * cleanup comment * Improve attn_implementation * Add comments for support of models * add model * add model * fixes and improvements * update docker * Add cache position * Add tests * remove redundant changes * remove tr version * Upgrade doc + fix linting. * Fixing the CI. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2025-04-06 10:20:22 +02:00
drbh	e721574729	fix: update test for tool_call_id in Message	2025-03-21 11:15:32 -04:00
drbh	af78f46c3d	feat: align function id with tool call response	2025-03-21 11:15:32 -04:00
Mohit Sharma	ed46c2c414	Add gemma3 model (#3099 )	2025-03-12 09:25:51 +01:00
Nicolas Patry	f74c36fe0d	Fix tool call3 (#3086 ) * Fixing the tool calling convention. * Update tehe doc. * Fixing some corner cases. * Fixing the tool call id. * Fmt. * Snapshot update with the new updated tool_call_id. * More qwen2.	2025-03-12 09:22:53 +01:00
drbh	dc5f05f8e6	Pr 3003 ci branch (#3007 ) * change ChatCompletionChunk to align with "OpenAI Chat Completions streaming API" Moving after tool_calls2 Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> add in Buffering.. Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> fix: handle usage outside of stream state and add tests Simplifying everything quite a bit. Remove the unused model_dump. Clippy. Clippy ? Ruff. Uppgrade the flake for latest transformers. Upgrade after rebase. Remove potential footgun. Fix completion test. * Clippy. * Tweak for multi prompt. * Ruff. * Update the snapshot a bit. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2025-03-10 17:56:19 +01:00
Nicolas Patry	622908deab	Fix tool call2 (#3076 ) * Making `tool_calls` a vector. * Arguments output is a string. * Update all the integration tests. * Add the requirements. * Upgrade other tests. * Clippy. * Update the old test.	2025-03-07 19:45:57 +01:00
Nicolas Patry	8e92942a18	Making `tool_calls` a vector. (#3075 ) * Making `tool_calls` a vector. * Update doc. * Fixing the nix overlay with updated version. * Add openai dependency. * Updating the old tests. * Trying to reduce the logs in the case of errors. * Less spammy logs too.	2025-03-05 22:32:31 +01:00
Nicolas Patry	491ed9e11d	Patch rust release. (#3069 ) * Patch rust release. * Trying to remove the rust-toolchain hardcoded in action. * Upgrade rust toolchain. * Put back the toolchain ? * Fix neuron dockerfile. * Move to the proper version of Rust. * 1.85 since the GH action doesn't respect the override. * Typo. * Fixing the github action. * Fixing docker llamacpp. * Fixing the github action. * Update clippy.	2025-03-04 18:07:33 +01:00
drbh	1cae3197c4	Improve tool call message processing (#3036 ) * make content field optional in chat request * add tool_calls field to Message struct * feat: add test and serialize tool messages * fix: bump utopia, openapi doc version and improve test * fix: rerun update docs * fix: suppoer tool call id in template and remove unnecessary changes * fix: ruff lint remove unused import * fix: adjust message types in tests --------- Co-authored-by: sailesh duddupudi <saileshradar@gmail.com>	2025-02-21 10:30:29 +01:00
Alvaro Bartolome	6ab02931cf	Set `alias` for `max_completion_tokens` in `ChatRequest` (#2932 )	2025-01-23 14:18:47 +01:00
Nicolas Patry	203cade244	Upgrading our rustc version. (#2908 ) * Upgrading our rustc version. * Fixing the rust tests to proper version. * Clippy everything.	2025-01-15 17:04:03 +01:00
drbh	da5ab46705	Improve vlm support (add idefics3 support) (#2437 ) * feat: expand vlm support and add image token logic and tests * fix: avoid unused perceiver config * feat: integrate image tokens into inputs embeds * feat: add simple idefics3 test * feat: update docs, image token logic and weight names * fix: improve image processing * feat: improve prefix for idefics3 * fix: bump idefics3 tests and snapshots * fix: improve text model loading * feat: consolidate changes with existing vlms and add support and test for smolvlm * fix: create new idefic3 file, simplify logic and adjust llama weight loading * fix: lint with ruff * fix: clean up idefics 3 and improve prefix handling * fix: improve typing * fix: improve prompt_split_image with ref to original impl * fix: adjust ruff lints and small refactors * fix: adjust FlashLlamaModel prefix logic	2025-01-09 10:35:32 -05:00
Nicolas Patry	6f0b8c947d	New arg. (#2845 )	2024-12-16 10:34:50 +01:00
Nicolas Patry	5df8059037	Auto max prefill (#2797 ) * Attempt at automatic max batch prefill. * Taking into account number of shards. * Adding more cards. * Adding A100 + H100 * Adding a few more cards. * Logprobs cost too much. * h100 better name, and keep factor of 2 * Damn inflated sparse tflops. * Typo in h100. * Updated the flops calculation (checked with fvcore). * chunking by default. * Fix prefix caching for chat completion since we removed logprobs. * More tests. * Dropping all the prefill logprobs. * Add a flag that enables users to get logprobs back. * Repairing prompt token counting. * Fixing a few tests. * Remove some scaffolding. * Attempting to reduces the issues (workarounds for now).	2024-12-06 05:52:00 +01:00
OlivierDehaene	8c3669b287	feat: auto max_new_tokens (#2803 ) * feat: auto max_new_tokens * update default * Fixing the tests. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-12-06 05:50:35 +01:00
Lucain	d012f229c6	Remove guideline from API (#2762 )	2024-11-21 16:56:38 +00:00
drbh	5489406c4a	PR 2634 CI - Fix the tool_choice format for named choice by adapting OpenAIs scheme (#2645 ) * add OpenAI like tool_choice for named choice * add tests * fix: run linter and bump api docs * fix: consolidate changes and remove old tool type * feat: improve, simplify and rename tool choice struct add required support and refactor * fix: simplify tool choice logic, improve tests, openapi and rust docs * fix: refactor away prepare_chat_input and improve tool grammar apply control flow * feat: update docs and add tool choice configuration section * fix: simplify naming, tool choice default and improve test * fix: adjust tool choice none logic, add test and small refactors * fix: add missing snapshot file * fix: adjust tool choice type in test * fix: adjust default when json tool choice is * fix: remove trailing space lint after rebase * fix: remove mostly mocked unit test --------- Co-authored-by: Linus Bierhoff <linus.bierhoff@icloud.com>	2024-11-19 13:31:59 -05:00
Daniël de Kok	52e48739a5	Remove vLLM dependency for CUDA (#2751 ) * Remove vLLM dependency for CUDA This change adds `attention-kernels` as a dependency for paged attention and cache reshaping. With that, we don't use vLLM anywhere for CUDA. Tested run (since we don't have paged attention in CI): ``` ❯ ATTENTION=paged python -m pytest integration-tests -k "llama and awq" --release [...] 5 snapshots passed. ``` * Fix clippy warning	2024-11-17 17:34:50 +01:00
Wang, Yi	97f7a22f0b	add trust_remote_code in tokenizer to fix baichuan issue (#2725 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-11-07 14:43:38 +01:00
Nicolas Patry	90b226db29	We can have a tokenizer anywhere. (#2527 ) * We can have a tokenizer anywhere. * Handling potential lack of offsets (python tokenizer) * Remove redundancy. * Fixing the tests. * Flake.lock update ? * Fixing the GIL locking. * Fixing mamba by using the transformers version. * Adding the legacy handle. * Ellide lifetime. * Lint. * Deprecation message. * Fixing bad rebase.	2024-10-28 05:00:24 +01:00
OlivierDehaene	41c2623735	feat: allow any supported payload on /invocations (#2683 ) * feat: allow any supported payload on /invocations * update openAPI * update doc	2024-10-23 11:26:01 +00:00
OlivierDehaene	a6a0c97ed9	feat: prefill chunking (#2600 ) * wip * rollback * refactor to use prefix/postfix namming + fix all_input_ids_tensor * maybe patching vlms? * fix filter and concat * wip, no filter, no concat * current * add prepare_for_prefill * working * load tested * re-create slots * re-create slots * fix slot_filtering_indices * feedback loop * remove log * fix benchmarker * fix vlm and seq2seq * rename to cache and input lengths * fix prefill logprobs * fix launcher * fix logprobs? * idk at this point * max input length * omfg * remove debugging lines * fix tests * fix mllama * fix cargo tests * remove support chunking for paged * Fixing non blocked attentions * Fixing dtype + AMD, Ipex targets. * lint fix. * rename * Fix prefix_caching variable, remove defaults in server (confusing a lot of the times). * Add simple resolution when user specifies ATTENTION=paged. * Put back non default simple tests. * Fix env name --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-10-16 12:49:33 +02:00
drbh	8ad20daf33	CI (2599): Update ToolType input schema (#2601 ) * Update ToolType input schema * lint * fix: run formatter * fix: allow tool choide to be null --------- Co-authored-by: Wauplin <lucainp@gmail.com>	2024-10-08 12:35:48 -04:00
Nicolas Patry	c032280b17	Cleanup Vertex + Chat (#2553 ) * Cleanup Vertex + Chat * logprobs defaults to false. * Parameters are optional * Fix docs. * Changing back this logprobs default. * Fixup doc. * Let's debug that. * Not unstable. * Updating Cargo ? * Wat? * Dummy change. * Trying some other install. * Trying smething. * Revert everything. * Update Cargo lock. * Fixing the pre-commit after rebase.	2024-09-24 23:37:17 +02:00
Nicolas Patry	f512021e77	Stream options. (#2533 ) * Stream options. * Fetch stuff from nix integration test for easier testing. * Adding the assert. * Only send the usage when asked for. * Update the docs. * Impure test because we need network. * develop. * Optional usage. * Fixes. * Workflow	2024-09-19 20:50:37 +02:00
drbh	47d7e34458	fix: enable chat requests in vertex endpoint (#2481 ) * fix: enable chat requests in vertex endpoint * feat: avoid unwrap and pre allocate future vec	2024-09-02 10:00:52 -04:00
drbh	d5202c46f7	feat: add /v1/models endpoint (#2433 ) * feat: add /v1/models endpoint * feat: add /v1/models endpoint * fix: remove unused type import * fix: revert route typo * fix: update docs with new endpoint * fix: add to redocly ignore and lint	2024-08-29 16:32:38 +02:00
Nicolas Patry	e415b690a6	Lots of improvements (Still 2 allocators) (#2449 ) * Making prefix/flashinfer the default and testing the full release tests. * Include flashinfer in the docker. * Using prebuilt. * Allowing window_left_size (dummy version). * Disabling flashinfer/prefix caching on odd head_dim * Disable prefix caching for lora. * More specific codes. * Update lock * Updating integration tests with new values with FI/FD. Remove paged as a default too, and using FD everywhere. * Update cargo lock ? * Upgrade to 1.80 because of bitstream... * Everywhere 1.80 * Forgot last default place. * Apply suggestions from code review Co-authored-by: drbh <david.richard.holtz@gmail.com> * Updated flake lock * Tmp * Upgrade resolution system for less errors in resolution. * Remove lambda for cleaner function. * Handling debugger. * OVerride the env in server tests. * Is this enough to make it work ? * This seems to be working. * Downgrade some logs. * Fixing the default for vlm. * Don't enable prefix caching on VLM just yet. * Change `add_special_tokens` in order to have the correct tokens for chat input and not (since it's super important with the prefixing now) * Fixing prefix caching for flashdecoding. * Update all models. * Fixed flashinfer version. * add_special_tokens is internal only * Fixing seqlen with the new vlms. * Fixing the issue with `add_special_tokens` not being passed around. * Fixing the test. * Removing encoder_decoder (seq2seq). * Update the chat test. * Fixing the batching tokenization in flash causal lm. * Truncating left for radix purposes. * Oops this doesn't belong here. * Put back default pure shell. * Update server tests - Default to throughput test in k6 - Use TGI_WIGGLE_ROOM to adjust wiggle room * Only n_heads / process_group.size() are necessary. * Revert the integrationt tests change (seem linked to head_size modification). * Adding error message when assert is violated. * Fixing the free algorithm to handle times where the common prefix is smaller. * Apply suggestions from code review Co-authored-by: OlivierDehaene <olivier@huggingface.co> * Update server/text_generation_server/layers/attention/common.py Co-authored-by: OlivierDehaene <olivier@huggingface.co> * Fix disabling prefix caching - Fix windowing checks. * Revert the Cohere tokenizer change (for now using a revision instead). * Fmt. --------- Co-authored-by: drbh <david.richard.holtz@gmail.com> Co-authored-by: OlivierDehaene <olivier@huggingface.co>	2024-08-29 16:29:01 +02:00
drbh	cfa73b5c99	Pr 2451 ci branch (#2454 ) * fix[router]: Fix tools not passed in chat template Signed-off-by: GitHub <noreply@github.com> * feat: improve default tool serialization and lints * feat: refactor tool logic to include notify_error in prompt and adjust typing * fix: adjust non tool template apply * fix: simplify tool grammar logic and improve schema * feat: avoid skip tool test and avoid empty tool prompts * fix: increase test client timeout for grammar compilation tests --------- Signed-off-by: GitHub <noreply@github.com> Co-authored-by: Simone Rossi <simone.rossi.93@gmail.com>	2024-08-26 20:19:38 -04:00
drbh	9a7830bd28	Pr 2395 ci run (#2406 ) * fix(router): Fix appending to message content * feat: add message and chat template test --------- Co-authored-by: Simone Rossi <simone.rossi.93@gmail.com>	2024-08-12 14:38:59 -04:00
drbh	30395b09f4	fix: improve completions to send a final chunk with usage details (#2336 ) * fix: improve completions to send a final chunk with usage details * fix: include finish reason string * fix: remove dev debug trait and unneeded mut * fix: update openapi schema	2024-08-12 17:26:11 +02:00
drbh	0d06aed02d	feat: add guideline to chat request and template (#2391 ) * feat: add guideline to chat request and template * fix: add template test and update docs	2024-08-09 10:56:45 -04:00
Nicolas Patry	7a48a84784	Using an enum for flash backens (paged/flashdecoding/flashinfer) (#2385 ) * Using an enum for flash backens (paged/flashdecoding/flashinfer) * Early exit on server too. * Clippy. * Fix clippy and fmt.	2024-08-09 16:41:17 +02:00
drbh	f8a5b381fe	feat: prefer stop over eos_token to align with openai finish_reason (#2344 )	2024-08-06 13:09:50 -04:00
drbh	e11f5f1c38	feat: implement a templated endpoint for visibility into chat requests (#2333 ) * feat: implement a templated endpoint for visibility into chat requests * feat: improve to tokenize too * fix: adjust return type * feat: simplify prepare_chat_input logic and adjust start stop chars	2024-08-06 13:51:32 +02:00
Nicolas Patry	2b19d671b4	Rebase TRT-llm (#2331 ) * wip wip refacto refacto Initial setup for CXX binding to TRTLLM Working FFI call for TGI and TRTLLM backend Remove unused parameters annd force tokenizer name to be set Overall build TRTLLM and deps through CMake build system Enable end to end CMake build First version loading engines and making it ready for inference Remembering to check how we can detect support for chunked context Move to latest TensorRT-LLM version Specify which default log level to use depending on CMake build type make leader executor mode working unconditionally call InitializeBackend on the FFI layer bind to CUDA::nvml to retrieve compute capabilities at runtime updated logic and comment to detect cuda compute capabilities implement the Stream method to send new tokens through a callback use spdlog release 1.14.1 moving forward update trtllm to latest version a96cccafcf6365c128f004f779160951f8c0801c correctly tell cmake to build dependent tensorrt-llm required libraries create cmake install target to put everything relevant in installation folder add auth_token CLI argument to provide hf hub authentification token allow converting huggingface::tokenizers error to TensorRtLlmBackendError use correct include for spdlog include guard to build example in cmakelists working setup of the ffi layer remove fmt import use external fmt lib end to end ffi flow working make sure to track include/ffi.h to trigger rebuild from cargo impl the rust backend which currently cannot move the actual computation in background thread expose shutdown function at ffi layer impl RwLock scenario for TensorRtLllmBackend oops missing c++ backend definitions compute the number of maximum new tokens for each request independently make sure the context is not dropped in the middle of the async decoding. remove unnecessary log add all the necessary plumbery to return the generated content update invalid doc in cpp file correctly forward back the log probabilities remove unneeded scope variable for now refactor Stream impl for Generation to factorise code expose the internal missing start/queue timestamp forward tgi parameters rep/freq penalty add some more validation about grammar not supported define a shared struct to hold the result of a decoding step expose information about potential error happening while decoding remove logging add logging in case of decoding error make sure executor_worker is provided add initial Dockerfile for TRTLLM backend add some more information in CMakeLists.txt to correctly install executorWorker add some more information in CMakeLists.txt to correctly find and install nvrtc wrapper simplify prebuilt trtllm libraries name definition do the same name definition stuff for tensorrt_llm_executor_static leverage pkg-config to probe libraries paths and reuse new install structure from cmake fix bad copy/past missing nvinfer linkage direction align all the linker search dependency add missing pkgconfig folder for MPI in Dockerfile correctly setup linking search path for runtime layer fix missing / before tgi lib path adding missing ld_library_path for cuda stubs in Dockerfile update tgi entrypoint commenting out Python part for TensorRT installation refactored docker image move to TensorRT-LLM v0.11.0 make docker linter happy with same capitalization rule fix typo refactor the compute capabilities detection along with num gpus update TensorRT-LLM to latest version update TensorRT install script to latest update build.rs to link to cuda 12.5 add missing dependant libraries for linking clean up a bit install to decoder_attention target add some custom stuff for nccl linkage fix envvar CARGO_CFG_TARGET_ARCH set at runtime vs compile time use std::env::const::ARCH make sure variable live long enough... look for cuda 12.5 add some more basic info in README.md * Rebase. * Fix autodocs. * Let's try to enable trtllm backend. * Ignore backends/v3 by default. * Fixing client. * Fix makefile + autodocs. * Updating the schema thing + redocly. * Fix trtllm lint. * Adding pb files ? * Remove cargo fmt temporarily. * ? * Tmp. * Remove both check + clippy ? * Backporting telemetry. * Backporting `457fb0a1` * Remove PB from git. * Fixing PB with default member backends/client * update TensorRT-LLM to latest version * provided None for api_key * link against libtensorrt_llm and not libtensorrt-llm --------- Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com> Co-authored-by: Morgan Funtowicz <morgan@huggingface.co>	2024-07-31 10:33:10 +02:00
drbh	68a9685f1b	fix: adjust default tool choice (#2244 ) * fix: adjust default tool choice * feat: improve tool choice syntax and response parsing/errors * fix: remove dev tests * feat: add ToolChoice to docs	2024-07-19 11:12:02 -04:00
Erik Kaunismäki	4c19593a90	usage stats and crash reports (#2220 ) * draft of usage stats * fix wrong link * launcher doesn't need sysinfo dep * only tokenizer class instead of hole struct * unused import * fix clippy errors * update openAPI doc * cargo fmt * fix error in passing flags to router * try again to update docs * run pre-commit locally * Update router/src/main.rs Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co> * Update router/src/main.rs Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co> * on crash use anonymous error event * delete json_output and ngrok * more robust way of checking if is in container * more robust nvidia smi * parse xpu more robustly * fix errors * add nvidia-smi details in docs * cargo fmt * fix clippy * should make docs check pass * Update router/src/usage_stats.rs Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co> * error reason can't be in nested json * cargo fmt --------- Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co> Co-authored-by: Erik Kaunismäki <erikkaum@Eriks-MacBook-Pro.local>	2024-07-19 16:17:56 +02:00
Nicolas Patry	4c976fb406	Updating the self check (#2209 ) * Updating the self check * Fix. * Revert the CLI . * cli. * Space. * Revert cargo update.	2024-07-09 17:23:48 +02:00
drbh	87ebb6477b	feat: use model name as adapter id in chat endpoints (#2128 )	2024-07-08 16:06:49 +02:00
Nicolas Patry	5ad41aa2a6	Fixing missing `object` field for regular completions. (#2175 ) * Fixing missing `object` field for regular completions. * Fixing docs by re-adding missing `Prompt`.	2024-07-03 12:56:27 +02:00
Nicolas Patry	be4a4c47f9	Revert "Fixing missing `object` field for regular completions." This reverts commit `2bbb7fa4b2`.	2024-07-03 10:41:39 +00:00
Nicolas Patry	2bbb7fa4b2	Fixing missing `object` field for regular completions.	2024-07-03 10:40:22 +00:00
drbh	9eefb2f672	fix: prefer serde structs over custom functions (#2127 ) * fix: prefer enum for chat object * fix: adjust typo * fix: enum CompletionType not ObjectType * fix: adjust typo * feat: leverage serde for conditional deser * fix: adjust HubTokenizerConfig after rebase * fix: update create_post_processor logic for token type * fix: adjust unwrap syntax in template * Fixing the post processor. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-07-01 15:08:05 +02:00
Nicolas Patry	0e4ab6d31c	Fixing malformed rust tokenizers (#2134 ) * Fixing malformed rust tokenizers * Fix for deepseek too.	2024-06-27 16:04:03 +02:00
Daniël de Kok	dd2d91b043	Idefics2: sync added image tokens with transformers (#2080 ) Before this change, the number of reserved image tokens was not the same as the number of images. Fixes #2029. While at it, also remove all the image token handling duplication in `prepare_input`.	2024-06-27 15:54:35 +02:00
drbh	04e1af94d7	Enable multiple LoRa adapters (#2010 ) * feat: first draft load multiple lora * feat: load weights within layer and refactor lora pass * fix: refactor and reduce lora math * feat: baseline impl single request multi lora support * feat: prefer lorax implementation and port loading logic * fix: prefer adapter_data and refactors * feat: perfer loraxs custom punica kernels and add mlp loras * fix: adjust batch for bgmv * fix: adjust adapter_segments logic when in batch * fix: refactor and move changes to v3 proto * fix: pass model_id for all flash causal lms * fix: pass model_id for all causal and seq2seq lms * fix: add model_id to model test * feat: add lora support to mistral and refactors * feat: prefer model id in request * fix: include rust code for adapter id * feat: bump launcher and add new lora docs * feat: support base model generation and refactors * fix: rename doc to retry ci build * feat: support if vlm models * fix: add adapter_data param and avoid missing layers * fix: add adapter_data param to phi and neox * fix: update all models forwards to include adapter_data * fix: add model_id to IdeficsCausalLM * Update lora.md Fixed a typo * Update lora.md Fixing spam image * fix: add lora kernel to dockerfile, support running without kernels and refactors * fix: avoid dockerfile conflict * fix: refactors and adjust flash llama lora logic * fix: skip llama test due to CI issue (temp) * fix: skip llama test CI (temp) 2 * fix: revert skips and prefer updated ci token for tests * fix: refactors and helpful comments * fix: add noop in TensorParallelAdapterRowLinear too * fix: refactor and move shard_lora_weights logic * fix: exit early if no adapter_data --------- Co-authored-by: Derek <datavistics@gmail.com>	2024-06-25 14:46:27 -04:00
sunxichen	b69f078041	fix ChatCompletion and ChatCompletionChunk object string not compatible with standard openai api (#2089 ) Co-authored-by: sunxichen <sun.xc@digitalcnzz.com>	2024-06-25 10:59:50 +02:00

1 2 3

129 Commits