text-generation-inference

mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-10-11 07:55:24 +00:00

Author	SHA1	Message	Date
Mohit Sharma	87a0af4ec2	Update transformers to 4.51 (#3148 ) * update transformres * Upgrading the nix deps too. * Forcing torchvision to be in there. * Fixing bug in mllama. * Those tests cannot be run in CI. * Lint. --------- Co-authored-by: Pedro Cuenca <pedro@huggingface.co> Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2025-04-07 12:55:43 +02:00
Mohit Sharma	d9bb9bebc9	Add llama4 (#3145 ) * initial changes * Add support for other vlm * cleanup comment * Improve attn_implementation * Add comments for support of models * add model * add model * fixes and improvements * update docker * Add cache position * Add tests * remove redundant changes * remove tr version * Upgrade doc + fix linting. * Fixing the CI. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2025-04-06 10:20:22 +02:00
Mohit Sharma	a35fbdb925	Bug Fix: Sliding Window Attention (#3112 ) * (fix) sliding window attention * (fix) flashinfer * (typo) collection link * Add window_size_left param ipex rocm * Update window size rocm flash decoding * fix: bump snapshots and improve exceed window test case * feat: add tests for image types and remove alpha from png * Upgrading `from_env` to get token from file when necessary + fix pali_gemma. * fix: add pillow dependency and bump lock+requirements * fix: bump org name in gemma3 test * Fix qwen2. --------- Co-authored-by: drbh <david.richard.holtz@gmail.com> Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2025-03-18 10:37:33 +01:00
Daniël de Kok	095775e05c	launcher: correctly get the head dimension for VLMs (#3116 ) * launcher: correctly get the head dimension for VLMs For most (?) VLMs, the head dimension is in the `text_config` configuration section. However, since we only queried the top-level `head_dim` (which typically doesn't exist in VLMs), we would never use flashinfer. This change adds a method that gets the head dimension from the top-level `Config` struct or `text_config` when that fails. * fix: bump org name in gemma3 test --------- Co-authored-by: drbh <david.richard.holtz@gmail.com>	2025-03-17 18:19:37 +01:00
Nicolas Patry	5c5528e362	Fix tool call4 (#3094 ) * Removing the no_tool content information. * Removing a lot of NO_TOOL shenanigans. * Update the tests.	2025-03-12 09:28:47 +01:00
Mohit Sharma	ed46c2c414	Add gemma3 model (#3099 )	2025-03-12 09:25:51 +01:00
Nicolas Patry	f74c36fe0d	Fix tool call3 (#3086 ) * Fixing the tool calling convention. * Update tehe doc. * Fixing some corner cases. * Fixing the tool call id. * Fmt. * Snapshot update with the new updated tool_call_id. * More qwen2.	2025-03-12 09:22:53 +01:00
drbh	dc5f05f8e6	Pr 3003 ci branch (#3007 ) * change ChatCompletionChunk to align with "OpenAI Chat Completions streaming API" Moving after tool_calls2 Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> add in Buffering.. Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> fix: handle usage outside of stream state and add tests Simplifying everything quite a bit. Remove the unused model_dump. Clippy. Clippy ? Ruff. Uppgrade the flake for latest transformers. Upgrade after rebase. Remove potential footgun. Fix completion test. * Clippy. * Tweak for multi prompt. * Ruff. * Update the snapshot a bit. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2025-03-10 17:56:19 +01:00
Nicolas Patry	622908deab	Fix tool call2 (#3076 ) * Making `tool_calls` a vector. * Arguments output is a string. * Update all the integration tests. * Add the requirements. * Upgrade other tests. * Clippy. * Update the old test.	2025-03-07 19:45:57 +01:00
Nicolas Patry	8e92942a18	Making `tool_calls` a vector. (#3075 ) * Making `tool_calls` a vector. * Update doc. * Fixing the nix overlay with updated version. * Add openai dependency. * Updating the old tests. * Trying to reduce the logs in the case of errors. * Less spammy logs too.	2025-03-05 22:32:31 +01:00
Nicolas Patry	ab9dafc68f	Making sure Olmo (transformers backend) works. (#3074 )	2025-03-05 17:46:47 +01:00
drbh	1cae3197c4	Improve tool call message processing (#3036 ) * make content field optional in chat request * add tool_calls field to Message struct * feat: add test and serialize tool messages * fix: bump utopia, openapi doc version and improve test * fix: rerun update docs * fix: suppoer tool call id in template and remove unnecessary changes * fix: ruff lint remove unused import * fix: adjust message types in tests --------- Co-authored-by: sailesh duddupudi <saileshradar@gmail.com>	2025-02-21 10:30:29 +01:00
drbh	d6a0c67e2f	feat: add initial qwen2.5-vl model and test (#2971 ) * feat: support qwen2.5 vl model * fix: bump support models doc * feat: check before rope type adjustment and small refactors * fix: add transformer overlay for processor support * fix: vendor processor and config from transformers * fix: refactor/simplify conditionals	2025-02-19 12:38:20 +01:00
Nicolas Patry	4cccce4b44	Update the flaky mllama test. (#3015 )	2025-02-12 12:26:52 +01:00
Nicolas Patry	b86c3947ab	Revert "Update the flaky mllama test." This reverts commit `8a870b31b9`.	2025-02-11 17:13:06 +01:00
Nicolas Patry	8a870b31b9	Update the flaky mllama test.	2025-02-11 17:10:36 +01:00
Nicolas Patry	4b8cda684b	Updating mllama after strftime. (#2993 ) * Updating mllama after strftime. * Town instead village. * Forgot the integration snapshot. * Attempt to fix intel CPU. * Intel extension fix. * Workaround intel. * Moving those deps directly into pyproject. * Revert "Moving those deps directly into pyproject." This reverts commit `98c1496ea6`. * Non system uv. * Fixing the docker environment hopefully. * Missed a step. * Move workdir up a bit. * Bailing out of reproducible python env. * Triton version.	2025-02-07 10:38:13 +01:00
drbh	c1cf36c0dc	Improve qwen vl impl (#2943 ) * feat: refactor model, improve startup and re enable tests * fix: improve multimodal rotary embed caching * fix: limit vision flop calc to qwen2 vl models and update config typing * fix: include clippy lint * feat: refactor position ids in warmup and bump tests * fix: prefer default dtype * fix: enable all cuda graphs and bump snapshots * fix: adjust rotaty init path * fix: simplify get position ids and remove usused vision config * fix: update position ids so first dim is batch, simplify rotary and bump vlm default token limit * fix: improve position id init during cuda warmup for mrope and simplfy rotary forward * fix: check existance before accessing rope type in cuda warmup * fix: check key before access * fix: improve mrope check in cuda graph warmup * fix: remove check for default rope type * fix: add more test and improve model generation * fix: improve and simplify get_cos_sin, refactors and cleanup get_position_ids * fix: adjust signatures with types	2025-02-04 12:44:18 -05:00
Nicolas Patry	c9d68945cc	Prepare for release 3.1.0 (#2972 ) * Prepare for release 3.1.0 * Back on main flake. * Fixing stuff. * Upgrade to moe-kernels 0.8.2 for Hip support. * Deactivating the flaky test.	2025-01-31 14:19:01 +01:00
drbh	8f6146f11a	Revert "feat: improve qwen2-vl startup " (#2924 ) Revert "feat: improve qwen2-vl startup (#2802)" This reverts commit `eecca27113`.	2025-01-17 12:09:05 -05:00
drbh	eecca27113	feat: improve qwen2-vl startup (#2802 ) * feat: tokenize each request individually and increase warmup image size * feat: adjust rotary embed and avoid cuda graphs of size 2 and smaller * fix: address image resize and rebase changes * feat: update to run qwen2-vl tests * fix: tweak param types	2025-01-17 11:50:41 -05:00
drbh	82f6ea1b71	feat: improve star coder to support multi lora layers (#2883 ) * feat: improve star coder to support multi lora layers * feat: improve weight that support adapters and add tests for starcoder with lora * fix: bump snapshot for added tests * fix: rerun pre commit lints * fix: bump adapter test for added later names	2025-01-16 16:23:55 -05:00
drbh	da5ab46705	Improve vlm support (add idefics3 support) (#2437 ) * feat: expand vlm support and add image token logic and tests * fix: avoid unused perceiver config * feat: integrate image tokens into inputs embeds * feat: add simple idefics3 test * feat: update docs, image token logic and weight names * fix: improve image processing * feat: improve prefix for idefics3 * fix: bump idefics3 tests and snapshots * fix: improve text model loading * feat: consolidate changes with existing vlms and add support and test for smolvlm * fix: create new idefic3 file, simplify logic and adjust llama weight loading * fix: lint with ruff * fix: clean up idefics 3 and improve prefix handling * fix: improve typing * fix: improve prompt_split_image with ref to original impl * fix: adjust ruff lints and small refactors * fix: adjust FlashLlamaModel prefix logic	2025-01-09 10:35:32 -05:00
Daniël de Kok	a9c7d2e3b6	Basic flashinfer 0.2 support (#2862 ) * Basic flashinfer 0.2 support This change does not use any of the new features yet, but makes some small compatibility changes. * Update to flashinfer 0.2.0.post1 * flashinfer: remove `contiguous` calls * Fix flashinfer install * flashinfer: fixup kv cache dtype * Fix some annoying perturbations * More output changes	2025-01-09 16:25:00 +01:00
Nicolas Patry	a04356fb8c	Attempt for cleverer auto batch_prefill values (some simplifications). (#2808 ) * Attempt for cleverer auto batch_prefill values (some simplifications). * Less flaky tests. * Fixing typo insertion. * Update launcher/src/main.rs Co-authored-by: Daniël de Kok <me@danieldk.eu> * Adding small comment for source of calculation. * Adding L40. * Adding L40s. --------- Co-authored-by: Daniël de Kok <me@danieldk.eu>	2024-12-09 19:44:32 +01:00
drbh	9f5c9a5e22	Enable paligemma2 (#2807 ) * feat: support loading gemma2 as vlm text model * feat: add test for paligemma2	2024-12-06 14:41:49 -05:00
Nicolas Patry	5df8059037	Auto max prefill (#2797 ) * Attempt at automatic max batch prefill. * Taking into account number of shards. * Adding more cards. * Adding A100 + H100 * Adding a few more cards. * Logprobs cost too much. * h100 better name, and keep factor of 2 * Damn inflated sparse tflops. * Typo in h100. * Updated the flops calculation (checked with fvcore). * chunking by default. * Fix prefix caching for chat completion since we removed logprobs. * More tests. * Dropping all the prefill logprobs. * Add a flag that enables users to get logprobs back. * Repairing prompt token counting. * Fixing a few tests. * Remove some scaffolding. * Attempting to reduces the issues (workarounds for now).	2024-12-06 05:52:00 +01:00
drbh	d471805134	Support continue final message (#2733 ) * feat: support continue_final_message param in chat request * feat: add test for continue final message * fix: bump openapi docs * fix: remove continue_final_message chat request param * fix: remove unneeded launcher args in continue test * fix: bump test output * fix: remove accidentally included guideline from rebase * fix: remove guideline tests * fix: adjust continuation tests expected text * fix: replace expected output for continue test	2024-11-27 19:13:30 -05:00
Daniël de Kok	289aa48554	Move JSON grammar -> regex grammar conversion to the router (#2772 ) * Move JSON grammar -> regex grammar conversion to the router This change moves the JSON grammar -> regex grammar conversion to the router by adding a dependency on the `outlines-core` Rust crate. In contrast to the Python implementation, the conversions are not LRU-cached since they seem to be fast enough: simple schema time: [5.8293 µs 5.8307 µs 5.8320 µs] change: [-13.166% -12.884% -12.641%] (p = 0.00 < 0.05) Performance has improved. complex schema time: [14.875 µs 14.881 µs 14.887 µs] change: [-2.1637% -1.9914% -1.7852%] (p = 0.00 < 0.05) Performance has improved. Using the schemas from: https://github.com/dottxt-ai/outlines-core/blob/main/benchmarks/bench_json_schema.py	2024-11-25 18:47:34 +01:00
OlivierDehaene	780531ec77	chore: prepare 2.4.1 release (#2773 ) * chore: prepare 2.4.1 release * fix tests * fmt	2024-11-22 17:26:15 +00:00
drbh	faa10ad0bc	fix: tweak grammar test response (#2769 )	2024-11-21 16:46:00 +00:00
Daniël de Kok	46a5a7e73e	Add support for wNa16 int 2:4 compressed-tensors checkpoints (#2758 ) This change adds support for wNa16 int checkpoints with 2:4 sparsity using Marlin 2:4 kernels.	2024-11-20 18:25:23 +01:00
drbh	5489406c4a	PR 2634 CI - Fix the tool_choice format for named choice by adapting OpenAIs scheme (#2645 ) * add OpenAI like tool_choice for named choice * add tests * fix: run linter and bump api docs * fix: consolidate changes and remove old tool type * feat: improve, simplify and rename tool choice struct add required support and refactor * fix: simplify tool choice logic, improve tests, openapi and rust docs * fix: refactor away prepare_chat_input and improve tool grammar apply control flow * feat: update docs and add tool choice configuration section * fix: simplify naming, tool choice default and improve test * fix: adjust tool choice none logic, add test and small refactors * fix: add missing snapshot file * fix: adjust tool choice type in test * fix: adjust default when json tool choice is * fix: remove trailing space lint after rebase * fix: remove mostly mocked unit test --------- Co-authored-by: Linus Bierhoff <linus.bierhoff@icloud.com>	2024-11-19 13:31:59 -05:00
Daniël de Kok	3c9df21ff8	Add support for compressed-tensors w8a8 int checkpoints (#2745 ) * Add support for compressed-tensors w8a8 int checkpoints This change adds a loader for w8a8 int checkpoints. One large benefit of int8 support is that the corresponding cutlass matmul kernels also work on compute capability 7.5. Evaluation on neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8: \| Tasks \|Version\| Filter \|n-shot\| Metric \| \|Value \| \|Stderr\| \|---------------\|------:\|----------------\|-----:\|-----------------------\|---\|-----:\|---\|------\| \|gsm8k_cot_llama\| 3\|flexible-extract\| 8\|exact_match \|↑ \|0.8431\|± \|0.0100\| \| \| \|strict-match \| 8\|exact_match \|↑ \|0.8393\|± \|0.0101\| \|ifeval \| 4\|none \| 0\|inst_level_loose_acc \|↑ \|0.8597\|± \| N/A\| \| \| \|none \| 0\|inst_level_strict_acc \|↑ \|0.8201\|± \| N/A\| \| \| \|none \| 0\|prompt_level_loose_acc \|↑ \|0.7967\|± \|0.0173\| \| \| \|none \| 0\|prompt_level_strict_acc\|↑ \|0.7468\|± \|0.0187\| Which is the same ballpark as vLLM. As usual, lots of thanks to Neural Magic/vLLM for the kernels. * Always use dynamic input quantization for w8a8 int It's far less flaky and gives better output. * Use marlin-kernels 0.3.5 * Fix a typo Co-authored-by: drbh <david.richard.holtz@gmail.com> * Small fixes --------- Co-authored-by: drbh <david.richard.holtz@gmail.com>	2024-11-18 17:20:31 +01:00
Daniël de Kok	a785000842	Add initial support for compressed-tensors checkpoints (#2732 ) compressed-tensors is a safetensors extension for sparse, quantized tensors. The format is more powerful than earlier AWQ/GPTQ/FP8 quantization, because - Different quantizer configurations can be used for different targets. - The format can specify input/output quantizers in addition to weight quantizers. - Configurable exclusions for quantization. This change adds a dependency on the `compressed-tensors` package for its configuration parsing and layer matching functionality. The following types of quantization are supported in this PR: - W8A16 and W4A16 INT using GPTQ-Marlin kernels. - W8A8 and W8A16 FP using FP8-Marlin and cutlass kernels. Support for other quantization types will be added in subsequent PRs.	2024-11-10 13:54:07 +01:00
drbh	01dacf8e8f	fix cuda graphs for qwen2-vl (#2708 ) * feat: support multidimensional position ids on batch to enable cuda graphs on qwen2-vl * fix: only check model type if config exists * fix: adjust sharding and lm head logic * fix qwen2 failure in intel cpu Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix: return correct shape logits and add streaming test * fix: remove unused import and refactor test --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-11-01 03:05:34 +01:00
drbh	befd9f6735	Support qwen2 vl (#2689 ) * feat: add support for qwen2 vl model * feat: fix token padding, enable warmup and process basic request * fix: improve get_position_ids, add lift embed_tokens * fix: remove get_cos_sin_hack dev function * feat: add simple test chat with meesage and text * fix: lint test * fix: adjust positional embeddings for multi dimensional position ids * fix: update docs and lint unused vars * fix: include linted file * fix: add norm after text output * fix: format model file * fix: adjust for ruff lints * fix: remove unused rotate_half * feat: refactors and calc num features * fix: prefer position_ids passed from vlm causal lm and reset ids on batch * fix: adjust get_position_ids if not available and add required args to signatures * fix: adjust resize case for qwen2_vl warmup * fix: avoid qwen2 vl specific paths with qwen2	2024-10-30 12:40:51 -04:00
Nicolas Patry	98330df65e	Monkey patching as a desperate measure. (#2704 ) * Monkey patching as a desperate measure. * New snapshot ?	2024-10-28 11:25:13 +01:00
Nicolas Patry	513d19b955	More timeout on docker start ? (#2701 ) * More timeout on docker start ? * Latest upgrade.	2024-10-28 08:57:22 +01:00
Nicolas Patry	90b226db29	We can have a tokenizer anywhere. (#2527 ) * We can have a tokenizer anywhere. * Handling potential lack of offsets (python tokenizer) * Remove redundancy. * Fixing the tests. * Flake.lock update ? * Fixing the GIL locking. * Fixing mamba by using the transformers version. * Adding the legacy handle. * Ellide lifetime. * Lint. * Deprecation message. * Fixing bad rebase.	2024-10-28 05:00:24 +01:00
Nicolas Patry	2e4f4ba1bb	Green main (#2697 )	2024-10-28 04:59:32 +01:00
Nicolas Patry	8a8794a672	Avoiding timeout for bloom tests. (#2693 ) * Avoiding timeout for bloom tests. * Skip the test let's see if it's always the first tests that fails. * Fail early. * Pulling ? * No early exit.	2024-10-26 05:35:28 +02:00
OlivierDehaene	a6b02da971	chore: prepare 2.4.0 release (#2695 )	2024-10-25 21:10:49 +00:00
Daniël de Kok	0f346a3296	Switch from fbgemm-gpu w8a8 scaled matmul to vLLM/marlin-kernels (#2688 ) * Switch from fbgemm-gpu w8a8 scaled matmul to vLLM/marlin-kernels Performance and accuracy of these kernels are on par (tested with Llama 70B and 405B). Removes a dependency and resolves some stability issues we have been seeing. * Update test snapshots	2024-10-25 16:40:47 +02:00
Nicolas Patry	db68bd0524	Fixing mt0 test. (#2692 )	2024-10-25 09:46:39 +02:00
Daniël de Kok	eab07f746c	Add support for FP8 KV cache scales (#2628 ) * Add support for FP8 KV cache scales Since FP8 only has limited dynamic range, we can scale keys/values before storing them into the cache (and unscale them in attention). To avoid rescaling the cache as the absmax values change, good scales are usually determined per layer using calibration calibration data and stored in the checkpoint. This change adds support for for using key-value scales and loading them from checkpoints in the two most common formats: - Separate per-layer `k_scale` and `v_scale` scalars. - Per-layer `kv_scale` scalar (older format). Currently, scales are only used with an `float8_e4m3fn` cache. Besides adding support for key/value scales, the `fp8_quantize` function is also extended to support quantization with a kernel vendored from vLLM. This is slightly faster than the PyTorch implementation, but also scales in FP32, potentially improving accuracy. * Update FP8 KV cache test to use checkpoint with scales * `can_scale`: check that the attention is flashinfer	2024-10-24 16:36:18 +02:00
Daniël de Kok	14a0df3a38	Fix Phi 3.5 MoE tests (#2684 ) PR #2682 also fixed in issue in Phi MoE, but it changes the test outputs a bit. Fix this.	2024-10-24 15:21:50 +02:00
Daniël de Kok	7f54b7336a	Test Marlin MoE with `desc_act=true` (#2622 ) Update the Mixtral GPTQ test to use a model with `desc_act=true` and `group_size!=-1` to ensure that we are checking activation sorting/non-full K (with tensor parallelism). The `desc_act=false` case is already checked by the Mixtral AWQ test.	2024-10-21 12:50:35 +02:00
Nicolas Patry	153ff3740b	CI job. Gpt awq 4 (#2665 ) * add gptq and awq int4 support in intel platform Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix ci failure Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * set kv cache dtype Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * refine the code according to the review command Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * Simplifying conditionals + reverting integration tests values. * Unused import * Fix redundant import. * Revert change after rebase. * Upgrading the tests (TP>1 fix changes to use different kernels.) * Update server/text_generation_server/layers/gptq/__init__.py --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Wang, Yi A <yi.a.wang@intel.com>	2024-10-18 17:55:53 +02:00
OlivierDehaene	a6a0c97ed9	feat: prefill chunking (#2600 ) * wip * rollback * refactor to use prefix/postfix namming + fix all_input_ids_tensor * maybe patching vlms? * fix filter and concat * wip, no filter, no concat * current * add prepare_for_prefill * working * load tested * re-create slots * re-create slots * fix slot_filtering_indices * feedback loop * remove log * fix benchmarker * fix vlm and seq2seq * rename to cache and input lengths * fix prefill logprobs * fix launcher * fix logprobs? * idk at this point * max input length * omfg * remove debugging lines * fix tests * fix mllama * fix cargo tests * remove support chunking for paged * Fixing non blocked attentions * Fixing dtype + AMD, Ipex targets. * lint fix. * rename * Fix prefix_caching variable, remove defaults in server (confusing a lot of the times). * Add simple resolution when user specifies ATTENTION=paged. * Put back non default simple tests. * Fix env name --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-10-16 12:49:33 +02:00

1 2 3 4

158 Commits