text-generation-inference

mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-07-03 22:40:17 +00:00

Author	SHA1	Message	Date
baptiste	55cdfbfce3	enable multi-card test	2025-05-22 14:13:59 +00:00
baptiste	d6fc50b628	feat(gaudi/ci): added ci for gaudi device	2025-05-22 14:13:59 +00:00
baptiste	76632c8c6e	testing	2025-05-22 14:13:59 +00:00
baptiste	88797d42a3	change defualt behaviour to only run a subset of all the models	2025-05-22 14:13:59 +00:00
Baptiste Colle	4b9a83a163	wip(ci): debug the ci	2025-05-22 14:13:59 +00:00
baptiste	bd5f5ce4f1	wip(ci): rerun ci to debug	2025-05-22 14:13:59 +00:00
baptiste	ee0fab4916	fix llama failing test	2025-05-22 14:13:59 +00:00
baptiste	e2032502e2	feat(ci): llama3 test working	2025-05-22 14:13:59 +00:00
baptiste	71a9aa3e3d	feat(ci): llama3 test working	2025-05-22 14:13:59 +00:00
baptiste	0b31188309	wip: able to launch gaudi tests	2025-05-22 14:13:59 +00:00
baptiste	b3b5bde933	wip(test): adding test to ci	2025-05-22 14:13:59 +00:00
Daniël de Kok	674c514d44	Prepare for 3.3.1 (#3238 )	2025-05-22 09:43:55 +02:00
Daniël de Kok	7e531f413d	Update to Torch 2.7.0 (#3221 ) * Update to Torch 2.7.0 * Try to fix typer/click issue * Pin click to fix incompatibility with typer * Fix some test outputs with slight deviations * Attempt again to sync with CI * Mamba too * Fixup mllama Also switch to `unsloth/Llama-3.2-11B-Vision-Instruct` for testing from the EU :).	2025-05-15 11:48:33 +02:00
Daniël de Kok	56c8189467	Prepare for 3.3.0 (#3220 )	2025-05-09 15:50:29 +02:00
drbh	d303c1e37e	fix: bump snaps for mllama (#3202 )	2025-05-01 10:20:45 -04:00
drbh	12ea8d74c7	Pr 2982 ci branch (#3046 ) * Add json_schema alias for GrammarType * Add tests for all aliases * fix: various linter adjustments * fix: end-of-file-fixer lint * fix: add test snapshots and avoid docs change * fix: another end-of-file-fixer lint * feat: support json_schema grammar constraining and add tests * fix: bump openapi doc with new grammar option * fix: adjust test payload * fix: bump test snaps --------- Co-authored-by: Alex Weston <alexw@alkymi.io>	2025-05-01 10:17:16 -04:00
Daniël de Kok	84ab88d843	Support flashinfer for Gemma3 prefill (#3167 ) * launcher: ensure correct detection of Gemma 3 head size * Support flashinfer for Gemma3 prefill Gemma3 uses bidirectional attention for images. Flashinfer supports custom masks. Hook up the mask with flashinfer, so that we do not have to use the slower SDPA implementation for prefills with images. * Update Gemma3 test outputs * Fixed unused import	2025-04-17 18:07:41 +02:00
Mohit Sharma	87a0af4ec2	Update transformers to 4.51 (#3148 ) * update transformres * Upgrading the nix deps too. * Forcing torchvision to be in there. * Fixing bug in mllama. * Those tests cannot be run in CI. * Lint. --------- Co-authored-by: Pedro Cuenca <pedro@huggingface.co> Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2025-04-07 12:55:43 +02:00
Mohit Sharma	d9bb9bebc9	Add llama4 (#3145 ) * initial changes * Add support for other vlm * cleanup comment * Improve attn_implementation * Add comments for support of models * add model * add model * fixes and improvements * update docker * Add cache position * Add tests * remove redundant changes * remove tr version * Upgrade doc + fix linting. * Fixing the CI. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2025-04-06 10:20:22 +02:00
Mohit Sharma	a35fbdb925	Bug Fix: Sliding Window Attention (#3112 ) * (fix) sliding window attention * (fix) flashinfer * (typo) collection link * Add window_size_left param ipex rocm * Update window size rocm flash decoding * fix: bump snapshots and improve exceed window test case * feat: add tests for image types and remove alpha from png * Upgrading `from_env` to get token from file when necessary + fix pali_gemma. * fix: add pillow dependency and bump lock+requirements * fix: bump org name in gemma3 test * Fix qwen2. --------- Co-authored-by: drbh <david.richard.holtz@gmail.com> Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2025-03-18 10:37:33 +01:00
Daniël de Kok	095775e05c	launcher: correctly get the head dimension for VLMs (#3116 ) * launcher: correctly get the head dimension for VLMs For most (?) VLMs, the head dimension is in the `text_config` configuration section. However, since we only queried the top-level `head_dim` (which typically doesn't exist in VLMs), we would never use flashinfer. This change adds a method that gets the head dimension from the top-level `Config` struct or `text_config` when that fails. * fix: bump org name in gemma3 test --------- Co-authored-by: drbh <david.richard.holtz@gmail.com>	2025-03-17 18:19:37 +01:00
David Corvoysier	f01dc9e743	Update neuron backend (#3098 ) * feat(neuron): use AWS Neuron SDK 2.21.1 * feat(neuron): bump optimum-neuron version * feat(neuron): tag latest image for local tests * test(neuron): simplify sampling test	2025-03-12 09:53:15 +01:00
Nicolas Patry	5c5528e362	Fix tool call4 (#3094 ) * Removing the no_tool content information. * Removing a lot of NO_TOOL shenanigans. * Update the tests.	2025-03-12 09:28:47 +01:00
Mohit Sharma	ed46c2c414	Add gemma3 model (#3099 )	2025-03-12 09:25:51 +01:00
Nicolas Patry	f74c36fe0d	Fix tool call3 (#3086 ) * Fixing the tool calling convention. * Update tehe doc. * Fixing some corner cases. * Fixing the tool call id. * Fmt. * Snapshot update with the new updated tool_call_id. * More qwen2.	2025-03-12 09:22:53 +01:00
drbh	dc5f05f8e6	Pr 3003 ci branch (#3007 ) * change ChatCompletionChunk to align with "OpenAI Chat Completions streaming API" Moving after tool_calls2 Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> add in Buffering.. Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> fix: handle usage outside of stream state and add tests Simplifying everything quite a bit. Remove the unused model_dump. Clippy. Clippy ? Ruff. Uppgrade the flake for latest transformers. Upgrade after rebase. Remove potential footgun. Fix completion test. * Clippy. * Tweak for multi prompt. * Ruff. * Update the snapshot a bit. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2025-03-10 17:56:19 +01:00
Nicolas Patry	622908deab	Fix tool call2 (#3076 ) * Making `tool_calls` a vector. * Arguments output is a string. * Update all the integration tests. * Add the requirements. * Upgrade other tests. * Clippy. * Update the old test.	2025-03-07 19:45:57 +01:00
Nicolas Patry	8e92942a18	Making `tool_calls` a vector. (#3075 ) * Making `tool_calls` a vector. * Update doc. * Fixing the nix overlay with updated version. * Add openai dependency. * Updating the old tests. * Trying to reduce the logs in the case of errors. * Less spammy logs too.	2025-03-05 22:32:31 +01:00
Nicolas Patry	3208d1cd1d	Revert "Trying to reduce the logs in the case of errors." This reverts commit `cdf70d6a28`.	2025-03-05 20:52:38 +01:00
Nicolas Patry	cdf70d6a28	Trying to reduce the logs in the case of errors.	2025-03-05 20:50:43 +01:00
Nicolas Patry	ab9dafc68f	Making sure Olmo (transformers backend) works. (#3074 )	2025-03-05 17:46:47 +01:00
David Corvoysier	5eec3a8bb6	Avoid running neuron integration tests twice (#3054 ) * test(neuron): refactor to prepare batch export * test(neuron): add helper to batch export models Also rename fixture file fro clarity. * ci(neuron): do not run tests twice * ci(neuron): rename precompilation job * test(neuron): remove redundant subdirectory * test(neuron): remove erroneous line * doc(neuron): update links to installation page * feat(neuron): cleanup Dockerfile CARGO_REGISTRIES_CRATES_IO_PROTOCOL=sparse is not required anymore. * test(neuron): try to reduce download errors	2025-02-26 12:15:01 +01:00
drbh	b0069e0485	fix: run linters and fix formatting (#3057 )	2025-02-25 16:11:34 -05:00
Nicolas Patry	cea9dbc971	You need to seek apparently. (#3049 )	2025-02-24 14:58:23 +01:00
David Corvoysier	c00add9c03	Add Neuron backend (#3033 ) * feat: add neuron backend * feat(neuron): add server standalone installation * feat(neuron): add server and integration tests * fix(neuron): increase ulimit when building image The base image used to compile the rust components seems to have a low ulimit for opened files, which leads to errors during compilation. * test(neuron): merge integration tests and fixtures * test: add --neuron option * review: do not use latest tag * review: remove ureq pinned version * review: --privileged should be the exception * feat: add neuron case to build ci * fix(neuron): export models from container in test fixtures The neuron tests require models to have been previously exported and cached on the hub. This is done automatically by the neuron.model fixture the first time the tests are ran for a specific version. This fixture used to export the models using optimum-neuron directly, but this package is not necessarily present on the system. Instead, it is now done through the neuron TGI itself, since it contains all the tools required to export the models. Note that since the CI runs docker in docker (dind) it does not seem possible to share a volume between the CI container and the container used to export the model. For that reason, a specific image with a modified entrypoint is built on-the-fly when a model export is required. * refactor: remove sagemaker entry-point The SageMaker image is built differently anyway. * fix(neuron): avoid using Levenshtein * test(neuron): use smaller llama model * feat(neuron): avoid installing CUDA in image * test(neuron): no error anymore when requesting too many tokens * ci: doing a precompilation step (with a different token). * test(neuron): avoid using image sha when exporting models We now manually evaluate the apparent hash of the neuron backend by combining the hash of the neuron backend directory and Dockerfile. This new hash is used to identify exported neuron models instead of the image sha. This has two benefits: - it changes less frequently (only hwen the neuron backend changes), which means less neuron models being pushed to the hub, - it can be evaluated locally, meaning that running the tests once locally will export the models before the CI uses them. * test(neuron): added a small script to prune test models --------- Co-authored-by: drbh <david.richard.holtz@gmail.com> Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2025-02-24 09:10:05 +01:00
drbh	1cae3197c4	Improve tool call message processing (#3036 ) * make content field optional in chat request * add tool_calls field to Message struct * feat: add test and serialize tool messages * fix: bump utopia, openapi doc version and improve test * fix: rerun update docs * fix: suppoer tool call id in template and remove unnecessary changes * fix: ruff lint remove unused import * fix: adjust message types in tests --------- Co-authored-by: sailesh duddupudi <saileshradar@gmail.com>	2025-02-21 10:30:29 +01:00
Nicolas Patry	142a49a80d	Simplify logs2. (#3045 ) * Simplify logs2. * Changing the scope from module to session to fix the event_loop issue.	2025-02-21 10:03:40 +01:00
Nicolas Patry	9c89d0070e	Having less logs in case of failure for checking CI more easily. (#3037 ) * Having less logs in case of failure for checking CI more easily. * Cleaning up the versions to uv for the client. * Ignore entirely the API.	2025-02-19 17:01:33 +01:00
drbh	d6a0c67e2f	feat: add initial qwen2.5-vl model and test (#2971 ) * feat: support qwen2.5 vl model * fix: bump support models doc * feat: check before rope type adjustment and small refactors * fix: add transformer overlay for processor support * fix: vendor processor and config from transformers * fix: refactor/simplify conditionals	2025-02-19 12:38:20 +01:00
Nicolas Patry	4cccce4b44	Update the flaky mllama test. (#3015 )	2025-02-12 12:26:52 +01:00
Nicolas Patry	b86c3947ab	Revert "Update the flaky mllama test." This reverts commit `8a870b31b9`.	2025-02-11 17:13:06 +01:00
Nicolas Patry	8a870b31b9	Update the flaky mllama test.	2025-02-11 17:10:36 +01:00
Nicolas Patry	4b8cda684b	Updating mllama after strftime. (#2993 ) * Updating mllama after strftime. * Town instead village. * Forgot the integration snapshot. * Attempt to fix intel CPU. * Intel extension fix. * Workaround intel. * Moving those deps directly into pyproject. * Revert "Moving those deps directly into pyproject." This reverts commit `98c1496ea6`. * Non system uv. * Fixing the docker environment hopefully. * Missed a step. * Move workdir up a bit. * Bailing out of reproducible python env. * Triton version.	2025-02-07 10:38:13 +01:00
drbh	c1cf36c0dc	Improve qwen vl impl (#2943 ) * feat: refactor model, improve startup and re enable tests * fix: improve multimodal rotary embed caching * fix: limit vision flop calc to qwen2 vl models and update config typing * fix: include clippy lint * feat: refactor position ids in warmup and bump tests * fix: prefer default dtype * fix: enable all cuda graphs and bump snapshots * fix: adjust rotaty init path * fix: simplify get position ids and remove usused vision config * fix: update position ids so first dim is batch, simplify rotary and bump vlm default token limit * fix: improve position id init during cuda warmup for mrope and simplfy rotary forward * fix: check existance before accessing rope type in cuda warmup * fix: check key before access * fix: improve mrope check in cuda graph warmup * fix: remove check for default rope type * fix: add more test and improve model generation * fix: improve and simplify get_cos_sin, refactors and cleanup get_position_ids * fix: adjust signatures with types	2025-02-04 12:44:18 -05:00
Nicolas Patry	c9d68945cc	Prepare for release 3.1.0 (#2972 ) * Prepare for release 3.1.0 * Back on main flake. * Fixing stuff. * Upgrade to moe-kernels 0.8.2 for Hip support. * Deactivating the flaky test.	2025-01-31 14:19:01 +01:00
Nicolas Patry	bdb3e488e4	Trying to avoid the random timeout. (#2929 ) * Trying to avoid the random timeout. * More read timeout ? * Longer timeout ? * Remove legacy ENV directive. * Remove the dummy test, only increase the read timeout. * Wat?	2025-01-21 11:06:10 +01:00
drbh	8f6146f11a	Revert "feat: improve qwen2-vl startup " (#2924 ) Revert "feat: improve qwen2-vl startup (#2802)" This reverts commit `eecca27113`.	2025-01-17 12:09:05 -05:00
drbh	eecca27113	feat: improve qwen2-vl startup (#2802 ) * feat: tokenize each request individually and increase warmup image size * feat: adjust rotary embed and avoid cuda graphs of size 2 and smaller * fix: address image resize and rebase changes * feat: update to run qwen2-vl tests * fix: tweak param types	2025-01-17 11:50:41 -05:00
drbh	82f6ea1b71	feat: improve star coder to support multi lora layers (#2883 ) * feat: improve star coder to support multi lora layers * feat: improve weight that support adapters and add tests for starcoder with lora * fix: bump snapshot for added tests * fix: rerun pre commit lints * fix: bump adapter test for added later names	2025-01-16 16:23:55 -05:00
drbh	da5ab46705	Improve vlm support (add idefics3 support) (#2437 ) * feat: expand vlm support and add image token logic and tests * fix: avoid unused perceiver config * feat: integrate image tokens into inputs embeds * feat: add simple idefics3 test * feat: update docs, image token logic and weight names * fix: improve image processing * feat: improve prefix for idefics3 * fix: bump idefics3 tests and snapshots * fix: improve text model loading * feat: consolidate changes with existing vlms and add support and test for smolvlm * fix: create new idefic3 file, simplify logic and adjust llama weight loading * fix: lint with ruff * fix: clean up idefics 3 and improve prefix handling * fix: improve typing * fix: improve prompt_split_image with ref to original impl * fix: adjust ruff lints and small refactors * fix: adjust FlashLlamaModel prefix logic	2025-01-09 10:35:32 -05:00

1 2 3 4 5

206 Commits