text-generation-inference

mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-06-12 20:32:06 +00:00

Author	SHA1	Message	Date
Funtowicz Morgan	17367438f3	Give TensorRT-LLMa proper CI/CD 😍 (#2886 ) * test(ctest) enable address sanitizer * feat(trtllm): expose finish reason to Rust * feat(trtllm): fix logits retrieval * misc(ci): enabe building tensorrt-llm * misc(ci): update Rust action toolchain * misc(ci): let's try to build the Dockerfile for trtllm # Conflicts: # Dockerfile_trtllm * misc(ci): provide mecanism to cache inside container * misc(ci): export aws creds as output of step * misc(ci): let's try this way * misc(ci): again * misc(ci): again * misc(ci): add debug profile * misc(ci): add debug profile * misc(ci): lets actually use sccache ... * misc(ci): do not build with ssl enabled * misc(ci): WAT * misc(ci): WAT * misc(ci): WAT * misc(ci): WAT * misc(ci): WAT * misc(backend): test with TGI S3 conf * misc(backend): test with TGI S3 conf * misc(backend): once more? * misc(backend): let's try with GHA * misc(backend): missing env directive * misc(backend): make sure to correctly set IS_GHA_BUILD=true in wf * misc(backend): ok let's debug smtg * misc(backend): WWWWWWWWWWWWWAAAAAAAA * misc(backend): kthxbye retry s3 * misc(backend): use session token * misc(backend): add more info * misc(backend): lets try 1h30 * misc(backend): lets try 1h30 * misc(backend): increase to 2h * misc(backend): lets try... * misc(backend): lets try... * misc(backend): let's build for ci-runtime * misc(backend): let's add some more tooling * misc(backend): add some tags * misc(backend): disable Werror for now * misc(backend): added automatic gha detection * misc(backend): remove leak sanitizer which is included in asan * misc(backend): forward env * misc(backend): forward env * misc(backend): let's try * misc(backend): let's try * misc(backend): again * misc(backend): again * misc(backend): again * misc(backend): again * misc(backend): again * misc(backend): fix sscache -> sccache * misc(backend): fix sscache -> sccache * misc(backend): fix sscache -> sccache * misc(backend): let's actually cache things now * misc(backend): let's actually cache things now * misc(backend): attempt to run the testS? * misc(backend): attempt to run the tests? * misc(backend): attempt to run the tests? * change runner size * fix: Correctly tag docker images (#2878) * fix: Correctly tag docker images * fix: Correctly tag docker images * misc(llamacpp): maybe? * misc(llamacpp): maybe? * misc(llamacpp): maybe? * misc(ci): gogogo * misc(ci): gogogo * misc(ci): gogogo * misc(ci): gogogo * misc(ci): gogogo * misc(ci): gogogo * misc(ci): go * misc(ci): go * misc(ci): go * misc(ci): use bin folder * misc(ci): make the wf callable for reuse * misc(ci): make the wf callable for reuse (bis) * misc(ci): make the wf callable for reuse (bis) * misc(ci): give the wf a name * Create test-trtllm.yml * Update test-trtllm.yml * Create build-trtllm2 * Rename build-trtllm2 to 1-build-trtllm2 * Rename test-trtllm.yml to 1-test-trtllm2.yml * misc(ci): fw secrets * Update 1-test-trtllm2.yml * Rename 1-build-trtllm2 to 1-build-trtllm2.yml * Update 1-test-trtllm2.yml * misc(ci): use ci-build.yaml as main dispatcher * Delete .github/workflows/1-test-trtllm2.yml * Delete .github/workflows/1-build-trtllm2.yml * misc(ci): rights? * misc(ci): rights? * misc(ci): once more? * misc(ci): once more? * misc(ci): baby more time? * misc(ci): baby more time? * misc(ci): try the permission above again? * misc(ci): try the permission above again? * misc(ci): try the permission scoped again? * misc(ci): install tensorrt_llm_executor_static * misc(ci): attempt to rebuild with sccache? * misc(ci):run the tests on GPU instance * misc(ci): let's actually setup sccache in the build.rs * misc(ci): reintroduce variables * misc(ci): enforce sccache * misc(ci): correct right job name dependency * misc(ci): detect dev profile for debug * misc(ci): detect gha build * misc(ci): detect gha build * misc(ci): ok debug * misc(ci): wtf * misc(ci): wtf2 * misc(ci): wtf3 * misc(ci): use commit HEAD instead of merge commit for image id * misc(ci): wtfinfini * misc(ci): wtfinfini * misc(ci): KAMEHAMEHA * Merge TRTLLM in standard CI * misc(ci): remove input machine * misc(ci): missing id-token for AWS auth * misc(ci): missing id-token for AWS auth * misc(ci): missing id-token for AWS auth * misc(ci): again... * misc(ci): again... * misc(ci): again... * misc(ci): again... * misc(ci): missing benchmark * misc(ci): missing backends * misc(ci): missing launcher * misc(ci): give everything aws needs * misc(ci): give everything aws needs * misc(ci): fix warnings * misc(ci): attempt to fix sccache not building trtllm * misc(ci): attempt to fix sccache not building trtllm again --------- Co-authored-by: Guillaume LEGENDRE <glegendre01@gmail.com> Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co> Co-authored-by: Pauline Bailly-Masson <155966238+paulinebm@users.noreply.github.com>	2025-01-21 10:19:16 +01:00
Cyril Vallez	b980848abf	Flash Transformers modeling backend support (#2913 ) * add transformers_flash * inits * switch version to make it work * Update Makefile-flash-att-v2 * Update Makefile-flash-att-v2 * Update Makefile-flash-att-v2 * Update Makefile-flash-att-v2 * Update Makefile-flash-att-v2 * Update Makefile-flash-att-v2 * runnable version * working * push change * fix high dim * init * default * latest transformers changes * revert * simplify check * remove flag * improve type hints + required args * Update based on transformers PR * small fix * Remove Warpers for Processor * fix compatibility version issue * raise error if needed * Simplify with monkey patch * revert + style + minor improvements * update comment * device check * move the import to avoid device issue * Update __init__.py * check for non-native models * oupsi --------- Co-authored-by: System administrator <root@ip-10-90-0-159.ec2.internal>	2025-01-21 10:01:51 +01:00
Nicolas Patry	447a5b2f87	Fixing TRTLLM dockerfile. (#2922 ) * Fixing TRTLLM dockerfile. * Fixed. * Creating a dummy modification to chekc CI runs. * Removing the cache directive. * Modifying this should cache hit. * Revert "Modifying this should cache hit." This reverts commit `46a2bde108`. * Modifying this should cache hit. * Unwanted files.	2025-01-20 11:13:46 +01:00
Daniël de Kok	630f198624	flashinfer: switch to plan API (#2904 ) This change doesn't switch `forward` to `run` yet, since it requires that we have access to the softmax scale and the logit softcap outside the model.	2025-01-17 18:18:02 +01:00
drbh	8f6146f11a	Revert "feat: improve qwen2-vl startup " (#2924 ) Revert "feat: improve qwen2-vl startup (#2802)" This reverts commit `eecca27113`.	2025-01-17 12:09:05 -05:00
drbh	eecca27113	feat: improve qwen2-vl startup (#2802 ) * feat: tokenize each request individually and increase warmup image size * feat: adjust rotary embed and avoid cuda graphs of size 2 and smaller * fix: address image resize and rebase changes * feat: update to run qwen2-vl tests * fix: tweak param types	2025-01-17 11:50:41 -05:00
Wang, Yi	6e982f43a1	fix the crash of meta-llama/Llama-3.2-1B (#2918 ) * fix the crash of meta-llama/Llama-3.2-1B Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * Apply suggestions from code review Simpler fix (which doesn't break vlms). --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2025-01-17 15:50:58 +01:00
Mohit Sharma	c20025dbf7	Add fp8 kv cache for ROCm (#2856 ) * add fp8 kv cache for rocm * improvements * update log statement * remove bookkeeping field	2025-01-17 18:43:29 +05:30
Nicolas Patry	de19e7e844	Moving to `uv` instead of `poetry`. (#2919 ) * Moving to `uv` instead of `poetry`. More in the standard, faster, seemingly better lockfile. * Creating venv if not created. * Create the venv. * Fix ? * Fixing the test by activating the environment ? * Install system ? * Add the cli entry point. * docker install on system * Monkeying this... * `--system` is redundant. * Trying to force-include this pb folder. * TRying to check that pb is imported correctly. * Editable install necessary ? * Non editable? * Editable it is.	2025-01-17 12:32:00 +01:00
Daniël de Kok	d61f14f271	nix: update to PyTorch 2.5.1 (#2921 )	2025-01-17 12:12:11 +01:00
Wang, Yi	885144166f	Flash decoding kernel adding and prefill-chunking and prefix caching enabling in intel cpu/xpu (#2815 ) * flash decoding Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * enable xpu flashdecoding Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * set flashdecoding blocksize as 64 Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * enable flashdecoding, prefill chunking and prefix caching Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * add flashdecoding-ipex Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-01-17 12:04:57 +01:00
drbh	82f6ea1b71	feat: improve star coder to support multi lora layers (#2883 ) * feat: improve star coder to support multi lora layers * feat: improve weight that support adapters and add tests for starcoder with lora * fix: bump snapshot for added tests * fix: rerun pre commit lints * fix: bump adapter test for added later names	2025-01-16 16:23:55 -05:00
Daniël de Kok	5f78ec32a5	Do not convert weight scale to e4m3fnuz on CUDA (#2917 )	2025-01-16 13:44:32 +01:00
Nicolas Patry	922cc38fbc	Upgrading bitsandbytes. (#2910 ) * Upgrading bitsandbytes. Co-Authored-By: Matthew Douglas <38992547+matthewdouglas@users.noreply.github.com> * Tighter lock. --------- Co-authored-by: Matthew Douglas <38992547+matthewdouglas@users.noreply.github.com>	2025-01-15 20:07:21 +01:00
Nicolas Patry	120bd3e3bb	Removing the github runner. (#2912 )	2025-01-15 19:20:44 +01:00
Baptiste Colle	1470aec9d9	Fix typo in TPU docs (#2911 ) docs(tpu): fix typo	2025-01-15 18:32:07 +01:00
Nicolas Patry	203cade244	Upgrading our rustc version. (#2908 ) * Upgrading our rustc version. * Fixing the rust tests to proper version. * Clippy everything.	2025-01-15 17:04:03 +01:00
Baptiste Colle	46994b34fb	📝 add guide on using TPU with TGI in the docs (#2907 )	2025-01-15 16:26:11 +01:00
Alvaro Bartolome	dc9b8e9814	Fix `docker run` in `README.md` (#2861 ) * Fix `docker run` in `README.md` * Add line-break in `docker run` for readability Co-authored-by: Daniël de Kok <danieldk@users.noreply.github.com> * Add line-break in `docker run` for readability Co-authored-by: Daniël de Kok <danieldk@users.noreply.github.com> --------- Co-authored-by: Daniël de Kok <danieldk@users.noreply.github.com>	2025-01-15 16:07:10 +01:00
Guspan Tanadi	3c7ae48f7f	docs(conceptual/speculation): available links Train Medusa (#2863 )	2025-01-15 16:05:54 +01:00
Wang, Yi	cc8b9650bd	Baichuan2-13B does not have max_position_embeddings in config (#2903 ) * Baichuan2-13B does not have max_position_embeddings in config see https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat/blob/main/config.json Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * Update server/text_generation_server/models/flash_causal_lm.py Co-authored-by: Daniël de Kok <me@github.danieldk.eu> * fmt Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Daniël de Kok <me@github.danieldk.eu>	2025-01-15 15:56:52 +01:00
Mohit Sharma	e07acc7f68	Enable FP8 Per-Tensor Scales and Integrate Marlin/MoE Kernels Repo for ROCm (#2825 ) * (feat) convert tscales to tensorwise * (fix) fp8 scaling for cuda * (kernel) add marlin-kernels * add moe-kernels * fix moe kernel comit * fix scaling * nm changes	2025-01-15 11:38:58 +05:30
Mohit Sharma	880ab9c2f3	Add Flash decoding kernel ROCm (#2855 ) * (vllm) updated vllm rocm kernels * revert silu * update partition size * remove grouped_topk * (nit) remove log * add flash decoding	2025-01-13 11:12:35 +01:00
Wang, Yi	1660154ae6	fix crash in torch2.6 if TP=1 (#2885 ) error like "ValueError: Expecting a ProcessGroup, but got a <class 'text_generation_server.utils.dist.FakeGroup'>. rank=0" Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-01-13 11:11:31 +01:00
Nicholas Broad	2e22164d4a	Update using_guidance.md (#2901 ) deletes one copy of a sentence that repeated twice	2025-01-13 11:09:35 +01:00
lazariv	83624a07be	Add possible variants for A100 and H100 GPUs for auto-detecting flops (#2837 ) * Update main.rs with A100 and H100 variants * Add another variant "nvidia-h100-nvl" * Update main.rs Add nvidia-a100-sxm4-40gb	2025-01-10 16:12:02 +01:00
Dmitry Dygalo	01067f8ba8	chore: Update jsonschema to 0.28.0 (#2870 ) * chore: Update jsonschema to 0.28.0 Signed-off-by: Dmitry Dygalo <dmitry@dygalo.dev> * chore: Enable blocking feature for reqwest Signed-off-by: Dmitry Dygalo <dmitry@dygalo.dev> --------- Signed-off-by: Dmitry Dygalo <dmitry@dygalo.dev>	2025-01-10 15:01:54 +01:00
Daniël de Kok	4f7e00f4ce	Update to marlin-kernels 0.3.7 (#2882 ) This fixes a race condition. See: https://github.com/vllm-project/vllm/pull/11493	2025-01-10 12:43:44 +01:00
drbh	da5ab46705	Improve vlm support (add idefics3 support) (#2437 ) * feat: expand vlm support and add image token logic and tests * fix: avoid unused perceiver config * feat: integrate image tokens into inputs embeds * feat: add simple idefics3 test * feat: update docs, image token logic and weight names * fix: improve image processing * feat: improve prefix for idefics3 * fix: bump idefics3 tests and snapshots * fix: improve text model loading * feat: consolidate changes with existing vlms and add support and test for smolvlm * fix: create new idefic3 file, simplify logic and adjust llama weight loading * fix: lint with ruff * fix: clean up idefics 3 and improve prefix handling * fix: improve typing * fix: improve prompt_split_image with ref to original impl * fix: adjust ruff lints and small refactors * fix: adjust FlashLlamaModel prefix logic	2025-01-09 10:35:32 -05:00
Daniël de Kok	a9c7d2e3b6	Basic flashinfer 0.2 support (#2862 ) * Basic flashinfer 0.2 support This change does not use any of the new features yet, but makes some small compatibility changes. * Update to flashinfer 0.2.0.post1 * flashinfer: remove `contiguous` calls * Fix flashinfer install * flashinfer: fixup kv cache dtype * Fix some annoying perturbations * More output changes	2025-01-09 16:25:00 +01:00
Wang, Yi	afb6c728d8	update ipex xpu to fix issue in ARC770 (#2884 ) * update ipex xpu to fix issue in ARC770 Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * add ats support Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-01-09 10:11:03 +01:00
Ruida Zeng	d37a43e581	chore: fixed some typos and attribute issues in README (#2891 ) * chore: fixed html repeated attribute in README * chore: fix minor grammar/capitalization * chore: fixed spelling mistakes in README	2025-01-09 10:09:23 +01:00
drbh	23bc38b10d	fix: include add_special_tokens in kserve request (#2859 ) merging as this patch is already used, and fully limit to the kserve feature	2024-12-19 16:55:17 -05:00
Wang, Yi	ab5f616920	change xpu lib download link (#2852 ) Signed-off-by: Wang,Yi A <yi.a.wang@intel.com>	2024-12-19 12:18:58 +01:00
Mohit Sharma	8f66d323d0	Update vllm kernels for ROCM (#2826 ) * (vllm) updated vllm rocm kernels * revert silu * update partition size * remove grouped_topk * (nit) remove log * update moe-kernels commit	2024-12-18 12:44:42 +01:00
janne-alatalo	7eeefa3b57	Qwen2-VL runtime error fix when prompted with multiple images (#2840 ) * Fix runtime error when Qwen2-VL was prompted with multiple images Fix runtime error when Qwen2-VL model is prompted with prompt with more than one image. The runtime error was: File "text-generation-inference/server/text_generation_server/models/custom_modeling/qwen2_vl.py", line 459, in get_position_ids text_pos_ids = torch.arange(text_length, device=d) RuntimeError: upper bound and larger bound inconsistent with step sign The error was caused by text_length variable going to negative value when multiple images caused multiple loops in the get_position_ids function's main loop. The error is a simple logic mistake where next_image_pos is initialized as relative offset from current_pos, but was used like it was absolute position from zero. * Fix runtime error when Qwen2-VL was prompted with multiple images Fix runtime error when Qwen2-VL model is prompted with prompt with more than one image. The runtime error was: File "text-generation-inference/server/text_generation_server/models/custom_modeling/qwen2_vl.py", line 534, in forward inputs_embeds[input_ids == self.image_token_id] = image_embeds RuntimeError: shape mismatch: value tensor of shape [512, 3584] cannot be broadcast to indexing result of shape [1024, 3584] (The error message shape numbers can be different depending on the input image resolutions) The error was caused by adding the wrong number of <\|image_pad\|> tokens to the tokenized input in the image_text_replacement function. The error is a simple logical mistake where the number of image pad tokens is checked from pixel_value_shape tensor's first dimension length. However, the pixel_value_shape contains patches from all of the images. Therefore the code added the total number of required image pad tokens for the whole input to each of the images locations. This resulted to extra image pad tokens to be present in the tokenized input. The fix was to check the number of required tokens from the image_grid_thw tensor. The tensor includes grid_t, grid_h, and grid_w values for each image. grid_t * grid_h * grid_w results to the total number of patches for the image [1]. The number of required image pad tokens is number_of_patches // 4. [1] `31f9a289a6/src/transformers/models/qwen2_vl/image_processing_qwen2_vl.py (L311)` --------- Co-authored-by: Janne Alatalo <janne.alatalo@jamk.fi>	2024-12-16 22:55:11 -05:00
drbh	a72f339c79	fix: lint backend and doc files (#2850 )	2024-12-16 16:12:34 -05:00
Nicolas Patry	11ab329883	Fixing CI. (#2846 )	2024-12-16 10:58:15 +01:00
Nicolas Patry	6f0b8c947d	New arg. (#2845 )	2024-12-16 10:34:50 +01:00
Hugo Larcher	1708865fdc	Feat/trtllm cancellation dev container (#2795 ) Add devcontainers for TRTLLM backend. --------- Co-authored-by: Morgan Funtowicz <morgan@huggingface.co>	2024-12-13 16:19:06 +01:00
Funtowicz Morgan	ea7f4082c4	TensorRT-LLM backend bump to latest version + misc fixes (#2791 ) * misc(cmake) update dependencies * feat(hardware) enable new hardware.hpp and unittests * test(ctest) enable address sanitizer * feat(backend): initial rewrite of the backend for simplicity * feat(backend): remove all the logs from hardware.hpp * feat(backend): added some logging * feat(backend): enable compiler warning if support for RVO not applying * feat(backend): missing return statement * feat(backend): introduce backend_workspace_t to store precomputed information from the engine folder * feat(backend): delete previous backend impl * feat(backend): more impl * feat(backend): use latest trtllm main version to have g++ >= 13 compatibility * feat(backend): allow overriding which Python to use * feat(backend): fix backend_exception_t -> backend_error_t naming * feat(backend): impl missing generation_step_t as return value of pull_tokens * feat(backend): make backend_workspace_t::engines_folder constexpr * feat(backend): fix main.rs retrieving the tokenizer * feat(backend): add guard to multiple header definitions * test(backend): add more unittest * feat(backend): remove constexpr from par * feat(backend): remove constexpig * test(backend): more test coverage * chore(trtllm): update dependency towards 0.15.0 * effectively cancel the request on the executor * feat(backend) fix moving backend when pulling * feat(backend): make sure we can easily cancel request on the executor * feat(backend): fix missing "0" field access * misc(backend): fix reborrowing Pin<&mut T> as described in the doc https://doc.rust-lang.org/stable/std/pin/struct.Pin.html#method.as_mut * chore: Add doc and CI for TRTLLM (#2799) * chore: Add doc and CI for TRTLLM * chore: Add doc and CI for TRTLLM * chore: Add doc and CI for TRTLLM * chore: Add doc and CI for TRTLLM * doc: Formatting * misc(backend): indent --------- Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co>	2024-12-13 15:50:59 +01:00
Nicolas Patry	3bb3fd19ae	Fixup opt to reduce the amount of odd if statements. (#2833 ) * Fixup opt to reduce the amount of odd if statements. * Fixing cargo lock	2024-12-12 18:20:13 +01:00
Wang, Yi	bf59118a93	fix facebook/opt-125m not working issue (#2824 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-12-12 14:41:30 +01:00
Nicolas Patry	c3bd7212c2	Fixing latest flavor by disabling it. (#2831 )	2024-12-12 14:09:35 +01:00
Guspan Tanadi	f01f2fb6e7	docs(README): supported hardware links TGI AMD GPUs (#2814 )	2024-12-12 13:49:33 +01:00
Nicolas Patry	07b01293c5	Prepare patch release. (#2829 )	2024-12-11 21:03:50 +01:00
RodriMora	cc66dccbe8	Update README.md (#2827 ) Added instructions to clone the repo and change directory into it. In following steps there is a "make install" step that would fail if people have not cloned the repo and cd into it, so it may be confusing for some Added python venv alternative to conda too.	2024-12-11 19:45:49 +01:00
Nicolas Patry	82c24f7420	Using both value from config as they might not be correct. (#2817 ) * Using both value from config as they might not be correct. * Fixing max_position_embeddings for falcon. * Simple attempt to fix the healthcheck block allocation. * Much simpler solution. * Default value for Backend start_health	2024-12-10 19:37:09 +01:00
Nicolas Patry	a2d878fa0f	Small update to docs (#2816 )	2024-12-10 10:46:26 +01:00
Nicolas Patry	b2fac5d947	Hotfix link2 (#2812 ) 2nd hotfix ?	2024-12-09 20:57:18 +01:00

1 2 3 4 5 ...

1324 Commits