text-generation-inference

mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-10-10 07:25:23 +00:00

Author	SHA1	Message	Date
Wang, Yi	cc8b9650bd	Baichuan2-13B does not have max_position_embeddings in config (#2903 ) * Baichuan2-13B does not have max_position_embeddings in config see https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat/blob/main/config.json Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * Update server/text_generation_server/models/flash_causal_lm.py Co-authored-by: Daniël de Kok <me@github.danieldk.eu> * fmt Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Daniël de Kok <me@github.danieldk.eu>	2025-01-15 15:56:52 +01:00
Mohit Sharma	e07acc7f68	Enable FP8 Per-Tensor Scales and Integrate Marlin/MoE Kernels Repo for ROCm (#2825 ) * (feat) convert tscales to tensorwise * (fix) fp8 scaling for cuda * (kernel) add marlin-kernels * add moe-kernels * fix moe kernel comit * fix scaling * nm changes	2025-01-15 11:38:58 +05:30
Mohit Sharma	880ab9c2f3	Add Flash decoding kernel ROCm (#2855 ) * (vllm) updated vllm rocm kernels * revert silu * update partition size * remove grouped_topk * (nit) remove log * add flash decoding	2025-01-13 11:12:35 +01:00
Wang, Yi	1660154ae6	fix crash in torch2.6 if TP=1 (#2885 ) error like "ValueError: Expecting a ProcessGroup, but got a <class 'text_generation_server.utils.dist.FakeGroup'>. rank=0" Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-01-13 11:11:31 +01:00
Nicholas Broad	2e22164d4a	Update using_guidance.md (#2901 ) deletes one copy of a sentence that repeated twice	2025-01-13 11:09:35 +01:00
lazariv	83624a07be	Add possible variants for A100 and H100 GPUs for auto-detecting flops (#2837 ) * Update main.rs with A100 and H100 variants * Add another variant "nvidia-h100-nvl" * Update main.rs Add nvidia-a100-sxm4-40gb	2025-01-10 16:12:02 +01:00
Dmitry Dygalo	01067f8ba8	chore: Update jsonschema to 0.28.0 (#2870 ) * chore: Update jsonschema to 0.28.0 Signed-off-by: Dmitry Dygalo <dmitry@dygalo.dev> * chore: Enable blocking feature for reqwest Signed-off-by: Dmitry Dygalo <dmitry@dygalo.dev> --------- Signed-off-by: Dmitry Dygalo <dmitry@dygalo.dev>	2025-01-10 15:01:54 +01:00
Daniël de Kok	4f7e00f4ce	Update to marlin-kernels 0.3.7 (#2882 ) This fixes a race condition. See: https://github.com/vllm-project/vllm/pull/11493	2025-01-10 12:43:44 +01:00
drbh	da5ab46705	Improve vlm support (add idefics3 support) (#2437 ) * feat: expand vlm support and add image token logic and tests * fix: avoid unused perceiver config * feat: integrate image tokens into inputs embeds * feat: add simple idefics3 test * feat: update docs, image token logic and weight names * fix: improve image processing * feat: improve prefix for idefics3 * fix: bump idefics3 tests and snapshots * fix: improve text model loading * feat: consolidate changes with existing vlms and add support and test for smolvlm * fix: create new idefic3 file, simplify logic and adjust llama weight loading * fix: lint with ruff * fix: clean up idefics 3 and improve prefix handling * fix: improve typing * fix: improve prompt_split_image with ref to original impl * fix: adjust ruff lints and small refactors * fix: adjust FlashLlamaModel prefix logic	2025-01-09 10:35:32 -05:00
Daniël de Kok	a9c7d2e3b6	Basic flashinfer 0.2 support (#2862 ) * Basic flashinfer 0.2 support This change does not use any of the new features yet, but makes some small compatibility changes. * Update to flashinfer 0.2.0.post1 * flashinfer: remove `contiguous` calls * Fix flashinfer install * flashinfer: fixup kv cache dtype * Fix some annoying perturbations * More output changes	2025-01-09 16:25:00 +01:00
Wang, Yi	afb6c728d8	update ipex xpu to fix issue in ARC770 (#2884 ) * update ipex xpu to fix issue in ARC770 Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * add ats support Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-01-09 10:11:03 +01:00
Ruida Zeng	d37a43e581	chore: fixed some typos and attribute issues in README (#2891 ) * chore: fixed html repeated attribute in README * chore: fix minor grammar/capitalization * chore: fixed spelling mistakes in README	2025-01-09 10:09:23 +01:00
drbh	23bc38b10d	fix: include add_special_tokens in kserve request (#2859 ) merging as this patch is already used, and fully limit to the kserve feature	2024-12-19 16:55:17 -05:00
Wang, Yi	ab5f616920	change xpu lib download link (#2852 ) Signed-off-by: Wang,Yi A <yi.a.wang@intel.com>	2024-12-19 12:18:58 +01:00
Mohit Sharma	8f66d323d0	Update vllm kernels for ROCM (#2826 ) * (vllm) updated vllm rocm kernels * revert silu * update partition size * remove grouped_topk * (nit) remove log * update moe-kernels commit	2024-12-18 12:44:42 +01:00
janne-alatalo	7eeefa3b57	Qwen2-VL runtime error fix when prompted with multiple images (#2840 ) * Fix runtime error when Qwen2-VL was prompted with multiple images Fix runtime error when Qwen2-VL model is prompted with prompt with more than one image. The runtime error was: File "text-generation-inference/server/text_generation_server/models/custom_modeling/qwen2_vl.py", line 459, in get_position_ids text_pos_ids = torch.arange(text_length, device=d) RuntimeError: upper bound and larger bound inconsistent with step sign The error was caused by text_length variable going to negative value when multiple images caused multiple loops in the get_position_ids function's main loop. The error is a simple logic mistake where next_image_pos is initialized as relative offset from current_pos, but was used like it was absolute position from zero. * Fix runtime error when Qwen2-VL was prompted with multiple images Fix runtime error when Qwen2-VL model is prompted with prompt with more than one image. The runtime error was: File "text-generation-inference/server/text_generation_server/models/custom_modeling/qwen2_vl.py", line 534, in forward inputs_embeds[input_ids == self.image_token_id] = image_embeds RuntimeError: shape mismatch: value tensor of shape [512, 3584] cannot be broadcast to indexing result of shape [1024, 3584] (The error message shape numbers can be different depending on the input image resolutions) The error was caused by adding the wrong number of <\|image_pad\|> tokens to the tokenized input in the image_text_replacement function. The error is a simple logical mistake where the number of image pad tokens is checked from pixel_value_shape tensor's first dimension length. However, the pixel_value_shape contains patches from all of the images. Therefore the code added the total number of required image pad tokens for the whole input to each of the images locations. This resulted to extra image pad tokens to be present in the tokenized input. The fix was to check the number of required tokens from the image_grid_thw tensor. The tensor includes grid_t, grid_h, and grid_w values for each image. grid_t * grid_h * grid_w results to the total number of patches for the image [1]. The number of required image pad tokens is number_of_patches // 4. [1] `31f9a289a6/src/transformers/models/qwen2_vl/image_processing_qwen2_vl.py (L311)` --------- Co-authored-by: Janne Alatalo <janne.alatalo@jamk.fi>	2024-12-16 22:55:11 -05:00
drbh	a72f339c79	fix: lint backend and doc files (#2850 )	2024-12-16 16:12:34 -05:00
Nicolas Patry	11ab329883	Fixing CI. (#2846 )	2024-12-16 10:58:15 +01:00
Nicolas Patry	6f0b8c947d	New arg. (#2845 )	2024-12-16 10:34:50 +01:00
Hugo Larcher	1708865fdc	Feat/trtllm cancellation dev container (#2795 ) Add devcontainers for TRTLLM backend. --------- Co-authored-by: Morgan Funtowicz <morgan@huggingface.co>	2024-12-13 16:19:06 +01:00
Funtowicz Morgan	ea7f4082c4	TensorRT-LLM backend bump to latest version + misc fixes (#2791 ) * misc(cmake) update dependencies * feat(hardware) enable new hardware.hpp and unittests * test(ctest) enable address sanitizer * feat(backend): initial rewrite of the backend for simplicity * feat(backend): remove all the logs from hardware.hpp * feat(backend): added some logging * feat(backend): enable compiler warning if support for RVO not applying * feat(backend): missing return statement * feat(backend): introduce backend_workspace_t to store precomputed information from the engine folder * feat(backend): delete previous backend impl * feat(backend): more impl * feat(backend): use latest trtllm main version to have g++ >= 13 compatibility * feat(backend): allow overriding which Python to use * feat(backend): fix backend_exception_t -> backend_error_t naming * feat(backend): impl missing generation_step_t as return value of pull_tokens * feat(backend): make backend_workspace_t::engines_folder constexpr * feat(backend): fix main.rs retrieving the tokenizer * feat(backend): add guard to multiple header definitions * test(backend): add more unittest * feat(backend): remove constexpr from par * feat(backend): remove constexpig * test(backend): more test coverage * chore(trtllm): update dependency towards 0.15.0 * effectively cancel the request on the executor * feat(backend) fix moving backend when pulling * feat(backend): make sure we can easily cancel request on the executor * feat(backend): fix missing "0" field access * misc(backend): fix reborrowing Pin<&mut T> as described in the doc https://doc.rust-lang.org/stable/std/pin/struct.Pin.html#method.as_mut * chore: Add doc and CI for TRTLLM (#2799) * chore: Add doc and CI for TRTLLM * chore: Add doc and CI for TRTLLM * chore: Add doc and CI for TRTLLM * chore: Add doc and CI for TRTLLM * doc: Formatting * misc(backend): indent --------- Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co>	2024-12-13 15:50:59 +01:00
Nicolas Patry	3bb3fd19ae	Fixup opt to reduce the amount of odd if statements. (#2833 ) * Fixup opt to reduce the amount of odd if statements. * Fixing cargo lock	2024-12-12 18:20:13 +01:00
Wang, Yi	bf59118a93	fix facebook/opt-125m not working issue (#2824 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-12-12 14:41:30 +01:00
Nicolas Patry	c3bd7212c2	Fixing latest flavor by disabling it. (#2831 )	2024-12-12 14:09:35 +01:00
Guspan Tanadi	f01f2fb6e7	docs(README): supported hardware links TGI AMD GPUs (#2814 )	2024-12-12 13:49:33 +01:00
Nicolas Patry	07b01293c5	Prepare patch release. (#2829 )	2024-12-11 21:03:50 +01:00
RodriMora	cc66dccbe8	Update README.md (#2827 ) Added instructions to clone the repo and change directory into it. In following steps there is a "make install" step that would fail if people have not cloned the repo and cd into it, so it may be confusing for some Added python venv alternative to conda too.	2024-12-11 19:45:49 +01:00
Nicolas Patry	82c24f7420	Using both value from config as they might not be correct. (#2817 ) * Using both value from config as they might not be correct. * Fixing max_position_embeddings for falcon. * Simple attempt to fix the healthcheck block allocation. * Much simpler solution. * Default value for Backend start_health	2024-12-10 19:37:09 +01:00
Nicolas Patry	a2d878fa0f	Small update to docs (#2816 )	2024-12-10 10:46:26 +01:00
Nicolas Patry	b2fac5d947	Hotfix link2 (#2812 ) 2nd hotfix ?	2024-12-09 20:57:18 +01:00
Nicolas Patry	a70dd2998b	Hotfixing the link. (#2811 )	2024-12-09 20:50:07 +01:00
Nicolas Patry	042791fbd5	Prep new version (#2810 ) * New version. * Link fixup. * Update docs. * FIxup.	2024-12-09 20:42:42 +01:00
Nicolas Patry	27fa83ca5b	V3 doc (#2809 ) * V3 document. * Updating asset.	2024-12-09 19:58:07 +01:00
Nicolas Patry	a04356fb8c	Attempt for cleverer auto batch_prefill values (some simplifications). (#2808 ) * Attempt for cleverer auto batch_prefill values (some simplifications). * Less flaky tests. * Fixing typo insertion. * Update launcher/src/main.rs Co-authored-by: Daniël de Kok <me@danieldk.eu> * Adding small comment for source of calculation. * Adding L40. * Adding L40s. --------- Co-authored-by: Daniël de Kok <me@danieldk.eu>	2024-12-09 19:44:32 +01:00
drbh	9f5c9a5e22	Enable paligemma2 (#2807 ) * feat: support loading gemma2 as vlm text model * feat: add test for paligemma2	2024-12-06 14:41:49 -05:00
Nicolas Patry	08f6fa0b59	Removing experimental to prefill chunking.	2024-12-06 19:09:40 +01:00
Nicolas Patry	d96dcb1797	Adding A100 compute. (#2806 )	2024-12-06 18:19:15 +01:00
Nicolas Patry	5df8059037	Auto max prefill (#2797 ) * Attempt at automatic max batch prefill. * Taking into account number of shards. * Adding more cards. * Adding A100 + H100 * Adding a few more cards. * Logprobs cost too much. * h100 better name, and keep factor of 2 * Damn inflated sparse tflops. * Typo in h100. * Updated the flops calculation (checked with fvcore). * chunking by default. * Fix prefix caching for chat completion since we removed logprobs. * More tests. * Dropping all the prefill logprobs. * Add a flag that enables users to get logprobs back. * Repairing prompt token counting. * Fixing a few tests. * Remove some scaffolding. * Attempting to reduces the issues (workarounds for now).	2024-12-06 05:52:00 +01:00
OlivierDehaene	8c3669b287	feat: auto max_new_tokens (#2803 ) * feat: auto max_new_tokens * update default * Fixing the tests. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-12-06 05:50:35 +01:00
Wang, Yi	6685e8fcda	use oneapi 2024 docker image directly for xpu (#2793 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-12-06 09:36:23 +05:30
drbh	e0db633396	fix: avoid setting use_sgmv if no kernels present (#2796 )	2024-12-04 15:26:09 -05:00
Nicolas Patry	b57f370386	Saving some VRAM. (#2790 ) * Saving some VRAM. - 8B on 4xL4 attention=flashdecoding . Before 4.28GB left, After 4.32GB left, so 400MB saved. - Effect not as visible on attention=flashinfer and n_shard=1. I suspect it's linked to the torch allocator. * Adding assertion.	2024-12-03 04:04:21 +01:00
Daniël de Kok	2003d8be0c	Sync (most) server dependencies with Nix (#2782 ) * Sync (most) server dependencies with Nix Skipped most grpcio packages, because of protobuf version incompatibility with the opentelemetry packages. * Add a primitive script to generate Poetry commands to sync with Nix This is not fully automated, since getting the Nix versions may be unresolvable. However, it does take most of the work out of doing this manually. * Upgrade eetq ? * Fmt. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-12-03 04:04:06 +01:00
Dmitry Rogozhkin	535149d872	fix: only use eos_token_id as pad_token_id if int (#2774 ) LLama 3 has a list of values as eos_token_id: "['<\|end_of_text\|>', '<\|eom_id\|>', '<\|eot_id\|>']" This breaks tokenizer since it expects single value. This commit uses tokenizer.eos_token_id instead in such a case. Fixes: #2440 Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>	2024-12-02 06:26:37 +01:00
drbh	2c74c55637	fix: add merge-lora arg for model id (#2788 )	2024-12-02 05:52:02 +01:00
Torsten Raudssus	a35d1e6fe5	Removing ../ that broke the link (#2789 )	2024-12-02 05:48:55 +01:00
Nicolas Patry	1d2cb356b9	Fix doc. (#2792 )	2024-12-02 05:28:26 +01:00
drbh	d471805134	Support continue final message (#2733 ) * feat: support continue_final_message param in chat request * feat: add test for continue final message * fix: bump openapi docs * fix: remove continue_final_message chat request param * fix: remove unneeded launcher args in continue test * fix: bump test output * fix: remove accidentally included guideline from rebase * fix: remove guideline tests * fix: adjust continuation tests expected text * fix: replace expected output for continue test	2024-11-27 19:13:30 -05:00
jp	caff779dd4	Fix: docs typo (#2777 ) Fix: typo in model loading code Fix typo in model loading code	2024-11-26 14:28:58 +01:00
Wang, Yi	892a26e549	upgrade ipex cpu to fix coredump in tiiuae/falcon-7b-instruct (pageat… (#2778 ) upgrade ipex cpu to fix coredump in tiiuae/falcon-7b-instruct (pageattention) Signed-off-by: Wang,Yi A <yi.a.wang@intel.com>	2024-11-26 14:28:11 +01:00

1 2 3 4 5 ...

1204 Commits