text-generation-inference

mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-09-17 07:14:53 +00:00

Author	SHA1	Message	Date
Nicolas Patry	d2ff68e98d	Remove AWS credentials?	2025-01-24 12:18:28 +01:00
Nicolas Patry	d9dda11726	Trying to put back the archlist (to fix the oom). (#2947 )	2025-01-24 09:32:17 +01:00
Nicolas Patry	d937eb64da	Fixing cargo lock.	2025-01-23 18:54:34 +01:00
Cyril Vallez	18c4607d46	Transformers backend TP fix (#2945 ) * init dispatch * cohere fix	2025-01-23 18:09:57 +01:00
Nicolas Patry	29a0893b67	Tmp tp transformers (#2942 ) * Upgrade the version number. * Remove modifications in Lock. * Tmp branch to test transformers backend with 2.5.1 and TP>1 * Fixing the transformers backend. inference_mode forces the use of `aten.matmul` instead of `aten.mm` the former doesn't have sharding support crashing the transformers TP support. `lm_head.forward` also crashes because it skips the hook that cast/decast the DTensor. Torch 2.5.1 is required for sharding support. * Put back the attention impl. * Revert the flashinfer (this will fails). * Building AOT. * Using 2.5 kernels. * Remove the archlist, it's defined in the docker anyway.	2025-01-23 18:07:30 +01:00
Funtowicz Morgan	0a89902663	[TRTLLM] Expose finish reason (#2841 ) * feat(trtllm): expose finish reason to Rust * misc(llamacpp): fix typo * misc(backend): update deps	2025-01-23 16:48:26 +01:00
Nikolai Kolodziej	4e172028aa	Add NVIDIA A40 to known cards (#2941 ) feat: add NVIDIA A40 to known cards	2025-01-23 14:19:21 +01:00
Alvaro Bartolome	6ab02931cf	Set `alias` for `max_completion_tokens` in `ChatRequest` (#2932 )	2025-01-23 14:18:47 +01:00
Funtowicz Morgan	cc212154e0	Bump TensorRT-LLM backend dependency to v0.16.0 (#2931 ) * backend(trtllm): update to 0.16.0 * backend(trtllm): do not use shallow clone * backend(trtllm): use tag instead * backend(trtllm): move to nvidia remote instead of hf * backend(trtllm): reenable shallow clone * backend(trtllm): attempt to use ADD instead of RUN for openmpi * backend(trtllm): make sure we are using correct path for openmpi ADD in dockerfile * backend(trtllm): add correctly untar it	2025-01-23 13:54:40 +01:00
Daniël de Kok	1dd346666a	Clarify FP8-Marlin use on capability 8.9 (#2940 ) The log message stated that the GPU does not support FP8 on capability 8.9. However we use FP8-Marlin on that capability because it is faster.	2025-01-22 18:18:11 +01:00
Wang, Yi	1d3c9beba8	fix moe in quantization path (#2935 ) update ipex xpu to support moe for mixtral Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-01-22 14:36:15 +01:00
Nicolas Patry	2dfe3b3ee6	Upgrading the deps to have transformers==4.48.0 necessary (#2937 )	2025-01-22 12:20:15 +01:00
Alvaro Bartolome	64a33c1f05	Run `pre-commit run --all-files` to fix CI (#2933 )	2025-01-21 17:33:33 +01:00
Nicolas Patry	bdb3e488e4	Trying to avoid the random timeout. (#2929 ) * Trying to avoid the random timeout. * More read timeout ? * Longer timeout ? * Remove legacy ENV directive. * Remove the dummy test, only increase the read timeout. * Wat?	2025-01-21 11:06:10 +01:00
Funtowicz Morgan	17367438f3	Give TensorRT-LLMa proper CI/CD 😍 (#2886 ) * test(ctest) enable address sanitizer * feat(trtllm): expose finish reason to Rust * feat(trtllm): fix logits retrieval * misc(ci): enabe building tensorrt-llm * misc(ci): update Rust action toolchain * misc(ci): let's try to build the Dockerfile for trtllm # Conflicts: # Dockerfile_trtllm * misc(ci): provide mecanism to cache inside container * misc(ci): export aws creds as output of step * misc(ci): let's try this way * misc(ci): again * misc(ci): again * misc(ci): add debug profile * misc(ci): add debug profile * misc(ci): lets actually use sccache ... * misc(ci): do not build with ssl enabled * misc(ci): WAT * misc(ci): WAT * misc(ci): WAT * misc(ci): WAT * misc(ci): WAT * misc(backend): test with TGI S3 conf * misc(backend): test with TGI S3 conf * misc(backend): once more? * misc(backend): let's try with GHA * misc(backend): missing env directive * misc(backend): make sure to correctly set IS_GHA_BUILD=true in wf * misc(backend): ok let's debug smtg * misc(backend): WWWWWWWWWWWWWAAAAAAAA * misc(backend): kthxbye retry s3 * misc(backend): use session token * misc(backend): add more info * misc(backend): lets try 1h30 * misc(backend): lets try 1h30 * misc(backend): increase to 2h * misc(backend): lets try... * misc(backend): lets try... * misc(backend): let's build for ci-runtime * misc(backend): let's add some more tooling * misc(backend): add some tags * misc(backend): disable Werror for now * misc(backend): added automatic gha detection * misc(backend): remove leak sanitizer which is included in asan * misc(backend): forward env * misc(backend): forward env * misc(backend): let's try * misc(backend): let's try * misc(backend): again * misc(backend): again * misc(backend): again * misc(backend): again * misc(backend): again * misc(backend): fix sscache -> sccache * misc(backend): fix sscache -> sccache * misc(backend): fix sscache -> sccache * misc(backend): let's actually cache things now * misc(backend): let's actually cache things now * misc(backend): attempt to run the testS? * misc(backend): attempt to run the tests? * misc(backend): attempt to run the tests? * change runner size * fix: Correctly tag docker images (#2878) * fix: Correctly tag docker images * fix: Correctly tag docker images * misc(llamacpp): maybe? * misc(llamacpp): maybe? * misc(llamacpp): maybe? * misc(ci): gogogo * misc(ci): gogogo * misc(ci): gogogo * misc(ci): gogogo * misc(ci): gogogo * misc(ci): gogogo * misc(ci): go * misc(ci): go * misc(ci): go * misc(ci): use bin folder * misc(ci): make the wf callable for reuse * misc(ci): make the wf callable for reuse (bis) * misc(ci): make the wf callable for reuse (bis) * misc(ci): give the wf a name * Create test-trtllm.yml * Update test-trtllm.yml * Create build-trtllm2 * Rename build-trtllm2 to 1-build-trtllm2 * Rename test-trtllm.yml to 1-test-trtllm2.yml * misc(ci): fw secrets * Update 1-test-trtllm2.yml * Rename 1-build-trtllm2 to 1-build-trtllm2.yml * Update 1-test-trtllm2.yml * misc(ci): use ci-build.yaml as main dispatcher * Delete .github/workflows/1-test-trtllm2.yml * Delete .github/workflows/1-build-trtllm2.yml * misc(ci): rights? * misc(ci): rights? * misc(ci): once more? * misc(ci): once more? * misc(ci): baby more time? * misc(ci): baby more time? * misc(ci): try the permission above again? * misc(ci): try the permission above again? * misc(ci): try the permission scoped again? * misc(ci): install tensorrt_llm_executor_static * misc(ci): attempt to rebuild with sccache? * misc(ci):run the tests on GPU instance * misc(ci): let's actually setup sccache in the build.rs * misc(ci): reintroduce variables * misc(ci): enforce sccache * misc(ci): correct right job name dependency * misc(ci): detect dev profile for debug * misc(ci): detect gha build * misc(ci): detect gha build * misc(ci): ok debug * misc(ci): wtf * misc(ci): wtf2 * misc(ci): wtf3 * misc(ci): use commit HEAD instead of merge commit for image id * misc(ci): wtfinfini * misc(ci): wtfinfini * misc(ci): KAMEHAMEHA * Merge TRTLLM in standard CI * misc(ci): remove input machine * misc(ci): missing id-token for AWS auth * misc(ci): missing id-token for AWS auth * misc(ci): missing id-token for AWS auth * misc(ci): again... * misc(ci): again... * misc(ci): again... * misc(ci): again... * misc(ci): missing benchmark * misc(ci): missing backends * misc(ci): missing launcher * misc(ci): give everything aws needs * misc(ci): give everything aws needs * misc(ci): fix warnings * misc(ci): attempt to fix sccache not building trtllm * misc(ci): attempt to fix sccache not building trtllm again --------- Co-authored-by: Guillaume LEGENDRE <glegendre01@gmail.com> Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co> Co-authored-by: Pauline Bailly-Masson <155966238+paulinebm@users.noreply.github.com>	2025-01-21 10:19:16 +01:00
Cyril Vallez	b980848abf	Flash Transformers modeling backend support (#2913 ) * add transformers_flash * inits * switch version to make it work * Update Makefile-flash-att-v2 * Update Makefile-flash-att-v2 * Update Makefile-flash-att-v2 * Update Makefile-flash-att-v2 * Update Makefile-flash-att-v2 * Update Makefile-flash-att-v2 * runnable version * working * push change * fix high dim * init * default * latest transformers changes * revert * simplify check * remove flag * improve type hints + required args * Update based on transformers PR * small fix * Remove Warpers for Processor * fix compatibility version issue * raise error if needed * Simplify with monkey patch * revert + style + minor improvements * update comment * device check * move the import to avoid device issue * Update __init__.py * check for non-native models * oupsi --------- Co-authored-by: System administrator <root@ip-10-90-0-159.ec2.internal>	2025-01-21 10:01:51 +01:00
Nicolas Patry	447a5b2f87	Fixing TRTLLM dockerfile. (#2922 ) * Fixing TRTLLM dockerfile. * Fixed. * Creating a dummy modification to chekc CI runs. * Removing the cache directive. * Modifying this should cache hit. * Revert "Modifying this should cache hit." This reverts commit `46a2bde108`. * Modifying this should cache hit. * Unwanted files.	2025-01-20 11:13:46 +01:00
Daniël de Kok	630f198624	flashinfer: switch to plan API (#2904 ) This change doesn't switch `forward` to `run` yet, since it requires that we have access to the softmax scale and the logit softcap outside the model.	2025-01-17 18:18:02 +01:00
drbh	8f6146f11a	Revert "feat: improve qwen2-vl startup " (#2924 ) Revert "feat: improve qwen2-vl startup (#2802)" This reverts commit `eecca27113`.	2025-01-17 12:09:05 -05:00
drbh	eecca27113	feat: improve qwen2-vl startup (#2802 ) * feat: tokenize each request individually and increase warmup image size * feat: adjust rotary embed and avoid cuda graphs of size 2 and smaller * fix: address image resize and rebase changes * feat: update to run qwen2-vl tests * fix: tweak param types	2025-01-17 11:50:41 -05:00
Wang, Yi	6e982f43a1	fix the crash of meta-llama/Llama-3.2-1B (#2918 ) * fix the crash of meta-llama/Llama-3.2-1B Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * Apply suggestions from code review Simpler fix (which doesn't break vlms). --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2025-01-17 15:50:58 +01:00
Mohit Sharma	c20025dbf7	Add fp8 kv cache for ROCm (#2856 ) * add fp8 kv cache for rocm * improvements * update log statement * remove bookkeeping field	2025-01-17 18:43:29 +05:30
Nicolas Patry	de19e7e844	Moving to `uv` instead of `poetry`. (#2919 ) * Moving to `uv` instead of `poetry`. More in the standard, faster, seemingly better lockfile. * Creating venv if not created. * Create the venv. * Fix ? * Fixing the test by activating the environment ? * Install system ? * Add the cli entry point. * docker install on system * Monkeying this... * `--system` is redundant. * Trying to force-include this pb folder. * TRying to check that pb is imported correctly. * Editable install necessary ? * Non editable? * Editable it is.	2025-01-17 12:32:00 +01:00
Daniël de Kok	d61f14f271	nix: update to PyTorch 2.5.1 (#2921 )	2025-01-17 12:12:11 +01:00
Wang, Yi	885144166f	Flash decoding kernel adding and prefill-chunking and prefix caching enabling in intel cpu/xpu (#2815 ) * flash decoding Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * enable xpu flashdecoding Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * set flashdecoding blocksize as 64 Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * enable flashdecoding, prefill chunking and prefix caching Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * add flashdecoding-ipex Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-01-17 12:04:57 +01:00
drbh	82f6ea1b71	feat: improve star coder to support multi lora layers (#2883 ) * feat: improve star coder to support multi lora layers * feat: improve weight that support adapters and add tests for starcoder with lora * fix: bump snapshot for added tests * fix: rerun pre commit lints * fix: bump adapter test for added later names	2025-01-16 16:23:55 -05:00
Daniël de Kok	5f78ec32a5	Do not convert weight scale to e4m3fnuz on CUDA (#2917 )	2025-01-16 13:44:32 +01:00
Nicolas Patry	922cc38fbc	Upgrading bitsandbytes. (#2910 ) * Upgrading bitsandbytes. Co-Authored-By: Matthew Douglas <38992547+matthewdouglas@users.noreply.github.com> * Tighter lock. --------- Co-authored-by: Matthew Douglas <38992547+matthewdouglas@users.noreply.github.com>	2025-01-15 20:07:21 +01:00
Nicolas Patry	120bd3e3bb	Removing the github runner. (#2912 )	2025-01-15 19:20:44 +01:00
Baptiste Colle	1470aec9d9	Fix typo in TPU docs (#2911 ) docs(tpu): fix typo	2025-01-15 18:32:07 +01:00
Nicolas Patry	203cade244	Upgrading our rustc version. (#2908 ) * Upgrading our rustc version. * Fixing the rust tests to proper version. * Clippy everything.	2025-01-15 17:04:03 +01:00
Baptiste Colle	46994b34fb	📝 add guide on using TPU with TGI in the docs (#2907 )	2025-01-15 16:26:11 +01:00
Alvaro Bartolome	dc9b8e9814	Fix `docker run` in `README.md` (#2861 ) * Fix `docker run` in `README.md` * Add line-break in `docker run` for readability Co-authored-by: Daniël de Kok <danieldk@users.noreply.github.com> * Add line-break in `docker run` for readability Co-authored-by: Daniël de Kok <danieldk@users.noreply.github.com> --------- Co-authored-by: Daniël de Kok <danieldk@users.noreply.github.com>	2025-01-15 16:07:10 +01:00
Guspan Tanadi	3c7ae48f7f	docs(conceptual/speculation): available links Train Medusa (#2863 )	2025-01-15 16:05:54 +01:00
Wang, Yi	cc8b9650bd	Baichuan2-13B does not have max_position_embeddings in config (#2903 ) * Baichuan2-13B does not have max_position_embeddings in config see https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat/blob/main/config.json Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * Update server/text_generation_server/models/flash_causal_lm.py Co-authored-by: Daniël de Kok <me@github.danieldk.eu> * fmt Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Daniël de Kok <me@github.danieldk.eu>	2025-01-15 15:56:52 +01:00
Mohit Sharma	e07acc7f68	Enable FP8 Per-Tensor Scales and Integrate Marlin/MoE Kernels Repo for ROCm (#2825 ) * (feat) convert tscales to tensorwise * (fix) fp8 scaling for cuda * (kernel) add marlin-kernels * add moe-kernels * fix moe kernel comit * fix scaling * nm changes	2025-01-15 11:38:58 +05:30
Mohit Sharma	880ab9c2f3	Add Flash decoding kernel ROCm (#2855 ) * (vllm) updated vllm rocm kernels * revert silu * update partition size * remove grouped_topk * (nit) remove log * add flash decoding	2025-01-13 11:12:35 +01:00
Wang, Yi	1660154ae6	fix crash in torch2.6 if TP=1 (#2885 ) error like "ValueError: Expecting a ProcessGroup, but got a <class 'text_generation_server.utils.dist.FakeGroup'>. rank=0" Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-01-13 11:11:31 +01:00
Nicholas Broad	2e22164d4a	Update using_guidance.md (#2901 ) deletes one copy of a sentence that repeated twice	2025-01-13 11:09:35 +01:00
lazariv	83624a07be	Add possible variants for A100 and H100 GPUs for auto-detecting flops (#2837 ) * Update main.rs with A100 and H100 variants * Add another variant "nvidia-h100-nvl" * Update main.rs Add nvidia-a100-sxm4-40gb	2025-01-10 16:12:02 +01:00
Dmitry Dygalo	01067f8ba8	chore: Update jsonschema to 0.28.0 (#2870 ) * chore: Update jsonschema to 0.28.0 Signed-off-by: Dmitry Dygalo <dmitry@dygalo.dev> * chore: Enable blocking feature for reqwest Signed-off-by: Dmitry Dygalo <dmitry@dygalo.dev> --------- Signed-off-by: Dmitry Dygalo <dmitry@dygalo.dev>	2025-01-10 15:01:54 +01:00
Daniël de Kok	4f7e00f4ce	Update to marlin-kernels 0.3.7 (#2882 ) This fixes a race condition. See: https://github.com/vllm-project/vllm/pull/11493	2025-01-10 12:43:44 +01:00
drbh	da5ab46705	Improve vlm support (add idefics3 support) (#2437 ) * feat: expand vlm support and add image token logic and tests * fix: avoid unused perceiver config * feat: integrate image tokens into inputs embeds * feat: add simple idefics3 test * feat: update docs, image token logic and weight names * fix: improve image processing * feat: improve prefix for idefics3 * fix: bump idefics3 tests and snapshots * fix: improve text model loading * feat: consolidate changes with existing vlms and add support and test for smolvlm * fix: create new idefic3 file, simplify logic and adjust llama weight loading * fix: lint with ruff * fix: clean up idefics 3 and improve prefix handling * fix: improve typing * fix: improve prompt_split_image with ref to original impl * fix: adjust ruff lints and small refactors * fix: adjust FlashLlamaModel prefix logic	2025-01-09 10:35:32 -05:00
Daniël de Kok	a9c7d2e3b6	Basic flashinfer 0.2 support (#2862 ) * Basic flashinfer 0.2 support This change does not use any of the new features yet, but makes some small compatibility changes. * Update to flashinfer 0.2.0.post1 * flashinfer: remove `contiguous` calls * Fix flashinfer install * flashinfer: fixup kv cache dtype * Fix some annoying perturbations * More output changes	2025-01-09 16:25:00 +01:00
Wang, Yi	afb6c728d8	update ipex xpu to fix issue in ARC770 (#2884 ) * update ipex xpu to fix issue in ARC770 Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * add ats support Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-01-09 10:11:03 +01:00
Ruida Zeng	d37a43e581	chore: fixed some typos and attribute issues in README (#2891 ) * chore: fixed html repeated attribute in README * chore: fix minor grammar/capitalization * chore: fixed spelling mistakes in README	2025-01-09 10:09:23 +01:00
drbh	23bc38b10d	fix: include add_special_tokens in kserve request (#2859 ) merging as this patch is already used, and fully limit to the kserve feature	2024-12-19 16:55:17 -05:00
Wang, Yi	ab5f616920	change xpu lib download link (#2852 ) Signed-off-by: Wang,Yi A <yi.a.wang@intel.com>	2024-12-19 12:18:58 +01:00
Mohit Sharma	8f66d323d0	Update vllm kernels for ROCM (#2826 ) * (vllm) updated vllm rocm kernels * revert silu * update partition size * remove grouped_topk * (nit) remove log * update moe-kernels commit	2024-12-18 12:44:42 +01:00
janne-alatalo	7eeefa3b57	Qwen2-VL runtime error fix when prompted with multiple images (#2840 ) * Fix runtime error when Qwen2-VL was prompted with multiple images Fix runtime error when Qwen2-VL model is prompted with prompt with more than one image. The runtime error was: File "text-generation-inference/server/text_generation_server/models/custom_modeling/qwen2_vl.py", line 459, in get_position_ids text_pos_ids = torch.arange(text_length, device=d) RuntimeError: upper bound and larger bound inconsistent with step sign The error was caused by text_length variable going to negative value when multiple images caused multiple loops in the get_position_ids function's main loop. The error is a simple logic mistake where next_image_pos is initialized as relative offset from current_pos, but was used like it was absolute position from zero. * Fix runtime error when Qwen2-VL was prompted with multiple images Fix runtime error when Qwen2-VL model is prompted with prompt with more than one image. The runtime error was: File "text-generation-inference/server/text_generation_server/models/custom_modeling/qwen2_vl.py", line 534, in forward inputs_embeds[input_ids == self.image_token_id] = image_embeds RuntimeError: shape mismatch: value tensor of shape [512, 3584] cannot be broadcast to indexing result of shape [1024, 3584] (The error message shape numbers can be different depending on the input image resolutions) The error was caused by adding the wrong number of <\|image_pad\|> tokens to the tokenized input in the image_text_replacement function. The error is a simple logical mistake where the number of image pad tokens is checked from pixel_value_shape tensor's first dimension length. However, the pixel_value_shape contains patches from all of the images. Therefore the code added the total number of required image pad tokens for the whole input to each of the images locations. This resulted to extra image pad tokens to be present in the tokenized input. The fix was to check the number of required tokens from the image_grid_thw tensor. The tensor includes grid_t, grid_h, and grid_w values for each image. grid_t * grid_h * grid_w results to the total number of patches for the image [1]. The number of required image pad tokens is number_of_patches // 4. [1] `31f9a289a6/src/transformers/models/qwen2_vl/image_processing_qwen2_vl.py (L311)` --------- Co-authored-by: Janne Alatalo <janne.alatalo@jamk.fi>	2024-12-16 22:55:11 -05:00

1 2 3 4 5 ...

1288 Commits