text-generation-inference

mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-07-05 23:40:17 +00:00

Author	SHA1	Message	Date
Mohit Sharma	4ef2e045c9	Add fp8 support moe models (#2928 ) * Add fp8 support moe models * flatten condition	2025-01-29 13:56:32 +01:00
Hugo Larcher	73b7cf83f6	Add backend name to telemetry (#2962 ) * feat: Add backend name to telemetry	2025-01-28 16:53:16 +01:00
Nicolas Patry	eb3df0f46f	Fixing the oom maybe with 2.5.1 change. (#2958 )	2025-01-28 10:30:28 +01:00
Hugo Larcher	c690da5973	fix: Telemetry (#2957 ) * fix: add telemetry regular pings and fix unhandled errors avoid not sending telemetry stop events. * fix: simplify error handling * fix: update ping delay and update doc. * fix: clippy * doc: Rephrase properly.	2025-01-28 10:29:18 +01:00
Daniël de Kok	db922eb77e	Update to attention-kernels 0.2.0 (#2950 ) This version removes our patches/custom API. Makes it simpler to get changes from upstream. One of which is that we can enable FP8 KV cache for paged attention as well.	2025-01-27 11:42:36 +01:00
Funtowicz Morgan	40b00275b2	Attempt to remove AWS S3 flaky cache for sccache (#2953 ) * backend(trtllm): attempt to remove AWS S3 flaky cache for sccache * backend(trtllm): what if we expose ENV instead of inline? * backend(trtllm): and with the right env var for gha sccache * backend(trtllm): relax the way to detect sccache * backend(trtllm): make sccache definition manually * backend(trtllm): ok let's try to define the launchers in build.rs when rustc_wrapper is present * backend(trtllm): export env variable in run mb? * backend(trtllm): Cache mode max to cache intermediate layers * backend(trtllm): inject ompi_version build arg in dependent step	2025-01-27 11:21:48 +01:00
Nicolas Patry	6cb41a80a1	Revert "Remove AWS credentials?" This reverts commit `d2ff68e98d`.	2025-01-24 14:34:17 +01:00
Nicolas Patry	d2ff68e98d	Remove AWS credentials?	2025-01-24 12:18:28 +01:00
Nicolas Patry	d9dda11726	Trying to put back the archlist (to fix the oom). (#2947 )	2025-01-24 09:32:17 +01:00
Nicolas Patry	d937eb64da	Fixing cargo lock.	2025-01-23 18:54:34 +01:00
Cyril Vallez	18c4607d46	Transformers backend TP fix (#2945 ) * init dispatch * cohere fix	2025-01-23 18:09:57 +01:00
Nicolas Patry	29a0893b67	Tmp tp transformers (#2942 ) * Upgrade the version number. * Remove modifications in Lock. * Tmp branch to test transformers backend with 2.5.1 and TP>1 * Fixing the transformers backend. inference_mode forces the use of `aten.matmul` instead of `aten.mm` the former doesn't have sharding support crashing the transformers TP support. `lm_head.forward` also crashes because it skips the hook that cast/decast the DTensor. Torch 2.5.1 is required for sharding support. * Put back the attention impl. * Revert the flashinfer (this will fails). * Building AOT. * Using 2.5 kernels. * Remove the archlist, it's defined in the docker anyway.	2025-01-23 18:07:30 +01:00
Funtowicz Morgan	0a89902663	[TRTLLM] Expose finish reason (#2841 ) * feat(trtllm): expose finish reason to Rust * misc(llamacpp): fix typo * misc(backend): update deps	2025-01-23 16:48:26 +01:00
Nikolai Kolodziej	4e172028aa	Add NVIDIA A40 to known cards (#2941 ) feat: add NVIDIA A40 to known cards	2025-01-23 14:19:21 +01:00
Alvaro Bartolome	6ab02931cf	Set `alias` for `max_completion_tokens` in `ChatRequest` (#2932 )	2025-01-23 14:18:47 +01:00
Funtowicz Morgan	cc212154e0	Bump TensorRT-LLM backend dependency to v0.16.0 (#2931 ) * backend(trtllm): update to 0.16.0 * backend(trtllm): do not use shallow clone * backend(trtllm): use tag instead * backend(trtllm): move to nvidia remote instead of hf * backend(trtllm): reenable shallow clone * backend(trtllm): attempt to use ADD instead of RUN for openmpi * backend(trtllm): make sure we are using correct path for openmpi ADD in dockerfile * backend(trtllm): add correctly untar it	2025-01-23 13:54:40 +01:00
Daniël de Kok	1dd346666a	Clarify FP8-Marlin use on capability 8.9 (#2940 ) The log message stated that the GPU does not support FP8 on capability 8.9. However we use FP8-Marlin on that capability because it is faster.	2025-01-22 18:18:11 +01:00
Wang, Yi	1d3c9beba8	fix moe in quantization path (#2935 ) update ipex xpu to support moe for mixtral Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-01-22 14:36:15 +01:00
Nicolas Patry	2dfe3b3ee6	Upgrading the deps to have transformers==4.48.0 necessary (#2937 )	2025-01-22 12:20:15 +01:00
Alvaro Bartolome	64a33c1f05	Run `pre-commit run --all-files` to fix CI (#2933 )	2025-01-21 17:33:33 +01:00
Nicolas Patry	bdb3e488e4	Trying to avoid the random timeout. (#2929 ) * Trying to avoid the random timeout. * More read timeout ? * Longer timeout ? * Remove legacy ENV directive. * Remove the dummy test, only increase the read timeout. * Wat?	2025-01-21 11:06:10 +01:00
Funtowicz Morgan	17367438f3	Give TensorRT-LLMa proper CI/CD 😍 (#2886 ) * test(ctest) enable address sanitizer * feat(trtllm): expose finish reason to Rust * feat(trtllm): fix logits retrieval * misc(ci): enabe building tensorrt-llm * misc(ci): update Rust action toolchain * misc(ci): let's try to build the Dockerfile for trtllm # Conflicts: # Dockerfile_trtllm * misc(ci): provide mecanism to cache inside container * misc(ci): export aws creds as output of step * misc(ci): let's try this way * misc(ci): again * misc(ci): again * misc(ci): add debug profile * misc(ci): add debug profile * misc(ci): lets actually use sccache ... * misc(ci): do not build with ssl enabled * misc(ci): WAT * misc(ci): WAT * misc(ci): WAT * misc(ci): WAT * misc(ci): WAT * misc(backend): test with TGI S3 conf * misc(backend): test with TGI S3 conf * misc(backend): once more? * misc(backend): let's try with GHA * misc(backend): missing env directive * misc(backend): make sure to correctly set IS_GHA_BUILD=true in wf * misc(backend): ok let's debug smtg * misc(backend): WWWWWWWWWWWWWAAAAAAAA * misc(backend): kthxbye retry s3 * misc(backend): use session token * misc(backend): add more info * misc(backend): lets try 1h30 * misc(backend): lets try 1h30 * misc(backend): increase to 2h * misc(backend): lets try... * misc(backend): lets try... * misc(backend): let's build for ci-runtime * misc(backend): let's add some more tooling * misc(backend): add some tags * misc(backend): disable Werror for now * misc(backend): added automatic gha detection * misc(backend): remove leak sanitizer which is included in asan * misc(backend): forward env * misc(backend): forward env * misc(backend): let's try * misc(backend): let's try * misc(backend): again * misc(backend): again * misc(backend): again * misc(backend): again * misc(backend): again * misc(backend): fix sscache -> sccache * misc(backend): fix sscache -> sccache * misc(backend): fix sscache -> sccache * misc(backend): let's actually cache things now * misc(backend): let's actually cache things now * misc(backend): attempt to run the testS? * misc(backend): attempt to run the tests? * misc(backend): attempt to run the tests? * change runner size * fix: Correctly tag docker images (#2878) * fix: Correctly tag docker images * fix: Correctly tag docker images * misc(llamacpp): maybe? * misc(llamacpp): maybe? * misc(llamacpp): maybe? * misc(ci): gogogo * misc(ci): gogogo * misc(ci): gogogo * misc(ci): gogogo * misc(ci): gogogo * misc(ci): gogogo * misc(ci): go * misc(ci): go * misc(ci): go * misc(ci): use bin folder * misc(ci): make the wf callable for reuse * misc(ci): make the wf callable for reuse (bis) * misc(ci): make the wf callable for reuse (bis) * misc(ci): give the wf a name * Create test-trtllm.yml * Update test-trtllm.yml * Create build-trtllm2 * Rename build-trtllm2 to 1-build-trtllm2 * Rename test-trtllm.yml to 1-test-trtllm2.yml * misc(ci): fw secrets * Update 1-test-trtllm2.yml * Rename 1-build-trtllm2 to 1-build-trtllm2.yml * Update 1-test-trtllm2.yml * misc(ci): use ci-build.yaml as main dispatcher * Delete .github/workflows/1-test-trtllm2.yml * Delete .github/workflows/1-build-trtllm2.yml * misc(ci): rights? * misc(ci): rights? * misc(ci): once more? * misc(ci): once more? * misc(ci): baby more time? * misc(ci): baby more time? * misc(ci): try the permission above again? * misc(ci): try the permission above again? * misc(ci): try the permission scoped again? * misc(ci): install tensorrt_llm_executor_static * misc(ci): attempt to rebuild with sccache? * misc(ci):run the tests on GPU instance * misc(ci): let's actually setup sccache in the build.rs * misc(ci): reintroduce variables * misc(ci): enforce sccache * misc(ci): correct right job name dependency * misc(ci): detect dev profile for debug * misc(ci): detect gha build * misc(ci): detect gha build * misc(ci): ok debug * misc(ci): wtf * misc(ci): wtf2 * misc(ci): wtf3 * misc(ci): use commit HEAD instead of merge commit for image id * misc(ci): wtfinfini * misc(ci): wtfinfini * misc(ci): KAMEHAMEHA * Merge TRTLLM in standard CI * misc(ci): remove input machine * misc(ci): missing id-token for AWS auth * misc(ci): missing id-token for AWS auth * misc(ci): missing id-token for AWS auth * misc(ci): again... * misc(ci): again... * misc(ci): again... * misc(ci): again... * misc(ci): missing benchmark * misc(ci): missing backends * misc(ci): missing launcher * misc(ci): give everything aws needs * misc(ci): give everything aws needs * misc(ci): fix warnings * misc(ci): attempt to fix sccache not building trtllm * misc(ci): attempt to fix sccache not building trtllm again --------- Co-authored-by: Guillaume LEGENDRE <glegendre01@gmail.com> Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co> Co-authored-by: Pauline Bailly-Masson <155966238+paulinebm@users.noreply.github.com>	2025-01-21 10:19:16 +01:00
Cyril Vallez	b980848abf	Flash Transformers modeling backend support (#2913 ) * add transformers_flash * inits * switch version to make it work * Update Makefile-flash-att-v2 * Update Makefile-flash-att-v2 * Update Makefile-flash-att-v2 * Update Makefile-flash-att-v2 * Update Makefile-flash-att-v2 * Update Makefile-flash-att-v2 * runnable version * working * push change * fix high dim * init * default * latest transformers changes * revert * simplify check * remove flag * improve type hints + required args * Update based on transformers PR * small fix * Remove Warpers for Processor * fix compatibility version issue * raise error if needed * Simplify with monkey patch * revert + style + minor improvements * update comment * device check * move the import to avoid device issue * Update __init__.py * check for non-native models * oupsi --------- Co-authored-by: System administrator <root@ip-10-90-0-159.ec2.internal>	2025-01-21 10:01:51 +01:00
Nicolas Patry	447a5b2f87	Fixing TRTLLM dockerfile. (#2922 ) * Fixing TRTLLM dockerfile. * Fixed. * Creating a dummy modification to chekc CI runs. * Removing the cache directive. * Modifying this should cache hit. * Revert "Modifying this should cache hit." This reverts commit `46a2bde108`. * Modifying this should cache hit. * Unwanted files.	2025-01-20 11:13:46 +01:00
Daniël de Kok	630f198624	flashinfer: switch to plan API (#2904 ) This change doesn't switch `forward` to `run` yet, since it requires that we have access to the softmax scale and the logit softcap outside the model.	2025-01-17 18:18:02 +01:00
drbh	8f6146f11a	Revert "feat: improve qwen2-vl startup " (#2924 ) Revert "feat: improve qwen2-vl startup (#2802)" This reverts commit `eecca27113`.	2025-01-17 12:09:05 -05:00
drbh	eecca27113	feat: improve qwen2-vl startup (#2802 ) * feat: tokenize each request individually and increase warmup image size * feat: adjust rotary embed and avoid cuda graphs of size 2 and smaller * fix: address image resize and rebase changes * feat: update to run qwen2-vl tests * fix: tweak param types	2025-01-17 11:50:41 -05:00
Wang, Yi	6e982f43a1	fix the crash of meta-llama/Llama-3.2-1B (#2918 ) * fix the crash of meta-llama/Llama-3.2-1B Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * Apply suggestions from code review Simpler fix (which doesn't break vlms). --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2025-01-17 15:50:58 +01:00
Mohit Sharma	c20025dbf7	Add fp8 kv cache for ROCm (#2856 ) * add fp8 kv cache for rocm * improvements * update log statement * remove bookkeeping field	2025-01-17 18:43:29 +05:30
Nicolas Patry	de19e7e844	Moving to `uv` instead of `poetry`. (#2919 ) * Moving to `uv` instead of `poetry`. More in the standard, faster, seemingly better lockfile. * Creating venv if not created. * Create the venv. * Fix ? * Fixing the test by activating the environment ? * Install system ? * Add the cli entry point. * docker install on system * Monkeying this... * `--system` is redundant. * Trying to force-include this pb folder. * TRying to check that pb is imported correctly. * Editable install necessary ? * Non editable? * Editable it is.	2025-01-17 12:32:00 +01:00
Daniël de Kok	d61f14f271	nix: update to PyTorch 2.5.1 (#2921 )	2025-01-17 12:12:11 +01:00
Wang, Yi	885144166f	Flash decoding kernel adding and prefill-chunking and prefix caching enabling in intel cpu/xpu (#2815 ) * flash decoding Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * enable xpu flashdecoding Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * set flashdecoding blocksize as 64 Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * enable flashdecoding, prefill chunking and prefix caching Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * add flashdecoding-ipex Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-01-17 12:04:57 +01:00
drbh	82f6ea1b71	feat: improve star coder to support multi lora layers (#2883 ) * feat: improve star coder to support multi lora layers * feat: improve weight that support adapters and add tests for starcoder with lora * fix: bump snapshot for added tests * fix: rerun pre commit lints * fix: bump adapter test for added later names	2025-01-16 16:23:55 -05:00
Daniël de Kok	5f78ec32a5	Do not convert weight scale to e4m3fnuz on CUDA (#2917 )	2025-01-16 13:44:32 +01:00
Nicolas Patry	922cc38fbc	Upgrading bitsandbytes. (#2910 ) * Upgrading bitsandbytes. Co-Authored-By: Matthew Douglas <38992547+matthewdouglas@users.noreply.github.com> * Tighter lock. --------- Co-authored-by: Matthew Douglas <38992547+matthewdouglas@users.noreply.github.com>	2025-01-15 20:07:21 +01:00
Nicolas Patry	120bd3e3bb	Removing the github runner. (#2912 )	2025-01-15 19:20:44 +01:00
Baptiste Colle	1470aec9d9	Fix typo in TPU docs (#2911 ) docs(tpu): fix typo	2025-01-15 18:32:07 +01:00
Nicolas Patry	203cade244	Upgrading our rustc version. (#2908 ) * Upgrading our rustc version. * Fixing the rust tests to proper version. * Clippy everything.	2025-01-15 17:04:03 +01:00
Baptiste Colle	46994b34fb	📝 add guide on using TPU with TGI in the docs (#2907 )	2025-01-15 16:26:11 +01:00
Alvaro Bartolome	dc9b8e9814	Fix `docker run` in `README.md` (#2861 ) * Fix `docker run` in `README.md` * Add line-break in `docker run` for readability Co-authored-by: Daniël de Kok <danieldk@users.noreply.github.com> * Add line-break in `docker run` for readability Co-authored-by: Daniël de Kok <danieldk@users.noreply.github.com> --------- Co-authored-by: Daniël de Kok <danieldk@users.noreply.github.com>	2025-01-15 16:07:10 +01:00
Guspan Tanadi	3c7ae48f7f	docs(conceptual/speculation): available links Train Medusa (#2863 )	2025-01-15 16:05:54 +01:00
Wang, Yi	cc8b9650bd	Baichuan2-13B does not have max_position_embeddings in config (#2903 ) * Baichuan2-13B does not have max_position_embeddings in config see https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat/blob/main/config.json Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * Update server/text_generation_server/models/flash_causal_lm.py Co-authored-by: Daniël de Kok <me@github.danieldk.eu> * fmt Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Daniël de Kok <me@github.danieldk.eu>	2025-01-15 15:56:52 +01:00
Mohit Sharma	e07acc7f68	Enable FP8 Per-Tensor Scales and Integrate Marlin/MoE Kernels Repo for ROCm (#2825 ) * (feat) convert tscales to tensorwise * (fix) fp8 scaling for cuda * (kernel) add marlin-kernels * add moe-kernels * fix moe kernel comit * fix scaling * nm changes	2025-01-15 11:38:58 +05:30
Mohit Sharma	880ab9c2f3	Add Flash decoding kernel ROCm (#2855 ) * (vllm) updated vllm rocm kernels * revert silu * update partition size * remove grouped_topk * (nit) remove log * add flash decoding	2025-01-13 11:12:35 +01:00
Wang, Yi	1660154ae6	fix crash in torch2.6 if TP=1 (#2885 ) error like "ValueError: Expecting a ProcessGroup, but got a <class 'text_generation_server.utils.dist.FakeGroup'>. rank=0" Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-01-13 11:11:31 +01:00
Nicholas Broad	2e22164d4a	Update using_guidance.md (#2901 ) deletes one copy of a sentence that repeated twice	2025-01-13 11:09:35 +01:00
lazariv	83624a07be	Add possible variants for A100 and H100 GPUs for auto-detecting flops (#2837 ) * Update main.rs with A100 and H100 variants * Add another variant "nvidia-h100-nvl" * Update main.rs Add nvidia-a100-sxm4-40gb	2025-01-10 16:12:02 +01:00
Dmitry Dygalo	01067f8ba8	chore: Update jsonschema to 0.28.0 (#2870 ) * chore: Update jsonschema to 0.28.0 Signed-off-by: Dmitry Dygalo <dmitry@dygalo.dev> * chore: Enable blocking feature for reqwest Signed-off-by: Dmitry Dygalo <dmitry@dygalo.dev> --------- Signed-off-by: Dmitry Dygalo <dmitry@dygalo.dev>	2025-01-10 15:01:54 +01:00
Daniël de Kok	4f7e00f4ce	Update to marlin-kernels 0.3.7 (#2882 ) This fixes a race condition. See: https://github.com/vllm-project/vllm/pull/11493	2025-01-10 12:43:44 +01:00
drbh	da5ab46705	Improve vlm support (add idefics3 support) (#2437 ) * feat: expand vlm support and add image token logic and tests * fix: avoid unused perceiver config * feat: integrate image tokens into inputs embeds * feat: add simple idefics3 test * feat: update docs, image token logic and weight names * fix: improve image processing * feat: improve prefix for idefics3 * fix: bump idefics3 tests and snapshots * fix: improve text model loading * feat: consolidate changes with existing vlms and add support and test for smolvlm * fix: create new idefic3 file, simplify logic and adjust llama weight loading * fix: lint with ruff * fix: clean up idefics 3 and improve prefix handling * fix: improve typing * fix: improve prompt_split_image with ref to original impl * fix: adjust ruff lints and small refactors * fix: adjust FlashLlamaModel prefix logic	2025-01-09 10:35:32 -05:00

1 2 3 4 5 ...

1345 Commits