text-generation-inference

mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-06-09 19:02:09 +00:00

Author	SHA1	Message	Date
Nicolas Patry	39eeb34d77	Working around the github runner thing.	2025-02-20 18:48:03 +01:00
Nicolas Patry	b4fefdcc45	Fighting docker in docker.	2025-02-20 16:55:50 +01:00
Nicolas Patry	215f39fb9d	Pull before running the container `python docker` doesn't handle zstd correctly it seems.	2025-02-20 16:41:15 +01:00
Corentin REGAL	21493042ef	re-enable recompressed	2025-02-20 12:36:06 +01:00
Corentin REGAL	65c3b3bc21	re-enable zstd	2025-02-20 12:36:06 +01:00
Corentin REGAL	3953b8aaf7	revert zstd test	2025-02-20 12:36:06 +01:00
Corentin REGAL	87c5e19072	test fix ci	2025-02-20 12:36:06 +01:00
Corentin REGAL	51e7d98b6b	Compress Docker layers with zstd instead of gzip Image is smaller but most importantly way faster to decompress. L4 g6.2xlarge (base) in 1m53.837s (1m53.837s including waiting). Image size: 5650343354 bytes. L4 g6.2xlarge (zsd ) in 1m25.92s (1m25.92s including waiting). Image size: 4581485004 bytes.	2025-02-20 12:36:04 +01:00
Daniël de Kok	ed96ba6503	flashinfer 0.2.0.post1 -> post2 (#3040 ) * flashinfer 0.2.0.post1 -> post2 * Fix ruff stuff. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2025-02-20 12:34:20 +01:00
Wang, Yi	feaa2477b7	update ipex and torch to 2.6 for cpu (#3039 ) ipex cpu 2.6 support topk_group in moe fusion ops Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-02-20 09:12:28 +01:00
Hugo Larcher	230aa25641	feat: Add the parsing of HF_HUB_USER_AGENT_ORIGIN environment variable for telemetry (#3027 ) * feat: Add the parsing of HF_HUB_USER_AGENT_ORIGIN environment variable to add info about the environment running TGI. That is useful to track usage in case of collaborations for example. * fix: trufflehog	2025-02-19 21:09:12 +01:00
Nicolas Patry	9c89d0070e	Having less logs in case of failure for checking CI more easily. (#3037 ) * Having less logs in case of failure for checking CI more easily. * Cleaning up the versions to uv for the client. * Ignore entirely the API.	2025-02-19 17:01:33 +01:00
Nicolas Patry	fde3234cbc	Using public external registry (to use external runners for CI). (#3031 ) * Using public external registry (to use external runners for CI). * Fix build. * Fixing the external registry. * Fixing trtllm tests.	2025-02-19 14:53:14 +01:00
drbh	d6a0c67e2f	feat: add initial qwen2.5-vl model and test (#2971 ) * feat: support qwen2.5 vl model * fix: bump support models doc * feat: check before rope type adjustment and small refactors * fix: add transformer overlay for processor support * fix: vendor processor and config from transformers * fix: refactor/simplify conditionals	2025-02-19 12:38:20 +01:00
Cyril Vallez	a7448661f7	Improve Transformers support (#2970 ) * Much better support * add gpt neox * bump transformers version * bump version	2025-02-18 19:04:34 +01:00
Nicolas Patry	5543fdc765	It's find in some machine. using hf_hub::api::sync::Api to download c… (#3030 ) It's find in some machine. using hf_hub::api::sync::Api to download config is not successful which will make warmup fail since attribute like max_position_embeddings could not be got. update hf-hub to the latest version could fix it Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Wang, Yi A <yi.a.wang@intel.com>	2025-02-18 12:19:51 +01:00
Nicolas Patry	b8a4928d0e	Pinning trufflehog. (#3032 )	2025-02-18 12:03:41 +01:00
Alvaro Bartolome	8a1cfd6122	Add `loop_controls` feature to `minijinja` to handle `{% break %}` (#2998 ) * Add `loop_controls` feature to `minijinja` * Add `test_chat_template_loop_controls` to test `break`	2025-02-18 10:33:22 +01:00
celsowm	794ec58b75	Update README.md (#3024 ) only way to avoid: error: experimental Nix feature 'nix-command' is disabled; add '--extra-experimental-features nix-command' to enable it	2025-02-18 10:08:28 +01:00
Daniël de Kok	f0ed76583c	Use eetq kernel from the hub (#3029 ) * Use eetq kernel from the hub * Fixing the CI. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2025-02-18 10:03:53 +01:00
Adrien Gallouët	cfd4fbb479	[Backend] Add Llamacpp backend (#2975 ) * Add llamacpp backend Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Get rid of llama_batch_get_one() Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Use max_batch_total_tokens Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Handle max_batch_size Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Add some input validation checks Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Handle ctx args & fix sampling Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Add GPU args Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Add --defrag-threshold Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Add a stupid batch mechanism Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Cleanup Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Add --numa Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Fix args Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Enable flash attention by default Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Add --offload-kqv Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Fix batch_pos Signed-off-by: Adrien Gallouët <angt@huggingface.co> * backend(llama): add CUDA Dockerfile_llamacpp for now * Only export the latest logits Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Output real logprobs Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Fix batching Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Fix seq iterations Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Auto-detect n_threads when not provided Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Clear request cache after completion Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Remove warmup Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Cleanup Signed-off-by: Adrien Gallouët <angt@huggingface.co> * backend(llama): add CUDA architectures build argument for Dockerfile * Add specific args for batch Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Add --type-v & --type-k Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Bump llamacpp to b4623 Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Disable graceful shutdown in debug mode Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Update Dockerfile_llamacpp Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Cleanup Dockerfile Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Update Cargo.lock Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Update args Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Simplify batching logic Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Set TGI_LLAMA_PKG_CUDA from CUDA_VERSION Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Rename bindings Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Remove n_ctx Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Make max_batch_total_tokens optional Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Ensure all samplers are freed on error Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Initialize penalty_last_n with llamacpp default value Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Cleanup Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Improve default settings Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Add doc Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Update docs Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Thanks clippy Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Thanks cargo fmt Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Update docs Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Do not use HOSTNAME env Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Bump llama.cpp & cuda Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Fix requirements.txt Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Fix fmt Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Enable KQV offload by default Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Remove Ngrok tunneling Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Remove .cargo/config.toml Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Fix Dockerfile Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Add missing cuda prefix Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Handle custom llama.cpp dir Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Cleanup Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Add README.md Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Add HF transfer Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Fix bool args Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Update doc Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Update doc Signed-off-by: Adrien Gallouët <angt@huggingface.co> --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co> Co-authored-by: Morgan Funtowicz <funtowiczmo@gmail.com>	2025-02-14 13:40:57 +01:00
Daniël de Kok	6df0fc0b55	Support sigmoid scoring function in GPTQ-MoE (#3017 )	2025-02-14 11:33:49 +01:00
Nicolas Patry	d6881c37ab	Putting back the NCCL forced upgrade. (#2999 ) * Putting back the NCCL forced upgrade. * . * ... * Ignoring conda. * Dropping conda from the buidl system + torch 2.6 * Cache min. * Rolling back torch version. * Reverting the EETQ modification. * Fix flash attention ? * Actually stay on flash v1. * Patching flash v1. * Torch 2.6, fork of rotary, eetq updated. * Put back nccl latest (override torch). * Slightly more reproducible build and not as scary.	2025-02-14 11:31:59 +01:00
Nicolas Patry	8a211dc7fc	Preventing single user hugging the server to death by asking (#3016 ) for way too many tokens.	2025-02-13 11:23:17 +01:00
Nicolas Patry	4cccce4b44	Update the flaky mllama test. (#3015 )	2025-02-12 12:26:52 +01:00
Wang, Yi	76bcb4948d	fix Qwen VL break in intel platform (#3002 ) * fix Qwen VL break in intel platform Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * could use PositionRotaryEmbedding impl so rocm and ipex could all work Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-02-12 11:31:34 +01:00
Nicolas Patry	b86c3947ab	Revert "Update the flaky mllama test." This reverts commit `8a870b31b9`.	2025-02-11 17:13:06 +01:00
Nicolas Patry	8a870b31b9	Update the flaky mllama test.	2025-02-11 17:10:36 +01:00
Daniël de Kok	571ac9b507	Use kernels from the kernel hub (#2988 ) * Use Hub kernels for Marlin and cutlass quantization kernels * Use hub kernels for MoE/GPTQ-Marlin MoE * Use attention kernels from the Hub * Cache the kernels in the Docker image * Update moe kernels * Support loading local kernels for development * Support latest moe kernels * Update to moe 0.1.1 * CI: download locked kernels for server tests * Fixup some imports * CI: activate venv * Fix unused imports * Nix: add attention/moe/quantization kernels * Update hf-kernels to 0.1.5 * Update kernels * Update tgi-nix flake for hf-kernels * Fix EOF * Take `load_kernel` out of a frequently-called function * Hoist another case of kernel loading out of a somewhat hot function * marlin-kernels -> quantization * attention -> paged-attention * EOF fix * Update hf-kernels, fixup Docker * ipex fix * Remove outdated TODO	2025-02-10 19:19:25 +01:00
Nicolas Patry	4b8cda684b	Updating mllama after strftime. (#2993 ) * Updating mllama after strftime. * Town instead village. * Forgot the integration snapshot. * Attempt to fix intel CPU. * Intel extension fix. * Workaround intel. * Moving those deps directly into pyproject. * Revert "Moving those deps directly into pyproject." This reverts commit `98c1496ea6`. * Non system uv. * Fixing the docker environment hopefully. * Missed a step. * Move workdir up a bit. * Bailing out of reproducible python env. * Triton version.	2025-02-07 10:38:13 +01:00
Funtowicz Morgan	856709d5c3	[Backend] Bump TRTLLM to v.0.17.0 (#2991 ) * backend(trtllm): bump TRTLLM to v.0.17.0 * backend(trtllm): forget to bump dockerfile * backend(trtllm): use arg instead of env * backend(trtllm): use correct library reference decoder_attention_src * backend(trtllm): link against decoder_attention_{0\|1} * backend(trtllm): build against gcc-14 with cuda12.8 * backend(trtllm): use return value optimization flag as as error if available * backend(trtllm): make sure we escalade all warnings as errors on the backend impl in debug mode * backend(trtllm): link against CUDA 12.8	2025-02-06 16:45:03 +01:00
Wang, Yi	36223f834e	Triton fix (#2995 ) fix triton to 3.1.0 to fix ipex import issue Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-02-06 12:28:41 +01:00
Nicolas Patry	0ef8c8a97a	Using the "lockfile". (#2992 ) * Using the "lockfile". * Revert dummy modifications. * Lock on python 3.11 * Another attempt. * .. * Bad cache hits. * The good old monkey. * How in the world... * We need the launcher still. * . * .. * Attempt #42 * Don't break all other builds. * Mode max. * Applying to other builds.	2025-02-06 12:28:24 +01:00
drbh	c1cf36c0dc	Improve qwen vl impl (#2943 ) * feat: refactor model, improve startup and re enable tests * fix: improve multimodal rotary embed caching * fix: limit vision flop calc to qwen2 vl models and update config typing * fix: include clippy lint * feat: refactor position ids in warmup and bump tests * fix: prefer default dtype * fix: enable all cuda graphs and bump snapshots * fix: adjust rotaty init path * fix: simplify get position ids and remove usused vision config * fix: update position ids so first dim is batch, simplify rotary and bump vlm default token limit * fix: improve position id init during cuda warmup for mrope and simplfy rotary forward * fix: check existance before accessing rope type in cuda warmup * fix: check key before access * fix: improve mrope check in cuda graph warmup * fix: remove check for default rope type * fix: add more test and improve model generation * fix: improve and simplify get_cos_sin, refactors and cleanup get_position_ids * fix: adjust signatures with types	2025-02-04 12:44:18 -05:00
Daniël de Kok	dd2bd5fdb3	impureWithCuda: fix gcc version (#2990 ) * impureWithCuda: fix gcc version * trufflehog: do not fail on unverified results	2025-02-04 17:01:59 +01:00
Alvaro Bartolome	88fd56f549	Add `strftime_now` callable function for `minijinja` chat templates (#2983 ) * Add `chrono` and `strftime_now` function callable * Fix `test_chat_template_valid_with_strftime_now` * Fix `test_chat_template_valid_with_strftime_now`	2025-02-03 15:30:48 +01:00
Hugo Larcher	e3f2018cb5	hotfix: fix trtllm CI build on release (#2981 ) * hotfix: fix trtllm CI build on release * fix: test release. * fix: test release. * fix: test release. env not recognized https://github.com/actions/runner/issues/1661 * fix: test release. Works.	2025-02-03 11:11:15 +01:00
Nicolas Patry	bb69c5b199	Back on nix main. (#2979 )	2025-01-31 14:39:52 +01:00
Nicolas Patry	c9d68945cc	Prepare for release 3.1.0 (#2972 ) * Prepare for release 3.1.0 * Back on main flake. * Fixing stuff. * Upgrade to moe-kernels 0.8.2 for Hip support. * Deactivating the flaky test.	2025-01-31 14:19:01 +01:00
Mohit Sharma	c07a2cc82b	Update moe-kernel to 0.8.2 for rocm (#2977 ) update moe-kernel for amd	2025-01-31 11:40:00 +01:00
Hugo Larcher	065aabb13d	doc: Update TRTLLM deployment doc. (#2960 ) * doc: Update TRTLLM deployment doc. Update TRTLLM CI to allow release builds when tagging TGI. * doc: Update TRTLLM deployment doc. Update TRTLLM CI to allow release builds when tagging TGI. * fix: PR comments	2025-01-30 18:04:42 +01:00
Nicolas Patry	cb747b33da	Add deepseekv3 (#2968 ) * Add fp8 support moe models add deepseekv3 format codfe' update dockerfile update doc * Small modifications. * Moe kernels 0.8.1 * Upgrade to 0.8.1 * Fixing moe import. * Black. * Apply suggestions from code review Co-authored-by: Mohit Sharma <mohit21sharma.ms@gmail.com> * Fixing Mixtral + Nits. * Put link to ref. * Fix other call locations. * Scoring func `softmax` is the only one that works. --------- Co-authored-by: Mohit Sharma <mohit21sharma.ms@gmail.com>	2025-01-30 16:40:25 +01:00
Nicolas Patry	80e7d98f88	Hotfixing intel-cpu (not sure how it was working before). (#2967 ) * Hotfixing intel-cpu (not sure how it was working before). * Do not fail on missing moe-kernels (Intel-cpu).	2025-01-29 22:34:41 +01:00
Daniël de Kok	ee0dffcd14	Update to moe-kernels 0.8.0 (#2966 )	2025-01-29 18:19:55 +01:00
Mohit Sharma	4ef2e045c9	Add fp8 support moe models (#2928 ) * Add fp8 support moe models * flatten condition	2025-01-29 13:56:32 +01:00
Hugo Larcher	73b7cf83f6	Add backend name to telemetry (#2962 ) * feat: Add backend name to telemetry	2025-01-28 16:53:16 +01:00
Nicolas Patry	eb3df0f46f	Fixing the oom maybe with 2.5.1 change. (#2958 )	2025-01-28 10:30:28 +01:00
Hugo Larcher	c690da5973	fix: Telemetry (#2957 ) * fix: add telemetry regular pings and fix unhandled errors avoid not sending telemetry stop events. * fix: simplify error handling * fix: update ping delay and update doc. * fix: clippy * doc: Rephrase properly.	2025-01-28 10:29:18 +01:00
Daniël de Kok	db922eb77e	Update to attention-kernels 0.2.0 (#2950 ) This version removes our patches/custom API. Makes it simpler to get changes from upstream. One of which is that we can enable FP8 KV cache for paged attention as well.	2025-01-27 11:42:36 +01:00
Funtowicz Morgan	40b00275b2	Attempt to remove AWS S3 flaky cache for sccache (#2953 ) * backend(trtllm): attempt to remove AWS S3 flaky cache for sccache * backend(trtllm): what if we expose ENV instead of inline? * backend(trtllm): and with the right env var for gha sccache * backend(trtllm): relax the way to detect sccache * backend(trtllm): make sccache definition manually * backend(trtllm): ok let's try to define the launchers in build.rs when rustc_wrapper is present * backend(trtllm): export env variable in run mb? * backend(trtllm): Cache mode max to cache intermediate layers * backend(trtllm): inject ompi_version build arg in dependent step	2025-01-27 11:21:48 +01:00

1 2 3 4 5 ...

1289 Commits