From 7a4084473463d1f4c5dc27e00a27b32e70ec5af1 Mon Sep 17 00:00:00 2001 From: Alvaro Bartolome <36760800+alvarobartt@users.noreply.github.com> Date: Fri, 7 Mar 2025 13:15:41 +0100 Subject: [PATCH] Update `--max-batch-total-tokens` description --- docs/source/reference/launcher.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/reference/launcher.md b/docs/source/reference/launcher.md index 159b22e7..2d9b58d9 100644 --- a/docs/source/reference/launcher.md +++ b/docs/source/reference/launcher.md @@ -198,7 +198,7 @@ Options: For `max_batch_total_tokens=1000`, you could fit `10` queries of `total_tokens=100` or a single query of `1000` tokens. - Overall this number should be the largest possible amount that fits the remaining memory (after the model is loaded). Since the actual memory overhead depends on other parameters like if you're using quantization, flash attention or the model implementation, text-generation-inference cannot infer this number automatically. + Overall this number should be the largest possible amount that fits the remaining memory (after the model is loaded). Since the actual memory overhead depends on other parameters like if you're using quantization, flash attention or the model implementation, text-generation-inference infers this number automatically if not provided ensuring that the value is as large as possible. [env: MAX_BATCH_TOTAL_TOKENS=]