Update --max-batch-total-tokens description (#3083)

* Update `--max-batch-total-tokens` description

* Update docstring in `launcher/src/main.rs` instead
This commit is contained in:
Alvaro Bartolome 2025-03-07 14:24:26 +01:00 committed by GitHub
parent 036d802b62
commit 55a6618434
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
2 changed files with 3 additions and 3 deletions

View File

@ -198,7 +198,7 @@ Options:
For `max_batch_total_tokens=1000`, you could fit `10` queries of `total_tokens=100` or a single query of `1000` tokens. For `max_batch_total_tokens=1000`, you could fit `10` queries of `total_tokens=100` or a single query of `1000` tokens.
Overall this number should be the largest possible amount that fits the remaining memory (after the model is loaded). Since the actual memory overhead depends on other parameters like if you're using quantization, flash attention or the model implementation, text-generation-inference cannot infer this number automatically. Overall this number should be the largest possible amount that fits the remaining memory (after the model is loaded). Since the actual memory overhead depends on other parameters like if you're using quantization, flash attention or the model implementation, text-generation-inference infers this number automatically if not provided ensuring that the value is as large as possible.
[env: MAX_BATCH_TOTAL_TOKENS=] [env: MAX_BATCH_TOTAL_TOKENS=]

View File

@ -702,8 +702,8 @@ struct Args {
/// Overall this number should be the largest possible amount that fits the /// Overall this number should be the largest possible amount that fits the
/// remaining memory (after the model is loaded). Since the actual memory overhead /// remaining memory (after the model is loaded). Since the actual memory overhead
/// depends on other parameters like if you're using quantization, flash attention /// depends on other parameters like if you're using quantization, flash attention
/// or the model implementation, text-generation-inference cannot infer this number /// or the model implementation, text-generation-inference infers this number automatically
/// automatically. /// if not provided ensuring that the value is as large as possible.
#[clap(long, env)] #[clap(long, env)]
max_batch_total_tokens: Option<u32>, max_batch_total_tokens: Option<u32>,