mirror of
https://github.com/huggingface/text-generation-inference.git
synced 2025-04-23 16:02:10 +00:00
Update --max-batch-total-tokens
description
This commit is contained in:
parent
036d802b62
commit
7a40844734
@ -198,7 +198,7 @@ Options:
|
|||||||
|
|
||||||
For `max_batch_total_tokens=1000`, you could fit `10` queries of `total_tokens=100` or a single query of `1000` tokens.
|
For `max_batch_total_tokens=1000`, you could fit `10` queries of `total_tokens=100` or a single query of `1000` tokens.
|
||||||
|
|
||||||
Overall this number should be the largest possible amount that fits the remaining memory (after the model is loaded). Since the actual memory overhead depends on other parameters like if you're using quantization, flash attention or the model implementation, text-generation-inference cannot infer this number automatically.
|
Overall this number should be the largest possible amount that fits the remaining memory (after the model is loaded). Since the actual memory overhead depends on other parameters like if you're using quantization, flash attention or the model implementation, text-generation-inference infers this number automatically if not provided ensuring that the value is as large as possible.
|
||||||
|
|
||||||
[env: MAX_BATCH_TOTAL_TOKENS=]
|
[env: MAX_BATCH_TOTAL_TOKENS=]
|
||||||
|
|
||||||
|
Loading…
Reference in New Issue
Block a user