From 16386b83e116c95ee99a52e87a0bf691e76c4133 Mon Sep 17 00:00:00 2001 From: Nicolas Patry Date: Fri, 12 Apr 2024 10:28:49 +0000 Subject: [PATCH] Forgot the doc again. --- docs/source/basic_tutorials/launcher.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/source/basic_tutorials/launcher.md b/docs/source/basic_tutorials/launcher.md index 627aff93..69e58b20 100644 --- a/docs/source/basic_tutorials/launcher.md +++ b/docs/source/basic_tutorials/launcher.md @@ -133,7 +133,7 @@ Options: ## MAX_INPUT_TOKENS ```shell --max-input-tokens - This is the maximum allowed input length (expressed in number of tokens) for users. The larger this value, the longer prompt users can send which can impact the overall memory required to handle the load. Please note that some models have a finite range of sequence they can handle. Default to min(max_position_embeddings - 1, 13383) + This is the maximum allowed input length (expressed in number of tokens) for users. The larger this value, the longer prompt users can send which can impact the overall memory required to handle the load. Please note that some models have a finite range of sequence they can handle. Default to min(max_position_embeddings - 1, 4095) [env: MAX_INPUT_TOKENS=] @@ -149,7 +149,7 @@ Options: ## MAX_TOTAL_TOKENS ```shell --max-total-tokens - This is the most important value to set as it defines the "memory budget" of running clients requests. Clients will send input sequences and ask to generate `max_new_tokens` on top. with a value of `1512` users can send either a prompt of `1000` and ask for `512` new tokens, or send a prompt of `1` and ask for `1511` max_new_tokens. The larger this value, the larger amount each request will be in your RAM and the less effective batching can be. Default to min(max_position_embeddings, 16384) + This is the most important value to set as it defines the "memory budget" of running clients requests. Clients will send input sequences and ask to generate `max_new_tokens` on top. with a value of `1512` users can send either a prompt of `1000` and ask for `512` new tokens, or send a prompt of `1` and ask for `1511` max_new_tokens. The larger this value, the larger amount each request will be in your RAM and the less effective batching can be. Default to min(max_position_embeddings, 4096) [env: MAX_TOTAL_TOKENS=] @@ -168,7 +168,7 @@ Options: ## MAX_BATCH_PREFILL_TOKENS ```shell --max-batch-prefill-tokens - Limits the number of tokens for the prefill operation. Since this operation take the most memory and is compute bound, it is interesting to limit the number of requests that can be sent. Default to min(max_input_length + 50, 16384) to give a bit of room + Limits the number of tokens for the prefill operation. Since this operation take the most memory and is compute bound, it is interesting to limit the number of requests that can be sent. Default to `max_input_length + 50` to give a bit of room [env: MAX_BATCH_PREFILL_TOKENS=]