add docs

2025-07-11 02:10:16 +00:00 · 2023-08-18 09:13:39 +02:00 · 2023-08-18 09:13:39 +02:00 · 69c3d79a1c
commit 69c3d79a1c
parent bce5e22444
6 changed files with 212 additions and 0 deletions
--- a/docs/source/_toctree.yml
+++ b/docs/source/_toctree.yml
@ -7,10 +7,18 @@
    title: Installation
  - local: supported_models
    title: Supported Models and Hardware
  - local: launch_parameters
    title: Configuring TGI
  - local: guides
    title: Guides
  title: Getting started
 - sections:
  - local: basic_tutorials/consuming_tgi
    title: Consuming TGI
  - local: basic_tutorials/customize_inference
    title: Control/Customize Inference
  - local: basic_tutorials/stream
    title: Stream Responses
  - local: basic_tutorials/preparing_model
    title: Preparing Model for Serving
  - local: basic_tutorials/gated_model_access
--- a/docs/source/basic_tutorials/customize_inference.md
+++ b/docs/source/basic_tutorials/customize_inference.md
@ -0,0 +1,29 @@
 # Control/Customize Inference Generation with Text Generation Inference
 Text Generation Inference support different parameters to control the generation, defining them in the `parameters`` attribute of the payload. 
 ```bash
 curl https://j4xhm53fxl9ussm8.us-east-1.aws.endpoints.huggingface.cloud \
 -X POST \
 -d '{"inputs":"Once upon a time,", "parameters": {"max_new_tokens": 256}}' \
 -H "Authorization: Bearer <hf_token>" \
 -H "Content-Type: application/json"
 ```
 As of today, the following parameters are supported:
 - `temperature`: Controls randomness in the model. Lower values will make the model more deterministic and higher values will make the model more random. Default value is 1.0.
 - `max_new_tokens`: The maximum number of tokens to generate. Default value is 20, max value is 512.
 - `repetition_penalty`: Controls the likelihood of repetition. Default is `null`.
 - `seed`: The seed to use for random generation. Default is `null`.
 - `stop`: A list of tokens to stop the generation. The generation will stop when one of the tokens is generated.
 - `top_k`: The number of highest probability vocabulary tokens to keep for top-k-filtering. Default value is `null`, which disables top-k-filtering.
 - `top_p`: The cumulative probability of parameter highest probability vocabulary tokens to keep for nucleus sampling, default to `null`
 - `do_sample`: Whether or not to use sampling; use greedy decoding otherwise. Default value is `false`.
 - `best_of`: Generate best_of sequences and return the one if the highest token logprobs, default to `null`.
 - `details`: Whether or not to return details about the generation. Default value is `false`.
 - `return_full_text`: Whether or not to return the full text or only the generated part. Default value is `false`.
 - `truncate`: Whether or not to truncate the input to the maximum length of the model. Default value is `true`.
 - `typical_p`: The typical probability of a token. Default value is `null`.
 - `watermark`: The watermark to use for the generation. Default value is `false`.
--- a/docs/source/basic_tutorials/stream.md
+++ b/docs/source/basic_tutorials/stream.md
@ -0,0 +1,94 @@
 # Stream responses in Javascript and Python
 Requesting and generating text with LLMs can be a time-consuming and iterative process. A great way to improve the user experience is streaming tokens to the user as they are generated. Below are two examples of how to stream tokens using Python and JavaScript. For Python, we are going to use the **[client from Text Generation Inference](https://github.com/huggingface/text-generation-inference/tree/main/clients/python)**, and for JavaScript, the **[HuggingFace.js library](https://huggingface.co/docs/huggingface.js/main/en/index)**
 ## Streaming requests with Python
 First, you need to install the `huggingface_hub` library:
 `pip install -U huggingface_hub`
 We can create a `InferenceClient` providing our endpoint URL and credential alongside the hyperparameters we want to use
 ```python
 from huggingface_hub import InferenceClient
 # HF Inference Endpoints parameter
 endpoint_url = "https://YOUR_ENDPOINT.endpoints.huggingface.cloud"
 hf_token = "hf_YOUR_TOKEN"
 # Streaming Client
 client = InferenceClient(endpoint_url, token=hf_token)
 # generation parameter
 gen_kwargs = dict(
    max_new_tokens=512,
    top_k=30,
    top_p=0.9,
    temperature=0.2,
    repetition_penalty=1.02,
    stop_sequences=["\nUser:", "<|endoftext|>", "</s>"],
 )
 # prompt
 prompt = "What can you do in Nuremberg, Germany? Give me 3 Tips"
 stream = client.text_generation(prompt, stream=True, details=True, **gen_kwargs)
 # yield each generated token
 for r in stream:
    # skip special tokens
    if r.token.special:
        continue
    # stop if we encounter a stop sequence
    if r.token.text in gen_kwargs["stop_sequences"]:
        break
    # yield the generated token
    print(r.token.text, end = "")
    # yield r.token.text
 ```
 Replace the `print` command with the `yield` or with a function you want to stream the tokens to.
 !https://huggingface.co/blog/assets/155_inference_endpoints_llm/python-stream.gif
 ## Streaming requests with JavaScript
 First, you need to install the `@huggingface/inference` library.
 `npm install @huggingface/inference`
 We can create a `HfInferenceEndpoint` providing our endpoint URL and credential alongside the hyperparameter we want to use.
 ```js
 import { HfInferenceEndpoint } from '@huggingface/inference'
 const hf = new HfInferenceEndpoint('https://YOUR_ENDPOINT.endpoints.huggingface.cloud', 'hf_YOUR_TOKEN')
 //generation parameter
 const gen_kwargs = {
  max_new_tokens: 512,
  top_k: 30,
  top_p: 0.9,
  temperature: 0.2,
  repetition_penalty: 1.02,
  stop_sequences: ['\nUser:', '<|endoftext|>', '</s>'],
 }
 // prompt
 const prompt = 'What can you do in Nuremberg, Germany? Give me 3 Tips'
 const stream = hf.textGenerationStream({ inputs: prompt, parameters: gen_kwargs })
 for await (const r of stream) {
  // # skip special tokens
  if (r.token.special) {
    continue
  }
  // stop if we encounter a stop sequence
  if (gen_kwargs['stop_sequences'].includes(r.token.text)) {
    break
  }
  // yield the generated token
  process.stdout.write(r.token.text)
 }
 ```
 Replace the `process.stdout` call with the `yield` or with a function you want to stream the tokens to.
--- a/docs/source/guides.md
+++ b/docs/source/guides.md
@ -0,0 +1,6 @@
 # Text Generation Inference Guides
 Text Generation Inference (TGI) is integrated into the Hugging Face ecosystem and can be used in many different ways. Below is a list of guides to help you get started with TGI.
 * [Deploy LLMs with Hugging Face Inference Endpoints](https://huggingface.co/blog/inference-endpoints-llm)
 * [Introducing the Hugging Face LLM Inference Container for Amazon SageMaker](https://huggingface.co/blog/sagemaker-huggingface-llm)
--- a/docs/source/index.md
+++ b/docs/source/index.md
@ -25,3 +25,4 @@ Text Generation Inference is used in production by multiple projects, such as:
 - [Hugging Chat](https://github.com/huggingface/chat-ui), an open-source interface for open-access models, such as Open Assistant and Llama
 - [OpenAssistant](https://open-assistant.io/), an open-source community effort to train LLMs in the open
 - [nat.dev](http://nat.dev/), a playground to explore and compare LLMs.
 - [Hugging Face LLM Inference Container for Amazon SageMaker](https://huggingface.co/blog/sagemaker-huggingface-llm), a purpose build container for Amazon SageMaker to deploy LLMs in production
--- a/docs/source/launch_parameters.md
+++ b/docs/source/launch_parameters.md
@ -0,0 +1,74 @@
 # Configuration parameters for Text Generation Inference
 Text Generation Inference allows you to customize the way you serve your models. You can use the following parameters to configure your server. You can enable them by adding them environment variables or by providing them as arguments when running `text-generation-launcher`. Environment variables are in `UPPER_CASE` and arguments are in `lower_case`.
 ## Model
 - `MODEL_ID` - The name of the model to load. Can be a MODEL_ID as listed on huggingface.co like `gpt2` or `OpenAssistant/oasst-sft-1-pythia-12b`. Or it can be a local directory containing the necessary files as saved by `save_pretrained(...)` methods of transformers. Default: `bigscience/bloom-560m`.
 - `REVISION` - The actual revision of the model if you're referring to a model on Hugging Face Hub. You can use a specific commit id or a branch like `refs/pr/2`.
 - `QUANTIZE` - Whether you want the model to be quantized. This will use `bitsandbytes` for quantization on the fly, or `gptq`. 4bit quantization is available through `bitsandbytes` by providing the `bitsandbytes-fp4` or `bitsandbytes-nf4` options. 
 - `DTYPE` - The dtype to be forced upon the model. This option cannot be used with `--quantize`.
 - `TRUST_REMOTE_CODE` - Whether you want to execute Hub modelling code. Explicitly passing a `revision` is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. Default: false
 - `DISABLE_CUSTOM_KERNELS` - For some models (like bloom), text-generation-inference implemented custom cuda kernels to speed up inference. Those kernels were only tested on A100. Use this flag to disable them if you're running on different hardware and encounter issues. Default: false
 - `ROPE_SCALING` - Rope scaling will only be used for RoPE models and allow rescaling the position rotary to accomodate for larger prompts. Goes together with `rope_factor`. Default: linear
 - `ROPE_FACTOR` - Rope scaling factor. Default: 1.0
 ## Inference Settings
 - `VALIDATION_WORKERS` - The number of tokenizer workers used for payload validation and truncation inside the router. Default: 2
 - `SHARDED` - Whether to shard the model across multiple GPUs. By default text-generation-inference will use all available GPUs to run the model. Setting it to `false` deactivates `num_shard`. 
 - `NUM_SHARD` - The number of shards to use if you don't want to use all GPUs on a given machine. You can use `CUDA_VISIBLE_DEVICES=0,1 text-generation-launcher... --num_shard 2` and `CUDA_VISIBLE_DEVICES=2,3 text-generation-launcher... --num_shard 2` to launch 2 copies with 2 shard each on a given machine with 4 GPUs for instance.
 - `MAX_CONCURRENT_REQUESTS` - The maximum amount of concurrent requests for this particular deployment. Having a low limit will refuse clients requests instead of having them wait for too long and is usually good to handle backpressure correctly. Default: 128
 - `MAX_BEST_OF` - This is the maximum allowed value for clients to set `best_of`. Best of makes `n` generations at the same time, and return the best in terms of overall log probability over the entire generated sequence. Default: 2
 - `MAX_STOP_SEQUENCES` - This is the maximum allowed value for clients to set `stop_sequences`. Stop sequences are used to allow the model to stop on more than just the EOS token, and enable more complex "prompting" where users can preprompt the model in a specific way and define their "own" stop token aligned with their prompt. Default: 4
 - `MAX_INPUT_LENGTH` - This is the maximum allowed input length (expressed in number of tokens) for users. The larger this value, the longer prompt users can send which can impact the overall memory required to handle the load. Please note that some models have a finite range of sequence they can handle. Default: 1024
 - `MAX_TOTAL_TOKENS` - This is the most important value to set as it defines the "memory budget" of running clients requests. Clients will send input sequences and ask to generate `max_new_tokens` on top. with a value of `1512` users can send either a prompt of `1000` and ask for `512` new tokens, or send a prompt of `1` and ask for `1511` max_new_tokens. The larger this value, the larger amount each request will be in your RAM and the less effective batching can be. Default: 2048
 - `WAITING_SERVED_RATIO` - This represents the ratio of waiting queries vs running queries where you want to start considering pausing the running queries to include the waiting ones into the same batch. `waiting_served_ratio=1.2` Means when 12 queries are waiting and there's only 10 queries left in the current batch we check if we can fit those 12 waiting queries into the batching strategy, and if yes, then batching happens delaying the 10 running queries by a `prefill` run. This setting is only applied if there is room in the batch as defined by `max_batch_total_tokens`. Default: 1.2
 - `MAX_BATCH_PREFILL_TOKENS` - Limits the number of tokens for the prefill operation. Since this operation take the most memory and is compute bound, it is interesting to limit the number of requests that can be sent. Default: 4096
 - `MAX_BATCH_TOTAL_TOKENS` - **IMPORTANT** This is one critical control to allow maximum usage of the available hardware. This represents the total amount of potential tokens within a batch. When using padding (not recommended) this would be equivalent of `batch_size` * `max_total_tokens`. However in the non-padded (flash attention) version this can be much finer. For `max_batch_total_tokens=1000`, you could fit `10` queries of `total_tokens=100` or a single query of `1000` tokens. Overall this number should be the largest possible amount that fits the remaining memory (after the model is loaded). Since the actual memory overhead depends on other parameters like if you're using quantization, flash attention or the model implementation, text-generation-inference cannot infer this number automatically.
 - `MAX_WAITING_TOKENS` - This setting defines how many tokens can be passed before forcing the waiting queries to be put on the batch (if the size of the batch allows for it). New queries require 1 `prefill` forward, which is different from `decode` and therefore you need to pause the running batch in order to run `prefill` to create the correct values for the waiting queries to be able to join the batch. With a value too small, queries will always "steal" the compute to run `prefill` and running queries will be delayed by a lot. With a value too big, waiting queries could wait for a very long time before being allowed a slot in the running batch. Default: 20
 ## Server 
 - `HOSTNAME` - The IP address to listen on. Default: `0.0.0.0`
 - `PORT` - The port to listen on. Default: 3000
 - `SHARD_UDS_PATH` - The name of the socket for gRPC communication between the webserver and the shards. Default: `/tmp/text-generation-server`
 - `MASTER_ADDR` - The address the master shard will listen on. (setting used by torch distributed). Default: `localhost` 
 - `MASTER_PORT` - The address the master port will listen on. (setting used by torch distributed). Default: 29500
 - `HUGGINGFACE_HUB_CACHE` - The location of the Hugging Face Hub cache. Used to override the location if you want to provide a mounted disk for instance.
 - `WEIGHTS_CACHE_OVERRIDE` - The location of the huggingface hub cache. Used to override the location if you want to provide a mounted disk for instance.
 - `HUGGING_FACE_HUB_TOKEN` - The token to use to authenticate to the Hugging Face Hub. Used to download private models.
 ## Logging
 - `JSON_OUTPUT` - Outputs the logs in JSON format (useful for telemetry). Default: false
 - `OTLP_ENDPOINT` - Send metrics to the OpenTelemetry endpoint. 
 - `CORS_ALLOW_ORIGIN` - Allowed CORS origins.