Improve the Consuming TGI docs.

2025-09-12 12:54:52 +00:00 · 2024-08-13 15:47:25 +02:00 · 2024-08-13 15:47:25 +02:00 · 8de10acdcf
commit 8de10acdcf
parent cd9b15d17f
2 changed files with 21 additions and 20 deletions
--- a/docs/openapi.json
+++ b/docs/openapi.json
@ -819,13 +819,6 @@
            "example": "1.0",
            "nullable": true
          },
          "guideline": {
            "type": "string",
            "description": "A guideline to be used in the chat_template",
            "default": "null",
            "example": "null",
            "nullable": true
          },
          "logit_bias": {
            "type": "array",
            "items": {
@ -1824,8 +1817,7 @@
        "type": "object",
        "required": [
          "finish_reason",
-          "generated_tokens",
+          "generated_tokens"
          "input_length"
        ],
        "properties": {
          "finish_reason": {
@ -1837,12 +1829,6 @@
            "example": 1,
            "minimum": 0
          },
          "input_length": {
            "type": "integer",
            "format": "int32",
            "example": 1,
            "minimum": 0
          },
          "seed": {
            "type": "integer",
            "format": "int64",
--- a/docs/source/basic_tutorials/consuming_tgi.md
+++ b/docs/source/basic_tutorials/consuming_tgi.md
@ -1,18 +1,33 @@
 # Consuming Text Generation Inference
-There are many ways you can consume Text Generation Inference server in your applications. After launching, you can use the `/generate` route and make a `POST` request to get results from the server. You can also use the `/generate_stream` route if you want TGI to return a stream of tokens. You can make the requests using the tool of your preference, such as curl, Python or TypeScrpt. For a final end-to-end experience, we also open-sourced ChatUI, a chat interface for open-source models.
+There are many ways to consume Text Generation Inference (TGI) server in your applications. After launching the server, you can use the [Messages API](https://huggingface.co/docs/text-generation-inference/en/messages_api) `/v1/chat/completions` route and make a `POST` request to get results from the server. You can also pass `"stream": true` to the call if you want TGI to return a stream of tokens. You can make the requests using the tool of your preference, such as curl, Python or TypeScript. For a final end-to-end experience, we have also open-sourced ChatUI, a chat interface for open-source models.
 ## curl
-After the launch, you can query the model using either the `/generate` or `/generate_stream` routes:
+After a successful server launch, you can query the model using the `v1/chat/completions` route to get OpenAI Chat Completion API spec compliant responses:
 ```bash
-curl 127.0.0.1:8080/generate \
+curl localhost:3000/v1/chat/completions \
    -X POST \
-    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
+    -d '{
  "model": "tgi",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "What is deep learning?"
    }
  ],
  "stream": true,
  "max_tokens": 20
 }' \
    -H 'Content-Type: application/json'
 ```
 You can update the `stream` parameter to `false` to get a non-streaming response.
 ## Inference Client