From 64cbf288e461ad64954c9c85339137cbccb33319 Mon Sep 17 00:00:00 2001
From: osanseviero <osanseviero@gmail.com>
Date: Wed, 16 Aug 2023 16:21:19 +0200
Subject: [PATCH] Add streaming guide

---
 docs/source/_toctree.yml            |   4 ++
 docs/source/conceptual/streaming.md | 100 ++++++++++++++++++++++++++++
 docs/source/quicktour.md            |   2 +-
 3 files changed, 105 insertions(+), 1 deletion(-)
 create mode 100644 docs/source/conceptual/streaming.md
diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
index a161dc28..cdb3127e 100644
--- a/docs/source/_toctree.yml
+++ b/docs/source/_toctree.yml
@@ -18,3 +18,7 @@
   - local: basic_tutorials/using_cli
     title: Using TGI CLI
   title: Tutorials
+- sections
+  - local: conceptual/streaming
+    title: Streaming
+  title: Conceptual Guides
diff --git a/docs/source/conceptual/streaming.md b/docs/source/conceptual/streaming.md
new file mode 100644
index 00000000..dacd3791
--- /dev/null
+++ b/docs/source/conceptual/streaming.md
@@ -0,0 +1,100 @@
+# Streaming
+
+## What is Streaming?
+
+With streaming, the server returns the tokens as they are being generated by the LLM. This enables showing progressive generations to the user rather than having to wait for the full generation. Streaming is an essential aspect of the end-user experience as it reduces latency, one of the most critical aspects of a smooth experience.
+
+![A diff of streaming vs non streaming](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/streaming-generation-visual.gif)
+
+With token streaming, the server can start returning the tokens as they are generated without having to wait for all of them to be generated. The users start to see something happening much earlier than before the work is done. This has different positive effects:
+
+* For extremely long queries, users can get results orders of magnitude earlier.
+* Seeing something in progress allows users to stop the generation if it's not going in the direction they expect.
+* Perceived latency is lower when results are shown in early stages.
+
+For example, think that a system can generate 100 tokens per second. If the system generates 1000 tokens, with the non-streaming setup, users need to wait 10 seconds to get results. On the other hand, with the streaming setup, users get initial results immediately, and, although end-to-end latency will be the same, they have seen half of the generation after five seconds. We've built an interactive demo that shows non-streaming vs streaming side-by-side. Click **generate** below.
+
+<div class="block dark:hidden">
+	<iframe 
+        src="https://osanseviero-streaming-vs-non-streaming.hf.space?__theme=light"
+        width="850"
+        height="1600"
+    ></iframe>
+</div>
+<div class="hidden dark:block">
+    <iframe 
+        src="https://osanseviero-streaming-vs-non-streaming.hf.space?__theme=dark"
+        width="850"
+        height="1600"
+    ></iframe>
+</div>
+
+## How to use Streaming?
+
+### Streaming with Python
+
+To stream tokens in `InferenceClient`, simply pass `stream=True`. 
+
+```python
+from huggingface_hub import InferenceClient
+
+client = InferenceClient(model="http://127.0.0.1:8080")
+for token in client.text_generation("How do you make cheese?", max_new_tokens=12, stream=True):
+    print(token)
+
+# To
+# make
+# cheese
+#,
+# you
+# need
+# to
+# start
+# with
+# milk
+#.
+```
+
+If you want additional details, you can add `details=True`. In this case, you get a `TextGenerationStreamResponse` which contains additional information such as the probabilities and the tokens. For the final response in the stream, it also returns the full generated text.
+
+```python
+for details in client.text_generation("How do you make cheese?", max_new_tokens=12, details=True, stream=True):
+    print(details)
+
+#TextGenerationStreamResponse(token=Token(id=193, text='\n', logprob=-0.007358551, special=False), generated_text=None, details=None)
+#TextGenerationStreamResponse(token=Token(id=2044, text='To', logprob=-1.1357422, special=False), generated_text=None, details=None)
+#TextGenerationStreamResponse(token=Token(id=717, text=' make', logprob=-0.009841919, special=False), generated_text=None, details=None)
+#...
+#TextGenerationStreamResponse(token=Token(id=25, text='.', logprob=-1.3408203, special=False), generated_text='\nTo make cheese, you need to start with milk.', details=StreamDetails(finish_reason=<FinishReason.Length: 'length'>, generated_tokens=12, seed=None))
+```
+
+The `huggingface_hub` library also comes with an `AsyncInferenceClient` in case you need to handle the requests concurrently.
+
+```python
+from huggingface_hub import AsyncInferenceClient
+
+client = AsyncInferenceClient(URL_TO_ENDPOINT_SERVING_TGI)
+await client.text_generation("How do you make cheese?")
+# \nTo make cheese, you need to start with milk.
+```
+
+### Streaming with cURL
+
+To use the `generate_stream` endpoint with curl, you can add the `-N` flag, which disables curl default buffering and shows data as it arrives from the server
+
+```curl
+curl -N 127.0.0.1:8080/generate_stream \
+    -X POST \
+    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
+    -H 'Content-Type: application/json'
+```
+
+## How does Streaming work under the hood?
+
+Under the hood, TGI uses Server-Sent Events (SSE). In a SSE Setup, a client sends a request with the data, opening a HTTP connection and subscribing to updates. Afterwards, the server sends data to the client. There is no need for further requests, the server will keep sending the data. SSEs are unidirectional, meaning that the client does not send further requests to the server. SSE sends data over HTTP, making it easy to 
+
+SSEs are different than:
+* Polling: where the client keeps making calls to the server to get data. This means that the server might return empty responses and cause overhead.
+* Webhooks: where there is a bi-directional connection. That is, the server can send information to the client, but the client can also send information to the server after the first request. Webhooks are more complex to operate as they don't only use HTTP.
+
+One of the limitations of Server-Sent Events is that they limit how many concurrent requests can handle by the server. Instead of timing out when there are too many SSE connections, TGI returns a HTTP Error with an `overloaded` error type (`huggingface_hub` returns `OverloadedError`). This allows the client to manage the overloaded server (e.g. it could display a busy error to the user or it could retry with a new request). To configure the maximum number of concurrent requests, you can specify `--max_concurrent_requests`, allowing to handle backpressure.
\ No newline at end of file
diff --git a/docs/source/quicktour.md b/docs/source/quicktour.md
index 4ba2be40..170f8dc0 100644
--- a/docs/source/quicktour.md
+++ b/docs/source/quicktour.md
@@ -85,7 +85,7 @@ curl 127.0.0.1:8080/generate \
 To see all possible deploy flags and options, you can use the `--help` flag. It's possible to configure the number of shards, quantization, generation parameters, and more.
 
 ```bash
-docker run ghcr.io/huggingface/text-generation-inference:1.0.0 --help
+docker run ghcr.io/huggingface/text-generation-inference:1.0.1 --help
 ```
 
 </Tip>