diff --git a/docs/source/conceptual/streaming.md b/docs/source/conceptual/streaming.md index b456bbde..93c4d2bd 100644 --- a/docs/source/conceptual/streaming.md +++ b/docs/source/conceptual/streaming.md @@ -2,17 +2,17 @@ ## What is Streaming? -With streaming, the server returns the tokens as they are being generated by the LLM. This enables showing progressive generations to the user rather than having to wait for the full generation. Streaming is an essential aspect of the end-user experience as it reduces latency, one of the most critical aspects of a smooth experience. +With streaming, the server returns the tokens as the LLM generates them. This enables showing progressive generations to the user rather than waiting for the whole generation. Streaming is an essential aspect of the end-user experience as it reduces latency, one of the most critical aspects of a smooth experience.  -With token streaming, the server can start returning the tokens as they are generated without having to wait for all of them to be generated. The users start to see something happening much earlier than before the work is done. This has different positive effects: +With token streaming, the server can start returning the tokens before having to wait for the whole generation. The users start to see something happening much earlier than before the work is done. This has different positive effects: -* For extremely long queries, users can get results orders of magnitude earlier. +* Users can get results orders of magnitude earlier for extremely long queries. * Seeing something in progress allows users to stop the generation if it's not going in the direction they expect. -* Perceived latency is lower when results are shown in early stages. +* Perceived latency is lower when results are shown in the early stages. -For example, think that a system can generate 100 tokens per second. If the system generates 1000 tokens, with the non-streaming setup, users need to wait 10 seconds to get results. On the other hand, with the streaming setup, users get initial results immediately, and, although end-to-end latency will be the same, they have seen half of the generation after five seconds. We've built an interactive demo that shows non-streaming vs streaming side-by-side. Click **generate** below. +For example, think that a system can generate 100 tokens per second. If the system generates 1000 tokens, with the non-streaming setup, users need to wait 10 seconds to get results. On the other hand, with the streaming setup, users get initial results immediately, and although end-to-end latency will be the same, they have seen half of the generation after five seconds. We've built an interactive demo that shows non-streaming vs streaming side-by-side. Click **generate** below.