mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-09-18 07:44:53 +00:00

Added streaming for InferenceClient (#821 )

Co-authored-by: Lucain <lucainp@gmail.com>
Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>

2023-08-11 18:05:19 +03:00

4.4 KiB

Raw Permalink Blame History

Consuming Text Generation Inference

There are many ways you can consume Text Generation Inference server in your applications. After launching, you can use the /generate route and make a POST request to get results from the server. You can also use the /generate_stream route if you want TGI to return a stream of tokens. You can make the requests using the tool of your preference, such as curl, Python or TypeScrpt. For a final end-to-end experience, we also open-sourced ChatUI, a chat interface for open-source models.

curl

After the launch, you can query the model using either the /generate or /generate_stream routes:

curl 127.0.0.1:8080/generate \
    -X POST \
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
    -H 'Content-Type: application/json'

Inference Client

huggingface-hub is a Python library to interact with the Hugging Face Hub, including its endpoints. It provides a nice high-level class, [~huggingface_hub.InferenceClient], which makes it easy to make calls to a TGI endpoint. InferenceClient also takes care of parameter validation and provides a simple to-use interface.

You can simply install huggingface-hub package with pip.

pip install huggingface-hub

Once you start the TGI server, instantiate InferenceClient() with the URL to the endpoint serving the model. You can then call text_generation() to hit the endpoint through Python.

from huggingface_hub import InferenceClient

client = InferenceClient(model=URL_TO_ENDPOINT_SERVING_TGI)
client.text_generation(prompt="Write a code for snake game", model=URL_TO_ENDPOINT_SERVING_TGI)

To stream tokens in InferenceClient, simply pass stream=True. Another parameter you can use with TGI backend is details. You can get more details on generation (tokens, probabilities, etc.) by setting details to True. By default, details is set to False, and text_generation returns a string. If you pass details=True and stream=True, text_generation will return a TextGenerationStreamResponse which consists of the generated token, generated text, and details.

output = client.text_generation(prompt="Meaning of life is", model=URL_OF_ENDPOINT, details=True)
print(output)

# TextGenerationResponse(generated_text=' a complex concept that is not always clear to the individual. It is a concept that is not always', details=Details(finish_reason=<FinishReason.Length: 'length'>, generated_tokens=20, seed=None, prefill=[], tokens=[Token(id=267, text=' a', logprob=-2.0723474, special=False), Token(id=11235, text=' complex', logprob=-3.1272552, special=False), Token(id=17908, text=' concept', logprob=-1.3632495, special=False),..))

You can see how to stream below.

output = client.text_generation(prompt="Meaning of life is", model="http://localhost:3000/", stream=True, details=True)
print(next(iter(output)))

# TextGenerationStreamResponse(token=Token(id=267, text=' a', logprob=-2.0723474, special=False), generated_text=None, details=None)

You can check out the details of the function here.

ChatUI

ChatUI is an open-source interface built for LLM serving. It offers many customization options, such as web search with SERP API and more. ChatUI can automatically consume the TGI server and even provides an option to switch between different TGI endpoints. You can try it out at Hugging Chat, or use the ChatUI Docker Space to deploy your own Hugging Chat to Spaces.

To serve both ChatUI and TGI in same environment, simply add your own endpoints to the MODELS variable in .env.local file inside the chat-ui repository. Provide the endpoints pointing to where TGI is served.

{
// rest of the model config here
"endpoints": [{"url": "https://HOST:PORT/generate_stream"}]
}

API documentation

You can consult the OpenAPI documentation of the text-generation-inference REST API using the /docs route. The Swagger UI is also available here.

4.4 KiB Raw Permalink Blame History