mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-10-10 15:35:24 +00:00

History

yuanwu2017 c345c734a7 Merge branch 'habana-main' into 2.3.0		2024-11-01 11:24:40 +08:00
..
README.md	requirements name - cabelo@opensuse.org (#237 )	2024-10-18 09:20:05 -07:00
requirements.txt	Code expects newer huggingface_hub versions, tested and this resolves issues with streaming response format (#190 )	2024-08-08 13:07:27 +02:00
run_generation.py	Merge branch 'habana-main' into 2.3.0	2024-11-01 11:24:40 +08:00
tgi_client.py	Add Habana copyright header (#122 )	2024-04-08 18:06:21 +02:00

README.md

TGI-Gaudi example

This example provide a simple way of usage of tgi-gaudi with continuous batching. It uses a small dataset DIBT/10k_prompts_ranked and present basic performance numbers.

Get started

Install

pip install -r requirements.txt

Setup TGI server

More details on runing the TGI server available here.

Run benchmark

To run benchmark use below command:

python run_generation --model_id MODEL_ID

where MODEL_ID should be set to the same value as in the TGI server instance.

For gated models such as LLama or StarCoder, you will have to set environment variable HUGGING_FACE_HUB_TOKEN=<token> with a valid Hugging Face Hub read token.

All possible parameters are described in the below table:

Name	Default value	Description
SERVER_ADDRESS	http://localhost:8080	The address and port at which the TGI server is available.
MODEL_ID	meta-llama/Llama-2-7b-chat-hf	Model ID used in the TGI server instance.
MAX_INPUT_LENGTH	1024	Maximum input length supported by the TGI server.
MAX_OUTPUT_LENGTH	1024	Maximum output length supported by the TGI server.
TOTAL_SAMPLE_COUNT	2048	Number of samples to run.
MAX_CONCURRENT_REQUESTS	256	The number of requests sent simultaneously to the TGI server.