mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-04-19 22:02:06 +00:00

doc: Add metrics documentation and add a 'Reference' section (#2230 )

* doc: Add metrics documentation and add a 'Reference' section

* doc: Add API reference

* doc: Refactor API reference

* fix: Message API link

* Bad rebase

* Moving the docs.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

2024-08-16 19:43:30 +02:00

3.8 KiB

Raw Blame History

Metrics

TGI exposes multiple metrics that can be collected via the /metrics Prometheus endpoint. These metrics can be used to monitor the performance of TGI, autoscale deployment and to help identify bottlenecks.

The following metrics are exposed:

Metric Name	Description	Type	Unit
`tgi_batch_current_max_tokens`	Maximum tokens for the current batch	Gauge	Count
`tgi_batch_current_size`	Current batch size	Gauge	Count
`tgi_batch_decode_duration`	Time spent decoding a batch per method (prefill or decode)	Histogram	Seconds
`tgi_batch_filter_duration`	Time spent filtering batches and sending generated tokens per method (prefill or decode)	Histogram	Seconds
`tgi_batch_forward_duration`	Batch forward duration per method (prefill or decode)	Histogram	Seconds
`tgi_batch_inference_count`	Inference calls per method (prefill or decode)	Counter	Count
`tgi_batch_inference_duration`	Batch inference duration	Histogram	Seconds
`tgi_batch_inference_success`	Number of successful inference calls per method (prefill or decode)	Counter	Count
`tgi_batch_next_size`	Batch size of the next batch	Histogram	Count
`tgi_queue_size`	Current queue size	Gauge	Count
`tgi_request_count`	Total number of requests	Counter	Count
`tgi_request_duration`	Total time spent processing the request (e2e latency)	Histogram	Seconds
`tgi_request_generated_tokens`	Generated tokens per request	Histogram	Count
`tgi_request_inference_duration`	Request inference duration	Histogram	Seconds
`tgi_request_input_length`	Input token length per request	Histogram	Count
`tgi_request_max_new_tokens`	Maximum new tokens per request	Histogram	Count
`tgi_request_mean_time_per_token_duration`	Mean time per token per request (inter-token latency)	Histogram	Seconds
`tgi_request_queue_duration`	Time spent in the queue per request	Histogram	Seconds
`tgi_request_skipped_tokens`	Speculated tokens per request	Histogram	Count
`tgi_request_success`	Number of successful requests	Counter
`tgi_request_validation_duration`	Time spent validating the request	Histogram	Seconds

3.8 KiB Raw Blame History

Metrics

3.8 KiB

Raw Blame History