text-generation-inference/docs/source/reference/metrics.md
Hugo Larcher 53729b74ac
doc: Add metrics documentation and add a 'Reference' section (#2230)
* doc: Add metrics documentation and add a 'Reference' section

* doc: Add API reference

* doc: Refactor API reference

* fix: Message API link

* Bad rebase

* Moving the docs.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-08-16 19:43:30 +02:00

3.8 KiB

Metrics

TGI exposes multiple metrics that can be collected via the /metrics Prometheus endpoint. These metrics can be used to monitor the performance of TGI, autoscale deployment and to help identify bottlenecks.

The following metrics are exposed:

Metric Name Description Type Unit
tgi_batch_current_max_tokens Maximum tokens for the current batch Gauge Count
tgi_batch_current_size Current batch size Gauge Count
tgi_batch_decode_duration Time spent decoding a batch per method (prefill or decode) Histogram Seconds
tgi_batch_filter_duration Time spent filtering batches and sending generated tokens per method (prefill or decode) Histogram Seconds
tgi_batch_forward_duration Batch forward duration per method (prefill or decode) Histogram Seconds
tgi_batch_inference_count Inference calls per method (prefill or decode) Counter Count
tgi_batch_inference_duration Batch inference duration Histogram Seconds
tgi_batch_inference_success Number of successful inference calls per method (prefill or decode) Counter Count
tgi_batch_next_size Batch size of the next batch Histogram Count
tgi_queue_size Current queue size Gauge Count
tgi_request_count Total number of requests Counter Count
tgi_request_duration Total time spent processing the request (e2e latency) Histogram Seconds
tgi_request_generated_tokens Generated tokens per request Histogram Count
tgi_request_inference_duration Request inference duration Histogram Seconds
tgi_request_input_length Input token length per request Histogram Count
tgi_request_max_new_tokens Maximum new tokens per request Histogram Count
tgi_request_mean_time_per_token_duration Mean time per token per request (inter-token latency) Histogram Seconds
tgi_request_queue_duration Time spent in the queue per request Histogram Seconds
tgi_request_skipped_tokens Speculated tokens per request Histogram Count
tgi_request_success Number of successful requests Counter
tgi_request_validation_duration Time spent validating the request Histogram Seconds