We currently do not have a way for users to monitor and alert about events happen in Longhorn such as volume is full, backup is failed, CPU usage, memory consumption.
This enhancement exports Prometheus metrics so that users can use Prometheus or other monitoring systems to monitor Longhorn.
### Related Issues
https://github.com/longhorn/longhorn/issues/1180
## Motivation
### Goals
We are planing to expose 22 metrics in this release:
We can expose metrics provide information about it and user can scrape the metrics and setup alert system.
### User Experience In Detail
After this enhancement is merged, Longhorn expose metrics at end point `/metrics` in Prometheus' [text-based format](https://prometheus.io/docs/instrumenting/exposition_formats/).
Users can use Prometheus or other monitoring systems to collect those metrics by scraping the end point `/metrics` in longhorn manager.
Then, user can display the collected data using tools such as Grafana.
User can also setup alert by using tools such as Prometheus Alertmanager.
This metric reports the requested CPU resources in Kubernetes of the Longhorn instance managers on the current node.
The unit of this metric is milliCPU. See more about the unit at https://kubernetes.io/docs/tasks/configure-pod-container/assign-cpu-resource/#cpu-units
This metric reports the CPU usage of the Longhorn instance managers on the current node.
The unit of this metric is milliCPU. See more about the unit at https://kubernetes.io/docs/tasks/configure-pod-container/assign-cpu-resource/#cpu-units
This metric reports the CPU usage of the Longhorn manager on the current node.
The unit of this metric is milliCPU. See more about the unit at https://kubernetes.io/docs/tasks/configure-pod-container/assign-cpu-resource/#cpu-units
We add a new end point `/metrics` to exposes all longhorn Prometheus metrics.
## Design
### Implementation Overview
We follow the [Prometheus best practice](https://prometheus.io/docs/instrumenting/writing_exporters/#deployment), each Longhorn manager reports information about the components it manages.
We create a new collector for each type (volumeCollector, backupCollector, nodeCollector, etc..) and have a common baseCollector.
This structure is similar to the controller package: we have volumeController, nodeController, etc.. which have a common baseController.
The end result is a structure like a tree:
```
a custom registry <-manycustomcollectorssharethesamebasecollector<-manymetricsineachcustomcollector
```
When a scrape request is made to endpoint `/metric`, a handler gathers data in the Longhorn custom registry, which in turn gathers data in custom collectors, which in turn gathers data in all metrics.
Below are how we collect data for each metric:
1. longhorn_volume_capacity_bytes
We get the information about volumes' capacity by reading volume CRD from datastore.
When volume move to a different node, the current longhorn manager stops reporting the vol.
The volume will be reported by a new longhorn manager.
1. longhorn_actual_size_bytes
We get the information about volumes' actual size by reading volume CRD from datastore.
When volume move to a different node, the current longhorn manager stops reporting the vol.
The volume will be reported by a new longhorn manager.
1. longhorn_volume_state
We get the information about volumes' state by reading volume CRD from datastore.
1. longhorn_volume_robustness
We get the information about volumes' robustness by reading volume CRD from datastore.
1. longhorn_node_status
We get the information about node status by reading node CRD from datastore.
Nodes don't move likes volume, so we don't have to decide which longhorn manager reports which node.
1. longhorn_node_count_total
We get the information about total number node by reading from datastore
1. longhorn_node_cpu_capacity_millicpu
We get the information about the maximum allocatable cpu on this node by reading Kubernetes node resource
1. longhorn_node_cpu_usage_millicpu
We get the information about the cpu usage on this node from metric client
1. longhorn_node_memory_capacity_bytes
We get the information about the maximum allocatable memory on this node by reading Kubernetes node resource
1. longhorn_node_memory_usage_bytes
We get the information about the memory usage on this node from metric client
1. longhorn_node_storage_capacity_bytes
We get the information by reading node CRD from datastore
1. longhorn_node_storage_usage_bytes
We get the information by reading node CRD from datastore
1. longhorn_node_storage_reservation_bytes
We get the information by reading node CRD from datastore
1. longhorn_disk_capacity_bytes
We get the information by reading node CRD from datastore
1. longhorn_disk_usage_bytes
We get the information by reading node CRD from datastore
1. longhorn_disk_reservation_bytes
We get the information by reading node CRD from datastore
We get the information by reading instance manager Pod objects from datastore.
1. longhorn_manager_cpu_usage_millicpu
We get the information by using kubernetes metric client.
1. longhorn_manager_memory_usage_bytes
We get the information by using kubernetes metric client.
### Test plan
The manual test plan is detailed at [here](https://github.com/longhorn/longhorn-tests/blob/master/docs/content/manual/release-specific/v1.1.0/prometheus_support.md)