Collecting metrics is great, but when things go south, or ideally BEFORE things go south, you want to get notified.
This is where Alertmanager by Prometheus comes into the picture.
The Alertmanager handles alerts sent by client applications such as the Prometheus server. It takes care of deduplicating, grouping, and routing them to the correct receiver integration such as email, PagerDuty, or OpsGenie. It also takes care of silencing and inhibition of alerts.
How Alertmanager works
- An alert is generated using the alerting rules in the Prometheus servers & is pushed to the Alertmanager.
- The Alertmanager then manages those alerts, including silencing, inhibition, grouping, and sending out notifications via methods such as email & other notification services.
Alertmanager handles alerts sent by the Prometheus server and then routes them to the receiver integration such as email, PagerDuty, or OpsGenie. It also takes care of deduplicating, grouping, silencing, and inhibition of alerts.
Core concepts implemented by Alertmanager
- Silences: This mute alerts for a given time. Incoming Alerts are checked to match against active silent alerts, & if matched then no notification will be sent out.
- Inhibition: They suppress notifications for certain alerts if certain other alerts are already firing.
- Grouping: It groups alerts of similar nature into a single notification. This is very useful when many systems fail at once & thousands of alerts may be firing simultaneously.
Why is Alerting necessary?
Automated alerts are essential to monitoring. They allow you to spot problems anywhere in your infrastructure so that you can rapidly identify their causes and minimize service degradation and disruption.
If metrics and other measurements facilitate observability, then alerts draw human attention to the particular systems that require observation, inspection, and intervention.
Alertmanager in DataVision
This is the Alertmanager dashboard. It shows all the alerts with any down nodes. Now you can monitor your cluster and the alerts all in one place.
In alerts, you can see the following metrics related to that alert:-
- Time: When the alert was generated
- Alertname: Name of the alert
- Device, namespace, instance, pod: Which resource has generated the alert
- Severity: The severity of the alert, where 1 – Info, 2 – Warning, 3 – Critical
- Description: The description of the alert
We’ve not only incorporated alertmanager in DataVision but have also provided a dashboard with it, which makes it easy to manage your cluster.