mirror of
https://github.com/grafana/grafana.git
synced 2025-08-03 06:06:33 +08:00
Alerting Docs: Monitor your high availability setup (#92063)
* Alerting Docs: Monitor your high availability setup * Update docs/sources/alerting/set-up/configure-high-availability/_index.md Co-authored-by: brendamuir <100768211+brendamuir@users.noreply.github.com> * Update docs/sources/alerting/set-up/configure-high-availability/_index.md Co-authored-by: brendamuir <100768211+brendamuir@users.noreply.github.com> * Update docs/sources/alerting/set-up/configure-high-availability/_index.md Co-authored-by: brendamuir <100768211+brendamuir@users.noreply.github.com> * Update docs/sources/alerting/set-up/configure-high-availability/_index.md Co-authored-by: brendamuir <100768211+brendamuir@users.noreply.github.com> * Shorten links * Update/reorder a bit the description about alertmanager gossiping * Update `alertmanager_peer_position` description --------- Co-authored-by: brendamuir <100768211+brendamuir@users.noreply.github.com>
This commit is contained in:
@ -18,6 +18,17 @@ labels:
|
||||
- oss
|
||||
title: Configure high availability
|
||||
weight: 600
|
||||
refs:
|
||||
state-history:
|
||||
- pattern: /docs/grafana/
|
||||
destination: /docs/grafana/<GRAFANA_VERSION>/alerting/manage-notifications/view-state-health/
|
||||
- pattern: /docs/grafana-cloud/
|
||||
destination: /docs/grafana-cloud/alerting-and-irm/alerting/manage-notifications/view-state-health/
|
||||
meta-monitoring:
|
||||
- pattern: /docs/grafana/
|
||||
destination: /docs/grafana/<GRAFANA_VERSION>/alerting/monitor/
|
||||
- pattern: /docs/grafana-cloud/
|
||||
destination: /docs/grafana-cloud/alerting-and-irm/alerting/monitor/
|
||||
---
|
||||
|
||||
# Configure high availability
|
||||
@ -28,18 +39,13 @@ Grafana Alerting uses the Prometheus model of separating the evaluation of alert
|
||||
|
||||
When running multiple instances of Grafana, all alert rules are evaluated on all instances. You can think of the evaluation of alert rules as being duplicated by the number of running Grafana instances. This is how Grafana Alerting makes sure that as long as at least one Grafana instance is working, alert rules are still be evaluated and notifications for alerts are still sent.
|
||||
|
||||
You can find this duplication in state history and it is a good way to confirm if you are using high availability.
|
||||
You can find this duplication in state history and it is a good way to [verify your high availability setup](#verify-your-high-availability-setup).
|
||||
|
||||
While the alert generator evaluates all alert rules on all instances, the alert receiver makes a best-effort attempt to avoid sending duplicate notifications. Alertmanager chooses availability over consistency, which may result in occasional duplicated or out-of-order notifications. It takes the opinion that duplicate or out-of-order notifications are better than no notifications.
|
||||
While the alert generator evaluates all alert rules on all instances, the alert receiver makes a best-effort attempt to avoid duplicate notifications. The alertmanagers use a gossip protocol to share information between them to prevent sending duplicated notifications.
|
||||
|
||||
The Alertmanager uses a gossip protocol to share information about notifications between Grafana instances. It also gossips silences, which means a silence created on one Grafana instance is replicated to all other Grafana instances. Both notifications and silences are persisted to the database periodically, and during graceful shut down.
|
||||
Alertmanager chooses availability over consistency, which may result in occasional duplicated or out-of-order notifications. It takes the opinion that duplicate or out-of-order notifications are better than no notifications.
|
||||
|
||||
{{% admonition type="note" %}}
|
||||
|
||||
If using a mix of `execute_alerts=false` and `execute_alerts=true` on the HA nodes, since the alert state is not shared amongst the Grafana instances, the instances with `execute_alerts=false` do not show any alert status.
|
||||
This is because the HA settings (`ha_peers`, etc) only apply to the alert notification delivery (i.e. de-duplication of alert notifications, and silences, as mentioned above).
|
||||
|
||||
{{% /admonition %}}
|
||||
Alertmanagers also gossip silences, which means a silence created on one Grafana instance is replicated to all other Grafana instances. Both notifications and silences are persisted to the database periodically, and during graceful shut down.
|
||||
|
||||
## Enable alerting high availability using Memberlist
|
||||
|
||||
@ -56,6 +62,8 @@ Since gossiping of notifications and silences uses both TCP and UDP port `9094`,
|
||||
By default, it is set to listen to all interfaces (`0.0.0.0`).
|
||||
1. Set `[ha_peer_timeout]` in the `[unified_alerting]` section of the custom.ini to specify the time to wait for an instance to send a notification via the Alertmanager. The default value is 15s, but it may increase if Grafana servers are located in different geographic regions or if the network latency between them is high.
|
||||
|
||||
For a demo, see this [example using Docker Compose](https://github.com/grafana/alerting-ha-docker-examples/tree/main/memberlist).
|
||||
|
||||
## Enable alerting high availability using Redis
|
||||
|
||||
As an alternative to Memberlist, you can use Redis for high availability. This is useful if you want to have a central
|
||||
@ -68,19 +76,7 @@ database for HA and cannot support the meshing of all Grafana servers.
|
||||
1. Optional: Set `ha_redis_prefix` to something unique if you plan to share the Redis server with multiple Grafana instances.
|
||||
1. Optional: Set `ha_redis_tls_enabled` to `true` and configure the corresponding `ha_redis_tls_*` fields to secure communications between Grafana and Redis with Transport Layer Security (TLS).
|
||||
|
||||
The following metrics can be used for meta monitoring, exposed by the `/metrics` endpoint in Grafana:
|
||||
|
||||
| Metric | Description |
|
||||
| ---------------------------------------------------- | -------------------------------------------------------------------------------------------------------------- |
|
||||
| alertmanager_cluster_messages_received_total | Total number of cluster messages received. |
|
||||
| alertmanager_cluster_messages_received_size_total | Total size of cluster messages received. |
|
||||
| alertmanager_cluster_messages_sent_total | Total number of cluster messages sent. |
|
||||
| alertmanager_cluster_messages_sent_size_total | Total number of cluster messages received. |
|
||||
| alertmanager_cluster_messages_publish_failures_total | Total number of messages that failed to be published. |
|
||||
| alertmanager_cluster_members | Number indicating current number of members in cluster. |
|
||||
| alertmanager_peer_position | Position the Alertmanager instance believes it's in. The position determines a peer's behavior in the cluster. |
|
||||
| alertmanager_cluster_pings_seconds | Histogram of latencies for ping messages. |
|
||||
| alertmanager_cluster_pings_failures_total | Total number of failed pings. |
|
||||
For a demo, see this [example using Docker Compose](https://github.com/grafana/alerting-ha-docker-examples/tree/main/redis).
|
||||
|
||||
## Enable alerting high availability using Kubernetes
|
||||
|
||||
@ -149,3 +145,48 @@ The following metrics can be used for meta monitoring, exposed by the `/metrics`
|
||||
ha_peer_timeout = 15s
|
||||
ha_reconnect_timeout = 2m
|
||||
```
|
||||
|
||||
## Verify your high availability setup
|
||||
|
||||
When running multiple Grafana instances, all alert rules are evaluated on every instance. This multiple evaluation of alert rules is visible in the [state history](ref:state-history) and provides a straightforward way to verify that your high availability configuration is working correctly.
|
||||
|
||||
{{% admonition type="note" %}}
|
||||
|
||||
If using a mix of `execute_alerts=false` and `execute_alerts=true` on the HA nodes, since the alert state is not shared amongst the Grafana instances, the instances with `execute_alerts=false` do not show any alert status.
|
||||
|
||||
The HA settings (`ha_peers`, etc.) apply only to communication between alertmanagers, synchronizing silences and attempting to avoid duplicate notifications, as described in the introduction.
|
||||
|
||||
{{% /admonition %}}
|
||||
|
||||
You can also confirm your high availability setup by monitoring Alertmanager metrics exposed by Grafana.
|
||||
|
||||
| Metric | Description |
|
||||
| ---------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| alertmanager_cluster_members | Number indicating current number of members in cluster. |
|
||||
| alertmanager_cluster_messages_received_total | Total number of cluster messages received. |
|
||||
| alertmanager_cluster_messages_received_size_total | Total size of cluster messages received. |
|
||||
| alertmanager_cluster_messages_sent_total | Total number of cluster messages sent. |
|
||||
| alertmanager_cluster_messages_sent_size_total | Total number of cluster messages received. |
|
||||
| alertmanager_cluster_messages_publish_failures_total | Total number of messages that failed to be published. |
|
||||
| alertmanager_cluster_pings_seconds | Histogram of latencies for ping messages. |
|
||||
| alertmanager_cluster_pings_failures_total | Total number of failed pings. |
|
||||
| alertmanager_peer_position | The position an Alertmanager instance believes it holds, which defines its role in the cluster. Peers should be numbered sequentially, starting from zero. |
|
||||
|
||||
You can confirm the number of Grafana instances in your alerting high availability setup by querying the `alertmanager_cluster_members` and `alertmanager_peer_position` metrics.
|
||||
|
||||
Note that these alerting high availability metrics are exposed via the `/metrics` endpoint in Grafana, and are not automatically collected or displayed. If you have a Prometheus instance connected to Grafana, add a `scrape_config` to scrape Grafana metrics and then query these metrics in Explore.
|
||||
|
||||
```yaml
|
||||
- job_name: grafana
|
||||
honor_timestamps: true
|
||||
scrape_interval: 15s
|
||||
scrape_timeout: 10s
|
||||
metrics_path: /metrics
|
||||
scheme: http
|
||||
follow_redirects: true
|
||||
static_configs:
|
||||
- targets:
|
||||
- grafana:3000
|
||||
```
|
||||
|
||||
For more information on monitoring alerting metrics, refer to [Alerting meta-monitoring](ref:meta-monitoring). For a demo, see [alerting high availability examples using Docker Compose](https://github.com/grafana/alerting-ha-docker-examples/).
|
||||
|
Reference in New Issue
Block a user