74 Commits

Author SHA1 Message Date
bccc980b90 Alerting: Notifiication history (#107644)
* Add unified_alerting.notification_history to ini files

* Parse notification history settings

* Move Loki client to a separate package

* Loki client: add params for metrics and traces

* add NotificationHistorian

* rm writeDuration

* remove RangeQuery stuff

* wip

* wip

* wip

* wip

* pass notification historian in tests

* unify loki settings

* unify loki settings

* add test

* update grafana/alerting

* make update-workspace

* add feature toggle

* fix configureNotificationHistorian

* Revert "add feature toggle"

This reverts commit de7af8f7

* add feature toggle

* more tests

* RuleUID

* fix metrics test

* met.Info.Set(0)
2025-07-17 14:26:26 +01:00
a76057cedb Alerting: Use GRAFANA_ALERTS as the default state history metric name (#107050) 2025-06-20 18:14:36 +02:00
ad683f83ff Alerting: Add state history backend to write ALERTS metric (#104361)
**What is this feature?**

This PR implements a new Prometheus historian backend that allows Grafana alerting to write alert state history as Prometheus-compatible `ALERTS` metrics to remote Prometheus-compatible data sources.

The metric includes a few additional labels:

* `grafana_alertstate`: Grafana's full alert state, more granular than Prometheus.
* `grafana_rule_uid`: Grafana's alert rule UID.

Grafana states are included in the `grafana_alertstate` label also mapped to Prometheus-compatible `alertstate` values:

| Grafana alert state | `alertstate`          | `grafana_alertstate`  |
|---------------------|-----------------------|-----------------------|
| `Alerting`          | `firing`              | `alerting`            |
| `Recovering`        | `firing`              | `recovering`          |
| `Pending`           | `pending`             | `pending`             |
| `Error`             | `firing`              | `error`               |
| `NoData`            | `firing`              | `nodata`              |
| `Normal`            | _(no metric emitted)_ | _(no metric emitted)_ |
2025-06-18 07:17:57 +02:00
5137995830 Alerting: Add support for Redis Sentinel for Alerting HA (#106322)
* Alerting: Add support for Redis Sentinel

* docs

* docs

* Use minisentinel in test

* Apply suggestions from code review

Co-authored-by: Johnny Kartheiser <140559259+JohnnyK-Grafana@users.noreply.github.com>
Co-authored-by: Fayzal Ghantiwala <114010985+fayzal-g@users.noreply.github.com>

* "address(es)" -> "address or addresses"

* make update-workspace

* make lint-go-diff

---------

Co-authored-by: Johnny Kartheiser <140559259+JohnnyK-Grafana@users.noreply.github.com>
Co-authored-by: Fayzal Ghantiwala <114010985+fayzal-g@users.noreply.github.com>
2025-06-05 15:02:40 +01:00
e256f2d5e2 Alerting: Enable recording rules by default (#105603) 2025-06-02 10:56:05 +02:00
6c3d89f390 Remote Alertmanager: Add timeouts to the HTTP client (#105279)
* Remote Alertmanage: Add timeouts to the HTTP client

* code review suggestions
2025-05-13 13:25:56 +02:00
5a589bb51a Alerting: Enable the remote Alertmanager feature using only feature toggles (#101410)
* Alerting: Enable the remote Alertmanager feature using only feature toggles

* Trigger build
2025-04-30 12:18:47 +02:00
756b45402c Alerting: Add rule_query_offset setting for Prometheus rule conversion (#102500)
Adds a new configuration option to specify a time offset for rule evaluation, which gets applied and saved during the Prometheus -> Grafana conversion. For example:

[unified_alerting.prometheus_conversion]
rule_query_offset = 1m

Changing this option affects only the rules imported after the change. If query_offset is set at the group level, it takes precedence over this setting.

Default is set to 1m.
2025-03-20 15:31:21 +01:00
943b73a682 Alerting: Add scheduled clean-up of deleted rules (#101963)
* add scheduled clean up of deleted rules


---------

Signed-off-by: Yuri Tseretyan <yuriy.tseretyan@grafana.com>
2025-03-11 22:58:26 +02:00
bbab62ce39 Alerting: Select remote write path dependent on metrics backend type. (#101891)
The remote write path differs based on whether the data source is actually
Prometheus, Mimir, Cortex, or an older version of Cortex. We do not want
users to have to specify the path, so this change determines the path as
best it can.

It may be in the future we have to make this configurable per-datasource
to cater for setups where it's impossible to determine the correct path.
2025-03-11 13:45:16 +01:00
14ebec527c Alerting: Allow selection of recording rule write target on per-rule basis. (#101778)
* Alerting: Allow selection of recording rule write target on per-rule basis.

Introduces a new feature flag (`grafanaManagedRecordingRulesDatasources`),
disabled by default, to enable the ability to write recording rules data using
data source settings, and selecting the data source to use on a per-rule basis.

To cope with the scenario of users upgrading, a configuration file option
allows setting the default data source to use, if none is specified in the rule,
emulating the behaviour of recording rules without the flag enabled.

* Lint

* Update conf/sample.ini

Co-authored-by: Alexander Akhmetov <me@alx.cx>

---------

Co-authored-by: Alexander Akhmetov <me@alx.cx>
2025-03-07 14:30:40 +01:00
1f8f9a45d7 Alerting: Add state_periodic_save_batch_size config option (#98019)
* Alerting: Add state_periodic_save_batch_size config option

---------

Co-authored-by: brendamuir <100768211+brendamuir@users.noreply.github.com>
2024-12-16 15:30:38 +01:00
c440bd2bda Alerting: Change default for max_attempts to 3. (#97461)
Currently the default is 1, this means that by default users will see transient
query errors reflected as alert evaluation failures, when often an immediate
retry is sufficient to evaluate the rule successfully.

Enabling retries by default leads to a better experience out of the box.
2024-12-05 21:48:24 +01:00
1fdc48faba Alerting: Make context deadline on AlertNG service startup configurable (#96053)
* Make alerting context deadline configurable

* Remove debug logs

* Change default timeout

* Update tests
2024-11-07 18:23:55 +00:00
fbad76007d Alerting: Limit and clean up old alert rules versions (#89754) 2024-10-05 00:31:21 +03:00
ac5ebe6e4d Alerting: Add enablement flag for recording rules (#92032)
* Add enablement flag

* Disable if toggle not enabled
2024-08-19 12:01:00 -05:00
b79b38f02c Alertmanager: Support limits for silences (#90826)
* Alertmanager: support limits for silences

* update grafana/alerting to latest main
2024-07-24 14:22:29 +02:00
68691c9386 Alerting: Add setting for maximum allowed rule evaluation results (#89468)
* Alerting: Add setting for maximum allowed rule evaluation results

Added a new configuration setting `quota.alerting_rule_evaluation_results` to set the maximum number of alert rule evaluation results per rule. If the limit is exceeded, the evaluation will result in an error.
2024-06-27 09:45:15 +02:00
4a5aab54a5 Alerting: Add max limit for Loki query size in state history API (#89646)
* add setting for query limit

* update BuildLogQuery to return error if limit is exceeded

* move tests for BuildLogQuery to separate suite
2024-06-25 09:20:38 -04:00
3228b64fe6 Alerting: Resend resolved notifications for ResolvedRetention duration (#88938)
* Simple replace of State.Resolved with State.ResolvedAt

* Retain ResolvedAt time between Normal->Normal transition

* Introduce ResolvedRetention to keep sending recently resolved alerts

* Make ResolvedRetention configurable with resolved_alert_retention

* Tick-based LastSentAt for testing of ResendDelay and ResolvedRetention

* Do not reset ResolvedAt during Normal->Pending transition

Initially this was done to be inline with Prom ruler. However, Prom ruler
doesn't keep track of Inactive->Pending/Alerting using the same alert instance,
so it's more understandable that they choose not to retain ResolvedAt. In our
case, since we use the same cached instance to represent the transition, it
makes more sense to retain it.

This should help alleviate some odd situations where temporarily entering
Pending will stop future resolved notifications that would have happened
because of ResolvedRetention.

* Pointers for ResolvedAt & LastSentAt

To avoid awkward time.Time{}.Unix() defaults on persist
2024-06-20 16:33:03 -04:00
c62cc25513 Alerting: Configure recording rule writer from config.ini (#89056) 2024-06-12 16:04:46 -04:00
eb76ea47a0 Alerting: Add ha_reconnect_timeout configuration option (#88823)
* Docs: Update "Configure high availability" guide with ha_reconnect_timeout configuration

---------

Co-authored-by: Christopher Moyer <35463610+chri2547@users.noreply.github.com>
2024-06-11 13:25:48 -04:00
e15e40fbd3 Alerting: Skip setting up clustering in remote primary/only modes (#88968)
* Alerting: Skip setting up clustering in remote primary mode

* Update pkg/services/ngalert/notifier/multiorg_alertmanager.go

Co-authored-by: Steve Simpson <steve.simpson@grafana.com>

---------

Co-authored-by: Steve Simpson <steve.simpson@grafana.com>
2024-06-10 13:51:11 +02:00
80f54778f3 Alerting: Add option to use Redis in cluster mode for Alerting HA (#88696)
* Add config option to use Redis in cluster mode

* Use UniversalOptions
2024-06-05 17:02:25 +01:00
7a2fbad0c8 Alerting: Add options to configure TLS for HA using Redis (#87567)
* Add Alerting HA Redis Client TLS configs

* Add test to ping miniredis with mTLS

* Update .ini files and docs

* Add tests for unified alerting ha redis TLS settings

* Fix malformed go.sum

* Add modowner

* Fix lint error

* Update docs and use dstls config
2024-05-14 14:21:42 +01:00
529f55cfe8 Alerting: Remove isDefault field from receivers (Alertmanager configuration) (#86605)
Alerting: Remove isDefault field from receivers in the Alertmanager configuration
2024-04-19 15:44:20 +02:00
c7573bb0f7 Alerting: Make retention period configurable for the notification log (#85605)
* Alerting: Make retention period configurable for the notification log

* update sample.ini

* fix outdated comment (on disk -> kvstore)

* skip checking cyclomatic complexity for ReadUnifiedAlertingSettings
2024-04-05 12:25:43 +02:00
97f37b2e6f Alerting: Clamp Loki ASH range query to configured max_query_length (#83986)
* Clamp range in loki http client to configured max_query_length

Defaults to 721h to match Loki default
2024-03-15 18:59:45 +02:00
8765c48389 Alerting: Remove legacy alerting (#83671)
Removes legacy alerting, so long and thanks for all the fish! 🐟

---------

Co-authored-by: Matthew Jacobson <matthew.jacobson@grafana.com>
Co-authored-by: Sonia Aguilar <soniaAguilarPeiron@users.noreply.github.com>
Co-authored-by: Armand Grillet <armandgrillet@users.noreply.github.com>
Co-authored-by: William Wernert <rwwiv@users.noreply.github.com>
Co-authored-by: Yuri Tseretyan <yuriy.tseretyan@grafana.com>
2024-03-14 15:36:35 +01:00
7147af6b8e Alerting: Disable legacy alerting for ever (#83651)
* hard disable for legacy alerting
* remove alerting section from configuration file 
* update documentation to not refer to deleted section
* remove AlertingEnabled from usage in UA setting parsing
2024-03-07 16:01:11 -05:00
99fa064576 Alerting: Emit warning when creating or updating unusually large groups (#82279)
* Add config for limit of rules per rule group

* Warn when editing big groups through normal API

* Warn on prov api writes for groups

* Wire up comp root, tests

* Also add warning to state manager warm

* Drop unnecessary conversion
2024-02-13 08:29:03 -06:00
5bbe9c6e61 Alerting: Enable group-level rule evaluation jittering by default, remove feature toggle (#82212)
* remove jitter feature flag

* Add an out so users can manually disable jitter

* Pass in cfg

* Add TODO to remove knob in future
2024-02-09 15:53:58 -06:00
aa25776f81 Alerting: Add a feature flag to periodically save states (#80987) 2024-01-23 17:03:30 +01:00
6768c6c059 Chore: Remove public vars in setting package (#81018)
Removes the public variable setting.SecretKey plus some other ones. 
Introduces some new functions for creating setting.Cfg.
2024-01-23 12:36:22 +01:00
a77ba40ed4 Alerting: Use the forked Alertmanager for remote secondary mode (#79646)
* (WIP) Alerting: Use the forked Alertmanager for remote secondary mode

* fall back to using internal AM in case of error

* remove TODOs, clean up .ini file, add orgId as part of remote AM config struct

* log warnings and errors, fall back to remoteSecondary, fall back to internal AM only

* extract logic to decide remote Alertmanager mode to a separate function, switch on mode

* tests

* make linter happy

* remove func to decide remote Alertmanager mode

* refactor factory function and options

* add default case to switch statement

* remove ineffectual assignment
2023-12-21 15:26:31 +01:00
c46da8ea9b Alerting: Update alerting package and imports from cluster and clusterpb (#79786)
* Alerting: Update alerting package

* update to latest commit

* alias for imports
2023-12-21 12:34:48 +01:00
0c9356a3c7 Unified Alerting: Set max_attempts to 1 by default (#79095)
* Unified Alerting: Set `max_attempts` to 1 by default

The retry logic for unified alerting has been broken as far as v9.4.x, rather than fixing it in one go and causing a headache to our users with rules putting extra load on their datasources - I think a better approach is to simply set 1 as a default and then let our users change it.

I see two cons with this approach:

- Configuration for legacy to unified alerting cannot be ported over automatically, users will have to manually set `max_attempts` to 3 when migrating.
- Users expecting to get any sort of retrying (as with legacy alerting) will not have it out of the box and will have to manually edit the configuration.

Signed-off-by: gotjosh <josue.abreu@gmail.com>
---------

Signed-off-by: gotjosh <josue.abreu@gmail.com>
2023-12-05 17:42:34 +00:00
5a80962de9 Alerting: Add clean_upgrade config and deprecate force_migration (#78324)
* Alerting: Add clean_upgrade config and deprecate force_migration

Upgrading to UA and rolling back will no longer delete any data by default. 
Instead, each set of tables will remain unchanged when switching between 
legacy and UA. As such, the force_migration config has been deprecated 
and no extra configuration is required to roll back to legacy anymore.

If clean_upgrade is set to true when upgrading from legacy alerting to Unified
Alerting, grafana will first delete all existing Unified Alerting resources,
thus re-upgrading all organizations from scratch. If false or unset,
organizations that have previously upgraded will not lose their existing Unified
 Alerting data when switching between legacy and Unified Alerting.

 Similar to force_migration, it should be kept false when not needed as it may
 cause unintended data-loss if left enabled.

---------

Co-authored-by: Christopher Moyer <35463610+chri2547@users.noreply.github.com>
2023-11-30 11:01:11 -05:00
7a34cdb3a2 Alerting: Add configuration options to migrate to an external Alertmanager (#71318)
* add configuration options to .ini file and parse them

* updates on config options, add external AM config to the main config struct

* separate external AM configs from general alerting configs, naming

* comments about usage of tenantID in basic auth & not using config options yet
2023-09-05 11:24:35 -03:00
dfba94e052 Alerting: Limit redis pool size to 5 and make configurable (#74057)
* Limit redis pool size to 5 and expose it in config ini

* Coerce negative pool sizes to the default
2023-08-29 14:59:12 -05:00
c7598cc6fb Alerting: Add ability to control scheduler tick interval via config (#71980)
* add ability to control scheduler interval via config
* add feature flag `configurableSchedulerTick`
2023-07-26 12:44:12 -04:00
7edbe72483 Alerting: Support concurrent queries for saving alert instances (#70525)
This commit adds support for concurrent queries when saving alert
instances to the database. This is an experimental feature in
response to some customers experiencing delays between rule evaluation
and sending alerts to Alertmanager, resulting in flapping. It is
disabled by default.
2023-06-23 11:36:07 +01:00
8bb62a8316 Alerting: Add option for memberlist label (#67982) 2023-05-09 10:32:23 +02:00
bc11a484ed Alerting: Add support for running HA using Redis (#65267)
Co-authored-by: Steve Simpson <steve.simpson@grafana.com>
2023-04-19 17:05:26 +02:00
b2abb63286 Alerting: Introduce proper feature toggles for common state history backend combinations (#65497)
* define 3 feature toggles for rollout phases

* Pass feature toggles along

* Implement first feature toggle

* Try a different strategy with fall-throughs to specific configurations

* Apply toggle overrides once outside of backend composition

* Emit log messages when we coerce backends

* Run code generator for feature toggle files

* Improve wording in flag descs

* Re-run generator

* Use code-generated constants instead of plain strings

* Use converted enum values rather than strings for pre-parsing
2023-03-30 13:53:21 -05:00
a31672fa40 Alerting: Create new state history "fanout" backend that dispatches to multiple other backends at once (#64774)
* Rename RecordStatesAsync to Record

* Rename QueryStates to Query

* Implement fanout writes

* Implement primary queries

* Simplify error joining

* Add test for query path

* Add tests for writes and error propagation

* Allow fanout backend to be configured

* Touch up log messages and config validation

* Consistent documentation for all backend structs

* Parse and normalize backend names more consistently against an enum

* Touch-ups to documentation

* Improve clarity around multi-record blocking

* Keep primary and secondaries more distinct

* Rename fanout backend to multiple backend

* Simplify config keys for multi backend mode
2023-03-17 12:41:18 -05:00
e7ace4ed62 Alerting: Allow separate read and write path URLs for Loki state history (#62268)
Extract config parsing and add tests
2023-01-30 16:30:05 -06:00
b4682fe3cb Alerting: Configurable externalLabels for Loki state history (#62404)
* Add config option for external labels

* Remove redundant nilcheck
2023-01-30 14:24:45 -06:00
aebcecf538 Chore: Fix goimports grouping in other backend platform packages (#62422)
* fix goimports

* fix goimports order

* fix goimports order

* fix goimports order

* fix goimports order

* fix goimports order
2023-01-30 08:26:42 +00:00
44b11d3228 Alerting: support basic auth for the state history loki client (#61696) 2023-01-18 20:24:40 +01:00