**What is this feature?**
This PR implements a new Prometheus historian backend that allows Grafana alerting to write alert state history as Prometheus-compatible `ALERTS` metrics to remote Prometheus-compatible data sources.
The metric includes a few additional labels:
* `grafana_alertstate`: Grafana's full alert state, more granular than Prometheus.
* `grafana_rule_uid`: Grafana's alert rule UID.
Grafana states are included in the `grafana_alertstate` label also mapped to Prometheus-compatible `alertstate` values:
| Grafana alert state | `alertstate` | `grafana_alertstate` |
|---------------------|-----------------------|-----------------------|
| `Alerting` | `firing` | `alerting` |
| `Recovering` | `firing` | `recovering` |
| `Pending` | `pending` | `pending` |
| `Error` | `firing` | `error` |
| `NoData` | `firing` | `nodata` |
| `Normal` | _(no metric emitted)_ | _(no metric emitted)_ |
* Alerting: Add support for Redis Sentinel
* docs
* docs
* Use minisentinel in test
* Apply suggestions from code review
Co-authored-by: Johnny Kartheiser <140559259+JohnnyK-Grafana@users.noreply.github.com>
Co-authored-by: Fayzal Ghantiwala <114010985+fayzal-g@users.noreply.github.com>
* "address(es)" -> "address or addresses"
* make update-workspace
* make lint-go-diff
---------
Co-authored-by: Johnny Kartheiser <140559259+JohnnyK-Grafana@users.noreply.github.com>
Co-authored-by: Fayzal Ghantiwala <114010985+fayzal-g@users.noreply.github.com>
Adds a new configuration option to specify a time offset for rule evaluation, which gets applied and saved during the Prometheus -> Grafana conversion. For example:
[unified_alerting.prometheus_conversion]
rule_query_offset = 1m
Changing this option affects only the rules imported after the change. If query_offset is set at the group level, it takes precedence over this setting.
Default is set to 1m.
The remote write path differs based on whether the data source is actually
Prometheus, Mimir, Cortex, or an older version of Cortex. We do not want
users to have to specify the path, so this change determines the path as
best it can.
It may be in the future we have to make this configurable per-datasource
to cater for setups where it's impossible to determine the correct path.
* Alerting: Allow selection of recording rule write target on per-rule basis.
Introduces a new feature flag (`grafanaManagedRecordingRulesDatasources`),
disabled by default, to enable the ability to write recording rules data using
data source settings, and selecting the data source to use on a per-rule basis.
To cope with the scenario of users upgrading, a configuration file option
allows setting the default data source to use, if none is specified in the rule,
emulating the behaviour of recording rules without the flag enabled.
* Lint
* Update conf/sample.ini
Co-authored-by: Alexander Akhmetov <me@alx.cx>
---------
Co-authored-by: Alexander Akhmetov <me@alx.cx>
Currently the default is 1, this means that by default users will see transient
query errors reflected as alert evaluation failures, when often an immediate
retry is sufficient to evaluate the rule successfully.
Enabling retries by default leads to a better experience out of the box.
* Alerting: Add setting for maximum allowed rule evaluation results
Added a new configuration setting `quota.alerting_rule_evaluation_results` to set the maximum number of alert rule evaluation results per rule. If the limit is exceeded, the evaluation will result in an error.
* Simple replace of State.Resolved with State.ResolvedAt
* Retain ResolvedAt time between Normal->Normal transition
* Introduce ResolvedRetention to keep sending recently resolved alerts
* Make ResolvedRetention configurable with resolved_alert_retention
* Tick-based LastSentAt for testing of ResendDelay and ResolvedRetention
* Do not reset ResolvedAt during Normal->Pending transition
Initially this was done to be inline with Prom ruler. However, Prom ruler
doesn't keep track of Inactive->Pending/Alerting using the same alert instance,
so it's more understandable that they choose not to retain ResolvedAt. In our
case, since we use the same cached instance to represent the transition, it
makes more sense to retain it.
This should help alleviate some odd situations where temporarily entering
Pending will stop future resolved notifications that would have happened
because of ResolvedRetention.
* Pointers for ResolvedAt & LastSentAt
To avoid awkward time.Time{}.Unix() defaults on persist
* Docs: Update "Configure high availability" guide with ha_reconnect_timeout configuration
---------
Co-authored-by: Christopher Moyer <35463610+chri2547@users.noreply.github.com>
* Alerting: Make retention period configurable for the notification log
* update sample.ini
* fix outdated comment (on disk -> kvstore)
* skip checking cyclomatic complexity for ReadUnifiedAlertingSettings
Removes legacy alerting, so long and thanks for all the fish! 🐟
---------
Co-authored-by: Matthew Jacobson <matthew.jacobson@grafana.com>
Co-authored-by: Sonia Aguilar <soniaAguilarPeiron@users.noreply.github.com>
Co-authored-by: Armand Grillet <armandgrillet@users.noreply.github.com>
Co-authored-by: William Wernert <rwwiv@users.noreply.github.com>
Co-authored-by: Yuri Tseretyan <yuriy.tseretyan@grafana.com>
* hard disable for legacy alerting
* remove alerting section from configuration file
* update documentation to not refer to deleted section
* remove AlertingEnabled from usage in UA setting parsing
* Add config for limit of rules per rule group
* Warn when editing big groups through normal API
* Warn on prov api writes for groups
* Wire up comp root, tests
* Also add warning to state manager warm
* Drop unnecessary conversion
* (WIP) Alerting: Use the forked Alertmanager for remote secondary mode
* fall back to using internal AM in case of error
* remove TODOs, clean up .ini file, add orgId as part of remote AM config struct
* log warnings and errors, fall back to remoteSecondary, fall back to internal AM only
* extract logic to decide remote Alertmanager mode to a separate function, switch on mode
* tests
* make linter happy
* remove func to decide remote Alertmanager mode
* refactor factory function and options
* add default case to switch statement
* remove ineffectual assignment
* Unified Alerting: Set `max_attempts` to 1 by default
The retry logic for unified alerting has been broken as far as v9.4.x, rather than fixing it in one go and causing a headache to our users with rules putting extra load on their datasources - I think a better approach is to simply set 1 as a default and then let our users change it.
I see two cons with this approach:
- Configuration for legacy to unified alerting cannot be ported over automatically, users will have to manually set `max_attempts` to 3 when migrating.
- Users expecting to get any sort of retrying (as with legacy alerting) will not have it out of the box and will have to manually edit the configuration.
Signed-off-by: gotjosh <josue.abreu@gmail.com>
---------
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* Alerting: Add clean_upgrade config and deprecate force_migration
Upgrading to UA and rolling back will no longer delete any data by default.
Instead, each set of tables will remain unchanged when switching between
legacy and UA. As such, the force_migration config has been deprecated
and no extra configuration is required to roll back to legacy anymore.
If clean_upgrade is set to true when upgrading from legacy alerting to Unified
Alerting, grafana will first delete all existing Unified Alerting resources,
thus re-upgrading all organizations from scratch. If false or unset,
organizations that have previously upgraded will not lose their existing Unified
Alerting data when switching between legacy and Unified Alerting.
Similar to force_migration, it should be kept false when not needed as it may
cause unintended data-loss if left enabled.
---------
Co-authored-by: Christopher Moyer <35463610+chri2547@users.noreply.github.com>
* add configuration options to .ini file and parse them
* updates on config options, add external AM config to the main config struct
* separate external AM configs from general alerting configs, naming
* comments about usage of tenantID in basic auth & not using config options yet
This commit adds support for concurrent queries when saving alert
instances to the database. This is an experimental feature in
response to some customers experiencing delays between rule evaluation
and sending alerts to Alertmanager, resulting in flapping. It is
disabled by default.
* define 3 feature toggles for rollout phases
* Pass feature toggles along
* Implement first feature toggle
* Try a different strategy with fall-throughs to specific configurations
* Apply toggle overrides once outside of backend composition
* Emit log messages when we coerce backends
* Run code generator for feature toggle files
* Improve wording in flag descs
* Re-run generator
* Use code-generated constants instead of plain strings
* Use converted enum values rather than strings for pre-parsing
* Rename RecordStatesAsync to Record
* Rename QueryStates to Query
* Implement fanout writes
* Implement primary queries
* Simplify error joining
* Add test for query path
* Add tests for writes and error propagation
* Allow fanout backend to be configured
* Touch up log messages and config validation
* Consistent documentation for all backend structs
* Parse and normalize backend names more consistently against an enum
* Touch-ups to documentation
* Improve clarity around multi-record blocking
* Keep primary and secondaries more distinct
* Rename fanout backend to multiple backend
* Simplify config keys for multi backend mode