97 Commits

Author SHA1 Message Date
c54da8f955 Alerting: Make $value return the query value in case when a single datasource is used (#102301)
What is this feature?

This PR changes the behavior of the $value and .Value variables in alerting templating to be more compatible with Prometheus templating. When a single datasource is used in the alerting rule, these variables will now return the numeric value from the query instead of the evaluation string.

Why do we need this feature?

It makes Grafana templating more compatible with Prometheus templates. In Prometheus, $value returns the numeric value of the query, but in Grafana it's the evaluation string: [ var='A' labels={instance=instance1} value=81.234 ]. This is because in Grafana multiple datasources can be used in the alert rule, and it's not always possible to get a single value.

This change makes Grafana's behavior consistent with Prometheus when a single datasource is used, and in case when multiple datasources are used in the query, it keeps the old behaviour.

Both $value and .Value are not recommended to use (documentation), and it's better to use .Values instead.
2025-03-26 10:31:38 +01:00
695ac91290 Alerting: Add backend support for keep_firing_for (#100750)
What is this feature?

This PR introduces a new alert rule configuration option, keep_firing_for (Prometheus documentation).

keep_firing_for prevents alerts from resolving immediately after the alert condition returns to normal. Instead, they transition into a "Recovering" state and are not considered resolved by the Alertmanager. Once the recovery period ends (or after the next evaluation if it is bigger than keep_firing_for), the alert transitions to "Normal" if it doesn't start alerting again:

Before                                          

+----------+     +----------+                    
| Alerting |---->|  Normal  |                    
+----------+     +----------+                    

-----
After

+----------+      +------------+     +----------+
| Alerting |----->| Recovering |---->|  Normal  |
+----------+      +------------+     +----------+                                                 

Why do we need this feature?

This feature prevents flapping alerts by adding a recovery period. This helps avoid false resolutions caused by brief alert
2025-03-18 11:24:48 +01:00
bc4be187af Alerting: Fix evaluation of rules with no-op math expressions
When you use a math expression with out any operators, the dataFrame pointer is identical between the expression result and the input query/expression.

This was resulting in the values returned from an evaluation overshadowing each other, depending on the order of the processing of the result map.

For example:
```
A: some_metric
B: reduce of A
C: math expression -> "${B}"
D: Threshold evaluation of C -> "C > 0"
```
With a value of 1 for `some_metric`, might result in a evaluation result of one of the following (somewhat at random):
1. { B: 1, D: 1 }
2. { C: 1, D: 1}

While you would expect to see:
{ B: 1, C: 1, D: 1 }
2025-02-27 17:04:18 -05:00
bfc6c032c4 refactor(alerting): remove transformation that is now done by the querier (#93660) 2024-09-24 14:46:03 +03:00
10314585ec fix(alerting): extend instant vector check for non-nullable types (#93323) 2024-09-17 13:20:40 +02:00
eabf3b9f73 feat(alerting): add support for query service instant vectors (#92091) 2024-09-12 15:33:00 +02:00
4c71cadd5f Alerting: Detach condition validator from condition evaluator (#91150)
* Detach validator from evaluator

* Drop unnecessary interface and type
2024-07-30 10:55:37 -05:00
8323b688c6 Alerting: Improve logging in scheduler and states (#91003)
* handle metadata map nil

* remove double context

* clean up logging in scheduler

* do not reuse loggers from previous ticks

* log the dropped tick

* log tick instead of ticknum

* replace with processing tick logs

* log sending notifications

* update logging in persister to fetch context

* logs to historian

moved them upstream to be able to log when store is overridden
2024-07-29 16:01:48 -04:00
94dd4105e2 Loki: Allow alert headers to be forwarded (#90890)
* Loki: Allow alert headers to be forwarded

* Loki: fix tests

---------

Co-authored-by: Yuri Tseretyan <yuriy.tseretyan@grafana.com>
2024-07-25 07:39:34 +02:00
88ed77e7e8 Alerting: More graceful handling of NoData in recording rules (#90312)
* Handle NoData as its own case

* Debug

* Scalars parseable by CollectionReader

* fix linter

* Orgit add pkg/*git add pkg/* not and
2024-07-17 15:24:03 -05:00
c3b9c9b239 Alerting: Send information about alert rule to data source in headers (#90344)
* add support of metadata to condition and adding it to request headers
* support for additional metadata when condition is built
* add additionall context to conditions: source and folder title
* add version
* use percent-encoding for header values
2024-07-17 22:55:12 +03:00
9c05b30489 Chore: Add more logs and tracing to hysteresis flows (#90369) 2024-07-15 13:38:20 -04:00
68691c9386 Alerting: Add setting for maximum allowed rule evaluation results (#89468)
* Alerting: Add setting for maximum allowed rule evaluation results

Added a new configuration setting `quota.alerting_rule_evaluation_results` to set the maximum number of alert rule evaluation results per rule. If the limit is exceeded, the evaluation will result in an error.
2024-06-27 09:45:15 +02:00
99d8025829 Chore: Move identity and errutil to apimachinery module (#89116) 2024-06-13 07:11:35 +03:00
d004f8a98d Alerting: Recording rules understands errors embedded in dataframes (#88946)
* Make MakeDependencyError public for tests in another package

* Create tests for errors in eval results

* Extract logic to pull frame errors out into exported function

* Maybe we can drop cyclomatic complexity lint suppression now?

* extract frame errors and fail recording rules if frames contain error

* Fix up retry logic to actually work

* Do not retry non retryable errors
2024-06-11 10:37:10 -05:00
6c47968f6c Alerting: Do not retry rule evaluations with "input data must be a wide series but got type long" style errors (#87343)
add typed error for series must be wide, do not retry
2024-05-07 11:31:07 -05:00
734d0111cb Alerting: Export pure function to convert query results to alert results (#85393)
Exported pure function to convert query results to alert results
2024-04-05 08:57:31 -05:00
9dc4221508 Alerting: Log expression command types during evaluation (#84614) 2024-03-19 10:00:03 -04:00
48b5ac779b Alerting/Annotations: Add annotation backend for Loki alert state history (#78156)
* Move scope type vars to testutil package

* Expose parts of state historian for use in annotation backend

* Implement Loki ASH Annotation store

This store will only implement the `Get` method of a RepositoryImpl since alert state history
writes to Loki elsewhere.

* Use interface for Loki HTTP Client

* Add tests for Loki ASH Annotation store

* Add missing test

* Fix lint

* Organize tests

* Add filter tests

* Improve tests

* Move filter logic into outer function

* Fix lint

* Add comment

* Fix tests

* Fix lint

* Rename historian store + refactor

* Cleanup historian store

* Fix tests

* Minor cleanup

* Use new `ShouldRecordAnnotation` filter

* Fix logic and add tests for this check

* Fix typos, remove unused variables, `< 1` -> `== 0`

* More closely mimic RBAC filter from xorm to ensure correct logic

* Move off weaveworks client

* Address PR comments
2024-01-10 18:42:35 -05:00
f6a46744a6 Alerting: Support hysteresis command expression (#75189)
Backend: 

* Update the Grafana Alerting engine to provide feedback to HysteresisCommand. The feedback information is stored in state.Manager as a fingerprint of each state. The fingerprint is persisted to the database. Only fingerprints that belong to Pending and Alerting states are considered as "loaded" and provided back to the command.
   - add ResultFingerprint to state.State. It's different from other fingerprints we store in the state because it is calculated from the result labels.
  -  add rule_fingerprint column to alert_instance
   - update alerting evaluator to accept AlertingResultsReader via context, and update scheduler to provide it.
   - add AlertingResultsFromRuleState that implements the new interface in eval package
   - update getExprRequest to patch the hysteresis command.

* Only one "Recovery Threshold" query is allowed to be used in the alert rule and it must be the Condition.


Frontend:

* Add hysteresis option to Threshold in UI. It's called "Recovery Threshold"
* Add test for getUnloadEvaluatorTypeFromCondition
* Hide hysteresis in panel expressions

* Refactor isInvalid and add test for it
* Remove unnecesary React.memo
* Add tests for updateEvaluatorConditions

---------

Co-authored-by: Sonia Aguilar <soniaaguilarpeiron@gmail.com>
2024-01-04 11:47:13 -05:00
c631261681 Alerting: Attempt to retry retryable errors (#79161)
* Alerting: Attempt to retry retryable errors

Retrying has been broken for a good while now (at least since version 9.4) - this change attempts to re-introduce them in their simplest and safest form possible.

I first introduced #79095 to make sure we don't disrupt or put additional load on our customer's data sources with this change in a patch release. Paired with this change, retries can now work as expected.

There's two small differences between how retries work now and how they used to work in legacy alerting.

Retries only occur for valid alert definitions - if we suspect that that error comes from a malformed alert definition we skip retrying.
We have added a constant backoff of 1s in between retries.

---------

Signed-off-by: gotjosh <josue.abreu@gmail.com>
2023-12-06 20:45:08 +00:00
07915703fe Revert "Alerting: Attempt to retry retryable errors" (#79158)
Revert "Alerting: Attempt to retry retryable errors (#79037)"

This reverts commit 3e51cf09491e19442cdceb8e84c5fb3b9ef17e2c.
2023-12-06 19:12:01 +00:00
3e51cf0949 Alerting: Attempt to retry retryable errors (#79037)
* Alerting: Attempt to retry retryable errors

Currently in a draft state, but this was the minimal diff I could put together to exemplify how could achieve this.

Signed-off-by: gotjosh <josue.abreu@gmail.com>

---------

Signed-off-by: gotjosh <josue.abreu@gmail.com>
2023-12-06 16:35:22 +00:00
Jo
580477bf8e NGAlerting: Use identity.Requester interface instead of SignedInUser (#76360)
* unfurl SignedInUserAttrs services

* replace signedInUser with Requester

replace signedInUser with requester

* fix tests

* linting

---------

Co-authored-by: Ieva <ieva.vasiljeva@grafana.com>
2023-11-14 14:47:34 +00:00
35e488b22b SSE: Localize/Contain Errors within an Expression (#73163)
Changes SSE to not always fail all queries when one fails. Now only the query itself, and nodes that depend on it will error.
---------

Co-authored-by: Gilles De Mey <gilles.de.mey@gmail.com>
2023-09-13 13:58:16 -04:00
e855efb13d Plugins: Move store and plugin dto to pluginsintegration (#74655)
move store and plugin dto
2023-09-11 13:59:24 +02:00
58f6648505 Chore: capitalise messages for alerting (#74335) 2023-09-04 18:46:34 +02:00
0717ec11d6 Alerting: Update state manager to change all current states in the case when Error\NoData is executed as Ok\Nomal (#68142) 2023-08-15 10:27:15 -04:00
5ba164d92b Alerting: Exclude expression refIDs from NoData state (#72219) 2023-07-26 11:42:04 -04:00
8dd3eb856d Alerting: Improve performance of matching captures (#71828)
This commit updates eval.go to improve the performance of matching
captures in the general case. In some cases we have reduced the
runtime of the function from 10s of minutes to a couple 100ms.
In the case where no capture matches the exact labels, we revert to
the current subset/superset match, but with a reduced search space
due to grouping captures.
2023-07-20 09:07:00 +01:00
541bfe636d SSE: Support for ML query node (#69963)
* introduce a new node-type ML and implement a command outlier that uses ML plugin as a source of data.
* add feature flag mlExpressions that guards the feature
2023-07-13 20:37:50 +03:00
842f33580e SSE: Add functions that determine NodeType by UID and construct a data source struct from NodeType (#70106)
* add NodeTypeFromDatasourceUID and DataSourceModelFromNodeType()
* deprecate expr.DataSourceModel
* replace usages of IsDataSource to NodeTypeFromDatasourceUID 
* replace usages of DataSourceModel to DataSourceModelFromNodeType()
2023-06-16 13:05:06 -04:00
624777258b Plugins: Refactor creation of plugin context to dedicated service (#66451)
* first pass

* fix tests

* return errs

* change signature

* tidy

* delete unnecessary fields from test

* tidy

* fix tests

* simplify

* separate error check in API

* apply nits
2023-06-08 13:59:51 +02:00
35342a3c76 Alerting: Fix DatasourceUID and RefID missing for DatasourceNoData alerts (#66733)
This commit fixes a bug where DatasourceUID and RefID annotations are
missing for DatasourceNoData alerts in Grafana 9.5. This bug affects
datasource plugins that have moved to using the data plane contract.
2023-04-20 14:38:20 +01:00
2bbf0c9de4 Alerting: Allow Rules to Schedule to be filtered by Rule Group (#59990)
* Alerting: Allow Rules to Schedule to be filtered by Rule Group
2023-04-13 12:55:42 +01:00
1c3ce0735f Alerting: Tiny refactor on the eval and schedule packages (#66130)
* Alerting: Tiny refactor on the eval and schedule packages

two very small things:

- We had a constructor on something called a `Context` which is not a `context.Context` so let's just name that constructor `NewContext`
- The user that we use to run query evaluations is the same (with some variation) abstract it to a function so that it can be re-used when necessary.

* Update pkg/services/ngalert/schedule/schedule.go

Co-authored-by: Alexander Weaver <weaver.alex.d@gmail.com>

* Update pkg/services/ngalert/schedule/schedule.go

Co-authored-by: Alexander Weaver <weaver.alex.d@gmail.com>

---------

Co-authored-by: Alexander Weaver <weaver.alex.d@gmail.com>
2023-04-06 16:02:28 +01:00
f93a9c794d Alerting: Fix incorrect comment in eval.go (#63510)
This commit fixes an incorrect comment in the Result struct in eval.go
that I had written some time ago. The comment now documents the
actual behaviour and content of this field.
2023-02-21 15:42:04 +00:00
23c27cffb3 Chore: Rename Id to ID in alerting models (#62777)
* Chore: Rename Id to ID in alerting models

* Add xorm tags for datasource

* Add xorm tag for uid
2023-02-02 17:22:43 +01:00
d6d4097567 Chore: Fix goimports grouping in alerting (#62424)
* fix goimports

* fix goimports order
2023-01-30 09:55:35 +01:00
2c46f46d37 Alerting: Rule evaluator to get cached data source info (#61305)
do not skip cache when get data source info
2023-01-18 14:25:11 -05:00
b4e1e1871f Alerting: Fix evaluation timeout (#61303) 2023-01-11 10:52:54 -05:00
c35c689a96 Plugins: Automatically forward plugin request HTTP headers in outgoing HTTP requests (#60417)
Automatically forward core plugin request HTTP headers in outgoing HTTP requests. 
Core datasource plugin authors don't have to specifically handle forwarding of HTTP 
headers, e.g. do not have to "hardcode" the header-names in the datasource plugin, 
if not having custom needs.

Fixes #57065
2022-12-21 13:25:58 +01:00
c5ee4e4ae1 Alerting: Improve rule validation to check if rule uses backend datasources (#58986)
* validate if rule uses backend datasources

* add backend datasource to test

* fix tests

* another forgotten import

* remove unused var
2022-12-08 10:44:02 +01:00
b57689e07e Alerting: Add header X-Grafana-Org-Id to evaluation requests (#58972) 2022-11-21 10:13:44 +01:00
e3a4bde622 Alerting: Condition evaluator with cached pipeline (#57479)
* create rule evaluator
* load header from the context
* init one factory
* update scheduler
2022-11-02 10:13:39 -04:00
0a4121cef8 Alerting: Contextual log provider for rule key (#57476)
* create contextual log context provider
* use contextual provider in scheduler
* init logger in the package
* use context for log context
* use context in state manager
2022-10-26 19:16:02 -04:00
2d20c8db7b Chore: Expression engine to support relative time range (#57474)
* make TimeRange interface and add relative range
* make Execute methods support the current time
* update resample to support relative time range
* update DSNode to support relative time range
* update query service to create queries with absolute time
* make alerting evaluator create relative time ranges
2022-10-26 16:13:58 -04:00
4eb8e4ff66 Alerting: Add traceability headers for alert queries (#57127)
* Define EvaluationContext

* Refactor ConditionEval to use new context struct

* Refactor QueriesAndExpressionsEval to use EvaluationContext

* Remove dead field from AlertExecCtx

* Refactor Validate to use EvaluationContext

* Get rid of privately used AlertExecCtx

* Move EvaluationContext to new file and add helper

* Add builder pattern and bind rule info to context

* Extract header logic and add rule UID header

* Fix missing call
2022-10-19 14:19:43 -05:00
a49fcbdbbc Alerting: Add frames for all queries and expressions (#55609)
This commit is one of two commits to make the data frames for all queries and expressions in an alert rule available to the state package for rendering a graph. It renames Result to Condition, and creates an additional field called
Results that is a map of Ref ID to data.Frames.
2022-09-27 10:05:29 +01:00
2d38664fe6 Alerting: Improve validation of query and expressions on rule submit (#53258)
* Improve error messages of server-side expression 
* move validation of alert queries and a condition to eval package
2022-09-21 15:14:11 -04:00