55 Commits

Author SHA1 Message Date
894944dcb0 Alerting: Refactor remote alertmanager to use Crypto interface (#107228) 2025-06-26 12:39:06 +02:00
5520f48153 Alerting: Use new receiver models for encrypt/decrypt in remote AM (#107042)
Several niche bugs have surfaced as a result of the decrypt code Grafana uses in receivers API being different than what is used to decrypt secrets before sending to remote AM. Example:
- Dingding notifier not abiding by new Patching added to local AM, thus causing missing url errors.

* noop refactor to simplify decryptConfiguration

* Move compat function package

* Use new receiver models to encrypt/decrypt in remote AM
2025-06-20 11:45:01 -04:00
3fe73b8de9 Remote Alertmanager: Send SMTP config (#106337)
* (WIP) Remote Alertmanager: Send SMTP config

* send SMTP configs separately

* bring back deleted fields

* actually send stuff over

* remove redundant type, fix comments

* smtp -> smtpConfig

* also send SmtpFrom an StaticHeaders separately

* tests

* restore defaults.ini
2025-06-13 12:44:39 -03:00
180b67ca6c Remote Alertmanager: Make timeout configurable in alert senders (#105599) 2025-05-19 12:52:12 +02:00
6c3d89f390 Remote Alertmanager: Add timeouts to the HTTP client (#105279)
* Remote Alertmanage: Add timeouts to the HTTP client

* code review suggestions
2025-05-13 13:25:56 +02:00
29128f7ae4 Alerting: Copy alertmanager configuration before decrypting (#105271) 2025-05-13 11:46:17 +02:00
51d7aa2bef Remote Alertmanager: Configure SMTP From address (#104925)
* Remote Alertmanager: Configure SMTP From address

* include smtp from address in config comparison

* updte tests

* trigger build

* make linter happy

* trigger build

* fix test
2025-05-12 10:37:27 +02:00
7edace5e88 Remote Alertmanager: Remove comparison before sending the state (#104930)
* Remote Alertmanager: Remove comparison before sending the state

* fix test

* fix test
2025-05-12 10:34:06 +02:00
57640e40a2 Remote Alertmanager: Consider auto-gen routes when flagging a config as "default" (#105120)
* Remote Alertmanager: Consider auto-gen routes when flagging a config as 'default'

* remove always-nil error from isDefaultConfiguration

* remove unnecessary context.Background() in test

* pass orgID to autogenFn call during AM creation

* fix test

* make update-workspace
2025-05-08 21:04:30 +02:00
757be6365a CI: Bump golangci-lint to 2.0.2 (#103572) 2025-04-10 14:42:23 +02:00
371ea5cda7 Alerting: Fix loss of TimeInterval location on remote AM apply (#102510)
* Alerting: Fix loss of TimeInterval location on remote AM apply

deepcopy.Copy does not correctly copy PostableUserConfig because it ignores
unexported fields. As a result, TimeInterval locations default to UTC instead
of retaining their original values.

* make update-workspace
2025-03-20 09:54:33 +01:00
7d4895c3c9 Alerting: Use exponential backoff in the remote Alertmanager readiness check (#99756)
* Alerting: Use exponential backoff in the remote Alertmanager readiness check

* fix capitalized error

* remove unnecessary 'for'

* refactor, use time.After() instead of channel
2025-01-29 18:53:30 +01:00
361312bbd7 Alerting: Expect 406s from the remote Alertmanager during the readiness check (#99507)
* Alerting: Expect 406s from the remote Alertmanager during the readiness check

* make it clear in the warning logs that we'll attempt to send the confgiuration/state without comparing in case of error pulling the current state/config
2025-01-24 16:26:57 +01:00
9e408f842c Alerting: Skip sanitizing labels when sending alerts to the remote Alertmanager (#96251)
* Alerting: Skip sanitizing labels when sending alerts to the remote Alertmanager

* fix drone image name
2024-11-13 11:21:44 -03:00
80611b381c Alerting: Decrypt secure settings when testing receivers in the remote Alertmanager (#93864)
* Alerting: Decrypt secure settings when testing receivers in the remote Alertmanager

* go work sync

* make update-workspace

* point to latest main in grafana/alerting

* unit test

* import definitions only once
2024-09-30 13:28:30 -03:00
4a124469fa Check is config is default by comparing hashes (#92296) 2024-08-23 11:22:06 +02:00
e321dbb690 Alerting: Use remote Alertmanager to test templates and receivers when enabled (#91570)
* Initial impl

* Add code to test templates and receivers

* Fix linter

* Fix forked am tests

* Update mimir client

* Remove trailing whitespace

* re-trigger CI
2024-08-15 16:56:14 +01:00
25dbb32cea Alerting: Vendor in latest grafana/alerting package (#91786)
* temp

* vendor

* Remove dead code

* Vendoring
2024-08-12 15:37:15 +01:00
3bb861b9f0 Alerting: Remove empty/namespace labels when sending alerts to the remote Alertmanager (#90284)
* Alerting: Remove empty/namespace labels when sending alerts to the remote Alertmanager

* update tests

* fix typo in comment
2024-07-11 15:20:12 +02:00
fce03cd724 Alerting: Send static headers to the remote Alertmanager (#89846) 2024-07-01 17:48:40 +02:00
fe1309dd96 Alerting: Send external URL to the remote Alertmanager (#89701)
* Alerting: Send external URL to the remote Alertmanager

* test that the URL is sent to the remote Alertmanager

* AppURL -> ExternalURL
2024-06-26 14:02:02 +02:00
8eabef1f91 Alerting: Update remote alertmanager to use extended receivers API. (#89253)
* Alerting: Update remote alertmanager to use extended receivers API.

* Update integration test and Mimir image

* Update Mimir image in more places
2024-06-19 12:40:22 +03:00
8491e02caf Alerting: Instrument outbound requests for Loki Historian and Remote Alertmanager with tracing (#89185)
* Add TracedClient

* Handle errors and status codes

* Wire up tracing to normal ASH and loki annotation mapping

* Add tracing to remote alertmanager

* one more spot

* and not or

* More consistency with other grafana traces, lower cardinality name
2024-06-14 13:24:12 -05:00
6262c56132 chore(perf): Pre-allocate where possible (enable prealloc linter) (#88952)
* chore(perf): Pre-allocate where possible (enable prealloc linter)

Signed-off-by: Dave Henderson <dave.henderson@grafana.com>

* fix TestAlertManagers_buildRedactedAMs

Signed-off-by: Dave Henderson <dave.henderson@grafana.com>

* prealloc a slice that appeared after rebase

Signed-off-by: Dave Henderson <dave.henderson@grafana.com>

---------

Signed-off-by: Dave Henderson <dave.henderson@grafana.com>
2024-06-14 14:16:36 -04:00
dd3c3b5857 Alerting: Update grafana/alerting. (#88914) 2024-06-14 09:19:04 +02:00
12d5251c12 Alerting: Alertmanager configuration sync loop (#88822)
* make the config sync happen on each call to ApplyConfig(), fix tests

* send autogen config

* add fake autogen function for tests

* update stale comments, tidy things up, make linter happy

* add auto-gen routes only if the feature toggle is enabled

* remove unnecessary fake autogen function

* throttle configuration syncs

* restore pkg/services/store/entity/sqlstash/sql_storage_server.go

* test sync loop in ApplyConfig, skip invalid autogen routes

* restore conf/defaults.ini

* restore conf/defaults.ini

* avoid skipping invalid auto-gen routes in SaveAndApplyConfig

* test that autogenFn is called and its errors are returned

* add debug message about the sync interval not having elapsed

* collapse two log lines into one
2024-06-12 10:13:34 +02:00
5f4d07bb75 Alerting: Enable remote primary mode using feature toggles (#88976) 2024-06-10 17:07:13 +02:00
a738cb42d8 Prometheus: Update dependency to v0.52.0 (#87809)
* Prometheus: Update dependency to v0.52.0

* go work sync

* fix panics in tests

* go work sync

* prometheus v0.52.0

* handle errors

* Update pkg/services/ngalert/sender/sender_test.go

Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>

* Update pkg/services/ngalert/sender/sender_test.go

Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>

---------

Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>
Co-authored-by: ismail simsek <ismailsimsek09@gmail.com>
2024-05-28 15:22:20 +02:00
ed42119907 Alerting: Pass metrics Registerer into NewExternalAlertmanagerSender. (#88313)
* Alerting: Pass metrics Registerer into NewExternalAlertmanagerSender.

I will work on a separate change to export the metrics from Grafana, this
is a little more complicated.

* Typo
2024-05-24 23:03:34 +02:00
8bcf589301 Alerting: Pass logger into NewExternalAlertmanagerSender (#88256) 2024-05-24 20:11:26 +02:00
e41434c332 Alerting: Promote configuration in the remote Alertmanager (#87388) 2024-05-16 12:06:03 +02:00
b76a9e4d31 Alerting: Implement GetStatus in the remote Alertmanager struct (#84887)
* Alerting: Implement GetStatus in the remote Alertmanager struct

* update tests

* fix tests, extract AlertmanagerConfig from PostableConfig

* get the remote AM config instead of the Grafana one from the remote AM

* pass grafana AM config in test

* return error in GetStatus instead of logging it (internal AM)
2024-05-03 13:59:02 +02:00
c77ab53819 Alerting: implement SaveAndApplyConfig in the remote Alertmanager struct (#84642)
* implement SaveAndApplyConfig in the remote Alertmanager struct

* remove ID from CreateGrafanaAlertmanagerConfig call

* decrypt, test that we decrypt, refactor

* fix duplicated declaration in test

* rephrase comment, remove unnecessary conversion to slice of bytes

* fix test
2024-04-23 14:37:10 +02:00
309a7e7684 Alerting: Implement SaveAndApplyDefaultConfig in the remote Alertmanager struct (#85005)
* Alerting: Implement SaveAndApplyDefaultConfig in the remote Alertmanager struct

* send the hash of the encrypted configuration

* tests, default config hash in AM struct

* add missing default config to test

* restore build directory

* go work file...

* fix broken test

* remove unnecessary conversion to []byte

* go work again...

* make things work again with latest main branch changes

* update error messages in tests for decrypting config
2024-04-19 15:11:07 +02:00
a2ce8fefed Alerting: Use a struct when sending a Grafana AM configuration to the remote Alertmanager (#86451)
* Alerting: Use a struct when sending a Grafana AM configuration to the remote Alertmanager

* remove '-distroless' from mimir image name
2024-04-19 13:04:18 +02:00
f79dd7c7f9 Alerting: Persist silence state immediately on Create/Delete (#84705)
* Alerting: Persist silence state immediately on Create/Delete

Persists the silence state to the kvstore immediately instead of waiting for the
 next maintenance run. This is used after Create/Delete to prevent silences from
 being lost when a new Alertmanager is started before the state has persisted.
 This can happen, for example, in a rolling deployment scenario.

* Fix test that requires real data

* Don't error if silence state persist fails, maintenance will correct
2024-04-09 13:39:34 -04:00
2e7cc68394 Alerting: Remove CleanUp method from the Alertmanager (#85650)
Alerting: Remove Cleanup method from the Alertmanager
2024-04-09 12:13:27 +02:00
0c3c5c5607 Alerting: Stop persisting silences and nflog to disk (#84706)
With this change, we no longer need to persist silence/nflog states to disk in addition to the kvstore
2024-03-23 00:37:33 +02:00
4ad6d66479 Alerting: Remove ID from UserGrafanaConfig struct (#84602)
* Alerting: Remove ID from UserGrafanaConfig struct

* user custom mimir image withoud id in grafana config

* change mimir image name
2024-03-19 12:47:13 +01:00
c9bb18101c Alerting: Decrypt secrets before sending configuration to the remote Alertmanager (#83640)
* (WIP) Alerting: Decrypt secrets before sending configuration to the remote Alertmanager

* refactor, fix tests

* test decrypting secrets

* tidy up

* test SendConfiguration, quote keys, refactor tests

* make linter happy

* decrypt configuration before comparing

* copy configuration struct before decrypting

* reduce diff in TestCompareAndSendConfiguration

* clean up remote/alertmanager.go

* make linter happy

* avoid serializing into JSON to copy struct

* codeowners
2024-03-19 12:12:03 +01:00
3217a0dc05 Alerting: Fix state sync errors counter increment (#80702) 2024-01-17 11:04:27 +01:00
9e78faa7ba Alerting: Add metrics to the remote Alertmanager struct (#79835)
* Alerting: Add metrics to the remote Alertmanager struct

* rephrase http_requests_failed description

* make linter happy

* remove unnecessary metrics

* extract timed client to separate package

* use histogram collector from dskit

* remove weaveworks dependency

* capture metrics for all requests to the remote Alertmanager (both clients)

* use the timed client in the MimirAuthRoundTripper

* HTTPRequestsDuration -> HTTPRequestDuration, clean up mimir client factory function

* refactor

* less git diff

* gauge for last readiness check in seconds

* initialize LastReadinesCheck to 0, tweak metric names and descriptions

* add counters for sync attempts/errors

* last config sync and last state sync timestamps (gauges)

* change latency metric name

* metric for remote Alertmanager mode

* code review comments

* move label constants to metrics package
2024-01-10 11:18:24 +01:00
9945514baa Alerting: Validate configuration for the remote Alertmanager struct (#79691)
* Alerting: Validate configuration for the remote Alertmanager struct

* add TenantID to test

* add OrgID to config struct in tests
2023-12-19 18:41:48 +01:00
23b4568597 Alerting: Send configuration and state to the remote Alertmanager on shutdown (#78682)
* Alerting: Send configuration and state to the remote Alertmanager on shutdown

* Alerting: Add a sync interval for ApplyConfig in remote secondary mode

* add routine to sync states and configs

* pass a cancellable context to syncRoutine(), remove tests for ApplyConfig, cache last config in memory

* extract logic to update config and state in the remote Alertmanager

* get latest config from the database

* avoid using separate goroutine for updating state and config

* clean up PR

* refactor, comments, tests

* update tests

* remove canceled context from calls to StopAndWait()

* create context with timeout and send config and state to remote Alertmanager

* update tests

* address code review comments
2023-12-13 22:53:09 +01:00
91836e7832 Alerting: Add time-based convergence in remote secondary mode (#78809)
* Alerting: Add a sync interval for ApplyConfig in remote secondary mode

* add routine to sync states and configs

* pass a cancellable context to syncRoutine(), remove tests for ApplyConfig, cache last config in memory

* extract logic to update config and state in the remote Alertmanager

* get latest config from the database

* avoid using separate goroutine for updating state and config

* clean up PR

* refactor, comments, tests

* update tests

* add config struct for remote secondary forked Alertmanager

* use errgroups for sync operations

* use waitgroup instead of errgroup

* remove helper method to sync AMs

* check for errors instead of bool syncErr
2023-12-13 13:36:17 +01:00
cc3c0a2cc2 Alerting: Refactor readiness check (#78799)
* Alerting: Refactor readiness check

Moves the readiness check to the mimir client and removes the need to assert that we have senders - it already has a queue and can hold notifications until we're ready to send them.
---------

Signed-off-by: gotjosh <josue.abreu@gmail.com>
2023-12-12 15:34:54 +00:00
73776f37eb Alerting: Send state to the remote Alertmanager (#78538)
* Alerting: Introduce a Mimir client as part of the Remote Alertmanager

Mimir client that understands the new APIs developed for mimir. Very much a WIP still.

* more wip

* appease the linter

* more linting

* add more code

* get state from kvstore, encode, send

* send state to the remote Alertmanager, extract fullstate logic into its own function

* pass kvstore to remote.NewAlertmanager()

* refactor

* add fake kvstore to tests

* tests

* use FileStore to get state

* always log 'completed state upload'

* refactor compareRemoteConfig

* base64-encode the state in the file store

* export silences and nflog filenames, refactor

* log 'completed state/config upload...' regardless of outcome

* add values to the state store in tests

* address code review comments

* log error from filestore

---------

Co-authored-by: gotjosh <josue.abreu@gmail.com>
2023-11-29 12:49:39 +01:00
01d274852c Alerting: Add GetFullState method to FileStore (#78701)
* Alerting: Add GetFullState method to FileStore

* make tests compile, create stateStore in NewAlertmanager

* return errors instead of logging, accept an arbitrary number of strings

* make NewAlertmanager() accept a stateStore
2023-11-28 15:34:45 +01:00
8120306fea Remote Alertmanager(refactor): Only parse the URL once (#78631)
* Remote Alertmanager(refactor): Only parse the URL once

Exactly what it says in the tin.

Signed-off-by: gotjosh <josue.abreu@gmail.com>

* use the existing tests

Signed-off-by: gotjosh <josue.abreu@gmail.com>

---------

Signed-off-by: gotjosh <josue.abreu@gmail.com>
2023-11-24 11:05:13 +00:00
23fe8f4e9c Alerting: Introduce a Mimir client as part of the Remote Alertmanager (#78357)
* Alerting: Introduce a Mimir client as part of the Remote Alertmanager

This is our first attempt at making Grafana communicate use Mimir as a backend - it uses a new set of APIs that we've developed on the Mimir side to upload the grafana configuration and alertmanager state so that it can then be ported over.

Codewise, we've introduced a couple of things:

A client to isolate in its own package all the communication that happens with Mimir
A few changes to the remote/alertmanager to include uploading the configuration and state when it starts
A few refactors that align a bit better with the design approach that we're thinking
An integration tests again these newly developed APIs using a custom image

---------

Signed-off-by: gotjosh <josue.abreu@gmail.com>
Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>
2023-11-23 16:59:36 +00:00