From 919a26f38904feb951dc32e45c06b9a0c78cdcd6 Mon Sep 17 00:00:00 2001 From: Marcus Efraimsson Date: Thu, 22 Sep 2022 11:39:14 +0200 Subject: [PATCH] Instrumentation: Guidance/conventions for logs, metrics and traces (#55562) A first rough draft of adding some guidance/conventions for instrumenting Grafana with logs, metrics and traces together with how to run things locally to query/visualize logs, metrics and traces. Closes #55470 Co-authored-by: Emil Tullstedt Co-authored-by: Kristin Laemmert Co-authored-by: Carl Bergquist --- .../engineering/backend/instrumentation.md | 276 ++++++++++++++++++ .../docker/blocks/jaeger/docker-compose.yaml | 2 +- 2 files changed, 277 insertions(+), 1 deletion(-) create mode 100644 contribute/engineering/backend/instrumentation.md diff --git a/contribute/engineering/backend/instrumentation.md b/contribute/engineering/backend/instrumentation.md new file mode 100644 index 00000000000..b8b77b473dd --- /dev/null +++ b/contribute/engineering/backend/instrumentation.md @@ -0,0 +1,276 @@ +# Instrumenting Grafana + +Guidance, conventions and best practices for instrumenting Grafana using logs, metrics and traces. + +## Logs + +Logs are files that record events, warnings and errors as they occur within a software environment. Most logs include contextual information, such as the time an event occurred and which user or endpoint was associated with it. + +### Usage + +Use the _pkg/infra/log_ package to create a named structured logger. Example: + +```go +import ( + "fmt" + + "github.com/grafana/grafana/pkg/infra/log" +) + +logger := log.New("my-logger") +logger.Debug("Debug msg") +logger.Info("Info msg") +logger.Warning("Warning msg") +logger.Error("Error msg", "error", fmt.Errorf("BOOM")) +``` + +### Naming conventions + +Name the logger using lowercase characters, e.g. `log.New("my-logger")` using snake_case or kebab-case styling. + +Prefix the logger name with an area name when using different loggers across a feature or related packages, e.g. `log.New("plugin.loader")` and `log.New("plugin.client")`. + +Start the log message with a capital letter, e.g. `logger.Info("Hello world")` instead of `logger.Info("hello world")`. The log message should be an identifier for the log entry, avoid parameterization in favor of key-value pairs for additional data. + +Prefer using camelCase style when naming log keys, e.g. _remoteAddr_, to be consistent with Go identifiers. + +Use the key _error_ when logging Go errors, e.g. `logger.Error("Something failed", "error", fmt.Errorf("BOOM"))`. + +### Validate and sanitize input coming from user input + +If log messages or key/value pairs originates from user input they **should** be validated and sanitized. + +Be **careful** to not expose any sensitive information in log messages e.g. secrets, credentials etc. It's especially easy to do by mistake when including a struct as value. + +### Log levels + +When to use which log level? + +- **Debug:** Informational messages of high frequency and/or less-important messages during normal operations. +- **Info:** Informational messages of low frequency and/or important messages. +- **Warning:** Should in normal cases not be used/needed. If used should be actionable. +- **Error:** Error messages indicating some operation failed (with an error) and the program didn't have a way of handle the error. + +### Contextual logging + +Use a contextual logger to include additional key/value pairs attached to `context.Context`, e.g. `traceID`, to allow correlating logs with traces and/or correlate logs with a common identifier. + +Example: + +```go +import ( + "context" + "fmt" + + "github.com/grafana/grafana/pkg/infra/log" +) + +var logger = log.New("my-logger") + +func doSomething(ctx context.Context) { + ctxLogger := logger.FromContext(ctx) + ctxLogger.Debug("Debug msg") + ctxLogger.Info("Info msg") + ctxLogger.Warning("Warning msg") + ctxLogger.Error("Error msg", "error", fmt.Errorf("BOOM")) +} +``` + +### Enable certain log levels for certain loggers + +During development it's convenient to enable certain log level, e.g. debug, for certain loggers to minimize the generated log output and make it easier to find things. See [[log.filters]](https://grafana.com/docs/grafana/latest/setup-grafana/configure-grafana/#filters) for information how to configure this. + +It's also possible to configure multiple loggers: + +```ini +[log] +filters = rendering:debug \ + ; alerting.notifier:debug \ + oauth.generic_oauth:debug \ + ; oauth.okta:debug \ + ; tsdb.postgres:debug \ + ; tsdb.mssql:debug \ + ; provisioning.plugins:debug \ + ; provisioning:debug \ + ; provisioning.dashboard:debug \ + ; provisioning.datasources:debug \ + datasources:debug \ + data-proxy-log:debug +``` + +## Metrics + +Metrics are quantifiable measurements that reflect the health and performance of applications or infrastructure. + +Consider using metrics to provide real-time insight into the state of resources. If you want to know how responsive your application is or identify anomalies that could be early signs of a performance issue, metrics are a key source of visibility. + +### Metric types + +See [Prometheus metric types](https://prometheus.io/docs/concepts/metric_types/) for a list and description of the different metric types you can use and when to use them. + +There are many possible types of metrics that can be tracked. One popular method for defining metrics is the [RED method](https://grafana.com/blog/2018/08/02/the-red-method-how-to-instrument-your-services/). + +### Naming conventions + +Use the namespace _grafana_ as that would prefix any defined metric names with `grafana_`. This will make it clear for operators that any metric named `grafana_*` belongs to Grafana. + +Use snake*case style when naming metrics, e.g. \_http_request_duration_seconds* instead of _httpRequestDurationSeconds_. + +Use snake*case style when naming labels, e.g. \_status_code* instead of _statusCode_. + +If metric type is a _counter_, name it with a `_total` suffix, e.g. _http_requests_total_. + +If metric type is a _histogram_ and you're measuring duration, name it with a `_` suffix, e.g. _http_request_duration_seconds_. + +If metric type is a _gauge_, name it to denote it's a value that can increase and decrease , e.g. _http_request_in_flight_. + +### Label values and high cardinality + +Be careful with what label values you add/accept. Using/allowing too many label values could result in [high cardinality problems](https://grafana.com/blog/2022/02/15/what-are-cardinality-spikes-and-why-do-they-matter/). + +If label values originates from user input they **should** be validated. Use `metricutil.SanitizeLabelName(