How to Set Up Monitoring on Nebari
In Nebari, we've integrated Grafana, Prometheus, and Loki to provide robust monitoring capabilities for your data science platform. This integration allows you to visualize metrics, monitor system health, and analyze logs effectively. Below, we'll discuss each component and how they are deployed using Helm charts, along with instructions on how to override configuration values.
Monitoring is enabled by default in Nebari. It can be disabled by setting the following in your nebari-config.yaml
.
monitoring:
enabled: false
Components Overview
Grafana
Grafana is a leading open-source platform for monitoring and observability. It provides rich visualization tools and dashboards for analyzing and monitoring metrics from various data sources.
Prometheus
Prometheus is a popular open-source monitoring and alerting toolkit. It collects metrics from configured targets, stores them efficiently, and allows querying them in real-time.
Loki
Loki is a horizontally-scalable, highly available, multi-tenant log aggregation system inspired by Prometheus. It is designed to be very cost-effective and easy to operate, as it does not index the contents of the logs, but rather a set of labels for each log stream.
See How to access system logs (Loki) via Grafana for more information on using Loki in Nebari.
Terraform Overrides
Nebari provides its users with the ability to customize the deployment of various component and Loki is one of them. Loki deployment is made up of three fundamental components:
- Loki: a set of components that when composed forms a fully featured logging stack
- Promtail: an agent which ships the contents of local logs to a Loki instance
- MinIO: a Kubernetes-native high-performance object storage server which is designed for large-scale private cloud infrastructure and compatible with Amazon S3.
monitoring:
enabled: true
overrides:
loki: <LOKI-HELM-CHART-VALUES-OVERRIDE>
promtail: <PROMTAIL-HELM-CHART-VALUES-OVERRIDE>
minio: <MINIO-HELM-CHART-VALUES-OVERRIDE>
Below are some examples of customizing your loki deployment via terraform overrides:
Log Storage Config
We use minio for logs storage as default, it is a high-performance, S3 compatible object store. You can use any S3 compatible object store instead of minio, below is an example configuration for using AWS S3:
monitoring:
enabled: true
minio_enabled: false
overrides:
loki:
loki:
storage_config:
aws:
s3: s3://<access_key>:<uri-encoded-secret-access-key>@<region>
See https://grafana.com/docs/loki/latest/storage/#aws-deployment-s3-single-store for more configuration options.
Log Storage
You can customize the size of persistent volume for logs storage, by setting up the value for
minio.persistence.size
as shown below:
monitoring:
enabled: true
overrides:
minio:
persistence:
size: 100Gi
Log Retention Period
Since the storage for MinIO is predefined and is not infinitely scalable out-of-the-box like cloud storage solutions like S3, GCS, etc, unless you keep increasing minio persistent volume size. It is a good idea to clear up old logs to make space for new ones, Nebari does this by default with 60 days retention period. You can customize the log retention period by overriding the default value as follows:
monitoring:
enabled: true
overrides:
loki:
loki:
limits_config:
# The minimum retention period is 24h.
retention_period: 90d
This may not be a suitable solution for many cases, like if your organization needs to keep all the logs forever, then you may want to use a very large persistent volume and disable deletion of old logs. You can disable deletion via:
monitoring:
enabled: true
overrides:
loki:
loki:
compactor:
retention_enabled: false
Logging architecture
The architecture diagram below shows a simplified, high level explanation of the logging components on Nebari.
Grafana
is the dashboarding user interface which allows us to use Loki
as the data source for our logs. Loki
connects to promtail
as it's source.
The promtail
component scrapes logs from various pods on the kubernetes nodes. The kube api server
provides the API endpoints which promtail
uses for for discovering and scraping its targeted resources
End users viewing the logs in Grafana
will create queries using Loki
as the data source, typically querying based on labels
. However, it's important to note that Grafana labels differ from Kubernetes labels, as their main goal is to act as an aggregation layer of logs from multiple matching resources into a single "stream," allowing users to easily access a collection of logs from various Kubernetes resources with just a single logical label.
Loki's "labels" are used to filter collections of logs from the available kubernetes_sd API endpoints, in a similar way as to how Prometheus handles metrics. These labels are configured through Promtail, which is the agent responsible for collecting and shipping logs to Loki, based on the defined targets and scraping configurations.
For details on how to view specific logs in Loki, check out the document "How to access system logs (Loki) via Grafana"
References
More information on promtail configurations Understanding labels in Loki