Monitoring and Alerts
Application metrics
All components of the backend are designed to output application metrics.
Metrics are prefixed with the name of each application. The metrics are used in Grafanaβπ§ for charts, monitoring and alarming.
They use the StatsDβπ‘ protocol.
Application metrics data flow:
Ellipses represent data; rectangles represent processes. Purple components belong to the backend. Click on the image and then click on each shape to see related documentation.
Prometheus and Grafanaβπ§ provide historical charts for more than 90 days and are useful to investigate long-term trends.
Netdataβπ§ provides a web UI with real-time metrics. See the dedicated subchapter for details.
StatsD
All backend components send StatsD metrics over UDP using localhost as destination.
This guarantees that applications never block on metric generation in case the receiver slows down. The StatsD messages are received by Netdataβπ§. It automatically tracks any new metric, generates averages and summaries as needed and exposes it to Prometheusβπ§ for scraping. In the codebase the statsd library is often used as:
Because of this, a quick way to identify where metrics are being generated in the backend codebase is to search e.g.:
- https://github.com/search?q=repo%3Aooni%2Fbackend+metrics.gauge&type=code
- https://github.com/search?q=repo%3Aooni%2Fbackend+metrics.timer&type=code
Where possible, timers have the same name as the function being timed e.g. https://github.com/search?q=repo%3Aooni%2Fbackend+clickhouse_upsert_summary&type=code
See Conventionsβπ‘ for patterns around component naming.
Metrics list
This subsection provides a list of the most important application metrics as they are shown in Grafana. The names are autogenerated by Netdata based on the metric name used in StatsD.
For example a @metrics.timer("generate_test_list")
Python decorator is used at:
https://github.com/ooni/backend/blob/0ec9fba0eb9c4c440dcb7456f2aab529561104ae/api/ooniapi/prio.py#L162.
Such timer will be processed by Netdata and appear in Grafana as:
The metrics always start with netdata_statsd
and end with:
_milliseconds_average
_events_persec_average
_value_average
Also see https://blog.netdata.cloud/introduction-to-statsd/
TIP: StatsD collectors (like Netdata or others) preprocess datapoints by calculating average/min/max values etc.
Run this to locate where in the backend codbase application metrics are being generated:
Metrics for ASN metadata updaterββ. See the ASN metadata updater dashboardβπ:
Metrics for CitizenLab test list updaterββ
Metrics for the Database backup toolββ. See the Database backup dashboardβπ on Grafana:
Metrics for the social media blocking event detectorββ:
Metrics for the Fastpathββ. Used in various dashboards, primarily API and fastpathβπ dashboard.
Metrics Fingerprint updaterββ See the Fingerprint updater dashboardβπ on Grafana.
Metrics from Nginx caching of the aggregation API. See Aggregation cache monitoringβπ
Metrics for the APIββ.
Metrics for the GeoIP downloaderββ.
Metrics for the test helper rotationββ.
Prometheus
Prometheus https://prometheus.io/ is a popular monitoring system and runs on monitoring.ooni.orgβπ₯
It is deployed and configured by Ansibleβπ§ using the following playbook: https://github.com/ooni/sysadmin/blob/master/ansible/deploy-monitoring.yml
Most of the metrics are collected by scraping Prometheus endpoints, Netdata, and using node exporter. The web UI is accessible at https://prometheus.ooni.org
Blackbox exporter
Blackbox exporter is part of Prometheus. Itβs a daemon that performs HTTP probing against other hosts without relying on local agents (hence the name Blackbox) and feeds the generated datapoints into Promethous.
See https://github.com/prometheus/blackbox_exporter
It is deployed by Ansible on the monitoring.ooni.orgβπ₯
See Updating Blackbox Exporter runbookβπ
Grafana dashboards
There is a number of dashboards on Grafanaβπ§ at https://grafana.ooni.org/
Grafanaβπ§ is deployed on the monitoring.ooni.orgβπ₯ host. See Monitoring deployment runbookβπ for deployment.
The dashboards are used for:
-
Routinely reviewing the general health of the backend infrastructure
-
Predicting long-term scaling requirements, i.e.
-
increasing disk space for the database
-
increasing CPU and memory requirements
-
Investigating alerts and troubleshooting incidents
Alerting
Alerts from Grafana and Prometheusβπ§ are sent to the #ooni-bots Slackβπ§ channel by a bot.
Slackβπ§ can be configured to provide desktop notification from browsers and audible notifications on smartphones.
Alert flow:
The diagram does not explicitly include alertmanager. It is part of Prometheus and receives alerts and routes them to Slack.
More detailed diagram:
In the diagram Prometheus receives, stores and serves datapoints and has some alert rules to trigger alerts. Grafana acts as a UI for Prometheus and also triggers alerts based on alert rules configured in Grafana itself.
Alertmanager is pretty simple - receives alerts and sends notification to Slack.
The alert rules are listed at https://grafana.ooni.org/alerting/list The list also shows which alerts are firing at the moment, if any. There is also a handful of alerts configured in Prometheusβπ§ using Ansibleβπ§.
The silences list shows if any alert has been temporarily silenced: https://grafana.ooni.org/alerting/silences
See Grafana editingβπ and Managing Grafana alert rulesβπ for details.
There are also many dashboards and alerts configured in Jupyter Notebookβπ§. These are meant for metrics that require more complex algorithms, predictions and SQL queries that cannot be implemented using Grafanaβπ§ e.g. when using machine learning or Pandas. See Ooniutils microlibraryβπ‘ for details.
On many dashboards you can set the averaging timespan and the target hostname using fields on the top left.
Here is an overview of the most useful dashboards:
API and fastpath
This is the most important dashboard showing metrics of the API and the Fastpathββ.
Test-list repository in the API
https://grafana.ooni.org/d/siWZslSVk/api-test-list-repo?orgId=1
This dashboard shows timings around the git repository checked out by the APIββ that contains the test lists.
Measurement uploader dashboard
https://grafana.ooni.org/d/ma3Q6GzVz/api-uploader?orgId=1
This dashboard shows metrics, timing and amounts of data transferred by the Measurement uploaderββ
Fingerprint updater dashboard
https://grafana.ooni.org/d/JNlK8ox4z/fingerprints
This dashboard shows metrics and timing from the Fingerprint updaterββ
ClickHouse dashboard
https://grafana.ooni.org/d/thEkJB_Mz/clickhouse?orgId=1
This dashboards show ClickHouse-specific performance metrics. It can be used for optimizations.
For investigating slow queries also see the ClickHouse queries notebookβπ.
HaProxy dashboard
https://grafana.ooni.org/d/ba33e4df-d686-4459-b37d-3966af14ad00/haproxy
Basic metrics from HaProxyββ load balancers. Used for OONI bridgesββ.
TLS certificate dashboard
https://grafana.ooni.org/d/-1mr7sWMk/ssl-certificates
Certificate expiration times. There are alerts configured in Grafanaβπ§ to alert on expiring certificates.
Test helpers dashboard
https://grafana.ooni.org/d/Dn1R7QEnz/test-helpers
Status, uptime and load metrics from the Test helpersββ.
Database backup dashboard
https://grafana.ooni.org/d/aQjQYhoGz/db-backup
Metrics, timing and data transferred by Database backup toolββ
By looking at the last 24 hours of run you should be able to see the backup being run https://grafana.ooni.org/d/aQjQYhoGz/db-backup?orgId=1&from=now-24h&to=now
The βStatusβ chart shows the running status. βUploaded bytes in totalβ and βBackup timeβ should be self explanatory.
TIP: If the backup time or size grows too much it could be worth alerting and considering implementing incremental backups.
Event detector dashboard
https://grafana.ooni.org/d/FH2TmwFVz/event-detection?orgId=1&refresh=1m
Basic metrics from the social media blocking event detectorββ
GeoIP MMDB database dashboard
https://grafana.ooni.org/d/0e6eROj7z/geoip?orgId=1&from=now-7d&to=now
Age and size of the GeoIP MMDB database. Also, a chart showing discrepancies between the lookup performed by the probes VS the one in the API, used to gauge the benefits of using a centralized solution.
Also see Geolocation scriptβπ
See GeoIP downloaderββ
Host clock offset dashboard
https://grafana.ooni.org/d/9dLa-RSnk/host-clock-offset?orgId=1
Measures NTP clock sync and alarms on big offsets
Netdata-specific dashboard
https://grafana.ooni.org/d/M1rOa7CWz/netdata?orgId=1&var-instance=backend-fsn.ooni.org:19999
Shows all the metrics captured by Netdataβπ§ - useful for in-depth performance investigation.
ASN metadata updater dashboard
https://grafana.ooni.org/d/XRihZL-Vk/ansmeta-update?orgId=1&from=now-7d
Progress, runtime and table size of the ASN metadata updaterββ
See Metrics listβπ‘
Netdata
Netdata https://www.netdata.cloud/ is a monitoring agent that runs locally on the backend servers. It exports host and Application metrics to Prometheusβπ§.
It also provides a web UI that can be accessed on port 19999. It can be useful during development, performance optimization and debugging as it provides metrics with higher time granularity (1 second) and almost no delay.
Netdata is not exposed on the Internet for security reasons and can be accessed only when nededed by setting up port forwarding using SSH. For example:
Netdata can also be run on a development desktop and be accessed locally in order to explore application metrics without having to deploy Prometheus and Grafanaβπ§.
See Netdata-specific dashboardβπ of an example of native Netdata metrics.
Log management
All components of the backend are designed to output logs to Systemdβs journald. They usually log using the component name as Systemd unit name.
Sometimes you might have to use --identifier <name>
instead for
scripts that are not run as Systemd units.
Journald automatically indexes logs by time, unit name and other items. This allows to quickly filter logs during troubleshooting, for example:
Or follow live logs using e.g.:
Sometimes it is useful to show milliseconds in the timestamps:
The logger used in Python components also sets additional fields, notably CODE_FUNC and CODE_LINE
Available fields can be listed using:
It is possible to filter by those fields. It comes very handy for debugging e.g.:
Every host running backend services also sends host to monitoring.ooni.org using Vectorβπ§.
There is a dedicated ClickHouse instance on monitoring.ooni.org used to collect logs. See the ClickHouse instance for logsββ. This is done to avoid adding unnecessary load to the production database on FSN that contains measurements and also keep a copy of FSNβs logs on a different host.
The receiving Vectorβπ§ instance and ClickHouse are deployed and configured by Ansibleβπ§ using the following playbook: https://github.com/ooni/sysadmin/blob/master/ansible/deploy-monitoring.yml
See Logs from FSN notebookβπ and Logs investigation notebookβπ
Slack
Slack is used for team messaging and automated alerts at the following instance: https://openobservatory.slack.com/
#ooni-bots
#ooni-bots
is a Slackβπ§ channel used for automated
alerts: https://app.slack.com/client/T37Q8EGUU/C38EJ0CET