Runbooks
Below you will find runbooks for common tasks and operations to manage our infra.
Monitoring deployment runbook
The monitoring stack is deployed and configured by Ansible on the monitoring.ooni.orgβπ₯ host using the following playbook: https://github.com/ooni/sysadmin/blob/master/ansible/deploy-monitoring-config.yml
It includes:
-
Grafanaβπ§ at https://grafana.ooni.org
-
Jupyter Notebookβπ§ at https://jupyter.ooni.org
-
Vector (see Log managementβπ‘)
-
local Netdata, Blackbox exporterβπ§, etc
-
Prometheusβπ§ at https://prometheus.ooni.org
It also configures the FQDNs:
-
loghost.ooni.org
-
monitoring.ooni.org
-
netdata.ooni.org
This also includes the credentials to access the Web UIs. They are
deployed as /etc/nginx/monitoring.htpasswd
from
ansible/roles/monitoring/files/htpasswd
Warning the following steps are dangerously broken. Applying the changes will either not work or worse break production.
If you must do something of this sort, you will unfortunately have to resort of
specifying the particular substeps you want to run using the -t
tag filter
(eg. -t prometheus-conf
to update the prometheus configuration.
Steps:
-
Review Ansible playbooks summaryβπ, Deploying a new host Grafana dashboardsβπ‘.
-
Run
./play deploy-monitoring.yml -l monitoring.ooni.org --diff -C
and review the output -
Run
./play deploy-monitoring.yml -l monitoring.ooni.org --diff
and review the output
Updating Blackbox Exporter runbook
This runbook describes updating Blackbox exporterβπ§.
The blackbox_exporter
role in ansible is pulled in by the deploy-monitoring.yml
runbook.
The configuration file is at roles/blackbox_exporter/templates/blackbox.yml.j2
together with host_vars/monitoring.ooni.org/vars.yml
.
To add a simple HTTP[S] check, for example, you can copy the βooni websiteβ block.
Edit it and run the deployment of the monitoring stack as described in the previous subchapter.
Deploying a new host
To deploy a new host:
-
Choose a FQDN like $name.ooni.org based on the DNS naming policyβπ‘
-
Deploy the physical host or VM using Debian Stable
-
Create
A
andAAAA
records for the FQDN in the Namecheap web UI -
Follow Updating DNS diagramsβπ
-
Review the
inventory
file and git-commit it -
Deploy the required stack. Run ansible it test mode first. For example this would deploy a backend host:
-
Update Prometheusβπ§ by following Monitoring deployment runbookβπ
-
git-push the commits
Also see Monitoring deployment runbookβπ for an example of deployment.
Deleting a host
-
Remove it from
inventory
-
Update the monitoring deployment using:
Weekly measurements review runbook
On a daily or weekly basis the following dashboards and Jupyter notebooks can be reviewed to detect unexpected patterns in measurements focusing on measurement drops, slowdowns or any potential issue affecting the backend infrastructure.
When browsing the dashboards expand the time range to one year in order to spot long term trends. Also zoom in to the last month to spot small glitches that could otherwise go unnoticed.
Review the API and fastpathβπ dashboard for the production backend host[s] for measurement flow, CPU and memory load, timings of various API calls, disk usage.
Review the Incoming measurements notebookβπ for unexpected trends.
Quickly review the following dashboards for unexpected changes:
- Long term measurements prediction notebookβπ
- Test helpers dashboardβπ
- Test helper failure rate notebookβπ
- Database backup dashboardβπ
- GeoIP MMDB database dashboardβπ
- GeoIP dashboardβπ
- Fingerprint updater dashboardβπ
- ASN metadata updater dashboardβπ
Also check https://jupyter.ooni.org/view/notebooks/jupycron/summary.html for glitches like notebooks not being run etc.
Grafana backup runbook
This runbook describes how to back up dashboards and alarms in Grafana. It does not include backing up datapoints stored in Prometheusβπ§.
The Grafana SQLite database can be dumped by running:
Future implementation is tracked in: Implement Grafana dashboard and alarms backupβπ
Grafana editing
This runbook describes adding new dashboards, panels and alerts in Grafanaβπ§
To add a new dashboard use this https://grafana.ooni.org/dashboard/new?orgId=1
To add a new panel to an existing dashboard load the dashboard and then click the βAddβ button on the top.
Many dashboards use variables. For example, on
https://grafana.ooni.org/d/l-MQSGonk/api-and-fastpath-multihost?orgId=1
the variables $host
and $avgspan
are set on the top left and used in
metrics like:
Managing Grafana alert rules
Alert rules can be listed at https://grafana.ooni.org/alerting/list
note The list also shows which alerts are currently alarming, if any.
Click the arrow on the left to expand each alerting rule.
The list shows:
note When creating alerts it can be useful to add full URLs linking to dashboards, runbooks etc.
To stop notifications create a βsilenceβ either:
-
by further expanding an alert rule (see below) and clicking the βSilenceβ button
-
by inputting it in https://grafana.ooni.org/alerting/silences
Screenshot:
Additionally, the βShow state historyβ button is useful especially with flapping alerts.
Adding new fingerprints
This is performed on https://github.com/ooni/blocking-fingerprints
Updates are fetched automatically by Fingerprint updaterββ
Also see Fingerprint updater dashboardβπ.
Backend code changes
This runbook describes making changes to backend components and deploying them.
Summary of the steps:
-
Check out the backend repository.
-
Create a dedicated branch.
-
Update
debian/changelog
in the component you want to monify. See Package versioningβπ‘ for details. -
Run unit/functional/integ tests as needed.
-
Create a pull request.
-
Ensure the CI workflows are successful.
-
Deploy the package on the testbed ams-pg-test.ooni.orgβπ₯ and verify the change works as intended.
-
Add a comment the PR with the deployed version and stage.
-
Wait for the PR to be approved.
-
Deploy the package to production on backend-fsn.ooni.orgβπ₯. Ensure it is the same version that has been used on the testbed. See API runbookβπ for deployment steps.
-
Add a comment the PR with the deployed version and stage, then merge the PR.
When introducing new metrics:
-
Create Grafanaβπ§ dashboards, alerts and Jupyter Notebookβπ§ and link them in the PR.
-
Collect and analize metrics and logs from the testbed stages before deploying to production.
-
Test alarming by simulating incidents.
Backend component deployment
This runbook provides general steps to deploy backend components on production hosts.
Review the package changelog and the related pull request.
The amount of testing and monitoring required depends on:
-
the impact of possible bugs in terms of number of users affected and consequences
-
the level of risk involved in rolling back the change, if needed
-
the complexity of the change and the risk of unforeseen impact
Monitor the API and fastpathβπ and dedicated . Review past weeks for any anomaly before starting a deployment.
Ensure that either the database schema is consistent with the new deployment by creating tables and columns manually, or that the new codebase is automatically updating the database.
Quickly check past logs.
Follow logs with:
While monitoring the logs, deploy the package using the The deployer toolβπ§ tool. (Details on the tool subchapter)
API runbook
This runbook describes making changes to the APIββ and deploying it.
Follow Backend code changesβπ and Backend component deploymentβπ.
In addition, monitor logs from Nginx and API focusing on HTTP errors and failing SQL queries.
Manually check Explorerβπ± and other Public and private web UIsβπ‘ as needed.
Managing feature flags
To change feature flags in the API a simple pull request like https://github.com/ooni/backend/pull/776 is enough.
Follow Backend code changesβπ and deploy it after basic testing on ams-pg-test.ooni.orgβπ₯.
Running database queries
This subsection describes how to run queries against ClickHouseββ. You can run queries from Jupyter Notebookβπ§ or from the CLI:
Prefer using the default user when possible. To log in as admin:
note Heavy queries can impact the production database. When in doubt run them on the CLI interface in order to terminate them using CTRL-C if needed.
warning ClickHouse is not transactional! Always test queries that mutate schemas or data on testbeds like ams-pg-test.ooni.orgβπ₯
For long running queries see the use of timeouts in Fastpath deduplicationβπ
Also see Dropping tablesβπ, Investigating table sizesβπ
Modifying the fastpath table
This runbook show an example of changing the contents of the fastpath tableββ by running a βmutationβ query.
warning This method creates changes that cannot be reproduced by external researchers by Reprocessing measurementsβπ. See Reproducibilityβπ‘
In this example Signal testββ measurements are being flagged as failed due to https://github.com/ooni/probe/issues/2627
Summarize affected measurements with:
important
ALTER TABLE β¦β UPDATE
starts a mutation that runs in background.
Check for any running or stuck mutation:
Start the mutation:
Run the previous SELECT
queries to monitor the mutation and its
outcome.
Updating tor targets
See Tor targetsβπ for a general description.
Review the Ansibleβπ§ chapter. Checkout the repository and
update the file ansible/roles/ooni-backend/templates/tor_targets.json
Commit the changes and deploy as usual:
Test the updated configuration, then:
git-push the changes.
Implements Document Tor targetsβπ
Creating admin API accounts
See Authβπ for a description of the API entry points related to account management.
The API provides entry points to:
The latter is implemented here.
important The default value for API accounts is
user
. For such accounts there is no need for a record in theaccounts
table.
To change roles it is required to be authenticated and have a role as
admin
.
It is also possible to create or update roles by running SQL queries
directly on ClickHouseββ. This can be necessary to
create the initial admin
account on a new deployment stage.
A quick way to identify the account ID an user is to extract logs from the APIββ either from the backend host or using Logs from FSN notebookβπ
Example output:
Then on the database test host:
Then in the ClickHouse shell insert a record to give`admin` role to the user. See Running database queriesβπ:
accounts
is an EmbeddedRocksDB table with account_id
as primary key.
No record deduplication is necessary.
To access the new role the user has to log out from web UIs and login again.
important Account IDs are not the same across test and production instances.
This is due to the use of a configuration variable
ACCOUNT_ID_HASHING_KEY
in the hashing of the email address. The
parameter is read from the API configuration file. The values are
different across deployment stages as a security feature.
Fastpath runbook
Fastpath code changes and deployment
Review Backend code changesβπ and Backend component deploymentβπ for changes and deployment of the backend stack in general.
Also see Modifying the fastpath tableβπ
In addition, monitor logs and Grafana dashboardsβπ‘ focusing on changes in incoming measurements.
You can use the The deployer toolβπ§ tool to perform deployment and rollbacks of the Fastpathββ.
important the fastpath is configured not to restart automatically during deployment.
Always monitor logs and restart it as needed:
Fastpath manual deployment
Sometimes it can be useful to run APT directly:
Reprocessing measurements
Reprocess old measurement by running the fastpath manually. This can be done without shutting down the fastpath instance running on live measurements.
You can run the fastpath as root or using the fastpath user. Both users
are able to read the configuration file under /etc/ooni
. The fastpath
will download Postcansβπ‘ in the local directory.
fastpath -h
generates:
To run the fastpath manually use:
The --no-write-to-db
option can be useful for testing.
The --ccs
and --testnames
flags are useful to selectively reprocess
measurements.
After reprocessing measurements itβs recommended to manually deduplicate
the contents of the fastpath
table. See
Fastpath deduplicationβπ
note it is possible to run multiple
fastpath
processes using https://www.gnu.org/software/parallel/ with different time ranges. Running the reprocessing underbyobu
is recommended.
The fastpath will pull Postcansβπ‘ from S3. See Feed fastpath from JSONLβπ for possible speedup.
Fastpath monitoring
The fastpath pipeline can be monitored using the Fastpath dashboard and API and fastpathβπ.
Also follow real-time process using:
Android probe release runbook
This runbook is meant to help coordinate Android probe releases between the probe and backend developers and public announcements. It does not contain detailed instructions for individual components.
Also see the Measurement drop runbookβπ.
Roles: @probe, @backend, @media
Android pre-release
@probe: drive the process involving the other teams as needed. Create calendar events to track the next steps. Run the probe checklist https://docs.google.com/document/d/1S6X5DqVd8YzlBLRvMFa4RR6aGQs8HSXfz8oGkKoKwnA/edit
@backend: review https://jupyter.ooni.org/view/notebooks/jupycron/autorun_android_probe_release.html and https://grafana.ooni.org/d/l-MQSGonk/api-and-fastpath-multihost?orgId=1&refresh=5s&var-avgspan=8h&var-host=backend-fsn.ooni.org&from=now-30d&to=now for long-term trends
Android release
@probe: release the probe for early adopters
@backend: monitor https://jupyter.ooni.org/view/notebooks/jupycron/autorun_android_probe_release.html frequently during the first 24h and report any drop on Slackβπ§
@probe: wait at least 24h then release the probe for all users
@backend: monitor https://jupyter.ooni.org/view/notebooks/jupycron/autorun_android_probe_release.html daily for 14 days and report any drop on Slackβπ§
@probe: wait at least 24h then poke @media to announce the release
(https://github.com/ooni/backend/wiki/Runbooks:-Android-Probe-Release
CLI probe release runbook
This runbook is meant to help coordinate CLI probe releases between the probe and backend developers and public announcements. It does not contain detailed instructions for individual components.
Roles: @probe, @backend, @media
CLI pre-release
@probe: drive the process involving the other teams as needed. Create calendar events to track the next steps. Run the probe checklist and review the CI.
@backend: review [jupyter](https://jupyter.ooni.org/view/notebooks/jupycron/autorun_cli_probe_release.html) and [grafana](https://grafana.ooni.org/d/l-MQSGonk/api-and-fastpath-multihost?orgId=1&refresh=5s&var-avgspan=8h&var-host=backend-fsn.ooni.org&from=now-30d&to=now) for long-term trends
CLI release
@probe: release the probe for early adopters
@backend: monitor [jupyter](https://jupyter.ooni.org/view/notebooks/jupycron/autorun_cli_probe_release.html) frequently during the first 24h and report any drop on Slackβπ§
@probe: wait at least 24h then release the probe for all users
@backend: monitor [jupyter](https://jupyter.ooni.org/view/notebooks/jupycron/autorun_cli_probe_release.html) daily for 14 days and report any drop on Slackβπ§
@probe: wait at least 24h then poke @media to announce the release
Investigating heavy aggregation queries runbook
In the following scenario the Aggregation and MATβπ API is experiencing query timeouts impacting users.
Reproduce the issue by setting a large enough time span on the MAT, e.g.: https://explorer.ooni.org/chart/mat?test_name=web_connectivity&axis_x=measurement_start_day&since=2023-10-15&until=2023-11-15&time_grain=day
Click on the link to JSON, e.g. https://api.ooni.io/api/v1/aggregation?test_name=web_connectivity&axis_x=measurement_start_day&since=2023-01-01&until=2023-11-15&time_grain=day
Review the backend-fsn.ooni.orgβπ₯ metrics on https://grafana.ooni.org/d/M1rOa7CWz/netdata?orgId=1&var-instance=backend-fsn.ooni.org:19999 (see Netdata-specific dashboardβπ for details)
Also review the API and fastpathβπ dashboard, looking at CPU load, disk I/O, query time, measurement flow.
Also see Aggregation cache monitoringβπ
Refresh and review the charts on the ClickHouse queries notebookβπ.
In this instance frequent calls to the aggregation API are found.
Review the summary of the API quotas. See Calling the API manuallyβπ for details:
Log on backend-fsn.ooni.orgβπ₯ and review the logs:
Summarize the subnets calling the API:
To block IP addresses or subnets see Nginxββ or HaProxyββ, then configure the required file in Ansibleβπ§ and deploy.
Also see Limiting scrapingβπ.
Aggregation cache monitoring
To monitor cache hit/miss ratio using StatsD metrics the following script can be run as needed.
See Metrics listβπ‘.
Limiting scraping
Aggressive bots and scrapers can be limited using a combination of methods. Listed below ordered starting from the most user-friendly:
-
Reduce the impact on the API (CPU, disk I/O, memory usage) by caching the results.
-
Rate limiting and quotasβπ already built in the API. It might need lowering of the quotas.
-
Adding API entry points to Robots.txtβπ
-
Adding specific
User-Agent
entries to Robots.txtβπ -
Blocking IP addresses or subnets in the Nginxββ or HaProxyββ configuration files
To add caching to the API or increase the expiration times:
-
Identify API calls that cause significant load. Nginxββ is configured to log timing information for each HTTP request. See Logs investigation notebookβπ for examples. Also see Logs from FSN notebookβπ and ClickHouse instance for logsββ. Additionally, Aggregation cache monitoringβπ can be tweaked for the present use-case.
-
Implement caching or increase expiration times across the API codebase. See API cacheβπ‘ and Purging Nginx cacheβπ.
-
Monitor the improvement in terms of cache hit VS cache miss ratio.
important Caching can be applied selectively for API requests that return rapidly changing data VS old, stable data. See Aggregation and MATβπ for an example.
To update the quotas edit the API here https://github.com/ooni/backend/blob/0ec9fba0eb9c4c440dcb7456f2aab529561104ae/api/ooniapi/app.py#L187 and deploy as usual.
To update the robots.txt
entry point see Robots.txtβπ and
edit the API here
https://github.com/ooni/backend/blob/0ec9fba0eb9c4c440dcb7456f2aab529561104ae/api/ooniapi/pages/init.py#L124
and deploy as usual
To block IP addresses or subnets see Nginxββ or HaProxyββ, then configure the required file in Ansibleβπ§ and deploy.
Calling the API manually
To make HTTP calls to the API manually youβll need to extact a JWT from the browser, sometimes with admin rights.
In Firefox, authenticate against https://test-lists.ooni.org/ , then
open Inspect >> Storage >> Local Storage >> Find
{"token": "<mytoken>"}
Extract the token ascii-encoded string without braces nor quotes.
Call the API using httpie with:
E.g.:
note Do not leave whitespaces after βAuthorization:β
Build, deploy, rollback
Host deployments are done with the sysadmin repo
For component updates a deployment pipeline is used:
Look at the [Status dashboard](https://github.com/ooni/backend/wiki/Backend) - be aware of badge image caching
The deployer tool
Deployments can be performed with a tool that acts as a frontend for APT. It implements a simple Continuous Delivery workflow from CLI. It does not require running a centralized CD pipeline server (e.g. like https://www.gocd.org/)
The tool is hosted on the backend repository together with its configuration file for simplicity: https://github.com/ooni/backend/blob/0ec9fba0eb9c4c440dcb7456f2aab529561104ae/deployer
At start time it traverses the path from the current working directory back to root until it finds a configuration file named deployer.ini This allows using different deployment pipelines stored in configuration files across different repositories and subdirectories.
The tool connects to the hosts to perform deployments and requires sudo rights. It installs Debian packages from repositories already configured on the hosts.
It runs apt-get update
and then apt-get install β¦β
to update or
rollback packages. By design, it does not interfere with manual
execution of apt-get or through tools like Ansibleβπ§.
This means operators can log on a host to do manual upgrade or rollback
of packages without breaking the deployer tool.
The tool depends only on the python3-apt
package.
Here is a configuration file example, with comments:
By running the tool without any argument it will connect to the hosts from the configuration file and print a summary of the installed packages, for example:
The green arrows between two package versions indicates that the version on the left side is higher than the one on the right side. This means that a rollout is pending. In the example the fastpath package on the βprodβ stage can be updated.
A red warning sign indicates that the version on the right side is higher than the one on the left side. During a typical continuous deployment workflow version numbers should always increment The rollout should go from left to right, aka from the least critical stage to the most critical stage.
Deploy/rollback a given version on the βtestβ stage:
Deploy latest build on the first stage:
Deploy latest build on a given stage. This usage is not recommended as it deploys the latest build regardless of what is currently running on previous stages.
The deployer tool can also generate SVG badges that can then served by Nginxββ or copied elsewhere to create a status dashboard.
Example:
Update all badges with:
Adding new tests
This runbook describes how to add support for a new test in the Fastpathββ.
Review Backend code changesβπ, then update fastpath core to add a scoring function.
See for example def score_torsf(msm: dict) β dict:
Also add an if
block to the def score_measurement(msm: dict) β dict:
function to call the newly created function.
Finish by adding a new test to the score_measurement
function and
adding relevant integration tests.
Run the integration tests locally.
Update the api if needed.
Deploy on ams-pg-test.ooni.orgβπ₯ and run end-to-end tests using real probes.
Adding support for a new test key
This runbook describes how to modify the Fastpathββ and the APIββ to extract, process, store and publish a new measurement field.
Start with adding a new column to the fastpath tableββ by following Adding a new column to the fastpathβπ.
Add the column to the local ClickHouse instance used for tests and ams-pg-test.ooni.orgβπ₯.
Update https://github.com/ooni/backend/blob/0ec9fba0eb9c4c440dcb7456f2aab529561104ae/api/tests/integ/clickhouse_1_schema.sql as described in Continuous Deployment: Database schema changesβπ‘
Add support for the new field in the fastpath core.py
and db.py
modules
and related tests.
See https://github.com/ooni/backend/pull/682 for a comprehensive example.
Run tests locally, then open a draft pull request and ensure the CI tests are running successfully.
If needed, the current pull request can be reviewed and deployed without modifying the API to expose the new column. This allows processing data sooner while the API is still being worked on.
Add support for the new column in the API. The change depends on where and how the new value is to be published. See https://github.com/ooni/backend/commit/ae2097498ec4d6a271d8cdca9d68bd277a7ac19d#diff-4a1608b389874f2c35c64297e9c676dffafd49b9ba80e495a703ba51d2ebd2bbL359 for a generic example of updating an SQL query in the API and updating related tests.
Deploy the changes on test and pre-production stages after creating the new column in the database. See The deployer toolβπ§ for details.
Perform end-to-end tests with real probes and Public and private web UIsβπ‘ as needed.
Complete the pull request and deploy to production.
Increasing the disk size on a dedicated host
Below are some notes on how to resize the disks when a new drive is added to our dedicated hosts:
Replicating MergeTree tables
Notes on how to go about converting a MergeTree family table to a replicated table, while minimizing downtime.
See the following links for more information:
- https://kb.altinity.com/altinity-kb-setup-and-maintenance/altinity-kb-converting-mergetree-to-replicated/
- https://clickhouse.com/docs/en/operations/system-tables/replicas
- https://clickhouse.com/docs/en/architecture/replication#verify-that-clickhouse-keeper-is-running
- https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/replication
- https://clickhouse.com/docs/en/operations/server-configuration-parameters/settings
Workflow
You should first create the replicated database cluster following the instructions at the clickhouse docs.
The ooni-devops repo has a role called oonidata_clickhouse
that does that by using the idealista.clickhouse_role.
Once the cluster is created you can proceed with creating a DATABASE on the cluster by running:
There are now a few options to go about doing this:
- You just create the new replicated tables and perform a copy into the destination database by running on the source database the following:
This will require duplicating the data and might not be feasible.
- If you already have all the data setup on one host and you just want to convert the database into a replicate one, you can do the following:
We assume there are 2 tables: obs_web_bak
(which is the source table) and
obs_web
which is the destination table. We also assume a single shard and
multiple replicas.
First create the destination replicated table. To retrieve the table create query you can run:
You should then modify the table to make use of the ReplicateReplacingMergeTree
engine:
Check all the partitions that exist for the source table and produce ALTER queries to map them from the source to the destination:
While you are running the following, you should stop all merges by running:
This can then be scripted like so:
You will now have a replicated table existing on one of the replicas.
Then you shall for each other replica in the set manually create the table, but this time pass in it explicitly the zookeeper path.
You can get the zookeeper path by running the following on the first replica you have setup
For each replica you will then have to create the tables like so:
You will then have to manually copy the data over to the destination replica from the source.
The data lives inside of /var/lib/clickhouse/data/{database_name}/{table_name}
Once the data has been copied over you should now have replicated the data and you can resume merges on all database by running: