Operations
Operations
This section contains howtos and runbooks on how to manage and update the backend.
Build, deploy, rollback
Host deployments are done with the sysadmin repo
For component updates a deployment pipeline is used:
Look at the [Status dashboard](https://github.com/ooni/backend/wiki/Backend) - be aware of badge image caching
The deployer tool
Deployments can be performed with a tool that acts as a frontend for APT. It implements a simple Continuous Delivery workflow from CLI. It does not require running a centralized CD pipeline server (e.g. like https://www.gocd.org/)
The tool is hosted on the backend repository together with its configuration file for simplicity: https://github.com/ooni/backend/blob/0ec9fba0eb9c4c440dcb7456f2aab529561104ae/deployer
At start time it traverses the path from the current working directory back to root until it finds a configuration file named deployer.ini This allows using different deployment pipelines stored in configuration files across different repositories and subdirectories.
The tool connects to the hosts to perform deployments and requires sudo rights. It installs Debian packages from repositories already configured on the hosts.
It runs apt-get update
and then apt-get install β¦β
to update or
rollback packages. By design, it does not interfere with manual
execution of apt-get or through tools like Ansibleβπ§.
This means operators can log on a host to do manual upgrade or rollback
of packages without breaking the deployer tool.
The tool depends only on the python3-apt
package.
Here is a configuration file example, with comments:
By running the tool without any argument it will connect to the hosts from the configuration file and print a summary of the installed packages, for example:
The green arrows between two package versions indicates that the version on the left side is higher than the one on the right side. This means that a rollout is pending. In the example the fastpath package on the βprodβ stage can be updated.
A red warning sign indicates that the version on the right side is higher than the one on the left side. During a typical continuous deployment workflow version numbers should always increment The rollout should go from left to right, aka from the least critical stage to the most critical stage.
Deploy/rollback a given version on the βtestβ stage:
Deploy latest build on the first stage:
Deploy latest build on a given stage. This usage is not recommended as it deploys the latest build regardless of what is currently running on previous stages.
The deployer tool can also generate SVG badges that can then served by Nginxββ or copied elsewhere to create a status dashboard.
Example:
Update all badges with:
Adding new tests
This runbook describes how to add support for a new test in the Fastpathββ.
Review Backend code changesβπ, then update fastpath core to add a scoring function.
See for example def score_torsf(msm: dict) β dict:
Also add an if
block to the def score_measurement(msm: dict) β dict:
function to call the newly created function.
Finish by adding a new test to the score_measurement
function and
adding relevant integration tests.
Run the integration tests locally.
Update the api if needed.
Deploy on ams-pg-test.ooni.orgβπ₯ and run end-to-end tests using real probes.
Adding support for a new test key
This runbook describes how to modify the Fastpathββ and the APIββ to extract, process, store and publish a new measurement field.
Start with adding a new column to the fastpath tableββ by following Adding a new column to the fastpathβπ.
Add the column to the local ClickHouse instance used for tests and ams-pg-test.ooni.orgβπ₯.
Update https://github.com/ooni/backend/blob/0ec9fba0eb9c4c440dcb7456f2aab529561104ae/api/tests/integ/clickhouse_1_schema.sql as described in Continuous Deployment: Database schema changesβπ‘
Add support for the new field in the fastpath core.py
and db.py
modules
and related tests.
See https://github.com/ooni/backend/pull/682 for a comprehensive example.
Run tests locally, then open a draft pull request and ensure the CI tests are running successfully.
If needed, the current pull request can be reviewed and deployed without modifying the API to expose the new column. This allows processing data sooner while the API is still being worked on.
Add support for the new column in the API. The change depends on where and how the new value is to be published. See https://github.com/ooni/backend/commit/ae2097498ec4d6a271d8cdca9d68bd277a7ac19d#diff-4a1608b389874f2c35c64297e9c676dffafd49b9ba80e495a703ba51d2ebd2bbL359 for a generic example of updating an SQL query in the API and updating related tests.
Deploy the changes on test and pre-production stages after creating the new column in the database. See The deployer toolβπ§ for details.
Perform end-to-end tests with real probes and Public and private web UIsβπ‘ as needed.
Complete the pull request and deploy to production.
Adding new fingerprints
This is performed on https://github.com/ooni/blocking-fingerprints
Updates are fetched automatically by Fingerprint updaterββ
Also see Fingerprint updater dashboardβπ.
Backend code changes
This runbook describes making changes to backend components and deploying them.
Summary of the steps:
-
Check out the backend repository.
-
Create a dedicated branch.
-
Update
debian/changelog
in the component you want to monify. See Package versioningβπ‘ for details. -
Run unit/functional/integ tests as needed.
-
Create a pull request.
-
Ensure the CI workflows are successful.
-
Deploy the package on the testbed ams-pg-test.ooni.orgβπ₯ and verify the change works as intended.
-
Add a comment the PR with the deployed version and stage.
-
Wait for the PR to be approved.
-
Deploy the package to production on backend-fsn.ooni.orgβπ₯. Ensure it is the same version that has been used on the testbed. See API runbookβπ for deployment steps.
-
Add a comment the PR with the deployed version and stage, then merge the PR.
When introducing new metrics:
-
Create Grafanaβπ§ dashboards, alerts and Jupyter Notebookβπ§ and link them in the PR.
-
Collect and analize metrics and logs from the testbed stages before deploying to production.
-
Test alarming by simulating incidents.
Backend component deployment
This runbook provides general steps to deploy backend components on production hosts.
Review the package changelog and the related pull request.
The amount of testing and monitoring required depends on:
-
the impact of possible bugs in terms of number of users affected and consequences
-
the level of risk involved in rolling back the change, if needed
-
the complexity of the change and the risk of unforeseen impact
Monitor the API and fastpathβπ and dedicated . Review past weeks for any anomaly before starting a deployment.
Ensure that either the database schema is consistent with the new deployment by creating tables and columns manually, or that the new codebase is automatically updating the database.
Quickly check past logs.
Follow logs with:
While monitoring the logs, deploy the package using the The deployer toolβπ§ tool. (Details on the tool subchapter)
API runbook
This runbook describes making changes to the APIββ and deploying it.
Follow Backend code changesβπ and Backend component deploymentβπ.
In addition, monitor logs from Nginx and API focusing on HTTP errors and failing SQL queries.
Manually check Explorerβπ± and other Public and private web UIsβπ‘ as needed.
Managing feature flags
To change feature flags in the API a simple pull request like https://github.com/ooni/backend/pull/776 is enough.
Follow Backend code changesβπ and deploy it after basic testing on ams-pg-test.ooni.orgβπ₯.
Running database queries
This subsection describes how to run queries against ClickHouseββ. You can run queries from Jupyter Notebookβπ§ or from the CLI:
Prefer using the default user when possible. To log in as admin:
note Heavy queries can impact the production database. When in doubt run them on the CLI interface in order to terminate them using CTRL-C if needed.
warning ClickHouse is not transactional! Always test queries that mutate schemas or data on testbeds like ams-pg-test.ooni.orgβπ₯
For long running queries see the use of timeouts in Fastpath deduplicationβπ
Also see Dropping tablesβπ, Investigating table sizesβπ
Modifying the fastpath table
This runbook show an example of changing the contents of the fastpath tableββ by running a βmutationβ query.
warning This method creates changes that cannot be reproduced by external researchers by Reprocessing measurementsβπ. See Reproducibilityβπ‘
In this example Signal testββ measurements are being flagged as failed due to https://github.com/ooni/probe/issues/2627
Summarize affected measurements with:
important
ALTER TABLE β¦β UPDATE
starts a mutation that runs in background.
Check for any running or stuck mutation:
Start the mutation:
Run the previous SELECT
queries to monitor the mutation and its
outcome.
Updating tor targets
See Tor targetsβπ for a general description.
Review the Ansibleβπ§ chapter. Checkout the repository and
update the file ansible/roles/ooni-backend/templates/tor_targets.json
Commit the changes and deploy as usual:
Test the updated configuration, then:
git-push the changes.
Implements Document Tor targetsβπ
Creating admin API accounts
See Authβπ for a description of the API entry points related to account management.
The API provides entry points to:
The latter is implemented here.
important The default value for API accounts is
user
. For such accounts there is no need for a record in theaccounts
table.
To change roles it is required to be authenticated and have a role as
admin
.
It is also possible to create or update roles by running SQL queries
directly on ClickHouseββ. This can be necessary to
create the initial admin
account on a new deployment stage.
A quick way to identify the account ID an user is to extract logs from the APIββ either from the backend host or using Logs from FSN notebookβπ
Example output:
Then on the database test host:
Then in the ClickHouse shell insert a record to give`admin` role to the user. See Running database queriesβπ:
accounts
is an EmbeddedRocksDB table with account_id
as primary key.
No record deduplication is necessary.
To access the new role the user has to log out from web UIs and login again.
important Account IDs are not the same across test and production instances.
This is due to the use of a configuration variable
ACCOUNT_ID_HASHING_KEY
in the hashing of the email address. The
parameter is read from the API configuration file. The values are
different across deployment stages as a security feature.
Fastpath runbook
Fastpath code changes and deployment
Review Backend code changesβπ and Backend component deploymentβπ for changes and deployment of the backend stack in general.
Also see Modifying the fastpath tableβπ
In addition, monitor logs and Grafana dashboardsβπ‘ focusing on changes in incoming measurements.
You can use the The deployer toolβπ§ tool to perform deployment and rollbacks of the Fastpathββ.
important the fastpath is configured not to restart automatically during deployment.
Always monitor logs and restart it as needed:
Fastpath manual deployment
Sometimes it can be useful to run APT directly:
Reprocessing measurements
Reprocess old measurement by running the fastpath manually. This can be done without shutting down the fastpath instance running on live measurements.
You can run the fastpath as root or using the fastpath user. Both users
are able to read the configuration file under /etc/ooni
. The fastpath
will download Postcansβπ‘ in the local directory.
fastpath -h
generates:
To run the fastpath manually use:
The --no-write-to-db
option can be useful for testing.
The --ccs
and --testnames
flags are useful to selectively reprocess
measurements.
After reprocessing measurements itβs recommended to manually deduplicate
the contents of the fastpath
table. See
Fastpath deduplicationβπ
note it is possible to run multiple
fastpath
processes using https://www.gnu.org/software/parallel/ with different time ranges. Running the reprocessing underbyobu
is recommended.
The fastpath will pull Postcansβπ‘ from S3. See Feed fastpath from JSONLβπ for possible speedup.
Fastpath monitoring
The fastpath pipeline can be monitored using the Fastpath dashboard and API and fastpathβπ.
Also follow real-time process using:
Android probe release runbook
This runbook is meant to help coordinate Android probe releases between the probe and backend developers and public announcements. It does not contain detailed instructions for individual components.
Also see the Measurement drop runbookβπ.
Roles: @probe, @backend, @media
Android pre-release
@probe: drive the process involving the other teams as needed. Create calendar events to track the next steps. Run the probe checklist https://docs.google.com/document/d/1S6X5DqVd8YzlBLRvMFa4RR6aGQs8HSXfz8oGkKoKwnA/edit
@backend: review https://jupyter.ooni.org/view/notebooks/jupycron/autorun_android_probe_release.html and https://grafana.ooni.org/d/l-MQSGonk/api-and-fastpath-multihost?orgId=1&refresh=5s&var-avgspan=8h&var-host=backend-fsn.ooni.org&from=now-30d&to=now for long-term trends
Android release
@probe: release the probe for early adopters
@backend: monitor https://jupyter.ooni.org/view/notebooks/jupycron/autorun_android_probe_release.html frequently during the first 24h and report any drop on Slackβπ§
@probe: wait at least 24h then release the probe for all users
@backend: monitor https://jupyter.ooni.org/view/notebooks/jupycron/autorun_android_probe_release.html daily for 14 days and report any drop on Slackβπ§
@probe: wait at least 24h then poke @media to announce the release
(https://github.com/ooni/backend/wiki/Runbooks:-Android-Probe-Release
CLI probe release runbook
This runbook is meant to help coordinate CLI probe releases between the probe and backend developers and public announcements. It does not contain detailed instructions for individual components.
Roles: @probe, @backend, @media
CLI pre-release
@probe: drive the process involving the other teams as needed. Create calendar events to track the next steps. Run the probe checklist and review the CI.
@backend: review [jupyter](https://jupyter.ooni.org/view/notebooks/jupycron/autorun_cli_probe_release.html) and [grafana](https://grafana.ooni.org/d/l-MQSGonk/api-and-fastpath-multihost?orgId=1&refresh=5s&var-avgspan=8h&var-host=backend-fsn.ooni.org&from=now-30d&to=now) for long-term trends
CLI release
@probe: release the probe for early adopters
@backend: monitor [jupyter](https://jupyter.ooni.org/view/notebooks/jupycron/autorun_cli_probe_release.html) frequently during the first 24h and report any drop on Slackβπ§
@probe: wait at least 24h then release the probe for all users
@backend: monitor [jupyter](https://jupyter.ooni.org/view/notebooks/jupycron/autorun_cli_probe_release.html) daily for 14 days and report any drop on Slackβπ§
@probe: wait at least 24h then poke @media to announce the release
Investigating heavy aggregation queries runbook
In the following scenario the Aggregation and MATβπ API is experiencing query timeouts impacting users.
Reproduce the issue by setting a large enough time span on the MAT, e.g.: https://explorer.ooni.org/chart/mat?test_name=web_connectivity&axis_x=measurement_start_day&since=2023-10-15&until=2023-11-15&time_grain=day
Click on the link to JSON, e.g. https://api.ooni.io/api/v1/aggregation?test_name=web_connectivity&axis_x=measurement_start_day&since=2023-01-01&until=2023-11-15&time_grain=day
Review the backend-fsn.ooni.orgβπ₯ metrics on https://grafana.ooni.org/d/M1rOa7CWz/netdata?orgId=1&var-instance=backend-fsn.ooni.org:19999 (see Netdata-specific dashboardβπ for details)
Also review the API and fastpathβπ dashboard, looking at CPU load, disk I/O, query time, measurement flow.
Also see Aggregation cache monitoringβπ
Refresh and review the charts on the ClickHouse queries notebookβπ.
In this instance frequent calls to the aggregation API are found.
Review the summary of the API quotas. See Calling the API manuallyβπ for details:
Log on backend-fsn.ooni.orgβπ₯ and review the logs:
Summarize the subnets calling the API:
To block IP addresses or subnets see Nginxββ or HaProxyββ, then configure the required file in Ansibleβπ§ and deploy.
Also see Limiting scrapingβπ.
Aggregation cache monitoring
To monitor cache hit/miss ratio using StatsD metrics the following script can be run as needed.
See Metrics listβπ‘.
Limiting scraping
Aggressive bots and scrapers can be limited using a combination of methods. Listed below ordered starting from the most user-friendly:
-
Reduce the impact on the API (CPU, disk I/O, memory usage) by caching the results.
-
Rate limiting and quotasβπ already built in the API. It might need lowering of the quotas.
-
Adding API entry points to Robots.txtβπ
-
Adding specific
User-Agent
entries to Robots.txtβπ -
Blocking IP addresses or subnets in the Nginxββ or HaProxyββ configuration files
To add caching to the API or increase the expiration times:
-
Identify API calls that cause significant load. Nginxββ is configured to log timing information for each HTTP request. See Logs investigation notebookβπ for examples. Also see Logs from FSN notebookβπ and ClickHouse instance for logsββ. Additionally, Aggregation cache monitoringβπ can be tweaked for the present use-case.
-
Implement caching or increase expiration times across the API codebase. See API cacheβπ‘ and Purging Nginx cacheβπ.
-
Monitor the improvement in terms of cache hit VS cache miss ratio.
important Caching can be applied selectively for API requests that return rapidly changing data VS old, stable data. See Aggregation and MATβπ for an example.
To update the quotas edit the API here https://github.com/ooni/backend/blob/0ec9fba0eb9c4c440dcb7456f2aab529561104ae/api/ooniapi/app.py#L187 and deploy as usual.
To update the robots.txt
entry point see Robots.txtβπ and
edit the API here
https://github.com/ooni/backend/blob/0ec9fba0eb9c4c440dcb7456f2aab529561104ae/api/ooniapi/pages/init.py#L124
and deploy as usual
To block IP addresses or subnets see Nginxββ or HaProxyββ, then configure the required file in Ansibleβπ§ and deploy.
Calling the API manually
To make HTTP calls to the API manually youβll need to extact a JWT from the browser, sometimes with admin rights.
In Firefox, authenticate against https://test-lists.ooni.org/ , then
open Inspect >> Storage >> Local Storage >> Find
{"token": "<mytoken>"}
Extract the token ascii-encoded string without braces nor quotes.
Call the API using httpie with:
E.g.:
note Do not leave whitespaces after βAuthorization:β
Debian packages
This section lists the Debian packages used to deploy backend components. They are built by GitHub CI workflowsβπ‘ and deployed using The deployer toolβπ§. See Debian package build and publishβπ‘.
ooni-api package
Debian package for the APIββ
fastpath package
Debian package for the Fastpathββ
detector package
Debian package for the Social media blocking event detectorββ
analysis package
The analysis
Debian package contains various tools and runs various of
systemd timers, see Systemd timersβπ‘.
Analysis deployment
See Backend component deploymentβπ