OONI Data Pipeline v5

This it the fifth major iteration of the OONI Data Pipeline.

For historical context, these are the major revisions:

v0 - The “pipeline” is basically just writing the RAW json files into a public www directory. Used until ~2013
v1 - OONI Pipeline based on custom CLI scripts using mongodb as a backend. Used until ~2015.
v2 - OONI Pipeline based on luigi. Used until ~2017.
v3 - OONI Pipeline based on airflow. Used until ~2020.
v4 - OONI Pipeline basedon custom script and systemd units (aka fastpath). Currently in use in production.
v5 - Next generation OONI Pipeline. What this readme is relevant to. Expected to become in production by Q4 2024.

Setup

In order to run the pipeline you should setup the following dependencies:

Quick start

git clone https://github.com/ooni/data

Start clickhouse server:

mkdir -p _clickhouse-data
cd _clickhouse-data
clickhouse server

Workflows are started by first scheduling them and then triggering a backfill operation on them. When they are scheduled they will also run on a daily basis.

You can then trigger the run operation like so:

hatch run oonipipeline run --create-tables --probe-cc US --test-name signal --workflow-name observations --start-at 2024-01-01 --end-at 2024-02-01

Production usage

In production it’s recommended that you trigger the workflows using airflow.

You can find the airflow dags specified inside of the root of this repo in the dags folder.