OONI Data Pipeline v5
This it the fifth major iteration of the OONI Data Pipeline.
For historical context, these are the major revisions:
v0
- The “pipeline” is basically just writing the RAW json files into a publicwww
directory. Used until ~2013v1
- OONI Pipeline based on custom CLI scripts using mongodb as a backend. Used until ~2015.v2
- OONI Pipeline based on luigi. Used until ~2017.v3
- OONI Pipeline based on airflow. Used until ~2020.v4
- OONI Pipeline basedon custom script and systemd units (aka fastpath). Currently in use in production.v5
- Next generation OONI Pipeline. What this readme is relevant to. Expected to become in production by Q4 2024.
Setup
In order to run the pipeline you should setup the following dependencies:
Quick start
Start clickhouse server:
Workflows are started by first scheduling them and then triggering a backfill operation on them. When they are scheduled they will also run on a daily basis.
You can then trigger the run operation like so:
Production usage
In production it’s recommended that you trigger the workflows using airflow.
You can find the airflow dags specified inside of the root of this repo in the
dags
folder.