System Architecture
The backend infrastructure performs multiple functions:
-
Provide APIs for data consumers
-
Instruct probes on what measurements to perform
-
Receive measurements from probes, process them and store them in the database
-
Upload new measurements to a bucket on S3 data bucket 💡
- Fetch data from external sources e.g. fingerprints from a GitHub repository
Main data flows
This diagram represent the main flow of measurement data.
The rectangles represent processes. The ellipses represent data at rest: as files on disk, files on S3 or records in database tables.
Ellipses represent data; rectangles represent processes. Click on the image and then click on each shape to see related documentation.
Probes submit measurements to the API with a POST at the following path: https://api.ooni.io/apidocs/#/default/post_report__report_id_ The measurement is optionally decompressed if zstd compression is detected. It is then parsed and added with a unique ID and saved to disk. Very little validation is done at this time in order to ensure that all incoming measurements are accepted.
Measurements are enqueued on disk using one file per measurement. On hourly intervals they are batched together, compressed and uploaded to S3 by the Measurement uploader ⚙. The batching is performed to allow efficient compression. See the dedicated subchapter ⚙ for details.
The measurement is also sent to the Fastpath ⚙. The Fastpath runs as a dedicated daemon with a pool of workers. It calculates scoring for the measurement and writes a record in the fastpath table. Each measurement is processed individually in real time. See the dedicated subchapter ⚙ below.
The disk queue is also used by the API to access recent measurements that have not been uploaded to S3 yet. See the measurement API 🐝 for details.
Reproducibility
The measurement processing pipeline is meant to generate outputs that can be equally generated by 3rd parties like external researchers and other organizations.
This is meant to keep OONI accountable and as a proof that we do not arbitrarily delete or alter measurements and that we score them as accessible/anomaly/confirmed/failure in a predictable and transparent way.
important The only exceptions were due to privacy breaches that required removal of the affected measurements from the S3 data bucket 💡 bucket.
As such, the backend infrastructure is FOSS and can be deployed by 3rd parties. We encourage researchers to replicate our findings.
Incoming measurements are minimally altered by the Measurement uploader ⚙ and uploaded to S3.
Supply chain management
The backend implements software supply chain management by installing components and libraries from the Debian Stable archive.
This provides the following benefits:
- Receiving targeted security updates from the OS without introducing breaking changes.
- Protection against supply chain attacks.
- Reproducible deployment for internal use (e.g. testbeds and new backend hosts) and for external researchers.
warning ClickHouse ⚙ is installed from a 3rd party archive and receives limited security support. As a mitigation an LTS version is used yet the support time is 1 year.
The backend received a security assessment from Cure53 https://cure53.de/ in 2022.
Low-criticality hosts e.g. the monitoring stack also have components installed from 3rd party archives.