Skip to content

OONI Measurement Uploader

edit file

This component uploads fresh measurements from backend-fsn.ooni.org to S3 data bucketβ€‰πŸ’‘ after compressing them into Postcansβ€‰πŸ’‘ and .jsonl files.

It inserts records in the jsonl table ⛁ using the api database user.

The uploader runs hourly. The measurement batching process is designed to avoid data loss in case of interruption or crash:

  • Scan for raw measurements from the spool directory, typically /var/lib/ooniapi/measurements/incoming/

  • Generate one Postcansβ€‰πŸ’‘ and JSONL filesβ€‰πŸ’‘ in a different directory

  • Delete the raw measurements

  • Upload the postcan and jsonl files to S3 data bucketβ€‰πŸ’‘

  • Insert new records in jsonl table ⛁ with fields report_id, input, s3path, linenum, measurement_uid

The jsonl table is used by the API to look up measurement bodies. There is one line per measurement. The s3path column identifies the key on S3 data bucketβ€‰πŸ’‘ containing the compressed JSONL file with the measurement data. The linenum column contains the line number in such file where the measurement is found. See Measurementsβ€‰πŸ

Reads the /etc/ooni/api.conf file. The file itself is deployed by Ansibleβ€‰πŸ”§.

Also see the Measurement uploader dashboardβ€‰πŸ“Š, uploader timer and Main data flowsβ€‰πŸ’‘

Sources

Postcans

A β€œpostcan” is tarball containing measurements as they are uploaded by the probes, optionally compressed. Postcans are meant for internal use.

S3 data bucket

The ooni-data-eu-fra Amazon S3 bucket contains the whole OONI dataset. It is accessible with the S3 protocol and also over HTTPS: https://ooni-data-eu-fra.s3.eu-central-1.amazonaws.com/

It uses a dedicated Open Data account providing free hosting for public data. Details on the OONI account used for this are in the Team credential repositoryβ€‰πŸ’‘.

note All data on the bucket has to be kept publicly accessible to comply with the Open Data requirements. Do not run other AWS services using the Open Data account.

S3 measurement files layout

Probes usually upload multiple measurements on each execution. Measurements are stored temporarily and then batched together, compressed and uploaded to the S3 bucket once every hour. To ensure transparency, incoming measurements go through basic content validation and the API returns success or error; once a measurement is accepted it will be published on S3.

Specifications of the raw measurement data can be found inside of the ooni/spec repository.

JSONL files

File paths in the S3 bucket in JSONL format.

Contains a JSON document for each measurement, separated by newline and compressed, for faster processing. The JSONL format is natively supported by various data science tools and libraries.

The path structure allows to easily select, identify and download data based on the researcher’s needs.

In the path template:

  • cc is an uppercase 2 letter country code

  • testname is a test name where underscores are removed

  • timestamp is a YYYYMMDD timestamp

  • name is a unique filename

Compressed JSONL from measurements before 20201021

The path structure is: s3://ooni-data-eu-fra/jsonl/<testname>/<cc>/<timestamp>/00/<name>.jsonl.gz

Example:

s3://ooni-data-eu-fra/jsonl/webconnectivity/IT/20200921/00/20200921_IT_webconnectivity.l.0.jsonl.gz

You can list JSONL files with:

s3cmd ls s3://ooni-data-eu-fra/jsonl/
s3cmd ls s3://ooni-data-eu-fra/jsonl/webconnectivity/US/20201021/00/

Compressed JSONL from measurements starting from 20201020

The path structure is:

s3://ooni-data-eu-fra/raw/<timestamp>/<hour>/<cc>/<testname>/<ts2>_<cc>_<testname>.<host_id>.<counter>.jsonl.gz

Example:

s3://ooni-data-eu-fra/raw/20210817/15/US/webconnectivity/2021081715_US_webconnectivity.n0.0.jsonl.gz

Note: The path will be updated in the future to live under /jsonl/

You can list JSONL files with:

s3cmd ls s3://ooni-data-eu-fra/raw/20210817/15/US/webconnectivity/

Raw β€œpostcans” from measurements starting from 20201020

Each HTTP POST is stored in the tarball as <timestamp>_<cc>_<testname>/<timestamp>_<cc>_<testname>_<hash>.post

Example:

s3://ooni-data-eu-fra/raw/20210817/11/GB/webconnectivity/2021081711_GB_webconnectivity.n0.0.tar.gz

Listing postcan files:

s3cmd ls s3://ooni-data-eu-fra/raw/20210817/
s3cmd ls s3://ooni-data-eu-fra/raw/20210817/11/GB/webconnectivity/