OONI Measurement Uploader
This component uploads fresh measurements from backend-fsn.ooni.org to S3 data bucketβπ‘ after compressing them into Postcansβπ‘ and .jsonl files.
It inserts records in the jsonl tableββ using the api
database user.
The uploader runs hourly. The measurement batching process is designed to avoid data loss in case of interruption or crash:
-
Scan for raw measurements from the spool directory, typically
/var/lib/ooniapi/measurements/incoming/
-
Generate one Postcansβπ‘ and JSONL filesβπ‘ in a different directory
-
Delete the raw measurements
-
Upload the postcan and jsonl files to S3 data bucketβπ‘
-
Insert new records in jsonl tableββ with fields
report_id
,input
,s3path
,linenum
,measurement_uid
The jsonl table is used by the API to look up measurement bodies. There
is one line per measurement. The s3path
column identifies the key on
S3 data bucketβπ‘ containing the compressed JSONL file
with the measurement data. The linenum
column contains the line number
in such file where the measurement is found. See
Measurementsβπ
Reads the /etc/ooni/api.conf
file. The file itself is deployed by
Ansibleβπ§.
Also see the Measurement uploader dashboardβπ, uploader timer and Main data flowsβπ‘
Postcans
A βpostcanβ is tarball containing measurements as they are uploaded by the probes, optionally compressed. Postcans are meant for internal use.
S3 data bucket
The ooni-data-eu-fra
Amazon S3 bucket contains the whole OONI dataset.
It is accessible with the S3 protocol and also over HTTPS:
https://ooni-data-eu-fra.s3.eu-central-1.amazonaws.com/
It uses a dedicated Open Data account providing free hosting for public data. Details on the OONI account used for this are in the Team credential repositoryβπ‘.
note All data on the bucket has to be kept publicly accessible to comply with the Open Data requirements. Do not run other AWS services using the Open Data account.
S3 measurement files layout
Probes usually upload multiple measurements on each execution. Measurements are stored temporarily and then batched together, compressed and uploaded to the S3 bucket once every hour. To ensure transparency, incoming measurements go through basic content validation and the API returns success or error; once a measurement is accepted it will be published on S3.
Specifications of the raw measurement data can be found inside of the
ooni/spec
repository.
JSONL files
File paths in the S3 bucket in JSONL format.
Contains a JSON document for each measurement, separated by newline and compressed, for faster processing. The JSONL format is natively supported by various data science tools and libraries.
The path structure allows to easily select, identify and download data based on the researcherβs needs.
In the path template:
-
cc
is an uppercase 2 letter country code -
testname
is a test name where underscores are removed -
timestamp
is a YYYYMMDD timestamp -
name
is a unique filename
Compressed JSONL from measurements before 20201021
The path structure is:
s3://ooni-data-eu-fra/jsonl/<testname>/<cc>/<timestamp>/00/<name>.jsonl.gz
Example:
You can list JSONL files with:
Compressed JSONL from measurements starting from 20201020
The path structure is:
Example:
Note: The path will be updated in the future to live under /jsonl/
You can list JSONL files with:
Raw βpostcansβ from measurements starting from 20201020
Each HTTP POST is stored in the tarball as
<timestamp>_<cc>_<testname>/<timestamp>_<cc>_<testname>_<hash>.post
Example:
Listing postcan files: