Migration to the v2 Data Pipeline

If you typically use the measurement-lab.ndt.unified_uploads or measurement-lab.ndt.unified_downloads views, then nothing will change. We are updating the ndt5, switch, and tcpinfo schemas, removing obsolete views, and renaming some views in preparation for improving ease of use and documentation.

Migration to the v2 Data Pipeline

Since completing the tcpinfo platform migration in 2020, the first generation v1 data pipeline has continued to process datatypes collected from the web100 platform and many datatypes from the tcpinfo platform. Datatypes processed by the v1 data pipeline have “v1 schemas”. These schemas typically use ad-hoc naming conventions that may be inconsistent between datatypes making use and documentation onerous. As well, these datatypes have had long standing data quality issues, such as accurate server annotations.

Now we are migrating all datatypes collected from the tcpinfo platform to the v2 data pipeline. The v2 data pipeline provides uniform naming conventions, standard BigQuery column names, general data quality improvements (e.g. fixing server annotations), and allows us to easily generate reference schema documentation for the website (e.g. like ndt7). All data types processed by the v2 data pipeline will adopt the “v2 schema” naming conventions.

Some datatypes are already processed by the v2 data pipeline, such as ndt7, annotation, scamper1 (i.e. traceroute), and a pcap index.

This announcement heralds a set of changes coming in the next few weeks and months that will update datatypes to the v2 data pipeline (ndt5, traceroute, tcpinfo, switch) or clearly label them as “legacy” schemas from the v1 data pipeline (paris-traceroute, ndt web100, sidestream).

Users of the measurement-lab.ndt.unified_uploads and measurement-lab.ndt.unified_downloads views should experience no changes.

Coming Changes

The views in the measurement-lab.library dataset are now obsolete. Each can be replaced by the equivalent view with the same datatype and direction suffix from the ndt_intermediate dataset.

To be removed Use instead
library.ndt_unified_ndt5_downloads ndt_intermediate.extended_ndt5_downloads
library.ndt_unified_ndt5_uploads ndt_intermediate.extended_ndt5_uploads
library.ndt_unified_ndt7_downloads ndt_intermediate.extended_ndt7_downloads
library.ndt_unified_ndt7_uploads ndt_intermediate.extended_ndt7_uploads
library.ndt_unified_web100_downloads ndt_intermediate.extended_web100_downloads
library.ndt_unified_web100_uploads ndt_intermediate.extended_web100_uploads

Views in the measurement-lab.ndt dataset that use v1 schemas must be removed before we replace them with views using v2 schemas. This should be minimally disruptive because these v1 views are redundant to identical views that are already present in the measurement-lab.ndt_raw dataset.

To be removed Use instead
ndt.traceroute (v1) ndt_raw.paris1_legacy
ndt.ndt5 (v1) ndt_raw.ndt5_legacy
ndt.tcpinfo (v1) ndt_raw.tcpinfo_legacy
ndt.web100 (v1) ndt_raw.web100_legacy

The following views that use v1 schemas will be renamed with a “_legacy” suffix to be consistent with other v1 schema views. As well, this allows a view using a v2 schema to be deployed later without causing a conflict.

To be renamed New name
sidestream.web100 (v1) sidestream.web100_legacy
utilization.switch (v1) utilization.switch_legacy

The traceroute datatypes have historically been combined into a single table. Going forward they will be separated and named by their underlying format and tool of origin. So, the following changes are planned.

To be removed To be replaced by
aggregate.traceroute traceroute.scamper1

Once the above changes are complete, they will be followed by deployment of the updated v2 data pipeline parsers for the switch, ndt5 and tcpinfo datatypes.

Once these changes are complete, we will be updating the website with up-to-date schema documentation for these raw datatypes using the v2, standard column naming conventions, including:

  • ndt7
  • annotation
  • ndt5
  • switch
  • scamper1
  • tcpinfo
  • pcap

UPDATE: 2022-02-28 - we changed the alternate name for the legacy ndt.traceroute view to ndt_raw.paris1_legacy (instead of ndt_raw.traceroute_legacy) to be consistent with the underlying datatype name and further move away from uses of “traceroute” as a single datatype name, when instead “traceroute” is a dataset consisting of many sources.

Back to Top