In December 2017, M-Lab was notified of oddities in the Paris Traceroute data. Upon investigation, a bug in the Paris Traceroute code was identified. The bug caused bad measurement data in 2.7% of the traceroutes since July 2016.
The M-Lab team is doing development to work around the bug in several ways:
- By changing how we call Paris Traceroute to reduce the probability that the bug condition occurs.
- By adding more sophisticated bad data detection and elimination code to our parsing of Paris Traceroute data.
- By adding monitoring to our parser so that we can know how much of our raw data is affected at any given time.
- By reprocessing our historical raw data archives in order to eliminate all bad traceroute data caused by this bug from our BigQuery database.
Through disclosures and analyses like these, M-Lab re-confirms its commitment to open data and open science. Our raw data, warts, bugs, and all, is and will be always available https://console.developers.google.com/storage/browser/m-lab/, while our processed data https://bigquery.cloud.google.com/dataset/measurement-lab:public?pli=1 continues to reflect our best understanding of the state of the Internet represented by that raw data.
Read More: The bug and its fixes
The root cause of the reported problem is a design flaw in Paris Traceroute, exacerbated by Rollins, the wrapper script that M-Lab uses to run it. The Rollins wrapper script went into production in first half of 2016 (replacing an older script which had its own flaws).
Every connection made to M-Lab systems triggers a Paris Traceroute to run from the M-Lab server back to the computer that initiated the connection. As the number of simultaneous traceroutes has increased, this has exposed a bug in Paris Traceroute where two independent traceroutes that overlap in time can become intermixed in the resulting Paris Traceroute output. For example, if we were to simultaneously run a traceroute from server S to clients A and B, the S-A path reported by the tool might contain hops that are exclusive to the S-B path, or it could switch in the middle of Paris Traceroute’s output from being the S-A path into being the S-B path.
In the following example, the output from Paris Traceroute switches without warning from being the output for:
$ paris-traceroute 188.8.131.52
into containing the data for the contemporaneously-run command:
$ paris-traceroute 184.108.40.206
The resulting output from the first command (shown below) shows a link that doesn’t exist from 220.127.116.11 to 18.104.22.168, on lines 8 and 9 and (worse!) switches without warning into the output from the second command (lines 9-13). In particular, note the suspiciously large RTT at the end of line 9 (
4802.776/4803.621/4807.622/1.790 ms) which is also the point of transition from the S-A path to the S-B path.
traceroute [(22.214.171.124:33458) -> (126.96.36.199:40784)], protocol icmp, algo exhaustive, duration 14 s 1 P(6, 6) 188.8.131.52 (184.108.40.206) 0.138/5.405/31.541/11.688 ms 2 P(6, 6) xe-1-0-6.cr2-sjc1.ip4.gtt.net (220.127.116.11) 19.090/21.052/24.168/1.898 ms 3 P(6, 6) as15169.sjc10.ip4.gtt.net (18.104.22.168) 19.105/19.611/21.314/0.796 ms 4 P(6, 6) 22.214.171.124 (126.96.36.199) 19.872/20.275/20.931/0.446 ms 5 P(6, 6) 188.8.131.52 (184.108.40.206) 20.092/20.545/21.096/0.331 ms MPLS Label 697177 TTL=1 6 P(6, 6) 220.127.116.11 (18.104.22.168) 53.493/54.490/57.796/1.493 ms MPLS Label 638493 TTL=1 7 P(6, 6) 22.214.171.124 (126.96.36.199) 52.755/56.170/67.922/5.386 ms MPLS Label 402431 TTL=1 8 P(6, 6) 188.8.131.52 (184.108.40.206) 52.455/52.652/52.981/0.228 ms 9 P(6, 6) csd180.gsc.webair.net (220.127.116.11) 4802.776/4803.621/4807.622/1.790 ms 10 P(6, 6) 18.104.22.168 (22.214.171.124) 66.509/66.524/66.561/0.018 ms 11 P(6, 6) 126.96.36.199 (188.8.131.52) 66.634/69.047/72.354/2.442 ms 12 P(6, 6) 184.108.40.206 (220.127.116.11) 67.066/70.367/72.034/2.327 ms 13 P(6, 6) 18.104.22.168 (22.214.171.124) 62.542/64.001/66.941/2.020 ms
This is obviously quite worrying, so the M-Lab team set about investigating why this is happening, how to prevent (or at least reduce the frequency of) its occurrence, and how to filter or annotate bad data in our database (while, of course, preserving the original raw data in our historical archives).
Paris Traceroute relies on a single 16 bit “tag” field to disambiguate returning ICMP messages, and that tag is essentially the output of a hash function. This single field is used both to identify which request packet (what initial TTL) and which session it belongs to. When there are two overlapping tests, it is possible to have a collision of tag values, exactly analogous to a hashtable collision. When one test starts first, that test runs for a bit, and then after the second (colliding) test starts, spurious data begins to arrive to the first test. The symptoms of this are that, in the middle of the first test, the results of the second test begin to pollute the results of the first. Just like with hash tables, the incidence rate of this bug scales up with usage. Sites with minimal numbers of connections saw no collisions, and sites with large numbers of connections saw an increased percentage of tests affected.
To prevent the collection of bad data, we have created a workaround in our Paris Traceroute wrapper to bring down the expected number of incidents and that workaround will be deployed to M-Lab servers soon.
To prevent the inclusion of any bad data in our BigQuery tables, we are porting the detection code from a standalone tool into our parser. We are also adding monitoring to measure the error rates going forwards. We will soon be reparsing our historical archives, and over the next quarter as all our historical archives get reprocessed, all bad data caused by this bug should be eliminated from or tagged within our BigQuery database. M-Lab’s raw data will remain untouched, as our raw data reflects the data we actually got from M-Lab servers, not the data we wish we had gotten from M-Lab servers.
Once remediation is completed, we will provide a longer write up going in to more depth about the issue and the remediation.
Thanks to Amogh Dhamdhere for flagging the Paris Traceroute data issue!