We deployed the new M-Lab platform to 1/3rd of the M-Lab fleet, and now we need to assess whether or not the new platform is a performance regression, relative to the old platform. As long as we can be sure the performance of the new platform does not constitute a regression, then we can roll out the new platform to the rest of the M-Lab fleet around the world.

More concretely, for every site, we would like to be certain, at the 95% level or above, that the performance data gathered on the new platform is not worse (lower bandwidth) than the old platform at that same site.

We are in a situation that is an almost textbook case for statistical analaysis:

1. We have a lot of data
2. We have a question of interest (is the new platform worse than the old platform?)
3. We have a default action (roll out the new platform)

In developing an analysis, we want to be congnizant of a few important facts:

1. Internet performance is differently variable in different regions of the world.
2. Not all machines at a site are used equally. In particular, we know of some NDT clients which will ONLY run a test to ndt.iupui.mlab1.lax03 and would never run a test to ndt.iupui.mlab2.lax03 or ndt.iupui.mlab3.lax03.
3. The kubernetes platform is not rolled out to any mlab1 machines.

Our needs and our constraints lead to the following per-site analysis method to decide whether the new platform is not a regression at a given site:

Let us assume that for this site, the mlab2 machine uses PlanetLab, vservers, and web100 (a PL machine) and the mlab3 machine uses Kubernetes, Docker, and tcp-info (a k8s machine).

1. Calculate the mean (both arithmetic mean and log mean) download and upload experienced by all NDT clients that connected to mlab2 or mlab3 during the month of September. If the means for the new platform are both not worse than the means for the old platform, then we are done. The performance of the new platform at that site does not constitute a regression.
2. If a mean has regressed at any site, then we calculate a "noise floor" for the site by doing the same analysis for January, February, ..., June of 2019. During those months, all the mlab3 machines were controlled by the old platform and as such, our monthly variation between sample means establishes a "noise floor". We then compare the new difference (between the new and old) to the established noise floor (between the old and the old) to see whether it is within historical norms. If we are 95% confident that the new performance differential is within historical norms, then we can conclude that the new performance at the site does not constitue a regression.
3. If the regression in performance is outside of historical norms, we conduct an investigation to discover why.

(Spoiler alert: every analysis we performed ended after step 1, because no means got worse)

Armed with this analysis method, we set about investigating M-Lab's Bigquery NDT data. When a single client endpoint has tested against a server multiple times, we choose a random successful result to prevent any individual network endpoint from contributing too much to the final analysis. We then compare the mean and median of these "monthly bests" between the PL and k8s.

We first begin by constructing a basic query that filters out some bad-looking data and then creates two joinable tables containing per-site performance at the minimum, 1%, 2%, ..., 98%, 99%, and maximum levels as well as per-site average performance. Once we have both of these, we conduct two analyses: a visual one and a quantitative one.

In [1]:
from google.cloud import bigquery
client = bigquery.Client(project='measurement-lab')
-- First, bring the old data and the new data into a common schema
-- Also throw out null data
-- First the data from the old platform
web100_performance_multiples AS (
SELECT test_id,
'web100' AS type,
partition_date AS date,
REGEXP_EXTRACT(
connection_spec.server_hostname,
r"(mlab[0-9]\.[^.]*)\.measurement-lab.org")
AS host,
web100_log_entry.connection_spec.remote_ip AS client,
8 * (web100_log_entry.snap.HCThruOctetsAcked
/ (web100_log_entry.snap.SndLimTimeRwin +
web100_log_entry.snap.SndLimTimeCwnd +
web100_log_entry.snap.SndLimTimeSnd))
AS bandwidth,
(web100_log_entry.snap.SndLimTimeRwin +
web100_log_entry.snap.SndLimTimeCwnd +
web100_log_entry.snap.SndLimTimeSnd) / 1000000.0
AS duration
FROM measurement-lab.ndt.web100
WHERE web100_log_entry.connection_spec.local_ip IS NOT NULL
AND web100_log_entry.connection_spec.remote_ip IS NOT NULL
AND connection_spec.data_direction IS NOT NULL
AND connection_spec.data_direction = 1
AND web100_log_entry.snap.HCThruOctetsAcked IS NOT NULL
AND web100_log_entry.snap.HCThruOctetsAcked > 0
-- Sum of SndLimTime* have real values.
AND web100_log_entry.snap.SndLimTimeRwin IS NOT NULL
AND web100_log_entry.snap.SndLimTimeCwnd IS NOT NULL
AND web100_log_entry.snap.SndLimTimeSnd IS NOT NULL
),
-- Next the data from the new platform, making sure to use socket-level
-- statistics instead of NDT's application-level statistics.
-- (This is to work around a bug in ndt-server -
--    https://github.com/m-lab/ndt-server/issues/207 )
tcpinfo_performance_multiples AS (
SELECT test_id,
'tcpinfo' AS type,
ndt5.partition_date AS date,
REPLACE(
REGEXP_EXTRACT(
"-(mlab[1-4]-[a-z]{3}[0-9]{2})-"),
"-",
".")
AS host,
result.ClientIP AS client,
(FinalSnapshot.TCPInfo.BytesAcked*8/1000000)
/(TIMESTAMP_DIFF(result.S2C.EndTime,
result.S2C.StartTime,
MILLISECOND)/1000)
AS bandwidth,
TIMESTAMP_DIFF(result.S2C.EndTime,
result.S2C.StartTime,
MILLISECOND)/1000
AS duration
FROM measurement-lab.ndt.ndt5 AS ndt5
INNER JOIN measurement-lab.ndt.tcpinfo
ON (result.S2C.UUID = UUID)
WHERE result.S2C IS NOT NULL
AND result.S2C.ClientIP IS NOT NULL
AND result.S2C.MeanThroughputMbps > 0
),
'''
normalization = '''
-- Now union the tables
performance_multiples_unfiltered AS (
SELECT * FROM web100_performance_multiples
UNION ALL
SELECT * FROM tcpinfo_performance_multiples
),
-- Now that the data has a unified schema, filter it:
--  1) only the month of September 2019
--  2) A 10 second transfer should take between 9 and 20 seconds, even
--     accounting for client/server clock skew
--  3) Only those streams that averaged greater than 10Kbps, as below
--     that is poorly measured by many popular clients
--  4) Eliminate the results generated by M-Lab's OA&M infrastructure
--  5) No mlab1 machines, because their client pool is known to be
--     different
--  6) No yul02 machines, because mlab3.yul02 was in maintenance mode
--     for the month of September.
performance_multiples AS (
SELECT type, regexp_extract(host, r"mlab[0-9]\.([^.]*)") AS site,
client, test_id, bandwidth
FROM performance_multiples_unfiltered
WHERE
-- The month of September 2019
date BETWEEN DATE("2019-09-01") AND DATE("2019-10-01")
-- A 10 second download should take between 9 and 20 seconds
AND duration BETWEEN 9 AND 20
-- Over 10kbps
AND bandwidth > .01
-- These IPs are used by M-Lab's OA&M infrastructure
AND client NOT IN ("45.56.98.222", "35.192.37.249", "35.225.75.192",
"2600:3c03::f03c:91ff:fe33:819", "23.228.128.99",
"2605:a601:f1ff:fffe::99")
-- No mlab1 machines
AND host NOT LIKE '%mlab1%'
-- No yul02 machines
AND host NOT LIKE '%yul02%'
),
-- Next, for ip addresses with multiple tests, choose a random result.
performance AS (
SELECT * EXCEPT(row_number) FROM (
SELECT *, ROW_NUMBER()
OVER (PARTITION BY CONCAT(site, client, type)
ORDER BY FARM_FINGERPRINT(test_id)) AS row_number
FROM performance_multiples)
WHERE row_number = 1
)
-- Select the type of server, the site, and the descriptive statistics.
SELECT
type,
site,
AVG(bandwidth) AS mean_mbps,
EXP(AVG(LN(bandwidth))) AS log_mean_mbps,
APPROX_QUANTILES(bandwidth, 100) AS deciles
FROM performance
GROUP BY site, type
ORDER BY site, type
'''

In [2]:
def deciles_and_means_from_query(query):
from collections import defaultdict as dd
decile = dd(lambda: dd(lambda: [0 for _ in range(101)]))
average = dd(lambda: dd(float))
log_average = dd(lambda: dd(float))

for row in client.query(query):
decile[row['site']][row['type']] = row['deciles']
average[row['site']][row['type']] = float(row['mean_mbps'])
log_average[row['site']][row['type']] = float(row['log_mean_mbps'])

for site in [x for x in decile]:
if sum(decile[site]['web100']) == 0 or sum(decile[site]['tcpinfo']) == 0:
del decile[site]
if site in average:
del average[site]
if site in log_average:
del log_average[site]
return decile, average, log_average

In [3]:
decile, average, log_average = deciles_and_means_from_query(download_data + normalization)

In [4]:
%matplotlib inline

def plot_small_multiples(decile, average, log_average):
import math
import matplotlib.pyplot as plt
columns = 5
rows = math.ceil(len(decile) / float(columns))
plt.rcParams['figure.figsize'] = [20,4*rows]
for i, site in enumerate(sorted(decile)):
current = plt.subplot(rows, columns, i+1)
top = 3000
plt.loglog(range(int(top+2)), range(int(top+2)))
plt.text(top**.5, top**.5/10, site, ha='center', fontsize=48,
color='lightgray')
plt.loglog(decile[site]['web100'], decile[site]['tcpinfo'],
label=site, linewidth=1, marker=".", markersize=10)
plt.loglog([average[site]['web100']], [average[site]['tcpinfo']],
color='green', marker=".", markersize=30)
plt.loglog([log_average[site]['web100']], [log_average[site]['tcpinfo']],
color='blue', marker=".", markersize=30)
plt.xlim(left=.1, right=3000)
plt.ylim(bottom=.1, top=3000)
plt.setp(current.get_xticklabels(), visible=(i+columns >= len(decile)))
plt.setp(current.get_yticklabels(), visible=((i%columns) == 0))

plt.show()

In [5]:
plot_small_multiples(decile, average, log_average)