Ookla

15.5. Ookla#

Ookla Speedtests are a way for users to find out the download speed, upload speed and latency of their internet connection.

15.5.1. Ookla for Good#

Ookla for Good™ is an initiative to provide data, analysis, and content to organizations that are seeking to improve people’s lives through internet accessibility. Ookla partners with organizations whose goals align with ours to provide unbiased information about the state of networks worldwide.

Ookla has an open dataset which can be downloaded from AWS S3. They also provide an interactive map to explore the data, here. This data is available quarterly at quadkey zoom level 16. It has Global Fixed Broadband & Mobile Network Maps.

Through the Development Data Partnership, we have access to a more spatio-temporal granular data that helps us understand the impact of and recovery after a crisis event.

In this class we are going to work with the flood from May 2024 in Rio Grande do Sul, Brazil. The class is organized in two parts:

Review the open access dataset.
Show the methodology that the Data Lab had followed in the analysis of the Ookla dataset in the past.

15.5.1.1. Ookla Open Dataset#

Observing changes due to a crisis might be harder with this type of dataset because of its granularity. However, by working with it one can understand the potential that the private dataset can have.

15.5.1.1.1. Download the data#

For downloading the data follow these steps:

Go to this link.
Install AWS CLI (if necessary) on your notebook.

!pip install awscli
Run this command to explore the datasets.

!aws s3 ls --no-sign-request s3://ookla-open-data/
Download the necessary file.

import geopandas as gpd
import pandas as pd
import mercantile
import folium

admin2 = gpd.read_file('../../data/mapping-monitoring-floods/gadm41_BRA_2.json')

# Filter boundaries in Rio Grande do Sul
rgds = admin2[admin2['NAME_1']=='RioGrandedoSul'].copy()

bounds = rgds.bounds
minx = bounds.minx.min()
miny = bounds.miny.min()
maxx = bounds.maxx.max()
maxy = bounds.maxy.max()

zoom_level = 16
tiles = mercantile.tiles(minx, miny, maxx, maxy, zooms=zoom_level)
quadkeys = [mercantile.quadkey(tile).rjust(16, '0') for tile in tiles]

path = '/home/sol/gitrepo/alternative-data-for-crisis/data/internet-connectivity/'

15.5.1.1.2. Mobile Network#

q1 = gpd.read_file(path + 'quarter=1/2024-01-01_performance_mobile_tiles.zip')
q2 = gpd.read_file(path + 'quarter=2/2024-04-01_performance_mobile_tiles.zip')
q3 = gpd.read_file(path + 'quarter=3/2024-07-01_performance_mobile_tiles.zip')

q1.head()

	quadkey	avg_d_kbps	avg_u_kbps	avg_lat_ms	tests	devices	geometry
0	0022133222312323	60189	18677	69	1	1	POLYGON ((-160.02136 70.64359, -160.01587 70.6...
1	0022133222330102	7878	15619	103	1	1	POLYGON ((-160.02686 70.63995, -160.02136 70.6...
2	0022332203013333	15985	4315	60	2	1	POLYGON ((-162.60315 66.89775, -162.59766 66.8...
3	0022332203102213	257763	36399	49	1	1	POLYGON ((-162.58118 66.90206, -162.57568 66.9...
4	0022332203102221	226464	52992	53	1	1	POLYGON ((-162.59216 66.89991, -162.58667 66.8...

q1_rgds = q1[q1['quadkey'].isin(quadkeys)].copy()
q2_rgds = q2[q2['quadkey'].isin(quadkeys)].copy()
q3_rgds = q3[q3['quadkey'].isin(quadkeys)].copy()

idx = list(set(set(q1_rgds.quadkey).union(set(q2_rgds.quadkey))).union(q3_rgds.quadkey))

df = pd.DataFrame(index = idx)
df['q1_count'] = q1_rgds.set_index('quadkey')['devices']
df['q2_count'] = q2_rgds.set_index('quadkey')['devices']
df['q3_count'] = q3_rgds.set_index('quadkey')['devices']
df.fillna(0, inplace = True)
df['change_3_2'] = df['q3_count'] - df['q2_count']

geoms = pd.concat([q1_rgds, q2_rgds, q3_rgds]).drop_duplicates('geometry').set_index('quadkey')
df['geometry'] = geoms['geometry']
gdf = gpd.GeoDataFrame(data = df, crs = 'epsg:4326', geometry = df['geometry'])

15.5.1.1.2.1. Change in number of observed devices between Quarter 2 (flood) and Quarter 3 - Mobile Network#

gdf.explore(column = 'change_3_2')

Make this Notebook Trusted to load map: File -> Trust Notebook

15.5.1.1.3. Fixed Broadband#

q1 = gpd.read_file(path + 'quarter=1/2024-01-01_performance_fixed_tiles.zip')
q2 = gpd.read_file(path + 'quarter=2/2024-04-01_performance_fixed_tiles.zip')
q3 = gpd.read_file(path + 'quarter=3/2024-07-01_performance_fixed_tiles.zip')

q1_rgds = q1[q1['quadkey'].isin(quadkeys)].copy()
q2_rgds = q2[q2['quadkey'].isin(quadkeys)].copy()
q3_rgds = q3[q3['quadkey'].isin(quadkeys)].copy()

idx = list(set(set(q1_rgds.quadkey).union(set(q2_rgds.quadkey))).union(q3_rgds.quadkey))
df = pd.DataFrame(index = idx)
df['q1_count'] = q1_rgds.set_index('quadkey')['devices']
df['q2_count'] = q2_rgds.set_index('quadkey')['devices']
df['q3_count'] = q3_rgds.set_index('quadkey')['devices']
df.fillna(0, inplace = True)
df['change_3_2'] = df['q3_count'] - df['q2_count']
df['change_2_1'] = df['q2_count'] - df['q1_count']

geoms = pd.concat([q1_rgds, q2_rgds, q3_rgds]).drop_duplicates('geometry').set_index('quadkey')
df['geometry'] = geoms['geometry']
gdf = gpd.GeoDataFrame(data = df, crs = 'epsg:4326', geometry = df['geometry'])

15.5.1.1.3.1. Change in number of observed devices between Quarter 2 (flood) and Quarter 3 - Fixed Network#

This map cannot be deployed due to its size. We invite you to plot it locally.

# gdf.explore(column = 'change_3_2')

15.5.1.2. Ookla Private Dataset (Illustrative)#

In this part of the class, we are showing the methodology followed by the Data Lab team when trying to understand the physical impact on the internet connection of a crisis event.

import dask.dataframe as dd
import pandas as pd
import os
import dask_geopandas as dg
import geopandas as gpd
from dask.distributed import Client, LocalCluster
import matplotlib.pyplot as plt
import datetime
from sklearn.preprocessing import StandardScaler
pd.set_option('display.max_columns', 500)

15.5.1.2.1. Fixed Network#

path = '../../data/internet-connectivity/proprietary/ookla/'
ddf = dd.read_csv(path + 'FixedNetworkPerformance*.csv', 
                   usecols = ["id_result",
                            "ts_result",
                            "attr_location_latitude",
                            "attr_location_longitude",
                             "val_download_mbps", 
                             "val_upload_mbps"],
                   low_memory = False,
                   dtype = {'id_result': 'str', 
                            'ts_result': 'str', 
                            'attr_location_latitude': 'float', 
                            'attr_location_longitude': 'float', })

ddf = ddf.dropna(subset=['ts_result'])
ddf['date'] = ddf['ts_result'].str[:10]

gddf = dg.from_dask_dataframe(ddf,
                                geometry=dg.points_from_xy(ddf, "attr_location_longitude", "attr_location_latitude"),
                            ).set_crs("EPSG:4326")

admin2 = gpd.read_file('../../data/mapping-monitoring-floods/gadm41_BRA_2.json')
rgds = admin2[admin2['NAME_1']=='RioGrandedoSul'].copy()

gddf = gddf.sjoin(rgds)

15.5.1.2.1.1. Number of Tests#

In this part of the analysis, we are exploring the number of tests we observe at each administrative level 2 boundary.

tests = gddf.groupby(['GID_2', 'NAME_2', 'date']).size().compute().reset_index()

tests.rename(columns = {0: 'ntests'}, inplace = True)

tests = tests[(tests['date']>'2024-04-01')&(tests['date']<='2024-06-30')].copy()
tests["date"] = pd.to_datetime(tests["date"])
tests.sort_values('date', inplace = True)

From our previous work, we selected administrative boundaries that had more than 300 affected pixels. From these 42 boundaries, I selected 5 of them in which I can see that the flood affected an urban area.

aff = ['Agudo', 'Alegrete', 'Arambaré', 'BarradoRibeiro', 'Cacequi', 'CachoeiradoSul', 'Camaquã', 'Candelária', 'Canoas', 'CapivaridoSul', 
 'DonaFrancisca', 'Formigueiro', 'Garruchos', 'GeneralCâmara', 'Itaqui', 'ManoelViana', 'MinasdoLeão', 'Mostardas', 'NovaSantaRita', 
 'PalmaresdoSul', 'PantanoGrande', 'ParaísodoSul', 'PassodoSobrado', 'PortoAlegre', 'RestingaSêca', 'RioGrande', 'RioPardo', 
 'SantaCruzdoSul', 'SantaMaria', 'SãoBorja', 'SãoJosédoNorte', 'SãoSepé', 'SãoVicentedoSul', 'Tapes', 'Taquari', 'Triunfo', 'ValedoSol', 
 'ValeVerde', 'VenâncioAires', 'VeraCruz']
aff_urban = ['PortoAlegre', 'Canoas', 'Charqueadas', 'NovaSantaRita', 'CachoeiradoSul']

import bokeh
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import Span
from bokeh.layouts import column
from bokeh.models import Legend, TabPanel, Tabs, HoverTool
from bokeh.core.validation.warnings import EMPTY_LAYOUT, MISSING_RENDERERS
from bokeh.io import reset_output
reset_output()

bokeh.core.validation.silence(EMPTY_LAYOUT, True)
bokeh.core.validation.silence(MISSING_RENDERERS, True)

color_palette = [
    "#4E79A7",  # Blue
    "#F28E2B",  # Orange
    "#E15759",  # Red
    "#76B7B2",  # Teal
    "#59A14F",  # Green
    "#EDC948",  # Yellow
    "#B07AA1",  # Purple
    "#FF9DA7",  # Pink
    "#9C755F",  # Brown
    "#BAB0AC",  # Gray
    "#7C7C7C",  # Dark gray
    "#6B4C9A",  # Violet
    "#D55E00",  # Orange-red
    "#CC61B0",  # Magenta
    "#0072B2",  # Bright blue
    "#329262",  # Peacock green
    "#9E5B5A",  # Brick red
    "#636363",  # Medium gray
    "#CD9C00",  # Gold
    "#5D69B1",  # Medium blue
]

output_notebook()
p = figure(x_axis_type="datetime", width=800, height=400, toolbar_location="above", 
          x_axis_label= 'Date', y_axis_label = 'Number of Tests')
p.add_layout(Legend(), "right")
measure = 'ntests'
for id, adm2 in enumerate(aff_urban):
    df = tests[tests["NAME_2"] == adm2][["date", measure]].reset_index(drop=True)
    p.line(df["date"], df[measure], line_width=2, line_color=color_palette[id], legend_label=adm2, )
p.legend.click_policy = "hide"

title_fig = figure(title='Number of Tests by Administrative level 2',
    toolbar_location=None, width=800, height=40)
title_fig.title.align = "left"
title_fig.title.text_font_size = "20pt"
title_fig.border_fill_alpha = 0
title_fig.outline_line_width = 0
layout = column(title_fig, p)
p.renderers.extend(
    [Span(location=datetime.datetime(2024, 4, 29), dimension="height", line_color="#7C7C7C", line_width=2, line_dash=(4, 4))]
)
hover = HoverTool(tooltips=[
    ("Date", "@x{%F}"),  # Format the date as YYYY-MM-DD
    ("Value", "@y")
], formatters={
    '@x': 'datetime'  # Use datetime formatter for x values
})
p.add_tools(hover)
show(layout)

Loading BokehJS ...

In the following plot, we are going to calculate the percentage change in number of test with respect to a baseline. In this case, the baseline is from 2024-04-02 to 2024-04-28.

baseline = tests[tests['date']<'2024-04-29']

baseline = baseline.groupby(["NAME_2"]).agg({"ntests": ["mean", "std"]})
baseline.columns = baseline.columns.map(".".join)

tests_change = tests.merge(baseline, on=["NAME_2"], how="left")

tests_change["n_baseline"] = tests_change["ntests.mean"]
tests_change["n_difference"] = (tests_change["ntests"] - tests_change["n_baseline"])
tests_change["percent_change"] = 100 * (tests_change["ntests"] / (tests_change["n_baseline"]) - 1)

output_notebook()
p = figure(x_axis_type="datetime", width=800, height=400, toolbar_location="above",
          x_axis_label= 'Date', y_axis_label = 'Percentage change [%]')
p.add_layout(Legend(), "right")
measure = 'percent_change'
for id, adm2 in enumerate(aff_urban):
    df = tests_change[(tests_change["NAME_2"] == adm2)&(tests_change['date']>'2024-04-29')][["date", measure]].reset_index(drop=True)
    p.line(df["date"], df[measure], line_width=2, line_color=color_palette[id], legend_label=adm2, )
p.legend.click_policy = "hide"

title_fig = figure(title='Percentage change of tests compared to baseline (2024-04-01 to 2024-04-28)',
    toolbar_location=None, width=800, height=40)
title_fig.title.align = "left"
title_fig.title.text_font_size = "15pt"
title_fig.border_fill_alpha = 0
title_fig.outline_line_width = 0
layout = column(title_fig, p)

hover = HoverTool(tooltips=[
    ("Date", "@x{%F}"),  # Format the date as YYYY-MM-DD
    ("Value", "@y")
], formatters={
    '@x': 'datetime'  # Use datetime formatter for x values
})
p.add_tools(hover)
show(layout)

Loading BokehJS ...

15.5.1.2.2. Mobile Network#

path = '../../data/internet-connectivity/proprietary/ookla/'
ddf = dd.read_csv(path + 'MobileNetworkPerformance*.csv', 
                   usecols = ["guid_result",
                            "ts_result",
                            "attr_location_latitude",
                            "attr_location_longitude",
                             "val_download_kbps", 
                             "val_upload_kbps"],
                   low_memory = False,
                   dtype = {'guid_result': 'str', 
                            'ts_result': 'str', 
                            'attr_location_latitude': 'float', 
                            'attr_location_longitude': 'float', })

ddf = ddf.dropna(subset=['ts_result'])
ddf['date'] = ddf['ts_result'].str[:10]

gddf = dg.from_dask_dataframe(ddf,
                                geometry=dg.points_from_xy(ddf, "attr_location_longitude", "attr_location_latitude"),
                            ).set_crs("EPSG:4326")

admin2 = gpd.read_file('../../data/mapping-monitoring-floods/gadm41_BRA_2.json')
rgds = admin2[admin2['NAME_1']=='RioGrandedoSul'].copy()

gddf = gddf.sjoin(rgds)

15.5.1.2.2.1. Number of Tests#

In this part of the analysis, we are exploring the number of tests we observe at each administrative level 2 boundary.

tests = gddf.groupby(['GID_2', 'NAME_2', 'date']).size().compute().reset_index()

tests.rename(columns = {0: 'ntests'}, inplace = True)

tests = tests[(tests['date']>'2024-04-01')&(tests['date']<='2024-06-30')].copy()
tests["date"] = pd.to_datetime(tests["date"])
tests.sort_values('date', inplace = True)

From our previous work, we selected administrative boundaries that had more than 300 affected pixels. From these 42 boundaries, I selected 5 of them in which I can see that the flood affected an urban area.

aff = ['Agudo', 'Alegrete', 'Arambaré', 'BarradoRibeiro', 'Cacequi', 'CachoeiradoSul', 'Camaquã', 'Candelária', 'Canoas', 'CapivaridoSul', 
 'DonaFrancisca', 'Formigueiro', 'Garruchos', 'GeneralCâmara', 'Itaqui', 'ManoelViana', 'MinasdoLeão', 'Mostardas', 'NovaSantaRita', 
 'PalmaresdoSul', 'PantanoGrande', 'ParaísodoSul', 'PassodoSobrado', 'PortoAlegre', 'RestingaSêca', 'RioGrande', 'RioPardo', 
 'SantaCruzdoSul', 'SantaMaria', 'SãoBorja', 'SãoJosédoNorte', 'SãoSepé', 'SãoVicentedoSul', 'Tapes', 'Taquari', 'Triunfo', 'ValedoSol', 
 'ValeVerde', 'VenâncioAires', 'VeraCruz']
aff_urban = ['PortoAlegre', 'Canoas', 'Charqueadas', 'NovaSantaRita', 'CachoeiradoSul']

import bokeh
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import Span
from bokeh.layouts import column
from bokeh.models import Legend, TabPanel, Tabs, HoverTool
from bokeh.core.validation.warnings import EMPTY_LAYOUT, MISSING_RENDERERS
from bokeh.io import reset_output
reset_output()

bokeh.core.validation.silence(EMPTY_LAYOUT, True)
bokeh.core.validation.silence(MISSING_RENDERERS, True)

color_palette = [
    "#4E79A7",  # Blue
    "#F28E2B",  # Orange
    "#E15759",  # Red
    "#76B7B2",  # Teal
    "#59A14F",  # Green
    "#EDC948",  # Yellow
    "#B07AA1",  # Purple
    "#FF9DA7",  # Pink
    "#9C755F",  # Brown
    "#BAB0AC",  # Gray
    "#7C7C7C",  # Dark gray
    "#6B4C9A",  # Violet
    "#D55E00",  # Orange-red
    "#CC61B0",  # Magenta
    "#0072B2",  # Bright blue
    "#329262",  # Peacock green
    "#9E5B5A",  # Brick red
    "#636363",  # Medium gray
    "#CD9C00",  # Gold
    "#5D69B1",  # Medium blue
]

output_notebook()
p = figure(x_axis_type="datetime", width=800, height=400, toolbar_location="above",
          x_axis_label= 'Date', y_axis_label = 'Number of Tests')
p.add_layout(Legend(), "right")
measure = 'ntests'
for id, adm2 in enumerate(aff_urban):
    df = tests[tests["NAME_2"] == adm2][["date", measure]].reset_index(drop=True)
    p.line(df["date"], df[measure], line_width=2, line_color=color_palette[id], legend_label=adm2, )
p.legend.click_policy = "hide"

title_fig = figure(title='Number of tests by Administrative level 2',
    toolbar_location=None, width=800, height=40)
title_fig.title.align = "left"
title_fig.title.text_font_size = "20pt"
title_fig.border_fill_alpha = 0
title_fig.outline_line_width = 0
layout = column(title_fig, p)
p.renderers.extend(
    [Span(location=datetime.datetime(2024, 4, 29), dimension="height", line_color="#7C7C7C", line_width=2, line_dash=(4, 4))]
)
hover = HoverTool(tooltips=[
    ("Date", "@x{%F}"),  # Format the date as YYYY-MM-DD
    ("Value", "@y")
], formatters={
    '@x': 'datetime'  # Use datetime formatter for x values
})
p.add_tools(hover)
show(layout)

Loading BokehJS ...

In the following plot, we are going to calculate the percentage change in number of test with respect to a baseline. In this case, the baseline is from 2024-04-02 to 2024-04-28.

baseline = tests[tests['date']<'2024-04-29']

baseline = baseline.groupby(["NAME_2"]).agg({"ntests": ["mean", "std"]})
baseline.columns = baseline.columns.map(".".join)

tests_change = tests.merge(baseline, on=["NAME_2"], how="left")

tests_change["n_baseline"] = tests_change["ntests.mean"]
tests_change["n_difference"] = (tests_change["ntests"] - tests_change["n_baseline"])
tests_change["percent_change"] = 100 * (tests_change["ntests"] / (tests_change["n_baseline"]) - 1)

output_notebook()
p = figure(x_axis_type="datetime", width=800, height=400, toolbar_location="above",
          x_axis_label= 'Date', y_axis_label = 'Percentage Change [%]')
p.add_layout(Legend(), "right")
measure = 'percent_change'
for id, adm2 in enumerate(aff_urban):
    df = tests_change[(tests_change["NAME_2"] == adm2)&(tests_change['date']>'2024-04-29')][["date", measure]].reset_index(drop=True)
    p.line(df["date"], df[measure], line_width=2, line_color=color_palette[id], legend_label=adm2, )
p.legend.click_policy = "hide"

title_fig = figure(title='Percentage change of tests compared to baseline (2024-04-01 to 2024-04-28)',
    toolbar_location=None, width=800, height=40)
title_fig.title.align = "left"
title_fig.title.text_font_size = "15pt"
title_fig.border_fill_alpha = 0
title_fig.outline_line_width = 0
layout = column(title_fig, p)

hover = HoverTool(tooltips=[
    ("Date", "@x{%F}"),  # Format the date as YYYY-MM-DD
    ("Value", "@y")
], formatters={
    '@x': 'datetime'  # Use datetime formatter for x values
})
p.add_tools(hover)
show(layout)

Loading BokehJS ...

Ookla

Contents

15.5. Ookla#

15.5.1. Ookla for Good#

15.5.1.1. Ookla Open Dataset#

15.5.1.1.1. Download the data#

15.5.1.1.2. Mobile Network#

15.5.1.1.2.1. Change in number of observed devices between Quarter 2 (flood) and Quarter 3 - Mobile Network#

15.5.1.1.3. Fixed Broadband#

15.5.1.1.3.1. Change in number of observed devices between Quarter 2 (flood) and Quarter 3 - Fixed Network#

15.5.1.2. Ookla Private Dataset (Illustrative)#

15.5.1.2.1. Fixed Network#

15.5.1.2.1.1. Number of Tests#

15.5.1.2.2. Mobile Network#

15.5.1.2.2.1. Number of Tests#