15.5. Ookla#

Ookla Speedtests are a way for users to find out the download speed, upload speed and latency of their internet connection.

15.5.1. Ookla for Good#

../../_images/ookla_for_good.png

Ookla for Good™ is an initiative to provide data, analysis, and content to organizations that are seeking to improve people’s lives through internet accessibility. Ookla partners with organizations whose goals align with ours to provide unbiased information about the state of networks worldwide.

Ookla has an open dataset which can be downloaded from AWS S3. They also provide an interactive map to explore the data, here. This data is available quarterly at quadkey zoom level 16. It has Global Fixed Broadband & Mobile Network Maps.

Through the Development Data Partnership, we have access to a more spatio-temporal granular data that helps us understand the impact of and recovery after a crisis event.

In this class we are going to work with the flood from May 2024 in Rio Grande do Sul, Brazil. The class is organized in two parts:

  1. Review the open access dataset.

  2. Show the methodology that the Data Lab had followed in the analysis of the Ookla dataset in the past.

15.5.1.1. Ookla Open Dataset#

Observing changes due to a crisis might be harder with this type of dataset because of its granularity. However, by working with it one can understand the potential that the private dataset can have.

15.5.1.1.1. Download the data#

For downloading the data follow these steps:

  1. Go to this link.

  2. Install AWS CLI (if necessary) on your notebook.

    !pip install awscli

  3. Run this command to explore the datasets.

    !aws s3 ls --no-sign-request s3://ookla-open-data/

  4. Download the necessary file.

import geopandas as gpd
import pandas as pd
import mercantile
import folium
admin2 = gpd.read_file('../../data/mapping-monitoring-floods/gadm41_BRA_2.json')
# Filter boundaries in Rio Grande do Sul
rgds = admin2[admin2['NAME_1']=='RioGrandedoSul'].copy()
bounds = rgds.bounds
minx = bounds.minx.min()
miny = bounds.miny.min()
maxx = bounds.maxx.max()
maxy = bounds.maxy.max()
zoom_level = 16
tiles = mercantile.tiles(minx, miny, maxx, maxy, zooms=zoom_level)
quadkeys = [mercantile.quadkey(tile).rjust(16, '0') for tile in tiles]
path = '/home/sol/gitrepo/alternative-data-for-crisis/data/internet-connectivity/'

15.5.1.1.2. Mobile Network#

q1 = gpd.read_file(path + 'quarter=1/2024-01-01_performance_mobile_tiles.zip')
q2 = gpd.read_file(path + 'quarter=2/2024-04-01_performance_mobile_tiles.zip')
q3 = gpd.read_file(path + 'quarter=3/2024-07-01_performance_mobile_tiles.zip')
q1.head()
quadkey avg_d_kbps avg_u_kbps avg_lat_ms tests devices geometry
0 0022133222312323 60189 18677 69 1 1 POLYGON ((-160.02136 70.64359, -160.01587 70.6...
1 0022133222330102 7878 15619 103 1 1 POLYGON ((-160.02686 70.63995, -160.02136 70.6...
2 0022332203013333 15985 4315 60 2 1 POLYGON ((-162.60315 66.89775, -162.59766 66.8...
3 0022332203102213 257763 36399 49 1 1 POLYGON ((-162.58118 66.90206, -162.57568 66.9...
4 0022332203102221 226464 52992 53 1 1 POLYGON ((-162.59216 66.89991, -162.58667 66.8...
q1_rgds = q1[q1['quadkey'].isin(quadkeys)].copy()
q2_rgds = q2[q2['quadkey'].isin(quadkeys)].copy()
q3_rgds = q3[q3['quadkey'].isin(quadkeys)].copy()
idx = list(set(set(q1_rgds.quadkey).union(set(q2_rgds.quadkey))).union(q3_rgds.quadkey))
df = pd.DataFrame(index = idx)
df['q1_count'] = q1_rgds.set_index('quadkey')['devices']
df['q2_count'] = q2_rgds.set_index('quadkey')['devices']
df['q3_count'] = q3_rgds.set_index('quadkey')['devices']
df.fillna(0, inplace = True)
df['change_3_2'] = df['q3_count'] - df['q2_count']
geoms = pd.concat([q1_rgds, q2_rgds, q3_rgds]).drop_duplicates('geometry').set_index('quadkey')
df['geometry'] = geoms['geometry']
gdf = gpd.GeoDataFrame(data = df, crs = 'epsg:4326', geometry = df['geometry'])
15.5.1.1.2.1. Change in number of observed devices between Quarter 2 (flood) and Quarter 3 - Mobile Network#
gdf.explore(column = 'change_3_2')
Make this Notebook Trusted to load map: File -> Trust Notebook

15.5.1.1.3. Fixed Broadband#

q1 = gpd.read_file(path + 'quarter=1/2024-01-01_performance_fixed_tiles.zip')
q2 = gpd.read_file(path + 'quarter=2/2024-04-01_performance_fixed_tiles.zip')
q3 = gpd.read_file(path + 'quarter=3/2024-07-01_performance_fixed_tiles.zip')
q1_rgds = q1[q1['quadkey'].isin(quadkeys)].copy()
q2_rgds = q2[q2['quadkey'].isin(quadkeys)].copy()
q3_rgds = q3[q3['quadkey'].isin(quadkeys)].copy()
idx = list(set(set(q1_rgds.quadkey).union(set(q2_rgds.quadkey))).union(q3_rgds.quadkey))
df = pd.DataFrame(index = idx)
df['q1_count'] = q1_rgds.set_index('quadkey')['devices']
df['q2_count'] = q2_rgds.set_index('quadkey')['devices']
df['q3_count'] = q3_rgds.set_index('quadkey')['devices']
df.fillna(0, inplace = True)
df['change_3_2'] = df['q3_count'] - df['q2_count']
df['change_2_1'] = df['q2_count'] - df['q1_count']
geoms = pd.concat([q1_rgds, q2_rgds, q3_rgds]).drop_duplicates('geometry').set_index('quadkey')
df['geometry'] = geoms['geometry']
gdf = gpd.GeoDataFrame(data = df, crs = 'epsg:4326', geometry = df['geometry'])
15.5.1.1.3.1. Change in number of observed devices between Quarter 2 (flood) and Quarter 3 - Fixed Network#

This map cannot be deployed due to its size. We invite you to plot it locally.

# gdf.explore(column = 'change_3_2')

15.5.1.2. Ookla Private Dataset (Illustrative)#

In this part of the class, we are showing the methodology followed by the Data Lab team when trying to understand the physical impact on the internet connection of a crisis event.

import dask.dataframe as dd
import pandas as pd
import os
import dask_geopandas as dg
import geopandas as gpd
from dask.distributed import Client, LocalCluster
import matplotlib.pyplot as plt
import datetime
from sklearn.preprocessing import StandardScaler
pd.set_option('display.max_columns', 500)

15.5.1.2.1. Fixed Network#

path = '../../data/internet-connectivity/proprietary/ookla/'
ddf = dd.read_csv(path + 'FixedNetworkPerformance*.csv', 
                   usecols = ["id_result",
                            "ts_result",
                            "attr_location_latitude",
                            "attr_location_longitude",
                             "val_download_mbps", 
                             "val_upload_mbps"],
                   low_memory = False,
                   dtype = {'id_result': 'str', 
                            'ts_result': 'str', 
                            'attr_location_latitude': 'float', 
                            'attr_location_longitude': 'float', })
ddf = ddf.dropna(subset=['ts_result'])
ddf['date'] = ddf['ts_result'].str[:10]
gddf = dg.from_dask_dataframe(ddf,
                                geometry=dg.points_from_xy(ddf, "attr_location_longitude", "attr_location_latitude"),
                            ).set_crs("EPSG:4326")
admin2 = gpd.read_file('../../data/mapping-monitoring-floods/gadm41_BRA_2.json')
rgds = admin2[admin2['NAME_1']=='RioGrandedoSul'].copy()
gddf = gddf.sjoin(rgds)
15.5.1.2.1.1. Number of Tests#

In this part of the analysis, we are exploring the number of tests we observe at each administrative level 2 boundary.

tests = gddf.groupby(['GID_2', 'NAME_2', 'date']).size().compute().reset_index()
tests.rename(columns = {0: 'ntests'}, inplace = True)
tests = tests[(tests['date']>'2024-04-01')&(tests['date']<='2024-06-30')].copy()
tests["date"] = pd.to_datetime(tests["date"])
tests.sort_values('date', inplace = True)

From our previous work, we selected administrative boundaries that had more than 300 affected pixels. From these 42 boundaries, I selected 5 of them in which I can see that the flood affected an urban area.

aff = ['Agudo', 'Alegrete', 'Arambaré', 'BarradoRibeiro', 'Cacequi', 'CachoeiradoSul', 'Camaquã', 'Candelária', 'Canoas', 'CapivaridoSul', 
 'DonaFrancisca', 'Formigueiro', 'Garruchos', 'GeneralCâmara', 'Itaqui', 'ManoelViana', 'MinasdoLeão', 'Mostardas', 'NovaSantaRita', 
 'PalmaresdoSul', 'PantanoGrande', 'ParaísodoSul', 'PassodoSobrado', 'PortoAlegre', 'RestingaSêca', 'RioGrande', 'RioPardo', 
 'SantaCruzdoSul', 'SantaMaria', 'SãoBorja', 'SãoJosédoNorte', 'SãoSepé', 'SãoVicentedoSul', 'Tapes', 'Taquari', 'Triunfo', 'ValedoSol', 
 'ValeVerde', 'VenâncioAires', 'VeraCruz']
aff_urban = ['PortoAlegre', 'Canoas', 'Charqueadas', 'NovaSantaRita', 'CachoeiradoSul']
import bokeh
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import Span
from bokeh.layouts import column
from bokeh.models import Legend, TabPanel, Tabs, HoverTool
from bokeh.core.validation.warnings import EMPTY_LAYOUT, MISSING_RENDERERS
from bokeh.io import reset_output
reset_output()

bokeh.core.validation.silence(EMPTY_LAYOUT, True)
bokeh.core.validation.silence(MISSING_RENDERERS, True)

color_palette = [
    "#4E79A7",  # Blue
    "#F28E2B",  # Orange
    "#E15759",  # Red
    "#76B7B2",  # Teal
    "#59A14F",  # Green
    "#EDC948",  # Yellow
    "#B07AA1",  # Purple
    "#FF9DA7",  # Pink
    "#9C755F",  # Brown
    "#BAB0AC",  # Gray
    "#7C7C7C",  # Dark gray
    "#6B4C9A",  # Violet
    "#D55E00",  # Orange-red
    "#CC61B0",  # Magenta
    "#0072B2",  # Bright blue
    "#329262",  # Peacock green
    "#9E5B5A",  # Brick red
    "#636363",  # Medium gray
    "#CD9C00",  # Gold
    "#5D69B1",  # Medium blue
]
output_notebook()
p = figure(x_axis_type="datetime", width=800, height=400, toolbar_location="above", 
          x_axis_label= 'Date', y_axis_label = 'Number of Tests')
p.add_layout(Legend(), "right")
measure = 'ntests'
for id, adm2 in enumerate(aff_urban):
    df = tests[tests["NAME_2"] == adm2][["date", measure]].reset_index(drop=True)
    p.line(df["date"], df[measure], line_width=2, line_color=color_palette[id], legend_label=adm2, )
p.legend.click_policy = "hide"

title_fig = figure(title='Number of Tests by Administrative level 2',
    toolbar_location=None, width=800, height=40)
title_fig.title.align = "left"
title_fig.title.text_font_size = "20pt"
title_fig.border_fill_alpha = 0
title_fig.outline_line_width = 0
layout = column(title_fig, p)
p.renderers.extend(
    [Span(location=datetime.datetime(2024, 4, 29), dimension="height", line_color="#7C7C7C", line_width=2, line_dash=(4, 4))]
)
hover = HoverTool(tooltips=[
    ("Date", "@x{%F}"),  # Format the date as YYYY-MM-DD
    ("Value", "@y")
], formatters={
    '@x': 'datetime'  # Use datetime formatter for x values
})
p.add_tools(hover)
show(layout)
Loading BokehJS ...

In the following plot, we are going to calculate the percentage change in number of test with respect to a baseline. In this case, the baseline is from 2024-04-02 to 2024-04-28.

baseline = tests[tests['date']<'2024-04-29']
baseline = baseline.groupby(["NAME_2"]).agg({"ntests": ["mean", "std"]})
baseline.columns = baseline.columns.map(".".join)
tests_change = tests.merge(baseline, on=["NAME_2"], how="left")
tests_change["n_baseline"] = tests_change["ntests.mean"]
tests_change["n_difference"] = (tests_change["ntests"] - tests_change["n_baseline"])
tests_change["percent_change"] = 100 * (tests_change["ntests"] / (tests_change["n_baseline"]) - 1)
output_notebook()
p = figure(x_axis_type="datetime", width=800, height=400, toolbar_location="above",
          x_axis_label= 'Date', y_axis_label = 'Percentage change [%]')
p.add_layout(Legend(), "right")
measure = 'percent_change'
for id, adm2 in enumerate(aff_urban):
    df = tests_change[(tests_change["NAME_2"] == adm2)&(tests_change['date']>'2024-04-29')][["date", measure]].reset_index(drop=True)
    p.line(df["date"], df[measure], line_width=2, line_color=color_palette[id], legend_label=adm2, )
p.legend.click_policy = "hide"

title_fig = figure(title='Percentage change of tests compared to baseline (2024-04-01 to 2024-04-28)',
    toolbar_location=None, width=800, height=40)
title_fig.title.align = "left"
title_fig.title.text_font_size = "15pt"
title_fig.border_fill_alpha = 0
title_fig.outline_line_width = 0
layout = column(title_fig, p)

hover = HoverTool(tooltips=[
    ("Date", "@x{%F}"),  # Format the date as YYYY-MM-DD
    ("Value", "@y")
], formatters={
    '@x': 'datetime'  # Use datetime formatter for x values
})
p.add_tools(hover)
show(layout)
Loading BokehJS ...

15.5.1.2.2. Mobile Network#

path = '../../data/internet-connectivity/proprietary/ookla/'
ddf = dd.read_csv(path + 'MobileNetworkPerformance*.csv', 
                   usecols = ["guid_result",
                            "ts_result",
                            "attr_location_latitude",
                            "attr_location_longitude",
                             "val_download_kbps", 
                             "val_upload_kbps"],
                   low_memory = False,
                   dtype = {'guid_result': 'str', 
                            'ts_result': 'str', 
                            'attr_location_latitude': 'float', 
                            'attr_location_longitude': 'float', })
ddf = ddf.dropna(subset=['ts_result'])
ddf['date'] = ddf['ts_result'].str[:10]
gddf = dg.from_dask_dataframe(ddf,
                                geometry=dg.points_from_xy(ddf, "attr_location_longitude", "attr_location_latitude"),
                            ).set_crs("EPSG:4326")
admin2 = gpd.read_file('../../data/mapping-monitoring-floods/gadm41_BRA_2.json')
rgds = admin2[admin2['NAME_1']=='RioGrandedoSul'].copy()
gddf = gddf.sjoin(rgds)
15.5.1.2.2.1. Number of Tests#

In this part of the analysis, we are exploring the number of tests we observe at each administrative level 2 boundary.

tests = gddf.groupby(['GID_2', 'NAME_2', 'date']).size().compute().reset_index()
tests.rename(columns = {0: 'ntests'}, inplace = True)
tests = tests[(tests['date']>'2024-04-01')&(tests['date']<='2024-06-30')].copy()
tests["date"] = pd.to_datetime(tests["date"])
tests.sort_values('date', inplace = True)

From our previous work, we selected administrative boundaries that had more than 300 affected pixels. From these 42 boundaries, I selected 5 of them in which I can see that the flood affected an urban area.

aff = ['Agudo', 'Alegrete', 'Arambaré', 'BarradoRibeiro', 'Cacequi', 'CachoeiradoSul', 'Camaquã', 'Candelária', 'Canoas', 'CapivaridoSul', 
 'DonaFrancisca', 'Formigueiro', 'Garruchos', 'GeneralCâmara', 'Itaqui', 'ManoelViana', 'MinasdoLeão', 'Mostardas', 'NovaSantaRita', 
 'PalmaresdoSul', 'PantanoGrande', 'ParaísodoSul', 'PassodoSobrado', 'PortoAlegre', 'RestingaSêca', 'RioGrande', 'RioPardo', 
 'SantaCruzdoSul', 'SantaMaria', 'SãoBorja', 'SãoJosédoNorte', 'SãoSepé', 'SãoVicentedoSul', 'Tapes', 'Taquari', 'Triunfo', 'ValedoSol', 
 'ValeVerde', 'VenâncioAires', 'VeraCruz']
aff_urban = ['PortoAlegre', 'Canoas', 'Charqueadas', 'NovaSantaRita', 'CachoeiradoSul']
import bokeh
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import Span
from bokeh.layouts import column
from bokeh.models import Legend, TabPanel, Tabs, HoverTool
from bokeh.core.validation.warnings import EMPTY_LAYOUT, MISSING_RENDERERS
from bokeh.io import reset_output
reset_output()

bokeh.core.validation.silence(EMPTY_LAYOUT, True)
bokeh.core.validation.silence(MISSING_RENDERERS, True)

color_palette = [
    "#4E79A7",  # Blue
    "#F28E2B",  # Orange
    "#E15759",  # Red
    "#76B7B2",  # Teal
    "#59A14F",  # Green
    "#EDC948",  # Yellow
    "#B07AA1",  # Purple
    "#FF9DA7",  # Pink
    "#9C755F",  # Brown
    "#BAB0AC",  # Gray
    "#7C7C7C",  # Dark gray
    "#6B4C9A",  # Violet
    "#D55E00",  # Orange-red
    "#CC61B0",  # Magenta
    "#0072B2",  # Bright blue
    "#329262",  # Peacock green
    "#9E5B5A",  # Brick red
    "#636363",  # Medium gray
    "#CD9C00",  # Gold
    "#5D69B1",  # Medium blue
]
output_notebook()
p = figure(x_axis_type="datetime", width=800, height=400, toolbar_location="above",
          x_axis_label= 'Date', y_axis_label = 'Number of Tests')
p.add_layout(Legend(), "right")
measure = 'ntests'
for id, adm2 in enumerate(aff_urban):
    df = tests[tests["NAME_2"] == adm2][["date", measure]].reset_index(drop=True)
    p.line(df["date"], df[measure], line_width=2, line_color=color_palette[id], legend_label=adm2, )
p.legend.click_policy = "hide"

title_fig = figure(title='Number of tests by Administrative level 2',
    toolbar_location=None, width=800, height=40)
title_fig.title.align = "left"
title_fig.title.text_font_size = "20pt"
title_fig.border_fill_alpha = 0
title_fig.outline_line_width = 0
layout = column(title_fig, p)
p.renderers.extend(
    [Span(location=datetime.datetime(2024, 4, 29), dimension="height", line_color="#7C7C7C", line_width=2, line_dash=(4, 4))]
)
hover = HoverTool(tooltips=[
    ("Date", "@x{%F}"),  # Format the date as YYYY-MM-DD
    ("Value", "@y")
], formatters={
    '@x': 'datetime'  # Use datetime formatter for x values
})
p.add_tools(hover)
show(layout)
Loading BokehJS ...

In the following plot, we are going to calculate the percentage change in number of test with respect to a baseline. In this case, the baseline is from 2024-04-02 to 2024-04-28.

baseline = tests[tests['date']<'2024-04-29']
baseline = baseline.groupby(["NAME_2"]).agg({"ntests": ["mean", "std"]})
baseline.columns = baseline.columns.map(".".join)
tests_change = tests.merge(baseline, on=["NAME_2"], how="left")
tests_change["n_baseline"] = tests_change["ntests.mean"]
tests_change["n_difference"] = (tests_change["ntests"] - tests_change["n_baseline"])
tests_change["percent_change"] = 100 * (tests_change["ntests"] / (tests_change["n_baseline"]) - 1)
output_notebook()
p = figure(x_axis_type="datetime", width=800, height=400, toolbar_location="above",
          x_axis_label= 'Date', y_axis_label = 'Percentage Change [%]')
p.add_layout(Legend(), "right")
measure = 'percent_change'
for id, adm2 in enumerate(aff_urban):
    df = tests_change[(tests_change["NAME_2"] == adm2)&(tests_change['date']>'2024-04-29')][["date", measure]].reset_index(drop=True)
    p.line(df["date"], df[measure], line_width=2, line_color=color_palette[id], legend_label=adm2, )
p.legend.click_policy = "hide"

title_fig = figure(title='Percentage change of tests compared to baseline (2024-04-01 to 2024-04-28)',
    toolbar_location=None, width=800, height=40)
title_fig.title.align = "left"
title_fig.title.text_font_size = "15pt"
title_fig.border_fill_alpha = 0
title_fig.outline_line_width = 0
layout = column(title_fig, p)

hover = HoverTool(tooltips=[
    ("Date", "@x{%F}"),  # Format the date as YYYY-MM-DD
    ("Value", "@y")
], formatters={
    '@x': 'datetime'  # Use datetime formatter for x values
})
p.add_tools(hover)
show(layout)
Loading BokehJS ...