Accessibility Metrics#
Author: Dunstan Matekenya
Affiliation: DECAT, The World Bank Group
Date: August 1, 2023
Background#
Given public GPS coordinates for a survey which come from the anonymization process described in the
Spatial-Anonmization
package, in this notebook, we generate metrics to measure the survey households accessibility to various amenities using geospatial data. This is for the National Panel Survey (NPS 2014-2015) in Tanzania.
The following accessibility metrics/variables are generated.
Major towns and/or cities
Border posts
District headquarters
Major agricultural markets
Main road
For each of the category above, we produce two metrics as follows:
Euclidean distance
Travel time
Please see Data-Cover-Page for details about definition of the metrics and other important information.
To Do#
For further development, more variables such as access to schools, health facilities and more can be added.
Table of Contents#
The the following steps are followed to generate the data.
Python setup
Define input data
Define global variables
Prepare roads data
Generate euclidean distance metrics
Perfom quality checks for euclidean distance metrics
Generate travel time metrics
Perfom quality checks for travel time metrics
Save output files
Python setup#
Back to Table of Contents
The following packages are required for the processing in this notebook.
import warnings
warnings.simplefilter(action='ignore')
import os, sys
from datetime import datetime
from pathlib import Path
import shutil
import json
import skimage.graph as graph
import pandas as pd
pd.set_option('display.float_format', lambda x: '%.3f' % x)
import numpy as np
import geopandas as gpd
from shapely.geometry import shape, Point
from rasterstats import point_query
import rasterio
from rasterio.mask import mask
from rasterio import features
from pyproj.aoi import AreaOfInterest
from pyproj.database import query_utm_crs_info
spanon = Path.cwd().parents[2].joinpath("Spatial-Anonymization/src/")
sys.path.append(str(spanon))
from spatial_anonymization import utils
from spatial_anonymization import point_displacement
Setup input files#
Back to Table of Contents
Define paths to input data files as global variables using Path
package.
# =============================
# BASE WORKING DIR
# ============================
DIR_DATA = Path.cwd().parents[1].joinpath('data', 'TZA')
DIR_SPATIAL_DATA = DIR_DATA.joinpath('spatial')
DIR_NPS_W5 = DIR_DATA.joinpath('surveys', 'NPS_w5')
DIR_OUTPUTS = DIR_NPS_W5.joinpath('geovars', 'puf')
DIR_GOST_DATA = Path.cwd().parents[2].joinpath('GOST-UTILS', 'data')
# =============================
# SURVEY COORDINATES
# ============================
# Public/displaced GPS oordinates files
FILE_HH_PUB_GEOVARS = DIR_OUTPUTS.joinpath('y5-hhs-pub-coords.csv')
# =============================
# ADMIN BOUNDARIES
# =============================
FILE_ADM0 = DIR_SPATIAL_DATA.joinpath('geoBoundaries-TZA-ADM0-all',
'geoBoundaries-TZA-ADM0.shp')
# =============================
# DISTANCE MEASUREMENT DATA
# ============================
FILE_TOWNS = DIR_SPATIAL_DATA.joinpath("towns.csv")
FILE_BORDER_POSTS = DIR_SPATIAL_DATA.joinpath("border-posts.csv")
FILE_DIST_CAPITALS = DIR_SPATIAL_DATA.joinpath("district-capitals.csv")
FILE_MARKETS = DIR_SPATIAL_DATA.joinpath("market-centers.csv")
FILE_MAIN_ROADS = DIR_SPATIAL_DATA.joinpath("tza_trunkrd/tanzania_trunkrd.shp")
FILE_ROADS = None
# =============================
# TRAVEL TIME LAYERS
# ============================
FILE_FRICTION_SURFACE = DIR_GOST_DATA.joinpath('2020_motorized_friction_surface.geotiff')
FILE_GLOBAL_POPULATION = DIR_GOST_DATA.joinpath("ppp_2020_1km_Aggregated.tif")
Setup processing global variables#
Back to Table of Contents
For variables we will use often such as column name containing latitude and longitude, set them as global variables.
# ====================
# GENERAL
# =====================
# in case we need to download OSM data
GEOFABRICK_REGION = 'africa'
COUNTRY_NAME = 'Tanzania'
COUNTRY_ISO = 'TZA'
FILENAME_SUFFIX = 'npsw5'
# ==============
# ROADS
# ==============
# This assumes we are working with OSM roads file
# Otherwise, these can be changed
MAIN_ROADS_TAGS = ['motorway', 'motorway_link', 'trunk', 'trunk_link', 'primary',
'primary_link']
ROAD_CLASS_COL = 'fclass'
INCLUDE_TRAVEL_TIME = False
# ====================
# INPUT DATA SPECS
# =====================
# Cols for input coordinates file
LAT = 'lat_pub_w5'
LON = 'lon_pub_w5'
HH_ID = 'y5_hhid'
CLUST_ID = 'y5_clusterid'
TEMP_UNIQ_ID = 'uniq_loc'
# =============================
# VARIABLE LABELS
# ============================
VAR_LABELS = {'dist_market':'Household distance to nearest major market (FEWSNET key market centers)' ,
'dist_border':'Household distance to nearest border post on main road',
'dist_popcenter': 'Household distance to nearest town of > 20,000 pop',
'dist_admhq':'Household distance to the headquarters of the district of residence',
'dist_road': 'Household distance to nearest main road',
'travel_time_market': 'Travel time (min) from household to nearest major market',
'travel_time_border':'Travel time (min) from household to nearest border post',
'travel_time_popcenter': 'Travel time (min) from household to nearest town of > 20,000 pop',
'travel_time_admhq': 'Travel time (min) from household to headquarters of the district of residence' ,
'travel_time_road': 'Travel time (min) from household to nearest main road'}
for name, label in VAR_LABELS.items():
assert len(label) < 80, 'LABEL TOO LONG'
Prepare Roads Data#
Back to Table of Contents
For roads, we are interested in distance and travel time to the country’s main roads. We use the definition of main roads based on OSM tags provided by Mapzen. However, main roads can also be defined in a country specific fashion. The purpose of this section is to prepare the roads data so that we end up with just the required main roads.
def prepare_roads_data(fpath_roads=None, fpath_main_roads=None):
"""
Filter roads data to keep only main roads by doing the following:
1. Load main roads shapefile if given
2. Filter roads file to keep main roads using conditions specified in MAIN_ROADS_TAGS variable
3. If no roads data is available or just need latest data, download latest data from OSM and filter it.
"""
if fpath_main_roads:
print('No need to process anything, main roads already exist')
gdf_main_roads = gpd.read_file(fpath_main_roads)
return gdf_main_roads
elif fpath_roads:
# Filter main roads from file
gdf_roads = gpd.read_file(fpath_roads)
gdf_main_roads = gdf_roads.query('{} in @MAIN_ROADS_TAGS'.format(ROAD_CLASS_COL))
return gdf_main_roads
else:
# download latest shapefile from Geofabrik
osm_data_dir = utils.download_osm_shapefiles(country=COUNTRY_NAME, region=GEOFABRICK_REGION,
outdir=DIR_SPATIAL_DATA)
# Re-run the function to filter for main roads
roads_fpath = osm_data_dir.joinpath('gis_osm_roads_free_1.shp')
return prepare_roads_data(fpath_roads=roads_fpath)
gdf_main_roads = prepare_roads_data(fpath_main_roads=FILE_MAIN_ROADS)
No need to process anything, main roads already exist
Euclidean Distance Metrics#
Back to Table of Contents
Calculate euclidean distance (in Km) from household (EA centroid) coordinates to places of interest.
def generate_distance_variables(df_coords, dist2pts_inputs_dict, gdf_roads):
"""
Generates distance accessibility metrics based on Euclidean distance to places and roads.
Parameters
----------
df_coords(pd.DataFrame) - Public coordinates to use to measure distance
dist2pts_inputs_dict(dict) - Dictionary containing distance variable as key and full path to CSV of locations.
For example, {'dist_market': FILE_MARKETS} has distance to market and FILE_MARKETS is path
to CSV file with lat/lon of markets.
gdf_roads (gpd.GeoDataframe) - GeoPandas Dataframe of main roads.
Returns
-------
Saves a CSV file with the following columns: household_id, public coordinates(lat/lon),
'dist_market', 'dist_border', 'dist_popcenter', 'dist_admhq' and 'dist_road'
"""
# ============================
# DISTANCE TO PLACES
# ============================
for var,file in dist2pts_inputs_dict.items():
df_var = pd.read_csv(file)
df_var.dropna(subset=['lat','lon'], inplace=True)
df_var[['lat','lon']] = df_var[['lat','lon']].astype(float)
lst_pts = [(i[0],i[1]) for i in df_var[['lat', 'lon']].values]
df_coords[var] = df_coords.apply(lambda x: utils.distance_to_points(lst_pts, (x[LAT],x[LON]), output='nearest'), axis=1)
# ============================
# DISTANCE TO ROADS
# ============================
gdf_coords = gpd.GeoDataFrame(df_coords, geometry=gpd.points_from_xy(df_coords[LON], df_coords[LAT]), crs=4326)
utm_zones = utils.utm_zones_from_geodataframe(gdf=gdf_roads)
# Use first, as results are the same for any proj
proj_crs = utm_zones[0]
try:
gdf_roads_proj = gdf_roads.to_crs(int(proj_crs))
except:
print('Please ensure roads layer has CRS')
gdf_coords_proj = gdf_coords.to_crs(int(proj_crs))
gdf_coords_proj['dist_road'] = gdf_coords_proj.geometry.apply(lambda x: min(list(gdf_roads_proj.distance(x)/1000)))
df_roads = gdf_coords_proj[[TEMP_UNIQ_ID, 'dist_road']]
# ======================================
# COMBINE THE VARS AND SAVE TO CSV FILE
# =====================================
# add dist_road to the main dataframe
df = df_coords.merge(df_roads, on=TEMP_UNIQ_ID, how='inner', indicator=True)
assert df._merge.value_counts()['both'] == len(df), 'Dataframe length should match'
df.drop(columns = ['_merge'], inplace=True)
# output variables
out_vars = ['dist_market', 'dist_border', 'dist_popcenter', 'dist_admhq', 'dist_road']
return df, out_vars
# Load survey file with public coordinates and keep only
# unique public coordinates
df_pub_coords = pd.read_csv(FILE_HH_PUB_GEOVARS)
df_pub_coords = df_pub_coords.query('make_public == 1')
df_pub_coords[TEMP_UNIQ_ID] = df_pub_coords.apply(lambda x: "{}-{}".format(x[LON], x[LAT]), axis=1)
df_pub_coords2 = df_pub_coords.drop_duplicates(subset=['uniq_loc'])
df_coords = df_pub_coords2[[TEMP_UNIQ_ID, LAT, LON]]
# dict object with distance variable and corresponding file with lat/lon
dist_to_pts_vars = {'dist_market': FILE_MARKETS, 'dist_border': FILE_BORDER_POSTS,
'dist_popcenter': FILE_TOWNS, 'dist_admhq': FILE_DIST_CAPITALS}
df_dist, out_vars = generate_distance_variables(df_coords=df_coords, dist2pts_inputs_dict=dist_to_pts_vars,
gdf_roads=gdf_main_roads)
Quality checks for euclidean distance#
Back to Table of Contents
Run summary statistic to generare minimum, maximum values etc and inspect them manually.
# Only use describe for continuos variables
print('-'*50)
print(' SUMMARY STATISTICS FOR DISTANCE VARS')
print('-'*50)
stats = df_dist[out_vars].describe().T
print(stats[['mean', 'min', '50%', '75%', 'max']])
print('-'*70)
--------------------------------------------------
SUMMARY STATISTICS FOR DISTANCE VARS
--------------------------------------------------
mean min 50% 75% max
dist_market 64.622 0.485 63.545 95.919 223.039
dist_border 187.005 0.918 186.684 254.558 431.687
dist_popcenter 33.729 0.139 21.679 51.962 179.360
dist_admhq 21.197 0.158 13.191 31.805 114.477
dist_road 13.861 0.006 3.872 18.583 125.164
----------------------------------------------------------------------
Travel Time Metrics#
Back to Table of Contents
def generate_travel_time_vars(travel_time_vars, df_coords, df_road):
"""
Helper function to generate travel time to places defined in places_dict and main roads.
"""
# ================
# SET UP
# ================
# output directory
DIR_DATA.joinpath('tmp', "travel_time").mkdir(parents=True, exist_ok=True)
out_folder = DIR_DATA.joinpath('tmp', "travel_time")
if 'geometry' in df_coords.columns:
gdf_coords = gpd.GeoDataFrame(df_coords, crs=4326)
else:
gdf_coords = gpd.GeoDataFrame(df_coords,
geometry=gpd.points_from_xy(df_coords[LON], df_coords[LAT]), crs=4326)
# Admin-0 (country bounds)
if FILE_ADM0:
gdf_adm0 = gpd.read_file(FILE_ADM0)
else:
gdf_adm0 = utils.admin_bounds_from_geoboundaries(co_iso='TZA', admin_level=0)
# global input friction surface and clip it to ISO
global_fric_surface = rasterio.open(str(FILE_FRICTION_SURFACE))
iso_travel_surface = Path(out_folder).joinpath("TRAVEL_SURFACE.tif")
if not iso_travel_surface.exists():
utils.clip_raster(global_fric_surface, gdf_adm0, iso_travel_surface)
# =========================================
# GENERATE VARIABLES EXCEPT FOR MAIN ROADS
# =========================================
for var, file in travel_time_vars.items():
# Load CSV as GeoDataframe
gdf = utils.load_csv_into_geopandas(csv_with_coords=file, lon='lon', lat='lat')
out_fname = 'travel_raster_{}.tif'.format(var)
travel_raster = utils.generate_travel_time_raster(friction_raster=iso_travel_surface,dest=gdf, outdir=out_folder,
out_raster=out_fname)
gdf_coords[var] = point_query(gdf_coords, travel_raster)
# ===========================================
# GENERATE TRAVEL TIME TO NEAREST MAIN ROAD
# ==========================================
# First, using distance to nearest road, create
# points on those road perpendicular to origin
# coordinates
df_road['tmp_pt'] = df_road.apply(lambda x: utils.terminal_coordinates((x[LON],x[LAT]),
distance=x['dist_road'],
bearing=90), axis=1)
df_road['geometry'] = df_road.tmp_pt.apply(lambda x: Point(x[0], x[1]))
gdf_road_dest = gpd.GeoDataFrame(df_road,crs=4326)[['geometry']]
# Now generate travel time raster
out_fname = 'travel_raster_roads.tif'
roads_travel_raster = utils.generate_travel_time_raster(friction_raster=iso_travel_surface, dest=gdf_road_dest,
outdir=out_folder,out_raster=out_fname)
gdf_coords['travel_time_road'] = point_query(gdf_coords, roads_travel_raster)
# output vars
out_vars = ['travel_time_market', 'travel_time_border', 'travel_time_popcenter',
'travel_time_admhq', 'travel_time_road']
# ============
# CLEAN-UP
# ============
# Delete temp folder
shutil.rmtree(out_folder)
return gdf_coords, out_vars
# travel time variables except for 'main road'
trav2pts_vars = {'travel_time_market': FILE_MARKETS, 'travel_time_border': FILE_BORDER_POSTS,
'travel_time_popcenter': FILE_TOWNS, 'travel_time_admhq': FILE_DIST_CAPITALS}
# distance to nearest road to used to compute travel time to nearest main road
df_road = df_dist[[HH_ID, LON, LAT, 'dist_road']]
df_travel, out_vars = generate_travel_time_vars(trav2pts_vars, df_coords, df_road)
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[10], line 7
2 trav2pts_vars = {'travel_time_market': FILE_MARKETS, 'travel_time_border': FILE_BORDER_POSTS,
3 'travel_time_popcenter': FILE_TOWNS, 'travel_time_admhq': FILE_DIST_CAPITALS}
6 # distance to nearest road to used to compute travel time to nearest main road
----> 7 df_road = df_dist[[HH_ID, LON, LAT, 'dist_road']]
9 df_travel, out_vars = generate_travel_time_vars(trav2pts_vars, df_coords, df_road)
File ~/anaconda3/envs/spanon/lib/python3.11/site-packages/pandas/core/frame.py:3813, in DataFrame.__getitem__(self, key)
3811 if is_iterator(key):
3812 key = list(key)
-> 3813 indexer = self.columns._get_indexer_strict(key, "columns")[1]
3815 # take() does not accept boolean indexers
3816 if getattr(indexer, "dtype", None) == bool:
File ~/anaconda3/envs/spanon/lib/python3.11/site-packages/pandas/core/indexes/base.py:6070, in Index._get_indexer_strict(self, key, axis_name)
6067 else:
6068 keyarr, indexer, new_indexer = self._reindex_non_unique(keyarr)
-> 6070 self._raise_if_missing(keyarr, indexer, axis_name)
6072 keyarr = self.take(indexer)
6073 if isinstance(key, Index):
6074 # GH 42790 - Preserve name from an Index
File ~/anaconda3/envs/spanon/lib/python3.11/site-packages/pandas/core/indexes/base.py:6133, in Index._raise_if_missing(self, key, indexer, axis_name)
6130 raise KeyError(f"None of [{key}] are in the [{axis_name}]")
6132 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
-> 6133 raise KeyError(f"{not_found} not in index")
KeyError: "['y5_hhid'] not in index"
Quality checks for travel time#
Back to Table of Contents
Manually inspect the results.
# Only use describe for continuos variables
print('-'*50)
print(' SUMMARY STATISTICS FOR TRAVEL TIME VARS')
print('-'*50)
stats = df_travel[out_vars].describe().T
print(stats[['mean', 'min', '50%', '75%', 'max']])
print('-'*70)
--------------------------------------------------
SUMMARY STATISTICS FOR TRAVEL TIME VARS
--------------------------------------------------
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[11], line 5
3 print(' SUMMARY STATISTICS FOR TRAVEL TIME VARS')
4 print('-'*50)
----> 5 stats = df_travel[out_vars].describe().T
6 print(stats[['mean', 'min', '50%', '75%', 'max']])
7 print('-'*70)
NameError: name 'df_travel' is not defined
Save output files#
Back to Table of Contents
This notebook saves two files: a CSV file and an ESRI shapefile with the following columns which has been generated:
‘dist_market’, ‘dist_border’, ‘dist_popcenter’, ‘dist_admhq’, ‘dist_road’, ‘travel_time_market’, ‘travel_time_border’, ‘travel_time_popcenter’, ‘travel_time_admhq’, ‘travel_time_road’
These files have been saved to this OneDrive folder.
def save_outputs(df_dist, df_coords, outdir, var_labels, df_travel=None):
"""
Save the following output files:
1. CSV
2. Shapefile
"""
# Merge the two dataframes
if INCLUDE_TRAVEL_TIME:
df_travel = df_travel[[HH_ID, 'travel_time_market',
'travel_time_border', 'travel_time_popcenter', 'travel_time_admhq',
'travel_time_road']]
df = df_dist.merge(df_travel, on=HH_ID, how='inner')
assert len(df) == len(df_coords)
else:
df = df_dist[[TEMP_UNIQ_ID] + list(df_dist.columns.difference([TEMP_UNIQ_ID]))]
df.set_index(TEMP_UNIQ_ID, inplace=True)
# save to CSV file
csv_fname = 'access-distance-{}.csv'.format(FILENAME_SUFFIX)
out_csv_data = Path(outdir).joinpath(csv_fname)
df_csv = df.drop(columns=['geometry', LAT, LON])
df_csv.to_csv(out_csv_data)
# Save variable labels
csv_fname = 'access-distance-{}-labels.csv'.format(FILENAME_SUFFIX)
out_labels = Path(outdir).joinpath(csv_fname)
if not INCLUDE_TRAVEL_TIME:
var_labels = {k:v for k, v in var_labels.items() if 'travel' not in k}
df_var_labels = pd.DataFrame(list(var_labels.values()), index=list(var_labels.keys()))
df_var_labels.index.name = 'var_name'
df_var_labels.rename(columns={0:'label'}, inplace=True)
df_var_labels.to_csv(out_labels)
print('-'*30)
print('VARIABLES SAVED TO FILE BELOW')
print('-'*30)
print("/".join(out_csv_data.parts[6:]))
return out_csv_data, out_labels
out_csv, out_labels = save_outputs(df_dist, df_coords, outdir=DIR_OUTPUTS, var_labels=VAR_LABELS)
------------------------------
VARIABLES SAVED TO FILE BELOW
------------------------------
DECAT_HH_Geovariables/data/TZA/surveys/NPS_w5/geovars/puf/access-distance-npsw5.csv
df = pd.read_csv(out_csv)
# Only use describe for continuos variables
print('-'*50)
print(' SUMMARY STATISTICS FOR TRAVEL TIME VARS')
print('-'*50)
vars = df.columns.difference([HH_ID])
stats = df[out_vars].describe().T
print(stats[['mean', 'min', '50%', '75%', 'max']])
print('-'*70)
--------------------------------------------------
SUMMARY STATISTICS FOR TRAVEL TIME VARS
--------------------------------------------------
mean min 50% 75% max
dist_market 64.622 0.485 63.545 95.919 223.039
dist_border 187.005 0.918 186.684 254.558 431.687
dist_popcenter 33.729 0.139 21.679 51.962 179.360
dist_admhq 21.197 0.158 13.191 31.805 114.477
dist_road 13.861 0.006 3.872 18.583 125.164
----------------------------------------------------------------------