Accessibility Metrics#

Author: Dunstan Matekenya

Affiliation: DECAT, The World Bank Group

Date: August 7, 2023

Background#

Given public GPS coordinates for a survey which come from the anonymization process described in the Spatial-Anonmization package, in this notebook, we generate metrics to measure the survey households accessibility to various amenities using geospatial data. This is for the National Panel Survey (NPS 2014-2015) in Tanzania. The following accessibility metrics/variables are generated.

  1. Major towns and/or cities

  2. Border posts

  3. District headquarters

  4. Major agricultural markets

  5. Main road

For each of the category above, we produce two metrics as follows:

  1. Euclidean distance

  2. Travel time

Please see Data-Cover-Page for details about definition of the metrics and other important information.

To Do#

For further development, more variables such as access to schools, health facilities and more can be added.

Table of Contents#

The the following steps are followed to generate the data.

  1. Python setup

  2. Define input data

  3. Define global variables

  4. Prepare roads data

  5. Generate euclidean distance metrics

  6. Perfom quality checks for euclidean distance metrics

  7. Generate travel time metrics

  8. Perfom quality checks for travel time metrics

  9. Save output files

Python setup#

  • Back to Table of Contents

The following packages are required for the processing in this notebook.

import warnings
warnings.simplefilter(action='ignore')
import os, sys
from datetime import datetime
from pathlib import Path
import shutil
import json

import skimage.graph as graph
import pandas as pd
pd.set_option('display.float_format', lambda x: '%.3f' % x)
import numpy as np

import geopandas as gpd
from shapely.geometry import shape, Point
from rasterstats import point_query
import rasterio
from rasterio.mask import mask
from rasterio import features
from pyproj.aoi import AreaOfInterest
from pyproj.database import query_utm_crs_info

spanon = Path.cwd().parents[2].joinpath("Spatial-Anonymization/src/")
sys.path.append(str(spanon))
from spatial_anonymization import utils
from spatial_anonymization import point_displacement

Setup input files#

  • Back to Table of Contents

Define paths to input data files as global variables using Path package.

# =============================
# BASE WORKING DIR
# ============================
DIR_DATA = Path.cwd().parents[1].joinpath('data', 'TZA')
DIR_SPATIAL_DATA = DIR_DATA.joinpath('spatial')
DIR_CONS_EXP = DIR_DATA.joinpath('surveys', 'consumption_experiment')
DIR_OUTPUTS = DIR_CONS_EXP.joinpath('geovars', 'puf')
DIR_GOST_DATA = Path.cwd().parents[2].joinpath('GOST-UTILS', 'data')

# =============================
# SURVEY COORDINATES 
# ============================
# Public/displaced GPS oordinates files
FILE_HH_PUB_COORDS = DIR_OUTPUTS.joinpath('cons-exp-pub-coords.csv')

# =============================
# ADMIN BOUNDARIES
# =============================
FILE_ADM0 = DIR_SPATIAL_DATA.joinpath('geoBoundaries-TZA-ADM0-all',
                                     'geoBoundaries-TZA-ADM0.shp')

# =============================
# DISTANCE MEASUREMENT DATA
# ============================
FILE_TOWNS = DIR_SPATIAL_DATA.joinpath("towns.csv")
FILE_BORDER_POSTS = DIR_SPATIAL_DATA.joinpath("border-posts.csv")
FILE_DIST_CAPITALS = DIR_SPATIAL_DATA.joinpath("district-capitals.csv")
FILE_MARKETS = DIR_SPATIAL_DATA.joinpath("market-centers.csv")
FILE_MAIN_ROADS = DIR_SPATIAL_DATA.joinpath("tza_trunkrd/tanzania_trunkrd.shp")
FILE_ROADS = None

# =============================
# TRAVEL TIME LAYERS
# ============================
FILE_FRICTION_SURFACE = DIR_GOST_DATA.joinpath('2020_motorized_friction_surface.geotiff')
FILE_GLOBAL_POPULATION = DIR_GOST_DATA.joinpath("ppp_2020_1km_Aggregated.tif")

Setup processing global variables#

  • Back to Table of Contents

For variables we will use often such as column name containing latitude and longitude, set them as global variables.

# ====================
# GENERAL
# =====================
# in case we need to download OSM data
GEOFABRICK_REGION  = 'africa'
COUNTRY_NAME = 'Tanzania'
COUNTRY_ISO = 'TZA'
FILENAME_SUFFIX = 'cons-exp2022'

# ==============
# ROADS
# ==============
# This assumes we are working with OSM roads file
# Otherwise, these can be changed 
MAIN_ROADS_TAGS = ['motorway', 'motorway_link', 'trunk', 'trunk_link', 'primary',
                       'primary_link']
ROAD_CLASS_COL = 'fclass'
    
INCLUDE_TRAVEL_TIME = False
    
# ====================
# INPUT DATA SPECS
# =====================
# Cols for input coordinates file
LAT = 'lat_pub' 
LON = 'lon_pub'
HH_ID = 'hhid'
CLUST_ID = 'clusterid'
TEMP_UNIQ_ID = 'clusterid'

# =============================
# VARIABLE LABELS
# ============================
VAR_LABELS = {'dist_market':'Household distance to nearest major market (FEWSNET key market centers)' ,
       'dist_border':'Household distance to nearest border post on main road',
        'dist_popcenter': 'Household distance to nearest town of > 20,000 pop',
        'dist_admhq':'Household distance to the headquarters of the district of residence',
        'dist_road': 'Household distance to nearest main road',
       'travel_time_market': 'Travel time (min) from household to nearest major market',
    'travel_time_border':'Travel time (min) from household to nearest border post', 
    'travel_time_popcenter': 'Travel time (min) from household to nearest town of > 20,000 pop',
       'travel_time_admhq': 'Travel time (min) from household to headquarters of the district of residence' ,
        'travel_time_road': 'Travel time (min)  from household to nearest main road'}

for name, label in VAR_LABELS.items():
    assert len(label) < 80, 'LABEL TOO LONG'

Prepare Roads Data#

  • Back to Table of Contents

For roads, we are interested in distance and travel time to the country’s main roads. We use the definition of main roads based on OSM tags provided by Mapzen. However, main roads can also be defined in a country specific fashion. The purpose of this section is to prepare the roads data so that we end up with just the required main roads.

def prepare_roads_data(fpath_roads=None, fpath_main_roads=None):
    """
    Filter roads data to keep only main roads by doing the following:
    1. Load main roads shapefile if given
    2. Filter roads file to keep main roads using conditions specified in MAIN_ROADS_TAGS variable
    3. If no roads data is available or just need latest data, download latest data from OSM and filter it.
    """
    if fpath_main_roads:
        print('No need to process anything, main roads already exist')
        gdf_main_roads = gpd.read_file(fpath_main_roads)
        return gdf_main_roads
    elif fpath_roads:
        # Filter main roads from file
        gdf_roads = gpd.read_file(fpath_roads)
        gdf_main_roads = gdf_roads.query('{} in @MAIN_ROADS_TAGS'.format(ROAD_CLASS_COL))
        return gdf_main_roads
    else:
        # download latest shapefile from Geofabrik
        osm_data_dir = utils.download_osm_shapefiles(country=COUNTRY_NAME, region=GEOFABRICK_REGION, 
                                                     outdir=DIR_SPATIAL_DATA)
        # Re-run the function to filter for main roads
        roads_fpath = osm_data_dir.joinpath('gis_osm_roads_free_1.shp')
        return prepare_roads_data(fpath_roads=roads_fpath)
gdf_main_roads = prepare_roads_data(fpath_main_roads=FILE_MAIN_ROADS)
No need to process anything, main roads already exist

Euclidean Distance Metrics#

  • Back to Table of Contents

Calculate euclidean distance (in Km) from household (EA centroid) coordinates to places of interest.

def generate_distance_variables(df_coords, dist2pts_inputs_dict, gdf_roads):
    """
    Generates distance accessibility metrics based on Euclidean distance to places and roads.
    Parameters
    ----------
    df_coords(pd.DataFrame) - Public coordinates to use to measure distance
    dist2pts_inputs_dict(dict) - Dictionary containing distance variable as key and full path to CSV of locations. 
    For example, {'dist_market': FILE_MARKETS} has distance to market and FILE_MARKETS is path 
    to CSV file with lat/lon of markets.
    gdf_roads (gpd.GeoDataframe) - GeoPandas Dataframe of main roads.
    Returns
    -------
    Saves a CSV file with the following columns: household_id, public coordinates(lat/lon),
    'dist_market', 'dist_border', 'dist_popcenter', 'dist_admhq' and 'dist_road'

    """
    # ============================
    # DISTANCE TO PLACES
    # ============================
    for var,file in dist2pts_inputs_dict.items():
        df_var = pd.read_csv(file)
        df_var.dropna(subset=['lat','lon'], inplace=True)
        df_var[['lat','lon']] = df_var[['lat','lon']].astype(float)
        lst_pts = [(i[0],i[1]) for i in df_var[['lat', 'lon']].values]
        df_coords[var] = df_coords.apply(lambda x: utils.distance_to_points(lst_pts, (x[LAT],x[LON]), output='nearest'), axis=1)
    
    # ============================
    # DISTANCE TO ROADS
    # ============================
    gdf_coords = gpd.GeoDataFrame(df_coords, geometry=gpd.points_from_xy(df_coords[LON], df_coords[LAT]), crs=4326)
    
    utm_zones = utils.utm_zones_from_geodataframe(gdf=gdf_roads)
    # Use first, as results are the same for any proj
    proj_crs = utm_zones[0]
    try:
        gdf_roads_proj = gdf_roads.to_crs(int(proj_crs))
        
    except:
        print('Please ensure roads layer has CRS')
    
    gdf_coords_proj = gdf_coords.to_crs(int(proj_crs))
    
    gdf_coords_proj['dist_road'] = gdf_coords_proj.geometry.apply(lambda x: min(list(gdf_roads_proj.distance(x)/1000)))
    df_roads = gdf_coords_proj[[TEMP_UNIQ_ID, 'dist_road']]
    
    # ======================================
    # COMBINE THE VARS AND SAVE TO CSV FILE
    # =====================================
    # add dist_road to the main dataframe
    df = df_coords.merge(df_roads, on=TEMP_UNIQ_ID, how='inner', indicator=True)
    assert df._merge.value_counts()['both'] == len(df), 'Dataframe length should match'
    df.drop(columns = ['_merge'], inplace=True)
    
    # output variables
    out_vars = ['dist_market', 'dist_border', 'dist_popcenter', 'dist_admhq', 'dist_road']

    return df, out_vars
# Load survey file with public coordinates and keep only 
# unique public coordinates
df_pub_coords = pd.read_csv(FILE_HH_PUB_COORDS)
df_pub_coords  = df_pub_coords.query('make_public == 1')
df_pub_coords2 = df_pub_coords.drop_duplicates(subset=[TEMP_UNIQ_ID])
df_coords = df_pub_coords2[[TEMP_UNIQ_ID, LAT, LON]]


# dict object with distance variable and corresponding file with lat/lon
dist_to_pts_vars = {'dist_market': FILE_MARKETS, 'dist_border': FILE_BORDER_POSTS,
       'dist_popcenter': FILE_TOWNS, 'dist_admhq': FILE_DIST_CAPITALS}

df_dist, out_vars = generate_distance_variables(df_coords=df_coords, dist2pts_inputs_dict=dist_to_pts_vars,
                            gdf_roads=gdf_main_roads)

Quality checks for euclidean distance#

  • Back to Table of Contents

Run summary statistic to generare minimum, maximum values etc and inspect them manually.

# Only use describe for continuos variables 
print('-'*50)
print('       SUMMARY STATISTICS FOR DISTANCE VARS')
print('-'*50)
stats = df_dist[out_vars].describe().T
print(stats[['mean', 'min', '50%', '75%', 'max']])
print('-'*70)
--------------------------------------------------
       SUMMARY STATISTICS FOR DISTANCE VARS
--------------------------------------------------
                  mean   min     50%     75%     max
dist_market     68.609 0.951  68.510 100.048 215.075
dist_border    174.173 7.907 162.815 253.774 401.870
dist_popcenter  44.312 0.462  37.895  68.215 183.349
dist_admhq      27.221 0.463  23.011  37.713 102.986
dist_road       17.329 0.115   8.108  27.566  89.500
----------------------------------------------------------------------

Travel Time Metrics#

  • Back to Table of Contents

def generate_travel_time_vars(travel_time_vars, df_coords, df_road):
    """
    Helper function to generate travel time to places defined in places_dict and main roads.
    """
    # ================
    # SET UP
    # ================
    # output directory 
    DIR_DATA.joinpath('tmp', "travel_time").mkdir(parents=True, exist_ok=True)
    out_folder = DIR_DATA.joinpath('tmp', "travel_time")
    if 'geometry' in df_coords.columns:
        gdf_coords = gpd.GeoDataFrame(df_coords, crs=4326)
    else:
        gdf_coords = gpd.GeoDataFrame(df_coords, 
                                      geometry=gpd.points_from_xy(df_coords[LON], df_coords[LAT]), crs=4326)
        
    # Admin-0 (country bounds)
    if FILE_ADM0:
        gdf_adm0 = gpd.read_file(FILE_ADM0)
    else:
        gdf_adm0 = utils.admin_bounds_from_geoboundaries(co_iso='TZA', admin_level=0)
    

    # global input friction surface and clip it to ISO
    global_fric_surface = rasterio.open(str(FILE_FRICTION_SURFACE))
    iso_travel_surface = Path(out_folder).joinpath("TRAVEL_SURFACE.tif")
    if not iso_travel_surface.exists():
        utils.clip_raster(global_fric_surface, gdf_adm0, iso_travel_surface)

    # =========================================
    # GENERATE VARIABLES EXCEPT FOR MAIN ROADS
    # =========================================
    for var, file in travel_time_vars.items():
        # Load CSV as GeoDataframe
        gdf = utils.load_csv_into_geopandas(csv_with_coords=file, lon='lon', lat='lat')
        out_fname = 'travel_raster_{}.tif'.format(var)
        travel_raster = utils.generate_travel_time_raster(friction_raster=iso_travel_surface,dest=gdf, outdir=out_folder, 
                              out_raster=out_fname)
        
        gdf_coords[var] = point_query(gdf_coords, travel_raster)
    
    # ===========================================
    # GENERATE TRAVEL TIME TO NEAREST MAIN ROAD
    # ==========================================
    # First, using distance to nearest road, create 
    # points on those road perpendicular to origin 
    # coordinates
    df_road['tmp_pt'] = df_road.apply(lambda x: utils.terminal_coordinates((x[LON],x[LAT]),
                                                                                         distance=x['dist_road'], 
                                                                                     bearing=90), axis=1)
    df_road['geometry'] = df_road.tmp_pt.apply(lambda x: Point(x[0], x[1]))
    gdf_road_dest = gpd.GeoDataFrame(df_road,crs=4326)[['geometry']]
    
    # Now generate travel time raster
    out_fname = 'travel_raster_roads.tif'
    roads_travel_raster = utils.generate_travel_time_raster(friction_raster=iso_travel_surface, dest=gdf_road_dest, 
                                                            outdir=out_folder,out_raster=out_fname)
    gdf_coords['travel_time_road'] = point_query(gdf_coords, roads_travel_raster)
    
    # output vars
    out_vars = ['travel_time_market', 'travel_time_border', 'travel_time_popcenter',
       'travel_time_admhq', 'travel_time_road']
    
    # ============
    # CLEAN-UP
    # ============
    # Delete temp folder
    shutil.rmtree(out_folder)
    
    return gdf_coords, out_vars
# travel time variables except for 'main road'
trav2pts_vars = {'travel_time_market': FILE_MARKETS, 'travel_time_border': FILE_BORDER_POSTS,
       'travel_time_popcenter': FILE_TOWNS, 'travel_time_admhq': FILE_DIST_CAPITALS}


# distance to nearest road to used to compute travel time to nearest main road
df_road = df_dist[[HH_ID, LON, LAT, 'dist_road']]

df_travel, out_vars = generate_travel_time_vars(trav2pts_vars, df_coords, df_road)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[10], line 7
      2 trav2pts_vars = {'travel_time_market': FILE_MARKETS, 'travel_time_border': FILE_BORDER_POSTS,
      3        'travel_time_popcenter': FILE_TOWNS, 'travel_time_admhq': FILE_DIST_CAPITALS}
      6 # distance to nearest road to used to compute travel time to nearest main road
----> 7 df_road = df_dist[[HH_ID, LON, LAT, 'dist_road']]
      9 df_travel, out_vars = generate_travel_time_vars(trav2pts_vars, df_coords, df_road)

File ~/anaconda3/envs/spanon/lib/python3.11/site-packages/pandas/core/frame.py:3813, in DataFrame.__getitem__(self, key)
   3811     if is_iterator(key):
   3812         key = list(key)
-> 3813     indexer = self.columns._get_indexer_strict(key, "columns")[1]
   3815 # take() does not accept boolean indexers
   3816 if getattr(indexer, "dtype", None) == bool:

File ~/anaconda3/envs/spanon/lib/python3.11/site-packages/pandas/core/indexes/base.py:6070, in Index._get_indexer_strict(self, key, axis_name)
   6067 else:
   6068     keyarr, indexer, new_indexer = self._reindex_non_unique(keyarr)
-> 6070 self._raise_if_missing(keyarr, indexer, axis_name)
   6072 keyarr = self.take(indexer)
   6073 if isinstance(key, Index):
   6074     # GH 42790 - Preserve name from an Index

File ~/anaconda3/envs/spanon/lib/python3.11/site-packages/pandas/core/indexes/base.py:6133, in Index._raise_if_missing(self, key, indexer, axis_name)
   6130     raise KeyError(f"None of [{key}] are in the [{axis_name}]")
   6132 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
-> 6133 raise KeyError(f"{not_found} not in index")

KeyError: "['hhid'] not in index"

Quality checks for travel time#

  • Back to Table of Contents

Manually inspect the results.

# Only use describe for continuos variables 
print('-'*50)
print('       SUMMARY STATISTICS FOR TRAVEL TIME  VARS')
print('-'*50)
stats = df_travel[out_vars].describe().T
print(stats[['mean', 'min', '50%', '75%', 'max']])
print('-'*70)
--------------------------------------------------
       SUMMARY STATISTICS FOR TRAVEL TIME  VARS
--------------------------------------------------
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[11], line 5
      3 print('       SUMMARY STATISTICS FOR TRAVEL TIME  VARS')
      4 print('-'*50)
----> 5 stats = df_travel[out_vars].describe().T
      6 print(stats[['mean', 'min', '50%', '75%', 'max']])
      7 print('-'*70)

NameError: name 'df_travel' is not defined

Save output files#

  • Back to Table of Contents

This notebook saves two files: a CSV file and an ESRI shapefile with the following columns which has been generated:

‘dist_market’, ‘dist_border’, ‘dist_popcenter’, ‘dist_admhq’, ‘dist_road’, ‘travel_time_market’, ‘travel_time_border’, ‘travel_time_popcenter’, ‘travel_time_admhq’, ‘travel_time_road’

These files have been saved to this OneDrive folder.

def save_outputs(df_dist, df_coords, outdir, var_labels, df_travel=None):
    """
    Save the following output files:
    1. CSV
    2. Shapefile 
    """
    # Merge the two dataframes
    
    if INCLUDE_TRAVEL_TIME:
        df_travel = df_travel[[HH_ID, 'travel_time_market',
       'travel_time_border', 'travel_time_popcenter', 'travel_time_admhq',
       'travel_time_road']]
        df = df_dist.merge(df_travel, on=HH_ID, how='inner')
        assert len(df) == len(df_coords)
    else:
        df = df_dist[[TEMP_UNIQ_ID] + list(df_dist.columns.difference([TEMP_UNIQ_ID]))]
        df.set_index(TEMP_UNIQ_ID, inplace=True)
    
    
    # save to CSV file
    csv_fname = 'access-distance-{}.csv'.format(FILENAME_SUFFIX)
    out_csv_data = Path(outdir).joinpath(csv_fname)
    df_csv = df.drop(columns=['geometry', LAT, LON])
    df_csv.to_csv(out_csv_data)
    
    # Save variable labels
    csv_fname = 'access-distance-{}-labels.csv'.format(FILENAME_SUFFIX)
    out_labels = Path(outdir).joinpath(csv_fname)
    if not INCLUDE_TRAVEL_TIME:
        var_labels = {k:v for k, v in var_labels.items() if 'travel' not in k}
    
    df_var_labels = pd.DataFrame(list(var_labels.values()), index=list(var_labels.keys()))
    df_var_labels.index.name = 'var_name'
    df_var_labels.rename(columns={0:'label'}, inplace=True)
    df_var_labels.to_csv(out_labels)
    
    print('-'*30)
    print('VARIABLES SAVED TO FILE BELOW')
    print('-'*30)
    print("/".join(out_csv_data.parts[6:]))
    
    return out_csv_data, out_labels
out_csv, out_labels = save_outputs(df_dist, df_coords, outdir=DIR_OUTPUTS, var_labels=VAR_LABELS)
------------------------------
VARIABLES SAVED TO FILE BELOW
------------------------------
DECAT_HH_Geovariables/data/TZA/surveys/consumption_experiment/geovars/puf/access-distance-cons-exp2022.csv
df = pd.read_csv(out_csv)

# Only use describe for continuos variables 
print('-'*50)
print('       SUMMARY STATISTICS FOR TRAVEL TIME  VARS')
print('-'*50)
vars = df.columns.difference([HH_ID])
stats = df[out_vars].describe().T
print(stats[['mean', 'min', '50%', '75%', 'max']])
print('-'*70)
--------------------------------------------------
       SUMMARY STATISTICS FOR TRAVEL TIME  VARS
--------------------------------------------------
                  mean   min     50%     75%     max
dist_market     68.609 0.951  68.510 100.048 215.075
dist_border    174.173 7.907 162.815 253.774 401.870
dist_popcenter  44.312 0.462  37.895  68.215 183.349
dist_admhq      27.221 0.463  23.011  37.713 102.986
dist_road       17.329 0.115   8.108  27.566  89.500
----------------------------------------------------------------------