Accessibility Metrics

Accessibility Metrics#

Author: Dunstan Matekenya

Affiliation: DECAT, The World Bank Group

Date: May 23, 2023

Background#

Given public GPS coordinates for a survey which come from the anonymization process described in the Spatial-Anonmization package, in this notebook, we generate metrics to measure the survey households accessibility to various amenities using geospatial data. This is for the National Panel Survey (NPS 2014-2015) in Tanzania. The following accessibility metrics/variables are generated.

Major towns and/or cities
Border posts
District headquarters
Major agricultural markets
Main road

For each of the category above, we produce two metrics as follows:

Euclidean distance
Travel time

Please see Data-Cover-Page for details about definition of the metrics and other important information.

To Do#

For further development, more variables such as access to schools, health facilities and more can be added.

Table of Contents#

The the following steps are followed to generate the data.

Python setup
Define input data
Define global variables
Prepare roads data
Generate euclidean distance metrics
Perfom quality checks for euclidean distance metrics
Generate travel time metrics
Perfom quality checks for travel time metrics
Save output files

Python setup#

Back to Table of Contents

The following packages are required for the processing in this notebook.

import warnings
warnings.simplefilter(action='ignore')
import os, sys
from datetime import datetime
from pathlib import Path
import shutil
import json

import skimage.graph as graph
import pandas as pd
pd.set_option('display.float_format', lambda x: '%.3f' % x)
import numpy as np

import geopandas as gpd
from shapely.geometry import shape, Point
from rasterstats import point_query
import rasterio
from rasterio.mask import mask
from rasterio import features
from pyproj.aoi import AreaOfInterest
from pyproj.database import query_utm_crs_info

spanon = Path.cwd().parents[2].joinpath("Spatial-Anonymization/src/")
sys.path.append(str(spanon))
from spatial_anonymization import utils
from spatial_anonymization import point_displacement

Setup input files#

Back to Table of Contents

Define paths to input data files as global variables using Path package.

# =============================
# BASE WORKING DIR
# ============================
DIR_DATA = Path.cwd().parents[1].joinpath('data', 'TZA')
DIR_SPATIAL_DATA = DIR_DATA.joinpath('spatial')
DIR_NPS_W4 = DIR_DATA.joinpath('surveys', 'NPS_w4')
DIR_OUTPUTS = DIR_NPS_W4.joinpath('geovars', 'puf')
DIR_GOST_DATA = Path.cwd().parents[2].joinpath('GOST-UTILS', 'data')

# =============================
# SURVEY COORDINATES 
# ============================
# Public/displaced GPS oordinates files
FILE_HH_PUB_GEOVARS = DIR_NPS_W4.joinpath('geovars', 'npsy4.ea.offset.dta')


# =============================
# ADMIN BOUNDARIES
# =============================
FILE_ADM0 = DIR_SPATIAL_DATA.joinpath('geoBoundaries-TZA-ADM0-all',
                                     'geoBoundaries-TZA-ADM0.shp')


# =============================
# DISTANCE MEASUREMENT DATA
# ============================
FILE_TOWNS = DIR_SPATIAL_DATA.joinpath("towns.csv")
# FILE_TOWNS =  DIR_DATA.parents[1].joinpath("Spatial-Anonymization/data/ETH/w4_geovars/sptial_data/town20k.xlsx")
FILE_BORDER_POSTS = DIR_SPATIAL_DATA.joinpath("border-posts.csv")
FILE_DIST_CAPITALS = DIR_SPATIAL_DATA.joinpath("district-capitals.csv")
FILE_MARKETS = DIR_SPATIAL_DATA.joinpath("market-centers.csv")
FILE_MAIN_ROADS = DIR_SPATIAL_DATA.joinpath("tza_trunkrd/tanzania_trunkrd.shp")
FILE_ROADS = None

# =============================
# TRAVEL TIME LAYERS
# ============================
FILE_FRICTION_SURFACE = DIR_GOST_DATA.joinpath('2020_motorized_friction_surface.geotiff')
FILE_GLOBAL_POPULATION = DIR_GOST_DATA.joinpath("ppp_2020_1km_Aggregated.tif")

Setup processing global variables#

Back to Table of Contents

For variables we will use often such as column name containing latitude and longitude, set them as global variables.

# ====================
# GENERAL
# =====================
# in case we need to download OSM data
GEOFABRICK_REGION  = 'africa'
COUNTRY_NAME = 'Tanzania'
COUNTRY_ISO = 'TZA'
FILENAME_SUFFIX = 'npsw4'

# ==============
# ROADS
# ==============
# This assumes we are working with OSM roads file
# Otherwise, these can be changed 
MAIN_ROADS_TAGS = ['motorway', 'motorway_link', 'trunk', 'trunk_link', 'primary',
                       'primary_link']
ROAD_CLASS_COL = 'fclass'
    
INCLUDE_TRAVEL_TIME = False
    
# ====================
# INPUT DATA SPECS
# =====================
# Cols for input coordinates file
LAT = 'lat_modified' 
LON = 'lon_modified'
HH_ID = 'clusterid'

# =============================
# VARIABLE LABELS
# ============================
VAR_LABELS = {'dist_market':'Household distance to nearest major market (FEWSNET key market centers)' ,
       'dist_border':'Household distance to nearest border post on main road',
        'dist_popcenter': 'Household distance to nearest major town',
        'dist_admhq':'Household distance to the headquarters of the district of residence',
        'dist_road': 'Household distance to nearest main road',
       'travel_time_market': 'Travel time (min) from household to nearest major market',
    'travel_time_border':'Travel time (min) from household to nearest border post', 
    'travel_time_popcenter': 'Travel time (min) from household to nearest major town',
       'travel_time_admhq': 'Travel time (min) from household to headquarters of the district of residence' ,
        'travel_time_road': 'Travel time (min)  from household to nearest main road'}

for name, label in VAR_LABELS.items():
    assert len(label) < 80, 'LABEL TOO LONG'

Prepare Roads Data#

Back to Table of Contents

For roads, we are interested in distance and travel time to the country’s main roads. We use the definition of main roads based on OSM tags provided by Mapzen. However, main roads can also be defined in a country specific fashion. The purpose of this section is to prepare the roads data so that we end up with just the required main roads.

def prepare_roads_data(fpath_roads=None, fpath_main_roads=None):
    """
    Filter roads data to keep only main roads by doing the following:
    1. Load main roads shapefile if given
    2. Filter roads file to keep main roads using conditions specified in MAIN_ROADS_TAGS variable
    3. If no roads data is available or just need latest data, download latest data from OSM and filter it.
    """
    if fpath_main_roads:
        print('No need to process anything, main roads already exist')
        gdf_main_roads = gpd.read_file(fpath_main_roads)
        return gdf_main_roads
    elif fpath_roads:
        # Filter main roads from file
        gdf_roads = gpd.read_file(fpath_roads)
        gdf_main_roads = gdf_roads.query('{} in @MAIN_ROADS_TAGS'.format(ROAD_CLASS_COL))
        return gdf_main_roads
    else:
        # download latest shapefile from Geofabrik
        osm_data_dir = utils.download_osm_shapefiles(country=COUNTRY_NAME, region=GEOFABRICK_REGION, 
                                                     outdir=DIR_SPATIAL_DATA)
        # Re-run the function to filter for main roads
        roads_fpath = osm_data_dir.joinpath('gis_osm_roads_free_1.shp')
        return prepare_roads_data(fpath_roads=roads_fpath)

gdf_main_roads = prepare_roads_data(fpath_main_roads=FILE_MAIN_ROADS)

No need to process anything, main roads already exist

Euclidean Distance Metrics#

Back to Table of Contents

Calculate euclidean distance (in Km) from household (EA centroid) coordinates to places of interest.

def generate_distance_variables(df_coords, dist2pts_inputs_dict, gdf_roads):
    """
    Generates distance accessibility metrics based on Euclidean distance to places and roads.
    Parameters
    ----------
    df_coords(pd.DataFrame) - Public coordinates to use to measure distance
    dist2pts_inputs_dict(dict) - Dictionary containing distance variable as key and full path to CSV of locations. 
    For example, {'dist_market': FILE_MARKETS} has distance to market and FILE_MARKETS is path 
    to CSV file with lat/lon of markets.
    gdf_roads (gpd.GeoDataframe) - GeoPandas Dataframe of main roads.
    Returns
    -------
    Saves a CSV file with the following columns: household_id, public coordinates(lat/lon),
    'dist_market', 'dist_border', 'dist_popcenter', 'dist_admhq' and 'dist_road'

    """
    # ============================
    # DISTANCE TO PLACES
    # ============================
    for var,file in dist2pts_inputs_dict.items():
        df_var = pd.read_csv(file)
        df_var.dropna(subset=['lat','lon'], inplace=True)
        df_var[['lat','lon']] = df_var[['lat','lon']].astype(float)
        lst_pts = [(i[0],i[1]) for i in df_var[['lat', 'lon']].values]
        df_coords[var] = df_coords.apply(lambda x: utils.distance_to_points(lst_pts, (x[LAT],x[LON]), output='nearest'), axis=1)
    
    # ============================
    # DISTANCE TO ROADS
    # ============================
    gdf_coords = gpd.GeoDataFrame(df_coords, geometry=gpd.points_from_xy(df_coords[LON], df_coords[LAT]), crs=4326)
    
    utm_zones = utils.utm_zones_from_geodataframe(gdf=gdf_roads)
    # Use first, as results are the same for any proj
    proj_crs = utm_zones[0]
    try:
        gdf_roads_proj = gdf_roads.to_crs(int(proj_crs))
        
    except:
        print('Please ensure roads layer has CRS')
    
    gdf_coords_proj = gdf_coords.to_crs(int(proj_crs))
    
    gdf_coords_proj['dist_road'] = gdf_coords_proj.geometry.apply(lambda x: min(list(gdf_roads_proj.distance(x)/1000)))
    df_roads = gdf_coords_proj[[HH_ID, 'dist_road']]
    
    # ======================================
    # COMBINE THE VARS AND SAVE TO CSV FILE
    # =====================================
    # add dist_road to the main dataframe
    df = df_coords.merge(df_roads, on=HH_ID, how='inner', indicator=True)
    assert df._merge.value_counts()['both'] == len(df), 'Dataframe length should match'
    df.drop(columns = ['_merge'], inplace=True)
    
    # output variables
    out_vars = ['dist_market', 'dist_border', 'dist_popcenter', 'dist_admhq', 'dist_road']

    return df, out_vars

# Load survey file with public coordinates 
df_ground_truth = pd.read_stata(FILE_HH_PUB_GEOVARS, convert_categoricals=False)
df_coords = df_ground_truth[[HH_ID, LAT, LON]]


# dict object with distance variable and corresponding file with lat/lon
dist_to_pts_vars = {'dist_market': FILE_MARKETS, 'dist_border': FILE_BORDER_POSTS,
       'dist_popcenter': FILE_TOWNS, 'dist_admhq': FILE_DIST_CAPITALS}

df_dist, out_vars = generate_distance_variables(df_coords=df_coords, dist2pts_inputs_dict=dist_to_pts_vars,
                            gdf_roads=gdf_main_roads)

Quality checks for euclidean distance#

Back to Table of Contents

Run summary statistic to generare minimum, maximum values etc and inspect them manually.

# Only use describe for continuos variables 
print('-'*50)
print('       SUMMARY STATISTICS FOR DISTANCE VARS')
print('-'*50)
stats = df_dist[out_vars].describe().T
print(stats[['mean', 'min', '50%', '75%', 'max']])
print('-'*70)

--------------------------------------------------
       SUMMARY STATISTICS FOR DISTANCE VARS
--------------------------------------------------
                  mean   min     50%     75%     max
dist_market     66.731 0.485  64.506 101.082 223.039
dist_border    180.218 1.125 175.998 250.116 431.687
dist_popcenter  35.975 0.158  26.389  52.276 179.360
dist_admhq      21.930 0.158  14.218  34.574 114.477
dist_road       14.550 0.035   3.726  20.748 117.838
----------------------------------------------------------------------

Travel Time Metrics#

Back to Table of Contents

def generate_travel_time_vars(travel_time_vars, df_coords, df_road):
    """
    Helper function to generate travel time to places defined in places_dict and main roads.
    """
    # ================
    # SET UP
    # ================
    # output directory 
    DIR_DATA.joinpath('tmp', "travel_time").mkdir(parents=True, exist_ok=True)
    out_folder = DIR_DATA.joinpath('tmp', "travel_time")
    gdf_coords = gpd.GeoDataFrame(df_coords, geometry=gpd.points_from_xy(df_coords[LON], df_coords[LAT]), crs=4326)
    
    # Admin-0 (country bounds)
    if FILE_ADM0:
        gdf_adm0 = gpd.read_file(FILE_ADM0)
    else:
        gdf_adm0 = utils.admin_bounds_from_geoboundaries(co_iso='TZA', admin_level=0)
    

    # global input friction surface and clip it to ISO
    global_fric_surface = rasterio.open(str(FILE_FRICTION_SURFACE))
    iso_travel_surface = Path(out_folder).joinpath("TRAVEL_SURFACE.tif")
    if not iso_travel_surface.exists():
        utils.clip_raster(global_fric_surface, gdf_adm0, iso_travel_surface)

    # =========================================
    # GENERATE VARIABLES EXCEPT FOR MAIN ROADS
    # =========================================
    for var, file in travel_time_vars.items():
        # Load CSV as GeoDataframe
        gdf = utils.load_csv_into_geopandas(csv_with_coords=file, lon='lon', lat='lat')
        out_fname = 'travel_raster_{}.tif'.format(var)
        travel_raster = utils.generate_travel_time_raster(friction_raster=iso_travel_surface,dest=gdf, outdir=out_folder, 
                              out_raster=out_fname)
        
        gdf_coords[var] = point_query(gdf_coords, travel_raster)
    
    # ===========================================
    # GENERATE TRAVEL TIME TO NEAREST MAIN ROAD
    # ==========================================
    # First, using distance to nearest road, create 
    # points on those road perpendicular to origin 
    # coordinates
    df_road['tmp_pt'] = df_road.apply(lambda x: utils.terminal_coordinates((x[LON],x[LAT]),
                                                                                         distance=x['dist_road'], 
                                                                                     bearing=90), axis=1)
    df_road['geometry'] = df_road.tmp_pt.apply(lambda x: Point(x[0], x[1]))
    gdf_road_dest = gpd.GeoDataFrame(df_road,crs=4326)[['geometry']]
    
    # Now generate travel time raster
    out_fname = 'travel_raster_roads.tif'
    roads_travel_raster = utils.generate_travel_time_raster(friction_raster=iso_travel_surface, dest=gdf_road_dest, 
                                                            outdir=out_folder,out_raster=out_fname)
    gdf_coords['travel_time_road'] = point_query(gdf_coords, roads_travel_raster)
    
    # output vars
    out_vars = ['travel_time_market', 'travel_time_border', 'travel_time_popcenter',
       'travel_time_admhq', 'travel_time_road']
    
    # ============
    # CLEAN-UP
    # ============
    # Delete temp folder
    shutil.rmtree(out_folder)
    
    return gdf_coords, out_vars

# travel time variables except for 'main road'
trav2pts_vars = {'travel_time_market': FILE_MARKETS, 'travel_time_border': FILE_BORDER_POSTS,
       'travel_time_popcenter': FILE_TOWNS, 'travel_time_admhq': FILE_DIST_CAPITALS}


# distance to nearest road to used to compute travel time to nearest main road
df_road = df_dist[[HH_ID, LON, LAT, 'dist_road']]

df_travel, out_vars = generate_travel_time_vars(trav2pts_vars, df_coords, df_road)

Quality checks for travel time#

Back to Table of Contents

Manually inspect the results.

# Only use describe for continuos variables 
print('-'*50)
print('       SUMMARY STATISTICS FOR TRAVEL TIME  VARS')
print('-'*50)
stats = df_travel[out_vars].describe().T
print(stats[['mean', 'min', '50%', '75%', 'max']])
print('-'*70)

--------------------------------------------------
       SUMMARY STATISTICS FOR TRAVEL TIME  VARS
--------------------------------------------------
                               mean   min     50%     75%            max
travel_time_market    558173950.037 0.147 149.466 266.763 5715261766.701
travel_time_border    558174313.190 2.220 569.896 729.368 5715262095.665
travel_time_popcenter   1798341.738 0.147  62.421 121.391  200000074.659
travel_time_admhq       1783733.977 0.147  35.044  77.601  200000074.659
travel_time_road         525328.722 0.010   0.964   4.514   24426968.657
----------------------------------------------------------------------

Save output files#

Back to Table of Contents

This notebook saves two files: a CSV file and an ESRI shapefile with the following columns which has been generated:

‘dist_market’, ‘dist_border’, ‘dist_popcenter’, ‘dist_admhq’, ‘dist_road’, ‘travel_time_market’, ‘travel_time_border’, ‘travel_time_popcenter’, ‘travel_time_admhq’, ‘travel_time_road’

These files have been saved to this OneDrive folder.

def save_outputs(df_dist, df_travel, df_coords, outdir, var_labels):
    """
    Save the following output files:
    1. CSV
    2. Shapefile 
    """
    # Merge the two dataframes
    df_travel = df_travel[[HH_ID, 'travel_time_market',
       'travel_time_border', 'travel_time_popcenter', 'travel_time_admhq',
       'travel_time_road']]
    if INCLUDE_TRAVEL_TIME:
        df = df_dist.merge(df_travel, on=HH_ID, how='inner')
        assert len(df) == len(df_coords)
    else:
        df = df_dist[[HH_ID] + list(df_dist.columns.difference([HH_ID]))]
    df[HH_ID] = df[HH_ID].apply(lambda x: str(int(x)))
    
    # save to CSV file
    csv_fname = 'access-distance-{}.csv'.format(FILENAME_SUFFIX)
    out_csv_data = Path(outdir).joinpath(csv_fname)
    df_csv = df.drop(columns=['geometry', LAT, LON])
    df_csv.to_csv(out_csv_data, index=False)
    
    # Save variable labels
    csv_fname = 'access-distance-{}-labels.csv'.format(FILENAME_SUFFIX)
    out_labels = Path(outdir).joinpath(csv_fname)
    if not INCLUDE_TRAVEL_TIME:
        var_labels = {k:v for k, v in var_labels.items() if 'travel' not in k}
    
    df_var_labels = pd.DataFrame(list(var_labels.values()), index=list(var_labels.keys()))
    df_var_labels.index.name = 'var_name'
    df_var_labels.rename(columns={0:'label'}, inplace=True)
    df_var_labels.to_csv(out_labels)
    
    print('-'*30)
    print('VARIABLES SAVED TO FILE BELOW')
    print('-'*30)
    print("/".join(out_csv_data.parts[6:]))
    
    return out_csv_data, out_labels

out_csv, out_labels = save_outputs(df_dist, df_travel, df_coords, outdir=DIR_OUTPUTS, var_labels=VAR_LABELS)

------------------------------
VARIABLES SAVED TO FILE BELOW
------------------------------
DECAT_HH_Geovariables/data/TZA/surveys/NPS_w4/geovars/puf/access-distance-npsw4.csv

df = pd.read_csv(out_csv)

# Only use describe for continuos variables 
print('-'*50)
print('       SUMMARY STATISTICS FOR TRAVEL TIME  VARS')
print('-'*50)
vars = df.columns.difference([HH_ID])
stats = df[vars].describe().T
print(stats[['mean', 'min', '50%', '75%', 'max']])
print('-'*70)

--------------------------------------------------
       SUMMARY STATISTICS FOR TRAVEL TIME  VARS
--------------------------------------------------
                  mean   min     50%     75%     max
dist_admhq      21.930 0.158  14.218  34.574 114.477
dist_border    180.218 1.125 175.998 250.116 431.687
dist_market     66.731 0.485  64.506 101.082 223.039
dist_popcenter  35.975 0.158  26.389  52.276 179.360
dist_road       14.550 0.035   3.726  20.748 117.838
----------------------------------------------------------------------