Landscape Typology#
Author: Dunstan Matekenya
Affiliation: DECAT, The World Bank Group
Date: August 1, 2023
Background#
In this notebook, we generate geovariables in the category of climate
for the National Panel Survey (NPS 2012-2013) in Tanzania. The following bioclimatic geovariables are computed.
Majority landcover class
Percent under agriculture
Agricultural ecological zones
Population density range (number of people per km sq.)
Input datasets#
The following datasets have been used for all the variables.
Variable |
Data sourcee |
Data files URL |
---|---|---|
Majority landcover class |
impact-observatory-s3-bucket |
|
Percent under agriculture |
impact-observatory-s3-bucket |
|
Agricultural ecological zones (AEZ) |
||
Population density |
^ Only WBG staff can access this OneDrive directory.
Python setup#
The following packages are required for the processing in this notebook.
import warnings
warnings.simplefilter(action='ignore')
import sys
from pathlib import Path
import geopandas as gpd
import pandas as pd
import rasterio
from rasterio.merge import merge
import numpy as np
from rasterstats import gen_zonal_stats
spanon = Path.cwd().parents[2].joinpath("Spatial-Anonymization/src/")
sys.path.append(str(spanon))
from spatial_anonymization import utils
from spatial_anonymization import point_displacement
Setup input files#
Define paths to input data files as global variables using Path
package.
# =============================
# BASE WORKING DIRS
# ============================
DIR_DATA = Path.cwd().parents[1].joinpath('data')
DIR_SPATIAL_DATA = DIR_DATA.joinpath('TZA', 'spatial')
DIR_NPS_W5 = DIR_DATA.joinpath('TZA', 'surveys', 'NPS_w5')
DIR_OUTPUTS = DIR_NPS_W5.joinpath('geovars', 'puf')
DIR_POP= DIR_DATA.joinpath('WORLDPOP')
# =============================
# SURVEY COORDINATES
# ============================
# Public/displaced GPS oordinates files
FILE_HH_PUB_GEOVARS = DIR_OUTPUTS.joinpath('y5-hhs-pub-coords.csv')
# =============================
# ADMIN BOUNDARIES
# =============================
# use geoboundaries API for country bounds
FILE_ADM0 = DIR_SPATIAL_DATA.joinpath('geoBoundaries-TZA-ADM0-all',
'geoBoundaries-TZA-ADM0.shp')
# =============================
# GEOVARS DATA FILES
# =============================
POP_YR = 2015
FILE_POP = DIR_POP.joinpath('tza', 'tza_pd_{}_1km_UNadj.tif'.format(POP_YR))
FILE_LAND_COVER = DIR_DATA.joinpath('TZA', 'spatial', 'land-cover',
'E020N00_PROBAV_LC100_global_v3.0.1_2015.tif')
FILE_AEZ = DIR_DATA.joinpath('AEZ_IFPRI/afr-aez09.tif')
FILE_URBAN = DIR_SPATIAL_DATA.joinpath('GHS_SMOD.tif')
Setup processing global variables#
For variables we will use often such as column name containing latitude and longitude, set them as global variables.
# ====================
# GENERAL
# =====================
# in case we need to download OSM data
GEOFABRICK_REGION = 'africa'
COUNTRY_NAME = 'Tanzania'
COUNTRY_ISO = 'TZA'
FILENAME_SUFFIX = 'npsw5'
LULC_YR = 2015
# ==============
# BUFFER
# ==============
# whether to buffer points or
# extract values at point
BUFFER = True
BUFF_ZONE_DIST_URB = 2000
BUFF_ZONE_DIST_RUR = 5000
# ====================
# INPUT DATA SPECS
# =====================
# Cols for input coordinates file
LAT = 'lat_pub_w5'
LON = 'lon_pub_w5'
HH_ID = 'y5_hhid'
CLUST_ID = 'y5_clusterid'
TEMP_UNIQ_ID = 'uniq_loc'
URB_RURAL = 'y5_clustertype'
URBAN_CODE = 'urban'
# =============================
# VARIABLE LABELS
# ============================
VAR_LABELS = {'land_cover': 'Majority landcover class within approximately 1km buffer',
'cropshare': 'Percent under agriculture within approx 1 km buffer',
'ssa_aez09': 'Agricultural ecological zones',
'pop_density': 'Population density (people per km2)',
'pct_urban_clust': 'Percent urban cluster',
'pct_urban_cntr': 'Percent urban center'}
for name, label in VAR_LABELS.items():
try:
assert len(label) < 80, 'LABEL TOO LONG'
except AssertionError:
print(name)
Define helper functions#
def extract_single_var(df_coords, input_tif, colname, extraction_type='majority',
buff_zone=False, int_outputs=True, pct_code=None, rural_code=1):
"""
Helper to extract variables
"""
# ======================================
# GENERATE THE DATA
# ======================================
if buff_zone:
print()
print("-"*65)
print(' Extracting variable -{}- with the following parameters'.format(colname))
print("-"*65)
print(' 1. Extraction type: zonal statistic within buffer zone.')
print(' 2. Statistic: {}'.format(extraction_type))
print(' 3. Urban buffer zone: {}km'.format(BUFF_ZONE_DIST_URB/1000))
print(' 4. Rural buffer zone: {}km'.format(BUFF_ZONE_DIST_RUR/1000))
print("-"*60)
gdf_coords_urb = df_coords.query('{} == "{}"'.format(URB_RURAL, URBAN_CODE))
df_urb = utils.point_query_raster(in_raster=input_tif, buffer=buff_zone, point_shp=gdf_coords_urb, statcol_name=colname,
id_col=TEMP_UNIQ_ID, stat=extraction_type, buff_dist=BUFF_ZONE_DIST_URB, class_pct_code=pct_code,
xcol=LON, ycol=LAT)
gdf_coords_rur = df_coords.query('{} == "{}"'.format(URB_RURAL, rural_code))
df_rur = utils.point_query_raster(in_raster=input_tif, buffer=buff_zone, point_shp=gdf_coords_rur, statcol_name=colname,
id_col=TEMP_UNIQ_ID, stat=extraction_type, buff_dist=BUFF_ZONE_DIST_RUR, class_pct_code=pct_code,
xcol=LON, ycol=LAT)
df = df_urb.append(df_rur)
else:
print()
print("-"*65)
print(' Extracting variable -{}- with the following parameters'.format(colname))
print("-"*65)
print(' 1. Extraction type: point extraction.')
print(' 2. Statistic: N/A')
print(' 3. Urban buffer zone: N/A')
print(' 4. Rural buffer zone: N/A')
print("-"*60)
df = utils.point_query_raster(in_raster=input_tif, buffer=False, point_shp=df_coords, statcol_name=colname,
id_col=HH_ID, xcol=LON, ycol=LAT)
# ======================================
# DEAL WITH MISSING VALUES
# ======================================
if df[colname].isnull().sum() > 1:
num_missing = df[colname].isnull().sum()
print('Replacing {} missing values for {}'.format(num_missing, colname))
df2 = df.merge(df_coords[[TEMP_UNIQ_ID, LAT, LON]], on=TEMP_UNIQ_ID)
df3 = utils.replace_missing_with_nearest(df=df2, target_col=colname, idcol=TEMP_UNIQ_ID, lon=LON, lat=LAT)
df = df3[[TEMP_UNIQ_ID, colname]]
assert df[colname].isnull().sum() == 0
# ======================================
# ROUND UP TO INTEGERS
# ======================================
if int_outputs:
df[colname] = df[colname].apply(lambda x: int(round(x, 0)) if isinstance(x, float) else x)
df.set_index(TEMP_UNIQ_ID, inplace=True)
# Add to variables list
return df
def generate_vars():
"""
Wraps data generation for all variables into this one function.
"""
# ===========================
# LOAD COORDINATES
# ===========================
df_pub_coords = pd.read_csv(FILE_HH_PUB_GEOVARS)
df_pub_coords = df_pub_coords.query('make_public == 1')
df_pub_coords[TEMP_UNIQ_ID] = df_pub_coords.apply(lambda x: "{}-{}".format(x[LON], x[LAT]), axis=1)
df_pub_coords2 = df_pub_coords.drop_duplicates(subset=['uniq_loc'])
df_coords = df_pub_coords2[[TEMP_UNIQ_ID, URB_RURAL, LAT, LON]]
gdf_coords = gpd.GeoDataFrame(df_coords,
geometry=gpd.points_from_xy(df_coords[LON],df_coords[LAT]), crs=4326)
# Get code for rural
rural = list(set(list(df_pub_coords[URB_RURAL].value_counts().index)) - set([URBAN_CODE]))
assert len(rural) == 1, 'THEY SHOULD BE ONLY 2 CODES FOR URBAN/RURAL'
rural = rural[0]
# Keep all variables in this dict
var_list = {}
# ====================================
# POPULATION DENSITY
# ====================================
# Extract population density at point
var = 'pop_density'
df_pop = extract_single_var(df_coords=gdf_coords, input_tif=FILE_POP,
colname=var, buff_zone=True, int_outputs=True, rural_code=rural)
var_list[var] = df_pop
# ====================================
# LAND COVER
# ====================================
var = 'land_cover'
df_lc = extract_single_var(df_coords=gdf_coords, input_tif=FILE_LAND_COVER,
colname=var, extraction_type='majority',rural_code=rural,
buff_zone=True, int_outputs=True)
var_list[var] = df_lc
# =====================================
# PERCENT LAND-CPVER UNDER AGRICULTURE
# =====================================
# Percent Agriculture within buffer
agr_code = 40
var = 'cropshare'
df_agr = extract_single_var(df_coords=gdf_coords, input_tif=FILE_LAND_COVER,
colname=var, extraction_type='class-pct',pct_code=[agr_code],rural_code=rural,
buff_zone=True, int_outputs=True)
# Add to variables list
var_list[var] = df_agr
# =====================================
# AGR ECOLOGICAL ZONES
# =====================================
# Agricultural ecological zones
var = 'ssa_aez09'
df_aez = extract_single_var(df_coords=gdf_coords, input_tif=FILE_AEZ,
colname=var, extraction_type='majority',rural_code=rural,
buff_zone=True, int_outputs=True)
# Add to variables list
var_list[var] = df_aez
# =====================================
# URBANIZATION
# =====================================
# 1. percent urban center (category 30)
var = 'pct_urban_cntr'
urban_center = [30]
df_urb_cnt = extract_single_var(df_coords=gdf_coords, input_tif=FILE_URBAN,
colname=var, extraction_type='class-pct',pct_code=urban_center,
buff_zone=True, int_outputs=True, rural_code=rural)
# Add to variables list
var_list[var] = df_urb_cnt
# 2. percent urban cluster (21,22,23)
var = 'pct_urban_clust'
urban_cluster = [21,22,23]
df_urb_clst = extract_single_var(df_coords=gdf_coords, input_tif=FILE_URBAN,
colname=var, extraction_type='class-pct', pct_code=urban_cluster,
buff_zone=True, int_outputs=True, rural_code=rural)
# Add to variables list
var_list[var] = df_urb_clst
# =====================================
# MERGE VARS
# =====================================
df_out = pd.concat(list(var_list.values()), axis=1)
out_csv = DIR_OUTPUTS.joinpath('landscape-typology-{}.csv'.format(FILENAME_SUFFIX))
df_out = df_out[list(var_list.keys())]
df_out.to_csv(out_csv)
# Save variable labels
out_csv_labels = DIR_OUTPUTS.joinpath('landscape-typology-{}-labels.csv'.format(FILENAME_SUFFIX))
df_var_labels = pd.DataFrame(list(VAR_LABELS.values()), index=list(VAR_LABELS.keys()))
df_var_labels.index.name = 'var_name'
df_var_labels.rename(columns={0:'label'}, inplace=True)
df_var_labels.to_csv(out_csv_labels)
print('-'*30)
print('VARIABLES SAVED TO FILE BELOW')
print('-'*30)
print("/".join(out_csv.parts[6:]))
return out_csv, list(var_list.keys())
Generate variables#
out_csv, var_list = generate_vars()
-----------------------------------------------------------------
Extracting variable -pop_density- with the following parameters
-----------------------------------------------------------------
1. Extraction type: zonal statistic within buffer zone.
2. Statistic: majority
3. Urban buffer zone: 2.0km
4. Rural buffer zone: 5.0km
------------------------------------------------------------
-----------------------------------------------------------------
Extracting variable -land_cover- with the following parameters
-----------------------------------------------------------------
1. Extraction type: zonal statistic within buffer zone.
2. Statistic: majority
3. Urban buffer zone: 2.0km
4. Rural buffer zone: 5.0km
------------------------------------------------------------
-----------------------------------------------------------------
Extracting variable -cropshare- with the following parameters
-----------------------------------------------------------------
1. Extraction type: zonal statistic within buffer zone.
2. Statistic: class-pct
3. Urban buffer zone: 2.0km
4. Rural buffer zone: 5.0km
------------------------------------------------------------
-----------------------------------------------------------------
Extracting variable -ssa_aez09- with the following parameters
-----------------------------------------------------------------
1. Extraction type: zonal statistic within buffer zone.
2. Statistic: majority
3. Urban buffer zone: 2.0km
4. Rural buffer zone: 5.0km
------------------------------------------------------------
-----------------------------------------------------------------
Extracting variable -pct_urban_cntr- with the following parameters
-----------------------------------------------------------------
1. Extraction type: zonal statistic within buffer zone.
2. Statistic: class-pct
3. Urban buffer zone: 2.0km
4. Rural buffer zone: 5.0km
------------------------------------------------------------
-----------------------------------------------------------------
Extracting variable -pct_urban_clust- with the following parameters
-----------------------------------------------------------------
1. Extraction type: zonal statistic within buffer zone.
2. Statistic: class-pct
3. Urban buffer zone: 2.0km
4. Rural buffer zone: 5.0km
------------------------------------------------------------
------------------------------
VARIABLES SAVED TO FILE BELOW
------------------------------
DECAT_HH_Geovariables/data/TZA/surveys/NPS_w5/geovars/puf/landscape-typology-npsw5.csv
Quality checks#
Run summary statistic to generare minimum, maximum values etc and inspect them manually.
df = pd.read_csv(out_csv)
# Only use describe for continuos variables
print('-'*50)
print(' SUMMARY STATISTICS FOR VARS')
print('-'*50)
stats = df[var_list].describe().T
print(stats[['mean', 'min', '50%', '75%', 'max']])
print('-'*60)
--------------------------------------------------
SUMMARY STATISTICS FOR VARS
--------------------------------------------------
mean min 50% 75% max
pop_density 785.501167 0.0 83.0 375.0 25951.0
land_cover 70.698833 20.0 50.0 126.0 200.0
cropshare 30.320623 0.0 15.0 56.0 100.0
ssa_aez09 316.161868 312.0 313.0 323.0 324.0
pct_urban_cntr 26.406226 0.0 0.0 60.0 100.0
pct_urban_clust 13.611673 0.0 5.0 23.0 100.0
------------------------------------------------------------
Summary#
The following variables below were generated.
ID |
Theme |
Variable Description |
Variable ID |
Units |
Resolution |
Data sources |
Required |
---|---|---|---|---|---|---|---|
1 |
Landscape Typology |
Majority landcover class within approximately 1km buffer |
land01 |
N/A |
0.002778 dd |
Yes |
|
2 |
Landscape Typology |
Percent under agriculture within approx 1 km buffer |
land02 |
N/A |
0.002778 dd |
Yes |
|
3 |
Landscape Typology |
Agricultural ecological zones |
land03 |
N/A |
0.008333 dd |
Yes |
|
4 |
Climatology |
Population density range (people per km2) |
land04 |
mm |
0.008333 dd |
Yes |