Landscape Typology#


Author: Dunstan Matekenya

Affiliation: DECAT, The World Bank Group

Date: August 1, 2023

Background#

In this notebook, we generate geovariables in the category of climate for the National Panel Survey (NPS 2012-2013) in Tanzania. The following bioclimatic geovariables are computed.

  1. Majority landcover class

  2. Percent under agriculture

  3. Agricultural ecological zones

  4. Population density range (number of people per km sq.)

Input datasets#

The following datasets have been used for all the variables.

Variable

Data sourcee

Data files URL

Majority landcover class

impact-observatory

impact-observatory-s3-bucket

Percent under agriculture

impact-observatory

impact-observatory-s3-bucket

Agricultural ecological zones (AEZ)

IFPRI standardized AEZ

OneDrive folder[1]

Population density

World Pop

UN adjusted pop density estimates

  1. ^ Only WBG staff can access this OneDrive directory.

Python setup#

The following packages are required for the processing in this notebook.

import warnings
warnings.simplefilter(action='ignore')
import sys
from pathlib import Path

import geopandas as gpd
import pandas as pd
import rasterio
from rasterio.merge import merge
import numpy as np
from rasterstats import gen_zonal_stats

spanon = Path.cwd().parents[2].joinpath("Spatial-Anonymization/src/")
sys.path.append(str(spanon))
from spatial_anonymization import utils
from spatial_anonymization import point_displacement

Setup input files#

Define paths to input data files as global variables using Path package.

# =============================
# BASE WORKING DIRS
# ============================
DIR_DATA = Path.cwd().parents[1].joinpath('data')
DIR_SPATIAL_DATA = DIR_DATA.joinpath('TZA', 'spatial')
DIR_NPS_W5 = DIR_DATA.joinpath('TZA', 'surveys', 'NPS_w5')
DIR_OUTPUTS = DIR_NPS_W5.joinpath('geovars', 'puf')
DIR_POP= DIR_DATA.joinpath('WORLDPOP')

# =============================
# SURVEY COORDINATES 
# ============================
# Public/displaced GPS oordinates files
FILE_HH_PUB_GEOVARS = DIR_OUTPUTS.joinpath('y5-hhs-pub-coords.csv')

# =============================
# ADMIN BOUNDARIES
# =============================
# use geoboundaries API for country bounds
FILE_ADM0 = DIR_SPATIAL_DATA.joinpath('geoBoundaries-TZA-ADM0-all',
                                     'geoBoundaries-TZA-ADM0.shp')

# =============================
# GEOVARS DATA FILES
# =============================
POP_YR = 2015
FILE_POP = DIR_POP.joinpath('tza', 'tza_pd_{}_1km_UNadj.tif'.format(POP_YR))
FILE_LAND_COVER = DIR_DATA.joinpath('TZA', 'spatial', 'land-cover', 
                                    'E020N00_PROBAV_LC100_global_v3.0.1_2015.tif')
FILE_AEZ = DIR_DATA.joinpath('AEZ_IFPRI/afr-aez09.tif')
FILE_URBAN = DIR_SPATIAL_DATA.joinpath('GHS_SMOD.tif')

Setup processing global variables#

For variables we will use often such as column name containing latitude and longitude, set them as global variables.

# ====================
# GENERAL
# =====================
# in case we need to download OSM data
GEOFABRICK_REGION  = 'africa'
COUNTRY_NAME = 'Tanzania'
COUNTRY_ISO = 'TZA'
FILENAME_SUFFIX = 'npsw5'
LULC_YR = 2015

# ==============
# BUFFER
# ==============
# whether to buffer points or
# extract values at point
BUFFER = True
BUFF_ZONE_DIST_URB = 2000
BUFF_ZONE_DIST_RUR = 5000


    
# ====================
# INPUT DATA SPECS
# =====================
# Cols for input coordinates file
LAT = 'lat_pub_w5' 
LON = 'lon_pub_w5'
HH_ID = 'y5_hhid'
CLUST_ID = 'y5_clusterid'
TEMP_UNIQ_ID = 'uniq_loc'
URB_RURAL = 'y5_clustertype'
URBAN_CODE = 'urban'

# =============================
# VARIABLE LABELS
# ============================
VAR_LABELS = {'land_cover': 'Majority landcover class within approximately 1km buffer',
 'cropshare': 'Percent under agriculture within approx 1 km buffer',
 'ssa_aez09': 'Agricultural ecological zones',
 'pop_density': 'Population density (people per km2)',
'pct_urban_clust': 'Percent urban cluster',
'pct_urban_cntr': 'Percent urban center'}


for name, label in VAR_LABELS.items():
    try:
        assert len(label) < 80, 'LABEL TOO LONG'
    except AssertionError:
        print(name)

Define helper functions#

def extract_single_var(df_coords, input_tif, colname, extraction_type='majority', 
                buff_zone=False, int_outputs=True, pct_code=None, rural_code=1):
    """
    Helper to extract variables 
    """
    
    # ======================================
    # GENERATE THE DATA
    # ======================================
    if buff_zone:
        print()
        print("-"*65)
        print(' Extracting variable -{}- with the following parameters'.format(colname))
        print("-"*65)
        print(' 1. Extraction type: zonal statistic within buffer zone.')
        print(' 2. Statistic: {}'.format(extraction_type))
        print(' 3. Urban buffer zone: {}km'.format(BUFF_ZONE_DIST_URB/1000))
        print(' 4. Rural buffer zone: {}km'.format(BUFF_ZONE_DIST_RUR/1000))
        print("-"*60)
        gdf_coords_urb = df_coords.query('{} == "{}"'.format(URB_RURAL, URBAN_CODE))
        df_urb = utils.point_query_raster(in_raster=input_tif, buffer=buff_zone, point_shp=gdf_coords_urb, statcol_name=colname,
                                              id_col=TEMP_UNIQ_ID, stat=extraction_type, buff_dist=BUFF_ZONE_DIST_URB, class_pct_code=pct_code,
                                             xcol=LON, ycol=LAT)
        gdf_coords_rur = df_coords.query('{} == "{}"'.format(URB_RURAL, rural_code))
        df_rur = utils.point_query_raster(in_raster=input_tif, buffer=buff_zone, point_shp=gdf_coords_rur, statcol_name=colname,
                                              id_col=TEMP_UNIQ_ID, stat=extraction_type, buff_dist=BUFF_ZONE_DIST_RUR, class_pct_code=pct_code,
                                             xcol=LON, ycol=LAT)
        df = df_urb.append(df_rur)
    else:
        print()
        print("-"*65)
        print(' Extracting variable -{}- with the following parameters'.format(colname))
        print("-"*65)
        print(' 1. Extraction type: point extraction.')
        print(' 2. Statistic: N/A')
        print(' 3. Urban buffer zone: N/A')
        print(' 4. Rural buffer zone: N/A')
        print("-"*60)
        df = utils.point_query_raster(in_raster=input_tif, buffer=False, point_shp=df_coords, statcol_name=colname,
                                        id_col=HH_ID, xcol=LON, ycol=LAT)

    # ======================================
    # DEAL WITH MISSING VALUES
    # ======================================
    if df[colname].isnull().sum() > 1:
        num_missing = df[colname].isnull().sum()
        print('Replacing {} missing values for {}'.format(num_missing, colname))
        df2 = df.merge(df_coords[[TEMP_UNIQ_ID, LAT, LON]], on=TEMP_UNIQ_ID)
        df3 = utils.replace_missing_with_nearest(df=df2, target_col=colname, idcol=TEMP_UNIQ_ID, lon=LON, lat=LAT)
        df = df3[[TEMP_UNIQ_ID, colname]]
        assert df[colname].isnull().sum() == 0
    
    # ======================================
    # ROUND UP TO INTEGERS
    # ======================================
    if int_outputs:
        df[colname] = df[colname].apply(lambda x: int(round(x, 0)) if isinstance(x, float) else x)
    
    
    df.set_index(TEMP_UNIQ_ID, inplace=True)


    # Add to variables list
    return df
def generate_vars():
    """
    Wraps data generation for all variables into this one function.
    """
    # ===========================
    # LOAD COORDINATES
    # ===========================
    df_pub_coords = pd.read_csv(FILE_HH_PUB_GEOVARS)
    df_pub_coords  = df_pub_coords.query('make_public == 1')
    df_pub_coords[TEMP_UNIQ_ID] = df_pub_coords.apply(lambda x: "{}-{}".format(x[LON], x[LAT]), axis=1)
    df_pub_coords2 = df_pub_coords.drop_duplicates(subset=['uniq_loc'])
    df_coords = df_pub_coords2[[TEMP_UNIQ_ID, URB_RURAL, LAT, LON]]
    
    gdf_coords = gpd.GeoDataFrame(df_coords, 
                                          geometry=gpd.points_from_xy(df_coords[LON],df_coords[LAT]), crs=4326)
    # Get code for rural
    rural = list(set(list(df_pub_coords[URB_RURAL].value_counts().index)) - set([URBAN_CODE]))
    assert len(rural) == 1, 'THEY SHOULD BE ONLY 2 CODES FOR URBAN/RURAL'
    rural = rural[0]
    
    # Keep all variables in this dict
    var_list = {}

    # ====================================
    # POPULATION DENSITY
    # ====================================
    # Extract population density at point
    var = 'pop_density'
    df_pop = extract_single_var(df_coords=gdf_coords, input_tif=FILE_POP, 
                        colname=var, buff_zone=True, int_outputs=True, rural_code=rural)
    var_list[var] = df_pop

    # ====================================
    # LAND COVER
    # ====================================
    var = 'land_cover'
    df_lc = extract_single_var(df_coords=gdf_coords, input_tif=FILE_LAND_COVER, 
                        colname=var, extraction_type='majority',rural_code=rural,
                        buff_zone=True, int_outputs=True)
    var_list[var] = df_lc


    # =====================================
    # PERCENT LAND-CPVER UNDER AGRICULTURE
    # =====================================
    # Percent Agriculture within buffer
    agr_code = 40
    var = 'cropshare'
    df_agr = extract_single_var(df_coords=gdf_coords, input_tif=FILE_LAND_COVER, 
                        colname=var, extraction_type='class-pct',pct_code=[agr_code],rural_code=rural,
                        buff_zone=True, int_outputs=True)
    # Add to variables list
    var_list[var] = df_agr

    # =====================================
    # AGR ECOLOGICAL ZONES
    # =====================================
    # Agricultural ecological zones 
    var = 'ssa_aez09'
    df_aez = extract_single_var(df_coords=gdf_coords, input_tif=FILE_AEZ, 
                        colname=var, extraction_type='majority',rural_code=rural,
                        buff_zone=True, int_outputs=True)

    # Add to variables list
    var_list[var] = df_aez

    # =====================================
    # URBANIZATION
    # =====================================
    # 1. percent urban center (category 30)
    var = 'pct_urban_cntr'
    urban_center = [30]
    df_urb_cnt = extract_single_var(df_coords=gdf_coords, input_tif=FILE_URBAN, 
                        colname=var, extraction_type='class-pct',pct_code=urban_center,
                        buff_zone=True, int_outputs=True, rural_code=rural)

    # Add to variables list
    var_list[var] = df_urb_cnt

    # 2. percent urban cluster (21,22,23)
    var = 'pct_urban_clust'
    urban_cluster = [21,22,23]
    df_urb_clst = extract_single_var(df_coords=gdf_coords, input_tif=FILE_URBAN, 
                        colname=var, extraction_type='class-pct', pct_code=urban_cluster,
                        buff_zone=True, int_outputs=True, rural_code=rural)

    # Add to variables list
    var_list[var] = df_urb_clst
    
    # =====================================
    # MERGE VARS
    # =====================================
    df_out = pd.concat(list(var_list.values()), axis=1)

    out_csv = DIR_OUTPUTS.joinpath('landscape-typology-{}.csv'.format(FILENAME_SUFFIX))
    df_out = df_out[list(var_list.keys())]
    df_out.to_csv(out_csv)

    # Save variable labels
    out_csv_labels = DIR_OUTPUTS.joinpath('landscape-typology-{}-labels.csv'.format(FILENAME_SUFFIX))
    df_var_labels = pd.DataFrame(list(VAR_LABELS.values()), index=list(VAR_LABELS.keys()))
    df_var_labels.index.name = 'var_name'
    df_var_labels.rename(columns={0:'label'}, inplace=True)
    df_var_labels.to_csv(out_csv_labels)
    
    print('-'*30)
    print('VARIABLES SAVED TO FILE BELOW')
    print('-'*30)
    print("/".join(out_csv.parts[6:]))
    
    return out_csv, list(var_list.keys())

Generate variables#

out_csv, var_list = generate_vars()
-----------------------------------------------------------------
 Extracting variable -pop_density- with the following parameters
-----------------------------------------------------------------
 1. Extraction type: zonal statistic within buffer zone.
 2. Statistic: majority
 3. Urban buffer zone: 2.0km
 4. Rural buffer zone: 5.0km
------------------------------------------------------------
-----------------------------------------------------------------
 Extracting variable -land_cover- with the following parameters
-----------------------------------------------------------------
 1. Extraction type: zonal statistic within buffer zone.
 2. Statistic: majority
 3. Urban buffer zone: 2.0km
 4. Rural buffer zone: 5.0km
------------------------------------------------------------
-----------------------------------------------------------------
 Extracting variable -cropshare- with the following parameters
-----------------------------------------------------------------
 1. Extraction type: zonal statistic within buffer zone.
 2. Statistic: class-pct
 3. Urban buffer zone: 2.0km
 4. Rural buffer zone: 5.0km
------------------------------------------------------------
-----------------------------------------------------------------
 Extracting variable -ssa_aez09- with the following parameters
-----------------------------------------------------------------
 1. Extraction type: zonal statistic within buffer zone.
 2. Statistic: majority
 3. Urban buffer zone: 2.0km
 4. Rural buffer zone: 5.0km
------------------------------------------------------------
-----------------------------------------------------------------
 Extracting variable -pct_urban_cntr- with the following parameters
-----------------------------------------------------------------
 1. Extraction type: zonal statistic within buffer zone.
 2. Statistic: class-pct
 3. Urban buffer zone: 2.0km
 4. Rural buffer zone: 5.0km
------------------------------------------------------------
-----------------------------------------------------------------
 Extracting variable -pct_urban_clust- with the following parameters
-----------------------------------------------------------------
 1. Extraction type: zonal statistic within buffer zone.
 2. Statistic: class-pct
 3. Urban buffer zone: 2.0km
 4. Rural buffer zone: 5.0km
------------------------------------------------------------
------------------------------
VARIABLES SAVED TO FILE BELOW
------------------------------
DECAT_HH_Geovariables/data/TZA/surveys/NPS_w5/geovars/puf/landscape-typology-npsw5.csv

Quality checks#

Run summary statistic to generare minimum, maximum values etc and inspect them manually.

df = pd.read_csv(out_csv)
# Only use describe for continuos variables 
print('-'*50)
print('       SUMMARY STATISTICS FOR VARS')
print('-'*50)
stats = df[var_list].describe().T
print(stats[['mean', 'min', '50%', '75%', 'max']])
print('-'*60)
--------------------------------------------------
       SUMMARY STATISTICS FOR VARS
--------------------------------------------------
                       mean    min    50%    75%      max
pop_density      785.501167    0.0   83.0  375.0  25951.0
land_cover        70.698833   20.0   50.0  126.0    200.0
cropshare         30.320623    0.0   15.0   56.0    100.0
ssa_aez09        316.161868  312.0  313.0  323.0    324.0
pct_urban_cntr    26.406226    0.0    0.0   60.0    100.0
pct_urban_clust   13.611673    0.0    5.0   23.0    100.0
------------------------------------------------------------

Summary#

The following variables below were generated.

ID

Theme

Variable Description

Variable ID

Units

Resolution

Data sources

Required

1

Landscape Typology

Majority landcover class within approximately 1km buffer

land01

N/A

0.002778 dd

GlobCover v2.3

Yes

2

Landscape Typology

Percent under agriculture within approx 1 km buffer

land02

N/A

0.002778 dd

GlobCover v2.3

Yes

3

Landscape Typology

Agricultural ecological zones

land03

N/A

0.008333 dd

IFIPRI

Yes

4

Climatology

Population density range (people per km2)

land04

mm

0.008333 dd

WorldPop

Yes