Soil and Terrain#
Author: Dunstan Matekenya
Affiliation: DECAT, The World Bank Group
Date: August 1, 2023
Background#
In this notebook, we generate geovariables in the category of soil and terrain
for the National Panel Survey (NPS 2014-2015) in Tanzania. The following geovariables are computed.
Average elevation (m) within 1 km block
Slope
Topographic wetness index
Terrain Roughness
Nutrient availability
Nutrient retention capacity
Rooting conditions
Oxygen availability to roots
Excess salts
Toxicity
Workability (constraining field management)
For full variables metadata, see Data-Cover-Page for details about definition of the variables and other important information.
Input datasets#
The following datasets have been used for all the variables.
Data source
Soil. The source for this data is the Harmonized World Soil Database(HWSD)
Elevation. The source for this data is the SRTM-30 dataset
Input data files
Soil. The data files used in this notebook were downloaded from the site mentioned above and kept in this OneDrive folder[1]
Elevation. The data files used in this notebook were downloaded from the site mentioned above and kept in this OneDrive folder[1]
Python setup#
The following packages are required for the processing in this notebook.
import warnings
warnings.simplefilter(action='ignore')
from pathlib import Path
import sys
import geopandas as gpd
import pandas as pd
import numpy as np
import rasterio
from rasterstats import point_query, zonal_stats
spanon = Path.cwd().parents[2].joinpath("Spatial-Anonymization/src/")
sys.path.append(str(spanon))
from spatial_anonymization import utils
from spatial_anonymization import point_displacement
Setup input files#
Define paths to input data files as global variables using Path
package.
# =============================
# BASE WORKING DIRS
# ============================
DIR_DATA = Path.cwd().parents[1].joinpath('data')
DIR_SPATIAL_DATA = DIR_DATA.joinpath('spatial')
DIR_NPS_W5 = DIR_DATA.joinpath('TZA', 'surveys', 'NPS_w5')
DIR_OUTPUTS = DIR_NPS_W5.joinpath('geovars', 'puf')
# =============================
# SURVEY COORDINATES
# ============================
# Public/displaced GPS oordinates files
FILE_HH_PUB_GEOVARS = DIR_OUTPUTS.joinpath('y5-hhs-pub-coords.csv')
# =============================
# ADMIN BOUNDARIES
# =============================
# use geoboundaries API for country bounds
FILE_ADM0 = DIR_SPATIAL_DATA.joinpath('geoBoundaries-TZA-ADM0-all',
'geoBoundaries-TZA-ADM0.shp')
# =============================
# GEOVARS DATA FILES
# =============================
FILE_ELEV_1K = DIR_DATA.joinpath('SRTM30', 'srtm_1k', 'srtm_1k.tif')
FILE_ELEV_PCT = DIR_DATA.joinpath('SRTM30', 'afmnslp_pct', 'afmnslp_pct.tif')
FILE_TOP_WET_INDEX = DIR_DATA.joinpath('AfSIS/AfSIS_TWI_se/AfSIS_TWI_se.tif')
# Harmonized World Soil Database Files
FILE_LIST_HWSD = [DIR_DATA.joinpath('HWSD/sq{}/sq{}.tif'.format(i,i)) for i in range(1, 8)]
Setup processing global variables#
For variables we will use often such as column name containing latitude and longitude, set them as global variables.
# ====================
# GENERAL
# =====================
# in case we need to download OSM data
GEOFABRICK_REGION = 'africa'
COUNTRY_NAME = 'Tanzania'
COUNTRY_ISO = 'TZA'
FILENAME_SUFFIX = 'npsw5'
# ==============
# BUFFER
# ==============
# whether to buffer points or
# extract values at point
BUFFER = False
BUFF_ZONE_DIST_URB = 2000
BUFF_ZONE_DIST_RUR = 5000
# ====================
# INPUT DATA SPECS
# =====================
# Cols for input coordinates file
LAT = 'lat_pub_w5'
LON = 'lon_pub_w5'
HH_ID = 'y5_hhid'
CLUST_ID = 'y5_clusterid'
TEMP_UNIQ_ID = 'uniq_loc'
URB_RURAL = 'y5_clustertype'
URBAN_CODE = 'urban'
# =============================
# VARIABLE LABELS
# ============================
VAR_LABELS = {'twi':'Topographic wetness index',
'slopepct':'Slope derived from 90m SRTM',
'srtm1k':'Average elevation (m) within 1 km block',
'sq1':'Nutrient availability', 'sq2':'Nutrient retention capacity',
'sq3':'Rooting conditions', 'sq4':'Oxygen availability to roots',
'sq5':'Excess salts', 'sq6':'Toxicity', 'sq7':'Workability (constraining field management'}
for name, label in VAR_LABELS.items():
try:
assert len(label) < 80, 'LABEL TOO LONG'
except AssertionError:
print(name)
Compute variables#
def generate_variables():
"""
Helper function to generate the variables.
It has to be manually edited.
Returns:
Saves the data.
"""
# ===================================
# SETUP INPUT FILES AND COLUMN NAMES
# ==================================
sq_colnames = {i.parts[-2]:i for i in FILE_LIST_HWSD}
fpath_colname_dict = {'twi': FILE_TOP_WET_INDEX, 'slopepct':FILE_ELEV_PCT, 'srtm1k':FILE_ELEV_1K}
fpath_colname_dict.update(sq_colnames)
# ===========================
# LOAD COORDINATES
# ===========================
df_pub_coords = pd.read_csv(FILE_HH_PUB_GEOVARS)
df_pub_coords = df_pub_coords.query('make_public == 1')
df_pub_coords[TEMP_UNIQ_ID] = df_pub_coords.apply(lambda x: "{}-{}".format(x[LON], x[LAT]), axis=1)
df_pub_coords2 = df_pub_coords.drop_duplicates(subset=['uniq_loc'])
df_coords = df_pub_coords2[[TEMP_UNIQ_ID, LAT, LON, URB_RURAL]]
gdf_coords = gpd.GeoDataFrame(df_coords,
geometry=gpd.points_from_xy(df_coords[LON],df_coords[LAT]), crs=4326)
# Get code for rural
rural = list(set(list(df_pub_coords[URB_RURAL].value_counts().index)) - set([URBAN_CODE]))
assert len(rural) == 1, 'THEY SHOULD BE ONLY 2 CODES FOR URBAN/RURAL'
rural_code = rural[0]
# ===========================
# COMPUTE
# ===========================
df_list = []
for colname, file in fpath_colname_dict.items():
if colname in ['slopepct', 'srtm1k']:
print()
print("-"*65)
print(' Extracting variable -{}- with the following parameters'.format(colname))
print("-"*65)
print(' 1. Extraction type: zonal statistic within buffer zone.')
print(' 2. Statistic: mean')
print(' 3. Urban buffer zone: {}km'.format(BUFF_ZONE_DIST_URB/1000))
print(' 4. Rural buffer zone: {}km'.format(BUFF_ZONE_DIST_RUR/1000))
print("-"*60)
gdf_coords_urb = gdf_coords.query('{} == "{}"'.format(URB_RURAL, URBAN_CODE))
df_urb = utils.point_query_raster(in_raster=file, buffer=True, point_shp=gdf_coords_urb, statcol_name=colname,
id_col=TEMP_UNIQ_ID, stat='mean', buff_dist=BUFF_ZONE_DIST_URB, xcol=LON, ycol=LAT)
gdf_coords_rur = gdf_coords.query('{} == "{}"'.format(URB_RURAL, rural_code))
df_rur = utils.point_query_raster(in_raster=file, buffer=True, point_shp=gdf_coords_rur, statcol_name=colname,
id_col=TEMP_UNIQ_ID, stat='mean', buff_dist=BUFF_ZONE_DIST_RUR, xcol=LON, ycol=LAT)
df = df_urb.append(df_rur)
elif colname == 'twi':
print()
print("-"*65)
print(' Extracting variable -{}- with the following parameters'.format(colname))
print("-"*65)
print(' 1. Extraction type: zonal statistic within buffer zone.')
print(' 2. Statistic: majority')
print(' 3. Urban buffer zone: {}km'.format(BUFF_ZONE_DIST_URB/1000))
print(' 4. Rural buffer zone: {}km'.format(BUFF_ZONE_DIST_RUR/1000))
print("-"*60)
gdf_coords_urb = gdf_coords.query('{} == "{}"'.format(URB_RURAL, URBAN_CODE))
df_urb = utils.point_query_raster(in_raster=file, buffer=True, point_shp=gdf_coords_urb, statcol_name=colname,
id_col=TEMP_UNIQ_ID, stat='majority', buff_dist=BUFF_ZONE_DIST_URB, xcol=LON, ycol=LAT)
gdf_coords_rur = gdf_coords.query('{} == "{}"'.format(URB_RURAL, rural_code))
df_rur = utils.point_query_raster(in_raster=file, buffer=True, point_shp=gdf_coords_rur, statcol_name=colname,
id_col=TEMP_UNIQ_ID, stat='majority', buff_dist=BUFF_ZONE_DIST_RUR, xcol=LON, ycol=LAT)
df = df_urb.append(df_rur)
else:
print()
print("-"*65)
print(' Extracting variable -{}- with the following parameters'.format(colname))
print("-"*65)
print(' 1. Extraction type: point extraction.')
print(' 2. Statistic: N/A')
print(' 3. Urban buffer zone: N/A')
print(' 4. Rural buffer zone: N/A')
print("-"*60)
df = utils.point_query_raster(in_raster=file, buffer=False, point_shp=gdf_coords, statcol_name=colname,
id_col=TEMP_UNIQ_ID, xcol=LON, ycol=LAT)
if df[colname].isnull().sum() > 1:
num_missing = df[colname].isnull().sum()
print('Replacing {} missing values for {}'.format(num_missing, colname))
df2 = df.merge(gdf_coords[[TEMP_UNIQ_ID, LAT, LON]], on=TEMP_UNIQ_ID)
df3 = utils.replace_missing_with_nearest(df=df2, target_col=colname, idcol=TEMP_UNIQ_ID, lon=LON, lat=LAT)
df = df3[[TEMP_UNIQ_ID, colname]]
assert df[colname].isnull().sum() == 0
df.set_index(TEMP_UNIQ_ID, inplace=True)
df[colname] = df[colname].apply(lambda x: int(round(x, 0)) if isinstance(x, float) else x)
df_list.append(df)
df_out = pd.concat(df_list, axis=1)
# ===========================
# SAVE OUTPUTS
# ===========================
out_csv = DIR_OUTPUTS.joinpath('soil-terrain-{}.csv'.format(FILENAME_SUFFIX))
df_out.to_csv(out_csv)
# Save variable labels
out_csv_labels = DIR_OUTPUTS.joinpath('soil-terrain-{}-labels.csv'.format(FILENAME_SUFFIX))
df_var_labels = pd.DataFrame(list(VAR_LABELS.values()), index=list(VAR_LABELS.keys()))
df_var_labels.index.name = 'var_name'
df_var_labels.rename(columns={0:'label'}, inplace=True)
df_var_labels.to_csv(out_csv_labels)
print('-'*30)
print('VARIABLES SAVED TO FILE BELOW')
print('-'*30)
print("/".join(out_csv.parts[6:]))
return out_csv, list(fpath_colname_dict.keys())
# Generate the variables
outcsv_file, out_vars = generate_variables()
-----------------------------------------------------------------
Extracting variable -twi- with the following parameters
-----------------------------------------------------------------
1. Extraction type: zonal statistic within buffer zone.
2. Statistic: majority
3. Urban buffer zone: 2.0km
4. Rural buffer zone: 5.0km
------------------------------------------------------------
-----------------------------------------------------------------
Extracting variable -slopepct- with the following parameters
-----------------------------------------------------------------
1. Extraction type: zonal statistic within buffer zone.
2. Statistic: mean
3. Urban buffer zone: 2.0km
4. Rural buffer zone: 5.0km
------------------------------------------------------------
-----------------------------------------------------------------
Extracting variable -srtm1k- with the following parameters
-----------------------------------------------------------------
1. Extraction type: zonal statistic within buffer zone.
2. Statistic: mean
3. Urban buffer zone: 2.0km
4. Rural buffer zone: 5.0km
------------------------------------------------------------
-----------------------------------------------------------------
Extracting variable -sq1- with the following parameters
-----------------------------------------------------------------
1. Extraction type: point extraction.
2. Statistic: N/A
3. Urban buffer zone: N/A
4. Rural buffer zone: N/A
------------------------------------------------------------
-----------------------------------------------------------------
Extracting variable -sq2- with the following parameters
-----------------------------------------------------------------
1. Extraction type: point extraction.
2. Statistic: N/A
3. Urban buffer zone: N/A
4. Rural buffer zone: N/A
------------------------------------------------------------
-----------------------------------------------------------------
Extracting variable -sq3- with the following parameters
-----------------------------------------------------------------
1. Extraction type: point extraction.
2. Statistic: N/A
3. Urban buffer zone: N/A
4. Rural buffer zone: N/A
------------------------------------------------------------
-----------------------------------------------------------------
Extracting variable -sq4- with the following parameters
-----------------------------------------------------------------
1. Extraction type: point extraction.
2. Statistic: N/A
3. Urban buffer zone: N/A
4. Rural buffer zone: N/A
------------------------------------------------------------
-----------------------------------------------------------------
Extracting variable -sq5- with the following parameters
-----------------------------------------------------------------
1. Extraction type: point extraction.
2. Statistic: N/A
3. Urban buffer zone: N/A
4. Rural buffer zone: N/A
------------------------------------------------------------
-----------------------------------------------------------------
Extracting variable -sq6- with the following parameters
-----------------------------------------------------------------
1. Extraction type: point extraction.
2. Statistic: N/A
3. Urban buffer zone: N/A
4. Rural buffer zone: N/A
------------------------------------------------------------
-----------------------------------------------------------------
Extracting variable -sq7- with the following parameters
-----------------------------------------------------------------
1. Extraction type: point extraction.
2. Statistic: N/A
3. Urban buffer zone: N/A
4. Rural buffer zone: N/A
------------------------------------------------------------
------------------------------
VARIABLES SAVED TO FILE BELOW
------------------------------
DECAT_HH_Geovariables/data/TZA/surveys/NPS_w5/geovars/puf/soil-terrain-npsw5.csv
Quality checks#
Run summary statistic to generare minimum, maximum values etc and inspect them manually.
df = pd.read_csv(outcsv_file)
# Only use describe for continuos variables
print('-'*50)
print(' SUMMARY STATISTICS FOR ELEVATION VARS')
print('-'*50)
stats = df[['slopepct', 'srtm1k']].describe().T
print(stats[['mean', 'min', '50%', '75%', 'max']])
print('-'*70)
--------------------------------------------------
SUMMARY STATISTICS FOR ELEVATION VARS
--------------------------------------------------
mean min 50% 75% max
slopepct 4.747082 0.0 4.0 6.0 39.0
srtm1k 765.325292 6.0 1045.0 1233.0 2261.0
----------------------------------------------------------------------
# Only use describe for continuos variables
cat_vars = [c for c in out_vars if 'sq' in c] + ['twi']
print('='*60)
print(' FREQUENCY DISTRIBUTION (%) FOR CATEGORICAL VARS')
print('='*60)
for c in cat_vars:
print()
print('Variable name: {}'.format(c))
print('-'*50)
print(df[c].value_counts(dropna=False, normalize=True))
print('-'*70)
============================================================
FREQUENCY DISTRIBUTION (%) FOR CATEGORICAL VARS
============================================================
Variable name: sq1
--------------------------------------------------
2 0.593774
1 0.260700
3 0.066148
0 0.056809
7 0.017121
4 0.005447
Name: sq1, dtype: float64
Variable name: sq2
--------------------------------------------------
1 0.512062
2 0.392218
0 0.056809
7 0.017121
3 0.016342
4 0.005447
Name: sq2, dtype: float64
Variable name: sq3
--------------------------------------------------
1 0.759533
2 0.115175
0 0.056809
3 0.028794
7 0.017121
5 0.011673
4 0.010895
Name: sq3, dtype: float64
Variable name: sq4
--------------------------------------------------
1 0.799222
2 0.092607
0 0.056809
3 0.031128
7 0.017121
4 0.003113
Name: sq4, dtype: float64
Variable name: sq5
--------------------------------------------------
1 0.866926
0 0.056809
2 0.046693
7 0.017121
3 0.012451
Name: sq5, dtype: float64
Variable name: sq6
--------------------------------------------------
1 0.902724
0 0.056809
2 0.017899
7 0.017121
3 0.005447
Name: sq6, dtype: float64
Variable name: sq7
--------------------------------------------------
1 0.566537
2 0.250584
3 0.077821
0 0.056809
7 0.017121
4 0.017121
5 0.014008
Name: sq7, dtype: float64
Variable name: twi
--------------------------------------------------
13 0.533074
12 0.205447
14 0.124514
0 0.087938
11 0.031907
36 0.012451
15 0.003891
10 0.000778
Name: twi, dtype: float64
----------------------------------------------------------------------
Summary#
The following variables below were generated.
ID |
Theme |
Variable Description |
Variable ID |
Units |
Resolution |
Data sources |
Required |
---|---|---|---|---|---|---|---|
1 |
Soil and Terrain |
Average elevation (m) within 1 km block |
soil01 |
m |
0.008333 dd |
Yes |
|
2 |
Soil and Terrain |
Slope derived from 90m SRTM |
soil02 |
Percent |
0.008333 dd |
Yes |
|
3 |
Soil and Terrain |
Topographic wetness index |
soil03 |
N/A |
0.008333 dd |
Yes |
|
4 |
Soil and Terrain |
Terrain Roughness |
soil04 |
N/A |
0.008333 dd |
Yes |
|
5 |
Soil and Terrain |
Nutrient availability |
soil05 |
N/A |
0.008333 dd |
Yes |
|
6 |
Soil and Terrain |
Nutrient retention capacity |
soil06 |
N/A |
0.008333 dd |
Yes |
|
7 |
Soil and Terrain |
Rooting conditions |
soil07 |
N/A |
0.008333 dd |
Yes |
|
8 |
Soil and Terrain |
Oxygen availability to roots |
soil08 |
N/A |
0.008333 dd |
Yes |
|
9 |
Soil and Terrain |
Excess salts |
soil09 |
N/A |
0.008333 dd |
Yes |
|
10 |
Soil and Terrain |
Toxicity |
soil08 |
N/A |
0.008333 dd |
Yes |
|
11 |
Soil and Terrain |
Workability (constraining field management) |
soil08 |
N/A |
0.008333 dd |
Yes |