Soil and Terrain

Soil and Terrain#

Author: Dunstan Matekenya

Affiliation: DECAT, The World Bank Group

Date: August 1, 2023

Background#

In this notebook, we generate geovariables in the category of soil and terrain for the National Panel Survey (NPS 2014-2015) in Tanzania. The following geovariables are computed.

Average elevation (m) within 1 km block
Slope
Topographic wetness index
Terrain Roughness
Nutrient availability
Nutrient retention capacity
Rooting conditions
Oxygen availability to roots
Excess salts
Toxicity
Workability (constraining field management)

For full variables metadata, see Data-Cover-Page for details about definition of the variables and other important information.

Input datasets#

The following datasets have been used for all the variables.

Data source
1. Soil. The source for this data is the Harmonized World Soil Database(HWSD)
2. Elevation. The source for this data is the SRTM-30 dataset
Input data files
1. Soil. The data files used in this notebook were downloaded from the site mentioned above and kept in this OneDrive folder ^[1]
2. Elevation. The data files used in this notebook were downloaded from the site mentioned above and kept in this OneDrive folder ^[1]

1. ^ Only WBG staff can access this OneDrive directory.

Python setup#

The following packages are required for the processing in this notebook.

import warnings
warnings.simplefilter(action='ignore')
from pathlib import Path
import sys

import geopandas as gpd
import pandas as pd
import numpy as np
import rasterio
from rasterstats import point_query, zonal_stats


spanon = Path.cwd().parents[2].joinpath("Spatial-Anonymization/src/")
sys.path.append(str(spanon))
from spatial_anonymization import utils
from spatial_anonymization import point_displacement

Setup input files#

Define paths to input data files as global variables using Path package.

# =============================
# BASE WORKING DIRS
# ============================
DIR_DATA = Path.cwd().parents[1].joinpath('data')
DIR_SPATIAL_DATA = DIR_DATA.joinpath('spatial')
DIR_NPS_W5 = DIR_DATA.joinpath('TZA', 'surveys', 'NPS_w5')
DIR_OUTPUTS = DIR_NPS_W5.joinpath('geovars', 'puf')

# =============================
# SURVEY COORDINATES 
# ============================
# Public/displaced GPS oordinates files
FILE_HH_PUB_GEOVARS = DIR_OUTPUTS.joinpath('y5-hhs-pub-coords.csv')

# =============================
# ADMIN BOUNDARIES
# =============================
# use geoboundaries API for country bounds
FILE_ADM0 = DIR_SPATIAL_DATA.joinpath('geoBoundaries-TZA-ADM0-all',
                                     'geoBoundaries-TZA-ADM0.shp')

# =============================
# GEOVARS DATA FILES
# =============================
FILE_ELEV_1K = DIR_DATA.joinpath('SRTM30', 'srtm_1k', 'srtm_1k.tif')
FILE_ELEV_PCT = DIR_DATA.joinpath('SRTM30', 'afmnslp_pct', 'afmnslp_pct.tif')
FILE_TOP_WET_INDEX = DIR_DATA.joinpath('AfSIS/AfSIS_TWI_se/AfSIS_TWI_se.tif')

# Harmonized World Soil Database Files
FILE_LIST_HWSD = [DIR_DATA.joinpath('HWSD/sq{}/sq{}.tif'.format(i,i)) for i in range(1, 8)]

Setup processing global variables#

For variables we will use often such as column name containing latitude and longitude, set them as global variables.

# ====================
# GENERAL
# =====================
# in case we need to download OSM data
GEOFABRICK_REGION  = 'africa'
COUNTRY_NAME = 'Tanzania'
COUNTRY_ISO = 'TZA'
FILENAME_SUFFIX = 'npsw5'

# ==============
# BUFFER
# ==============
# whether to buffer points or
# extract values at point
BUFFER = False
BUFF_ZONE_DIST_URB = 2000
BUFF_ZONE_DIST_RUR = 5000
    
# ====================
# INPUT DATA SPECS
# =====================
# Cols for input coordinates file
LAT = 'lat_pub_w5' 
LON = 'lon_pub_w5'
HH_ID = 'y5_hhid'
CLUST_ID = 'y5_clusterid'
TEMP_UNIQ_ID = 'uniq_loc'
URB_RURAL = 'y5_clustertype'
URBAN_CODE = 'urban'

# =============================
# VARIABLE LABELS
# ============================
VAR_LABELS = {'twi':'Topographic wetness index', 
 'slopepct':'Slope derived from 90m SRTM',
 'srtm1k':'Average elevation (m) within 1 km block',
 'sq1':'Nutrient availability', 'sq2':'Nutrient retention capacity',
 'sq3':'Rooting conditions', 'sq4':'Oxygen availability to roots',
       'sq5':'Excess salts', 'sq6':'Toxicity', 'sq7':'Workability (constraining field management'}
for name, label in VAR_LABELS.items():
    try:
        assert len(label) < 80, 'LABEL TOO LONG'
    except AssertionError:
        print(name)

Compute variables#

def generate_variables():
    """
    Helper function to generate the variables. 
    It has to be manually edited.
    
    Returns:
    Saves the data.
    """
    # ===================================
    # SETUP INPUT FILES AND COLUMN NAMES
    # ==================================
    sq_colnames = {i.parts[-2]:i for i in FILE_LIST_HWSD}
    fpath_colname_dict = {'twi': FILE_TOP_WET_INDEX, 'slopepct':FILE_ELEV_PCT, 'srtm1k':FILE_ELEV_1K}
    fpath_colname_dict.update(sq_colnames)
    
    # ===========================
    # LOAD COORDINATES
    # ===========================
    df_pub_coords = pd.read_csv(FILE_HH_PUB_GEOVARS)
    df_pub_coords  = df_pub_coords.query('make_public == 1')
    df_pub_coords[TEMP_UNIQ_ID] = df_pub_coords.apply(lambda x: "{}-{}".format(x[LON], x[LAT]), axis=1)
    df_pub_coords2 = df_pub_coords.drop_duplicates(subset=['uniq_loc'])
    df_coords = df_pub_coords2[[TEMP_UNIQ_ID, LAT, LON, URB_RURAL]]
    gdf_coords = gpd.GeoDataFrame(df_coords, 
                                      geometry=gpd.points_from_xy(df_coords[LON],df_coords[LAT]), crs=4326)
    # Get code for rural
    rural = list(set(list(df_pub_coords[URB_RURAL].value_counts().index)) - set([URBAN_CODE]))
    assert len(rural) == 1, 'THEY SHOULD BE ONLY 2 CODES FOR URBAN/RURAL'
    rural_code = rural[0]
    
    # ===========================
    # COMPUTE
    # ===========================
    df_list = []
    for colname, file in fpath_colname_dict.items():
        if colname in ['slopepct', 'srtm1k']:
            print()
            print("-"*65)
            print(' Extracting variable -{}- with the following parameters'.format(colname))
            print("-"*65)
            print(' 1. Extraction type: zonal statistic within buffer zone.')
            print(' 2. Statistic: mean')
            print(' 3. Urban buffer zone: {}km'.format(BUFF_ZONE_DIST_URB/1000))
            print(' 4. Rural buffer zone: {}km'.format(BUFF_ZONE_DIST_RUR/1000))
            print("-"*60)
            gdf_coords_urb = gdf_coords.query('{} == "{}"'.format(URB_RURAL, URBAN_CODE))
            df_urb = utils.point_query_raster(in_raster=file, buffer=True, point_shp=gdf_coords_urb, statcol_name=colname,
                                                  id_col=TEMP_UNIQ_ID, stat='mean', buff_dist=BUFF_ZONE_DIST_URB, xcol=LON, ycol=LAT)
            gdf_coords_rur = gdf_coords.query('{} == "{}"'.format(URB_RURAL, rural_code))
            df_rur = utils.point_query_raster(in_raster=file, buffer=True, point_shp=gdf_coords_rur, statcol_name=colname,
                                                  id_col=TEMP_UNIQ_ID, stat='mean', buff_dist=BUFF_ZONE_DIST_RUR, xcol=LON, ycol=LAT)
            df = df_urb.append(df_rur)
        elif colname == 'twi':
            print()
            print("-"*65)
            print(' Extracting variable -{}- with the following parameters'.format(colname))
            print("-"*65)
            print(' 1. Extraction type: zonal statistic within buffer zone.')
            print(' 2. Statistic: majority')
            print(' 3. Urban buffer zone: {}km'.format(BUFF_ZONE_DIST_URB/1000))
            print(' 4. Rural buffer zone: {}km'.format(BUFF_ZONE_DIST_RUR/1000))
            print("-"*60)
            gdf_coords_urb = gdf_coords.query('{} == "{}"'.format(URB_RURAL, URBAN_CODE))
            df_urb = utils.point_query_raster(in_raster=file, buffer=True, point_shp=gdf_coords_urb, statcol_name=colname,
                                                  id_col=TEMP_UNIQ_ID, stat='majority', buff_dist=BUFF_ZONE_DIST_URB, xcol=LON, ycol=LAT)
            gdf_coords_rur = gdf_coords.query('{} == "{}"'.format(URB_RURAL, rural_code))
            df_rur = utils.point_query_raster(in_raster=file, buffer=True, point_shp=gdf_coords_rur, statcol_name=colname,
                                                  id_col=TEMP_UNIQ_ID, stat='majority', buff_dist=BUFF_ZONE_DIST_RUR, xcol=LON, ycol=LAT)
            df = df_urb.append(df_rur)
            
        else:
            print()
            print("-"*65)
            print(' Extracting variable -{}- with the following parameters'.format(colname))
            print("-"*65)
            print(' 1. Extraction type: point extraction.')
            print(' 2. Statistic: N/A')
            print(' 3. Urban buffer zone: N/A')
            print(' 4. Rural buffer zone: N/A')
            print("-"*60)
            df = utils.point_query_raster(in_raster=file, buffer=False, point_shp=gdf_coords, statcol_name=colname,
                                                  id_col=TEMP_UNIQ_ID, xcol=LON, ycol=LAT)
        if df[colname].isnull().sum() > 1:
            num_missing = df[colname].isnull().sum()
            print('Replacing {} missing values for {}'.format(num_missing, colname))
            df2 = df.merge(gdf_coords[[TEMP_UNIQ_ID, LAT, LON]], on=TEMP_UNIQ_ID)
            df3 = utils.replace_missing_with_nearest(df=df2, target_col=colname, idcol=TEMP_UNIQ_ID, lon=LON, lat=LAT)
            df = df3[[TEMP_UNIQ_ID, colname]]
            assert df[colname].isnull().sum() == 0
        
        df.set_index(TEMP_UNIQ_ID, inplace=True)
        df[colname] = df[colname].apply(lambda x: int(round(x, 0)) if isinstance(x, float) else x)
        df_list.append(df)
    
    
    df_out = pd.concat(df_list, axis=1)
    
    # ===========================
    # SAVE OUTPUTS
    # ===========================
    out_csv = DIR_OUTPUTS.joinpath('soil-terrain-{}.csv'.format(FILENAME_SUFFIX))
    df_out.to_csv(out_csv)
    
    # Save variable labels
    out_csv_labels = DIR_OUTPUTS.joinpath('soil-terrain-{}-labels.csv'.format(FILENAME_SUFFIX))
    df_var_labels = pd.DataFrame(list(VAR_LABELS.values()), index=list(VAR_LABELS.keys()))
    df_var_labels.index.name = 'var_name'
    df_var_labels.rename(columns={0:'label'}, inplace=True)
    df_var_labels.to_csv(out_csv_labels)
    
    print('-'*30)
    print('VARIABLES SAVED TO FILE BELOW')
    print('-'*30)
    print("/".join(out_csv.parts[6:]))
    
    return out_csv, list(fpath_colname_dict.keys())

# Generate the variables
outcsv_file, out_vars = generate_variables()

-----------------------------------------------------------------
 Extracting variable -twi- with the following parameters
-----------------------------------------------------------------
 1. Extraction type: zonal statistic within buffer zone.
 2. Statistic: majority
 3. Urban buffer zone: 2.0km
 4. Rural buffer zone: 5.0km
------------------------------------------------------------

-----------------------------------------------------------------
 Extracting variable -slopepct- with the following parameters
-----------------------------------------------------------------
 1. Extraction type: zonal statistic within buffer zone.
 2. Statistic: mean
 3. Urban buffer zone: 2.0km
 4. Rural buffer zone: 5.0km
------------------------------------------------------------

-----------------------------------------------------------------
 Extracting variable -srtm1k- with the following parameters
-----------------------------------------------------------------
 1. Extraction type: zonal statistic within buffer zone.
 2. Statistic: mean
 3. Urban buffer zone: 2.0km
 4. Rural buffer zone: 5.0km
------------------------------------------------------------

-----------------------------------------------------------------
 Extracting variable -sq1- with the following parameters
-----------------------------------------------------------------
 1. Extraction type: point extraction.
 2. Statistic: N/A
 3. Urban buffer zone: N/A
 4. Rural buffer zone: N/A
------------------------------------------------------------

-----------------------------------------------------------------
 Extracting variable -sq2- with the following parameters
-----------------------------------------------------------------
 1. Extraction type: point extraction.
 2. Statistic: N/A
 3. Urban buffer zone: N/A
 4. Rural buffer zone: N/A
------------------------------------------------------------

-----------------------------------------------------------------
 Extracting variable -sq3- with the following parameters
-----------------------------------------------------------------
 1. Extraction type: point extraction.
 2. Statistic: N/A
 3. Urban buffer zone: N/A
 4. Rural buffer zone: N/A
------------------------------------------------------------

-----------------------------------------------------------------
 Extracting variable -sq4- with the following parameters
-----------------------------------------------------------------
 1. Extraction type: point extraction.
 2. Statistic: N/A
 3. Urban buffer zone: N/A
 4. Rural buffer zone: N/A
------------------------------------------------------------

-----------------------------------------------------------------
 Extracting variable -sq5- with the following parameters
-----------------------------------------------------------------
 1. Extraction type: point extraction.
 2. Statistic: N/A
 3. Urban buffer zone: N/A
 4. Rural buffer zone: N/A
------------------------------------------------------------

-----------------------------------------------------------------
 Extracting variable -sq6- with the following parameters
-----------------------------------------------------------------
 1. Extraction type: point extraction.
 2. Statistic: N/A
 3. Urban buffer zone: N/A
 4. Rural buffer zone: N/A
------------------------------------------------------------

-----------------------------------------------------------------
 Extracting variable -sq7- with the following parameters
-----------------------------------------------------------------
 1. Extraction type: point extraction.
 2. Statistic: N/A
 3. Urban buffer zone: N/A
 4. Rural buffer zone: N/A
------------------------------------------------------------

------------------------------
VARIABLES SAVED TO FILE BELOW
------------------------------
DECAT_HH_Geovariables/data/TZA/surveys/NPS_w5/geovars/puf/soil-terrain-npsw5.csv

Quality checks#

Run summary statistic to generare minimum, maximum values etc and inspect them manually.

df = pd.read_csv(outcsv_file)

# Only use describe for continuos variables
print('-'*50)
print('       SUMMARY STATISTICS FOR ELEVATION VARS')
print('-'*50)
stats = df[['slopepct', 'srtm1k']].describe().T
print(stats[['mean', 'min', '50%', '75%', 'max']])
print('-'*70)

--------------------------------------------------
       SUMMARY STATISTICS FOR ELEVATION VARS
--------------------------------------------------
                mean  min     50%     75%     max
slopepct    4.747082  0.0     4.0     6.0    39.0
srtm1k    765.325292  6.0  1045.0  1233.0  2261.0
----------------------------------------------------------------------

# Only use describe for continuos variables 
cat_vars = [c for c in out_vars if 'sq' in c] + ['twi']
print('='*60)
print('       FREQUENCY DISTRIBUTION (%) FOR CATEGORICAL VARS')
print('='*60)
for c in cat_vars:
    print()
    print('Variable name: {}'.format(c))
    print('-'*50)
    print(df[c].value_counts(dropna=False, normalize=True))
print('-'*70)

============================================================
       FREQUENCY DISTRIBUTION (%) FOR CATEGORICAL VARS
============================================================

Variable name: sq1
--------------------------------------------------
2    0.593774
1    0.260700
3    0.066148
0    0.056809
7    0.017121
4    0.005447
Name: sq1, dtype: float64

Variable name: sq2
--------------------------------------------------
1    0.512062
2    0.392218
0    0.056809
7    0.017121
3    0.016342
4    0.005447
Name: sq2, dtype: float64

Variable name: sq3
--------------------------------------------------
1    0.759533
2    0.115175
0    0.056809
3    0.028794
7    0.017121
5    0.011673
4    0.010895
Name: sq3, dtype: float64

Variable name: sq4
--------------------------------------------------
1    0.799222
2    0.092607
0    0.056809
3    0.031128
7    0.017121
4    0.003113
Name: sq4, dtype: float64

Variable name: sq5
--------------------------------------------------
1    0.866926
0    0.056809
2    0.046693
7    0.017121
3    0.012451
Name: sq5, dtype: float64

Variable name: sq6
--------------------------------------------------
1    0.902724
0    0.056809
2    0.017899
7    0.017121
3    0.005447
Name: sq6, dtype: float64

Variable name: sq7
--------------------------------------------------
1    0.566537
2    0.250584
3    0.077821
0    0.056809
7    0.017121
4    0.017121
5    0.014008
Name: sq7, dtype: float64

Variable name: twi
--------------------------------------------------
13    0.533074
12    0.205447
14    0.124514
0     0.087938
11    0.031907
36    0.012451
15    0.003891
10    0.000778
Name: twi, dtype: float64
----------------------------------------------------------------------

Summary#

The following variables below were generated.

ID	Theme	Variable Description	Variable ID	Units	Resolution	Data sources	Required
1	Soil and Terrain	Average elevation (m) within 1 km block	soil01	m	0.008333 dd	SRTM-90m	Yes
2	Soil and Terrain	Slope derived from 90m SRTM	soil02	Percent	0.008333 dd	SRTM-90m	Yes
3	Soil and Terrain	Topographic wetness index	soil03	N/A	0.008333 dd	SRTM-90m	Yes
4	Soil and Terrain	Terrain Roughness	soil04	N/A	0.008333 dd	SRTM-90m	Yes
5	Soil and Terrain	Nutrient availability	soil05	N/A	0.008333 dd	Harmonized World Soil Database	Yes
6	Soil and Terrain	Nutrient retention capacity	soil06	N/A	0.008333 dd	Harmonized World Soil Database	Yes
7	Soil and Terrain	Rooting conditions	soil07	N/A	0.008333 dd	Harmonized World Soil Database	Yes
8	Soil and Terrain	Oxygen availability to roots	soil08	N/A	0.008333 dd	Harmonized World Soil Database	Yes
9	Soil and Terrain	Excess salts	soil09	N/A	0.008333 dd	Harmonized World Soil Database	Yes
10	Soil and Terrain	Toxicity	soil08	N/A	0.008333 dd	Harmonized World Soil Database	Yes
11	Soil and Terrain	Workability (constraining field management)	soil08	N/A	0.008333 dd	Harmonized World Soil Database	Yes