Soil and Terrain#


Author: Dunstan Matekenya

Affiliation: DECAT, The World Bank Group

Date: August 1, 2023

Background#

In this notebook, we generate geovariables in the category of soil and terrain for the National Panel Survey (NPS 2014-2015) in Tanzania. The following geovariables are computed.

  1. Average elevation (m) within 1 km block

  2. Slope

  3. Topographic wetness index

  4. Terrain Roughness

  5. Nutrient availability

  6. Nutrient retention capacity

  7. Rooting conditions

  8. Oxygen availability to roots

  9. Excess salts

  10. Toxicity

  11. Workability (constraining field management)

For full variables metadata, see Data-Cover-Page for details about definition of the variables and other important information.

Input datasets#

The following datasets have been used for all the variables.

  • Data source

    1. Soil. The source for this data is the Harmonized World Soil Database(HWSD)

    2. Elevation. The source for this data is the SRTM-30 dataset

  • Input data files

    1. Soil. The data files used in this notebook were downloaded from the site mentioned above and kept in this OneDrive folder[1]

    2. Elevation. The data files used in this notebook were downloaded from the site mentioned above and kept in this OneDrive folder[1]

1. ^ Only WBG staff can access this OneDrive directory.

Python setup#

The following packages are required for the processing in this notebook.

import warnings
warnings.simplefilter(action='ignore')
from pathlib import Path
import sys

import geopandas as gpd
import pandas as pd
import numpy as np
import rasterio
from rasterstats import point_query, zonal_stats


spanon = Path.cwd().parents[2].joinpath("Spatial-Anonymization/src/")
sys.path.append(str(spanon))
from spatial_anonymization import utils
from spatial_anonymization import point_displacement

Setup input files#

Define paths to input data files as global variables using Path package.

# =============================
# BASE WORKING DIRS
# ============================
DIR_DATA = Path.cwd().parents[1].joinpath('data')
DIR_SPATIAL_DATA = DIR_DATA.joinpath('spatial')
DIR_NPS_W5 = DIR_DATA.joinpath('TZA', 'surveys', 'NPS_w5')
DIR_OUTPUTS = DIR_NPS_W5.joinpath('geovars', 'puf')

# =============================
# SURVEY COORDINATES 
# ============================
# Public/displaced GPS oordinates files
FILE_HH_PUB_GEOVARS = DIR_OUTPUTS.joinpath('y5-hhs-pub-coords.csv')

# =============================
# ADMIN BOUNDARIES
# =============================
# use geoboundaries API for country bounds
FILE_ADM0 = DIR_SPATIAL_DATA.joinpath('geoBoundaries-TZA-ADM0-all',
                                     'geoBoundaries-TZA-ADM0.shp')

# =============================
# GEOVARS DATA FILES
# =============================
FILE_ELEV_1K = DIR_DATA.joinpath('SRTM30', 'srtm_1k', 'srtm_1k.tif')
FILE_ELEV_PCT = DIR_DATA.joinpath('SRTM30', 'afmnslp_pct', 'afmnslp_pct.tif')
FILE_TOP_WET_INDEX = DIR_DATA.joinpath('AfSIS/AfSIS_TWI_se/AfSIS_TWI_se.tif')

# Harmonized World Soil Database Files
FILE_LIST_HWSD = [DIR_DATA.joinpath('HWSD/sq{}/sq{}.tif'.format(i,i)) for i in range(1, 8)]

Setup processing global variables#

For variables we will use often such as column name containing latitude and longitude, set them as global variables.

# ====================
# GENERAL
# =====================
# in case we need to download OSM data
GEOFABRICK_REGION  = 'africa'
COUNTRY_NAME = 'Tanzania'
COUNTRY_ISO = 'TZA'
FILENAME_SUFFIX = 'npsw5'

# ==============
# BUFFER
# ==============
# whether to buffer points or
# extract values at point
BUFFER = False
BUFF_ZONE_DIST_URB = 2000
BUFF_ZONE_DIST_RUR = 5000
    
# ====================
# INPUT DATA SPECS
# =====================
# Cols for input coordinates file
LAT = 'lat_pub_w5' 
LON = 'lon_pub_w5'
HH_ID = 'y5_hhid'
CLUST_ID = 'y5_clusterid'
TEMP_UNIQ_ID = 'uniq_loc'
URB_RURAL = 'y5_clustertype'
URBAN_CODE = 'urban'

# =============================
# VARIABLE LABELS
# ============================
VAR_LABELS = {'twi':'Topographic wetness index', 
 'slopepct':'Slope derived from 90m SRTM',
 'srtm1k':'Average elevation (m) within 1 km block',
 'sq1':'Nutrient availability', 'sq2':'Nutrient retention capacity',
 'sq3':'Rooting conditions', 'sq4':'Oxygen availability to roots',
       'sq5':'Excess salts', 'sq6':'Toxicity', 'sq7':'Workability (constraining field management'}
for name, label in VAR_LABELS.items():
    try:
        assert len(label) < 80, 'LABEL TOO LONG'
    except AssertionError:
        print(name)

Compute variables#

def generate_variables():
    """
    Helper function to generate the variables. 
    It has to be manually edited.
    
    Returns:
    Saves the data.
    """
    # ===================================
    # SETUP INPUT FILES AND COLUMN NAMES
    # ==================================
    sq_colnames = {i.parts[-2]:i for i in FILE_LIST_HWSD}
    fpath_colname_dict = {'twi': FILE_TOP_WET_INDEX, 'slopepct':FILE_ELEV_PCT, 'srtm1k':FILE_ELEV_1K}
    fpath_colname_dict.update(sq_colnames)
    
    # ===========================
    # LOAD COORDINATES
    # ===========================
    df_pub_coords = pd.read_csv(FILE_HH_PUB_GEOVARS)
    df_pub_coords  = df_pub_coords.query('make_public == 1')
    df_pub_coords[TEMP_UNIQ_ID] = df_pub_coords.apply(lambda x: "{}-{}".format(x[LON], x[LAT]), axis=1)
    df_pub_coords2 = df_pub_coords.drop_duplicates(subset=['uniq_loc'])
    df_coords = df_pub_coords2[[TEMP_UNIQ_ID, LAT, LON, URB_RURAL]]
    gdf_coords = gpd.GeoDataFrame(df_coords, 
                                      geometry=gpd.points_from_xy(df_coords[LON],df_coords[LAT]), crs=4326)
    # Get code for rural
    rural = list(set(list(df_pub_coords[URB_RURAL].value_counts().index)) - set([URBAN_CODE]))
    assert len(rural) == 1, 'THEY SHOULD BE ONLY 2 CODES FOR URBAN/RURAL'
    rural_code = rural[0]
    
    # ===========================
    # COMPUTE
    # ===========================
    df_list = []
    for colname, file in fpath_colname_dict.items():
        if colname in ['slopepct', 'srtm1k']:
            print()
            print("-"*65)
            print(' Extracting variable -{}- with the following parameters'.format(colname))
            print("-"*65)
            print(' 1. Extraction type: zonal statistic within buffer zone.')
            print(' 2. Statistic: mean')
            print(' 3. Urban buffer zone: {}km'.format(BUFF_ZONE_DIST_URB/1000))
            print(' 4. Rural buffer zone: {}km'.format(BUFF_ZONE_DIST_RUR/1000))
            print("-"*60)
            gdf_coords_urb = gdf_coords.query('{} == "{}"'.format(URB_RURAL, URBAN_CODE))
            df_urb = utils.point_query_raster(in_raster=file, buffer=True, point_shp=gdf_coords_urb, statcol_name=colname,
                                                  id_col=TEMP_UNIQ_ID, stat='mean', buff_dist=BUFF_ZONE_DIST_URB, xcol=LON, ycol=LAT)
            gdf_coords_rur = gdf_coords.query('{} == "{}"'.format(URB_RURAL, rural_code))
            df_rur = utils.point_query_raster(in_raster=file, buffer=True, point_shp=gdf_coords_rur, statcol_name=colname,
                                                  id_col=TEMP_UNIQ_ID, stat='mean', buff_dist=BUFF_ZONE_DIST_RUR, xcol=LON, ycol=LAT)
            df = df_urb.append(df_rur)
        elif colname == 'twi':
            print()
            print("-"*65)
            print(' Extracting variable -{}- with the following parameters'.format(colname))
            print("-"*65)
            print(' 1. Extraction type: zonal statistic within buffer zone.')
            print(' 2. Statistic: majority')
            print(' 3. Urban buffer zone: {}km'.format(BUFF_ZONE_DIST_URB/1000))
            print(' 4. Rural buffer zone: {}km'.format(BUFF_ZONE_DIST_RUR/1000))
            print("-"*60)
            gdf_coords_urb = gdf_coords.query('{} == "{}"'.format(URB_RURAL, URBAN_CODE))
            df_urb = utils.point_query_raster(in_raster=file, buffer=True, point_shp=gdf_coords_urb, statcol_name=colname,
                                                  id_col=TEMP_UNIQ_ID, stat='majority', buff_dist=BUFF_ZONE_DIST_URB, xcol=LON, ycol=LAT)
            gdf_coords_rur = gdf_coords.query('{} == "{}"'.format(URB_RURAL, rural_code))
            df_rur = utils.point_query_raster(in_raster=file, buffer=True, point_shp=gdf_coords_rur, statcol_name=colname,
                                                  id_col=TEMP_UNIQ_ID, stat='majority', buff_dist=BUFF_ZONE_DIST_RUR, xcol=LON, ycol=LAT)
            df = df_urb.append(df_rur)
            
        else:
            print()
            print("-"*65)
            print(' Extracting variable -{}- with the following parameters'.format(colname))
            print("-"*65)
            print(' 1. Extraction type: point extraction.')
            print(' 2. Statistic: N/A')
            print(' 3. Urban buffer zone: N/A')
            print(' 4. Rural buffer zone: N/A')
            print("-"*60)
            df = utils.point_query_raster(in_raster=file, buffer=False, point_shp=gdf_coords, statcol_name=colname,
                                                  id_col=TEMP_UNIQ_ID, xcol=LON, ycol=LAT)
        if df[colname].isnull().sum() > 1:
            num_missing = df[colname].isnull().sum()
            print('Replacing {} missing values for {}'.format(num_missing, colname))
            df2 = df.merge(gdf_coords[[TEMP_UNIQ_ID, LAT, LON]], on=TEMP_UNIQ_ID)
            df3 = utils.replace_missing_with_nearest(df=df2, target_col=colname, idcol=TEMP_UNIQ_ID, lon=LON, lat=LAT)
            df = df3[[TEMP_UNIQ_ID, colname]]
            assert df[colname].isnull().sum() == 0
        
        df.set_index(TEMP_UNIQ_ID, inplace=True)
        df[colname] = df[colname].apply(lambda x: int(round(x, 0)) if isinstance(x, float) else x)
        df_list.append(df)
    
    
    df_out = pd.concat(df_list, axis=1)
    
    # ===========================
    # SAVE OUTPUTS
    # ===========================
    out_csv = DIR_OUTPUTS.joinpath('soil-terrain-{}.csv'.format(FILENAME_SUFFIX))
    df_out.to_csv(out_csv)
    
    # Save variable labels
    out_csv_labels = DIR_OUTPUTS.joinpath('soil-terrain-{}-labels.csv'.format(FILENAME_SUFFIX))
    df_var_labels = pd.DataFrame(list(VAR_LABELS.values()), index=list(VAR_LABELS.keys()))
    df_var_labels.index.name = 'var_name'
    df_var_labels.rename(columns={0:'label'}, inplace=True)
    df_var_labels.to_csv(out_csv_labels)
    
    print('-'*30)
    print('VARIABLES SAVED TO FILE BELOW')
    print('-'*30)
    print("/".join(out_csv.parts[6:]))
    
    return out_csv, list(fpath_colname_dict.keys())
# Generate the variables
outcsv_file, out_vars = generate_variables()
-----------------------------------------------------------------
 Extracting variable -twi- with the following parameters
-----------------------------------------------------------------
 1. Extraction type: zonal statistic within buffer zone.
 2. Statistic: majority
 3. Urban buffer zone: 2.0km
 4. Rural buffer zone: 5.0km
------------------------------------------------------------
-----------------------------------------------------------------
 Extracting variable -slopepct- with the following parameters
-----------------------------------------------------------------
 1. Extraction type: zonal statistic within buffer zone.
 2. Statistic: mean
 3. Urban buffer zone: 2.0km
 4. Rural buffer zone: 5.0km
------------------------------------------------------------
-----------------------------------------------------------------
 Extracting variable -srtm1k- with the following parameters
-----------------------------------------------------------------
 1. Extraction type: zonal statistic within buffer zone.
 2. Statistic: mean
 3. Urban buffer zone: 2.0km
 4. Rural buffer zone: 5.0km
------------------------------------------------------------
-----------------------------------------------------------------
 Extracting variable -sq1- with the following parameters
-----------------------------------------------------------------
 1. Extraction type: point extraction.
 2. Statistic: N/A
 3. Urban buffer zone: N/A
 4. Rural buffer zone: N/A
------------------------------------------------------------
-----------------------------------------------------------------
 Extracting variable -sq2- with the following parameters
-----------------------------------------------------------------
 1. Extraction type: point extraction.
 2. Statistic: N/A
 3. Urban buffer zone: N/A
 4. Rural buffer zone: N/A
------------------------------------------------------------
-----------------------------------------------------------------
 Extracting variable -sq3- with the following parameters
-----------------------------------------------------------------
 1. Extraction type: point extraction.
 2. Statistic: N/A
 3. Urban buffer zone: N/A
 4. Rural buffer zone: N/A
------------------------------------------------------------
-----------------------------------------------------------------
 Extracting variable -sq4- with the following parameters
-----------------------------------------------------------------
 1. Extraction type: point extraction.
 2. Statistic: N/A
 3. Urban buffer zone: N/A
 4. Rural buffer zone: N/A
------------------------------------------------------------
-----------------------------------------------------------------
 Extracting variable -sq5- with the following parameters
-----------------------------------------------------------------
 1. Extraction type: point extraction.
 2. Statistic: N/A
 3. Urban buffer zone: N/A
 4. Rural buffer zone: N/A
------------------------------------------------------------
-----------------------------------------------------------------
 Extracting variable -sq6- with the following parameters
-----------------------------------------------------------------
 1. Extraction type: point extraction.
 2. Statistic: N/A
 3. Urban buffer zone: N/A
 4. Rural buffer zone: N/A
------------------------------------------------------------
-----------------------------------------------------------------
 Extracting variable -sq7- with the following parameters
-----------------------------------------------------------------
 1. Extraction type: point extraction.
 2. Statistic: N/A
 3. Urban buffer zone: N/A
 4. Rural buffer zone: N/A
------------------------------------------------------------
------------------------------
VARIABLES SAVED TO FILE BELOW
------------------------------
DECAT_HH_Geovariables/data/TZA/surveys/NPS_w5/geovars/puf/soil-terrain-npsw5.csv

Quality checks#

Run summary statistic to generare minimum, maximum values etc and inspect them manually.

df = pd.read_csv(outcsv_file)
# Only use describe for continuos variables
print('-'*50)
print('       SUMMARY STATISTICS FOR ELEVATION VARS')
print('-'*50)
stats = df[['slopepct', 'srtm1k']].describe().T
print(stats[['mean', 'min', '50%', '75%', 'max']])
print('-'*70)
--------------------------------------------------
       SUMMARY STATISTICS FOR ELEVATION VARS
--------------------------------------------------
                mean  min     50%     75%     max
slopepct    4.747082  0.0     4.0     6.0    39.0
srtm1k    765.325292  6.0  1045.0  1233.0  2261.0
----------------------------------------------------------------------
# Only use describe for continuos variables 
cat_vars = [c for c in out_vars if 'sq' in c] + ['twi']
print('='*60)
print('       FREQUENCY DISTRIBUTION (%) FOR CATEGORICAL VARS')
print('='*60)
for c in cat_vars:
    print()
    print('Variable name: {}'.format(c))
    print('-'*50)
    print(df[c].value_counts(dropna=False, normalize=True))
print('-'*70)
============================================================
       FREQUENCY DISTRIBUTION (%) FOR CATEGORICAL VARS
============================================================

Variable name: sq1
--------------------------------------------------
2    0.593774
1    0.260700
3    0.066148
0    0.056809
7    0.017121
4    0.005447
Name: sq1, dtype: float64

Variable name: sq2
--------------------------------------------------
1    0.512062
2    0.392218
0    0.056809
7    0.017121
3    0.016342
4    0.005447
Name: sq2, dtype: float64

Variable name: sq3
--------------------------------------------------
1    0.759533
2    0.115175
0    0.056809
3    0.028794
7    0.017121
5    0.011673
4    0.010895
Name: sq3, dtype: float64

Variable name: sq4
--------------------------------------------------
1    0.799222
2    0.092607
0    0.056809
3    0.031128
7    0.017121
4    0.003113
Name: sq4, dtype: float64

Variable name: sq5
--------------------------------------------------
1    0.866926
0    0.056809
2    0.046693
7    0.017121
3    0.012451
Name: sq5, dtype: float64

Variable name: sq6
--------------------------------------------------
1    0.902724
0    0.056809
2    0.017899
7    0.017121
3    0.005447
Name: sq6, dtype: float64

Variable name: sq7
--------------------------------------------------
1    0.566537
2    0.250584
3    0.077821
0    0.056809
7    0.017121
4    0.017121
5    0.014008
Name: sq7, dtype: float64

Variable name: twi
--------------------------------------------------
13    0.533074
12    0.205447
14    0.124514
0     0.087938
11    0.031907
36    0.012451
15    0.003891
10    0.000778
Name: twi, dtype: float64
----------------------------------------------------------------------

Summary#

The following variables below were generated.

ID

Theme

Variable Description

Variable ID

Units

Resolution

Data sources

Required

1

Soil and Terrain

Average elevation (m) within 1 km block

soil01

m

0.008333 dd

SRTM-90m

Yes

2

Soil and Terrain

Slope derived from 90m SRTM

soil02

Percent

0.008333 dd

SRTM-90m

Yes

3

Soil and Terrain

Topographic wetness index

soil03

N/A

0.008333 dd

SRTM-90m

Yes

4

Soil and Terrain

Terrain Roughness

soil04

N/A

0.008333 dd

SRTM-90m

Yes

5

Soil and Terrain

Nutrient availability

soil05

N/A

0.008333 dd

Harmonized World Soil Database

Yes

6

Soil and Terrain

Nutrient retention capacity

soil06

N/A

0.008333 dd

Harmonized World Soil Database

Yes

7

Soil and Terrain

Rooting conditions

soil07

N/A

0.008333 dd

Harmonized World Soil Database

Yes

8

Soil and Terrain

Oxygen availability to roots

soil08

N/A

0.008333 dd

Harmonized World Soil Database

Yes

9

Soil and Terrain

Excess salts

soil09

N/A

0.008333 dd

Harmonized World Soil Database

Yes

10

Soil and Terrain

Toxicity

soil08

N/A

0.008333 dd

Harmonized World Soil Database

Yes

11

Soil and Terrain

Workability (constraining field management)

soil08

N/A

0.008333 dd

Harmonized World Soil Database

Yes