Climatology#


Author: Dunstan Matekenya

Affiliation: DECAT, The World Bank Group

Date: April 18, 2023

Background#

In this notebook, we generate geovariables in the category of climate for the National Panel Survey (NPS 2014-2015) in Tanzania. The following bioclimatic geovariables are computed.

  1. Annual mean temperature (af_bio_1)

  2. Mean temperature of wettest quarter (af_bio_8)

  3. Annual precipitation (af_bio_12)

  4. Precipitation of wettest month (af_bio_13)

  5. Precipitation of wettest quarter(af_bio_16)

The naming and definition of the variables are based on WorldClim. For full variables metadata, see Data-Cover-Page for details about definition of the metrics and other important information.

Input datasets#

The following datasets have been used for all the variables.

1. ^ Only WBG staff can access this OneDrive directory.

Python setup#

The following packages are required for the processing in this notebook.

import warnings
warnings.simplefilter(action='ignore')
import sys
from pathlib import Path

import geopandas as gpd
import pandas as pd

spanon = Path.cwd().parents[2].joinpath("Spatial-Anonymization/src/")
sys.path.append(str(spanon))
from spatial_anonymization import utils
from spatial_anonymization import point_displacement

Setup input files#

Define paths to input data files as global variables using Path package.

# =============================
# BASE WORKING DIR
# ============================
DIR_DATA = Path.cwd().parents[1].joinpath('data', 'TZA')
DIR_SPATIAL_DATA = DIR_DATA.joinpath('spatial')
DIR_NPS_W4 = DIR_DATA.joinpath('surveys', 'NPS_w4')
DIR_OUTPUTS = DIR_NPS_W4.joinpath('geovars', 'puf')

# =================
# GEOVARS DATA
# =================
# DIR_CLIM_DATA = Path.cwd().parents[1].joinpath('data', 'WORLDCLIM')
DIR_CLIM_DATA = Path.cwd().parents[1].joinpath('data', 'WORLDCLIM_v2_1')

# =============================
# SURVEY COORDINATES 
# ============================
# Public/displaced GPS oordinates files
FILE_HH_PUB_GEOVARS = DIR_NPS_W4.joinpath('geovars', 'npsy4.ea.offset.dta')


# =============================
# ADMIN BOUNDARIES
# =============================
# use geoboundaries API for country bounds
FILE_ADM0 = DIR_SPATIAL_DATA.joinpath('geoBoundaries-TZA-ADM0-all',
                                     'geoBoundaries-TZA-ADM0.shp')

Setup processing global variables#

For variables we will use often such as column name containing latitude and longitude, set them as global variables.

# ====================
# GENERAL
# =====================
# in case we need to download OSM data
GEOFABRICK_REGION  = 'africa'
COUNTRY_NAME = 'Tanzania'
COUNTRY_ISO = 'TZA'
FILENAME_SUFFIX = 'npsw4'

# ==============
# BUFFER
# ==============
# whether to buffer points or
# extract values at point
BUFFER = True
BUFF_ZONE_DIST_URB = 2000
BUFF_ZONE_DIST_RUR = 5000
    
# ====================
# INPUT DATA SPECS
# =====================
# Cols for input coordinates file
LAT = 'lat_modified' 
LON = 'lon_modified'
HH_ID = 'clusterid'
URB_RURAL = 'clustertype'
URBAN_CODE = 2


# =============================
# VARIABLE LABELS
# ============================
VAR_LABELS = {'af_bio_1': 'Average annual temperature multiplied by 10 (°C)'
              , 'af_bio_8':'Average temperature of the wettest quarter multiplied by 10(°C)'
              , 'af_bio_12': 'Total annual precipitation (mm)',
               'af_bio_13':'Precipitation of wettest month (mm)',
              'af_bio_16':'Precipitation of wettest quarter (mm)'
             }
for name, label in VAR_LABELS.items():
    try:
        assert len(label) < 80, 'LABEL TOO LONG'
    except AssertionError:
        print(name)

Compute bioclimatic variables#

def generate_variables(buff_zone=False, extraction_type='mean'):
    """
    Helper function to generate the variables. 
    It has to be manually edited.
    
    Returns:
    Saves the data.
    """
    # ===========================
    # SETUP INPUT FILES 
    # ===========================
    # Using folders with suffix x because thats what Siobhan used
    bioc_vars_nums = {'af_bio_1':1, 'af_bio_8':8 , 'af_bio_12':12, 'af_bio_13':13, 'af_bio_16':16}
    fpath_lst = {k: DIR_CLIM_DATA.joinpath('wc2.1_30s_bio_{}_af.tif'.format(v)) for k,v in bioc_vars_nums.items()}

    # ===========================
    # LOAD COORDINATES
    # ===========================
    df_pub_coords = pd.read_stata(FILE_HH_PUB_GEOVARS, convert_categoricals=False)
    df_coords = df_pub_coords[[HH_ID, LAT, LON, URB_RURAL]]
    gdf_coords = gpd.GeoDataFrame(df_coords, 
                                      geometry=gpd.points_from_xy(df_coords[LON],df_coords[LAT]), crs=4326)
    # Get code for rural
    rural = list(set(list(df_pub_coords[URB_RURAL].value_counts().index)) - set([URBAN_CODE]))
    assert len(rural) == 1, 'THEY SHOULD BE ONLY 2 CODES FOR URBAN/RURAL'
    rural = rural[0]

    # ===========================
    # COMPUTE
    # ===========================
    out_vars  = []
    df_list = []
    for colname, file in fpath_lst.items():
        # Set column name
        # colname = 'clim0{}'.format(bioc_varname[num])
        out_vars.append(colname)
        if buff_zone:
            print()
            print("-"*65)
            print(' Extracting variable -{}- with the following parameters'.format(colname))
            print("-"*65)
            print(' 1. Extraction type: zonal statistic within buffer zone.')
            print(' 2. Statistic: {}'.format(extraction_type))
            print(' 3. Urban buffer zone: {}km'.format(BUFF_ZONE_DIST_URB/1000))
            print(' 4. Rural buffer zone: {}km'.format(BUFF_ZONE_DIST_RUR/1000))
            print("-"*60)
            gdf_coords_urb = gdf_coords.query('{} == {}'.format(URB_RURAL, URBAN_CODE))
            df_urb = utils.point_query_raster(in_raster=file, buffer=buff_zone, point_shp=gdf_coords_urb, statcol_name=colname,
                                                  id_col=HH_ID, stat=extraction_type, buff_dist=BUFF_ZONE_DIST_URB,
                                                 xcol=LON, ycol=LAT)
            gdf_coords_rur = gdf_coords.query('{} == {}'.format(URB_RURAL, rural))
            df_rur = utils.point_query_raster(in_raster=file, buffer=buff_zone, point_shp=gdf_coords_rur, statcol_name=colname,
                                                  id_col=HH_ID, stat=extraction_type, buff_dist=BUFF_ZONE_DIST_RUR,
                                                 xcol=LON, ycol=LAT)
            df = df_urb.append(df_rur)
        else:
            print()
            print("-"*65)
            print(' Extracting variable -{}- with the following parameters'.format(colname))
            print("-"*65)
            print(' 1. Extraction type: point extraction.')
            print(' 2. Statistic: N/A')
            print(' 3. Urban buffer zone: N/A')
            print(' 4. Rural buffer zone: N/A')
            df = utils.point_query_raster(in_raster=file, buffer=buff_zone,
                                          point_shp=gdf_coords, statcol_name=colname
                                          ,id_col=HH_ID, stat='mean', buff_dist=BUFF_ZONE_DIST, 
                                         xcol=LON, ycol=LAT)
        
        
        if df[colname].isnull().sum() > 1:
            num_missing = df[colname].isnull().sum()
            print('Replacing {} missing values for {}'.format(num_missing, colname))
            df2 = df_coords.merge(df, on=HH_ID)
            df3 = utils.replace_missing_with_nearest(df=df2, target_col=colname, idcol=HH_ID, lon=LON, lat=LAT)
            df = df3[[HH_ID, colname]]
            assert df[colname].isnull().sum() == 0
        
        df.set_index(HH_ID, inplace=True)
        df_list.append(df)
    
    # Round to 0 decimal places
    df_out = pd.concat(df_list, axis=1)
    df_out = df_out.applymap(lambda x: int(round(x, 0)) if isinstance(x, (int, float)) else x)
    
    # ===========================
    # SAVE OUTPUTS
    # ===========================
    out_csv = DIR_OUTPUTS.joinpath('bioclimatic-{}.csv'.format(FILENAME_SUFFIX))
    df_out.to_csv(out_csv)
    
    # Save variable labels
    out_csv_labels = DIR_OUTPUTS.joinpath('bioclimatic-{}-labels.csv'.format(FILENAME_SUFFIX))
    df_var_labels = pd.DataFrame(list(VAR_LABELS.values()), index=list(VAR_LABELS.keys()))
    df_var_labels.index.name = 'var_name'
    df_var_labels.rename(columns={0:'label'}, inplace=True)
    df_var_labels.to_csv(out_csv_labels)
    
    print('-'*30)
    print('VARIABLES SAVED TO FILE BELOW')
    print('-'*30)
    print("/".join(out_csv.parts[6:]))
    
    return out_csv, out_vars
# Generate the variables
outcsv_file, out_vars = generate_variables(buff_zone=BUFFER)
-----------------------------------------------------------------
 Extracting variable -af_bio_1- with the following parameters
-----------------------------------------------------------------
 1. Extraction type: zonal statistic within buffer zone.
 2. Statistic: mean
 3. Urban buffer zone: 2.0km
 4. Rural buffer zone: 5.0km
------------------------------------------------------------
-----------------------------------------------------------------
 Extracting variable -af_bio_8- with the following parameters
-----------------------------------------------------------------
 1. Extraction type: zonal statistic within buffer zone.
 2. Statistic: mean
 3. Urban buffer zone: 2.0km
 4. Rural buffer zone: 5.0km
------------------------------------------------------------
-----------------------------------------------------------------
 Extracting variable -af_bio_12- with the following parameters
-----------------------------------------------------------------
 1. Extraction type: zonal statistic within buffer zone.
 2. Statistic: mean
 3. Urban buffer zone: 2.0km
 4. Rural buffer zone: 5.0km
------------------------------------------------------------
-----------------------------------------------------------------
 Extracting variable -af_bio_13- with the following parameters
-----------------------------------------------------------------
 1. Extraction type: zonal statistic within buffer zone.
 2. Statistic: mean
 3. Urban buffer zone: 2.0km
 4. Rural buffer zone: 5.0km
------------------------------------------------------------
-----------------------------------------------------------------
 Extracting variable -af_bio_16- with the following parameters
-----------------------------------------------------------------
 1. Extraction type: zonal statistic within buffer zone.
 2. Statistic: mean
 3. Urban buffer zone: 2.0km
 4. Rural buffer zone: 5.0km
------------------------------------------------------------
------------------------------
VARIABLES SAVED TO FILE BELOW
------------------------------
DECAT_HH_Geovariables/data/TZA/surveys/NPS_w4/geovars/puf/bioclimatic-npsw4.csv

Quality checks#

Run summary statistic to generare minimum, maximum values etc and inspect them manually.

df = pd.read_csv(outcsv_file)
# Only use describe for continuos variables 
cont_vars = [c for c in out_vars if 'sq' not in c]
print('-'*50)
print('       SUMMARY STATISTICS FOR BIOCLIMATIC VARS')
print('-'*50)
stats = df[cont_vars].describe().T
print(stats[['mean', 'min', '50%', '75%', 'max']])
print('-'*70)
--------------------------------------------------
       SUMMARY STATISTICS FOR BIOCLIMATIC VARS
--------------------------------------------------
                  mean    min     50%     75%     max
af_bio_1     23.346062   15.0    23.0    26.0    28.0
af_bio_8     23.964200   16.0    24.0    27.0    28.0
af_bio_12  1104.849642  522.0  1067.0  1218.5  1957.0
af_bio_13   235.575179  117.0   219.0   275.0   454.0
af_bio_16   560.789976  276.0   532.0   653.0  1054.0
----------------------------------------------------------------------

Summary#

The following variables below were generated.

ID

Theme

Variable Description

Variable ID

Units

Resolution

Data sources

Required

1

Climatology

Average annual temperature calculated from monthly climatology, multiplied by 10 (°C)

clim01

Degrees Celcius

0.008333 dd

World Clim

Yes

2

Climatology

Average temperature of the wettest quarter, from monthly climatology, multiplied by 10. (°C)

clim02

Degrees Celcius

0.008333 dd

World Clim

Yes

3

Climatology

Total annual precipitation

clim03

mm

0.008333 dd

World Clim

Yes

4

Climatology

Precipitation of wettest month

clim04

mm

0.008333 dd

World Clim

Yes

5

Climatology

Precipitation of wettest quarter

clim05

mm

0.008333 dd

World Clim

Yes