Climatology

Climatology#

Author: Dunstan Matekenya

Affiliation: DECAT, The World Bank Group

Date: August 7, 2023

Background#

In this notebook, we generate geovariables in the category of climate for the National Panel Survey (NPS 2014-2015) in Tanzania. The following bioclimatic geovariables are computed.

Annual mean temperature (af_bio_1)
Mean temperature of wettest quarter (af_bio_8)
Annual precipitation (af_bio_12)
Precipitation of wettest month (af_bio_13)
Precipitation of wettest quarter(af_bio_16)

The naming and definition of the variables are based on WorldClim. For full variables metadata, see Data-Cover-Page for details about definition of the metrics and other important information.

Input datasets#

The following datasets have been used for all the variables.

Data source: The source for this data is the WorldClim-Bioclimatic variables-30secs
Input data files The data files used in this notebook were downloaded from the site mentioned above and kept in this OneDrive folder ^[1]

1. ^ Only WBG staff can access this OneDrive directory.

Python setup#

The following packages are required for the processing in this notebook.

import warnings
warnings.simplefilter(action='ignore')
import sys
from pathlib import Path

import geopandas as gpd
import pandas as pd

spanon = Path.cwd().parents[2].joinpath("Spatial-Anonymization/src/")
sys.path.append(str(spanon))
from spatial_anonymization import utils
from spatial_anonymization import point_displacement

Setup input files#

Define paths to input data files as global variables using Path package.

# =============================
# BASE WORKING DIR
# ============================
DIR_DATA = Path.cwd().parents[1].joinpath('data', 'TZA')
DIR_SPATIAL_DATA = DIR_DATA.joinpath('spatial')
DIR_CONS_EXP = DIR_DATA.joinpath('surveys', 'consumption_experiment')
DIR_OUTPUTS = DIR_CONS_EXP.joinpath('geovars', 'puf')

# =================
# GEOVARS DATA
# =================
DIR_CLIM_DATA = Path.cwd().parents[1].joinpath('data', 'WORLDCLIM')

# =============================
# SURVEY COORDINATES 
# ============================
# Public/displaced GPS oordinates files
FILE_HH_PUB_COORDS = DIR_OUTPUTS.joinpath('cons-exp-pub-coords.csv')

# =============================
# ADMIN BOUNDARIES
# =============================
# use geoboundaries API for country bounds
FILE_ADM0 = DIR_SPATIAL_DATA.joinpath('geoBoundaries-TZA-ADM0-all',
                                     'geoBoundaries-TZA-ADM0.shp')

Setup processing global variables#

For variables we will use often such as column name containing latitude and longitude, set them as global variables.

# ====================
# GENERAL
# =====================
# in case we need to download OSM data
GEOFABRICK_REGION  = 'africa'
COUNTRY_NAME = 'Tanzania'
COUNTRY_ISO = 'TZA'
FILENAME_SUFFIX = 'cons-exp2022'

# ==============
# BUFFER
# ==============
# whether to buffer points or
# extract values at point
BUFFER = True
BUFF_ZONE_DIST_URB = 2000
BUFF_ZONE_DIST_RUR = 5000
    
# ====================
# INPUT DATA SPECS
# =====================
# Cols for input coordinates file
LAT = 'lat_pub' 
LON = 'lon_pub'
HH_ID = 'hhid'
CLUST_ID = 'clusterid'
TEMP_UNIQ_ID = 'clusterid'
URB_RURAL = 'clustertype'
URBAN_CODE = 'urban'

# =============================
# VARIABLE LABELS
# ============================
VAR_LABELS = {'af_bio_1': 'Average annual temperature multiplied by 10 (°C)'
              , 'af_bio_8':'Average temperature of the wettest quarter multiplied by 10(°C)'
              , 'af_bio_12': 'Total annual precipitation (mm)',
               'af_bio_13':'Precipitation of wettest month (mm)',
              'af_bio_16':'Precipitation of wettest quarter (mm)'
             }
for name, label in VAR_LABELS.items():
    try:
        assert len(label) < 80, 'LABEL TOO LONG'
    except AssertionError:
        print(name)

Compute bioclimatic variables#

def generate_variables(buff_zone=False, extraction_type='mean'):
    """
    Helper function to generate the variables. 
    It has to be manually edited.
    
    Returns:
    Saves the data.
    """
    # ===========================
    # SETUP INPUT FILES 
    # ===========================
    # Using folders with suffix x because thats what Siobhan used
    bioc_vars_nums = [1, 8, 12, 13, 16]
    fpath_lst = {i: DIR_CLIM_DATA.joinpath('af_bio_{}_x'.format(i), 
                                        'afbio{}x.tif'.format(i)) for i in bioc_vars_nums}
    # ===========================
    # LOAD COORDINATES
    # ===========================
    df_pub_coords = pd.read_csv(FILE_HH_PUB_COORDS)
    df_pub_coords  = df_pub_coords.query('make_public == 1')
    df_pub_coords2 = df_pub_coords.drop_duplicates(subset=[TEMP_UNIQ_ID])
    df_coords = df_pub_coords2[[TEMP_UNIQ_ID, LAT, LON, URB_RURAL]]
    gdf_coords = gpd.GeoDataFrame(df_coords, 
                                      geometry=gpd.points_from_xy(df_coords[LON],df_coords[LAT]), crs=4326)
    # Get code for rural
    rural = list(set(list(df_pub_coords[URB_RURAL].value_counts().index)) - set([URBAN_CODE]))
    assert len(rural) == 1, 'THEY SHOULD BE ONLY 2 CODES FOR URBAN/RURAL'
    rural = rural[0]

    # ===========================
    # COMPUTE
    # ===========================
    bioc_varname = {1:1, 8:2, 12:3, 13:4, 16:5}
    out_vars  = []
    df_list = []
    for num, file in fpath_lst.items():
        # Set column name
        # colname = 'clim0{}'.format(bioc_varname[num])
        colname = file.parts[-2][:-2]
        out_vars.append(colname)
        if buff_zone:
            print()
            print("-"*65)
            print(' Extracting variable -{}- with the following parameters'.format(colname))
            print("-"*65)
            print(' 1. Extraction type: zonal statistic within buffer zone.')
            print(' 2. Statistic: {}'.format(extraction_type))
            print(' 3. Urban buffer zone: {}km'.format(BUFF_ZONE_DIST_URB/1000))
            print(' 4. Rural buffer zone: {}km'.format(BUFF_ZONE_DIST_RUR/1000))
            print("-"*60)
            gdf_coords_urb = gdf_coords.query('{} == "{}"'.format(URB_RURAL, URBAN_CODE))
            df_urb = utils.point_query_raster(in_raster=file, buffer=buff_zone, point_shp=gdf_coords_urb, statcol_name=colname,
                                                  id_col=TEMP_UNIQ_ID, stat=extraction_type, buff_dist=BUFF_ZONE_DIST_URB,
                                                 xcol=LON, ycol=LAT)
            gdf_coords_rur = gdf_coords.query('{} == "{}"'.format(URB_RURAL, rural))
            df_rur = utils.point_query_raster(in_raster=file, buffer=buff_zone, point_shp=gdf_coords_rur, statcol_name=colname,
                                                  id_col=TEMP_UNIQ_ID, stat=extraction_type, buff_dist=BUFF_ZONE_DIST_RUR,
                                                 xcol=LON, ycol=LAT)
            df = df_urb.append(df_rur)
        else:
            print()
            print("-"*65)
            print(' Extracting variable -{}- with the following parameters'.format(colname))
            print("-"*65)
            print(' 1. Extraction type: point extraction.')
            print(' 2. Statistic: N/A')
            print(' 3. Urban buffer zone: N/A')
            print(' 4. Rural buffer zone: N/A')
            df = utils.point_query_raster(in_raster=file, buffer=buff_zone,
                                          point_shp=gdf_coords, statcol_name=colname
                                          ,id_col=TEMP_UNIQ_ID, stat='mean', buff_dist=BUFF_ZONE_DIST, 
                                         xcol=LON, ycol=LAT)
        
        
        if df[colname].isnull().sum() > 1:
            num_missing = df[colname].isnull().sum()
            print('Replacing {} missing values for {}'.format(num_missing, colname))
            df2 = df_coords.merge(df, on=TEMP_UNIQ_ID)
            df3 = utils.replace_missing_with_nearest(df=df2, target_col=colname, idcol=TEMP_UNIQ_ID, lon=LON, lat=LAT)
            df = df3[[TEMP_UNIQ_ID, colname]]
            assert df[colname].isnull().sum() == 0
        
        df.set_index(TEMP_UNIQ_ID, inplace=True)
        df_list.append(df)
    
    # Round to 0 decimal places
    df_out = pd.concat(df_list, axis=1)
    df_out = df_out.applymap(lambda x: int(round(x, 0)) if isinstance(x, (int, float)) else x)

    # ===========================
    # SAVE OUTPUTS
    # ===========================
    out_csv = DIR_OUTPUTS.joinpath('bioclimatic-{}.csv'.format(FILENAME_SUFFIX))
    df_out.to_csv(out_csv)
    
    # Save variable labels
    out_csv_labels = DIR_OUTPUTS.joinpath('bioclimatic-{}-labels.csv'.format(FILENAME_SUFFIX))
    df_var_labels = pd.DataFrame(list(VAR_LABELS.values()), index=list(VAR_LABELS.keys()))
    df_var_labels.index.name = 'var_name'
    df_var_labels.rename(columns={0:'label'}, inplace=True)
    df_var_labels.to_csv(out_csv_labels)
    
    print('-'*30)
    print('VARIABLES SAVED TO FILE BELOW')
    print('-'*30)
    print("/".join(out_csv.parts[6:]))
    
    return out_csv, out_vars

# Generate the variables
# df = generate_variables(buff_zone=BUFFER)
outcsv_file, out_vars = generate_variables(buff_zone=BUFFER)

-----------------------------------------------------------------
 Extracting variable -af_bio_1- with the following parameters
-----------------------------------------------------------------
 1. Extraction type: zonal statistic within buffer zone.
 2. Statistic: mean
 3. Urban buffer zone: 2.0km
 4. Rural buffer zone: 5.0km
------------------------------------------------------------

-----------------------------------------------------------------
 Extracting variable -af_bio_8- with the following parameters
-----------------------------------------------------------------
 1. Extraction type: zonal statistic within buffer zone.
 2. Statistic: mean
 3. Urban buffer zone: 2.0km
 4. Rural buffer zone: 5.0km
------------------------------------------------------------

-----------------------------------------------------------------
 Extracting variable -af_bio_12- with the following parameters
-----------------------------------------------------------------
 1. Extraction type: zonal statistic within buffer zone.
 2. Statistic: mean
 3. Urban buffer zone: 2.0km
 4. Rural buffer zone: 5.0km
------------------------------------------------------------

-----------------------------------------------------------------
 Extracting variable -af_bio_13- with the following parameters
-----------------------------------------------------------------
 1. Extraction type: zonal statistic within buffer zone.
 2. Statistic: mean
 3. Urban buffer zone: 2.0km
 4. Rural buffer zone: 5.0km
------------------------------------------------------------

-----------------------------------------------------------------
 Extracting variable -af_bio_16- with the following parameters
-----------------------------------------------------------------
 1. Extraction type: zonal statistic within buffer zone.
 2. Statistic: mean
 3. Urban buffer zone: 2.0km
 4. Rural buffer zone: 5.0km
------------------------------------------------------------

------------------------------
VARIABLES SAVED TO FILE BELOW
------------------------------
DECAT_HH_Geovariables/data/TZA/surveys/consumption_experiment/geovars/puf/bioclimatic-cons-exp2022.csv

Quality checks#

Run summary statistic to generare minimum, maximum values etc and inspect them manually.

df = pd.read_csv(outcsv_file)

# Only use describe for continuos variables 
cont_vars = [c for c in out_vars if 'sq' not in c]
print('-'*50)
print('       SUMMARY STATISTICS FOR BIOCLIMATIC VARS')
print('-'*50)
stats = df[cont_vars].describe().T
print(stats[['mean', 'min', '50%', '75%', 'max']])
print('-'*70)

--------------------------------------------------
       SUMMARY STATISTICS FOR BIOCLIMATIC VARS
--------------------------------------------------
                  mean    min     50%      75%     max
af_bio_1    223.359155  161.0   224.5   238.75   274.0
af_bio_8    228.485915  168.0   227.0   247.75   281.0
af_bio_12  1045.366197  466.0  1023.5  1159.25  2227.0
af_bio_13   213.126761  114.0   188.5   253.00   664.0
af_bio_16   519.147887  236.0   477.5   579.75  1416.0
----------------------------------------------------------------------

Summary#

The following variables below were generated.

ID	Theme	Variable Description	Variable ID	Units	Resolution	Data sources	Required
1	Climatology	Average annual temperature calculated from monthly climatology, multiplied by 10 (°C)	clim01	Degrees Celcius	0.008333 dd	World Clim	Yes
2	Climatology	Average temperature of the wettest quarter, from monthly climatology, multiplied by 10. (°C)	clim02	Degrees Celcius	0.008333 dd	World Clim	Yes
3	Climatology	Total annual precipitation	clim03	mm	0.008333 dd	World Clim	Yes
4	Climatology	Precipitation of wettest month	clim04	mm	0.008333 dd	World Clim	Yes
5	Climatology	Precipitation of wettest quarter	clim05	mm	0.008333 dd	World Clim	Yes