Climatology#
Author: Dunstan Matekenya
Affiliation: DECAT, The World Bank Group
Date: April 18, 2023
Background#
In this notebook, we generate geovariables in the category of climate
for the National Panel Survey (NPS 2014-2015) in Tanzania. The following bioclimatic geovariables are computed.
Annual mean temperature (af_bio_1)
Mean temperature of wettest quarter (af_bio_8)
Annual precipitation (af_bio_12)
Precipitation of wettest month (af_bio_13)
Precipitation of wettest quarter(af_bio_16)
The naming and definition of the variables are based on WorldClim. For full variables metadata, see Data-Cover-Page for details about definition of the metrics and other important information.
Input datasets#
The following datasets have been used for all the variables.
Data source: The source for this data is the WorldClim-Bioclimatic variables-30secs
Input data files The data files used in this notebook were downloaded from the site mentioned above and kept in this OneDrive folder[1]
Python setup#
The following packages are required for the processing in this notebook.
import warnings
warnings.simplefilter(action='ignore')
import sys
from pathlib import Path
import geopandas as gpd
import pandas as pd
spanon = Path.cwd().parents[2].joinpath("Spatial-Anonymization/src/")
sys.path.append(str(spanon))
from spatial_anonymization import utils
from spatial_anonymization import point_displacement
Setup input files#
Define paths to input data files as global variables using Path
package.
# =============================
# BASE WORKING DIR
# ============================
DIR_DATA = Path.cwd().parents[1].joinpath('data', 'TZA')
DIR_SPATIAL_DATA = DIR_DATA.joinpath('spatial')
DIR_NPS_W4 = DIR_DATA.joinpath('surveys', 'NPS_w4')
DIR_OUTPUTS = DIR_NPS_W4.joinpath('geovars', 'puf')
# =================
# GEOVARS DATA
# =================
# DIR_CLIM_DATA = Path.cwd().parents[1].joinpath('data', 'WORLDCLIM')
DIR_CLIM_DATA = Path.cwd().parents[1].joinpath('data', 'WORLDCLIM_v2_1')
# =============================
# SURVEY COORDINATES
# ============================
# Public/displaced GPS oordinates files
FILE_HH_PUB_GEOVARS = DIR_NPS_W4.joinpath('geovars', 'npsy4.ea.offset.dta')
# =============================
# ADMIN BOUNDARIES
# =============================
# use geoboundaries API for country bounds
FILE_ADM0 = DIR_SPATIAL_DATA.joinpath('geoBoundaries-TZA-ADM0-all',
'geoBoundaries-TZA-ADM0.shp')
Setup processing global variables#
For variables we will use often such as column name containing latitude and longitude, set them as global variables.
# ====================
# GENERAL
# =====================
# in case we need to download OSM data
GEOFABRICK_REGION = 'africa'
COUNTRY_NAME = 'Tanzania'
COUNTRY_ISO = 'TZA'
FILENAME_SUFFIX = 'npsw4'
# ==============
# BUFFER
# ==============
# whether to buffer points or
# extract values at point
BUFFER = True
BUFF_ZONE_DIST_URB = 2000
BUFF_ZONE_DIST_RUR = 5000
# ====================
# INPUT DATA SPECS
# =====================
# Cols for input coordinates file
LAT = 'lat_modified'
LON = 'lon_modified'
HH_ID = 'clusterid'
URB_RURAL = 'clustertype'
URBAN_CODE = 2
# =============================
# VARIABLE LABELS
# ============================
VAR_LABELS = {'af_bio_1': 'Average annual temperature multiplied by 10 (°C)'
, 'af_bio_8':'Average temperature of the wettest quarter multiplied by 10(°C)'
, 'af_bio_12': 'Total annual precipitation (mm)',
'af_bio_13':'Precipitation of wettest month (mm)',
'af_bio_16':'Precipitation of wettest quarter (mm)'
}
for name, label in VAR_LABELS.items():
try:
assert len(label) < 80, 'LABEL TOO LONG'
except AssertionError:
print(name)
Compute bioclimatic variables#
def generate_variables(buff_zone=False, extraction_type='mean'):
"""
Helper function to generate the variables.
It has to be manually edited.
Returns:
Saves the data.
"""
# ===========================
# SETUP INPUT FILES
# ===========================
# Using folders with suffix x because thats what Siobhan used
bioc_vars_nums = {'af_bio_1':1, 'af_bio_8':8 , 'af_bio_12':12, 'af_bio_13':13, 'af_bio_16':16}
fpath_lst = {k: DIR_CLIM_DATA.joinpath('wc2.1_30s_bio_{}_af.tif'.format(v)) for k,v in bioc_vars_nums.items()}
# ===========================
# LOAD COORDINATES
# ===========================
df_pub_coords = pd.read_stata(FILE_HH_PUB_GEOVARS, convert_categoricals=False)
df_coords = df_pub_coords[[HH_ID, LAT, LON, URB_RURAL]]
gdf_coords = gpd.GeoDataFrame(df_coords,
geometry=gpd.points_from_xy(df_coords[LON],df_coords[LAT]), crs=4326)
# Get code for rural
rural = list(set(list(df_pub_coords[URB_RURAL].value_counts().index)) - set([URBAN_CODE]))
assert len(rural) == 1, 'THEY SHOULD BE ONLY 2 CODES FOR URBAN/RURAL'
rural = rural[0]
# ===========================
# COMPUTE
# ===========================
out_vars = []
df_list = []
for colname, file in fpath_lst.items():
# Set column name
# colname = 'clim0{}'.format(bioc_varname[num])
out_vars.append(colname)
if buff_zone:
print()
print("-"*65)
print(' Extracting variable -{}- with the following parameters'.format(colname))
print("-"*65)
print(' 1. Extraction type: zonal statistic within buffer zone.')
print(' 2. Statistic: {}'.format(extraction_type))
print(' 3. Urban buffer zone: {}km'.format(BUFF_ZONE_DIST_URB/1000))
print(' 4. Rural buffer zone: {}km'.format(BUFF_ZONE_DIST_RUR/1000))
print("-"*60)
gdf_coords_urb = gdf_coords.query('{} == {}'.format(URB_RURAL, URBAN_CODE))
df_urb = utils.point_query_raster(in_raster=file, buffer=buff_zone, point_shp=gdf_coords_urb, statcol_name=colname,
id_col=HH_ID, stat=extraction_type, buff_dist=BUFF_ZONE_DIST_URB,
xcol=LON, ycol=LAT)
gdf_coords_rur = gdf_coords.query('{} == {}'.format(URB_RURAL, rural))
df_rur = utils.point_query_raster(in_raster=file, buffer=buff_zone, point_shp=gdf_coords_rur, statcol_name=colname,
id_col=HH_ID, stat=extraction_type, buff_dist=BUFF_ZONE_DIST_RUR,
xcol=LON, ycol=LAT)
df = df_urb.append(df_rur)
else:
print()
print("-"*65)
print(' Extracting variable -{}- with the following parameters'.format(colname))
print("-"*65)
print(' 1. Extraction type: point extraction.')
print(' 2. Statistic: N/A')
print(' 3. Urban buffer zone: N/A')
print(' 4. Rural buffer zone: N/A')
df = utils.point_query_raster(in_raster=file, buffer=buff_zone,
point_shp=gdf_coords, statcol_name=colname
,id_col=HH_ID, stat='mean', buff_dist=BUFF_ZONE_DIST,
xcol=LON, ycol=LAT)
if df[colname].isnull().sum() > 1:
num_missing = df[colname].isnull().sum()
print('Replacing {} missing values for {}'.format(num_missing, colname))
df2 = df_coords.merge(df, on=HH_ID)
df3 = utils.replace_missing_with_nearest(df=df2, target_col=colname, idcol=HH_ID, lon=LON, lat=LAT)
df = df3[[HH_ID, colname]]
assert df[colname].isnull().sum() == 0
df.set_index(HH_ID, inplace=True)
df_list.append(df)
# Round to 0 decimal places
df_out = pd.concat(df_list, axis=1)
df_out = df_out.applymap(lambda x: int(round(x, 0)) if isinstance(x, (int, float)) else x)
# ===========================
# SAVE OUTPUTS
# ===========================
out_csv = DIR_OUTPUTS.joinpath('bioclimatic-{}.csv'.format(FILENAME_SUFFIX))
df_out.to_csv(out_csv)
# Save variable labels
out_csv_labels = DIR_OUTPUTS.joinpath('bioclimatic-{}-labels.csv'.format(FILENAME_SUFFIX))
df_var_labels = pd.DataFrame(list(VAR_LABELS.values()), index=list(VAR_LABELS.keys()))
df_var_labels.index.name = 'var_name'
df_var_labels.rename(columns={0:'label'}, inplace=True)
df_var_labels.to_csv(out_csv_labels)
print('-'*30)
print('VARIABLES SAVED TO FILE BELOW')
print('-'*30)
print("/".join(out_csv.parts[6:]))
return out_csv, out_vars
# Generate the variables
outcsv_file, out_vars = generate_variables(buff_zone=BUFFER)
-----------------------------------------------------------------
Extracting variable -af_bio_1- with the following parameters
-----------------------------------------------------------------
1. Extraction type: zonal statistic within buffer zone.
2. Statistic: mean
3. Urban buffer zone: 2.0km
4. Rural buffer zone: 5.0km
------------------------------------------------------------
-----------------------------------------------------------------
Extracting variable -af_bio_8- with the following parameters
-----------------------------------------------------------------
1. Extraction type: zonal statistic within buffer zone.
2. Statistic: mean
3. Urban buffer zone: 2.0km
4. Rural buffer zone: 5.0km
------------------------------------------------------------
-----------------------------------------------------------------
Extracting variable -af_bio_12- with the following parameters
-----------------------------------------------------------------
1. Extraction type: zonal statistic within buffer zone.
2. Statistic: mean
3. Urban buffer zone: 2.0km
4. Rural buffer zone: 5.0km
------------------------------------------------------------
-----------------------------------------------------------------
Extracting variable -af_bio_13- with the following parameters
-----------------------------------------------------------------
1. Extraction type: zonal statistic within buffer zone.
2. Statistic: mean
3. Urban buffer zone: 2.0km
4. Rural buffer zone: 5.0km
------------------------------------------------------------
-----------------------------------------------------------------
Extracting variable -af_bio_16- with the following parameters
-----------------------------------------------------------------
1. Extraction type: zonal statistic within buffer zone.
2. Statistic: mean
3. Urban buffer zone: 2.0km
4. Rural buffer zone: 5.0km
------------------------------------------------------------
------------------------------
VARIABLES SAVED TO FILE BELOW
------------------------------
DECAT_HH_Geovariables/data/TZA/surveys/NPS_w4/geovars/puf/bioclimatic-npsw4.csv
Quality checks#
Run summary statistic to generare minimum, maximum values etc and inspect them manually.
df = pd.read_csv(outcsv_file)
# Only use describe for continuos variables
cont_vars = [c for c in out_vars if 'sq' not in c]
print('-'*50)
print(' SUMMARY STATISTICS FOR BIOCLIMATIC VARS')
print('-'*50)
stats = df[cont_vars].describe().T
print(stats[['mean', 'min', '50%', '75%', 'max']])
print('-'*70)
--------------------------------------------------
SUMMARY STATISTICS FOR BIOCLIMATIC VARS
--------------------------------------------------
mean min 50% 75% max
af_bio_1 23.346062 15.0 23.0 26.0 28.0
af_bio_8 23.964200 16.0 24.0 27.0 28.0
af_bio_12 1104.849642 522.0 1067.0 1218.5 1957.0
af_bio_13 235.575179 117.0 219.0 275.0 454.0
af_bio_16 560.789976 276.0 532.0 653.0 1054.0
----------------------------------------------------------------------
Summary#
The following variables below were generated.
ID |
Theme |
Variable Description |
Variable ID |
Units |
Resolution |
Data sources |
Required |
---|---|---|---|---|---|---|---|
1 |
Climatology |
Average annual temperature calculated from monthly climatology, multiplied by 10 (°C) |
clim01 |
Degrees Celcius |
0.008333 dd |
Yes |
|
2 |
Climatology |
Average temperature of the wettest quarter, from monthly climatology, multiplied by 10. (°C) |
clim02 |
Degrees Celcius |
0.008333 dd |
Yes |
|
3 |
Climatology |
Total annual precipitation |
clim03 |
mm |
0.008333 dd |
Yes |
|
4 |
Climatology |
Precipitation of wettest month |
clim04 |
mm |
0.008333 dd |
Yes |
|
5 |
Climatology |
Precipitation of wettest quarter |
clim05 |
mm |
0.008333 dd |
Yes |