Climatology#
Author: Dunstan Matekenya
Affiliation: DECAT, The World Bank Group
Date: August 7, 2023
Background#
In this notebook, we generate geovariables in the category of climate
for the National Panel Survey (NPS 2014-2015) in Tanzania. The following bioclimatic geovariables are computed.
Annual mean temperature (af_bio_1)
Mean temperature of wettest quarter (af_bio_8)
Annual precipitation (af_bio_12)
Precipitation of wettest month (af_bio_13)
Precipitation of wettest quarter(af_bio_16)
The naming and definition of the variables are based on WorldClim. For full variables metadata, see Data-Cover-Page for details about definition of the metrics and other important information.
Input datasets#
The following datasets have been used for all the variables.
Data source: The source for this data is the WorldClim-Bioclimatic variables-30secs
Input data files The data files used in this notebook were downloaded from the site mentioned above and kept in this OneDrive folder[1]
Python setup#
The following packages are required for the processing in this notebook.
import warnings
warnings.simplefilter(action='ignore')
import sys
from pathlib import Path
import geopandas as gpd
import pandas as pd
spanon = Path.cwd().parents[2].joinpath("Spatial-Anonymization/src/")
sys.path.append(str(spanon))
from spatial_anonymization import utils
from spatial_anonymization import point_displacement
Setup input files#
Define paths to input data files as global variables using Path
package.
# =============================
# BASE WORKING DIR
# ============================
DIR_DATA = Path.cwd().parents[1].joinpath('data', 'TZA')
DIR_SPATIAL_DATA = DIR_DATA.joinpath('spatial')
DIR_CONS_EXP = DIR_DATA.joinpath('surveys', 'consumption_experiment')
DIR_OUTPUTS = DIR_CONS_EXP.joinpath('geovars', 'puf')
# =================
# GEOVARS DATA
# =================
DIR_CLIM_DATA = Path.cwd().parents[1].joinpath('data', 'WORLDCLIM')
# =============================
# SURVEY COORDINATES
# ============================
# Public/displaced GPS oordinates files
FILE_HH_PUB_COORDS = DIR_OUTPUTS.joinpath('cons-exp-pub-coords.csv')
# =============================
# ADMIN BOUNDARIES
# =============================
# use geoboundaries API for country bounds
FILE_ADM0 = DIR_SPATIAL_DATA.joinpath('geoBoundaries-TZA-ADM0-all',
'geoBoundaries-TZA-ADM0.shp')
Setup processing global variables#
For variables we will use often such as column name containing latitude and longitude, set them as global variables.
# ====================
# GENERAL
# =====================
# in case we need to download OSM data
GEOFABRICK_REGION = 'africa'
COUNTRY_NAME = 'Tanzania'
COUNTRY_ISO = 'TZA'
FILENAME_SUFFIX = 'cons-exp2022'
# ==============
# BUFFER
# ==============
# whether to buffer points or
# extract values at point
BUFFER = True
BUFF_ZONE_DIST_URB = 2000
BUFF_ZONE_DIST_RUR = 5000
# ====================
# INPUT DATA SPECS
# =====================
# Cols for input coordinates file
LAT = 'lat_pub'
LON = 'lon_pub'
HH_ID = 'hhid'
CLUST_ID = 'clusterid'
TEMP_UNIQ_ID = 'clusterid'
URB_RURAL = 'clustertype'
URBAN_CODE = 'urban'
# =============================
# VARIABLE LABELS
# ============================
VAR_LABELS = {'af_bio_1': 'Average annual temperature multiplied by 10 (°C)'
, 'af_bio_8':'Average temperature of the wettest quarter multiplied by 10(°C)'
, 'af_bio_12': 'Total annual precipitation (mm)',
'af_bio_13':'Precipitation of wettest month (mm)',
'af_bio_16':'Precipitation of wettest quarter (mm)'
}
for name, label in VAR_LABELS.items():
try:
assert len(label) < 80, 'LABEL TOO LONG'
except AssertionError:
print(name)
Compute bioclimatic variables#
def generate_variables(buff_zone=False, extraction_type='mean'):
"""
Helper function to generate the variables.
It has to be manually edited.
Returns:
Saves the data.
"""
# ===========================
# SETUP INPUT FILES
# ===========================
# Using folders with suffix x because thats what Siobhan used
bioc_vars_nums = [1, 8, 12, 13, 16]
fpath_lst = {i: DIR_CLIM_DATA.joinpath('af_bio_{}_x'.format(i),
'afbio{}x.tif'.format(i)) for i in bioc_vars_nums}
# ===========================
# LOAD COORDINATES
# ===========================
df_pub_coords = pd.read_csv(FILE_HH_PUB_COORDS)
df_pub_coords = df_pub_coords.query('make_public == 1')
df_pub_coords2 = df_pub_coords.drop_duplicates(subset=[TEMP_UNIQ_ID])
df_coords = df_pub_coords2[[TEMP_UNIQ_ID, LAT, LON, URB_RURAL]]
gdf_coords = gpd.GeoDataFrame(df_coords,
geometry=gpd.points_from_xy(df_coords[LON],df_coords[LAT]), crs=4326)
# Get code for rural
rural = list(set(list(df_pub_coords[URB_RURAL].value_counts().index)) - set([URBAN_CODE]))
assert len(rural) == 1, 'THEY SHOULD BE ONLY 2 CODES FOR URBAN/RURAL'
rural = rural[0]
# ===========================
# COMPUTE
# ===========================
bioc_varname = {1:1, 8:2, 12:3, 13:4, 16:5}
out_vars = []
df_list = []
for num, file in fpath_lst.items():
# Set column name
# colname = 'clim0{}'.format(bioc_varname[num])
colname = file.parts[-2][:-2]
out_vars.append(colname)
if buff_zone:
print()
print("-"*65)
print(' Extracting variable -{}- with the following parameters'.format(colname))
print("-"*65)
print(' 1. Extraction type: zonal statistic within buffer zone.')
print(' 2. Statistic: {}'.format(extraction_type))
print(' 3. Urban buffer zone: {}km'.format(BUFF_ZONE_DIST_URB/1000))
print(' 4. Rural buffer zone: {}km'.format(BUFF_ZONE_DIST_RUR/1000))
print("-"*60)
gdf_coords_urb = gdf_coords.query('{} == "{}"'.format(URB_RURAL, URBAN_CODE))
df_urb = utils.point_query_raster(in_raster=file, buffer=buff_zone, point_shp=gdf_coords_urb, statcol_name=colname,
id_col=TEMP_UNIQ_ID, stat=extraction_type, buff_dist=BUFF_ZONE_DIST_URB,
xcol=LON, ycol=LAT)
gdf_coords_rur = gdf_coords.query('{} == "{}"'.format(URB_RURAL, rural))
df_rur = utils.point_query_raster(in_raster=file, buffer=buff_zone, point_shp=gdf_coords_rur, statcol_name=colname,
id_col=TEMP_UNIQ_ID, stat=extraction_type, buff_dist=BUFF_ZONE_DIST_RUR,
xcol=LON, ycol=LAT)
df = df_urb.append(df_rur)
else:
print()
print("-"*65)
print(' Extracting variable -{}- with the following parameters'.format(colname))
print("-"*65)
print(' 1. Extraction type: point extraction.')
print(' 2. Statistic: N/A')
print(' 3. Urban buffer zone: N/A')
print(' 4. Rural buffer zone: N/A')
df = utils.point_query_raster(in_raster=file, buffer=buff_zone,
point_shp=gdf_coords, statcol_name=colname
,id_col=TEMP_UNIQ_ID, stat='mean', buff_dist=BUFF_ZONE_DIST,
xcol=LON, ycol=LAT)
if df[colname].isnull().sum() > 1:
num_missing = df[colname].isnull().sum()
print('Replacing {} missing values for {}'.format(num_missing, colname))
df2 = df_coords.merge(df, on=TEMP_UNIQ_ID)
df3 = utils.replace_missing_with_nearest(df=df2, target_col=colname, idcol=TEMP_UNIQ_ID, lon=LON, lat=LAT)
df = df3[[TEMP_UNIQ_ID, colname]]
assert df[colname].isnull().sum() == 0
df.set_index(TEMP_UNIQ_ID, inplace=True)
df_list.append(df)
# Round to 0 decimal places
df_out = pd.concat(df_list, axis=1)
df_out = df_out.applymap(lambda x: int(round(x, 0)) if isinstance(x, (int, float)) else x)
# ===========================
# SAVE OUTPUTS
# ===========================
out_csv = DIR_OUTPUTS.joinpath('bioclimatic-{}.csv'.format(FILENAME_SUFFIX))
df_out.to_csv(out_csv)
# Save variable labels
out_csv_labels = DIR_OUTPUTS.joinpath('bioclimatic-{}-labels.csv'.format(FILENAME_SUFFIX))
df_var_labels = pd.DataFrame(list(VAR_LABELS.values()), index=list(VAR_LABELS.keys()))
df_var_labels.index.name = 'var_name'
df_var_labels.rename(columns={0:'label'}, inplace=True)
df_var_labels.to_csv(out_csv_labels)
print('-'*30)
print('VARIABLES SAVED TO FILE BELOW')
print('-'*30)
print("/".join(out_csv.parts[6:]))
return out_csv, out_vars
# Generate the variables
# df = generate_variables(buff_zone=BUFFER)
outcsv_file, out_vars = generate_variables(buff_zone=BUFFER)
-----------------------------------------------------------------
Extracting variable -af_bio_1- with the following parameters
-----------------------------------------------------------------
1. Extraction type: zonal statistic within buffer zone.
2. Statistic: mean
3. Urban buffer zone: 2.0km
4. Rural buffer zone: 5.0km
------------------------------------------------------------
-----------------------------------------------------------------
Extracting variable -af_bio_8- with the following parameters
-----------------------------------------------------------------
1. Extraction type: zonal statistic within buffer zone.
2. Statistic: mean
3. Urban buffer zone: 2.0km
4. Rural buffer zone: 5.0km
------------------------------------------------------------
-----------------------------------------------------------------
Extracting variable -af_bio_12- with the following parameters
-----------------------------------------------------------------
1. Extraction type: zonal statistic within buffer zone.
2. Statistic: mean
3. Urban buffer zone: 2.0km
4. Rural buffer zone: 5.0km
------------------------------------------------------------
-----------------------------------------------------------------
Extracting variable -af_bio_13- with the following parameters
-----------------------------------------------------------------
1. Extraction type: zonal statistic within buffer zone.
2. Statistic: mean
3. Urban buffer zone: 2.0km
4. Rural buffer zone: 5.0km
------------------------------------------------------------
-----------------------------------------------------------------
Extracting variable -af_bio_16- with the following parameters
-----------------------------------------------------------------
1. Extraction type: zonal statistic within buffer zone.
2. Statistic: mean
3. Urban buffer zone: 2.0km
4. Rural buffer zone: 5.0km
------------------------------------------------------------
------------------------------
VARIABLES SAVED TO FILE BELOW
------------------------------
DECAT_HH_Geovariables/data/TZA/surveys/consumption_experiment/geovars/puf/bioclimatic-cons-exp2022.csv
Quality checks#
Run summary statistic to generare minimum, maximum values etc and inspect them manually.
df = pd.read_csv(outcsv_file)
# Only use describe for continuos variables
cont_vars = [c for c in out_vars if 'sq' not in c]
print('-'*50)
print(' SUMMARY STATISTICS FOR BIOCLIMATIC VARS')
print('-'*50)
stats = df[cont_vars].describe().T
print(stats[['mean', 'min', '50%', '75%', 'max']])
print('-'*70)
--------------------------------------------------
SUMMARY STATISTICS FOR BIOCLIMATIC VARS
--------------------------------------------------
mean min 50% 75% max
af_bio_1 223.359155 161.0 224.5 238.75 274.0
af_bio_8 228.485915 168.0 227.0 247.75 281.0
af_bio_12 1045.366197 466.0 1023.5 1159.25 2227.0
af_bio_13 213.126761 114.0 188.5 253.00 664.0
af_bio_16 519.147887 236.0 477.5 579.75 1416.0
----------------------------------------------------------------------
Summary#
The following variables below were generated.
ID |
Theme |
Variable Description |
Variable ID |
Units |
Resolution |
Data sources |
Required |
---|---|---|---|---|---|---|---|
1 |
Climatology |
Average annual temperature calculated from monthly climatology, multiplied by 10 (°C) |
clim01 |
Degrees Celcius |
0.008333 dd |
Yes |
|
2 |
Climatology |
Average temperature of the wettest quarter, from monthly climatology, multiplied by 10. (°C) |
clim02 |
Degrees Celcius |
0.008333 dd |
Yes |
|
3 |
Climatology |
Total annual precipitation |
clim03 |
mm |
0.008333 dd |
Yes |
|
4 |
Climatology |
Precipitation of wettest month |
clim04 |
mm |
0.008333 dd |
Yes |
|
5 |
Climatology |
Precipitation of wettest quarter |
clim05 |
mm |
0.008333 dd |
Yes |