Crop Season#
Author: Dunstan Matekenya
Affiliation: DECAT, The World Bank Group
Date: August 1, 2023
Background#
In this notebook, we generate geovariables in the category of crop season
for the National Panel Survey (NPS 2020-2021) in Tanzania. The following geovariables are computed.
Average 12-month total rainfall (mm) for July-June
Average total rainfall in wettest quarter (mm) within 12-month periods from July-June
Average start of wettest quarter in dekads 1-36, where first dekad of July =1
12-month total rainfall (mm) in July-June, starting July 2019
Total rainfall in wettest quarter (mm) within 12- month periods starting July 2019
Start of wettest quarter in dekads 1-36, where first dekad of July 2019 =1
12-month total rainfall (mm) in July-June, starting July 2020
Total rainfall in wettest quarter (mm) within 12- month periods starting July 2020
Start of wettest quarter in dekads 1-36, where first dekad of July 2020 =1
For full variables metadata, see Data-Cover-Page for details about definition of the metrics and other important information.
Input datasets#
The following datasets have been used for all the variables.
Data source: The source for this data is the USGS FEWS NET Data Portal
Input data files The data files used in this notebook were downloaded from the site mentioned above and kept in this OneDrive folder[1]
Python setup#
The following packages are required for the processing in this notebook.
import warnings
warnings.simplefilter(action='ignore')
import sys
import itertools
from pathlib import Path
from datetime import datetime
import geopandas as gpd
import pandas as pd
import numpy as np
spanon = Path.cwd().parents[2].joinpath("Spatial-Anonymization/src/")
sys.path.append(str(spanon))
from spatial_anonymization import utils
from spatial_anonymization import point_displacement
Setup input files#
Define paths to input data files as global variables using Path
package.
# =============================
# BASE WORKING DIR
# ============================
DIR_DATA = Path.cwd().parents[1].joinpath('data', 'TZA')
DIR_SPATIAL_DATA = DIR_DATA.joinpath('spatial')
DIR_NPS_W5 = DIR_DATA.joinpath('surveys', 'NPS_w5')
DIR_OUTPUTS = DIR_NPS_W5.joinpath('geovars', 'puf')
# =================
# GEOVARS DATA
# =================
DIR_RFE = Path.cwd().parents[1].joinpath('data', 'RFE', 'dekad')
# =============================
# SURVEY COORDINATES
# ============================
# Public/displaced GPS oordinates files
FILE_HH_PUB_GEOVARS = DIR_OUTPUTS.joinpath('y5-hhs-pub-coords.csv')
# =============================
# ADMIN BOUNDARIES
# =============================
# use geoboundaries API for country bounds
FILE_ADM0 = DIR_SPATIAL_DATA.joinpath('geoBoundaries-TZA-ADM0-all',
'geoBoundaries-TZA-ADM0.shp')
Setup processing global variables#
For variables we will use often such as column name containing latitude and longitude, set them as global variables.
# ====================
# GENERAL
# =====================
DEBUG = False
# in case we need to download OSM data
GEOFABRICK_REGION = 'africa'
COUNTRY_NAME = 'Tanzania'
COUNTRY_ISO = 'TZA'
FILENAME_SUFFIX = 'npsw5'
# When survey was conducted
SURVEY_CALENDAR_YR = 2021
DATA_EXTRACT_YEARS = ['2019-2020', '2020-2021']
# Year starts on July 1 (dekad-19) instead of
# Jan-1 (dekad01)
DEKAD_YEAR_START = 19
DEKADS = ['d{}'.format(str(i).zfill(2)) for i in range(DEKAD_YEAR_START, 37)] + \
['d{}'.format(str(i).zfill(2)) for i in range(1, DEKAD_YEAR_START)]
if DEBUG:
for i in range(28):
q = [DEKADS[i] for i in range(i, i+9)]
assert len(q) == 9
print('Quarter-{}: {}'.format(i+1, q))
# ==============
# BUFFER
# ==============
# whether to buffer points or
# extract values at point
BUFFER = False
BUFF_ZONE_DIST = 1000
# ====================
# INPUT DATA SPECS
# =====================
# Cols for input coordinates file
LAT = 'lat_pub_w5'
LON = 'lon_pub_w5'
HH_ID = 'y5_hhid'
CLUST_ID = 'y5_clusterid'
TEMP_UNIQ_ID = 'uniq_loc'
URB_RURAL = 'y5_clustertype'
URBAN_CODE = 'urban'
# =============================
# VARIABLE LABELS
# ============================
VAR_LABELS = {'anntot_avg': 'Average 12-month total rainfall for July-June',
'wetQ_avg': 'Average total rainfall in wettest quarter within 12-month periods from July-June',
'wetQ_avgstart': 'Average start of wettest quarter in dekads 1-36, where first dekad of July =1',
'anntot_2019_2020': '12-month total rainfall in July-June, starting July 2019',
'wetQ_2019_2020': 'Total rainfall in wettest quarter within 12- month periods starting July 2019',
'wetQstart_2019_2020': 'Start of wettest quarter in dekads 1-36, where first dekad of July 2019 =1',
'anntot_2019_2020': '12-month total rainfall in July-June, starting July 2019',
'wetQ_2019_2020': 'Total rainfall in wettest quarter within 12- month periods starting July 2019',
'wetQstart_2019_2020': 'Start of wettest quarter in dekads 1-36, where first dekad of July 2019 =1'
}
for name, label in VAR_LABELS.items():
try:
assert len(label) <= 80, 'LABEL TOO LONG'
except AssertionError:
print(name)
DEKADS
['d19',
'd20',
'd21',
'd22',
'd23',
'd24',
'd25',
'd26',
'd27',
'd28',
'd29',
'd30',
'd31',
'd32',
'd33',
'd34',
'd35',
'd36',
'd01',
'd02',
'd03',
'd04',
'd05',
'd06',
'd07',
'd08',
'd09',
'd10',
'd11',
'd12',
'd13',
'd14',
'd15',
'd16',
'd17',
'd18']
Define helper functions for computations#
def get_year_and_tif_list(stop_year, year_start=1):
"""
Retrieve filepath list and years
"""
# ===================================
# RETRIVE FILES BY CALENDAR YEAR
# ===================================
rfe_tifs = [i for i in DIR_RFE.iterdir() if i.suffix == ".tif"]
yrs = {}
for tif in rfe_tifs:
yr = tif.parts[-1][4:8]
if yrs.get(yr, None):
items = yrs[yr]
items.append(tif)
yrs[yr] = items
else:
yrs[yr] = [tif]
# Keep only the years we need based on survey year
# global variable se
sorted_yrs = sorted(list(yrs.keys()))
stop_idx = sorted_yrs.index(str(stop_year))
calend_yrs = dict(sorted(yrs.items())[:stop_idx+1])
# ===================================
# DEAL WITH CROSS CALENDAR YEARS
# ===================================
if year_start != 1:
cross_calend_yrs = {}
files_lst = sorted(list(itertools.chain.from_iterable(list(calend_yrs.values()))))
yr_start_id = int(list(calend_yrs.keys())[0])
while year_start < len(files_lst):
yr_end_id = yr_start_id + 1
yr_id = '{}-{}'.format(yr_start_id, yr_end_id)
yr_start_id += 1
files = files_lst[year_start-1:year_start+35]
cross_calend_yrs[yr_id] = files
year_start += 36
return cross_calend_yrs
else:
return calend_yrs
def get_wettest_quarter(df_yr, dekads=None):
"""
Given one year data, get the quarter with maximum rains
and start dekad of the quarter
"""
dfq = pd.DataFrame(index=df_yr.index)
quarter_start = {}
if dekads:
for i in range(28):
q = [dekads[i] for i in range(i, i+9)]
assert len(q) == 9
colname = 'q{}'.format(i+1)
quarter_start[colname] = int(q[0][1:])
dfq[colname] = df_yr[q].sum(axis=1)
else:
for i in range(1,29):
q = ['d{}'.format(str(i).zfill(2)) for i in range(i, i+9)]
assert len(q) == 9
colname = 'q{}'.format(i)
quarter_start[colname] = int(i)
dfq[colname] = df_yr[q].sum(axis=1)
dfq['qrt_max_val'] = dfq.max(axis=1)
dfq['qrt_max_id'] = dfq.idxmax(axis=1)
dfq['qrt_max_start'] = dfq.qrt_max_id.apply(lambda x: quarter_start[x])
return dfq[['qrt_max_val', 'qrt_max_id', 'qrt_max_start']]
def generate_yr_rfe_stats(tif_list, yr, gdf_coords, uniq_id):
"""
Generates RFE stats for a single year as follows
"""
df_lst = []
for file in tif_list:
colname = 'd{}'.format(file.parts[-1][8:10])
df = utils.point_query_raster(in_raster=file, buffer=BUFFER,
point_shp=gdf_coords, statcol_name=colname
,id_col=uniq_id,xcol=LON, ycol=LAT)
df.set_index(uniq_id, inplace=True)
df_lst.append(df)
df = pd.concat(df_lst, axis=1)
dekad_cols = sorted(df.columns)
df = df[dekad_cols]
# Annual total and max dekadal value
df.loc[:,'total'] = df.sum(axis=1)
df.loc[:,'max'] = df[dekad_cols].max(axis=1)
# Wettest quarter
dfq = get_wettest_quarter(df_yr=df, dekads=DEKADS)
# merge
df2 = pd.concat([dfq, df[['total', 'max']]], axis=1)
df2['yr'] = yr
return df2
def generate_rfe_stats(yr_tif_dict, pts_shp, id_col):
"""
Helper to loop through all years and run generate_yr_rfe_stats()
"""
# ========================
# CALCULATE YEARLY STATS
# ========================
df_yr_list = []
# Loop years and calculate stats for that year
cnt_yrs = 0
for yr, files in yr_tif_dict.items():
if len(files) != 36:
print('SKIPPING {} BECAUSE WE DONT HAVE ALL DEKADS'.format(yr))
continue
df_yr = generate_yr_rfe_stats(files, yr, pts_shp, uniq_id=id_col)
df_yr_list.append( df_yr)
cnt_yrs += 1
if DEBUG:
if cnt_yrs == 5:
break
# concat into a single dataframe
df = pd.concat(df_yr_list, axis=0)
df = df[['yr', 'total', 'max', 'qrt_max_val', 'qrt_max_start']]
assert len(df) == len(pts_shp)*cnt_yrs
# ==================================
# CALCULATE AVERAGES ACROSS YEARS
# ==================================
# Calculate the averages
df_avg = pd.DataFrame(index=df.index.drop_duplicates())
vars = ['total', 'max', 'qrt_max_val', 'qrt_max_start']
print('Long average years: {}'.format(df.yr.unique()))
for idx, row in df_avg.iterrows():
data = df.loc[idx]
for var in vars:
new_colname = 'avg_{}'.format(var)
avg = data[var].mean()
df_avg.loc[idx, new_colname] = int(avg)
# Add the averages to the original dataframe
df2 = df.merge(df_avg, left_index=True, right_index=True)
assert len(df2) == len(df)
return df2
Generate stats#
def generate_stats(target_yrs):
"""
Helper function to put everything together and generate stats
"""
# ===========================
# LOAD COORDINATES
# ===========================
df_pub_coords = pd.read_csv(FILE_HH_PUB_GEOVARS)
df_pub_coords = df_pub_coords.query('make_public == 1')
df_pub_coords[TEMP_UNIQ_ID] = df_pub_coords.apply(lambda x: "{}-{}".format(x[LON], x[LAT]), axis=1)
df_pub_coords2 = df_pub_coords.drop_duplicates(subset=['uniq_loc'])
df_coords = df_pub_coords2[[TEMP_UNIQ_ID, LAT, LON]]
gdf_coords = gpd.GeoDataFrame(df_coords,
geometry=gpd.points_from_xy(df_coords[LON],df_coords[LAT]), crs=4326)
# ===========================
# GENERATE ALL YEARS STATS
# ===========================
start = datetime.now()
files_yrs = get_year_and_tif_list(stop_year=SURVEY_CALENDAR_YR, year_start=DEKAD_YEAR_START)
df = generate_rfe_stats(files_yrs, pts_shp=gdf_coords, id_col=TEMP_UNIQ_ID)
end = datetime.now()
time_taken = (end - start).total_seconds()/60
print('Time taken: {}'.format(int(time_taken)))
# ================================
# EXTRACT DATA FOR TARGET YEARS
# ================================
df_target_yrs = df.query('yr in @target_yrs')
required_vars = {'qrt_max_start': 'wetQstart', 'avg_total': 'anntot_avg',
'avg_qrt_max_start': 'wetQ_avgstart', 'total': 'anntot',
'avg_qrt_max_val': 'wetQ_avg', 'qrt_max_val': 'wetQ'}
df_target_yrs.rename(columns=required_vars, inplace=True)
df_target_yrs = df_target_yrs[['yr', 'anntot', 'wetQ', 'wetQstart', 'anntot_avg','wetQ_avg', 'wetQ_avgstart']]
# ===========================================
# GENERATE YEAR BASED VARIABLES (E.G.,
# ANNTOT_2011-2012,ANNTOT_2012-2013
# ===========================================
df_yr_based = df_target_yrs[['yr', 'anntot', 'wetQ', 'wetQstart']]
df_tmp = pd.DataFrame(index=df_yr_based.index.drop_duplicates())
yr_based_vars = df_yr_based.columns.difference(['yr'])
# There is probably a shorter pivot approach but couldnt
# figure it out, so
for idx, row in df_tmp.iterrows():
df_idx = df_yr_based.loc[idx]
for yr in target_yrs:
for var in yr_based_vars:
val = df_idx.query('yr == @yr').iloc[0][var]
new_col_name = '{}_{}'.format(var, yr)
df_tmp.loc[idx, new_col_name] = val
# ===========================================
# PUT EVERYTHING BACK TOGETHER
# ===========================================
# We can drop everything else and remain with unique-locs
df_target_yrs2 = df_target_yrs[~df_target_yrs.index.duplicated(keep='first')]
df = df_target_yrs2.merge(df_tmp, left_index=True, right_index=True, how='inner')
assert len(df) == len(gdf_coords)
# Drop columns we dont need
# Drop these columns because we don't have them in pheno file
drop_cols2match_pheno = ['anntot_2020-2021', 'wetQ_2020-2021', 'wetQstart_2020-2021']
drop_cols_not_needed = ['yr', 'anntot', 'wetQ', 'wetQstart']
drop_cols = drop_cols2match_pheno+drop_cols_not_needed
df.drop(columns=drop_cols, inplace=True)
# Remove "-" from col names
df.columns = [c.replace("-", "_") for c in df.columns]
# ===========================================
# SAVE THE OUTPUTS
# ===========================================
out_vars = list(df.columns.difference([TEMP_UNIQ_ID, LON, LAT]))
out_csv = DIR_OUTPUTS.joinpath('crop-season-{}.csv'.format(FILENAME_SUFFIX))
df.to_csv(out_csv)
# Save variable labels
out_csv_labels = DIR_OUTPUTS.joinpath('crop-season-{}-labels.csv'.format(FILENAME_SUFFIX))
df_var_labels = pd.DataFrame(list(VAR_LABELS.values()), index=list(VAR_LABELS.keys()))
df_var_labels.index.name = 'var_name'
df_var_labels.rename(columns={0:'label'}, inplace=True)
df_var_labels.to_csv(out_csv_labels)
print('-'*30)
print('VARIABLES SAVED TO FILE BELOW')
print('-'*30)
print("/".join(out_csv.parts[6:]))
return out_csv, out_vars
# Generate the variables
out_csv, out_vars = generate_stats(target_yrs=DATA_EXTRACT_YEARS)
---------------------------------------------------------------------------
KeyboardInterrupt Traceback (most recent call last)
Cell In[10], line 2
1 # Generate the variables
----> 2 out_csv, out_vars = generate_stats(target_yrs=DATA_EXTRACT_YEARS)
Cell In[9], line 22, in generate_stats(target_yrs)
20 start = datetime.now()
21 files_yrs = get_year_and_tif_list(stop_year=SURVEY_CALENDAR_YR, year_start=DEKAD_YEAR_START)
---> 22 df = generate_rfe_stats(files_yrs, pts_shp=gdf_coords, id_col=TEMP_UNIQ_ID)
23 end = datetime.now()
24 time_taken = (end - start).total_seconds()/60
Cell In[8], line 15, in generate_rfe_stats(yr_tif_dict, pts_shp, id_col)
13 print('SKIPPING {} BECAUSE WE DONT HAVE ALL DEKADS'.format(yr))
14 continue
---> 15 df_yr = generate_yr_rfe_stats(files, yr, pts_shp, uniq_id=id_col)
16 df_yr_list.append( df_yr)
17 cnt_yrs += 1
Cell In[7], line 8, in generate_yr_rfe_stats(tif_list, yr, gdf_coords, uniq_id)
6 for file in tif_list:
7 colname = 'd{}'.format(file.parts[-1][8:10])
----> 8 df = utils.point_query_raster(in_raster=file, buffer=BUFFER,
9 point_shp=gdf_coords, statcol_name=colname
10 ,id_col=uniq_id,xcol=LON, ycol=LAT)
11 df.set_index(uniq_id, inplace=True)
12 df_lst.append(df)
File ~/anaconda3/lib/python3.10/site-packages/spatial_anonymization/utils.py:746, in point_query_raster(in_raster, point_shp, statcol_name, class_pct_code, id_col, stat, buffer, buff_dist, xcol, ycol)
743 if rst_obj.crs != point_shp.crs:
744 point_shp = point_shp.to_crs(rst_obj.crs)
--> 746 point_shp[statcol_name] = list(gen_point_query(point_shp, in_raster, interpolate='nearest'))
747 try:
748 assert point_shp[statcol_name].isna().sum() == 0, 'THERE SHOULD BE NO NULLS'
File ~/anaconda3/lib/python3.10/site-packages/rasterstats/point.py:184, in gen_point_query(vectors, raster, band, layer, nodata, affine, interpolate, property_name, geojson_out, boundless)
182 r, c = rast.index(x, y)
183 window = ((int(r), int(r + 1)), (int(c), int(c + 1)))
--> 184 src_array = rast.read(
185 window=window, masked=True, boundless=boundless
186 ).array
187 val = src_array[0, 0]
188 if val is masked:
File ~/anaconda3/lib/python3.10/site-packages/rasterstats/io.py:356, in Raster.read(self, bounds, window, masked, boundless)
351 masked = True
352 warnings.warn(
353 "Setting masked to True because dataset mask has been detected"
354 )
--> 356 new_array = self.src.read(
357 self.band, window=win, boundless=boundless, masked=masked
358 )
360 return Raster(new_array, new_affine, nodata)
File rasterio/_io.pyx:412, in rasterio._io.DatasetReaderBase.read()
File rasterio/_io.pyx:439, in rasterio._io.DatasetReaderBase.read()
File rasterio/_io.pyx:680, in rasterio._io.DatasetReaderBase._read()
File ~/anaconda3/lib/python3.10/logging/__init__.py:1464, in Logger.debug(self, msg, *args, **kwargs)
1455 def debug(self, msg, *args, **kwargs):
1456 """
1457 Log 'msg % args' with severity 'DEBUG'.
1458
(...)
1462 logger.debug("Houston, we have a %s", "thorny problem", exc_info=1)
1463 """
-> 1464 if self.isEnabledFor(DEBUG):
1465 self._log(DEBUG, msg, args, **kwargs)
KeyboardInterrupt:
df = pd.read_csv(DIR_OUTPUTS.joinpath('crop-season-npsw5.csv'))
df_labels = pd.read_csv(DIR_OUTPUTS.joinpath('crop-season-npsw5-labels.csv'))
labels = dict(df_labels.values)
df.columns = [c.replace("-", "_") for c in df.columns]
len(labels), len(df.columns.difference([TEMP_UNIQ_ID]))
(6, 9)
drop_cols = ['anntot_2020_2021', 'wetQ_2020_2021', 'wetQstart_2020_2021']
x = 'anntot_2019-2020'
x = x.replace("-", "_")
x
'anntot_2019_2020'
for c in df.columns:
print(c)
print(labels.get(c))
Quality checks#
Run summary statistic to generare minimum, maximum values etc and inspect them manually.
out_csv = DIR_OUTPUTS.joinpath('crop-season-{}.csv'.format(FILENAME_SUFFIX))
df = pd.read_csv(out_csv)
df
uniq_loc | anntot_avg | wetQ_avg | wetQ_avgstart | anntot_2019_2020 | wetQ_2019_2020 | wetQstart_2019_2020 | |
---|---|---|---|---|---|---|---|
0 | 29.602591737997177--4.912058073441596 | 1012.0 | 538.0 | 20.0 | 1285.0 | 590.0 | 29.0 |
1 | 29.6035629093--4.80935986272 | 1117.0 | 554.0 | 15.0 | 1403.0 | 648.0 | 4.0 |
2 | 29.61972179903359--4.842254030081743 | 1099.0 | 550.0 | 16.0 | 1285.0 | 590.0 | 29.0 |
3 | 29.635780889348787--4.866383736927592 | 1012.0 | 538.0 | 20.0 | 1285.0 | 590.0 | 29.0 |
4 | 29.637410106--4.89393190766 | 1012.0 | 538.0 | 20.0 | 1285.0 | 590.0 | 29.0 |
... | ... | ... | ... | ... | ... | ... | ... |
1280 | 40.19491998234548--10.288644436052715 | 969.0 | 664.0 | 13.0 | 1045.0 | 790.0 | 36.0 |
1281 | 40.2248285482--10.2230680537 | 945.0 | 648.0 | 12.0 | 1084.0 | 808.0 | 36.0 |
1282 | 40.25452621839283--10.5583863385372 | 907.0 | 639.0 | 10.0 | 1041.0 | 787.0 | 36.0 |
1283 | 40.2787369628--10.5425273253 | 925.0 | 657.0 | 10.0 | 1041.0 | 787.0 | 36.0 |
1284 | 40.29176355850665--10.52910817196854 | 925.0 | 657.0 | 10.0 | 1041.0 | 787.0 | 36.0 |
1285 rows × 7 columns
dk10_01,#
# Only use describe for continuos variables
print('-'*50)
print(' SUMMARY STATISTICS FOR CROP SEASON VARS')
print('-'*50)
stats = df[out_vars].describe().T
print(stats[['mean', 'min', '50%', '75%', 'max']])
print('-'*70)
Summary#
The following variables below were generated.
ID |
Theme |
Variable Description |
Variable ID |
Units |
Resolution |
Data sources |
Required |
Reference period |
---|---|---|---|---|---|---|---|---|
1 |
Crop Season |
Average 12-month total rainfall for July-June |
anntot_avg |
mm |
0.008333 dd |
Yes |
2001-2015 |
|
2 |
Crop Season |
Average total rainfall in wettest quarter within 12-month periods from July-June |
wetQ_avg |
mm |
0.008333 dd |
Yes |
2001-2015 |
|
3 |
Crop Season |
Average start of wettest quarter in dekads 1-36, where first dekad of July =1 |
wetQ_avgstart |
dekads |
0.008333 dd |
Yes |
2001-2015 |
|
4 |
Crop Season |
12-month total rainfall in July-June, starting July 2013 |
anntot_2013-2014 |
mm |
0.008333 dd |
Yes |
2013-2014 |
|
5 |
Crop Season |
Total rainfall in wettest quarter within 12- month periods starting July 2013 |
wetQ_2013-2014 |
mm |
0.008333 dd |
Yes |
2013-2014 |
|
6 |
Crop Season |
Start of wettest quarter in dekads 1-36, where first dekad of July 2013 =1 |
wetQstart_2013-2014 |
dekads |
0.008333 dd |
Yes |
2013-2015 |
|
7 |
Crop Season |
12-month total rainfall in July-June, starting July 2014 |
anntot_2014-2015 |
mm |
0.008333 dd |
Yes |
2014-2015 |
|
8 |
Crop Season |
Total rainfall in wettest quarter within 12- month periods starting July 2014 |
wetQ_2014-2015 |
mm |
0.008333 dd |
Yes |
2014-2015 |
|
9 |
Crop Season |
Start of wettest quarter in dekads 1-36, where first dekad of July 2014 =1 |
wetQstart_2014-2015 |
dekads |
0.008333 dd |
Yes |
2014-2015 |