Crop Season#


Author: Dunstan Matekenya

Affiliation: DECAT, The World Bank Group

Date: August 7, 2023

Background#

In this notebook, we generate geovariables in the category of crop season for the National Panel Survey (NPS 2020-2021) in Tanzania. The following geovariables are computed.

  1. Average 12-month total rainfall (mm) for July-June

  2. Average total rainfall in wettest quarter (mm) within 12-month periods from July-June

  3. Average start of wettest quarter in dekads 1-36, where first dekad of July =1

  4. 12-month total rainfall (mm) in July-June, starting July 2019

  5. Total rainfall in wettest quarter (mm) within 12- month periods starting July 2019

  6. Start of wettest quarter in dekads 1-36, where first dekad of July 2019 =1

  7. 12-month total rainfall (mm) in July-June, starting July 2020

  8. Total rainfall in wettest quarter (mm) within 12- month periods starting July 2020

  9. Start of wettest quarter in dekads 1-36, where first dekad of July 2020 =1

For full variables metadata, see Data-Cover-Page for details about definition of the metrics and other important information.

Input datasets#

The following datasets have been used for all the variables.

  • Data source: The source for this data is the USGS FEWS NET Data Portal

  • Input data files The data files used in this notebook were downloaded from the site mentioned above and kept in this OneDrive folder[1]

1. ^ Only WBG staff can access this OneDrive directory.

Python setup#

The following packages are required for the processing in this notebook.

import warnings
warnings.simplefilter(action='ignore')
import sys
import itertools
from pathlib import Path
from datetime import datetime

import geopandas as gpd
import pandas as pd
import numpy as np

spanon = Path.cwd().parents[2].joinpath("Spatial-Anonymization/src/")
sys.path.append(str(spanon))
from spatial_anonymization import utils
from spatial_anonymization import point_displacement

Setup input files#

Define paths to input data files as global variables using Path package.

# =============================
# BASE WORKING DIR
# ============================
DIR_DATA = Path.cwd().parents[1].joinpath('data', 'TZA')
DIR_SPATIAL_DATA = DIR_DATA.joinpath('spatial')
DIR_CONS_EXP = DIR_DATA.joinpath('surveys', 'consumption_experiment')
DIR_OUTPUTS = DIR_CONS_EXP.joinpath('geovars', 'puf')

# =================
# GEOVARS DATA
# =================
DIR_RFE = Path.cwd().parents[1].joinpath('data', 'RFE', 'dekad')

# =============================
# SURVEY COORDINATES 
# ============================
# Public/displaced GPS oordinates files
FILE_HH_PUB_COORDS = DIR_OUTPUTS.joinpath('cons-exp-pub-coords.csv')

# =============================
# ADMIN BOUNDARIES
# =============================
# use geoboundaries API for country bounds
FILE_ADM0 = DIR_SPATIAL_DATA.joinpath('geoBoundaries-TZA-ADM0-all',
                                     'geoBoundaries-TZA-ADM0.shp')

Setup processing global variables#

For variables we will use often such as column name containing latitude and longitude, set them as global variables.

# ====================
# GENERAL
# =====================
DEBUG = False
# in case we need to download OSM data
GEOFABRICK_REGION  = 'africa'
COUNTRY_NAME = 'Tanzania'
COUNTRY_ISO = 'TZA'
FILENAME_SUFFIX = 'cons-exp2022'
# When survey was conducted
SURVEY_CALENDAR_YR = 2022
DATA_EXTRACT_YEARS = ['2021-2022']
# Year starts on July 1 (dekad-19) instead of
# Jan-1 (dekad01)
DEKAD_YEAR_START = 19
DEKADS = ['d{}'.format(str(i).zfill(2)) for i in range(DEKAD_YEAR_START, 37)] + \
['d{}'.format(str(i).zfill(2)) for i in range(1, DEKAD_YEAR_START)]
if DEBUG:
    for i in range(28):
        q = [DEKADS[i] for i in range(i, i+9)]
        assert len(q) == 9
        print('Quarter-{}: {}'.format(i+1, q))

# ==============
# BUFFER
# ==============
# whether to buffer points or
# extract values at point
BUFFER = False
BUFF_ZONE_DIST = 1000
    
# ====================
# INPUT DATA SPECS
# =====================
# Cols for input coordinates file
LAT = 'lat_pub' 
LON = 'lon_pub'
HH_ID = 'hhid'
CLUST_ID = 'clusterid'
TEMP_UNIQ_ID = 'clusterid'
URB_RURAL = 'clustertype'
URBAN_CODE = 'urban'

# =============================
# VARIABLE LABELS
# ============================
VAR_LABELS = {'anntot_avg': 'Average 12-month total rainfall for July-June',
    'wetQ_avg': 'Average total rainfall in wettest quarter within 12-month periods from July-June',
    'wetQ_avgstart': 'Average start of wettest quarter in dekads 1-36, where first dekad of July =1',
     'anntot_2019_2020': '12-month total rainfall in July-June, starting July 2019',
    'wetQ_2019_2020': 'Total rainfall in wettest quarter within 12- month periods starting July 2019',
     'wetQstart_2019_2020': 'Start of wettest quarter in dekads 1-36, where first dekad of July 2019 =1',
     'anntot_2019_2020': '12-month total rainfall in July-June, starting July 2019',
     'wetQ_2019_2020': 'Total rainfall in wettest quarter within 12- month periods starting July 2019',
     'wetQstart_2019_2020': 'Start of wettest quarter in dekads 1-36, where first dekad of July 2019 =1'
    }

for name, label in VAR_LABELS.items():
    try:
        assert len(label) <= 80, 'LABEL TOO LONG'
    except AssertionError:
        print(name)

Define helper functions for computations#

def get_year_and_tif_list(stop_year, year_start=1):
    """
    Retrieve filepath list and years
    """
    # ===================================
    # RETRIVE FILES BY CALENDAR YEAR
    # ===================================
    rfe_tifs = [i for i in DIR_RFE.iterdir() if i.suffix == ".tif"]
    yrs = {}
    for tif in rfe_tifs:
        yr = tif.parts[-1][4:8]
        if yrs.get(yr, None):
            items =  yrs[yr]
            items.append(tif)
            yrs[yr] = items
        else:
            yrs[yr] = [tif]
    
    # Keep only the years we need based on survey year
    # global variable se
    sorted_yrs = sorted(list(yrs.keys()))
    stop_idx = sorted_yrs.index(str(stop_year))
    calend_yrs = dict(sorted(yrs.items())[:stop_idx+1])
    
    # ===================================
    # DEAL WITH CROSS CALENDAR YEARS
    # ===================================
    if year_start != 1:
        cross_calend_yrs = {}
        files_lst = sorted(list(itertools.chain.from_iterable(list(calend_yrs.values()))))

        yr_start_id = int(list(calend_yrs.keys())[0])
        while year_start < len(files_lst):
            yr_end_id = yr_start_id + 1
            yr_id = '{}-{}'.format(yr_start_id, yr_end_id)
            yr_start_id += 1
            files = files_lst[year_start-1:year_start+35]
            cross_calend_yrs[yr_id] = files
            year_start += 36
        return cross_calend_yrs
    else:
        return calend_yrs
def get_wettest_quarter(df_yr, dekads=None):
    """
    Given one year data, get the quarter with maximum rains
    and start dekad of the quarter
    """
    dfq = pd.DataFrame(index=df_yr.index)
    quarter_start = {}
    if dekads:
        for i in range(28):
            q = [dekads[i] for i in range(i, i+9)]
            assert len(q) == 9
            colname = 'q{}'.format(i+1)
            quarter_start[colname] = int(q[0][1:])
            dfq[colname] =  df_yr[q].sum(axis=1)
    else:
        for i in range(1,29):
            q = ['d{}'.format(str(i).zfill(2)) for i in range(i, i+9)]
            assert len(q) == 9
            colname = 'q{}'.format(i)
            quarter_start[colname] = int(i)
            dfq[colname] =  df_yr[q].sum(axis=1)
    
    dfq['qrt_max_val'] = dfq.max(axis=1)
    dfq['qrt_max_id'] = dfq.idxmax(axis=1)
    dfq['qrt_max_start'] = dfq.qrt_max_id.apply(lambda x: quarter_start[x])
    
    return dfq[['qrt_max_val', 'qrt_max_id', 'qrt_max_start']]
def generate_yr_rfe_stats(tif_list, yr, gdf_coords, uniq_id):
    """
    Generates RFE stats for a single year as follows
    """
    df_lst = []
    for file in tif_list:
        colname = 'd{}'.format(file.parts[-1][8:10])
        df = utils.point_query_raster(in_raster=file, buffer=BUFFER,
                                        point_shp=gdf_coords, statcol_name=colname
                                        ,id_col=uniq_id,xcol=LON, ycol=LAT)
        df.set_index(uniq_id, inplace=True)
        df_lst.append(df)

    df = pd.concat(df_lst, axis=1)
    dekad_cols = sorted(df.columns)
    df = df[dekad_cols]

    # Annual total and max dekadal value
    df.loc[:,'total'] = df.sum(axis=1)
    df.loc[:,'max'] = df[dekad_cols].max(axis=1)

    # Wettest quarter
    dfq = get_wettest_quarter(df_yr=df, dekads=DEKADS)

    # merge 
    df2 = pd.concat([dfq, df[['total', 'max']]], axis=1)
    df2['yr'] = yr

    return df2
def generate_rfe_stats(yr_tif_dict, pts_shp, id_col):
    """
    Helper to loop through all years and run generate_yr_rfe_stats()
    """
    # ========================
    # CALCULATE YEARLY STATS
    # ========================
    df_yr_list = []
    # Loop years and calculate stats for that year
    cnt_yrs = 0
    for yr, files in yr_tif_dict.items():
        if len(files) != 36:
            print('SKIPPING {} BECAUSE WE DONT HAVE ALL DEKADS'.format(yr))
            continue
        df_yr = generate_yr_rfe_stats(files, yr, pts_shp, uniq_id=id_col)
        df_yr_list.append( df_yr)
        cnt_yrs += 1
        if DEBUG:
            if cnt_yrs == 5:
                break
    
    # concat into a single dataframe
    df = pd.concat(df_yr_list, axis=0)
    df = df[['yr',  'total', 'max', 'qrt_max_val', 'qrt_max_start']]
    assert len(df) == len(pts_shp)*cnt_yrs
    
    # ==================================
    # CALCULATE AVERAGES ACROSS YEARS
    # ==================================
    # Calculate the averages
    df_avg = pd.DataFrame(index=df.index.drop_duplicates())
    vars = ['total', 'max', 'qrt_max_val', 'qrt_max_start']
    print('Long average years: {}'.format(df.yr.unique()))
    for idx, row in df_avg.iterrows():
        data = df.loc[idx]
        for var in vars:
            new_colname = 'avg_{}'.format(var)
            avg = data[var].mean()
            df_avg.loc[idx, new_colname] = int(avg)
    
    # Add the averages to the original dataframe
    df2 = df.merge(df_avg, left_index=True, right_index=True)
    assert len(df2) == len(df)
    
    return df2

Generate stats#

def generate_stats(target_yrs):
    """
    Helper function to put everything together and generate stats
    """
    # ===========================
    # LOAD COORDINATES
    # ===========================
    df_pub_coords = pd.read_csv(FILE_HH_PUB_COORDS)
    df_pub_coords  = df_pub_coords.query('make_public == 1')
    df_pub_coords2 = df_pub_coords.drop_duplicates(subset=[TEMP_UNIQ_ID])
    df_coords = df_pub_coords2[[TEMP_UNIQ_ID, LAT, LON]]
    gdf_coords = gpd.GeoDataFrame(df_coords, 
                                      geometry=gpd.points_from_xy(df_coords[LON],df_coords[LAT]), crs=4326)
    
    # ===========================
    # GENERATE ALL YEARS STATS
    # ===========================
    start = datetime.now()
    files_yrs = get_year_and_tif_list(stop_year=SURVEY_CALENDAR_YR, year_start=DEKAD_YEAR_START)
    df = generate_rfe_stats(files_yrs, pts_shp=gdf_coords, id_col=TEMP_UNIQ_ID)
    end = datetime.now()
    time_taken = (end - start).total_seconds()/60
    print('Time taken: {}'.format(int(time_taken)))
    # ================================
    # EXTRACT DATA FOR TARGET YEARS
    # ================================
    df_target_yrs = df.query('yr in @target_yrs')
    required_vars = {'qrt_max_start': 'wetQstart', 'avg_total': 'anntot_avg', 
                     'avg_qrt_max_start': 'wetQ_avgstart', 'total': 'anntot',
                     'avg_qrt_max_val': 'wetQ_avg', 'qrt_max_val': 'wetQ'}
    df_target_yrs.rename(columns=required_vars, inplace=True)
    df_target_yrs = df_target_yrs[['yr', 'anntot', 'wetQ', 'wetQstart', 'anntot_avg','wetQ_avg', 'wetQ_avgstart']]

    # ===========================================
    # GENERATE YEAR BASED VARIABLES (E.G., 
    # ANNTOT_2011-2012,ANNTOT_2012-2013
    # ===========================================
    df_yr_based = df_target_yrs[['yr', 'anntot', 'wetQ', 'wetQstart']]

    df_tmp = pd.DataFrame(index=df_yr_based.index.drop_duplicates())
    yr_based_vars = df_yr_based.columns.difference(['yr'])
    
    # There is probably a shorter pivot approach but couldnt 
    # figure it out, so 
    for idx, row in df_tmp.iterrows():
        df_idx = df_yr_based.loc[idx]
        for yr in target_yrs:
            for var in yr_based_vars:
                val = df_idx.query('yr == @yr').iloc[0][var]
                new_col_name = '{}_{}'.format(var, yr)
                df_tmp.loc[idx, new_col_name] = val

    # ===========================================
    # PUT EVERYTHING BACK TOGETHER
    # ===========================================
    # We can drop everything else and remain with unique-locs
    df_target_yrs2 = df_target_yrs[~df_target_yrs.index.duplicated(keep='first')]
    df = df_target_yrs2.merge(df_tmp, left_index=True, right_index=True, how='inner')
    assert len(df) == len(gdf_coords)
    
    # Drop columns we dont need
    # Drop these columns because we don't have them in pheno file
    drop_cols2match_pheno = ['anntot_2020-2021', 'wetQ_2020-2021', 'wetQstart_2020-2021']
    drop_cols_not_needed = ['yr', 'anntot', 'wetQ', 'wetQstart']
    drop_cols = drop_cols2match_pheno+drop_cols_not_needed
    df.drop(columns=drop_cols, inplace=True)
    
    # Remove "-" from col names
    df.columns = [c.replace("-", "_") for c in df.columns]
    
    # ===========================================
    # SAVE THE OUTPUTS
    # =========================================== 
    out_vars = list(df.columns.difference([TEMP_UNIQ_ID, LON, LAT]))
    out_csv = DIR_OUTPUTS.joinpath('crop-season-{}.csv'.format(FILENAME_SUFFIX))
    df.to_csv(out_csv)
    
    
    # Save variable labels
    out_csv_labels = DIR_OUTPUTS.joinpath('crop-season-{}-labels.csv'.format(FILENAME_SUFFIX))
    df_var_labels = pd.DataFrame(list(VAR_LABELS.values()), index=list(VAR_LABELS.keys()))
    df_var_labels.index.name = 'var_name'
    df_var_labels.rename(columns={0:'label'}, inplace=True)
    df_var_labels.to_csv(out_csv_labels)
    
    print('-'*30)
    print('VARIABLES SAVED TO FILE BELOW')
    print('-'*30)
    print("/".join(out_csv.parts[6:]))
    
    return out_csv, out_vars
# Generate the variables
out_csv, out_vars = generate_stats(target_yrs=DATA_EXTRACT_YEARS)
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
Cell In[9], line 2
      1 # Generate the variables
----> 2 out_csv, out_vars = generate_stats(target_yrs=DATA_EXTRACT_YEARS)

Cell In[8], line 20, in generate_stats(target_yrs)
     18 start = datetime.now()
     19 files_yrs = get_year_and_tif_list(stop_year=SURVEY_CALENDAR_YR, year_start=DEKAD_YEAR_START)
---> 20 df = generate_rfe_stats(files_yrs, pts_shp=gdf_coords, id_col=TEMP_UNIQ_ID)
     21 end = datetime.now()
     22 time_taken = (end - start).total_seconds()/60

Cell In[7], line 15, in generate_rfe_stats(yr_tif_dict, pts_shp, id_col)
     13     print('SKIPPING {} BECAUSE WE DONT HAVE ALL DEKADS'.format(yr))
     14     continue
---> 15 df_yr = generate_yr_rfe_stats(files, yr, pts_shp, uniq_id=id_col)
     16 df_yr_list.append( df_yr)
     17 cnt_yrs += 1

Cell In[6], line 8, in generate_yr_rfe_stats(tif_list, yr, gdf_coords, uniq_id)
      6 for file in tif_list:
      7     colname = 'd{}'.format(file.parts[-1][8:10])
----> 8     df = utils.point_query_raster(in_raster=file, buffer=BUFFER,
      9                                     point_shp=gdf_coords, statcol_name=colname
     10                                     ,id_col=uniq_id,xcol=LON, ycol=LAT)
     11     df.set_index(uniq_id, inplace=True)
     12     df_lst.append(df)

File ~/anaconda3/lib/python3.10/site-packages/spatial_anonymization/utils.py:746, in point_query_raster(in_raster, point_shp, statcol_name, class_pct_code, id_col, stat, buffer, buff_dist, xcol, ycol)
    743 if rst_obj.crs != point_shp.crs:
    744     point_shp = point_shp.to_crs(rst_obj.crs)
--> 746 point_shp[statcol_name] = list(gen_point_query(point_shp, in_raster, interpolate='nearest'))
    747 try:
    748     assert point_shp[statcol_name].isna().sum() == 0, 'THERE SHOULD BE NO NULLS'

File ~/anaconda3/lib/python3.10/site-packages/rasterstats/point.py:184, in gen_point_query(vectors, raster, band, layer, nodata, affine, interpolate, property_name, geojson_out, boundless)
    182 r, c = rast.index(x, y)
    183 window = ((int(r), int(r + 1)), (int(c), int(c + 1)))
--> 184 src_array = rast.read(
    185     window=window, masked=True, boundless=boundless
    186 ).array
    187 val = src_array[0, 0]
    188 if val is masked:

File ~/anaconda3/lib/python3.10/site-packages/rasterstats/io.py:356, in Raster.read(self, bounds, window, masked, boundless)
    351             masked = True
    352             warnings.warn(
    353                 "Setting masked to True because dataset mask has been detected"
    354             )
--> 356     new_array = self.src.read(
    357         self.band, window=win, boundless=boundless, masked=masked
    358     )
    360 return Raster(new_array, new_affine, nodata)

File rasterio/_io.pyx:412, in rasterio._io.DatasetReaderBase.read()

File rasterio/_base.pyx:391, in rasterio._base.DatasetBase.__enter__()

File ~/anaconda3/lib/python3.10/site-packages/rasterio/env.py:280, in Env.__enter__(self)
    277     self.context_options = getenv()
    278     setenv(**self.options)
--> 280 self.credentialize()
    282 log.debug("Entered env context: %r", self)
    283 return self

File ~/anaconda3/lib/python3.10/site-packages/rasterio/env.py:231, in Env.credentialize(self)
    228     options.update(**kwargs)
    229     return Env(*args, **options)
--> 231 def credentialize(self):
    232     """Get credentials and configure GDAL
    233 
    234     Note well: this method is a no-op if the GDAL environment
   (...)
    240 
    241     """
    242     cred_opts = self.session.get_credential_options()

KeyboardInterrupt: 
df = pd.read_csv(DIR_OUTPUTS.joinpath('crop-season-npsw5.csv'))
df_labels = pd.read_csv(DIR_OUTPUTS.joinpath('crop-season-npsw5-labels.csv'))
labels = dict(df_labels.values)
df.columns = [c.replace("-", "_") for c in df.columns]
for c in df.columns:
    print(c)
    print(labels.get(c))
uniq_loc
None
anntot_avg
Average 12-month total rainfall for July-June
wetQ_avg
Average total rainfall in wettest quarter within 12-month periods from July-June
wetQ_avgstart
Average start of wettest quarter in dekads 1-36, where first dekad of July =1
anntot_2019_2020
12-month total rainfall in July-June, starting July 2014
wetQ_2019_2020
Total rainfall in wettest quarter within 12- month periods starting July 2014
wetQstart_2019_2020
Start of wettest quarter in dekads 1-36, where first dekad of July 2014 =1
anntot_2020_2021
None
wetQ_2020_2021
None
wetQstart_2020_2021
None
len(labels), len(df.columns.difference([TEMP_UNIQ_ID]))
(6, 9)
drop_cols = ['anntot_2020_2021', 'wetQ_2020_2021', 'wetQstart_2020_2021']
x = 'anntot_2019-2020'
x = x.replace("-", "_")
x
'anntot_2019_2020'
for c in df.columns:
    print(c)
    print(labels.get(c))

Quality checks#

Run summary statistic to generare minimum, maximum values etc and inspect them manually.

df = pd.read_csv(out_csv)
# Only use describe for continuos variables 
print('-'*50)
print('       SUMMARY STATISTICS FOR CROP SEASON VARS')
print('-'*50)
stats = df[out_vars].describe().T
print(stats[['mean', 'min', '50%', '75%', 'max']])
print('-'*70)

Summary#

The following variables below were generated.

ID

Theme

Variable Description

Variable ID

Units

Resolution

Data sources

Required

Reference period

1

Crop Season

Average 12-month total rainfall for July-June

anntot_avg

mm

0.008333 dd

RFE

Yes

2001-2015

2

Crop Season

Average total rainfall in wettest quarter within 12-month periods from July-June

wetQ_avg

mm

0.008333 dd

RFE

Yes

2001-2015

3

Crop Season

Average start of wettest quarter in dekads 1-36, where first dekad of July =1

wetQ_avgstart

dekads

0.008333 dd

RFE

Yes

2001-2015

4

Crop Season

12-month total rainfall in July-June, starting July 2013

anntot_2013-2014

mm

0.008333 dd

RFE

Yes

2013-2014

5

Crop Season

Total rainfall in wettest quarter within 12- month periods starting July 2013

wetQ_2013-2014

mm

0.008333 dd

RFE

Yes

2013-2014

6

Crop Season

Start of wettest quarter in dekads 1-36, where first dekad of July 2013 =1

wetQstart_2013-2014

dekads

0.008333 dd

RFE

Yes

2013-2015

7

Crop Season

12-month total rainfall in July-June, starting July 2014

anntot_2014-2015

mm

0.008333 dd

RFE

Yes

2014-2015

8

Crop Season

Total rainfall in wettest quarter within 12- month periods starting July 2014

wetQ_2014-2015

mm

0.008333 dd

RFE

Yes

2014-2015

9

Crop Season

Start of wettest quarter in dekads 1-36, where first dekad of July 2014 =1

wetQstart_2014-2015

dekads

0.008333 dd

RFE

Yes

2014-2015