Crop Season#


Author: Dunstan Matekenya

Affiliation: DECAT, The World Bank Group

Date: May 18, 2023

Background#

In this notebook, we generate geovariables in the category of crop season for the National Panel Survey (NPS 2014-2015) in Tanzania. The following geovariables are computed.

  1. Average 12-month total rainfall (mm) for July-June

  2. Average total rainfall in wettest quarter (mm) within 12-month periods from July-June

  3. Average start of wettest quarter in dekads 1-36, where first dekad of July =1

  4. 12-month total rainfall (mm) in July-June, starting July 2013

  5. Total rainfall in wettest quarter (mm) within 12- month periods starting July 2013

  6. Start of wettest quarter in dekads 1-36, where first dekad of July 2013 =1

  7. 12-month total rainfall (mm) in July-June, starting July 2014

  8. Total rainfall in wettest quarter (mm) within 12- month periods starting July 2014

  9. Start of wettest quarter in dekads 1-36, where first dekad of July 2014 =1

For full variables metadata, see Data-Cover-Page for details about definition of the metrics and other important information.

Input datasets#

The following datasets have been used for all the variables.

1. ^ Only WBG staff can access this OneDrive directory.

Python setup#

The following packages are required for the processing in this notebook.

import warnings
warnings.simplefilter(action='ignore')
import sys
import itertools
from pathlib import Path
from datetime import datetime

import geopandas as gpd
import pandas as pd
import numpy as np

spanon = Path.cwd().parents[2].joinpath("Spatial-Anonymization/src/")
sys.path.append(str(spanon))
from spatial_anonymization import utils
from spatial_anonymization import point_displacement

Setup input files#

Define paths to input data files as global variables using Path package.

# =============================
# BASE WORKING DIR
# ============================
DIR_DATA = Path.cwd().parents[1].joinpath('data', 'TZA')
DIR_SPATIAL_DATA = DIR_DATA.joinpath('spatial')
DIR_NPS_W4 = DIR_DATA.joinpath('surveys', 'NPS_w4')
DIR_OUTPUTS = DIR_NPS_W4.joinpath('geovars', 'puf')

# =================
# GEOVARS DATA
# =================
DIR_RFE = Path.cwd().parents[1].joinpath('data', 'RFE', 'dekad')

# =============================
# SURVEY COORDINATES 
# ============================
# Public/displaced GPS oordinates files
FILE_HH_PUB_GEOVARS = DIR_NPS_W4.joinpath('geovars', 'npsy4.ea.offset.dta')


# =============================
# ADMIN BOUNDARIES
# =============================
# use geoboundaries API for country bounds
FILE_ADM0 = DIR_SPATIAL_DATA.joinpath('geoBoundaries-TZA-ADM0-all',
                                     'geoBoundaries-TZA-ADM0.shp')

Setup processing global variables#

For variables we will use often such as column name containing latitude and longitude, set them as global variables.

# ====================
# GENERAL
# =====================
DEBUG = False
# in case we need to download OSM data
GEOFABRICK_REGION  = 'africa'
COUNTRY_NAME = 'Tanzania'
COUNTRY_ISO = 'TZA'
FILENAME_SUFFIX = 'npsw4'
# When survey was conducted
SURVEY_CALENDAR_YR = 2015
DATA_EXTRACT_YEARS = ['2013-2014', '2014-2015']
# Year starts on July 1 (dekad-19) instead of
# Jan-1 (dekad01)
DEKAD_YEAR_START = 19
DEKADS = ['d{}'.format(str(i).zfill(2)) for i in range(DEKAD_YEAR_START, 37)] + \
['d{}'.format(str(i).zfill(2)) for i in range(1, DEKAD_YEAR_START)]
if DEBUG:
    for i in range(28):
        q = [DEKADS[i] for i in range(i, i+9)]
        assert len(q) == 9
        print('Quarter-{}: {}'.format(i+1, q))

# ==============
# BUFFER
# ==============
# whether to buffer points or
# extract values at point
BUFFER = False
BUFF_ZONE_DIST = 1000
    
# ====================
# INPUT DATA SPECS
# =====================
# Cols for input coordinates file
LAT = 'lat_modified' 
LON = 'lon_modified'
HH_ID = 'clusterid'

# =============================
# VARIABLE LABELS
# ============================
VAR_LABELS = {'anntot_avg': 'Average 12-month total rainfall for July-June',
    'wetQ_avg': 'Average total rainfall in wettest quarter within 12-month periods from July-June',
    'wetQ_avgstart': 'Average start of wettest quarter in dekads 1-36, where first dekad of July =1',
     'anntot_2013_2014': '12-month total rainfall in July-June, starting July 2013',
    'wetQ_2013_2014': 'Total rainfall in wettest quarter within 12- month periods starting July 2013',
     'wetQstart_2013_2014': 'Start of wettest quarter in dekads 1-36, where first dekad of July 2013=1',
     'anntot_2014_2015': '12-month total rainfall in July-June, starting July 2014',
     'wetQ_2014_2015': 'Total rainfall in wettest quarter within 12- month periods starting July 2014',
     'wetQstart_2014_2015': 'Start of wettest quarter in dekads 1-36, where first dekad of July 2014 =1'
    }

for name, label in VAR_LABELS.items():
    try:
        assert len(label) <= 80, 'LABEL TOO LONG'
    except AssertionError:
        print(name)

Define helper functions for computations#

def get_year_and_tif_list(stop_year, year_start=1):
    """
    Retrieve filepath list and years
    """
    # ===================================
    # RETRIVE FILES BY CALENDAR YEAR
    # ===================================
    rfe_tifs = [i for i in DIR_RFE.iterdir() if i.suffix == ".tif"]
    yrs = {}
    for tif in rfe_tifs:
        yr = tif.parts[-1][4:8]
        if yrs.get(yr, None):
            items =  yrs[yr]
            items.append(tif)
            yrs[yr] = items
        else:
            yrs[yr] = [tif]
    
    # Keep only the years we need based on survey year
    # global variable se
    sorted_yrs = sorted(list(yrs.keys()))
    stop_idx = sorted_yrs.index(str(stop_year))
    calend_yrs = dict(sorted(yrs.items())[:stop_idx+1])
    
    # ===================================
    # DEAL WITH CROSS CALENDAR YEARS
    # ===================================
    if year_start != 1:
        cross_calend_yrs = {}
        files_lst = sorted(list(itertools.chain.from_iterable(list(calend_yrs.values()))))

        yr_start_id = int(list(calend_yrs.keys())[0])
        while year_start < len(files_lst):
            yr_end_id = yr_start_id + 1
            yr_id = '{}-{}'.format(yr_start_id, yr_end_id)
            yr_start_id += 1
            files = files_lst[year_start-1:year_start+35]
            cross_calend_yrs[yr_id] = files
            year_start += 36
        return cross_calend_yrs
    else:
        return calend_yrs
def get_wettest_quarter(df_yr, dekads=None):
    """
    Given one year data, get the quarter with maximum rains
    and start dekad of the quarter
    """
    dfq = pd.DataFrame(index=df_yr.index)
    quarter_start = {}
    if dekads:
        for i in range(28):
            q = [dekads[i] for i in range(i, i+9)]
            assert len(q) == 9
            colname = 'q{}'.format(i+1)
            quarter_start[colname] = int(q[0][1:])
            dfq[colname] =  df_yr[q].sum(axis=1)
    else:
        for i in range(1,29):
            q = ['d{}'.format(str(i).zfill(2)) for i in range(i, i+9)]
            assert len(q) == 9
            colname = 'q{}'.format(i)
            quarter_start[colname] = int(i)
            dfq[colname] =  df_yr[q].sum(axis=1)
    
    dfq['qrt_max_val'] = dfq.max(axis=1)
    dfq['qrt_max_id'] = dfq.idxmax(axis=1)
    dfq['qrt_max_start'] = dfq.qrt_max_id.apply(lambda x: quarter_start[x])
    
    return dfq[['qrt_max_val', 'qrt_max_id', 'qrt_max_start']]
def generate_yr_rfe_stats(tif_list, yr, gdf_coords, uniq_id):
    """
    Generates RFE stats for a single year as follows
    """
    df_lst = []
    for file in tif_list:
        colname = 'd{}'.format(file.parts[-1][8:10])
        df = utils.point_query_raster(in_raster=file, buffer=BUFFER,
                                        point_shp=gdf_coords, statcol_name=colname
                                        ,id_col=uniq_id,xcol=LON, ycol=LAT)
        df.set_index(uniq_id, inplace=True)
        df_lst.append(df)

    df = pd.concat(df_lst, axis=1)
    dekad_cols = sorted(df.columns)
    df = df[dekad_cols]

    # Annual total and max dekadal value
    df.loc[:,'total'] = df.sum(axis=1)
    df.loc[:,'max'] = df[dekad_cols].max(axis=1)

    # Wettest quarter
    dfq = get_wettest_quarter(df_yr=df, dekads=DEKADS)

    # merge 
    df2 = pd.concat([dfq, df[['total', 'max']]], axis=1)
    df2['yr'] = yr

    return df2
def generate_rfe_stats(yr_tif_dict, pts_shp, id_col):
    """
    Helper to loop through all years and run generate_yr_rfe_stats()
    """
    # ========================
    # CALCULATE YEARLY STATS
    # ========================
    df_yr_list = []
    # Loop years and calculate stats for that year
    cnt_yrs = 0
    for yr, files in yr_tif_dict.items():
        if len(files) != 36:
            print('SKIPPING {} BECAUSE WE DONT HAVE ALL DEKADS'.format(yr))
            continue
        df_yr = generate_yr_rfe_stats(files, yr, pts_shp, uniq_id=id_col)
        df_yr_list.append( df_yr)
        cnt_yrs += 1
        if DEBUG:
            if cnt_yrs == 5:
                break
    
    # concat into a single dataframe
    df = pd.concat(df_yr_list, axis=0)
    df = df[['yr',  'total', 'max', 'qrt_max_val', 'qrt_max_start']]
    assert len(df) == len(pts_shp)*cnt_yrs
    
    # ==================================
    # CALCULATE AVERAGES ACROSS YEARS
    # ==================================
    # Calculate the averages
    df_avg = pd.DataFrame(index=df.index.drop_duplicates())
    vars = ['total', 'max', 'qrt_max_val', 'qrt_max_start']
    print('Long average years: {}'.format(df.yr.unique()))
    for idx, row in df_avg.iterrows():
        data = df.loc[idx]
        for var in vars:
            new_colname = 'avg_{}'.format(var)
            avg = data[var].mean()
            df_avg.loc[idx, new_colname] = int(avg)
    
    # Add the averages to the original dataframe
    df2 = df.merge(df_avg, left_index=True, right_index=True)
    assert len(df2) == len(df)
    
    return df2

Generate stats#

def generate_stats(target_yrs):
    """
    Helper function to put everything together and generate stats
    """
    # ===========================
    # LOAD COORDINATES
    # ===========================
    df_pub_coords = pd.read_stata(FILE_HH_PUB_GEOVARS, convert_categoricals=False)
    df_coords = df_pub_coords[[HH_ID, LAT, LON]]
    gdf_coords = gpd.GeoDataFrame(df_coords, 
                                      geometry=gpd.points_from_xy(df_coords[LON],df_coords[LAT]), crs=4326)
    
    # ===========================
    # GENERATE ALL YEARS STATS
    # ===========================
    start = datetime.now()
    files_yrs = get_year_and_tif_list(stop_year=SURVEY_CALENDAR_YR, year_start=DEKAD_YEAR_START)
    df = generate_rfe_stats(files_yrs, pts_shp=gdf_coords, id_col=HH_ID)
    end = datetime.now()
    time_taken = (end - start).total_seconds()/60
    print('Time taken: {}'.format(int(time_taken)))
    # ================================
    # EXTRACT DATA FOR TARGET YEARS
    # ================================
    df_target_yrs = df.query('yr in @target_yrs')
    required_vars = {'qrt_max_start': 'wetQstart', 'avg_total': 'anntot_avg', 
                     'avg_qrt_max_start': 'wetQ_avgstart', 'total': 'anntot',
                     'avg_qrt_max_val': 'wetQ_avg', 'qrt_max_val': 'wetQ'}
    df_target_yrs.rename(columns=required_vars, inplace=True)
    df_target_yrs = df_target_yrs[['yr', 'anntot', 'wetQ', 'wetQstart', 'anntot_avg','wetQ_avg', 'wetQ_avgstart']]

    # ===========================================
    # GENERATE YEAR BASED VARIABLES (E.G., 
    # ANNTOT_2011-2012,ANNTOT_2012-2013
    # ===========================================
    df_yr_based = df_target_yrs[['yr', 'anntot', 'wetQ', 'wetQstart']]

    df_tmp = pd.DataFrame(index=df_yr_based.index.drop_duplicates())
    yr_based_vars = df_yr_based.columns.difference(['yr'])
    
    # There is probably a shorter pivot approach but couldnt 
    # figure it out, so 
    for idx, row in df_tmp.iterrows():
        df_idx = df_yr_based.loc[idx]
        for yr in target_yrs:
            for var in yr_based_vars:
                val = df_idx.query('yr == @yr').iloc[0][var]
                new_col_name = '{}_{}'.format(var, yr.replace("-","_"))
                df_tmp.loc[idx, new_col_name] = val

    # ===========================================
    # PUT EVERYTHING BACK TOGETHER
    # ===========================================
    # We can drop everything else and remain with unique-locs
    df_target_yrs2 = df_target_yrs[~df_target_yrs.index.duplicated(keep='first')]
    df = df_target_yrs2.merge(df_tmp, left_index=True, right_index=True, how='inner')
    assert len(df) == len(gdf_coords)
    # Drop columns we dont need
    df.drop(columns=['yr', 'anntot', 'wetQ', 'wetQstart'], inplace=True)

    # ===========================================
    # SAVE THE OUTPUTS
    # =========================================== 
    out_vars = list(df.columns.difference([HH_ID, LON, LAT]))
    out_csv = DIR_OUTPUTS.joinpath('crop-season-{}.csv'.format(FILENAME_SUFFIX))
    df.to_csv(out_csv)
    
    
    # Save variable labels
    out_csv_labels = DIR_OUTPUTS.joinpath('crop-season-{}-labels.csv'.format(FILENAME_SUFFIX))
    df_var_labels = pd.DataFrame(list(VAR_LABELS.values()), index=list(VAR_LABELS.keys()))
    df_var_labels.index.name = 'var_name'
    df_var_labels.rename(columns={0:'label'}, inplace=True)
    df_var_labels.to_csv(out_csv_labels)
    
    print('-'*30)
    print('VARIABLES SAVED TO FILE BELOW')
    print('-'*30)
    print("/".join(out_csv.parts[6:]))
    
    return out_csv, out_vars
# Generate the variables
out_csv, out_vars = generate_stats(target_yrs=DATA_EXTRACT_YEARS)
SKIPPING 2015-2016 BECAUSE WE DONT HAVE ALL DEKADS
Long average years: ['2001-2002' '2002-2003' '2003-2004' '2004-2005' '2005-2006' '2006-2007'
 '2007-2008' '2008-2009' '2009-2010' '2010-2011' '2011-2012' '2012-2013'
 '2013-2014' '2014-2015']
Time taken: 2
------------------------------
VARIABLES SAVED TO FILE BELOW
------------------------------
DECAT_HH_Geovariables/data/TZA/surveys/NPS_w4/geovars/puf/crop-season-npsw4.csv

Quality checks#

Run summary statistic to generare minimum, maximum values etc and inspect them manually.

# Only use describe for continuos variables 
print('-'*50)
print('       SUMMARY STATISTICS FOR CROP SEASON VARS')
print('-'*50)
stats = df[out_vars].describe().T
print(stats[['mean', 'min', '50%', '75%', 'max']])
print('-'*70)
--------------------------------------------------
       SUMMARY STATISTICS FOR CROP SEASON VARS
--------------------------------------------------
                           mean    min    50%     75%     max
anntot_2013_2014     887.973747  249.0  898.0  1038.5  2111.0
anntot_2014_2015     878.959427  278.0  846.0  1036.0  1967.0
anntot_avg           829.522673  322.0  850.0   967.0  1817.0
wetQ_2013_2014       570.584726  180.0  539.0   718.5  1008.0
wetQ_2014_2015       517.694511  165.0  538.0   628.0   937.0
wetQ_avg             495.923628  211.0  500.0   563.0   821.0
wetQ_avgstart         13.852029    5.0   11.0    19.0    29.0
wetQstart_2013_2014   11.630072    1.0    4.0    30.0    36.0
wetQstart_2014_2015   13.892601    1.0    7.0    30.0    36.0
----------------------------------------------------------------------

Summary#

The following variables below were generated.

ID

Theme

Variable Description

Variable ID

Units

Resolution

Data sources

Required

Reference period

1

Crop Season

Average 12-month total rainfall for July-June

anntot_avg

mm

0.008333 dd

RFE

Yes

2001-2015

2

Crop Season

Average total rainfall in wettest quarter within 12-month periods from July-June

wetQ_avg

mm

0.008333 dd

RFE

Yes

2001-2015

3

Crop Season

Average start of wettest quarter in dekads 1-36, where first dekad of July =1

wetQ_avgstart

dekads

0.008333 dd

RFE

Yes

2001-2015

4

Crop Season

12-month total rainfall in July-June, starting July 2013

anntot_2013-2014

mm

0.008333 dd

RFE

Yes

2013-2014

5

Crop Season

Total rainfall in wettest quarter within 12- month periods starting July 2013

wetQ_2013-2014

mm

0.008333 dd

RFE

Yes

2013-2014

6

Crop Season

Start of wettest quarter in dekads 1-36, where first dekad of July 2013 =1

wetQstart_2013-2014

dekads

0.008333 dd

RFE

Yes

2013-2015

7

Crop Season

12-month total rainfall in July-June, starting July 2014

anntot_2014-2015

mm

0.008333 dd

RFE

Yes

2014-2015

8

Crop Season

Total rainfall in wettest quarter within 12- month periods starting July 2014

wetQ_2014-2015

mm

0.008333 dd

RFE

Yes

2014-2015

9

Crop Season

Start of wettest quarter in dekads 1-36, where first dekad of July 2014 =1

wetQstart_2014-2015

dekads

0.008333 dd

RFE

Yes

2014-2015