Crop Season

Crop Season#

Author: Dunstan Matekenya

Affiliation: DECAT, The World Bank Group

Date: August 1, 2023

Background#

In this notebook, we generate geovariables in the category of crop season for the National Panel Survey (NPS 2020-2021) in Tanzania. The following geovariables are computed.

Average 12-month total rainfall (mm) for July-June
Average total rainfall in wettest quarter (mm) within 12-month periods from July-June
Average start of wettest quarter in dekads 1-36, where first dekad of July =1
12-month total rainfall (mm) in July-June, starting July 2019
Total rainfall in wettest quarter (mm) within 12- month periods starting July 2019
Start of wettest quarter in dekads 1-36, where first dekad of July 2019 =1
12-month total rainfall (mm) in July-June, starting July 2020
Total rainfall in wettest quarter (mm) within 12- month periods starting July 2020
Start of wettest quarter in dekads 1-36, where first dekad of July 2020 =1

For full variables metadata, see Data-Cover-Page for details about definition of the metrics and other important information.

Input datasets#

The following datasets have been used for all the variables.

Data source: The source for this data is the USGS FEWS NET Data Portal
Input data files The data files used in this notebook were downloaded from the site mentioned above and kept in this OneDrive folder ^[1]

1. ^ Only WBG staff can access this OneDrive directory.

Python setup#

The following packages are required for the processing in this notebook.

import warnings
warnings.simplefilter(action='ignore')
import sys
import itertools
from pathlib import Path
from datetime import datetime

import geopandas as gpd
import pandas as pd
import numpy as np

spanon = Path.cwd().parents[2].joinpath("Spatial-Anonymization/src/")
sys.path.append(str(spanon))
from spatial_anonymization import utils
from spatial_anonymization import point_displacement

Setup input files#

Define paths to input data files as global variables using Path package.

# =============================
# BASE WORKING DIR
# ============================
DIR_DATA = Path.cwd().parents[1].joinpath('data', 'TZA')
DIR_SPATIAL_DATA = DIR_DATA.joinpath('spatial')
DIR_NPS_W5 = DIR_DATA.joinpath('surveys', 'NPS_w5')
DIR_OUTPUTS = DIR_NPS_W5.joinpath('geovars', 'puf')

# =================
# GEOVARS DATA
# =================
DIR_RFE = Path.cwd().parents[1].joinpath('data', 'RFE', 'dekad')

# =============================
# SURVEY COORDINATES 
# ============================
# Public/displaced GPS oordinates files
FILE_HH_PUB_GEOVARS = DIR_OUTPUTS.joinpath('y5-hhs-pub-coords.csv')

# =============================
# ADMIN BOUNDARIES
# =============================
# use geoboundaries API for country bounds
FILE_ADM0 = DIR_SPATIAL_DATA.joinpath('geoBoundaries-TZA-ADM0-all',
                                     'geoBoundaries-TZA-ADM0.shp')

Setup processing global variables#

For variables we will use often such as column name containing latitude and longitude, set them as global variables.

# ====================
# GENERAL
# =====================
DEBUG = False
# in case we need to download OSM data
GEOFABRICK_REGION  = 'africa'
COUNTRY_NAME = 'Tanzania'
COUNTRY_ISO = 'TZA'
FILENAME_SUFFIX = 'npsw5'
# When survey was conducted
SURVEY_CALENDAR_YR = 2021
DATA_EXTRACT_YEARS = ['2019-2020', '2020-2021']
# Year starts on July 1 (dekad-19) instead of
# Jan-1 (dekad01)
DEKAD_YEAR_START = 19
DEKADS = ['d{}'.format(str(i).zfill(2)) for i in range(DEKAD_YEAR_START, 37)] + \
['d{}'.format(str(i).zfill(2)) for i in range(1, DEKAD_YEAR_START)]
if DEBUG:
    for i in range(28):
        q = [DEKADS[i] for i in range(i, i+9)]
        assert len(q) == 9
        print('Quarter-{}: {}'.format(i+1, q))

# ==============
# BUFFER
# ==============
# whether to buffer points or
# extract values at point
BUFFER = False
BUFF_ZONE_DIST = 1000
    
# ====================
# INPUT DATA SPECS
# =====================
# Cols for input coordinates file
LAT = 'lat_pub_w5' 
LON = 'lon_pub_w5'
HH_ID = 'y5_hhid'
CLUST_ID = 'y5_clusterid'
TEMP_UNIQ_ID = 'uniq_loc'
URB_RURAL = 'y5_clustertype'
URBAN_CODE = 'urban'


# =============================
# VARIABLE LABELS
# ============================
VAR_LABELS = {'anntot_avg': 'Average 12-month total rainfall for July-June',
    'wetQ_avg': 'Average total rainfall in wettest quarter within 12-month periods from July-June',
    'wetQ_avgstart': 'Average start of wettest quarter in dekads 1-36, where first dekad of July =1',
     'anntot_2019_2020': '12-month total rainfall in July-June, starting July 2019',
    'wetQ_2019_2020': 'Total rainfall in wettest quarter within 12- month periods starting July 2019',
     'wetQstart_2019_2020': 'Start of wettest quarter in dekads 1-36, where first dekad of July 2019 =1',
     'anntot_2019_2020': '12-month total rainfall in July-June, starting July 2019',
     'wetQ_2019_2020': 'Total rainfall in wettest quarter within 12- month periods starting July 2019',
     'wetQstart_2019_2020': 'Start of wettest quarter in dekads 1-36, where first dekad of July 2019 =1'
    }

for name, label in VAR_LABELS.items():
    try:
        assert len(label) <= 80, 'LABEL TOO LONG'
    except AssertionError:
        print(name)

DEKADS

['d19',
 'd20',
 'd21',
 'd22',
 'd23',
 'd24',
 'd25',
 'd26',
 'd27',
 'd28',
 'd29',
 'd30',
 'd31',
 'd32',
 'd33',
 'd34',
 'd35',
 'd36',
 'd01',
 'd02',
 'd03',
 'd04',
 'd05',
 'd06',
 'd07',
 'd08',
 'd09',
 'd10',
 'd11',
 'd12',
 'd13',
 'd14',
 'd15',
 'd16',
 'd17',
 'd18']

Define helper functions for computations#

def get_year_and_tif_list(stop_year, year_start=1):
    """
    Retrieve filepath list and years
    """
    # ===================================
    # RETRIVE FILES BY CALENDAR YEAR
    # ===================================
    rfe_tifs = [i for i in DIR_RFE.iterdir() if i.suffix == ".tif"]
    yrs = {}
    for tif in rfe_tifs:
        yr = tif.parts[-1][4:8]
        if yrs.get(yr, None):
            items =  yrs[yr]
            items.append(tif)
            yrs[yr] = items
        else:
            yrs[yr] = [tif]
    
    # Keep only the years we need based on survey year
    # global variable se
    sorted_yrs = sorted(list(yrs.keys()))
    stop_idx = sorted_yrs.index(str(stop_year))
    calend_yrs = dict(sorted(yrs.items())[:stop_idx+1])
    
    # ===================================
    # DEAL WITH CROSS CALENDAR YEARS
    # ===================================
    if year_start != 1:
        cross_calend_yrs = {}
        files_lst = sorted(list(itertools.chain.from_iterable(list(calend_yrs.values()))))

        yr_start_id = int(list(calend_yrs.keys())[0])
        while year_start < len(files_lst):
            yr_end_id = yr_start_id + 1
            yr_id = '{}-{}'.format(yr_start_id, yr_end_id)
            yr_start_id += 1
            files = files_lst[year_start-1:year_start+35]
            cross_calend_yrs[yr_id] = files
            year_start += 36
        return cross_calend_yrs
    else:
        return calend_yrs

def get_wettest_quarter(df_yr, dekads=None):
    """
    Given one year data, get the quarter with maximum rains
    and start dekad of the quarter
    """
    dfq = pd.DataFrame(index=df_yr.index)
    quarter_start = {}
    if dekads:
        for i in range(28):
            q = [dekads[i] for i in range(i, i+9)]
            assert len(q) == 9
            colname = 'q{}'.format(i+1)
            quarter_start[colname] = int(q[0][1:])
            dfq[colname] =  df_yr[q].sum(axis=1)
    else:
        for i in range(1,29):
            q = ['d{}'.format(str(i).zfill(2)) for i in range(i, i+9)]
            assert len(q) == 9
            colname = 'q{}'.format(i)
            quarter_start[colname] = int(i)
            dfq[colname] =  df_yr[q].sum(axis=1)
    
    dfq['qrt_max_val'] = dfq.max(axis=1)
    dfq['qrt_max_id'] = dfq.idxmax(axis=1)
    dfq['qrt_max_start'] = dfq.qrt_max_id.apply(lambda x: quarter_start[x])
    
    return dfq[['qrt_max_val', 'qrt_max_id', 'qrt_max_start']]

def generate_yr_rfe_stats(tif_list, yr, gdf_coords, uniq_id):
    """
    Generates RFE stats for a single year as follows
    """
    df_lst = []
    for file in tif_list:
        colname = 'd{}'.format(file.parts[-1][8:10])
        df = utils.point_query_raster(in_raster=file, buffer=BUFFER,
                                        point_shp=gdf_coords, statcol_name=colname
                                        ,id_col=uniq_id,xcol=LON, ycol=LAT)
        df.set_index(uniq_id, inplace=True)
        df_lst.append(df)

    df = pd.concat(df_lst, axis=1)
    dekad_cols = sorted(df.columns)
    df = df[dekad_cols]

    # Annual total and max dekadal value
    df.loc[:,'total'] = df.sum(axis=1)
    df.loc[:,'max'] = df[dekad_cols].max(axis=1)

    # Wettest quarter
    dfq = get_wettest_quarter(df_yr=df, dekads=DEKADS)

    # merge 
    df2 = pd.concat([dfq, df[['total', 'max']]], axis=1)
    df2['yr'] = yr

    return df2

def generate_rfe_stats(yr_tif_dict, pts_shp, id_col):
    """
    Helper to loop through all years and run generate_yr_rfe_stats()
    """
    # ========================
    # CALCULATE YEARLY STATS
    # ========================
    df_yr_list = []
    # Loop years and calculate stats for that year
    cnt_yrs = 0
    for yr, files in yr_tif_dict.items():
        if len(files) != 36:
            print('SKIPPING {} BECAUSE WE DONT HAVE ALL DEKADS'.format(yr))
            continue
        df_yr = generate_yr_rfe_stats(files, yr, pts_shp, uniq_id=id_col)
        df_yr_list.append( df_yr)
        cnt_yrs += 1
        if DEBUG:
            if cnt_yrs == 5:
                break
    
    # concat into a single dataframe
    df = pd.concat(df_yr_list, axis=0)
    df = df[['yr',  'total', 'max', 'qrt_max_val', 'qrt_max_start']]
    assert len(df) == len(pts_shp)*cnt_yrs
    
    # ==================================
    # CALCULATE AVERAGES ACROSS YEARS
    # ==================================
    # Calculate the averages
    df_avg = pd.DataFrame(index=df.index.drop_duplicates())
    vars = ['total', 'max', 'qrt_max_val', 'qrt_max_start']
    print('Long average years: {}'.format(df.yr.unique()))
    for idx, row in df_avg.iterrows():
        data = df.loc[idx]
        for var in vars:
            new_colname = 'avg_{}'.format(var)
            avg = data[var].mean()
            df_avg.loc[idx, new_colname] = int(avg)
    
    # Add the averages to the original dataframe
    df2 = df.merge(df_avg, left_index=True, right_index=True)
    assert len(df2) == len(df)
    
    return df2

Generate stats#

def generate_stats(target_yrs):
    """
    Helper function to put everything together and generate stats
    """
    # ===========================
    # LOAD COORDINATES
    # ===========================
    df_pub_coords = pd.read_csv(FILE_HH_PUB_GEOVARS)
    df_pub_coords  = df_pub_coords.query('make_public == 1')
    df_pub_coords[TEMP_UNIQ_ID] = df_pub_coords.apply(lambda x: "{}-{}".format(x[LON], x[LAT]), axis=1)
    df_pub_coords2 = df_pub_coords.drop_duplicates(subset=['uniq_loc'])
    
    df_coords = df_pub_coords2[[TEMP_UNIQ_ID, LAT, LON]]
    gdf_coords = gpd.GeoDataFrame(df_coords, 
                                      geometry=gpd.points_from_xy(df_coords[LON],df_coords[LAT]), crs=4326)
    
    # ===========================
    # GENERATE ALL YEARS STATS
    # ===========================
    start = datetime.now()
    files_yrs = get_year_and_tif_list(stop_year=SURVEY_CALENDAR_YR, year_start=DEKAD_YEAR_START)
    df = generate_rfe_stats(files_yrs, pts_shp=gdf_coords, id_col=TEMP_UNIQ_ID)
    end = datetime.now()
    time_taken = (end - start).total_seconds()/60
    print('Time taken: {}'.format(int(time_taken)))
    # ================================
    # EXTRACT DATA FOR TARGET YEARS
    # ================================
    df_target_yrs = df.query('yr in @target_yrs')
    required_vars = {'qrt_max_start': 'wetQstart', 'avg_total': 'anntot_avg', 
                     'avg_qrt_max_start': 'wetQ_avgstart', 'total': 'anntot',
                     'avg_qrt_max_val': 'wetQ_avg', 'qrt_max_val': 'wetQ'}
    df_target_yrs.rename(columns=required_vars, inplace=True)
    df_target_yrs = df_target_yrs[['yr', 'anntot', 'wetQ', 'wetQstart', 'anntot_avg','wetQ_avg', 'wetQ_avgstart']]

    # ===========================================
    # GENERATE YEAR BASED VARIABLES (E.G., 
    # ANNTOT_2011-2012,ANNTOT_2012-2013
    # ===========================================
    df_yr_based = df_target_yrs[['yr', 'anntot', 'wetQ', 'wetQstart']]

    df_tmp = pd.DataFrame(index=df_yr_based.index.drop_duplicates())
    yr_based_vars = df_yr_based.columns.difference(['yr'])
    
    # There is probably a shorter pivot approach but couldnt 
    # figure it out, so 
    for idx, row in df_tmp.iterrows():
        df_idx = df_yr_based.loc[idx]
        for yr in target_yrs:
            for var in yr_based_vars:
                val = df_idx.query('yr == @yr').iloc[0][var]
                new_col_name = '{}_{}'.format(var, yr)
                df_tmp.loc[idx, new_col_name] = val

    # ===========================================
    # PUT EVERYTHING BACK TOGETHER
    # ===========================================
    # We can drop everything else and remain with unique-locs
    df_target_yrs2 = df_target_yrs[~df_target_yrs.index.duplicated(keep='first')]
    df = df_target_yrs2.merge(df_tmp, left_index=True, right_index=True, how='inner')
    assert len(df) == len(gdf_coords)
    
    # Drop columns we dont need
    # Drop these columns because we don't have them in pheno file
    drop_cols2match_pheno = ['anntot_2020-2021', 'wetQ_2020-2021', 'wetQstart_2020-2021']
    drop_cols_not_needed = ['yr', 'anntot', 'wetQ', 'wetQstart']
    drop_cols = drop_cols2match_pheno+drop_cols_not_needed
    df.drop(columns=drop_cols, inplace=True)
    
    # Remove "-" from col names
    df.columns = [c.replace("-", "_") for c in df.columns]
    
    # ===========================================
    # SAVE THE OUTPUTS
    # =========================================== 
    out_vars = list(df.columns.difference([TEMP_UNIQ_ID, LON, LAT]))
    out_csv = DIR_OUTPUTS.joinpath('crop-season-{}.csv'.format(FILENAME_SUFFIX))
    df.to_csv(out_csv)
    
    
    # Save variable labels
    out_csv_labels = DIR_OUTPUTS.joinpath('crop-season-{}-labels.csv'.format(FILENAME_SUFFIX))
    df_var_labels = pd.DataFrame(list(VAR_LABELS.values()), index=list(VAR_LABELS.keys()))
    df_var_labels.index.name = 'var_name'
    df_var_labels.rename(columns={0:'label'}, inplace=True)
    df_var_labels.to_csv(out_csv_labels)
    
    print('-'*30)
    print('VARIABLES SAVED TO FILE BELOW')
    print('-'*30)
    print("/".join(out_csv.parts[6:]))
    
    return out_csv, out_vars

# Generate the variables
out_csv, out_vars = generate_stats(target_yrs=DATA_EXTRACT_YEARS)

SKIPPING 2021-2022 BECAUSE WE DONT HAVE ALL DEKADS
Long average years: ['2001-2002' '2002-2003' '2003-2004' '2004-2005' '2005-2006' '2006-2007'
 '2007-2008' '2008-2009' '2009-2010' '2010-2011' '2011-2012' '2012-2013'
 '2013-2014' '2014-2015' '2015-2016' '2016-2017' '2017-2018' '2018-2019'
 '2019-2020' '2020-2021']
Time taken: 8
------------------------------
VARIABLES SAVED TO FILE BELOW
------------------------------
DECAT_HH_Geovariables/data/TZA/surveys/NPS_w5/geovars/puf/crop-season-npsw5.csv

df = pd.read_csv(DIR_OUTPUTS.joinpath('crop-season-npsw5.csv'))

df_labels = pd.read_csv(DIR_OUTPUTS.joinpath('crop-season-npsw5-labels.csv'))

labels = dict(df_labels.values)

df.columns = [c.replace("-", "_") for c in df.columns]

len(labels), len(df.columns.difference([TEMP_UNIQ_ID]))

(6, 9)

drop_cols = ['anntot_2020_2021', 'wetQ_2020_2021', 'wetQstart_2020_2021']

x = 'anntot_2019-2020'
x = x.replace("-", "_")
x

'anntot_2019_2020'

for c in df.columns:
    print(c)
    print(labels.get(c))

Quality checks#

Run summary statistic to generare minimum, maximum values etc and inspect them manually.

out_csv = DIR_OUTPUTS.joinpath('crop-season-{}.csv'.format(FILENAME_SUFFIX))
df = pd.read_csv(out_csv)
df

	uniq_loc	anntot_avg	wetQ_avg	wetQ_avgstart	anntot_2019_2020	wetQ_2019_2020	wetQstart_2019_2020
0	29.602591737997177--4.912058073441596	1012.0	538.0	20.0	1285.0	590.0	29.0
1	29.6035629093--4.80935986272	1117.0	554.0	15.0	1403.0	648.0	4.0
2	29.61972179903359--4.842254030081743	1099.0	550.0	16.0	1285.0	590.0	29.0
3	29.635780889348787--4.866383736927592	1012.0	538.0	20.0	1285.0	590.0	29.0
4	29.637410106--4.89393190766	1012.0	538.0	20.0	1285.0	590.0	29.0
...	...	...	...	...	...	...	...
1280	40.19491998234548--10.288644436052715	969.0	664.0	13.0	1045.0	790.0	36.0
1281	40.2248285482--10.2230680537	945.0	648.0	12.0	1084.0	808.0	36.0
1282	40.25452621839283--10.5583863385372	907.0	639.0	10.0	1041.0	787.0	36.0
1283	40.2787369628--10.5425273253	925.0	657.0	10.0	1041.0	787.0	36.0
1284	40.29176355850665--10.52910817196854	925.0	657.0	10.0	1041.0	787.0	36.0

1285 rows × 7 columns

dk10_01,#

# Only use describe for continuos variables 
print('-'*50)
print('       SUMMARY STATISTICS FOR CROP SEASON VARS')
print('-'*50)
stats = df[out_vars].describe().T
print(stats[['mean', 'min', '50%', '75%', 'max']])
print('-'*70)

Summary#

The following variables below were generated.

ID	Theme	Variable Description	Variable ID	Units	Resolution	Data sources	Required	Reference period
1	Crop Season	Average 12-month total rainfall for July-June	anntot_avg	mm	0.008333 dd	RFE	Yes	2001-2015
2	Crop Season	Average total rainfall in wettest quarter within 12-month periods from July-June	wetQ_avg	mm	0.008333 dd	RFE	Yes	2001-2015
3	Crop Season	Average start of wettest quarter in dekads 1-36, where first dekad of July =1	wetQ_avgstart	dekads	0.008333 dd	RFE	Yes	2001-2015
4	Crop Season	12-month total rainfall in July-June, starting July 2013	anntot_2013-2014	mm	0.008333 dd	RFE	Yes	2013-2014
5	Crop Season	Total rainfall in wettest quarter within 12- month periods starting July 2013	wetQ_2013-2014	mm	0.008333 dd	RFE	Yes	2013-2014
6	Crop Season	Start of wettest quarter in dekads 1-36, where first dekad of July 2013 =1	wetQstart_2013-2014	dekads	0.008333 dd	RFE	Yes	2013-2015
7	Crop Season	12-month total rainfall in July-June, starting July 2014	anntot_2014-2015	mm	0.008333 dd	RFE	Yes	2014-2015
8	Crop Season	Total rainfall in wettest quarter within 12- month periods starting July 2014	wetQ_2014-2015	mm	0.008333 dd	RFE	Yes	2014-2015
9	Crop Season	Start of wettest quarter in dekads 1-36, where first dekad of July 2014 =1	wetQstart_2014-2015	dekads	0.008333 dd	RFE	Yes	2014-2015