Crop Season#


Author: Dunstan Matekenya

Affiliation: DECAT, The World Bank Group

Date: August 1, 2023

Background#

In this notebook, we generate geovariables in the category of crop season for the National Panel Survey (NPS 2020-2021) in Tanzania. The following geovariables are computed.

  1. Average 12-month total rainfall (mm) for July-June

  2. Average total rainfall in wettest quarter (mm) within 12-month periods from July-June

  3. Average start of wettest quarter in dekads 1-36, where first dekad of July =1

  4. 12-month total rainfall (mm) in July-June, starting July 2019

  5. Total rainfall in wettest quarter (mm) within 12- month periods starting July 2019

  6. Start of wettest quarter in dekads 1-36, where first dekad of July 2019 =1

  7. 12-month total rainfall (mm) in July-June, starting July 2020

  8. Total rainfall in wettest quarter (mm) within 12- month periods starting July 2020

  9. Start of wettest quarter in dekads 1-36, where first dekad of July 2020 =1

For full variables metadata, see Data-Cover-Page for details about definition of the metrics and other important information.

Input datasets#

The following datasets have been used for all the variables.

  • Data source: The source for this data is the USGS FEWS NET Data Portal

  • Input data files The data files used in this notebook were downloaded from the site mentioned above and kept in this OneDrive folder[1]

1. ^ Only WBG staff can access this OneDrive directory.

Python setup#

The following packages are required for the processing in this notebook.

import warnings
warnings.simplefilter(action='ignore')
import sys
import itertools
from pathlib import Path
from datetime import datetime

import geopandas as gpd
import pandas as pd
import numpy as np

spanon = Path.cwd().parents[2].joinpath("Spatial-Anonymization/src/")
sys.path.append(str(spanon))
from spatial_anonymization import utils
from spatial_anonymization import point_displacement

Setup input files#

Define paths to input data files as global variables using Path package.

# =============================
# BASE WORKING DIR
# ============================
DIR_DATA = Path.cwd().parents[1].joinpath('data', 'TZA')
DIR_SPATIAL_DATA = DIR_DATA.joinpath('spatial')
DIR_NPS_W5 = DIR_DATA.joinpath('surveys', 'NPS_w5')
DIR_OUTPUTS = DIR_NPS_W5.joinpath('geovars', 'puf')

# =================
# GEOVARS DATA
# =================
DIR_RFE = Path.cwd().parents[1].joinpath('data', 'RFE', 'dekad')

# =============================
# SURVEY COORDINATES 
# ============================
# Public/displaced GPS oordinates files
FILE_HH_PUB_GEOVARS = DIR_OUTPUTS.joinpath('y5-hhs-pub-coords.csv')

# =============================
# ADMIN BOUNDARIES
# =============================
# use geoboundaries API for country bounds
FILE_ADM0 = DIR_SPATIAL_DATA.joinpath('geoBoundaries-TZA-ADM0-all',
                                     'geoBoundaries-TZA-ADM0.shp')

Setup processing global variables#

For variables we will use often such as column name containing latitude and longitude, set them as global variables.

# ====================
# GENERAL
# =====================
DEBUG = False
# in case we need to download OSM data
GEOFABRICK_REGION  = 'africa'
COUNTRY_NAME = 'Tanzania'
COUNTRY_ISO = 'TZA'
FILENAME_SUFFIX = 'npsw5'
# When survey was conducted
SURVEY_CALENDAR_YR = 2021
DATA_EXTRACT_YEARS = ['2019-2020', '2020-2021']
# Year starts on July 1 (dekad-19) instead of
# Jan-1 (dekad01)
DEKAD_YEAR_START = 19
DEKADS = ['d{}'.format(str(i).zfill(2)) for i in range(DEKAD_YEAR_START, 37)] + \
['d{}'.format(str(i).zfill(2)) for i in range(1, DEKAD_YEAR_START)]
if DEBUG:
    for i in range(28):
        q = [DEKADS[i] for i in range(i, i+9)]
        assert len(q) == 9
        print('Quarter-{}: {}'.format(i+1, q))

# ==============
# BUFFER
# ==============
# whether to buffer points or
# extract values at point
BUFFER = False
BUFF_ZONE_DIST = 1000
    
# ====================
# INPUT DATA SPECS
# =====================
# Cols for input coordinates file
LAT = 'lat_pub_w5' 
LON = 'lon_pub_w5'
HH_ID = 'y5_hhid'
CLUST_ID = 'y5_clusterid'
TEMP_UNIQ_ID = 'uniq_loc'
URB_RURAL = 'y5_clustertype'
URBAN_CODE = 'urban'


# =============================
# VARIABLE LABELS
# ============================
VAR_LABELS = {'anntot_avg': 'Average 12-month total rainfall for July-June',
    'wetQ_avg': 'Average total rainfall in wettest quarter within 12-month periods from July-June',
    'wetQ_avgstart': 'Average start of wettest quarter in dekads 1-36, where first dekad of July =1',
     'anntot_2019_2020': '12-month total rainfall in July-June, starting July 2019',
    'wetQ_2019_2020': 'Total rainfall in wettest quarter within 12- month periods starting July 2019',
     'wetQstart_2019_2020': 'Start of wettest quarter in dekads 1-36, where first dekad of July 2019 =1',
     'anntot_2019_2020': '12-month total rainfall in July-June, starting July 2019',
     'wetQ_2019_2020': 'Total rainfall in wettest quarter within 12- month periods starting July 2019',
     'wetQstart_2019_2020': 'Start of wettest quarter in dekads 1-36, where first dekad of July 2019 =1'
    }

for name, label in VAR_LABELS.items():
    try:
        assert len(label) <= 80, 'LABEL TOO LONG'
    except AssertionError:
        print(name)
DEKADS
['d19',
 'd20',
 'd21',
 'd22',
 'd23',
 'd24',
 'd25',
 'd26',
 'd27',
 'd28',
 'd29',
 'd30',
 'd31',
 'd32',
 'd33',
 'd34',
 'd35',
 'd36',
 'd01',
 'd02',
 'd03',
 'd04',
 'd05',
 'd06',
 'd07',
 'd08',
 'd09',
 'd10',
 'd11',
 'd12',
 'd13',
 'd14',
 'd15',
 'd16',
 'd17',
 'd18']

Define helper functions for computations#

def get_year_and_tif_list(stop_year, year_start=1):
    """
    Retrieve filepath list and years
    """
    # ===================================
    # RETRIVE FILES BY CALENDAR YEAR
    # ===================================
    rfe_tifs = [i for i in DIR_RFE.iterdir() if i.suffix == ".tif"]
    yrs = {}
    for tif in rfe_tifs:
        yr = tif.parts[-1][4:8]
        if yrs.get(yr, None):
            items =  yrs[yr]
            items.append(tif)
            yrs[yr] = items
        else:
            yrs[yr] = [tif]
    
    # Keep only the years we need based on survey year
    # global variable se
    sorted_yrs = sorted(list(yrs.keys()))
    stop_idx = sorted_yrs.index(str(stop_year))
    calend_yrs = dict(sorted(yrs.items())[:stop_idx+1])
    
    # ===================================
    # DEAL WITH CROSS CALENDAR YEARS
    # ===================================
    if year_start != 1:
        cross_calend_yrs = {}
        files_lst = sorted(list(itertools.chain.from_iterable(list(calend_yrs.values()))))

        yr_start_id = int(list(calend_yrs.keys())[0])
        while year_start < len(files_lst):
            yr_end_id = yr_start_id + 1
            yr_id = '{}-{}'.format(yr_start_id, yr_end_id)
            yr_start_id += 1
            files = files_lst[year_start-1:year_start+35]
            cross_calend_yrs[yr_id] = files
            year_start += 36
        return cross_calend_yrs
    else:
        return calend_yrs
def get_wettest_quarter(df_yr, dekads=None):
    """
    Given one year data, get the quarter with maximum rains
    and start dekad of the quarter
    """
    dfq = pd.DataFrame(index=df_yr.index)
    quarter_start = {}
    if dekads:
        for i in range(28):
            q = [dekads[i] for i in range(i, i+9)]
            assert len(q) == 9
            colname = 'q{}'.format(i+1)
            quarter_start[colname] = int(q[0][1:])
            dfq[colname] =  df_yr[q].sum(axis=1)
    else:
        for i in range(1,29):
            q = ['d{}'.format(str(i).zfill(2)) for i in range(i, i+9)]
            assert len(q) == 9
            colname = 'q{}'.format(i)
            quarter_start[colname] = int(i)
            dfq[colname] =  df_yr[q].sum(axis=1)
    
    dfq['qrt_max_val'] = dfq.max(axis=1)
    dfq['qrt_max_id'] = dfq.idxmax(axis=1)
    dfq['qrt_max_start'] = dfq.qrt_max_id.apply(lambda x: quarter_start[x])
    
    return dfq[['qrt_max_val', 'qrt_max_id', 'qrt_max_start']]
def generate_yr_rfe_stats(tif_list, yr, gdf_coords, uniq_id):
    """
    Generates RFE stats for a single year as follows
    """
    df_lst = []
    for file in tif_list:
        colname = 'd{}'.format(file.parts[-1][8:10])
        df = utils.point_query_raster(in_raster=file, buffer=BUFFER,
                                        point_shp=gdf_coords, statcol_name=colname
                                        ,id_col=uniq_id,xcol=LON, ycol=LAT)
        df.set_index(uniq_id, inplace=True)
        df_lst.append(df)

    df = pd.concat(df_lst, axis=1)
    dekad_cols = sorted(df.columns)
    df = df[dekad_cols]

    # Annual total and max dekadal value
    df.loc[:,'total'] = df.sum(axis=1)
    df.loc[:,'max'] = df[dekad_cols].max(axis=1)

    # Wettest quarter
    dfq = get_wettest_quarter(df_yr=df, dekads=DEKADS)

    # merge 
    df2 = pd.concat([dfq, df[['total', 'max']]], axis=1)
    df2['yr'] = yr

    return df2
def generate_rfe_stats(yr_tif_dict, pts_shp, id_col):
    """
    Helper to loop through all years and run generate_yr_rfe_stats()
    """
    # ========================
    # CALCULATE YEARLY STATS
    # ========================
    df_yr_list = []
    # Loop years and calculate stats for that year
    cnt_yrs = 0
    for yr, files in yr_tif_dict.items():
        if len(files) != 36:
            print('SKIPPING {} BECAUSE WE DONT HAVE ALL DEKADS'.format(yr))
            continue
        df_yr = generate_yr_rfe_stats(files, yr, pts_shp, uniq_id=id_col)
        df_yr_list.append( df_yr)
        cnt_yrs += 1
        if DEBUG:
            if cnt_yrs == 5:
                break
    
    # concat into a single dataframe
    df = pd.concat(df_yr_list, axis=0)
    df = df[['yr',  'total', 'max', 'qrt_max_val', 'qrt_max_start']]
    assert len(df) == len(pts_shp)*cnt_yrs
    
    # ==================================
    # CALCULATE AVERAGES ACROSS YEARS
    # ==================================
    # Calculate the averages
    df_avg = pd.DataFrame(index=df.index.drop_duplicates())
    vars = ['total', 'max', 'qrt_max_val', 'qrt_max_start']
    print('Long average years: {}'.format(df.yr.unique()))
    for idx, row in df_avg.iterrows():
        data = df.loc[idx]
        for var in vars:
            new_colname = 'avg_{}'.format(var)
            avg = data[var].mean()
            df_avg.loc[idx, new_colname] = int(avg)
    
    # Add the averages to the original dataframe
    df2 = df.merge(df_avg, left_index=True, right_index=True)
    assert len(df2) == len(df)
    
    return df2

Generate stats#

def generate_stats(target_yrs):
    """
    Helper function to put everything together and generate stats
    """
    # ===========================
    # LOAD COORDINATES
    # ===========================
    df_pub_coords = pd.read_csv(FILE_HH_PUB_GEOVARS)
    df_pub_coords  = df_pub_coords.query('make_public == 1')
    df_pub_coords[TEMP_UNIQ_ID] = df_pub_coords.apply(lambda x: "{}-{}".format(x[LON], x[LAT]), axis=1)
    df_pub_coords2 = df_pub_coords.drop_duplicates(subset=['uniq_loc'])
    
    df_coords = df_pub_coords2[[TEMP_UNIQ_ID, LAT, LON]]
    gdf_coords = gpd.GeoDataFrame(df_coords, 
                                      geometry=gpd.points_from_xy(df_coords[LON],df_coords[LAT]), crs=4326)
    
    # ===========================
    # GENERATE ALL YEARS STATS
    # ===========================
    start = datetime.now()
    files_yrs = get_year_and_tif_list(stop_year=SURVEY_CALENDAR_YR, year_start=DEKAD_YEAR_START)
    df = generate_rfe_stats(files_yrs, pts_shp=gdf_coords, id_col=TEMP_UNIQ_ID)
    end = datetime.now()
    time_taken = (end - start).total_seconds()/60
    print('Time taken: {}'.format(int(time_taken)))
    # ================================
    # EXTRACT DATA FOR TARGET YEARS
    # ================================
    df_target_yrs = df.query('yr in @target_yrs')
    required_vars = {'qrt_max_start': 'wetQstart', 'avg_total': 'anntot_avg', 
                     'avg_qrt_max_start': 'wetQ_avgstart', 'total': 'anntot',
                     'avg_qrt_max_val': 'wetQ_avg', 'qrt_max_val': 'wetQ'}
    df_target_yrs.rename(columns=required_vars, inplace=True)
    df_target_yrs = df_target_yrs[['yr', 'anntot', 'wetQ', 'wetQstart', 'anntot_avg','wetQ_avg', 'wetQ_avgstart']]

    # ===========================================
    # GENERATE YEAR BASED VARIABLES (E.G., 
    # ANNTOT_2011-2012,ANNTOT_2012-2013
    # ===========================================
    df_yr_based = df_target_yrs[['yr', 'anntot', 'wetQ', 'wetQstart']]

    df_tmp = pd.DataFrame(index=df_yr_based.index.drop_duplicates())
    yr_based_vars = df_yr_based.columns.difference(['yr'])
    
    # There is probably a shorter pivot approach but couldnt 
    # figure it out, so 
    for idx, row in df_tmp.iterrows():
        df_idx = df_yr_based.loc[idx]
        for yr in target_yrs:
            for var in yr_based_vars:
                val = df_idx.query('yr == @yr').iloc[0][var]
                new_col_name = '{}_{}'.format(var, yr)
                df_tmp.loc[idx, new_col_name] = val

    # ===========================================
    # PUT EVERYTHING BACK TOGETHER
    # ===========================================
    # We can drop everything else and remain with unique-locs
    df_target_yrs2 = df_target_yrs[~df_target_yrs.index.duplicated(keep='first')]
    df = df_target_yrs2.merge(df_tmp, left_index=True, right_index=True, how='inner')
    assert len(df) == len(gdf_coords)
    
    # Drop columns we dont need
    # Drop these columns because we don't have them in pheno file
    drop_cols2match_pheno = ['anntot_2020-2021', 'wetQ_2020-2021', 'wetQstart_2020-2021']
    drop_cols_not_needed = ['yr', 'anntot', 'wetQ', 'wetQstart']
    drop_cols = drop_cols2match_pheno+drop_cols_not_needed
    df.drop(columns=drop_cols, inplace=True)
    
    # Remove "-" from col names
    df.columns = [c.replace("-", "_") for c in df.columns]
    
    # ===========================================
    # SAVE THE OUTPUTS
    # =========================================== 
    out_vars = list(df.columns.difference([TEMP_UNIQ_ID, LON, LAT]))
    out_csv = DIR_OUTPUTS.joinpath('crop-season-{}.csv'.format(FILENAME_SUFFIX))
    df.to_csv(out_csv)
    
    
    # Save variable labels
    out_csv_labels = DIR_OUTPUTS.joinpath('crop-season-{}-labels.csv'.format(FILENAME_SUFFIX))
    df_var_labels = pd.DataFrame(list(VAR_LABELS.values()), index=list(VAR_LABELS.keys()))
    df_var_labels.index.name = 'var_name'
    df_var_labels.rename(columns={0:'label'}, inplace=True)
    df_var_labels.to_csv(out_csv_labels)
    
    print('-'*30)
    print('VARIABLES SAVED TO FILE BELOW')
    print('-'*30)
    print("/".join(out_csv.parts[6:]))
    
    return out_csv, out_vars
# Generate the variables
out_csv, out_vars = generate_stats(target_yrs=DATA_EXTRACT_YEARS)
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
Cell In[10], line 2
      1 # Generate the variables
----> 2 out_csv, out_vars = generate_stats(target_yrs=DATA_EXTRACT_YEARS)

Cell In[9], line 22, in generate_stats(target_yrs)
     20 start = datetime.now()
     21 files_yrs = get_year_and_tif_list(stop_year=SURVEY_CALENDAR_YR, year_start=DEKAD_YEAR_START)
---> 22 df = generate_rfe_stats(files_yrs, pts_shp=gdf_coords, id_col=TEMP_UNIQ_ID)
     23 end = datetime.now()
     24 time_taken = (end - start).total_seconds()/60

Cell In[8], line 15, in generate_rfe_stats(yr_tif_dict, pts_shp, id_col)
     13     print('SKIPPING {} BECAUSE WE DONT HAVE ALL DEKADS'.format(yr))
     14     continue
---> 15 df_yr = generate_yr_rfe_stats(files, yr, pts_shp, uniq_id=id_col)
     16 df_yr_list.append( df_yr)
     17 cnt_yrs += 1

Cell In[7], line 8, in generate_yr_rfe_stats(tif_list, yr, gdf_coords, uniq_id)
      6 for file in tif_list:
      7     colname = 'd{}'.format(file.parts[-1][8:10])
----> 8     df = utils.point_query_raster(in_raster=file, buffer=BUFFER,
      9                                     point_shp=gdf_coords, statcol_name=colname
     10                                     ,id_col=uniq_id,xcol=LON, ycol=LAT)
     11     df.set_index(uniq_id, inplace=True)
     12     df_lst.append(df)

File ~/anaconda3/lib/python3.10/site-packages/spatial_anonymization/utils.py:746, in point_query_raster(in_raster, point_shp, statcol_name, class_pct_code, id_col, stat, buffer, buff_dist, xcol, ycol)
    743 if rst_obj.crs != point_shp.crs:
    744     point_shp = point_shp.to_crs(rst_obj.crs)
--> 746 point_shp[statcol_name] = list(gen_point_query(point_shp, in_raster, interpolate='nearest'))
    747 try:
    748     assert point_shp[statcol_name].isna().sum() == 0, 'THERE SHOULD BE NO NULLS'

File ~/anaconda3/lib/python3.10/site-packages/rasterstats/point.py:184, in gen_point_query(vectors, raster, band, layer, nodata, affine, interpolate, property_name, geojson_out, boundless)
    182 r, c = rast.index(x, y)
    183 window = ((int(r), int(r + 1)), (int(c), int(c + 1)))
--> 184 src_array = rast.read(
    185     window=window, masked=True, boundless=boundless
    186 ).array
    187 val = src_array[0, 0]
    188 if val is masked:

File ~/anaconda3/lib/python3.10/site-packages/rasterstats/io.py:356, in Raster.read(self, bounds, window, masked, boundless)
    351             masked = True
    352             warnings.warn(
    353                 "Setting masked to True because dataset mask has been detected"
    354             )
--> 356     new_array = self.src.read(
    357         self.band, window=win, boundless=boundless, masked=masked
    358     )
    360 return Raster(new_array, new_affine, nodata)

File rasterio/_io.pyx:412, in rasterio._io.DatasetReaderBase.read()

File rasterio/_io.pyx:439, in rasterio._io.DatasetReaderBase.read()

File rasterio/_io.pyx:680, in rasterio._io.DatasetReaderBase._read()

File ~/anaconda3/lib/python3.10/logging/__init__.py:1464, in Logger.debug(self, msg, *args, **kwargs)
   1455 def debug(self, msg, *args, **kwargs):
   1456     """
   1457     Log 'msg % args' with severity 'DEBUG'.
   1458 
   (...)
   1462     logger.debug("Houston, we have a %s", "thorny problem", exc_info=1)
   1463     """
-> 1464     if self.isEnabledFor(DEBUG):
   1465         self._log(DEBUG, msg, args, **kwargs)

KeyboardInterrupt: 
df = pd.read_csv(DIR_OUTPUTS.joinpath('crop-season-npsw5.csv'))
df_labels = pd.read_csv(DIR_OUTPUTS.joinpath('crop-season-npsw5-labels.csv'))
labels = dict(df_labels.values)
df.columns = [c.replace("-", "_") for c in df.columns]
len(labels), len(df.columns.difference([TEMP_UNIQ_ID]))
(6, 9)
drop_cols = ['anntot_2020_2021', 'wetQ_2020_2021', 'wetQstart_2020_2021']
x = 'anntot_2019-2020'
x = x.replace("-", "_")
x
'anntot_2019_2020'
for c in df.columns:
    print(c)
    print(labels.get(c))

Quality checks#

Run summary statistic to generare minimum, maximum values etc and inspect them manually.

out_csv = DIR_OUTPUTS.joinpath('crop-season-{}.csv'.format(FILENAME_SUFFIX))
df = pd.read_csv(out_csv)
df
uniq_loc anntot_avg wetQ_avg wetQ_avgstart anntot_2019_2020 wetQ_2019_2020 wetQstart_2019_2020
0 29.602591737997177--4.912058073441596 1012.0 538.0 20.0 1285.0 590.0 29.0
1 29.6035629093--4.80935986272 1117.0 554.0 15.0 1403.0 648.0 4.0
2 29.61972179903359--4.842254030081743 1099.0 550.0 16.0 1285.0 590.0 29.0
3 29.635780889348787--4.866383736927592 1012.0 538.0 20.0 1285.0 590.0 29.0
4 29.637410106--4.89393190766 1012.0 538.0 20.0 1285.0 590.0 29.0
... ... ... ... ... ... ... ...
1280 40.19491998234548--10.288644436052715 969.0 664.0 13.0 1045.0 790.0 36.0
1281 40.2248285482--10.2230680537 945.0 648.0 12.0 1084.0 808.0 36.0
1282 40.25452621839283--10.5583863385372 907.0 639.0 10.0 1041.0 787.0 36.0
1283 40.2787369628--10.5425273253 925.0 657.0 10.0 1041.0 787.0 36.0
1284 40.29176355850665--10.52910817196854 925.0 657.0 10.0 1041.0 787.0 36.0

1285 rows × 7 columns

dk10_01,#

# Only use describe for continuos variables 
print('-'*50)
print('       SUMMARY STATISTICS FOR CROP SEASON VARS')
print('-'*50)
stats = df[out_vars].describe().T
print(stats[['mean', 'min', '50%', '75%', 'max']])
print('-'*70)

Summary#

The following variables below were generated.

ID

Theme

Variable Description

Variable ID

Units

Resolution

Data sources

Required

Reference period

1

Crop Season

Average 12-month total rainfall for July-June

anntot_avg

mm

0.008333 dd

RFE

Yes

2001-2015

2

Crop Season

Average total rainfall in wettest quarter within 12-month periods from July-June

wetQ_avg

mm

0.008333 dd

RFE

Yes

2001-2015

3

Crop Season

Average start of wettest quarter in dekads 1-36, where first dekad of July =1

wetQ_avgstart

dekads

0.008333 dd

RFE

Yes

2001-2015

4

Crop Season

12-month total rainfall in July-June, starting July 2013

anntot_2013-2014

mm

0.008333 dd

RFE

Yes

2013-2014

5

Crop Season

Total rainfall in wettest quarter within 12- month periods starting July 2013

wetQ_2013-2014

mm

0.008333 dd

RFE

Yes

2013-2014

6

Crop Season

Start of wettest quarter in dekads 1-36, where first dekad of July 2013 =1

wetQstart_2013-2014

dekads

0.008333 dd

RFE

Yes

2013-2015

7

Crop Season

12-month total rainfall in July-June, starting July 2014

anntot_2014-2015

mm

0.008333 dd

RFE

Yes

2014-2015

8

Crop Season

Total rainfall in wettest quarter within 12- month periods starting July 2014

wetQ_2014-2015

mm

0.008333 dd

RFE

Yes

2014-2015

9

Crop Season

Start of wettest quarter in dekads 1-36, where first dekad of July 2014 =1

wetQstart_2014-2015

dekads

0.008333 dd

RFE

Yes

2014-2015