Crop Season-Phenology#


Author: Dunstan Matekenya

Affiliation: DECAT, The World Bank Group

Date: April 16, 2023

Background#

In this notebook, we generate geovariables in the category of climate for the National Panel Survey (NPS 2012-2013) in Tanzania. The following bioclimatic geovariables are computed.

  1. Average total change in greenness (integral of daily EVI values) within primary growing season

  2. Average timing of onset of greenness increase in day of year 1-356, where Jul 1 = 1

  3. Average timing of onset of greenness decrease in day of year 1-356, where Jul 1 = 1, averaged by district

  4. Total change in greenness (integral of daily EVI values) within primary growing season for July 2011 - Jun 2012, averaged by district

  5. Onset of greenness increase in day of year 1-356, starting July 1 2011, averaged by district

  6. Onset of greenness decrease in day of year 1-356, starting July 1 2011, averaged by district

  7. Total change in greenness (integral of daily EVI values) within primary growing season for July 2012- Jun 2013

  8. Onset of greenness increase in day of year 1-356, starting July 1 2012, averaged by district

  9. Onset of greenness decrease in day of year 1-356, starting July 1 2012, averaged by district

  10. Average EVI value at peak of greenness within growing season,starting July 1

  11. EVI value at peak of greenness within growing season,starting July 1 2011, averaged by district

  12. EVI value at peak of greenness within growing season,starting July 1 2012, averaged by district

The naming and definition of the variables are based on WorldClim. For full variables metadata, see Data-Cover-Page for details about definition of the metrics and other important information.

Input datasets#

The following datasets have been used for all the variables.

1. ^ Only WBG staff can access this OneDrive directory.

Python setup#

The following packages are required for the processing in this notebook.

import warnings
warnings.simplefilter(action='ignore')
import sys
import itertools
from pathlib import Path
from datetime import datetime, timedelta

import geopandas as gpd
import rasterio
from rasterio.enums import Resampling
from rasterstats import gen_zonal_stats
import pandas as pd
import numpy as np
from numpy import ma

spanon = Path.cwd().parents[2].joinpath("Spatial-Anonymization/src/")
sys.path.append(str(spanon))
from spatial_anonymization import utils
from spatial_anonymization import point_displacement

Setup input files#

Define paths to input data files as global variables using Path package.

# =============================
# BASE WORKING DIR
# ============================
DIR_DATA = Path.cwd().parents[1].joinpath('data', 'TZA')
DIR_SPATIAL_DATA = DIR_DATA.joinpath('spatial')
DIR_NPS_W5 = DIR_DATA.joinpath('surveys', 'NPS_w5')
DIR_OUTPUTS = DIR_NPS_W5.joinpath('geovars', 'puf')

# =================
# GEOVARS DATA
# =================
DIR_MCD = Path.cwd().parents[1].joinpath('data', 'MCD12Q2', 'MCD12Q2.061_2000336_to_2021365')
FILE_AG_MASK = DIR_SPATIAL_DATA.joinpath('ESACCI_2015_agmask.tif')
FILE_PHENO_FROM_R_SUM = DIR_OUTPUTS.joinpath('pufdistrict_sumstats_phenology_s1.csv')
FILE_PHENO_FROM_R_TS = DIR_OUTPUTS.joinpath('pufdistrict_ts_phenology_s1.csv')

# =============================
# SURVEY COORDINATES 
# ============================
# Public/displaced GPS oordinates files
FILE_HH_PUB_GEOVARS = DIR_OUTPUTS.joinpath('y5-hhs-pub-coords.csv')


# =============================
# ADMIN BOUNDARIES
# =============================
# use geoboundaries API for country bounds
FILE_ADM0 = DIR_SPATIAL_DATA.joinpath('geoBoundaries-TZA-ADM0-all','geoBoundaries-TZA-ADM0.shp')
FILE_ADM3 = DIR_SPATIAL_DATA.joinpath('geoBoundaries-TZA-ADM3-all','geoBoundaries-TZA-ADM3.shp')
FILE_ADM2 = DIR_SPATIAL_DATA.joinpath('geoBoundaries-TZA-ADM2-all','geoBoundaries-TZA-ADM2.shp')

Setup processing global variables#

For variables we will use often such as column name containing latitude and longitude, set them as global variables.

# ====================
# GENERAL
# =====================
# in case we need to download OSM data
GEOFABRICK_REGION  = 'africa'
COUNTRY_NAME = 'Tanzania'
COUNTRY_ISO = 'TZA'
FILENAME_SUFFIX = 'npsw4'
# When survey was conducted
SURVEY_CALENDAR_YR = 2021
AVG_YEARS = [i for i in range(2001, SURVEY_CALENDAR_YR+1)]
CYCLE = 0

# ==============
# BUFFER
# ==============
# whether to buffer points or
# extract values at point
BUFFER = True
BUFF_ZONE_DIST = 1000
    
# ====================
# INPUT DATA SPECS
# =====================
# Cols for input coordinates file
LAT = 'lat_pub_w5' 
LON = 'lon_pub_w5'
HH_ID = 'y5_hhid'
CLUST_ID = 'y5_clusterid'
TEMP_UNIQ_ID = 'uniq_loc'
URB_RURAL = 'y5_clustertype'
URBAN_CODE = 'urban'

# In cases where HH_ID isnt unique to lat,lon pairs, create temporary
# locationID based on lat,lon pairs
UNIQ_LOC_ID = 'loc' 

# If variables were generated in R
# just Get from CSV file
LOAD_FROM_FILE = True

# =============================
# VARIABLE LABELS
# ============================
VAR_LABELS = {'h2019_eviarea':'Total change in greenness within primary growing season',
 'avg_eviarea': 'Average total change in greenness within primary growing season',
'h2019_evimax':'EVI value at peak of greenness within growing seaso',
 'avg_evimax':'Average EVI value at peak of greenness within growing season',
 'avg_grn':'Average timing of onset of greenness increase',
 'h2019_grn':'Onset of greenness increase',
 'h2019_sen':'Onset of greenness decrease',
 'avg_sen': 'Average timing of onset of greenness decrease'}

for name, label in VAR_LABELS.items():
    try:
        assert len(label) < 80, 'LABEL TOO LONG'
    except AssertionError:
        print(name)

Define helper functions for computation#

def update_ag_mask(ag_mask_file):
    """
    Update ag mask TIF so that it matches MCD12Q2 layers dimensions
    """
    
    sample_MCD12Q2_tif = DIR_MCD.joinpath('MCD12Q2.061_EVI_Area_0_doy2001001_aid0001.tif')
    ds_MCD12Q2 = rasterio.open(sample_MCD12Q2_tif)
    MCD12Q2_shape = ds_MCD12Q2.meta['height'], ds_MCD12Q2.meta['width']
    with rasterio.open(ag_mask_file) as dataset:
        # resample data to target shape
        data = dataset.read(
            out_shape=MCD12Q2_shape,
            resampling=Resampling.nearest
        )
        
        # Set NaN to zero for the sake of creating numpy mask
        data[np.isnan(data)] = 0
    
    # Create numpy mask which sets all zeros to False
    # Because we actually want the zeros to be True (in the np.ma mask style)
    # we invert the mask
    ag_mask = ~ma.make_mask(data)
    
    return ag_mask
def select_files(cycle, category='EVI_Area', target_yrs=None):
    """
    Extract relevant files from MCD12Q2 directory
    """
    
    file_dict = {}
    category_with_cycle = '{}_{}'.format(category, cycle)
    for file in DIR_MCD.iterdir():
        if file.suffix != ".tif":
            continue
        
        yr = file.parts[-1].split("_")[-2][3:7]
        if target_yrs:
            if int(yr) not in target_yrs:
                continue
        fname = file.parts[-1][12:]
        fname_parts = fname.split("_")
        this_category = sorted(fname_parts[:-2])
        target_category = sorted(category_with_cycle.split("_"))
        if this_category == target_category:
            file_dict[yr] = file
    
    
    
    return file_dict
def days_since_epoch_to_doy(days):
    """
    Converts days since epoch to day of the year where 1 is Jan-1 and 365 is Dec-31.
    """
    
    # Subtract 182 to shift to June-July
    if days == 32767:
        return 32767
    else:
        days = int(days - 182)
        new_date = datetime(1970, 1, 1) + timedelta(days - 1)
        doy = new_date.timetuple().tm_yday
        
        return doy
def prepare_raster_for_zonal_stats(raster1, raster2=None, ag_mask=None):
        """
        Prepare array before calculating zonal stats
        """
        ds = rasterio.open(raster1)
        no_data = ds.meta['nodata']
        transform = ds.meta['transform']
        arr = ds.read(1)
        
        if raster2:
            ds2 = rasterio.open(raster2)
            arr2 = ds2.read(1)
            if ag_mask is not None:
                arr2 = np.where(arr2!=no_data, arr2, np.NaN)
                arr = np.where(arr!=no_data, arr, np.NaN)
                arr_out = arr + arr2
                
                arr_out_mask = ma.array(arr_out, mask=ag_mask)
                arr_out_mask.fill_data = np.NaN
                arr_out_mask[np.isnan(arr_out_mask)] = no_data
                return arr_out_mask.data, transform, no_data
            else:
                arr2 = np.where(arr2!=no_data, arr2, np.NaN)
                arr = np.where(arr!=no_data, arr, np.NaN)
                arr_out = arr + arr2
                arr_out[np.isnan(arr_out)] = no_data
                return arr_out, transform, no_data
        else:
            if ag_mask is not None:
                arr_main = ma.array(arr, mask=ag_mask)
                arr_main.fill_data = no_data
                return arr_main.data, transform, no_data
            else:
                return arr, transform, no_data
def generate_evi_stats(adm_hhs, coords_notin_geo, evi_category, 
                       colname, use_ag_mask=False, avg_yrs=None, 
                       survey_yr=None):
    
    # Extract files for total change in greenness
    if evi_category == 'EVI_max':
        files = select_files(cycle=CYCLE, category='EVI_Amplitude', target_yrs=avg_yrs)
        files_min = select_files(cycle=CYCLE, category='EVI_Minimum', target_yrs=avg_yrs)
    else:
        files = select_files(cycle=CYCLE, category=evi_category, target_yrs=avg_yrs)
    
    # Load ag_mask in case we need it
    mask = update_ag_mask(ag_mask_file=FILE_AG_MASK)
    
    # Lambda function to convert days since 1970-1-1 to doy
    convert2doy = np.vectorize(days_since_epoch_to_doy)
    
    
    # =================================
    # LOOP THROUGH THE YEARS
    # =================================
    for yr, file in files.items():
        col = file.parts[-1].split("_")[-2]
        if evi_category=='EVI_max':
            second_raster = files_min[yr]
            if use_ag_mask:
                input_arr, transform, no_data = prepare_raster_for_zonal_stats(raster1=file, raster2=second_raster, 
                                                                    ag_mask=mask)
            else:
                input_arr, transform, no_data = prepare_raster_for_zonal_stats(raster1=file, raster2=second_raster)
            
            zstats = [i['mean']/10000 for i in list(gen_zonal_stats(adm_hhs , input_arr, affine=transform, 
                                      stats=['mean'], nodata=no_data))]
        elif evi_category=='Greenup' or evi_category=='Senescence':
            if use_ag_mask:
                input_arr, transform, no_data = prepare_raster_for_zonal_stats(raster1=file, ag_mask=mask)
                input_arr = convert2doy(input_arr)
            else:
                input_arr, transform, no_data = prepare_raster_for_zonal_stats(raster1=file)
                input_arr = convert2doy(input_arr)
        
            zstats = [i['mean'] for i in list(gen_zonal_stats(adm_hhs , input_arr, affine=transform, 
                                      stats=['mean'], nodata=no_data))]
        else:
            if use_ag_mask:
                input_arr, transform, no_data = prepare_raster_for_zonal_stats(raster1=file, ag_mask=mask)
            else:
                input_arr, transform, no_data = prepare_raster_for_zonal_stats(raster1=file)
        
            zstats = [i['mean']/10 for i in list(gen_zonal_stats(adm_hhs , input_arr, affine=transform, 
                                      stats=['mean'], nodata=no_data))]
        
        adm_hhs[col] = zstats 
    
    # ===========================================
    # EXTRACT 1 YEAR DATA OR AVERAGE OVER YEARS
    # ============================================
    if survey_yr:
        print('Extracting value for {} only'.format(survey_yr))
        yr_colname = 'doy{}001'.format(survey_yr)
        adm_hhs[colname] =  adm_hhs[yr_colname]
        df = adm_hhs[[UNIQ_LOC_ID,LAT, LON, colname]]
    else:
        avg_cols = [c for c in adm_hhs.columns if c[:3] == 'doy']
        avg_yrs = sorted(files.keys())
        print('-'*50)
        print('     AVEAGING OVER THE FOLLOWING  YEARS')
        print('-'*50)
        print(avg_yrs)
        adm_hhs[colname] =  adm_hhs[avg_cols].mean(axis=1)
        df = adm_hhs[[UNIQ_LOC_ID,LAT, LON, colname]]
        
    
    # =================================
    # DEAL WITH MISSING VALUES
    # =================================
    df = df.append(coords_notin_geo)
    null_cnt = df[colname].isnull().sum()
    if  null_cnt > 0:
        print()
        print('REPLACING {} MISSING VALUES ......'.format(null_cnt))
        df2 = utils.replace_missing_with_nearest(df=df, target_col=colname, 
                                                 idcol=UNIQ_LOC_ID, lon=LON, lat=LAT)
        df = df2[[UNIQ_LOC_ID, colname]]
    assert df[colname].isnull().sum() == 0
    df.set_index(UNIQ_LOC_ID, inplace=True)
    
    return df, colname

Compute variables#

def compute_vars():
    """
    Putting everything together. 
    """
    # ===========================
    # LOAD COORDINATES
    # ===========================
    df_pub_coords = pd.read_stata(FILE_HH_PUB_GEOVARS, convert_categoricals=False)
    global UNIQ_LOC_ID
    if df_pub_coords[HH_ID].nunique() != len(df_pub_coords.drop_duplicates(subset=[LAT, LON])):
        df_pub_coords[UNIQ_LOC_ID] = df_pub_coords.apply(lambda x: '{}-{}'.format(x[LON], x[LAT]), axis=1)
        df_coords = df_pub_coords[[UNIQ_LOC_ID, LAT, LON]]
        df_coords.drop_duplicates(subset=UNIQ_LOC_ID, inplace=True)
    else:
        UNIQ_LOC_ID = HH_ID
        df_coords = df_pub_coords[[UNIQ_LOC_ID, LAT, LON]]
        df_coords[UNIQ_LOC_ID] = df_coords[UNIQ_LOC_ID].astype(str)

    gdf_coords = gpd.GeoDataFrame(df_coords, 
                                          geometry=gpd.points_from_xy(df_coords[LON],df_coords[LAT]), crs=4326)

    gdf_adm2 = gpd.read_file(FILE_ADM2)

    gdf_adm2_hhs = gpd.sjoin(gdf_adm2, gdf_coords, op='contains')
    gdf_adm2_hhs = gdf_adm2_hhs[['geometry', 'shapeID', UNIQ_LOC_ID, LAT, LON]]
    coords_not_in_adm2 = list(set(gdf_coords[UNIQ_LOC_ID].unique()) - set(gdf_adm2_hhs[UNIQ_LOC_ID].unique()))
    df_coords_not_in_adm2 = df_coords.query('{} in {}'.format(UNIQ_LOC_ID, coords_not_in_adm2))[[UNIQ_LOC_ID, LAT, LON]]
    
    # Keep variables here 
    df_list = {}

    # =========================
    # EVI Area
    # =========================
    # Survey-year EVI_Area
    colname = 'h{}_eviarea'.format(SURVEY_CALENDAR_YR)
    df_evia_sy, eviarea_sy = generate_evi_stats(gdf_adm2_hhs, df_coords_not_in_adm2, evi_category='EVI_Area',
                            colname=colname, avg_yrs=AVG_YEARS, use_ag_mask=True, survey_yr=SURVEY_CALENDAR_YR)
    df_list[eviarea_sy] = df_evia_sy

    # Average EVI_Area
    df_avg_evia, avg_eviarea = generate_evi_stats(gdf_adm2_hhs, df_coords_not_in_adm2, evi_category='EVI_Area',
                            colname='avg_eviarea', avg_yrs=AVG_YEARS, use_ag_mask=True)

    df_list[avg_eviarea] = df_avg_evia


    # =========================
    # EVI Maximum
    # =========================
    # Survey year EVI maximum
    colname = 'h{}_evimax'.format(SURVEY_CALENDAR_YR)
    df_evimax_sy, evimax_sy = generate_evi_stats(gdf_adm2_hhs, df_coords_not_in_adm2, evi_category='EVI_max',
                            colname=colname, avg_yrs=AVG_YEARS, use_ag_mask=True, survey_yr=SURVEY_CALENDAR_YR)
    df_list[evimax_sy] = df_evimax_sy

    # Average EVI maximum
    df_avg_evimax, avg_evimax = generate_evi_stats(gdf_adm2_hhs, df_coords_not_in_adm2, evi_category='EVI_max',
                            colname='avg_evimax', avg_yrs=AVG_YEARS, use_ag_mask=True)

    df_list[avg_evimax] = df_avg_evimax


    # =========================
    # TIMING OF GREENUP
    # =========================
    # Survey year Greenup
    colname = 'h{}_grn'.format(SURVEY_CALENDAR_YR)
    df_grn_sy, grn_sy = generate_evi_stats(gdf_adm2_hhs, df_coords_not_in_adm2, evi_category='Greenup',
                            colname=colname, avg_yrs=AVG_YEARS, use_ag_mask=True, survey_yr=SURVEY_CALENDAR_YR)
    df_list[grn_sy] = df_grn_sy

    # Average Greenup
    df_avg_grn, avg_grn = generate_evi_stats(gdf_adm2_hhs, df_coords_not_in_adm2, evi_category='Greenup',
                            colname='avg_grn', avg_yrs=AVG_YEARS, use_ag_mask=False)

    df_list[avg_grn] = df_avg_grn


    # =========================
    # TIMING OF GREENUP
    # =========================
    # Survey year senescence
    colname = 'h{}_grn'.format(SURVEY_CALENDAR_YR)
    df_grn_sy, grn_sy = generate_evi_stats(gdf_adm2_hhs, df_coords_not_in_adm2, evi_category='Greenup',
                            colname=colname, avg_yrs=AVG_YEARS, use_ag_mask=True, survey_yr=SURVEY_CALENDAR_YR)
    df_list[grn_sy] = df_grn_sy

    # Average senescence
    df_avg_grn, avg_grn = generate_evi_stats(gdf_adm2_hhs, df_coords_not_in_adm2, evi_category='Greenup',
                            colname='avg_grn', avg_yrs=AVG_YEARS, use_ag_mask=False)

    df_list[avg_grn] = df_avg_grn

    # =========================
    # TIMING OF SENESCENCE
    # =========================
    # Survey year Greenup
    colname = 'h{}_sen'.format(SURVEY_CALENDAR_YR)
    df_sen_sy, sen_sy = generate_evi_stats(gdf_adm2_hhs, df_coords_not_in_adm2, evi_category="Senescence",
                            colname=colname, avg_yrs=AVG_YEARS, use_ag_mask=True, survey_yr=SURVEY_CALENDAR_YR)
    df_list[sen_sy] = df_sen_sy

    # Average Greenup
    df_avg_sen, avg_sen = generate_evi_stats(gdf_adm2_hhs, df_coords_not_in_adm2, evi_category="Senescence",
                            colname='avg_sen', avg_yrs=AVG_YEARS, use_ag_mask=False)

    df_list[avg_sen] = df_avg_sen
    
    # =====================
    # COMBINE VARS
    # =====================
    df = pd.concat(list(df_list.values()), axis=1)
    col_sen = 'h{}_sen'.format(SURVEY_CALENDAR_YR)
    col_grn = 'h{}_grn'.format(SURVEY_CALENDAR_YR)
    col_evimax = 'h{}_evimax'.format(SURVEY_CALENDAR_YR)
    col_eviarea = 'h{}_eviarea'.format(SURVEY_CALENDAR_YR)
    
    df[[col_evimax,'avg_evimax']] = df[[col_evimax,'avg_evimax']].applymap(lambda x: round(x, 3) if isinstance(x, float) else x)

    cols_to_int = [col_eviarea, 'avg_eviarea', col_grn, 'avg_grn', col_sen, 'avg_sen']
    df[cols_to_int] = df[cols_to_int].applymap(lambda x: int(round(x, 0)) if isinstance(x, float) else x)
    
    # ===========================
    # SAVE OUTPUTS
    # ===========================
    out_csv = DIR_OUTPUTS.joinpath('phenology{}.csv'.format(FILENAME_SUFFIX))
    df.to_csv(out_csv)

    # Save variable labels
    out_csv_labels = DIR_OUTPUTS.joinpath('phenology-{}-labels.csv'.format(FILENAME_SUFFIX))
    df_var_labels = pd.DataFrame(list(VAR_LABELS.values()), index=list(VAR_LABELS.keys()))
    df_var_labels.index.name = 'var_name'
    df_var_labels.rename(columns={0:'label'}, inplace=True)
    df_var_labels.to_csv(out_csv_labels)
    
    return out_csv, list(df_list.keys())
def process_pheno_vars_from_rscript(yrs=None):
    
    # ===================================
    # LOAD HH PUB COORDINATES
    # ===================================
    df_pub_coords = pd.read_csv(FILE_HH_PUB_GEOVARS)
    df_pub_coords  = df_pub_coords.query('make_public == 1')
    df_pub_coords[TEMP_UNIQ_ID] = df_pub_coords.apply(lambda x: "{}-{}".format(x[LON], x[LAT]), axis=1)
    df_pub_coords2 = df_pub_coords.drop_duplicates(subset=['uniq_loc'])
    
    df_coords = df_pub_coords2[[TEMP_UNIQ_ID, LAT, LON]]
    gdf_coords = gpd.GeoDataFrame(df_coords, 
                                      geometry=gpd.points_from_xy(df_coords[LON],df_coords[LAT]),
                                  crs=4326)
    
    # ===================================
    # LOAD TIME SERIES (TS) AND SUMMARY 
    # VARS FROM SIOBHAN'S R-SCRIPT
    # ===================================
    df_sum = pd.read_csv(FILE_PHENO_FROM_R_SUM)
    df_ts = pd.read_csv(FILE_PHENO_FROM_R_TS)
    
    # For time series, keep only the cols we need
    # based on years
    yrs_keep = ['shapeID']
    for c in df_ts.columns:
        try:
            yr = c.split("_")[1]

            if yr in yrs:
                yrs_keep.append(c)
        except:
            pass
    df_ts = df_ts[yrs_keep]
    df_sum.drop(columns=['shapeName'], inplace=True)
    df = df_sum.merge(df_ts, on='shapeID')
    assert len(df) == len(df_sum)
    
    # =========================================
    # MERGE WITH ADMIN REGIONS AND THEN SURVEY
    # PUB COORDINATES 
    # ========================================
    # Merge admin regions with geovars
    df_adm = gpd.read_file(FILE_ADM2)
    df_adm2 = df_adm.merge(df, on='shapeID')
    
    # Add the geovars to pub coordinates
    gdf_coords2 = gpd.sjoin(left_df=gdf_coords, right_df=df_adm2, op='within', how='left')
    
    # =========================================
    # RESOLVE MISSING VALUES
    # ========================================
    # To Do
    # df_pub_coords_geovars = df_pub_coords[[TEMP_UNIQ_ID, 'y5_hhid', 'y5_clusterid']].merge(gdf_coords2, on=TEMP_UNIQ_ID)
    
    # ===========================
    # SAVE OUTPUTS
    # ===========================
    out_csv = DIR_OUTPUTS.joinpath('phenology-{}.csv'.format(FILENAME_SUFFIX))
    # change column names to match with wave4
    new_cols = {'eviarea_avg': 'avg_eviarea', 'evimax_avg':'avg_evimax', 
                'eviarea_2019': 'h2019_eviarea', 'evimax_2019': 'h2019_evimax',
                'gr_2019': 'h2019_grn', 'se_2019':'h2019_sen','grn_avg': 'avg_grn',
                'sen_avg':'avg_sen'}
                
    gdf_coords2.rename(columns=new_cols, inplace=True)
    
    # keep only the columns we need
    df = gdf_coords2[list(new_cols.values()) + [TEMP_UNIQ_ID]]
    df.set_index(TEMP_UNIQ_ID, inplace=True)
    df.to_csv(out_csv)

    # Save variable labels
    out_csv_labels = DIR_OUTPUTS.joinpath('phenology-{}-labels.csv'.format(FILENAME_SUFFIX))
    df_var_labels = pd.DataFrame(list(VAR_LABELS.values()), index=list(VAR_LABELS.keys()))
    df_var_labels.index.name = 'var_name'
    df_var_labels.rename(columns={0:'label'}, inplace=True)
    df_var_labels.to_csv(out_csv_labels)
    
    return out_csv, list(new_cols.values())
if LOAD_FROM_FILE:
    out_csv, out_vars = process_pheno_vars_from_rscript(yrs=['2019', '2020', '2021'])
else:
    out_csv, out_vars = compute_vars()

Quality checks#

Run summary statistic to generare minimum, maximum values etc and inspect them manually.

df = pd.read_csv(out_csv)
# Only use describe for continuos variables 
print('-'*50)
print('       SUMMARY STATISTICS FOR CONTINOUS VARS')
print('-'*50)
stats = df[out_vars].describe().T
print(stats[['mean', 'min', '50%', '75%', 'max']])
print('-'*70)
--------------------------------------------------
       SUMMARY STATISTICS FOR CONTINOUS VARS
--------------------------------------------------
                     mean      min         50%         75%         max
avg_eviarea      9.819046    5.619    9.117000   11.300000   18.785000
avg_evimax       0.458581    0.224    0.456000    0.512000    0.659000
h2019_eviarea   11.451269    4.000   10.777027   13.902597   26.350000
h2019_evimax     0.474663    0.242    0.476833    0.521810    0.620618
h2019_grn      268.451120  168.000  270.000000  273.000000  313.000000
h2019_sen       54.743381    1.000   15.000000   30.000000  364.000000
avg_grn        266.310788  182.000  266.000000  276.000000  302.000000
avg_sen        146.190068   13.000  134.000000  212.250000  366.000000
----------------------------------------------------------------------

Summary#

The following variables below were generated.

:name: geovariables_crop_season

ID

Theme

Variable Description

Variable ID

Units

Resolution

Data sources

Required

1

Crop Season Parameters

Average total change in greenness within primary growing season

crops07

N/A

0.004176 dd

MODIS-MOD12Q2(DOY185) Land Cover Dynamics

Yes

2

Crop Season Parameters

Average timing of onset of greenness increase[^10]

crops07

Day of the year

0.004176 dd

MODIS-MOD12Q2(DOY185) Land Cover Dynamics

Yes

3

Crop Season Parameters

Average timing of onset of greenness decrease[^11]

crops07

Day of the year

0.004176 dd

MODIS-MOD12Q2(DOY185) Land Cover Dynamics

Yes

4

Crop Season Parameters

Total change in greenness within primary growing season [^12]

crops07

N/A

0.004176 dd

MODIS-MOD12Q2(DOY185) Land Cover Dynamics

Yes

5

Crop Season Parameters

Onset of greenness increase [^13]

crops07

Day of the year

0.004176 dd

MODIS-MOD12Q2(DOY185) Land Cover Dynamics

Yes

6

Crop Season Parameters

Onset of greenness decrease [^14]

crops07

Day of the year

0.004176 dd

MODIS-MOD12Q2(DOY185) Land Cover Dynamics

Yes

7

Crop Season Parameters

Average EVI value at peak of greenness within growing season[^15]

crops07

N/A

0.004176 dd

MODIS-MOD12Q2(DOY185) Land Cover Dynamics

Yes

8

Crop Season Parameters

EVI value at peak of greenness within growing seaso[^16]

crops07

N/A

0.004176 dd

MODIS-MOD12Q2(DOY185) Land Cover Dynamics

Yes