Crop Season-Phenology#


Author: Dunstan Matekenya

Affiliation: DECAT, The World Bank Group

Date: April 16, 2023

Background#

In this notebook, we generate geovariables in the category of climate for the National Panel Survey (NPS 2012-2013) in Tanzania. The following bioclimatic geovariables are computed.

  1. Average total change in greenness (integral of daily EVI values) within primary growing season

  2. Average timing of onset of greenness increase in day of year 1-356, where Jul 1 = 1

  3. Average timing of onset of greenness decrease in day of year 1-356, where Jul 1 = 1, averaged by district

  4. Total change in greenness (integral of daily EVI values) within primary growing season for July 2011 - Jun 2012, averaged by district

  5. Onset of greenness increase in day of year 1-356, starting July 1 2011, averaged by district

  6. Onset of greenness decrease in day of year 1-356, starting July 1 2011, averaged by district

  7. Total change in greenness (integral of daily EVI values) within primary growing season for July 2012- Jun 2013

  8. Onset of greenness increase in day of year 1-356, starting July 1 2012, averaged by district

  9. Onset of greenness decrease in day of year 1-356, starting July 1 2012, averaged by district

  10. Average EVI value at peak of greenness within growing season,starting July 1

  11. EVI value at peak of greenness within growing season,starting July 1 2011, averaged by district

  12. EVI value at peak of greenness within growing season,starting July 1 2012, averaged by district

The naming and definition of the variables are based on WorldClim. For full variables metadata, see Data-Cover-Page for details about definition of the metrics and other important information.

Input datasets#

The following datasets have been used for all the variables.

1. ^ Only WBG staff can access this OneDrive directory.

Python setup#

The following packages are required for the processing in this notebook.

import warnings
warnings.simplefilter(action='ignore')
import sys
import itertools
from pathlib import Path
from datetime import datetime, timedelta

import geopandas as gpd
import rasterio
from rasterio.enums import Resampling
from rasterstats import gen_zonal_stats
import pandas as pd
import numpy as np
from numpy import ma

spanon = Path.cwd().parents[2].joinpath("Spatial-Anonymization/src/")
sys.path.append(str(spanon))
from spatial_anonymization import utils
from spatial_anonymization import point_displacement

Setup input files#

Define paths to input data files as global variables using Path package.

# =============================
# BASE WORKING DIR
# ============================
DIR_DATA = Path.cwd().parents[1].joinpath('data', 'TZA')
DIR_SPATIAL_DATA = DIR_DATA.joinpath('spatial')
DIR_NPS_W4 = DIR_DATA.joinpath('surveys', 'NPS_w4')
DIR_OUTPUTS = DIR_NPS_W4.joinpath('geovars', 'puf')

# =================
# GEOVARS DATA
# =================
DIR_MCD = Path.cwd().parents[1].joinpath('data', 'MCD12Q2', 'MCD12Q2.061_2000336_to_2021365')
FILE_AG_MASK = DIR_SPATIAL_DATA.joinpath('ESACCI_2015_agmask.tif')

# =============================
# SURVEY COORDINATES 
# ============================
# Public/displaced GPS oordinates files
FILE_HH_PUB_GEOVARS = DIR_NPS_W4.joinpath('geovars', 'npsy4.ea.offset.dta')


# =============================
# ADMIN BOUNDARIES
# =============================
# use geoboundaries API for country bounds
FILE_ADM0 = DIR_SPATIAL_DATA.joinpath('geoBoundaries-TZA-ADM0-all','geoBoundaries-TZA-ADM0.shp')
FILE_ADM3 = DIR_SPATIAL_DATA.joinpath('geoBoundaries-TZA-ADM3-all','geoBoundaries-TZA-ADM3.shp')
FILE_ADM2 = DIR_SPATIAL_DATA.joinpath('geoBoundaries-TZA-ADM2-all','geoBoundaries-TZA-ADM2.shp')
df_pub_coords = pd.read_stata(FILE_HH_PUB_GEOVARS, convert_categoricals=False)
df_pub_coords.rename(columns={'lat_modified':'lat_dd_mod', 'lon_modified': 'lon_dd_mod'}, inplace=True)
df_pub_coords.to_csv("/Users/dunstanmatekenya/Downloads/tmp-pheno/locs.csv", index=False)

Setup processing global variables#

For variables we will use often such as column name containing latitude and longitude, set them as global variables.

# ====================
# GENERAL
# =====================
# in case we need to download OSM data
GEOFABRICK_REGION  = 'africa'
COUNTRY_NAME = 'Tanzania'
COUNTRY_ISO = 'TZA'
FILENAME_SUFFIX = 'npsw4'
# When survey was conducted
SURVEY_CALENDAR_YR = 2015
AVG_YEARS = [i for i in range(2001, SURVEY_CALENDAR_YR+1)]
CYCLE = 0

# ==============
# BUFFER
# ==============
# whether to buffer points or
# extract values at point
BUFFER = True
BUFF_ZONE_DIST = 1000
    
# ====================
# INPUT DATA SPECS
# =====================
# Cols for input coordinates file
LAT = 'lat_modified' 
LON = 'lon_modified'
HH_ID = 'clusterid'

# In cases where HH_ID isnt unique to lat,lon pairs, create temporary
# locationID based on lat,lon pairs
UNIQ_LOC_ID = 'loc' 

# =============================
# VARIABLE LABELS
# ============================
VAR_LABELS = {'h2013_eviarea':'Total change in greenness within primary growing season',
 'avg_eviarea': 'Average total change in greenness within primary growing season',
'h2013_evimax':'EVI value at peak of greenness within growing seaso',
 'avg_evimax':'Average EVI value at peak of greenness within growing season',
 'avg_grn':'Average timing of onset of greenness increase',
 'h2013_grn':'Onset of greenness increase',
 'h2013_sen':'Onset of greenness decrease',
 'avg_sen': 'Average timing of onset of greenness decrease'}
for name, label in VAR_LABELS.items():
    try:
        assert len(label) < 80, 'LABEL TOO LONG'
    except AssertionError:
        print(name)

Define helper functions for computation#

def update_ag_mask(ag_mask_file):
    """
    Update ag mask TIF so that it matches MCD12Q2 layers dimensions
    """
    
    sample_MCD12Q2_tif = DIR_MCD.joinpath('MCD12Q2.061_EVI_Area_0_doy2001001_aid0001.tif')
    ds_MCD12Q2 = rasterio.open(sample_MCD12Q2_tif)
    MCD12Q2_shape = ds_MCD12Q2.meta['height'], ds_MCD12Q2.meta['width']
    with rasterio.open(ag_mask_file) as dataset:
        # resample data to target shape
        data = dataset.read(
            out_shape=MCD12Q2_shape,
            resampling=Resampling.nearest
        )
        
        # Set NaN to zero for the sake of creating numpy mask
        data[np.isnan(data)] = 0
    
    # Create numpy mask which sets all zeros to False
    # Because we actually want the zeros to be True (in the np.ma mask style)
    # we invert the mask
    ag_mask = ~ma.make_mask(data)
    
    return ag_mask
def select_files(cycle, category='EVI_Area', target_yrs=None):
    """
    Extract relevant files from MCD12Q2 directory
    """
    
    file_dict = {}
    category_with_cycle = '{}_{}'.format(category, cycle)
    for file in DIR_MCD.iterdir():
        if file.suffix != ".tif":
            continue
        
        yr = file.parts[-1].split("_")[-2][3:7]
        if target_yrs:
            if int(yr) not in target_yrs:
                continue
        fname = file.parts[-1][12:]
        fname_parts = fname.split("_")
        this_category = sorted(fname_parts[:-2])
        target_category = sorted(category_with_cycle.split("_"))
        if this_category == target_category:
            file_dict[yr] = file
    
    
    
    return file_dict
def days_since_epoch_to_doy(days):
    """
    Converts days since epoch to day of the year where 1 is Jan-1 and 365 is Dec-31.
    """
    
    # Subtract 182 to shift to June-July
    if days == 32767:
        return 32767
    else:
        days = int(days - 182)
        new_date = datetime(1970, 1, 1) + timedelta(days - 1)
        doy = new_date.timetuple().tm_yday
        
        return doy
def prepare_raster_for_zonal_stats(raster1, raster2=None, ag_mask=None):
        """
        Prepare array before calculating zonal stats
        """
        ds = rasterio.open(raster1)
        no_data = ds.meta['nodata']
        transform = ds.meta['transform']
        arr = ds.read(1)
        
        if raster2:
            ds2 = rasterio.open(raster2)
            arr2 = ds2.read(1)
            if ag_mask is not None:
                arr2 = np.where(arr2!=no_data, arr2, np.NaN)
                arr = np.where(arr!=no_data, arr, np.NaN)
                arr_out = arr + arr2
                
                arr_out_mask = ma.array(arr_out, mask=ag_mask)
                arr_out_mask.fill_data = np.NaN
                arr_out_mask[np.isnan(arr_out_mask)] = no_data
                return arr_out_mask.data, transform, no_data
            else:
                arr2 = np.where(arr2!=no_data, arr2, np.NaN)
                arr = np.where(arr!=no_data, arr, np.NaN)
                arr_out = arr + arr2
                arr_out[np.isnan(arr_out)] = no_data
                return arr_out, transform, no_data
        else:
            if ag_mask is not None:
                arr_main = ma.array(arr, mask=ag_mask)
                arr_main.fill_data = no_data
                return arr_main.data, transform, no_data
            else:
                return arr, transform, no_data
def generate_evi_stats(adm_hhs, coords_notin_geo, evi_category, 
                       colname, use_ag_mask=False, avg_yrs=None, 
                       survey_yr=None):
    
    # Extract files for total change in greenness
    if evi_category == 'EVI_max':
        files = select_files(cycle=CYCLE, category='EVI_Amplitude', target_yrs=avg_yrs)
        files_min = select_files(cycle=CYCLE, category='EVI_Minimum', target_yrs=avg_yrs)
    else:
        files = select_files(cycle=CYCLE, category=evi_category, target_yrs=avg_yrs)
    
    # Load ag_mask in case we need it
    mask = update_ag_mask(ag_mask_file=FILE_AG_MASK)
    
    # Lambda function to convert days since 1970-1-1 to doy
    convert2doy = np.vectorize(days_since_epoch_to_doy)
    
    
    # =================================
    # LOOP THROUGH THE YEARS
    # =================================
    for yr, file in files.items():
        col = file.parts[-1].split("_")[-2]
        if evi_category=='EVI_max':
            second_raster = files_min[yr]
            if use_ag_mask:
                input_arr, transform, no_data = prepare_raster_for_zonal_stats(raster1=file, raster2=second_raster, 
                                                                    ag_mask=mask)
            else:
                input_arr, transform, no_data = prepare_raster_for_zonal_stats(raster1=file, raster2=second_raster)
            
            zstats = [i['mean']/10000 for i in list(gen_zonal_stats(adm_hhs , input_arr, affine=transform, 
                                      stats=['mean'], nodata=no_data))]
        elif evi_category=='Greenup' or evi_category=='Senescence':
            if use_ag_mask:
                input_arr, transform, no_data = prepare_raster_for_zonal_stats(raster1=file, ag_mask=mask)
                input_arr = convert2doy(input_arr)
            else:
                input_arr, transform, no_data = prepare_raster_for_zonal_stats(raster1=file)
                input_arr = convert2doy(input_arr)
        
            zstats = [i['mean'] for i in list(gen_zonal_stats(adm_hhs , input_arr, affine=transform, 
                                      stats=['mean'], nodata=no_data))]
        else:
            if use_ag_mask:
                input_arr, transform, no_data = prepare_raster_for_zonal_stats(raster1=file, ag_mask=mask)
            else:
                input_arr, transform, no_data = prepare_raster_for_zonal_stats(raster1=file)
        
            zstats = [i['mean']/10 for i in list(gen_zonal_stats(adm_hhs , input_arr, affine=transform, 
                                      stats=['mean'], nodata=no_data))]
        
        adm_hhs[col] = zstats 
    
    # ===========================================
    # EXTRACT 1 YEAR DATA OR AVERAGE OVER YEARS
    # ============================================
    if survey_yr:
        print('Extracting value for {} only'.format(survey_yr))
        yr_colname = 'doy{}001'.format(survey_yr)
        adm_hhs[colname] =  adm_hhs[yr_colname]
        df = adm_hhs[[UNIQ_LOC_ID,LAT, LON, colname]]
    else:
        avg_cols = [c for c in adm_hhs.columns if c[:3] == 'doy']
        avg_yrs = sorted(files.keys())
        print('-'*50)
        print('     AVEAGING OVER THE FOLLOWING  YEARS')
        print('-'*50)
        print(avg_yrs)
        adm_hhs[colname] =  adm_hhs[avg_cols].mean(axis=1)
        df = adm_hhs[[UNIQ_LOC_ID,LAT, LON, colname]]
        
    
    # =================================
    # DEAL WITH MISSING VALUES
    # =================================
    df = df.append(coords_notin_geo)
    null_cnt = df[colname].isnull().sum()
    if  null_cnt > 0:
        print()
        print('REPLACING {} MISSING VALUES ......'.format(null_cnt))
        df2 = utils.replace_missing_with_nearest(df=df, target_col=colname, 
                                                 idcol=UNIQ_LOC_ID, lon=LON, lat=LAT)
        df = df2[[UNIQ_LOC_ID, colname]]
    assert df[colname].isnull().sum() == 0
    df.set_index(UNIQ_LOC_ID, inplace=True)
    
    return df, colname

Compute variables#

def compute_vars():
    """
    Putting everything together. 
    """
    # ===========================
    # LOAD COORDINATES
    # ===========================
    df_pub_coords = pd.read_stata(FILE_HH_PUB_GEOVARS, convert_categoricals=False)
    global UNIQ_LOC_ID
    if df_pub_coords[HH_ID].nunique() != len(df_pub_coords.drop_duplicates(subset=[LAT, LON])):
        df_pub_coords[UNIQ_LOC_ID] = df_pub_coords.apply(lambda x: '{}-{}'.format(x[LON], x[LAT]), axis=1)
        df_coords = df_pub_coords[[UNIQ_LOC_ID, LAT, LON]]
        df_coords.drop_duplicates(subset=UNIQ_LOC_ID, inplace=True)
    else:
        UNIQ_LOC_ID = HH_ID
        df_coords = df_pub_coords[[UNIQ_LOC_ID, LAT, LON]]
        df_coords[UNIQ_LOC_ID] = df_coords[UNIQ_LOC_ID].astype(str)

    gdf_coords = gpd.GeoDataFrame(df_coords, 
                                          geometry=gpd.points_from_xy(df_coords[LON],df_coords[LAT]), crs=4326)

    gdf_adm2 = gpd.read_file(FILE_ADM2)

    gdf_adm2_hhs = gpd.sjoin(gdf_adm2, gdf_coords, op='contains')
    gdf_adm2_hhs = gdf_adm2_hhs[['geometry', 'shapeID', UNIQ_LOC_ID, LAT, LON]]
    coords_not_in_adm2 = list(set(gdf_coords[UNIQ_LOC_ID].unique()) - set(gdf_adm2_hhs[UNIQ_LOC_ID].unique()))
    df_coords_not_in_adm2 = df_coords.query('{} in {}'.format(UNIQ_LOC_ID, coords_not_in_adm2))[[UNIQ_LOC_ID, LAT, LON]]
    
    # Keep variables here 
    df_list = {}

    # =========================
    # EVI Area
    # =========================
    # Survey-year EVI_Area
    colname = 'h{}_eviarea'.format(SURVEY_CALENDAR_YR)
    df_evia_sy, eviarea_sy = generate_evi_stats(gdf_adm2_hhs, df_coords_not_in_adm2, evi_category='EVI_Area',
                            colname=colname, avg_yrs=AVG_YEARS, use_ag_mask=True, survey_yr=SURVEY_CALENDAR_YR)
    df_list[eviarea_sy] = df_evia_sy

    # Average EVI_Area
    df_avg_evia, avg_eviarea = generate_evi_stats(gdf_adm2_hhs, df_coords_not_in_adm2, evi_category='EVI_Area',
                            colname='avg_eviarea', avg_yrs=AVG_YEARS, use_ag_mask=True)

    df_list[avg_eviarea] = df_avg_evia


    # =========================
    # EVI Maximum
    # =========================
    # Survey year EVI maximum
    colname = 'h{}_evimax'.format(SURVEY_CALENDAR_YR)
    df_evimax_sy, evimax_sy = generate_evi_stats(gdf_adm2_hhs, df_coords_not_in_adm2, evi_category='EVI_max',
                            colname=colname, avg_yrs=AVG_YEARS, use_ag_mask=True, survey_yr=SURVEY_CALENDAR_YR)
    df_list[evimax_sy] = df_evimax_sy

    # Average EVI maximum
    df_avg_evimax, avg_evimax = generate_evi_stats(gdf_adm2_hhs, df_coords_not_in_adm2, evi_category='EVI_max',
                            colname='avg_evimax', avg_yrs=AVG_YEARS, use_ag_mask=True)

    df_list[avg_evimax] = df_avg_evimax


    # =========================
    # TIMING OF GREENUP
    # =========================
    # Survey year Greenup
    colname = 'h{}_grn'.format(SURVEY_CALENDAR_YR)
    df_grn_sy, grn_sy = generate_evi_stats(gdf_adm2_hhs, df_coords_not_in_adm2, evi_category='Greenup',
                            colname=colname, avg_yrs=AVG_YEARS, use_ag_mask=True, survey_yr=SURVEY_CALENDAR_YR)
    df_list[grn_sy] = df_grn_sy

    # Average Greenup
    df_avg_grn, avg_grn = generate_evi_stats(gdf_adm2_hhs, df_coords_not_in_adm2, evi_category='Greenup',
                            colname='avg_grn', avg_yrs=AVG_YEARS, use_ag_mask=False)

    df_list[avg_grn] = df_avg_grn


    # =========================
    # TIMING OF GREENUP
    # =========================
    # Survey year senescence
    colname = 'h{}_grn'.format(SURVEY_CALENDAR_YR)
    df_grn_sy, grn_sy = generate_evi_stats(gdf_adm2_hhs, df_coords_not_in_adm2, evi_category='Greenup',
                            colname=colname, avg_yrs=AVG_YEARS, use_ag_mask=True, survey_yr=SURVEY_CALENDAR_YR)
    df_list[grn_sy] = df_grn_sy

    # Average senescence
    df_avg_grn, avg_grn = generate_evi_stats(gdf_adm2_hhs, df_coords_not_in_adm2, evi_category='Greenup',
                            colname='avg_grn', avg_yrs=AVG_YEARS, use_ag_mask=False)

    df_list[avg_grn] = df_avg_grn

    # =========================
    # TIMING OF SENESCENCE
    # =========================
    # Survey year Greenup
    colname = 'h{}_sen'.format(SURVEY_CALENDAR_YR)
    df_sen_sy, sen_sy = generate_evi_stats(gdf_adm2_hhs, df_coords_not_in_adm2, evi_category="Senescence",
                            colname=colname, avg_yrs=AVG_YEARS, use_ag_mask=True, survey_yr=SURVEY_CALENDAR_YR)
    df_list[sen_sy] = df_sen_sy

    # Average Greenup
    df_avg_sen, avg_sen = generate_evi_stats(gdf_adm2_hhs, df_coords_not_in_adm2, evi_category="Senescence",
                            colname='avg_sen', avg_yrs=AVG_YEARS, use_ag_mask=False)

    df_list[avg_sen] = df_avg_sen
    
    # =====================
    # COMBINE VARS
    # =====================
    df = pd.concat(list(df_list.values()), axis=1)
    col_sen = 'h{}_sen'.format(SURVEY_CALENDAR_YR)
    col_grn = 'h{}_grn'.format(SURVEY_CALENDAR_YR)
    col_evimax = 'h{}_evimax'.format(SURVEY_CALENDAR_YR)
    col_eviarea = 'h{}_eviarea'.format(SURVEY_CALENDAR_YR)
    
    df[[col_evimax,'avg_evimax']] = df[[col_evimax,'avg_evimax']].applymap(lambda x: round(x, 3) if isinstance(x, float) else x)

    cols_to_int = [col_eviarea, 'avg_eviarea', col_grn, 'avg_grn', col_sen, 'avg_sen']
    df[cols_to_int] = df[cols_to_int].applymap(lambda x: int(round(x, 0)) if isinstance(x, float) else x)
    
    # ===========================
    # SAVE OUTPUTS
    # ===========================
    out_csv = DIR_OUTPUTS.joinpath('phenology{}.csv'.format(FILENAME_SUFFIX))
    df.to_csv(out_csv)

    # Save variable labels
    out_csv_labels = DIR_OUTPUTS.joinpath('phenology-{}-labels.csv'.format(FILENAME_SUFFIX))
    df_var_labels = pd.DataFrame(list(VAR_LABELS.values()), index=list(VAR_LABELS.keys()))
    df_var_labels.index.name = 'var_name'
    df_var_labels.rename(columns={0:'label'}, inplace=True)
    df_var_labels.to_csv(out_csv_labels)
    
    return out_csv, list(df_list.keys())
out_csv, out_vars = compute_vars()
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
Cell In[13], line 1
----> 1 out_csv, out_vars = compute_vars()

Cell In[12], line 37, in compute_vars()
     32 # =========================
     33 # EVI Area
     34 # =========================
     35 # Survey-year EVI_Area
     36 colname = 'h{}_eviarea'.format(SURVEY_CALENDAR_YR)
---> 37 df_evia_sy, eviarea_sy = generate_evi_stats(gdf_adm2_hhs, df_coords_not_in_adm2, evi_category='EVI_Area',
     38                         colname=colname, avg_yrs=AVG_YEARS, use_ag_mask=True, survey_yr=SURVEY_CALENDAR_YR)
     39 df_list[eviarea_sy] = df_evia_sy
     41 # Average EVI_Area

Cell In[11], line 50, in generate_evi_stats(adm_hhs, coords_notin_geo, evi_category, colname, use_ag_mask, avg_yrs, survey_yr)
     47         else:
     48             input_arr, transform, no_data = prepare_raster_for_zonal_stats(raster1=file)
---> 50         zstats = [i['mean']/10 for i in list(gen_zonal_stats(adm_hhs , input_arr, affine=transform, 
     51                                   stats=['mean'], nodata=no_data))]
     53     adm_hhs[col] = zstats 
     55 # ===========================================
     56 # EXTRACT 1 YEAR DATA OR AVERAGE OVER YEARS
     57 # ============================================

File ~/anaconda3/lib/python3.10/site-packages/rasterstats/main.py:166, in gen_zonal_stats(vectors, raster, layer, band, nodata, affine, stats, all_touched, categorical, category_map, add_stats, zone_func, raster_out, prefix, geojson_out, boundless, **kwargs)
    163     band = band_num
    165 with Raster(raster, affine, nodata, band) as rast:
--> 166     features_iter = read_features(vectors, layer)
    167     for _, feat in enumerate(features_iter):
    168         geom = shape(feat["geometry"])

File ~/anaconda3/lib/python3.10/site-packages/rasterstats/io.py:126, in read_features(obj, layer)
    124     features_iter = [parse_feature(obj)]
    125 elif hasattr(obj, "__geo_interface__"):
--> 126     mapping = obj.__geo_interface__
    127     if mapping["type"] == "FeatureCollection":
    128         features_iter = mapping["features"]

File ~/anaconda3/lib/python3.10/site-packages/geopandas/geodataframe.py:761, in GeoDataFrame.__geo_interface__(self)
    730 @property
    731 def __geo_interface__(self):
    732     """Returns a ``GeoDataFrame`` as a python feature collection.
    733 
    734     Implements the `geo_interface`. The returned python data structure
   (...)
    759 
    760     """
--> 761     return self._to_geo(na="null", show_bbox=True, drop_id=False)

File ~/anaconda3/lib/python3.10/site-packages/geopandas/geodataframe.py:870, in GeoDataFrame._to_geo(self, **kwargs)
    862 def _to_geo(self, **kwargs):
    863     """
    864     Returns a python feature collection (i.e. the geointerface)
    865     representation of the GeoDataFrame.
    866 
    867     """
    868     geo = {
    869         "type": "FeatureCollection",
--> 870         "features": list(self.iterfeatures(**kwargs)),
    871     }
    873     if kwargs.get("show_bbox", False):
    874         geo["bbox"] = tuple(self.total_bounds)

File ~/anaconda3/lib/python3.10/site-packages/geopandas/geodataframe.py:811, in GeoDataFrame.iterfeatures(self, na, show_bbox, drop_id)
    805     raise AttributeError(
    806         "No geometry data set (expected in"
    807         " column '%s')." % self._geometry_column_name
    808     )
    810 ids = np.array(self.index, copy=False)
--> 811 geometries = np.array(self[self._geometry_column_name], copy=False)
    813 properties_cols = self.columns.difference([self._geometry_column_name])
    815 if len(properties_cols) > 0:
    816     # convert to object to get python scalars.

File ~/anaconda3/lib/python3.10/site-packages/pandas/core/series.py:893, in Series.__array__(self, dtype)
    846 def __array__(self, dtype: npt.DTypeLike | None = None) -> np.ndarray:
    847     """
    848     Return the values as a NumPy array.
    849 
   (...)
    891           dtype='datetime64[ns]')
    892     """
--> 893     return np.asarray(self._values, dtype)

File ~/anaconda3/lib/python3.10/site-packages/geopandas/array.py:1283, in GeometryArray.__array__(self, dtype)
   1275 def __array__(self, dtype=None):
   1276     """
   1277     The numpy array interface.
   1278 
   (...)
   1281     values : numpy array
   1282     """
-> 1283     return to_shapely(self)

File ~/anaconda3/lib/python3.10/site-packages/geopandas/array.py:177, in to_shapely(geoms)
    175 if not isinstance(geoms, GeometryArray):
    176     raise ValueError("'geoms' must be a GeometryArray")
--> 177 return vectorized.to_shapely(geoms.data)

File ~/anaconda3/lib/python3.10/site-packages/geopandas/_vectorized.py:150, in to_shapely(data)
    148     out = np.empty(len(data), dtype=object)
    149     with compat.ignore_shapely2_warnings():
--> 150         out[:] = [_pygeos_to_shapely(geom) for geom in data]
    151     return out
    152 else:

File ~/anaconda3/lib/python3.10/site-packages/geopandas/_vectorized.py:150, in <listcomp>(.0)
    148     out = np.empty(len(data), dtype=object)
    149     with compat.ignore_shapely2_warnings():
--> 150         out[:] = [_pygeos_to_shapely(geom) for geom in data]
    151     return out
    152 else:

File ~/anaconda3/lib/python3.10/site-packages/geopandas/_vectorized.py:75, in _pygeos_to_shapely(geom)
     73     return shapely.wkt.loads("POINT EMPTY")
     74 else:
---> 75     return shapely.wkb.loads(pygeos.to_wkb(geom))

File ~/anaconda3/lib/python3.10/site-packages/shapely/wkb.py:17, in loads(data, hex)
      8 def loads(data, hex=False):
      9     """Load a geometry from a WKB byte string, or hex-encoded string if
     10     ``hex=True``.
     11 
   (...)
     15         If ``data`` contains an invalid geometry.
     16     """
---> 17     return shapely.from_wkb(data)

File ~/anaconda3/lib/python3.10/site-packages/shapely/io.py:325, in from_wkb(geometry, on_invalid, **kwargs)
    321 # ensure the input has object dtype, to avoid numpy inferring it as a
    322 # fixed-length string dtype (which removes trailing null bytes upon access
    323 # of array elements)
    324 geometry = np.asarray(geometry, dtype=object)
--> 325 return lib.from_wkb(geometry, invalid_handler, **kwargs)

KeyboardInterrupt: 

Quality checks#

Run summary statistic to generare minimum, maximum values etc and inspect them manually.

df = pd.read_csv(out_csv)
# Only use describe for continuos variables 
print('-'*50)
print('       SUMMARY STATISTICS FOR CONTINOUS VARS')
print('-'*50)
stats = df[out_vars].describe().T
print(stats[['mean', 'min', '50%', '75%', 'max']])
print('-'*70)
--------------------------------------------------
       SUMMARY STATISTICS FOR CONTINOUS VARS
--------------------------------------------------
                     mean      min      50%      75%      max
h2015_eviarea   13.768496    6.000   11.000   18.000   31.000
avg_eviarea     17.305489    8.000   17.000   20.000   32.000
h2015_evimax     0.463456    0.315    0.455    0.508    0.632
avg_evimax       0.476568    0.341    0.473    0.517    0.646
h2015_grn      181.202864   78.000  174.000  230.000  269.000
avg_grn        157.840095   88.000  155.000  183.000  247.000
h2015_sen      285.171838  152.000  288.000  319.000  344.000
avg_sen        280.959427  212.000  282.000  294.000  321.000
----------------------------------------------------------------------

Summary#

The following variables below were generated.

:name: geovariables_crop_season

ID

Theme

Variable Description

Variable ID

Units

Resolution

Data sources

Required

1

Crop Season Parameters

Average total change in greenness within primary growing season

crops07

N/A

0.004176 dd

MODIS-MOD12Q2(DOY185) Land Cover Dynamics

Yes

2

Crop Season Parameters

Average timing of onset of greenness increase[^10]

crops07

Day of the year

0.004176 dd

MODIS-MOD12Q2(DOY185) Land Cover Dynamics

Yes

3

Crop Season Parameters

Average timing of onset of greenness decrease[^11]

crops07

Day of the year

0.004176 dd

MODIS-MOD12Q2(DOY185) Land Cover Dynamics

Yes

4

Crop Season Parameters

Total change in greenness within primary growing season [^12]

crops07

N/A

0.004176 dd

MODIS-MOD12Q2(DOY185) Land Cover Dynamics

Yes

5

Crop Season Parameters

Onset of greenness increase [^13]

crops07

Day of the year

0.004176 dd

MODIS-MOD12Q2(DOY185) Land Cover Dynamics

Yes

6

Crop Season Parameters

Onset of greenness decrease [^14]

crops07

Day of the year

0.004176 dd

MODIS-MOD12Q2(DOY185) Land Cover Dynamics

Yes

7

Crop Season Parameters

Average EVI value at peak of greenness within growing season[^15]

crops07

N/A

0.004176 dd

MODIS-MOD12Q2(DOY185) Land Cover Dynamics

Yes

8

Crop Season Parameters

EVI value at peak of greenness within growing seaso[^16]

crops07

N/A

0.004176 dd

MODIS-MOD12Q2(DOY185) Land Cover Dynamics

Yes