{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "[](https://colab.research.google.com/github/worldbank/OpenNightLights/blob/master/onl/tutorials/mod6_6_RF_classifier.ipynb)\n", "\n", "# Random Forest Classifier\n", "\n", "Now that we have processed and explored our data, we will try to classify built-up areas with a Random Forest ensemble of decision trees.\n", "\n", "Decision tree models like Random Forest are among the most powerful, easy to use, and simple to understand models in the machine learning portfolio. The resources noted in {doc}`mod6_2_supervised_learning_img_classification` are a good place to start.\n", "\n", "More in-depth context on methods like Random Forest are out of scope for this tutorial, but worth understanding well if you use them for your analysis, even exploration.\n", "\n", "## Training data\n", "\n", "Let's recreate the training data \"image\" for 2015 that fuses Sentinel-2, VIIRS-DNB and GHSL for the Bagmati province." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "# reminder that if you are installing libraries in a Google Colab instance you will be prompted to restart your kernal\n", "\n", "try:\n", " import geemap, ee\n", " import seaborn as sns\n", " import matplotlib.pyplot as plt\n", "except ModuleNotFoundError:\n", " if 'google.colab' in str(get_ipython()):\n", " print(\"package not found, installing w/ pip in Google Colab...\")\n", " !pip install geemap seaborn matplotlib\n", " else:\n", " print(\"package not found, installing w/ conda...\")\n", " !conda install mamba -c conda-forge -y\n", " !mamba install geemap -c conda-forge -y\n", " !conda install seaborn matplotlib -y\n", " import geemap, ee\n", " import seaborn as sns\n", " import matplotlib.pyplot as plt" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "try:\n", " ee.Initialize()\n", "except Exception as e:\n", " ee.Authenticate()\n", " ee.Initialize()\n", "\n", "# define some functions and variables\n", "def se2mask(image):\n", " quality_band = image.select('QA60')\n", " cloudmask = 1 << 10\n", " cirrusmask = 1 << 11\n", " mask = quality_band.bitwiseAnd(cloudmask).eq(0) and (quality_band.bitwiseAnd(cirrusmask).eq(0))\n", " return image.updateMask(mask).divide(10000)\n", "\n", "\n", "se2bands = ['B2', 'B3', 'B4', 'B5', 'B6', 'B7','B8','B8A']\n", "trainingbands = se2bands + ['avg_rad']\n", "label = 'smod_code'\n", "scaleFactor=1000\n", "\n", "# create training data\n", "roi = ee.FeatureCollection(\"FAO/GAUL/2015/level2\").filter(ee.Filter.eq('ADM2_NAME','Bagmati')).geometry()\n", "\n", "se2 = ee.ImageCollection('COPERNICUS/S2').filterDate(\n", " \"2015-07-01\",\"2015-12-31\").filterBounds(roi).filter(\n", " ee.Filter.lt(\"CLOUDY_PIXEL_PERCENTAGE\",20)).map(se2mask).median().select(se2bands).clip(roi)\n", "\n", "viirs = ee.Image(ee.ImageCollection(\"NOAA/VIIRS/DNB/MONTHLY_V1/VCMSLCFG\").filterDate(\n", " \"2015-07-01\",\"2019-12-31\").filterBounds(roi).median().select('avg_rad').clip(roi))\n", "\n", "fused = se2.addBands(viirs)\n", "\n", "# create and overlay labels to training data\n", "ghsl = ee.ImageCollection('JRC/GHSL/P2016/SMOD_POP_GLOBE_V1').filter(ee.Filter.date(\n", " '2015-01-01', '2015-12-31')).select(label).median().gte(2).clip(roi)\n", "\n", "points = ghsl.sample(**{\"region\":roi, \"scale\":scaleFactor,\"seed\":0,'geometries':True})\n", "\n", "data = fused.select(trainingbands).sampleRegions(collection=points,\n", " properties=[label],\n", " scale=scaleFactor)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As a gut-check let's look at the stats:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'max': 1,\n", " 'mean': 0.18796029458853666,\n", " 'min': 0,\n", " 'sample_sd': 0.39071173874702697,\n", " 'sample_var': 0.15265566279472506,\n", " 'sum': 1174,\n", " 'sum_sq': 1174,\n", " 'total_count': 6246,\n", " 'total_sd': 0.39068046053869543,\n", " 'total_var': 0.15263122224672718,\n", " 'valid_count': 6246,\n", " 'weight_sum': 6246,\n", " 'weighted_sum': 1174}" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.aggregate_stats(label).getInfo()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In our total training dataset, we have 6,246 observations and our label clasification shows that about 19% (the mean value above) of our data are classified as built-up (1) with the rest not.\n", "\n", "## Cross-validation\n", "\n", "Cross-validation is one of the most important aspects of machine learning development. Last section we talked about several attributes of the training data that may impact classification:\n", "- varying spatial resolution and choices for re-sample rate\n", "- choices about data cleaning\n", "- decisions about which bands to include\n", "\n", "We also may decide to create new features (known as feature engineering) by transforming our data sources mathmatically (getting derivatives, information about neighboring pixels, etc) or even fusing additional data.\n", "\n", "The classification algorithm itself will have hyperparameters than may be adjusted (known as hyperparameter tuning). \n", "\n", "How do we decide these things? \n", "\n", "Often we will experiment empirically and see what works, this is a big advantage to advances in computing resources and machine learning packages. However, if we just tweak our data until we get the best performance on our training data and leave it at that, we are at risk of over-fitting our model. \n", "\n", "Over-fitting a model means making it too specific to our data on-hand in a way that will fail us on unseen data. This could impact the final analysis and ultimately stakeholder trust and their ability to make informed decisions, so we want to think about strategies to avoid this.\n", "\n", "That is our situation here: since we dont have any \"ground truth\" for our data after 2015 in terms of settlements, we will want to validate our classifier as best we can with the labeled data we have before \"releasing it to the wild.\"\n", "\n", "### train test split\n", "A key way to do that is to split our labeled data into two components: training and testing sets (or even train, validation and test sets). There are many strategies for this, including a K-fold sample technique (or a stratified K-Fold technique, which addresses the class imbalance issue we noted earlier). Unfortunately things can get quite complex with time series or sequential data (since observations in time are \"dependent\" on observations before, we cannot fairly randomly split them). \n", "\n", "For our purposes, a simple 80/20 train/test split randomly among the pixels in our 2015 training image will be fine, but this is another great topic to learn more about.\n", "\n", "