Open In Colab

6. Data fusion: Sentinel-2, VIIRS-DNB, GHSL

Note that until this point we have been looking at the entire country of Nepal, but for simplicity sake and due to computational constraints with the Google Earth Engine platform (and local internet speeds), from this point on we will focus our analysis on the Province of Bagmati, which contains the capital city Kathmandu.

Now that we are familiar with the underlying data, we will integrate the the feature sourcs (Sentinel-2 and VIIRS-DNB) as well as add the GHSL data as layers.

Let’s again orient ourselves to the underlying data layers, including the masks we had created earlier.

# reminder that if you are installing libraries in a Google Colab instance you will be prompted to restart your kernal

try:
    import geemap, ee
except ModuleNotFoundError:
    if 'google.colab' in str(get_ipython()):
        print("package not found, installing w/ pip in Google Colab...")
        !pip install geemap
    else:
        print("package not found, installing w/ conda...")
        !conda install mamba -c conda-forge -y
        !mamba install geemap -c conda-forge -y
    import geemap, ee
try:
        ee.Initialize()
except Exception as e:
        ee.Authenticate()
        ee.Initialize()

# our Region Of Interest is the Province of Bagmati
roi = ee.FeatureCollection("FAO/GAUL/2015/level2").filter(ee.Filter.eq('ADM2_NAME','Bagmati')).geometry()

def se2mask(image):
    quality_band = image.select('QA60')
    cloudmask = 1 << 10
    cirrusmask = 1 << 11
    mask = quality_band.bitwiseAnd(cloudmask).eq(0) and (quality_band.bitwiseAnd(cirrusmask).eq(0))
    return image.updateMask(mask).divide(10000)
    
se2 = ee.ImageCollection('COPERNICUS/S2').filterDate(
    "2019-01-01","2019-12-31").filterBounds(roi).filter(
    ee.Filter.lt("CLOUDY_PIXEL_PERCENTAGE",20)).map(se2mask).median().clip(roi)

viirs = ee.Image(ee.ImageCollection("NOAA/VIIRS/DNB/MONTHLY_V1/VCMSLCFG").filterDate("2019-01-01","2019-12-31").filterBounds(roi).median().select('avg_rad').clip(roi))

ghsl = ee.ImageCollection('JRC/GHSL/P2016/SMOD_POP_GLOBE_V1').filter(ee.Filter.date('2015-01-01', '2015-12-31')).select('smod_code').median().clip(roi)

ghsl = ghsl.gte(2)

ghslVis= {"palette":['000000', 'ffffff']}
se2Vis = {"min":0.0, "max":0.3,"bands": ['B4','B3','B2']}

# initialize our map
map1 = geemap.Map()
map1.centerObject(roi, 9)
map1.addLayer(se2, se2Vis, "S2")
map1.addLayer(viirs, {}, "VIIRS-DNB", opacity=0.5)
map1.addLayer(ghsl, ghslVis, "GHSL", opacity=0.25)
map1.addLayerControl()
map1

6.1. Data exploration

Data exploration is a key part of any analysis. We won’t dedicate too much time here but you would typically want to look at changes in these underlying datasets to spot any possible biases or inconsistencies (such as unexpected spikes or drops in the data over time or interesting spatial distributions). For now, we’ll just take a sneak peak at the VIIRS to compare late 2015 with 2019.

Note that we only look at the latter half of the year, because we do not have Sentinel-2 1-C data prior to July 2015 and we want to compare the same months of the year in 2015 as 2019 for both data sources.

6.1.1. VIIRS-DNB

Warning, this is based on `ipyleaflet` a Python library that does not play well with Google Colab, so the split panel display will not work in the Google Colab environment but should on your local machine.
viirs2015 = ee.ImageCollection("NOAA/VIIRS/DNB/MONTHLY_V1/VCMSLCFG").filterDate(
    "2015-07-01","2015-12-31").filterBounds(roi).median().select('avg_rad').clip(roi)
viirs2019 = ee.ImageCollection("NOAA/VIIRS/DNB/MONTHLY_V1/VCMSLCFG").filterDate(
    "2019-07-01","2019-12-31").filterBounds(roi).median().select('avg_rad').clip(roi)

viirs_15_tile = geemap.ee_tile_layer(viirs2015, {}, 'Jul-Dec 2015', opacity=0.75)
viirs_19_tile = geemap.ee_tile_layer(viirs2019, {}, 'Jul-Dec 2019', opacity=0.75)

# initialize our map
map2 = geemap.Map()
map2.centerObject(roi, 9)
map2.split_map(left_layer=viirs_15_tile, right_layer=viirs_19_tile)
map2.addLayerControl()
map2

There may be some cleaning issues to be aware of in terms of the background noise, but it is clear that there are structure changes in nighttime lights that appear consistent with human settlement growth, particularly spreading out from Kathmandu along the road network.

Let’s try cleaning the data a bit (as we did in Cell statistics and basic band math (10 min) to get a clearer signal to noise ratio. We’ll subtract the mean and divide by the standard deviation (also known as “standardizing” or “scaling” the data).

Warning, this is based on `ipyleaflet` a Python library that does not play well with Google Colab, so the split panel display will not work in the Google Colab environment but should on your local machine.
mu15 = viirs2015.reduceRegion(reducer=ee.Reducer.mean(),scale=500)
std15 = viirs2015.reduceRegion(reducer=ee.Reducer.stdDev(),scale=500)
mu19 = viirs2019.reduceRegion(reducer=ee.Reducer.mean(),scale=500)
std19 = viirs2019.reduceRegion(reducer=ee.Reducer.stdDev(),scale=500)

# we'll cast these to native ee Numbers using the ee.Number constructor
mu15 = ee.Number(mu15.get('avg_rad'))
std15 = ee.Number(std15.get('avg_rad'))
mu19 = ee.Number(mu19.get('avg_rad'))
std19 = ee.Number(std19.get('avg_rad'))

viirs2015 = viirs2015.subtract(mu15).divide(std19)
viirs2019 = viirs2019.subtract(mu15).divide(std19)

viirs_15_tile = geemap.ee_tile_layer(viirs2015, {}, 'Jul-Dec 2015', opacity=0.75)
viirs_19_tile = geemap.ee_tile_layer(viirs2019, {}, 'Jul-Dec 2019', opacity=0.75)

# initialize our map
map3 = geemap.Map()
map3.centerObject(roi, 9)
map3.split_map(left_layer=viirs_15_tile, right_layer=viirs_19_tile)
map3.addLayerControl()
map3

This looks like a much clearer signal in terms of human activity and settlement areas. But we lost some information, which is potentially important if we want to pick up dimly lit areas that are transitioning from rural to low density.

We also see a curious patch of lights in 2015 to the East of Kathmandu that do not appear in 2019…worth investigating? Perhaps on your own see if you can identify what’s happening there and if it will affect your analysis.

This is a very important aspect of data exploration: finding ways we may need to adjust the data. If our classifier fails to perform well, we may need to experiment with cleaning the data like this prior to training.

Sometimes machine learning algorithms that are too senstive will “learn” noise too well and then perform badly on novel data, so cleaning data, which is by definition removing information, may help your classifier be more robust.

6.1.1.1. Data bias

Data bias is a big topic for exploration on its own. For here we will just flag that it is critical to understand and a massively underestimated factor for folks new to machine learning (and veterans!).

For example, you may have read in the GHSL methodology that some of that built-up information for that layer is derived from remote sensing, including Sentinel-2 itself. This presents a possible bias because if we are saying the “ground truth” of human settlements is coming from Sentinel-2 data it is very likely a classifier also using Sentinel-2 data will find the same patterns. If these patterns are in error in “the real ground trush” – perhaps GHSL says something is high or low density when it is not or vice versa, then any classifier using Sentinel-2 data is likely to make the same mistake.

In other words, your classifier might “perform well” on your labeled data…but what if the labels themselves are off in the same way your input data is?

There is a lot more that goes into GHSL, so we can move forward with our exploratory analysis for learning purposes and ignore this possible bias, but always be thinking about the root of your data.

This is also a good reminder why it is helpful to integrate multiple independent data sources (such as Sentinel-2 MSI daytime imagery) into your analysis. The Sentinel-2 may present a bias with GHSL and be less sensitive to time dynamics, so we can balance it with nighttime lights. But it also has a higher signal to noise ratio spatially and more spectral information, so it counters the noise we see in nighttime lights alone.

6.2. Merging different data sources

We are ultimately looking to classify land cover using both VIIRS-DNB and Sentinel-2, so we should merge them into a single file to make it easier to pass to our classifier.

6.2.1. Spatial resolution

These are two different sources that have two different sets of attributes. In particular, VIIRS-DNB has a spatial resolution of about 750 meters whereas the Sentinel-2 MSI Level1-C data product we’re using has a resolution that varies across the bands, from as low as 10 meters (including visible light bands B2, B3, B4) up to 60 meters. And our GHSL label has a resolution of 1 km.

Our best approach with Google Earth Engine is to re-sample our data. We can:

  • downsample our higher-resolution data

  • disaggregate our lower resolution data to a smaller pixel size. Note! this will not create a truly higher resolution image. We cannot create visual information that is not present of course (unless we use some fance deep learning like Generative Adversarial Networks to do it artificially, but that’s another topic…). This just disaggregates our “larger” size pixels into proportionally-valued smaller pixels.

  • we could re-sample all images to some separate resolution (say…500 meters)

You may want err on the side of lower resolution if you’re concerned with creating a robust classifier that can handle new data with versatility (i.e. not over-fit). Just as with the data cleaning above, you’ll give it a bit less complex information to train on in order to make better predictions later.

We have decided today to re-sample our data to 1000 meter resolution.

6.2.2. Temporal resolution

We also have to think about time. Our GHSL data is a rather static dataset generally time-stamped 2015 (although the underlying data is somewhat varied).

We will create a dataset that is a single representation (in time) of all our data in 2015, i.e. a single composite image. Since we do have time dynamic data from Sentinel-2 and VIIRS-DNB we will reduce these data from the months in 2015 we have available: July through December. Recall that we do not have Sentinel-2 1-C data prior to July.

For our inference data set (2016 to 2019) we will predict built-up land for each year, but we could do this per month if we wanted to given our VIIRS-DNB and Sentinel-2 frequency. It would give us a denser time series but is more work to calculate.

6.3. Creating our “fused” training dataset

Let’s first combine our VIIRS-DNB radiance band with our (all) our Sentinel-2 bands into a single Image object. For the Sentinel-2 bands, we’ll choose the visible and near infra-red bands.

se2bands = ['B2', 'B3', 'B4', 'B5', 'B6', 'B7','B8','B8A']

se2_2015 = ee.ImageCollection('COPERNICUS/S2').filterDate(
    "2015-07-01","2015-12-31").filterBounds(roi).filter(
    ee.Filter.lt("CLOUDY_PIXEL_PERCENTAGE",20)).map(se2mask).median().clip(roi).select(se2bands)

viirs2015 = ee.ImageCollection("NOAA/VIIRS/DNB/MONTHLY_V1/VCMSLCFG").filterDate(
    "2015-07-01","2015-12-31").filterBounds(roi).median().select('avg_rad').clip(roi)

fused = se2_2015.addBands(viirs2015)

For our GHSL, we’ll want to extract points cooresponging with our “built-up” class that we can overlay over our feature data.

We’re choosing 1000 meters as our overall scaling factor for all data, a sample rate that matches the resolution of the GHSL native resolution.

ghslpoints = ghsl.sample(**{"region":roi, "scale":1000,"seed":0,'geometries':True})
ghslpoints.size().getInfo()
10749
ghslpoints.first().getInfo()
{'type': 'Feature',
 'geometry': {'geodesic': False,
  'type': 'Point',
  'coordinates': [85.67682022289937, 27.331242519336442]},
 'id': '0',
 'properties': {'smod_code': 0}}

Note the “property” field name from GHSL: “smod_code”. We’ll use this to assign our training data labels (1 for built up, 0 for not built up after we binarized the “degrees of urbanization”).

Now we overlay these labels on our fused training image:

# get list of all bands we're using in Sentinel-2 and VIIRS-DNB
trainingbands = se2bands + ['avg_rad']

training = fused.select(trainingbands).sampleRegions(collection=ghslpoints,
                                                     properties=['smod_code'],
                                                     scale=1000)

Let’s glance at the underlying data…a view of our first record or observation (i.e. pixel):

training.first().getInfo()
{'type': 'Feature',
 'geometry': None,
 'id': '0_0',
 'properties': {'B2': 0.08789999783039093,
  'B3': 0.0714000016450882,
  'B4': 0.052299998700618744,
  'B5': 0.0754999965429306,
  'B6': 0.147599995136261,
  'B7': 0.17309999465942383,
  'B8': 0.17339999973773956,
  'B8A': 0.19089999794960022,
  'avg_rad': 0.1543983817100525,
  'smod_code': 0}}

Stats for smod_code our binary label for built up (1) or not (0)

training.aggregate_stats('smod_code').getInfo()
{'max': 1,
 'mean': 0.18796029458853666,
 'min': 0,
 'sample_sd': 0.39071173874702697,
 'sample_var': 0.15265566279472506,
 'sum': 1174,
 'sum_sq': 1174,
 'total_count': 6246,
 'total_sd': 0.39068046053869543,
 'total_var': 0.15263122224672718,
 'valid_count': 6246,
 'weight_sum': 6246,
 'weighted_sum': 1174}

Imbalanced class: We notice the mean value for our label is 0.18, which indicates a fairly high ratio of non built up land (0) relative to built-up land (1). While this is expected behavior in such a real world dataset, class imbalance can be a challenge for classifiers.

There are strategies to address this if our classifier performs poorly, but we’re just flagging this for now.

If you see a theme here in how taking time to investigate your data can give you clues to improve your machine learning model later...you're catching on about a key aspect of data science in any data domain!

Our classifier data can now be described as follows.

  • Training:

    • Feature (i.e. band): VIIRS-DNB (“avg_rad” band, monthly median, July to Dec 2015)

    • Feature(s): Sentinel-2 (select optical bands, monthly median, July to Dec 2015)

    • Label: GHSL Settlement Grid (“low” and “hi” cluster areas classified as built-up, 2015)

And our inference data, which we will create in the next exercise, will consist of a monthly time series that includes predicted “built-up” areas for each year from 2016 to 2019.

  • Inference:

    • Feature 1: VIIRS-DNB (“avg_rad”, monthly median for each year 2016-2019)

    • Feature 2: Sentinel-2 (select optical bands, monthly median for each year 2016-2019)

    • Labels: unknown!

All our processed data is re-sampled to 1000 meter resolution.

While we are running inference on annual composites, you can certainly do this per month.