Outlier detection • pddcs

This vignette gives an introduction to outlier detection with the pddcs package.

Users should note that the methods described below are not meant to provide any statistical proof of what constitutes real outliers or incorrect values. They are only meant to give empirical guidance on which observations might warrant further inspection.

library(pddcs)

Detect outliers in a single dataset

detect_outliers() can be used to detect outlier observations within each country in a single dataset. It does so by calculating the the z-score and p-value for each observation. The function takes two main arguments; df, a data frame in pddcs format, and alpha, the significance level for a two-tailed tests.

First we need to fetch the data.

# Fetch data
df <- fetch_indicator('SH.MED.PHYS.ZS', source = 'who')
head(df)
#> # A tibble: 6 × 6
#>   iso3c  year indicator      value note                      source
#>   <chr> <dbl> <chr>          <dbl> <chr>                     <chr> 
#> 1 AFG    2001 SH.MED.PHYS.ZS 0.190 Plausible bound is 1.899. who   
#> 2 AFG    2006 SH.MED.PHYS.ZS 0.160 Plausible bound is 1.596. who   
#> 3 AFG    2007 SH.MED.PHYS.ZS 0.174 Plausible bound is 1.743. who   
#> 4 AFG    2008 SH.MED.PHYS.ZS 0.174 Plausible bound is 1.744. who   
#> 5 AFG    2009 SH.MED.PHYS.ZS 0.213 Plausible bound is 2.126. who   
#> 6 AFG    2010 SH.MED.PHYS.ZS 0.237 Plausible bound is 2.365. who

We can then run detect_outliers() with its default arguments.

# Detect outliers (alpha = 0.05)
df2 <- detect_outliers(df)
head(df2)
#> # A tibble: 6 × 8
#>   iso3c  year indicator      value note                   source outlier p_value
#>   <chr> <dbl> <chr>          <dbl> <chr>                  <chr>  <lgl>     <dbl>
#> 1 AFG    2001 SH.MED.PHYS.ZS 0.190 Plausible bound is 1.… who    FALSE     0.848
#> 2 AFG    2006 SH.MED.PHYS.ZS 0.160 Plausible bound is 1.… who    FALSE     0.955
#> 3 AFG    2007 SH.MED.PHYS.ZS 0.174 Plausible bound is 1.… who    FALSE     0.914
#> 4 AFG    2008 SH.MED.PHYS.ZS 0.174 Plausible bound is 1.… who    FALSE     0.914
#> 5 AFG    2009 SH.MED.PHYS.ZS 0.213 Plausible bound is 2.… who    FALSE     0.702
#> 6 AFG    2010 SH.MED.PHYS.ZS 0.237 Plausible bound is 2.… who    FALSE     0.503

The output of detect_outliers() is the same data frame we used as input, but with two additional columns; outlier is a boolean that indicates whether the z-score and corresponding p-value was above the given significance threshold, while p_value is the p-value of the calculation.

To get an overview of the potential outliers we can simply filter by the outlier column.

df2[df2$outlier, ]
#> # A tibble: 84 × 8
#>    iso3c  year indicator      value note                  source outlier p_value
#>    <chr> <dbl> <chr>          <dbl> <chr>                 <chr>  <lgl>     <dbl>
#>  1 ALB    2018 SH.MED.PHYS.ZS 2.16  Plausible bound is 2… who    TRUE      1.00 
#>  2 ALB    2020 SH.MED.PHYS.ZS 1.88  Plausible bound is 1… who    TRUE      0.983
#>  3 ARM    2017 SH.MED.PHYS.ZS 4.40  Plausible bound is 4… who    TRUE      1.00 
#>  4 AUS    2020 SH.MED.PHYS.ZS 4.13  Plausible bound is 4… who    TRUE      0.987
#>  5 AZE    2019 SH.MED.PHYS.ZS 3.17  Plausible bound is 3… who    TRUE      0.989
#>  6 BEL    2020 SH.MED.PHYS.ZS 6.08  Plausible bound is 6… who    TRUE      1.00 
#>  7 BGD    2020 SH.MED.PHYS.ZS 0.667 Plausible bound is 6… who    TRUE      0.977
#>  8 BGR    2018 SH.MED.PHYS.ZS 4.21  Plausible bound is 4… who    TRUE      0.984
#>  9 BRA    2019 SH.MED.PHYS.ZS 2.31  Plausible bound is 2… who    TRUE      0.976
#> 10 CHN    2018 SH.MED.PHYS.ZS 2.10  Plausible bound is 2… who    TRUE      0.980
#> # … with 74 more rows

In this case we have 84 outlier observations. If this seems too many, we can adjust the significance level.

# Detect outliers (alpha = 0.01)
df3 <- detect_outliers(df, alpha = 0.01)
df3[df3$outlier, ]
#> # A tibble: 16 × 8
#>    iso3c  year indicator      value note                  source outlier p_value
#>    <chr> <dbl> <chr>          <dbl> <chr>                 <chr>  <lgl>     <dbl>
#>  1 ALB    2018 SH.MED.PHYS.ZS 2.16  Plausible bound is 2… who    TRUE      1.00 
#>  2 ARM    2017 SH.MED.PHYS.ZS 4.40  Plausible bound is 4… who    TRUE      1.00 
#>  3 BEL    2020 SH.MED.PHYS.ZS 6.08  Plausible bound is 6… who    TRUE      1.00 
#>  4 CYP    2019 SH.MED.PHYS.ZS 3.14  Plausible bound is 3… who    TRUE      0.996
#>  5 ESP    1990 SH.MED.PHYS.ZS 2.03  Plausible bound is 2… who    TRUE      0.999
#>  6 HUN    2020 SH.MED.PHYS.ZS 6.06  Plausible bound is 6… who    TRUE      1.00 
#>  7 ITA    1993 SH.MED.PHYS.ZS 5.48  Plausible bound is 5… who    TRUE      0.997
#>  8 ITA    1994 SH.MED.PHYS.ZS 5.58  Plausible bound is 5… who    TRUE      0.998
#>  9 JOR    1998 SH.MED.PHYS.ZS 0.989 Plausible bound is 9… who    TRUE      0.999
#> 10 LTU    2020 SH.MED.PHYS.ZS 5.08  Plausible bound is 5… who    TRUE      0.999
#> 11 NIC    2018 SH.MED.PHYS.ZS 1.66  Plausible bound is 1… who    TRUE      0.998
#> 12 POL    2020 SH.MED.PHYS.ZS 3.77  Plausible bound is 3… who    TRUE      1.00 
#> 13 SRB    2016 SH.MED.PHYS.ZS 3.11  Plausible bound is 3… who    TRUE      0.999
#> 14 SWE    2019 SH.MED.PHYS.ZS 7.09  Plausible bound is 7… who    TRUE      1.00 
#> 15 TKM    2002 SH.MED.PHYS.ZS 7.09  Plausible bound is 7… who    TRUE      1.00 
#> 16 TTO    2018 SH.MED.PHYS.ZS 5.41  Plausible bound is 5… who    TRUE      0.996

With a lower significance level we only identify 16 outliers.

Which significance level you should use will depend on the indicator and data in question, as well as individual judgement on the risk of Type I and Type II errors.

Detect outliers by comparing with an existing dataset

Another option for outlier detection is to compare the new dataset from the source with the existing data in WDI. In order to do so we first need to fetch data from WDI as well. We can do this with compare_with_wdi().

# Fetch source data (SH.MED.NUMW.P3)
df <- fetch_indicator("SH.MED.NUMW.P3", "who")

# Compare with WDI
dl <- compare_with_wdi(df)

After fetching the data from both the source and WDI, we can compare the two with compare_datasets(). This function takes three main arguments; new, a pddcs formatted data frame from the source, current, a pddcs formatted data frame from DCS or WDI, and alpha, the significance level for a two-tailed tests.

Here we use a 0.05 value for alpha (the default), but which level you should use will be context dependent.

# Compare new (source) and current (WDI) datasets
res <- compare_datasets(new = dl$source, current = dl$wdi, alpha = 0.05)
head(res)

#> # A tibble: 6 × 13
#>   iso3c  year indicator    value note  source current_value current_source  diff
#>   <chr> <dbl> <chr>        <dbl> <chr> <chr>          <dbl> <chr>          <dbl>
#> 1 AFG    2005 SH.MED.NUMW… 0.582 Plau… who            0.582 wdi                0
#> 2 AFG    2006 SH.MED.NUMW… 0.44  Plau… who            0.44  wdi                0
#> 3 AFG    2007 SH.MED.NUMW… 0.496 Plau… who            0.496 wdi                0
#> 4 AFG    2008 SH.MED.NUMW… 0.497 Plau… who            0.497 wdi                0
#> 5 AFG    2009 SH.MED.NUMW… 0.608 Plau… who            0.608 wdi                0
#> 6 AFG    2013 SH.MED.NUMW… 0.250 Plau… who            0.250 wdi                0
#> # … with 4 more variables: outlier <lgl>, p_value <dbl>, n_diff <int>,
#> #   n_outlier <int>

The output of compare_datasets() adds seven additional columns to the source dataset. See the documentation (?compare_datasets) for details on each column.

Comparison is done by merging the two datasets (left join on new), calculating the absolute difference between the two value columns, and then running outlier detection on the diff column. You should look for both large differences in values (diff) and large p-values (p_value) to identify outliers or other possible unwanted changes in the data.

In the case where a few values for a specific country are substantially different from the current dataset in WDI they will be identified as outliers with large p-values. On the other hand it might be the case that most or all values for a specific country have changed. In that case it is unlikely to be any outliers, but changes can be found by inspecting the diff and n_diff columns.

A few examples are given below.

For Australia four values are different then the observations in WDI, but only two of the differences are identified as outliers.

cols <- c("iso3c", "year", "value", "current_value", "diff", "outlier", "n_diff", "n_outlier")
res[res$iso3c == "AUS", ][cols]
#> # A tibble: 21 × 8
#>    iso3c  year  value current_value   diff outlier n_diff n_outlier
#>    <chr> <dbl>  <dbl>         <dbl>  <dbl> <lgl>    <int>     <int>
#>  1 AUS    1996  0.599         10.4   9.76  TRUE         4         2
#>  2 AUS    2000 10.1           NA    NA     FALSE        4         2
#>  3 AUS    2001 10.6            9.75  0.847 FALSE        4         2
#>  4 AUS    2002  9.99          NA    NA     FALSE        4         2
#>  5 AUS    2003  9.99          NA    NA     FALSE        4         2
#>  6 AUS    2004 10.2           NA    NA     FALSE        4         2
#>  7 AUS    2005  9.76           9.76  0     FALSE        4         2
#>  8 AUS    2006  0.596         10.8  10.2   TRUE         4         2
#>  9 AUS    2007 10.2           10.2   0     FALSE        4         2
#> 10 AUS    2008 10.3           10.3   0     FALSE        4         2
#> # … with 11 more rows

res[res$iso3c == "AUS" & res$diff > 0 & !is.na(res$diff), ][cols]
#> # A tibble: 4 × 8
#>   iso3c  year  value current_value   diff outlier n_diff n_outlier
#>   <chr> <dbl>  <dbl>         <dbl>  <dbl> <lgl>    <int>     <int>
#> 1 AUS    1996  0.599         10.4   9.76  TRUE         4         2
#> 2 AUS    2001 10.6            9.75  0.847 FALSE        4         2
#> 3 AUS    2006  0.596         10.8  10.2   TRUE         4         2
#> 4 AUS    2017 14.5           12.6   1.99  FALSE        4         2

For Belize there are no outlier differences. But 5 out of 9 observations are completely different then in WDI.

res[res$iso3c == "BLZ", ][cols]
#> # A tibble: 9 × 8
#>   iso3c  year value current_value  diff outlier n_diff n_outlier
#>   <chr> <dbl> <dbl>         <dbl> <dbl> <lgl>    <int>     <int>
#> 1 BLZ    2000  1.23          1.23    0  FALSE        5         0
#> 2 BLZ    2009  1.81          1.81    0  FALSE        5         0
#> 3 BLZ    2012  1.10          1.10    0  FALSE        5         0
#> 4 BLZ    2013  1.96        157.    155. FALSE        5         0
#> 5 BLZ    2014  2.02        177.    175. FALSE        5         0
#> 6 BLZ    2015  2.09        174.    172. FALSE        5         0
#> 7 BLZ    2016  2.14        183.    181. FALSE        5         0
#> 8 BLZ    2017  2.39        185.    183. FALSE        5         0
#> 9 BLZ    2018  2.34          2.34    0  FALSE        5         0

Another interesting case is Israel. There are no outliers, but there are minor differences between WDI and WHO for 18 of 25 observations. Since all the differences are quite small everything might be okay. Yet it could still be worth investigating why the value for so many years has changed.

res[res$iso3c == "ISR", ][cols]
#> # A tibble: 25 × 8
#>    iso3c  year value current_value  diff outlier n_diff n_outlier
#>    <chr> <dbl> <dbl>         <dbl> <dbl> <lgl>    <int>     <int>
#>  1 ISR    1995  8.23          8.23  0    FALSE       18         0
#>  2 ISR    1996  5.56          5.56  0    FALSE       18         0
#>  3 ISR    1997  5.50          5.50  0    FALSE       18         0
#>  4 ISR    1998  5.60          5.60  0    FALSE       18         0
#>  5 ISR    1999  5.71          5.71  0    FALSE       18         0
#>  6 ISR    2000  7.76          6.02  1.73 FALSE       18         0
#>  7 ISR    2001  7.79          6.12  1.68 FALSE       18         0
#>  8 ISR    2002  7.68          6.05  1.63 FALSE       18         0
#>  9 ISR    2003  7.34          5.92  1.42 FALSE       18         0
#> 10 ISR    2004  7.10          5.78  1.32 FALSE       18         0
#> # … with 15 more rows