This vignette gives an introduction to outlier detection with the pddcs
package.
Users should note that the methods described below are not meant to provide any statistical proof of what constitutes real outliers or incorrect values. They are only meant to give empirical guidance on which observations might warrant further inspection.
detect_outliers()
can be used to detect outlier observations within each country in a single dataset. It does so by calculating the the z-score and p-value for each observation. The function takes two main arguments; df
, a data frame in pddcs
format, and alpha
, the significance level for a two-tailed tests.
First we need to fetch the data.
# Fetch data
df <- fetch_indicator('SH.MED.PHYS.ZS', source = 'who')
head(df)
#> # A tibble: 6 × 6
#> iso3c year indicator value note source
#> <chr> <dbl> <chr> <dbl> <chr> <chr>
#> 1 AFG 2001 SH.MED.PHYS.ZS 0.190 Plausible bound is 1.899. who
#> 2 AFG 2006 SH.MED.PHYS.ZS 0.160 Plausible bound is 1.596. who
#> 3 AFG 2007 SH.MED.PHYS.ZS 0.174 Plausible bound is 1.743. who
#> 4 AFG 2008 SH.MED.PHYS.ZS 0.174 Plausible bound is 1.744. who
#> 5 AFG 2009 SH.MED.PHYS.ZS 0.213 Plausible bound is 2.126. who
#> 6 AFG 2010 SH.MED.PHYS.ZS 0.237 Plausible bound is 2.365. who
We can then run detect_outliers()
with its default arguments.
# Detect outliers (alpha = 0.05)
df2 <- detect_outliers(df)
head(df2)
#> # A tibble: 6 × 8
#> iso3c year indicator value note source outlier p_value
#> <chr> <dbl> <chr> <dbl> <chr> <chr> <lgl> <dbl>
#> 1 AFG 2001 SH.MED.PHYS.ZS 0.190 Plausible bound is 1.… who FALSE 0.848
#> 2 AFG 2006 SH.MED.PHYS.ZS 0.160 Plausible bound is 1.… who FALSE 0.955
#> 3 AFG 2007 SH.MED.PHYS.ZS 0.174 Plausible bound is 1.… who FALSE 0.914
#> 4 AFG 2008 SH.MED.PHYS.ZS 0.174 Plausible bound is 1.… who FALSE 0.914
#> 5 AFG 2009 SH.MED.PHYS.ZS 0.213 Plausible bound is 2.… who FALSE 0.702
#> 6 AFG 2010 SH.MED.PHYS.ZS 0.237 Plausible bound is 2.… who FALSE 0.503
The output of detect_outliers()
is the same data frame we used as input, but with two additional columns; outlier
is a boolean that indicates whether the z-score and corresponding p-value was above the given significance threshold, while p_value
is the p-value of the calculation.
To get an overview of the potential outliers we can simply filter by the outlier
column.
df2[df2$outlier, ]
#> # A tibble: 84 × 8
#> iso3c year indicator value note source outlier p_value
#> <chr> <dbl> <chr> <dbl> <chr> <chr> <lgl> <dbl>
#> 1 ALB 2018 SH.MED.PHYS.ZS 2.16 Plausible bound is 2… who TRUE 1.00
#> 2 ALB 2020 SH.MED.PHYS.ZS 1.88 Plausible bound is 1… who TRUE 0.983
#> 3 ARM 2017 SH.MED.PHYS.ZS 4.40 Plausible bound is 4… who TRUE 1.00
#> 4 AUS 2020 SH.MED.PHYS.ZS 4.13 Plausible bound is 4… who TRUE 0.987
#> 5 AZE 2019 SH.MED.PHYS.ZS 3.17 Plausible bound is 3… who TRUE 0.989
#> 6 BEL 2020 SH.MED.PHYS.ZS 6.08 Plausible bound is 6… who TRUE 1.00
#> 7 BGD 2020 SH.MED.PHYS.ZS 0.667 Plausible bound is 6… who TRUE 0.977
#> 8 BGR 2018 SH.MED.PHYS.ZS 4.21 Plausible bound is 4… who TRUE 0.984
#> 9 BRA 2019 SH.MED.PHYS.ZS 2.31 Plausible bound is 2… who TRUE 0.976
#> 10 CHN 2018 SH.MED.PHYS.ZS 2.10 Plausible bound is 2… who TRUE 0.980
#> # … with 74 more rows
In this case we have 84 outlier observations. If this seems too many, we can adjust the significance level.
# Detect outliers (alpha = 0.01)
df3 <- detect_outliers(df, alpha = 0.01)
df3[df3$outlier, ]
#> # A tibble: 16 × 8
#> iso3c year indicator value note source outlier p_value
#> <chr> <dbl> <chr> <dbl> <chr> <chr> <lgl> <dbl>
#> 1 ALB 2018 SH.MED.PHYS.ZS 2.16 Plausible bound is 2… who TRUE 1.00
#> 2 ARM 2017 SH.MED.PHYS.ZS 4.40 Plausible bound is 4… who TRUE 1.00
#> 3 BEL 2020 SH.MED.PHYS.ZS 6.08 Plausible bound is 6… who TRUE 1.00
#> 4 CYP 2019 SH.MED.PHYS.ZS 3.14 Plausible bound is 3… who TRUE 0.996
#> 5 ESP 1990 SH.MED.PHYS.ZS 2.03 Plausible bound is 2… who TRUE 0.999
#> 6 HUN 2020 SH.MED.PHYS.ZS 6.06 Plausible bound is 6… who TRUE 1.00
#> 7 ITA 1993 SH.MED.PHYS.ZS 5.48 Plausible bound is 5… who TRUE 0.997
#> 8 ITA 1994 SH.MED.PHYS.ZS 5.58 Plausible bound is 5… who TRUE 0.998
#> 9 JOR 1998 SH.MED.PHYS.ZS 0.989 Plausible bound is 9… who TRUE 0.999
#> 10 LTU 2020 SH.MED.PHYS.ZS 5.08 Plausible bound is 5… who TRUE 0.999
#> 11 NIC 2018 SH.MED.PHYS.ZS 1.66 Plausible bound is 1… who TRUE 0.998
#> 12 POL 2020 SH.MED.PHYS.ZS 3.77 Plausible bound is 3… who TRUE 1.00
#> 13 SRB 2016 SH.MED.PHYS.ZS 3.11 Plausible bound is 3… who TRUE 0.999
#> 14 SWE 2019 SH.MED.PHYS.ZS 7.09 Plausible bound is 7… who TRUE 1.00
#> 15 TKM 2002 SH.MED.PHYS.ZS 7.09 Plausible bound is 7… who TRUE 1.00
#> 16 TTO 2018 SH.MED.PHYS.ZS 5.41 Plausible bound is 5… who TRUE 0.996
With a lower significance level we only identify 16 outliers.
Which significance level you should use will depend on the indicator and data in question, as well as individual judgement on the risk of Type I and Type II errors.
Another option for outlier detection is to compare the new dataset from the source with the existing data in WDI. In order to do so we first need to fetch data from WDI as well. We can do this with compare_with_wdi()
.
# Fetch source data (SH.MED.NUMW.P3)
df <- fetch_indicator("SH.MED.NUMW.P3", "who")
# Compare with WDI
dl <- compare_with_wdi(df)
After fetching the data from both the source and WDI, we can compare the two with compare_datasets()
. This function takes three main arguments; new
, a pddcs
formatted data frame from the source, current
, a pddcs
formatted data frame from DCS or WDI, and alpha
, the significance level for a two-tailed tests.
Here we use a 0.05 value for alpha (the default), but which level you should use will be context dependent.
# Compare new (source) and current (WDI) datasets
res <- compare_datasets(new = dl$source, current = dl$wdi, alpha = 0.05)
head(res)
#> # A tibble: 6 × 13
#> iso3c year indicator value note source current_value current_source diff
#> <chr> <dbl> <chr> <dbl> <chr> <chr> <dbl> <chr> <dbl>
#> 1 AFG 2005 SH.MED.NUMW… 0.582 Plau… who 0.582 wdi 0
#> 2 AFG 2006 SH.MED.NUMW… 0.44 Plau… who 0.44 wdi 0
#> 3 AFG 2007 SH.MED.NUMW… 0.496 Plau… who 0.496 wdi 0
#> 4 AFG 2008 SH.MED.NUMW… 0.497 Plau… who 0.497 wdi 0
#> 5 AFG 2009 SH.MED.NUMW… 0.608 Plau… who 0.608 wdi 0
#> 6 AFG 2013 SH.MED.NUMW… 0.250 Plau… who 0.250 wdi 0
#> # … with 4 more variables: outlier <lgl>, p_value <dbl>, n_diff <int>,
#> # n_outlier <int>
The output of compare_datasets()
adds seven additional columns to the source dataset. See the documentation (?compare_datasets
) for details on each column.
Comparison is done by merging the two datasets (left join on new
), calculating the absolute difference between the two value
columns, and then running outlier detection on the diff
column. You should look for both large differences in values (diff
) and large p-values (p_value
) to identify outliers or other possible unwanted changes in the data.
In the case where a few values for a specific country are substantially different from the current dataset in WDI they will be identified as outliers with large p-values. On the other hand it might be the case that most or all values for a specific country have changed. In that case it is unlikely to be any outliers, but changes can be found by inspecting the diff
and n_diff
columns.
A few examples are given below.
For Australia four values are different then the observations in WDI, but only two of the differences are identified as outliers.
cols <- c("iso3c", "year", "value", "current_value", "diff", "outlier", "n_diff", "n_outlier")
res[res$iso3c == "AUS", ][cols]
#> # A tibble: 21 × 8
#> iso3c year value current_value diff outlier n_diff n_outlier
#> <chr> <dbl> <dbl> <dbl> <dbl> <lgl> <int> <int>
#> 1 AUS 1996 0.599 10.4 9.76 TRUE 4 2
#> 2 AUS 2000 10.1 NA NA FALSE 4 2
#> 3 AUS 2001 10.6 9.75 0.847 FALSE 4 2
#> 4 AUS 2002 9.99 NA NA FALSE 4 2
#> 5 AUS 2003 9.99 NA NA FALSE 4 2
#> 6 AUS 2004 10.2 NA NA FALSE 4 2
#> 7 AUS 2005 9.76 9.76 0 FALSE 4 2
#> 8 AUS 2006 0.596 10.8 10.2 TRUE 4 2
#> 9 AUS 2007 10.2 10.2 0 FALSE 4 2
#> 10 AUS 2008 10.3 10.3 0 FALSE 4 2
#> # … with 11 more rows
res[res$iso3c == "AUS" & res$diff > 0 & !is.na(res$diff), ][cols]
#> # A tibble: 4 × 8
#> iso3c year value current_value diff outlier n_diff n_outlier
#> <chr> <dbl> <dbl> <dbl> <dbl> <lgl> <int> <int>
#> 1 AUS 1996 0.599 10.4 9.76 TRUE 4 2
#> 2 AUS 2001 10.6 9.75 0.847 FALSE 4 2
#> 3 AUS 2006 0.596 10.8 10.2 TRUE 4 2
#> 4 AUS 2017 14.5 12.6 1.99 FALSE 4 2
For Belize there are no outlier differences. But 5 out of 9 observations are completely different then in WDI.
res[res$iso3c == "BLZ", ][cols]
#> # A tibble: 9 × 8
#> iso3c year value current_value diff outlier n_diff n_outlier
#> <chr> <dbl> <dbl> <dbl> <dbl> <lgl> <int> <int>
#> 1 BLZ 2000 1.23 1.23 0 FALSE 5 0
#> 2 BLZ 2009 1.81 1.81 0 FALSE 5 0
#> 3 BLZ 2012 1.10 1.10 0 FALSE 5 0
#> 4 BLZ 2013 1.96 157. 155. FALSE 5 0
#> 5 BLZ 2014 2.02 177. 175. FALSE 5 0
#> 6 BLZ 2015 2.09 174. 172. FALSE 5 0
#> 7 BLZ 2016 2.14 183. 181. FALSE 5 0
#> 8 BLZ 2017 2.39 185. 183. FALSE 5 0
#> 9 BLZ 2018 2.34 2.34 0 FALSE 5 0
Another interesting case is Israel. There are no outliers, but there are minor differences between WDI and WHO for 18 of 25 observations. Since all the differences are quite small everything might be okay. Yet it could still be worth investigating why the value for so many years has changed.
res[res$iso3c == "ISR", ][cols]
#> # A tibble: 25 × 8
#> iso3c year value current_value diff outlier n_diff n_outlier
#> <chr> <dbl> <dbl> <dbl> <dbl> <lgl> <int> <int>
#> 1 ISR 1995 8.23 8.23 0 FALSE 18 0
#> 2 ISR 1996 5.56 5.56 0 FALSE 18 0
#> 3 ISR 1997 5.50 5.50 0 FALSE 18 0
#> 4 ISR 1998 5.60 5.60 0 FALSE 18 0
#> 5 ISR 1999 5.71 5.71 0 FALSE 18 0
#> 6 ISR 2000 7.76 6.02 1.73 FALSE 18 0
#> 7 ISR 2001 7.79 6.12 1.68 FALSE 18 0
#> 8 ISR 2002 7.68 6.05 1.63 FALSE 18 0
#> 9 ISR 2003 7.34 5.92 1.42 FALSE 18 0
#> 10 ISR 2004 7.10 5.78 1.32 FALSE 18 0
#> # … with 15 more rows