compare_datasets(new, current, alpha = 0.05)
data.frame: A pddcs
formatted data frame.
data.frame: A pddcs
formatted data frame.
numeric: Significance level for a two-tailed test. Defaults to 0.05.
A tibble with the following columns added to new
:
current_value
: Value in current
dataset.
current_source
: Source in current
dataset.
diff
: Absolute difference between value
and current_value
.
outlier
: TRUE if the diff
value is an outlier.
p_value
: p-value for the diff
value.
n_diff
: Sum of diff
by country.
n_outlier
: Sum of outlier
by country.
The main usage of compare_datasets()
is to compare the individual
country-year values in the new source dataset with the current values in
WDI/DCS.
Comparison is done by merging the two datasets (left join on new
),
calculating the absolute difference between the two value
columns, and then
running outlier detection on the diff
column.
Users should look for both large differences in values (diff
) and large
p-values (p_value
) to identify outliers or other possible unwanted changes
in the data.
In the case where a few values for a specific country are substantially
different from the current dataset in WDI/DCS they should pop out as outliers
with large p-values. On the other hand it might be the case that most or all
values for a specific country have changed. In that case it is unlikely to be
any outliers, but changes can be found by inspecting the diff
and
n_diff
columns.
if (FALSE) {
# Fetch indicator from source
df <- fetch_indicator("SH.MED.NUMW.P3", "who")
# Compare with WDI
dl <- compare_with_wdi(df)
# Compare new (source) and current (WDI) datasets
res <- compare_datasets(new = dl$source, current = dl$wdi)
}