Compare datasets — compare_datasets • pddcs

Compare new datasets from source with current datasets in WDI/DCS.

compare_datasets(new, current, alpha = 0.05)

Arguments

new: data.frame: A pddcs formatted data frame.
current: data.frame: A pddcs formatted data frame.
alpha: numeric: Significance level for a two-tailed test. Defaults to 0.05.

Value

A tibble with the following columns added to new:

current_value: Value in current dataset.
current_source: Source in current dataset.
diff: Absolute difference between value and current_value.
outlier: TRUE if the diff value is an outlier.
p_value: p-value for the diff value.
n_diff: Sum of diff by country.
n_outlier: Sum of outlier by country.

Details

The main usage of compare_datasets() is to compare the individual country-year values in the new source dataset with the current values in WDI/DCS.

Comparison is done by merging the two datasets (left join on new), calculating the absolute difference between the two value columns, and then running outlier detection on the diff column.

Users should look for both large differences in values (diff) and large p-values (p_value) to identify outliers or other possible unwanted changes in the data.

In the case where a few values for a specific country are substantially different from the current dataset in WDI/DCS they should pop out as outliers with large p-values. On the other hand it might be the case that most or all values for a specific country have changed. In that case it is unlikely to be any outliers, but changes can be found by inspecting the diff and n_diff columns.

Examples


if (FALSE) {
# Fetch indicator from source
df <- fetch_indicator("SH.MED.NUMW.P3", "who")

# Compare with WDI
dl <- compare_with_wdi(df)

# Compare new (source) and current (WDI) datasets
res <- compare_datasets(new = dl$source, current = dl$wdi)
}