Data preparation reference
This is the reference deep-dive on the two data inputs that feed every PTI app:
- A named list of administrative-boundary
sftibbles (saved asadmin_bounds.rds). - A multi-sheet Excel metadata template describing your indicators (
mtdt.xlsx).
If you just want a recipe, start with Build a PTI. This page is for when that walkthrough refers you here for a data-shape question, or when a validator tells you something specific is wrong with your inputs.
The package’s own validate_geometries() and validate_metadata() functions encode every rule on this page. If you’re unsure whether your data conforms, run them — they’re authoritative.
1. Boundary shapes — admin_bounds.rds
1.1 Top-level structure
shp_dta is a named list of sf tibbles, one element per administrative level. It is normally saved as a single .rds file (commonly admin_bounds.rds) and read with readRDS() at app launch.
The list element names follow the package convention admin<N>_<HumanName>, where <N> is a digit 0–9 and <HumanName> is a single word naming the level (for example, Oblast, District, Hexagon). The bundled Ukraine sample uses levels 0, 1, 2, and 4 — admin3 is intentionally skipped to demonstrate that level numbers do not need to be contiguous, but each level you do include must be uniquely numbered.
1.2 Naming convention
| Element name | Meaning |
|---|---|
admin0_Country |
Mandatory. The country polygon. PTI apps require it even if you do not display admin0 in the UI. |
admin1_Region |
First-level subdivision (province, oblast, region, …) |
admin2_District |
Second-level subdivision |
admin<N>_<Name> |
Pattern for any level. <N> is the integer level depth; <Name> is one human-readable word. No spaces, no colons. |
If your country uses different administrative terminology (e.g. admin1_Wilaya, admin2_Commune), use it — only the admin<N>_ prefix matters to the package; the name suffix is for display.
1.3 Required columns per layer
Every sf tibble in the list must, at minimum, carry these columns:
| Column | Type | Purpose |
|---|---|---|
admin<N>Pcod |
character |
Unique polygon identifier (P-code). One per row at this level. |
admin<N>Name |
character |
Human-readable polygon name (one per row). |
area |
numeric |
Polygon area, in km². Compute after re-projecting to an appropriate UTM. |
geometry |
sfc_(MULTI)POLYGON |
Spatial geometry. Must be POLYGON or MULTIPOLYGON only. |
Both admin<N>Pcod and admin<N>Name use the level number of the current layer. For example, in admin2_District, those columns are admin2Pcod and admin2Name.
str(ukr_shp$admin1_Oblast, max.level = 1)
#> sf [27 × 5] (S3: sf/tbl_df/tbl/data.frame)
#> - attr(*, "sf_column")= chr "geometry"
#> - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA
#> ..- attr(*, "names")= chr [1:4] "admin0Pcod" "admin1Pcod" "admin1Name" "area"1.4 Parent-child mapping (cascade rule)
Sub-admin layers must also carry the parent-level P-code columns, so the package can join up the hierarchy without re-computing keys.
The rule: a layer at depth N must contain the admin<k>Pcod column for every level k < N that exists elsewhere in the list. Names are not required at parent levels (only Pcods).
Schematically:
admin0_Country [admin0Pcod, admin0Name]
▲
│ every admin1 row carries admin0Pcod
│
admin1_Region [admin0Pcod, admin1Pcod, admin1Name]
▲
│ every admin2 row carries admin0Pcod + admin1Pcod
│
admin2_District [admin0Pcod, admin1Pcod, admin2Pcod, admin2Name]
If a P-code value appears in admin2_District$admin1Pcod but does not appear in admin1_Region$admin1Pcod, that’s an orphan child — validate_geometries() will flag it.
1.5 Geometry constraints
-
Geometry types:
POLYGONandMULTIPOLYGONonly.POINT,LINESTRING,GEOMETRYCOLLECTIONare rejected. -
Validity: each geometry should pass
sf::st_is_valid(). Runsf::st_make_valid()on raw shapes before saving. - CRS: all layers must be in EPSG:4326 (WGS84). The package requires a consistent CRS across every layer in the list, and EPSG:4326 is the project standard. If your source data is in a different CRS, re-project to EPSG:4326 before assembling the list.
-
Area: compute
areain km² in EPSG:4326 usingsf::st_area()with s2 enabled (the default).sf::st_area()returns aunitsobject; convert and drop the attribute withas.numeric(units::set_units(sf::st_area(geometry), "km^2")). For country-level PTI apps the resulting 1–5% area approximation is fit for purpose.
1.6 Save format
Save the assembled list as compressed RDS:
saveRDS(my_shp, "app-data/admin_bounds.rds", compress = "gz")A compressed Ukraine-sized RDS is typically a few MB.
1.7 What validate_geometries() catches
validate_geometries(my_shp) runs the following checks per layer:
- Layer name parses to
admin<digit>_<name>. - Element is an
sfobject. - Has a column called
geometryof classsfc. - All geometries are
POLYGONorMULTIPOLYGON. -
admin<N>Pcodandadmin<N>Namecolumns exist. -
admin<N>Pcodis unique (no duplicate IDs). -
admin<N>Nameis unique (no duplicate names). - No
NAs inadmin<N>Pcodoradmin<N>Name. - All parent-level P-code columns exist.
- All parent P-code values appear in their parent layer (cascade rule, §1.4).
It also verifies, across the full list, that get_mt(my_shp) produces a row for every polygon at the most-disaggregated level — i.e. that the levels form a strict hierarchy with no gaps.
1.8 What validate_geometries() does not check
- CRS consistency between layers.
- Whether
areais in km² versus m² (it cannot tell). - Whether geometries are topologically valid (use
sf::st_is_valid()separately). - Whether the polygons cover the country (gaps are not detected).
2. Metadata Excel template — mtdt.xlsx
The metadata template is a single .xlsx file with one sheet per admin level, plus three book-keeping sheets. fct_template_reader() parses it into the list shape that launch_pti() consumes.
2.1 Sheet inventory
| Sheet name | Required? | Purpose |
|---|---|---|
general |
yes | One-row tibble with country-level metadata (country column). |
metadata |
yes | One row per indicator. Fourteen columns describing each indicator (see §2.3). |
admin<N>_<Name> |
one or more | Wide-format indicator values. One column per indicator at that admin level. Sheet names must match the names in your admin_bounds.rds exactly. |
weights_table |
optional | Pre-defined weight schemes. Each scheme is a ws<digit>.. column group. See §2.5. |
point_data |
optional | Reserved for future point-overlay support. Currently ignored by the pipeline; safe to omit. |
The bundled ukr_mtdt_full is the canonical example:
The populated sheets here are general, admin1_Oblast, admin2_Rayon, admin4_Hexagon, and metadata — note that the admin-level sheet names match the slot names in ukr_shp exactly. This is mandatory.
2.2 The general sheet
A one-row tibble. The pipeline currently uses one column:
| Column | Type | Purpose |
|---|---|---|
country |
character | Display name of the country (e.g. "Ukraine"). |
Other columns are tolerated but ignored.
ukr_mtdt_full$general
#> # A tibble: 1 × 1
#> country
#> <chr>
#> 1 Ukraine2.3 The metadata sheet
This is the indicator catalogue. One row per indicator. Fourteen columns. Every column listed below is required (write a default if you have nothing meaningful for it).
| Column | Type | Required | Purpose |
|---|---|---|---|
var_code |
character | yes | The indicator’s column name as it appears in the admin sheets. Must match exactly. No :.
|
var_name |
character | yes | Display name shown in the dashboard UI. No :.
|
var_description |
character | recommended | Long-form description: how the indicator was constructed, sources used, units, caveats. |
var_order |
integer | yes | Sort order within a pillar. Indicators are listed pillar_group then var_order ascending. |
var_units |
character | optional | Unit of measurement (e.g. "km", "%", "USD", "persons/km²"). Display-only. |
spatial_level |
character | yes | The most disaggregated admin level where this indicator has data. Value must equal an admin<N>_<Name> sheet name in this same file. |
pillar_group |
numeric | yes | Numeric code grouping indicators into pillars. All indicators with the same pillar_group form one pillar. |
pillar_name |
character | yes | Human-readable pillar name (e.g. "Human capital"). All rows sharing a pillar_group must share pillar_name. No :.
|
pillar_description |
character | optional | Long-form pillar description. Display-only. |
fltr_exclude_pti |
logical | yes |
TRUE excludes this indicator from the PTI calculation. FALSE (default) includes it. |
fltr_exclude_explorer |
logical | yes |
TRUE excludes this indicator from the Data Explorer tab. FALSE (default) includes it. |
fltr_overlay_pti |
logical | yes | Reserved overlay flag. Set FALSE unless you know you need it. |
fltr_overlay_explorer |
logical | yes | Reserved overlay flag. Set FALSE. |
legend_revert_colours |
logical | yes |
TRUE reverses the colour scale for this indicator’s legend (use when high values are bad). |
Rules of thumb:
- All
fltr_*columns andlegend_revert_coloursmust be booleans, not strings. If Excel renders them as"TRUE"/"FALSE"strings,validate_read_metadata()will complain. -
var_codevalues must be valid R column names (no spaces, no leading digits, no special characters except_). - The
:character breaks app compilation inpillar_nameandvar_namebecause it’s used internally as a separator.
The bundled metadata sheet illustrates the layout:
ukr_mtdt_full$metadata[, c(
"var_code", "var_name", "spatial_level",
"pillar_group", "pillar_name", "fltr_exclude_pti"
)]
#> # A tibble: 9 × 6
#> var_code var_name spatial_level pillar_group pillar_name fltr_exclude_pti
#> <chr> <chr> <chr> <chr> <chr> <lgl>
#> 1 var_nval3_sk… var_nva… admin1_Oblast 1 Pilar 1 FALSE
#> 2 var_nval6_na… var_nva… admin2_Rayon 1 Pilar 1 FALSE
#> 3 var_nval15_s… var_nva… admin2_Rayon 1 Pilar 1 FALSE
#> 4 var_nvalinf_… var_nva… admin2_Rayon 1 Pilar 1 FALSE
#> 5 var_nval60_n… var_nva… admin4_Hexag… 1 Pilar 1 FALSE
#> 6 var_nval4_sm… var_nva… admin4_Hexag… 1 Pilar 1 FALSE
#> 7 var_nvalinf_… var_nva… admin4_Hexag… 1 Pilar 1 FALSE
#> 8 var_nvalinf_… var_nva… admin4_Hexag… 1 Pilar 1 FALSE
#> 9 var_nvalinf_… var_nva… admin4_Hexag… 1 Pilar 1 FALSE2.4 Per-admin sheets
One sheet per admin level you have indicator data for. Sheet name must be identical to the corresponding slot name in admin_bounds.rds (e.g. admin2_Rayon). Wide format: rows are polygons, columns are indicators.
Required columns:
| Column | Type | Purpose |
|---|---|---|
admin<N>Pcod |
character | Foreign key. Must match admin<N>Pcod in the corresponding sf layer 1-to-1. |
admin<N>Name |
character | Mirror of the geometry’s name column. |
Parent admin<k>Pcod
|
character | All k < N ancestor P-codes (cascade rule, §1.4). |
year |
integer | Vintage of the values in this row. May vary across rows; pipeline aggregates by P-code rather than year. |
area |
numeric | Area in km², usually copied from the geometry layer. |
Then one column per indicator measured at this level. Column name must equal the indicator’s var_code in the metadata sheet. Type should be numeric (integer is OK).
fct_template_reader() drops indicator columns that are not declared in the metadata sheet, and drops admin sheets that have no remaining indicator columns. So the system tolerates “extra” columns in your raw data — they are silently filtered out.
str(ukr_mtdt_full$admin1_Oblast)
#> tibble [27 × 8] (S3: tbl_df/tbl/data.frame)
#> $ admin0Pcod : chr [1:27] "UK" "UK" "UK" "UK" ...
#> $ admin1Pcod : chr [1:27] "UK01" "UK02" "UK03" "UK04" ...
#> $ admin1Name : chr [1:27] "Cherkasy" "Chernihiv" "Chernivtsi" "Crimea" ...
#> $ year : num [1:27] NA NA NA NA NA NA NA NA NA NA ...
#> $ var_nval6_na_adm12 : num [1:27] 2 2 5 1 1 4 6 3 2 NA ...
#> $ var_nval3_skewd_adm1 : num [1:27] -1000 10000 10000 10000 10000 10000 10000 10000 10000 10000 ...
#> $ var_nval15_small_skewd_adm12: num [1:27] 9.2e-07 9.2e-07 9.2e-07 9.2e-07 9.2e-07 ...
#> $ var_nvalinf_unif_adm124 : num [1:27] -1.56 -8.91 3.18 2.65 1.56 ...2.5 The optional weights_table sheet
Lets you ship pre-defined weight schemes alongside the dashboard. Users can pick from these in the UI instead of building weights manually.
Layout: one row per indicator (var_code), then column groups named ws<digit>..<column> — one group per weight scheme. Each scheme has at least:
| Column | Type | Purpose |
|---|---|---|
ws<digit>..name |
character | Display name of the weight scheme (e.g. "Equal weights", "Poverty-heavy"). |
ws<digit>..value |
numeric | The weight assigned to this var_code under this scheme. Sum within a scheme is normalised internally. |
Multiple schemes use successive ws1.., ws2.., ws3.. prefixes. fct_convert_weight_to_clean() (internal) reshapes this into the per-scheme list the calculator consumes.
Omit the sheet entirely if you want users to build all weights from the UI.
2.6 What validate_metadata() catches
validate_metadata(shp_path, mtdt_path) runs the file through validate_read_shp(), then validate_read_metadata(), then executes the full calculation pipeline under all-equal weights and confirms the number of scored pillars equals the number of indicators. Specifically:
- The shapes file is readable as a non-empty named list (
validate_read_shp()). - The metadata file is readable as a non-empty
metadatasheet (validate_read_metadata()). - All
fltr_*columns are read as logical, not string. - The pipeline runs without error and produces the expected count of scored pillars.
2.7 What validate_metadata() does not check (yet)
These are tracked under issue #7 and will be added incrementally:
- ID-mismatch reports between
admin<N>Pcodin your sheets and thesflayer. - Per-indicator data-type and NA-percentage summaries.
- CSV-format inputs (currently
.xlsxonly). - Auto-generated metadata from a bare CSV.
3. Common pitfalls
A non-exhaustive list of things that have bitten developers before. Most are caught by the validators; a few are not.
-
fltr_*stored as strings. Excel’s autoformat sometimes savesTRUE/FALSEas text. Set the cell type to Boolean (or usewritexl::write_xlsx()from R with logical columns) sofct_template_reader()reads them correctly. -
Indicator column not in
metadatasheet. If the indicator column exists inadmin1_Oblastbut is not declared in themetadatasheet, it is silently dropped at template load. App users will not see it. -
var_codemismatch. A typo between the indicator column name in the admin sheet andvar_codein the metadata sheet means the indicator is dropped without an error. -
spatial_levelnot matching a sheet. If the metadata saysspatial_level = "admin3_District"but no such sheet exists, the indicator is dropped. -
admin0_Countrymissing. The pipeline requires the country polygon even if the UI only displays admin1+. -
Lower-case admin keys.
admin1_oblast≠admin1_Oblast. Names are case-sensitive acrossadmin_bounds.rdsslots,metadata$spatial_levelvalues, and admin sheet names. -
:in pillar or variable names. The package uses:internally as a separator;:in display strings will fragment them. -
Pillar metadata inconsistent. Two rows with the same
pillar_groupmust share the samepillar_nameandpillar_description, or rendering becomes ambiguous. -
Area in m² not km².
validate_geometries()cannot detect this; the legend just shows surprisingly large numbers. Always convert withunits::set_units(., "km^2")before saving. -
CRS mismatch between layers. The package does not enforce a single CRS. Inconsistent CRSs can produce wrong centroids and wrong
area. Re-project once at the top of your data-prep pipeline and reuse. -
Orphan child P-codes.
admin2_District$admin1Pcodvalues must all appear inadmin1_Region$admin1Pcod.validate_geometries()warns on orphans. - Non-numeric indicator columns. Indicator columns must be numeric. Strings, factors, dates are rejected by the pipeline.
4. Worked end-to-end example
This example assembles a metadata Excel file from the bundled data, then validates it. Runnable in any session that has devPTIpack installed.
library(devPTIpack)
library(writexl)
data(ukr_shp)
data(ukr_mtdt_full)
# 1. Save the bundled boundary list as the kind of .rds a user would ship.
shp_path <- tempfile(fileext = ".rds")
saveRDS(ukr_shp, shp_path)
# 2. Build a metadata template by lifting the bundled metadata + admin sheets
# into a list of data frames in the shape writexl expects.
sheets <- list(
general = ukr_mtdt_full$general,
metadata = ukr_mtdt_full$metadata,
admin1_Oblast = ukr_mtdt_full$admin1_Oblast,
admin2_Rayon = ukr_mtdt_full$admin2_Rayon
)
mtdt_path <- tempfile(fileext = ".xlsx")
writexl::write_xlsx(sheets, mtdt_path)
# 3. Read it back through the package's template reader to confirm the
# round-trip works.
parsed <- fct_template_reader(mtdt_path)
names(parsed)
# 4. Validate the geometries (in-memory).
validate_geometries(ukr_shp)
# 5. Validate the on-disk pair (shapes + metadata) end-to-end.
validate_metadata(shp_path = shp_path, mtdt_path = mtdt_path)If both validators report pass, you can launch the app:
launch_pti(
shp_dta = readRDS(shp_path),
inp_dta = fct_template_reader(mtdt_path),
app_name = "Worked example"
)See also
- Build a PTI — the task-oriented walkthrough this page is referenced from.
- PTI Methodology — the conceptual framing (what PTI is, why it exists, the math).
-
?validate_geometries,?validate_metadata— the authoritative validator references. -
?ukr_shp,?ukr_mtdt_full— the bundled sample inputs that all examples on this page use.
External worked-example repositories (private — request access from the listed maintainer):
-
zamPTIdata— Zambia data-prep pipeline. -
kgzPTIdata— Kyrgyz Republic data-prep pipeline.
Both private; request access from Eduard Bukin.