Exploratory Data Analysis and Quality Assessment - Metro Manila#
1. Overview#
This analysis presents an Exploratory Data Analysis (EDA) and a Quality Assessment (QA) of large-scale mobility data collected in the Philippines in 2023, specifically focusing on the Metro Manila area. The EDA aims to uncover key characteristics of the dataset, such as its temporal dynamics, spatial distribution, and patterns of user activity. Building on these insights, the QA component assesses the dataset’s coverage, temporal stability, and representativeness, while identifying potential structural biases and limitations that could affect downstream analytical and policy applications. The analysis is conducted primarily at an aggregate level, focusing on identifying systematic trends within the data rather than reconstructing individual mobility behaviors.
1.1 Executive Summary#
The analyzed dataset contains approximately 4.4 billion GPS observations from 27.18 million users during 2023, with high overall temporal coverage (96.62%).
Overall assessment: The dataset is suitable for constructing aggregate, presence-based indicators of urban space usage within Metro Manila. Its primary strength lies in its extensive spatial coverage across a dense metropolitan area. However, its use requires explicit treatment of temporal regime shifts, periods of spatial concentration, and strong heterogeneity in user activity.
Key Findings#
1. Temporal dynamics are affected by structural changes in the data-generation process.
The dataset is stable during the first part of the year but exhibits a clear structural break from 10 July 2023, in which both the number of GPS observations and active users increase without a corresponding expansion in spatial coverage. Within this post-shift period, a phase of increased spatial concentration is observed between 27 July and 20 August, during which activity becomes highly concentrated in a small subset of locations.
In addition, a 13-day data gap in September (18-30 September) introduces a structural discontinuity. These shifts reflect changes in data generation rather than clear behavioral mobility changes and must be treated explicitly in longitudinal analysis.
2. Spatial coverage is high, but activity is unevenly distributed across locations.
Spatial coverage across Metro Manila is near complete, with almost all hexagonal units recording observations during the year. Visitation patterns, as expected, follow a heavy-tailed distribution: a small fraction of locations accounts for a large share of activity. During anomalous periods a limited number of highly visited hexagons dominate the dataset (top 1%of visited hexes account for 60% of observations). At the same time, the number of observed users is positively associated with resident population at the aggregate level, indicating that the dataset captures broad spatial patterns of activity within the metropolitan area.
3. User activity is highly heterogeneous and daily traces are sparse.
User contributions follow a heavy-tailed pattern, with a small subset of highly active users generating a large share of observations, while most users produce sparse and localized daily traces. A large fraction of users generates only a few observations per day and remains within one or two spatial units. From 10 July onward, user-level distributions become more homogeneous, with reduced variability and lower per-user activity, suggesting a shift in tracking intensity and/or user composition. As a result, the dataset is structurally more robust as a presence signal than as a basis for detailed individual mobility reconstruction.
Implications for Policy Use#
Taken together, these findings indicate that the dataset should be interpreted primarily as a coverage and presence indicator, not as a complete representation of individual mobility behavior.
Accordingly:
Temporal comparisons require normalization and regime awareness. Raw time-series comparisons across periods with different sampling characteristics are not valid without adjusting for variations in user coverage and tracking intensity.
Spatial indicators should account for concentration effects. Metrics based on averages or totals can be dominated by a small number of highly active locations, particularly during anomalous periods, and should therefore rely on distribution-aware summaries.
Individual-level mobility metrics require strong aggregation. Trajectory-based indicators are sensitive to user sparsity and contribution inequality and should not be used without filtering or normalization.
Overall, the dataset is well suited for aggregate, presence-based analyses of urban space usage at daily or coarser temporal resolution within Metro Manila. However, longitudinal analyses require particular caution: direct comparisons across time periods may reflect changes in sampling intensity or user coverage rather than real changes in mobility or space usage. Accordingly, analyses should:
(i) exclude or explicitly model anomalous periods, (ii) avoid comparing metrics across periods with different data-generation regimes unless appropriately normalized, and (iii) interpret spatial patterns in light of potential concentration effects.
Importantly, the dataset does not support the construction of a full-year baseline of the Urban Space Usage Index, as regime shifts in tracking intensity and user coverage would introduce structural breaks unrelated to underlying spatial dynamics. A valid annual baseline must therefore be derived from temporally stable regimes or constructed using explicit normalization procedures to ensure interpretability and comparability over time.
2. Context and Objectives#
This analysis is conducted to generate high-frequency indices of urban space use (e.g., retail centers, construction sites, manufacturing zones, financial centers, residential areas). These indicators aim to inform policy-relevant analyses of urban activity, resilience, and responses to external shocks, including major events and climate-related disruptions.
The analysis focuses primarily on the spatial dimension of the data. Mobility observations generated by individuals are used to characterize aggregate patterns of activity across locations rather than to reconstruct individual mobility trajectories. Users are therefore treated as proxies for mobility intensity and the popularity of places.
Accordingly, the EDA and QA prioritize the assessment of aggregate, presence-based spatial signals, with particular attention to coverage, temporal stability, and sensitivity to spatial and temporal aggregation choices. User-level trajectory analyses are explicitly outside the scope of this evaluation.
The findings of this assessment directly inform methodological decisions in subsequent project phases, including the selection of spatial resolution, normalization strategies, and temporal smoothing procedures used to construct the Urban Space Usage Index.
3. Dataset Overview#
The mobility dataset used in this analysis is the Veraset Movement dataset, provided by Veraset as part of the Mobility Data collection from the Development Data Partnership. This dataset consists of anonymized, high-frequency mobile device location pings collected through a network of mobile applications and software development kits (SDKs). Each record captures a device’s geographic coordinates (latitude and longitude), a UTC timestamp indicating when the observation was recorded, and a device identifier.
For this particular study, the analysis focuses on mobility patterns in Metro Manila (Philippines) during the year 2023, using the dataset to examine spatial and temporal movement dynamics within the metropolitan region.
3.1 Data Sources and Structure#
The dataset is collected and stored at a daily granularity and includes the following fields:
uid(string): Unique identifier of the userdatetime(datetime, UTC): Timestamp of the observation, recorded in UTChex_id(string): H3 spatial index at resolution 7latitude(float): Latitude coordinatelongitude(float): Longitude coordinatecountry(category): Country code
A preliminary check confirms that the country field is constant and equal to PH for all records, indicating that the dataset corresponds to a single national context and that this variable does not contribute to analytical variability. While the original dataset covers the entire Philippines, this analysis focuses specifically on the Metro Manila metropolitan area. Therefore, during the data loading process, only observations located within the geographic boundaries of Metro Manila are retained for the analysis.
All timestamps are reported in UTC and must be converted to the appropriate local time zone before performing any time-based analysis. After conversion, records associated with a given partition date may shift to the previous or following local calendar day, depending on the time zone. In addition, a small subset of records may exhibit minor temporal offsets due to signal transmission delays. Furthermore, as specified in Veraset’s data documentation, the data feed may include observations from adjacent dates up to three days after the nominal delivery date. For example, events captured on January 1 (based on utc_timestamp) may appear in data feeds distributed between January 1 and January 4.
To properly account for both time zone conversion effects and feed delivery lags, it is therefore necessary to load data ranging from one day prior to up to three days after the target date to ensure that all relevant records are captured.
3.2 Dataset Statistics#
Metric |
Value |
|---|---|
Total GPS points |
4.4 billion |
Total users |
27.18 million |
Total active areas |
1041 |
Time span |
Jan-Dec 2023 |
Missing days |
January 1st and September 18th 3 AM - 30th 3 AM |
Spatial unit |
Default H3 res 7; analyzed at res 8 for finer urban-scale analysis (lat/lng; aggregable to any H3 resolution). |
Temporal coverage |
96.62% |
Spatial coverage |
99.52% |
Avg. points / day |
12,426,375 |
Avg. users / day |
733,539 |
Avg. hexes / day |
869 |
3.3 Dataset Temporal Coverage#
We assess the temporal coverage of the dataset relative to the 2023 calendar year. Overall temporal coverage is high, with observations available for 96.62% of all days and hours in 2023 (Figure 1).
Two explicit data gaps are identified. First, observations are missing on January 1st between 00:00 and 08:00, corresponding to the beginning of the year. Second, a continuous interruption in the data occurs from September 18th at 08:00 to September 30th at 03:00, amounting to a 13-day gap. Because September 18th and September 30th contain only partial observations (8 and 16 hours, respectively), both days are excluded to ensure consistency in daily aggregation. Consequently, the period September 18th to September 30th (inclusive) is treated as missing in the daily analysis. For the same reason, January 1st is also excluded, as it contains incomplete observations. We recommend treating September 18-30 as structurally missing and avoiding interpolation across this interval in time-series analyses, as it introduces a discontinuity in the data-generating process.
Outside of these periods, the dataset contains observations for every day and hour of the year.
Figure 1. Heatmap showing the total number of GPS points recorded for each hour of the day across the calendar year. The colorbar represents the absolute number of GPS observations (in millions), with lighter shades indicating higher volumes. White areas indicate periods with no recorded observations.
3.4 Dataset Spatial Coverage#
The dataset provides coverage of Metro Manila at the latitude-longitude level and is aggregated by default to H3 hexagons at resolution 7. For the purposes of this analysis, the data are re-aggregated to H3 resolution 8 (corresponding to an average cell area of approximately 0.73 km²) to capture spatial variation within a dense urban environment. Given the availability of latitude-longitude coordinates, the dataset can be aggregated to any H3 spatial resolution.
Spatial coverage is defined as the share of H3 hexagons covering the study area that record at least one observation during the year. Under this definition, the spatial coverage of the dataset is 99.52%. To ensure that observations located near administrative boundaries are properly captured, the study area was extended using a small spatial buffer around Metro Manila. As a result, the dataset achieves near-complete spatial coverage, effectively approximating full coverage of the metropolitan area.
Figure 2. Spatial coverage of H3 hexagons (resolution 8) in Metro Manila, 2023. Hexagons that record at least one GPS observation during the year are shown in blue. Hexagons with no recorded observations are shown in orange. Spatial coverage is defined as the share of H3 hexagons with at least one observation over the study period.
4. Spatial Data Quality Assessment#
This section examines the spatial dimension of the mobility dataset by analyzing spatial coverage, concentration, and sparsity across H3 hexagonal units. The analysis focuses on the distribution of observations and users across space, the degree of spatial inequality, and the relationship between observed users and resident population.
4.1 Spatial Distribution#
The map on the left in Figure 3 shows the spatial distribution of mobility observations collected throughout 2023 across the metropolitan area of Manila. Data are aggregated on an H3 hexagonal grid (resolution 8) and expressed as the average number of points per hexagon over the year.
As expected for an urban area, observations are concentrated across all the area of interest. Areas with the fewest observations (lighter colors) include the natural area “La Mesa Watershed” in the northeast, and water bodies such as Laguna de Bay in the southeast (map on the right). The presence of points over water bodies is likely due to individuals carrying GPS-enabled devices while traveling by ferry, boat, or ship, and these points will need to be addressed in subsequent steps of the analysis.
Figure 3. (left) Spatial distribution of GPS observations shown as the average annual number of records per H3 hexagon (resolution 8). Higher values represent a greater concentration of recorded activity. The color scale is log₁₀-transformed, with darker blue tones indicating areas with more observations. (right) H3 hexagon grid overlaid on an OpenStreetMap (OSM) base map for spatial reference.
4.2 Total Points and Unique Users per Visited Spatial Unit#
Daily activity at the hex level is generally high. On average, each hexagon contains approximately 14,300 points generated by 1,370 users. However, these averages are influenced by highly active locations: the median values are lower, with a hexagon containing 5,166 points generated by 288 users. The difference between the mean and the median indicates a right-skewed spatial distribution of activity.
The time series of the points-per-hexagon and users-per-hexagon metrics (Figure 4, left and right panels, respectively) shows the presence of different regimes throughout the year. In particular, several time intervals display a marked decrease in both the number of GPS observations per hexagon and the number of active users per hexagon.
During these intervals, the aggregate number of GPS points and active users remains comparable to other periods (see Figure 16). The decline in the per-hexagon metrics therefore does not appear to be driven by a reduction in overall activity. Instead, it suggests a strong spatial imbalance, where observations become concentrated in a small number of highly visited hexagons.
For reference, notable periods include: 1-20 march, 10-12 july, 27 july-20 august.
Figure 4. Daily values of points per hexagon (left) and users per hexagon (right). The black line represents the 50th percentile (median), the blue line the 75th percentile, and the orange line the 90th percentile. The shaded area indicates periods of missing data.
To further investigate this phenomenon, we analyze the distribution of observations across hexagons using Lorenz curves. The curves computed for the anomalous periods and for a baseline period show that, during the anomalous intervals, observations are substantially more concentrated in a small number of hexagons (Figure 5), as indicated by the higher Gini coefficients.
One exception is the period in March, which exhibits a Gini coefficient comparable to that of a reference period, in this case May 2023. The lower values of the per-hexagon metrics are consistent with a general reduction in activity, as both the number of active users and the total number of GPS observations decrease during this interval (see Section 6.1). Consequently, the lower averages observed in this period are expected and do not appear to result from increased spatial concentration of visits.
Although this pattern will be explored in more detail in the following subsection, the preliminary evidence suggests that these regimes should be interpreted with caution. In particular, shifts in the distribution of visits across hexagons may reflect changes in sampling or spatial coverage and not change in mobility.
To assess whether the observed patterns could instead be explained by real-world events affecting mobility, we also investigated potential exogenous events occurring in the metropolitan area of Manila during the identified periods. However, no major events were found that could plausibly account for the observed concentration of observations. For example, no significant events were recorded in August 2023 that could explain the changes observed during the 2023-07-27 - 2023-08-20 interval.
In Section 4.3, we examine the spatial distribution of observations by identifying the hexagons that account for a large share of activity during these anomalous periods.
Figure 5. Lorenz curves of the distribution of GPS observations across hexagons for selected time intervals in 2023, compared with a baseline period. The dashed line represents the line of perfect equality.
4.3 Inequality of Distribution of Observations across Spatial Units#
We analyze the distribution of observations across spatial units (H3 hexes) to assess the degree of spatial inequality, as most hexes receive relatively few observations, while a small fraction accumulates a disproportionately large share of GPS points.
This inequality is apparent in both the rank‑frequency distribution (Figure 6, left), which exhibits a heavy‑tailed pattern, and in the Lorenz curve (Figure 6, right), with an associated Gini coefficient of 0.657. Such levels of spatial inequality are typical in human mobility data, where visit frequencies are known to follow heavy‑tailed distributions. Empirical analyses of large-scale GPS and mobile phone datasets consistently show that most locations are infrequently visited, while a few highly attractive locations capture the majority of visits [GHB08, BBG+18].
Figure 6. Rank-size distribution of total GPS observations per H3 hexagon, illustrating a pronounced heavy-tailed distribution (left). Lorenz curve showing the cumulative share of observations by ranked hexagons (right), with the corresponding Gini coefficient (0.66).
We further characterize spatial inequality by examining the share of total observations captured by the most frequently visited hexes. The results shown in Figure 7 highlight that the top 1% of hexes account for about 10% of all GPS points during the first part of the year. However, following the change in data collection regime on July 10th, we observe a pronounced peak, indicating that observations became highly concentrated in a few popular hexes. For example, in late August, the top 1% of hexes accounted for approximately 60% of all observations, reflecting an extreme concentration of activity.
These fluctuations across data collection regimes suggest that such periods should be carefully considered in subsequent analyses, particularly by excluding high-regime intervals where the top 1% of hexes dominate the dataset.
Figure 7. Time series of the share of total GPS observations captured by the highest-ranked H3 hexagons at different thresholds: top 1% (blue), top 2% (orange), and top 5% (green) most visited hexagons.
The spatial distribution of GPS observations per hexagon reveals distinct patterns across the different data collection regimes in 2023. Figure 8 illustrates this by displaying the log-transformed number of points per hexagon for a reference period (March) and 27 July-20 August.
During the reference period, observations are fairly evenly spread across hexagons, indicating consistent spatial coverage. In contrast, the July-August intervals exhibit a pronounced spatial concentration of visits, with a small subset of hexagons accounting for a disproportionately large share of observations.
While some seasonal mobility variation is expected, the sharp changes observed here suggest that these patterns are more likely driven by shifts in data collection practices than by true changes in mobility behavior.
Figure 8. Spatial distribution of the log₁₀-transformed number of GPS observations per hexagon for March and 27 July-20 August.
4.4 Spatial Representativeness#
GPS observations are collected through GPS-enabled devices and therefore represent only a subset of the total population within the study area. Specifically, they capture individuals who own smartphones, have specific applications installed, and have enabled location-sharing permissions accessible to the data provider. Consequently, the derived mobility metrics should not be interpreted as measures of the true total population. Rather, they serve as high-frequency proxies for the spatial and temporal distribution of observed device users.
To assess spatial representativeness and identify potential representation bias, we examine the relationship between the resident population and observed users at the H3 hexagon level. Specifically, we correlate the average daily number of unique users per hexagon with the corresponding resident population counts (data extracted from WorldPop). At the aggregate level, population and observed users exhibit a strong positive association (Figure 9; Pearson’s r ≈ 0.62), indicating that densely populated areas of the Manila metropolitan region tend to have higher numbers of observed users, though some spatial variability remains.
Figure 9. Bivariate kernel density plot of resident population estimates (WorldPop) versus the average daily number of unique users per H3 hexagon (Pearson correlation, r ≈ 0.62). The dashed line represents the linear fit. Red color intensity corresponds to higher point density.
4.5 Effects of User and Point Thresholds on Spatial Units#
The hexagons within the Metro Manila area generally exhibit high activity in terms of GPS observations, with only a small proportion showing very low daily activity. As shown in Figure 10, on average, fewer than 1% of hexagons record one or fewer GPS points per day. This share increases to 1.4% for hexagons with two or fewer points, and 2.26% for those with five or fewer points.
Number of Points per Hex |
Share of Hexes (%) |
|---|---|
≤ 1 |
0.85 ± 0.47 |
≤ 2 |
1.42 ± 0.64 |
≤ 5 |
2.26 ± 0.77 |
Figure 10. Time series of the share of H3 hexagons recording ≤1 (blue), ≤2 (orange), and ≤5 (green) GPS points. The figure illustrates the proportion of low-activity hexagons over time.
Moreover, only a small proportion of hexagons experience visits from very few unique users (Figure 11). On average, about 2.4% of hexagons record visits from no more than one user per day. This share increases to 3.5% for hexagons with two or fewer users, and 4.2% for those with three or fewer users. These findings suggest that most spatial units exhibit high levels of usage and broad user participation, with only a few hexagons being driven by a very small number of individuals. This is expected in a metropolitan area like Metro Manila, where urban areas typically have higher user engagement.
Number of Users per Hex |
Share of Hexes (%) |
|---|---|
≤ 1 |
2.42 ± 0.68 |
≤ 2 |
3.52 ± 0.77 |
≤ 3 |
4.21 ± 0.99 |
Figure 11. Time series of the share of H3 hexagons visited by ≤1 (blue), ≤2 (orange), and ≤5 (green) unique users.
4.6 Data Quality Implications#
The spatial distribution of observations is consistent with expectations for a dense metropolitan area. Activity is observed across nearly the entire study region and is concentrated in highly populated urban locations. Areas with the lowest activity correspond primarily to natural areas and water bodies (e.g., La Mesa Watershed and Laguna de Bay). Observations over water bodies likely reflect movement-related activity (e.g., ferry or boat travel) rather than stationary presence and may be filtered depending on the analytical objective.
The distribution of visits across hexagons is skewed, with a small number of locations accounting for a large share of total observations (Gini coefficient ≈ 0.66). This heavy-tailed structure is typical of human mobility data and reflects the concentration of activity in highly attractive urban locations. As a result, simple averages can be dominated by high-activity areas, and distribution-aware statistics (e.g., medians, quantiles, or inequality measures) provide a more robust characterization of spatial patterns.
The positive correlation between resident population and observed users (Pearson’s r ≈ 0.62) suggests that the dataset captures the spatial distribution of activity reasonably well, with more densely populated areas generating higher numbers of observed users. This relationship supports the spatial representativeness of the dataset at an aggregate level.
Some temporal intervals exhibit unusually strong spatial concentration of observations. In particular, during parts of July and August the share of activity captured by the most visited hexagons increases substantially, with the top 1% of hexes accounting for up to approximately 60% of all recorded observations. Such patterns likely reflect changes in sampling and should therefore be interpreted with caution.
Despite this concentration, most spatial units maintain relatively high daily activity levels. This supports the use of relatively fine spatial resolutions for aggregate analyses within Metro Manila.
5. User Activity Assessment#
This section examines the user-level dimension of the dataset by analyzing patterns of activity, spatial behavior, and contribution inequality across individuals. Specifically, we investigate the distribution of observations per user, the prevalence of sparse daily traces, and the extent to which user activity is spatially localized.
Given that our primary objective is to construct an Urban Space Index, we do not focus on individual mobility descriptors such as radius of gyration, nor do we attempt to characterize users’ mobility through detailed trajectory reconstruction. Instead, our analysis emphasizes aggregated behavioral patterns that are directly relevant to the proposed index.
5.1 Total Points, Visited Hexes, and Active Window per User#
User-level activity exhibits strong heterogeneity, as evident in Figure 12. The median user records approximately 5 GPS points, 1 visited location, and 203 minutes of observed activity per day, while mean values are higher: 23 GPS points, 2 locations, and 357 minutes. In general, a small share of highly active users accounts for most of the observations while the majority of users generate sparse, discontinuous daily traces.
Figure 12. Daily metrics per user: number of GPS points (left), number of visited hexagons (center), and active window duration in minutes (right). Lines show the mean or selected quantiles (50th, 75th, and 90th percentiles, as indicated). Shaded regions indicate periods with missing data.
As evident from Figure 12, the same intervals during which visits concentrate in a small number of hexagons also exhibit clear changes in the user-level metrics. In particular, these periods are characterized by a marked divergence between the median and the upper quantiles. From 10 July onward, user-level values substantially decrease. In particular, upper quantiles flatten and overall variability declines, especially for the number of visited hexagons and active window duration. This pattern is consistent with a change in the data-generation process, likely reflecting an expansion in the user base combined with less intensive per-user tracking. As a result, user-level metrics are not directly comparable across periods and should be interpreted within homogeneous temporal regimes.
5.2 Inequality of Distribution of Observations across Users#
We assess user‑level contribution patterns to evaluate the degree of user‑representation inequality in the dataset. The distribution of GPS observations across users is highly uneven: most users contribute relatively few observations, while a small fraction of highly active users accounts for a disproportionate share of all recorded points.
This imbalance is clearly visible in the rank-size distribution, which exhibits a pronounced heavy-tailed shape (Figure 13, left). Consistently, the Lorenz curve diverges from the equity line, with an associated Gini coefficient of 0.683, highligting strong concentration of GPS observations among a small share of users (Figure 13, right).
From a data quality perspective, this pattern is typical of passively collected mobility data. However, it has important analytical implications, particularly for user‑weighted metrics and for any analysis that relies on individual‑level mobility intensity.
Figure 13. Rank-size distribution of total GPS observations per user, showing a pronounced heavy-tailed distribution in individual activity levels (left). Lorenz curve depicting the cumulative share of GPS observations by ranked users (right), with the corresponding Gini coefficient (0.683) quantifying inequality in user-level contribution to total activity.
5.3 Effects of Point and Hexes Thresholds on Users#
A large share of users exhibits very sparse daily activity (Figure 14). On a typical day, approximately 19% of users generate at most one GPS point, nearly 33% record two or fewer points, and about 50% record five or fewer points. This proportion increases to 64% when considering users with ten or fewer daily observations.
These results indicate that most users contribute limited and intermittent mobility traces. As a result, trajectory-level analyses of individual mobility require restricting the sample to high-quality users with sufficiently dense trajectories, typically through strict filtering criteria. However, applying such filters would substantially reduce the number of retained users and may introduce selection bias in the resulting analyses.
Number of Points per User |
Share of Users (%) |
|---|---|
≤ 1 |
19.05 ± 11.76 |
≤ 2 |
33.13 ± 14.7 |
≤ 3 |
40.32 ± 16.39 |
≤ 5 |
50.67 ± 17.24 |
≤ 10 |
64.42 ± 16.31 |
Figure 14. Daily percentage of users generating ≤1 (blue), ≤2 (orange), ≤3 (green), ≤5 (red), or ≤10 (purple) GPS observations. The shaded area denotes periods of missing data.
Most users are also spatially localized on a given day. On average, 73% of users visit at most one hex, and 87% visit two or fewer hexes. This pattern indicates that most daily activity reflects presence in a single location. From a QA perspective, this behavior implies that user-level mobility metrics are likely to be data-sparse unless substantial temporal or spatial aggregation is applied, or minimum-activity thresholds are enforced (Figure 15). In addition, these results indicate that trajectory-level analyses should filter users not only by the number of GPS points but also by the number of visited hexagons. Many users may generate multiple observations within the same location, which provides limited value for trajectory-based mobility analyses.
Number of Hexes per User |
Share of Users (%) |
|---|---|
≤ 1 |
73.02 ± 9.32 |
≤ 2 |
87.02 ± 8.21 |
Figure 15. Daily percentage of users visiting ≤1 (blue) or ≤2 (orange) unique H3 hexagons. The shaded area denotes periods of missing data.
5.4 Data Quality Implications#
User contributions follow a heavy-tailed distribution, meaning that averages are often dominated by a small subset of highly active users, while the behavior of low-activity users is under-represented. Where user-level comparability is required, analyses should rely on distribution-aware summaries (e.g., medians or quantiles) and stratify results by user activity level.
A large share of users generates sparse daily traces. On a typical day, about half of users record five or fewer GPS observations, and nearly two thirds record ten or fewer points. User activity is also highly localized, with approximately 73% of users visiting at most one hexagon per day and about 87% visiting no more than two. These patterns indicate that daily observations often reflect presence in a single location or simple movement behavior rather than multi-step trajectories.
From 10 July onward, there is a noticeable reduction in user-level activity. This change is consistent with a shift in the data-generation process and implies that user-level metrics are not directly comparable across periods.
Overall, these characteristics are consistent with passively collected mobility data. The dataset provides a robust signal of user presence and relative spatial activity, while detailed individual-level mobility metrics require substantial aggregation or filtering.
6. Temporal Data Quality Assessment#
This section examines the temporal dimension of the mobility dataset by analyzing daily activity levels, intra-day (hourly) patterns, and differences between weekdays and weekends. The analysis focuses on the evolution over time of the number of observations, active users, and visited spatial units, as well as on circadian mobility rhythms reflected in hourly distributions. Taken together, these dimensions provide insight into the temporal stability and representativeness of the dataset and support the identification of temporal gaps, irregular sampling patterns, and structural changes in recording intensity that may affect downstream analyses.
6.1 Total Points, Unique Users, and Visited Hexes per Day#
Figure 16 shows the daily evolution of GPS observations, active users, and visited hexagons. The time series reveals a clear structural break on 10 July 2023, after which the dataset transitions into distinct regimes characterized by different sampling properties.
Baseline regime (January - 9th July)
During the baseline period, the three indicators evolve consistently. The dataset records on average 5.9 million GPS points per day, 207k users, and approximately 869 visited hexes. Variability is high for points and users (CV ≈ 57.6% and 67.5%), while spatial coverage remains stable (CV ≈ 1.4%), indicating a consistent spatial exploration. Importantly, variations in observations and users tend to move together, suggesting a relatively stable sampling regime. In this phase, increases in the number of users are generally accompanied by proportional increases in recorded observations and spatial coverage, indicating consistent data collection intensity.
Sampling Shift (10th July).
A clear structural break is observed on 10 July 2023. On this date, both the number of GPS observations and active users increase abruptly relative to the baseline period, while the number of visited hexagons remains broadly stable. This sudden increase suggests a change in sampling intensity or user coverage instead of a sudden expansion in mobility. This date defines the boundary between regimes. Metrics before and after this point are not directly comparable without normalization.
T1) Post-shift regime (10th july - 31st December)
From 10 July onward, the dataset enters a regime with substantially higher activity levels. On average, this period records 21.6 million points and 1.33 million users per day, while the number of visited hexes remains comparable to the baseline. Relative to the average values in the baseline period, GPS observations increase by approximately +265% and the number of users by +538%, while the number of visited hexagons remains essentially unchanged (+1%). Variability is moderate for points and users (CV ≈ 47.1% and 26.3%) and low for hexes (CV ≈ 1.2%), indicating a stable spatial footprint despite the increase in volume.
T2) Spatial concentration regime (27th July - 20th August).
Within the post-shift period, we observe a phase of spatial concentration. During this interval, observations and users remain elevated (≈11.5 million points and 1.45 million users on average), while the number of visited hexes declines to approximately 826. At the same time, variability in points and users decreases (CV ≈ 33.3% and 25.1%), whereas spatial variability increases (CV ≈ 3%). This indicates that activity becomes more concentrated in a smaller set of locations, consistent with the spatial analysis in Section 4.
In conclusion, the first part of the year can be interpreted as a baseline regime. From 10 July onward, the dataset reflects a different sampling regime with higher activity levels and a phase of spatial concentration. As a result, temporal comparisons across these periods should be conducted with caution and require appropriate normalization or regime-specific analysis.
Regime |
Main Signal |
Primary Driver |
Usability |
CV Points (%) |
CV Users (%) |
CV Hexes (%) |
|---|---|---|---|---|---|---|
Baseline (Jan 2 - Jul 9) |
Stable co-movement across metrics |
Stable sampling regime |
High |
57.6 |
67.5 |
1.4 |
T1) Post-shift (Jul 10 - Dec) |
Higher volume, stable spatial footprint |
Increased sampling intensity / expanded user base |
Medium (requires normalization) |
47.1 |
26.3 |
1.2 |
T2) Spatial concentration (Jul 27 - Aug 20) |
Elevated activity concentrated in fewer hexes |
Sampling shift with spatial concentration |
Medium-Low (use with caution) |
33.3 |
25.1 |
3.0 |
Figure 16. Daily time series of total GPS observations (top), number of unique active users (middle), and number of visited spatial units at H3 resolution 8 (bottom). Shaded regions indicate data regimes: baseline period (no shade), Post-shift Period (T1; blue), and Spatial Concentration (T2; orange).
Relationships among Total GPS Points, Unique Users, and Unique Visited Hexes#
To evaluate how the relationships among total GPS observations, active users, and visited hexagons evolve over time, we compare their daily correlations across the baseline period, the post-shift regime (T1), and the spatial concentration phase (T2). The three periods form distinct clusters of points, each corresponding to a different data sampling regime (Figure 17).
During the baseline period (black points), the three dimensions exhibit a compact and consistent scaling structure. GPS observations increase strongly with the number of users (r ≈ 0.92) and with visited hexagons (r ≈ 0.68). The relationship between users and visited hexagons is moderate (r ≈ 0.55), suggesting as expected that additional users tend to concentrate within already active areas.
In the post-shift regime (T1) (blue points), the relationships become more dispersed as the number of points and users increases. While GPS observations still increase with the number of users (r ≈ 0.68), the association between users and visited hexagons becomes substantially weaker (r ≈ 0.25). This indicates that increases in user counts are less systematically associated with broader spatial coverage. The relationship between observations and visited hexagons also weakens (r ≈ 0.21), suggesting a decoupling between overall activity volume and spatial extent.
The spatial concentration phase (T2) (orange points) exhibits a distinct pattern. The relationship between users and observations remains moderate (r ≈ 0.61), but the association between visited hexagons and the other metrics is substantially weaker (r ≈ 0.31 with observations). This reflects a regime in which activity becomes concentrated in a smaller subset of locations, with high volumes of observations generated without a corresponding expansion in spatial coverage.
Figure 17. Scatter plots illustrating the daily relationships among total GPS observations, number of unique users, and number of visited hexagons. Points are color-coded by regime (baseline, T1, T2, T3) to highlight differences in scaling patterns across periods.
6.2 Circadian Rhythm: Hour-of-day Distributions#
The hour-of-day distributions of points, active users, and active spatial hexes exhibit a clear circadian pattern, as shown in Figure 18. Activity is lowest during nighttime hours, increases in the early morning, remains relatively stable from late morning through early evening, and gradually declines thereafter. The consistency of these daily profiles across indicators provides an important quality check, suggesting that, despite temporal fluctuations in data volume, the dataset captures realistic daily mobility rhythms [JYG+16].
Figure 18. Average hourly distribution of total GPS observations, unique users, and active hexagons across all available days in 2023 (local time). Values are aggregated across the full observation period.
6.3 Weekend vs. Weekday#
We also investigate how mobility and activity patterns differ between weekdays and weekends (Figure 19). Weekdays exhibit a higher volume of observations (on average, +15%) and a greater number of users (on average, +18%) compared to weekends. However, the number of visited areas remains largely consistent across weekdays and weekends, with only a negligible increase on weekdays (+0.14%).
Figure 19. Comparison of mean hourly GPS observations between weekdays and weekends, aggregated over the study period. Curves reflect differences in intra-day activity profiles.
We further examined whether weekday-weekend differences are reflected in the total number of observations, users, and visited hexes within the same week.
To assess this, for each week we compared the total number of observations, users, and hexes recorded on weekdays versus weekends and computed the relative percentage difference. The results indicate that the number of observations and users is significantly higher on weekdays than on weekends. Statistical tests confirm these differences (observations: Wilcoxon (\(p \approx 6.9 \cdot 10^{-4}\)), t-test (\(p \approx 6.4 \cdot 10^{-4}\)); users: Wilcoxon (\(p \approx 2.37 \cdot 10^{-6}\)), t-test (\(p \approx 2.57 \cdot 10^{-5}\))).
In contrast, the number of visited hexes does not show a statistically significant difference between weekdays and weekends (Wilcoxon (\(p \approx 0.566\)); t-test (\(p \approx 0.517\))), indicating that while activity levels and user participation vary across the week, the spatial coverage of visited areas remains largely consistent.
6.4 Data Quality Implications#
The temporal patterns observed in user counts, observation volumes, and spatial coverage highlight several important considerations for data quality and interpretation. In particular, variations in recorded activity are likely driven by changes in the data-generation process and not in changes in underlying mobility dynamics.
First, the structural break observed on 10 July indicates a shift in sampling intensity or user coverage. Both GPS observations and active users increase abruptly, while spatial coverage remains broadly stable. This pattern is unlikely to reflect a change in mobility behavior and instead suggests modifications in the data collection process (e.g., changes in application coverage or data ingestion). As a result, absolute activity levels before and after this transition are not directly comparable without normalization.
Second, during the spatial concentration phase, the number of visited hexagons declines while the number of active users remains high. This indicates that observations become concentrated in a smaller subset of locations rather than reflecting a reduction in overall activity. Metrics based on spatial coverage or per-area averages are therefore sensitive to this effect and should be interpreted with caution during this interval.
We recommend treating the September 18-30 as structurally missing and avoiding interpolation across this period, as it introduces a discontinuity that may bias longitudinal comparisons.
The dataset also reproduces well-established temporal mobility rhythms, showing a clear circadian pattern. In addition, weekday activity levels are higher than those observed on weekends as confirmed by statistical tests. These patterns indicate that the dataset captures realistic daily mobility dynamics.
In summary, robust temporal analysis should be conducted within a single data regime. Longitudinal analyses should either exclude anomalous periods or normalize metrics by active user counts to reduce sensitivity to variations in sampling intensity.
7. Data Quality Assessment Summary#
This assessment evaluated the temporal, spatial, and user-level properties of the mobility dataset to determine its suitability for constructing an Urban Space Usage Index.
Overall, the dataset provides broad coverage of Metro Manila and internally consistent aggregate signals. At the same time, it exhibits structural unevenness across time, space, and users that directly affects how indicators should be constructed and interpreted.
7.1 Temporal Dimension#
The dataset achieves high temporal coverage (96.62%) over 2023. Daily patterns of observations, active users, and visited spatial units remain broadly stable during most of the observation period, although the time series exhibits a structural break on 10 July, followed by a phase of increased spatial concentration of observations and a data interruption in September.
These changes are likely due to variations in the data-generation process and not caused by mobility shifts and therefore limit direct longitudinal comparability. Raw time-series comparisons across periods with different sampling characteristics should be avoided without normalization.
7.2 Spatial Dimension#
Spatial coverage across the Metro Manila study area is high (99.52% of H3 cells record at least one observation during the year). Activity, as expected, follows a heavy-tailed distribution, with a small fraction of spatial units accounting for a disproportionate share of total observations.
Temporal analyses reveal intervals during which observations become unusually concentrated in a limited number of locations, indicating changes in the spatial distribution of recorded activity. During these periods, a small subset of highly visited hexagons captures a large share of total observations.
These characteristics are expected from passively collected mobility data and do not invalidate the dataset, but spatial indicators should rely on distribution-aware statistics and that periods of extreme spatial concentration should be treated with caution in downstream analyses.
7.3 User Dimension#
User contributions are highly heterogeneous. A small subset of highly active users accounts for a large share of observations, while most users produce sparse and localized daily traces: approximately 50% of users generate five or fewer observations per day.
From 10 July onward, user activity becomes more homogeneous, with a noticeable flattening of user-level distributions and reduced variability. This pattern is consistent with a shift in the data-generation process, possibly reflecting changes in user composition and/or tracking intensity.
This structure limits the robustness of individual-level mobility metrics and trajectory-based indicators without substantial aggregation or filtering. While trajectory analyses remain possible when restricting the sample to users with sufficiently dense activity traces, aggregate presence-based measures at the spatial-unit level provide a more stable signal of activity and are therefore more appropriate for index construction.
7.4 Overall Assessment#
The dataset is suitable for constructing an Urban Space Usage Index when interpreted as a signal of relative presence and activity rather than as a comprehensive representation of individual mobility behavior.
Its principal strengths include near-complete spatial coverage across Metro Manila and internally consistent aggregate patterns within identified temporal regimes. Its primary limitations arise from time-varying sampling intensity, the presence of distinct data-generation regimes, periods of spatial concentration of activity, and user-level sparsity.
Accordingly, index construction should:
Rely on temporally stable regimes or apply normalization across regimes;
Interpret spatial patterns using distribution-aware metrics, particularly during periods of strong concentration;
Avoid reliance on individual-level trajectory metrics;
Treat any annual baseline with caution, ensuring that structural regime shifts are explicitly accounted for.
When these safeguards are implemented, the dataset provides a reliable foundation for high-frequency monitoring of urban space usage.
Suitable and Non-Suitable Applications#
Suitable Applications
The dataset is well-suited for:
Aggregate, presence-based indicators of urban space usage
Relative comparisons across locations within homogeneous temporal regimes
Urban analyses at hourly, daily, or coarser temporal resolution
Event- or shock-related analyses, provided anomalous periods are explicitly handled
Non-Suitable Applications
Without substantial aggregation or correction, the dataset is not well-suited for:
Individual-level mobility profiling or behavioral inference
Fine-grained trajectory reconstruction or detailed OD analysis
Interpretation of raw activity volumes as absolute measures of mobility or population presence
Longitudinal comparisons that assume stable sampling intensity over time
DO |
DON’T |
|---|---|
Aggregate and smooth temporally (e.g., weekly averages) and analyze within homogeneous data regimes (baseline vs post-shift). |
Do not interpret peaks or drops as real mobility changes without checking for regime shifts, sampling changes, or missing data. |
Interpret temporal trends jointly with diagnostic indicators (active users, points per user, users per hex, regime flags). |
Do not attribute peaks or drops as real mobility effects before verifying that they are not driven by tracking intensity, ingestion changes, or missing data periods. |
Use presence-based or relative metrics (for example, active users per hex, share of total activity, standardized change relative to a local baseline) to improve comparability. |
Do not treat raw point counts as absolute measures of mobility or population activity. |
Interpret spatial patterns using distribution-aware statistics (e.g., medians, quantiles) and account for concentration effects. |
Do not rely on averages alone when activity is highly concentrated in a small number of locations. |
For trajectory- or origin-destination-based analyses, enforce minimum data sufficiency thresholds (for example, minimum number of points per day or minimum active window) and report sensitivity to threshold choices. |
Do not compute individual-level mobility or origin-destination metrics on sparse traces, and do not apply aggressive filtering without explicitly assessing potential selection bias. |
8. References#
- 1
Shan Jiang, Yingxiang Yang, Siddharth Gupta, Daniele Veneziano, Shounak Athavale, and Marta C González. The timegeo modeling framework for urban mobility without travel surveys. Proceedings of the National Academy of Sciences, 113(37):E5370–E5378, 2016.
- 2
Marta C Gonzalez, Cesar A Hidalgo, and Albert-Laszlo Barabasi. Understanding individual human mobility patterns. nature, 453(7196):779–782, 2008.
- 3
Hugo Barbosa, Marc Barthelemy, Gourab Ghoshal, Charlotte R James, Maxime Lenormand, Thomas Louail, Ronaldo Menezes, José J Ramasco, Filippo Simini, and Marcello Tomasini. Human mobility: models and applications. Physics Reports, 734:1–74, 2018.
- 4
Adrian Dobra, Nathalie E Williams, and Nathan Eagle. Spatiotemporal detection of unusual human population behavior using mobile phone data. PloS one, 10(3):e0120449, 2015.
- 5
Takahiro Yabe, Nicholas KW Jones, Nancy Lozano-Gracia, Maham Faisal Khan, Satish V Ukkusuri, Samuel Fraiberger, and Aleister Montfort. Location data reveals disproportionate disaster impact amongst the poor: a case study of the 2017 puebla earthquake using mobilkit. arXiv preprint arXiv:2107.13590, 2021.
- 6
World Bank. World Development Report 2021 : Data for Better Lives. World Bank, 2021. License: CC BY 3.0 IGO. URL: http://hdl.handle.net/10986/35218.