Example: Cross-dataset Comparison¶
Comparing datasets that describe different populations.
Scenario
What someone tries to do:
- Compare employment statistics from two surveys
- Chart trends across different census releases
What they expect:
- Direct comparison of values from both datasets
Why it's wrong (or risky)¶
Datasets often describe different universes — the population they cover. Comparing values across incompatible universes produces misleading results.
Example:
| Dataset | Universe | Employment Rate |
|---|---|---|
| Survey A | All adults (18+) | 65% |
| Survey B | Working-age adults (18-64) | 72% |
These rates are not directly comparable — Survey B excludes retirees.
What Invariant detects¶
- Claim violated: Datasets have different universe definitions
- Evidence: Universe mismatch between
survey_aandsurvey_b - Rule:
UniverseComparabilityRule
Acknowledge required
Datasets have different universes. Comparison requires explicit acknowledgment.
Warn
For partial comparability, results include disclosure about universe differences.
Typical remediations¶
- Acknowledge the difference — Accept the comparison with disclosed caveats
- Filter to common universe — Restrict both datasets to comparable populations
- Use separate visualizations — Show datasets side-by-side with clear labels
What to do next¶
- Concepts: Universe — What universes are and why they matter
- Integration: Query Lifecycle — How acknowledgment flows work