Core Concepts¶
This guide explains the key concepts in Invariant's domain model. Understanding these concepts is essential for using the library effectively.
Catalog Hierarchy¶
Invariant organizes statistical data in a hierarchy:
Study¶
A Study is a data collection effort with methodology. It's the top-level organizational unit.
Study(
id=StudyId(...),
name="South Africa Census 2021",
owner_org="Statistics South Africa",
methodology_summary="Full enumeration of all households",
)
A study may produce multiple datasets (e.g., demographic tables, housing tables, economic tables).
Dataset¶
A Dataset is a concrete table produced by a study at a specific level of aggregation.
Dataset(
id=DatasetId(...),
study_id=study.id,
name="Census 2021 Demographics",
reference_date=date(2021, 10, 10),
universe_id=universe.id, # Links to population definition
reference_system_version_id=geo_version.id, # Links to geography version
)
Key properties: - reference_date — What point in time the data describes - collection_start/end — When data was gathered - universe_id — Which population this data represents - reference_system_version_id — Which version of the unit system (geography, facilities, etc.)
DataProduct¶
A DataProduct is what users actually query. It defines the grain and available variables.
DataProduct(
id=DataProductId(...),
dataset_id=dataset.id,
name="Population by Geography and Demographics",
kind=DataProductKind.FACT, # or INDICATOR
grain=GrainSpec(keys=[geo_var.id, age_var.id, sex_var.id]),
variables=[geo_var, age_var, sex_var, population_var],
)
Two kinds: - FACT — Contains measures (counts, sums) that can be aggregated - INDICATOR — Contains derived values (rates, percentages) with special aggregation rules
Variable¶
A Variable is a column in a data product with semantic meaning.
Variable(
id=VariableId(...),
data_product_id=dp.id,
name="population",
role=VariableRole.MEASURE, # or DIMENSION, INDICATOR
data_type=DataType.INT,
unit="persons",
)
Three roles: - DIMENSION — Used to slice and filter (geography, age_group, sex) - MEASURE — Additive facts that can be summed (population, count) - INDICATOR — Derived values that need special handling (rate, percentage)
Variable Types¶
Understanding variable types is critical because they determine what operations are valid.
Dimensions¶
Dimensions are classificatory variables used to slice data.
| Property | Example |
|---|---|
| Finite domain | sex ∈ |
| Not additive | You can't sum "Male + Female" to get a number |
| Often hierarchical | Country → Province → District |
Dimensions go in GROUP BY clauses.
Measures¶
Measures are additive facts you can sum.
| Property | Example |
|---|---|
| Numeric | population = 1234567 |
| Meaningful under SUM | Total population = sum of all cells |
| May support MIN/MAX | Smallest population in a ward |
Measures are safe to aggregate across dimensions.
Indicators¶
Indicators are derived values that cannot be naively aggregated.
| Property | Example |
|---|---|
| Derived from measures | unemployment_rate = unemployed / labour_force |
| NOT additive | You can't sum 30% + 40% to get 70% |
| Require recomputation | To aggregate, sum numerator and denominator separately, then divide |
This is where Invariant adds the most value. It prevents users from accidentally averaging percentages or summing rates.
Universes¶
A Universe defines the population to which data values apply.
Universe(
id=UniverseId(...),
label="All Residents",
definition="All persons resident in South Africa at the time of enumeration",
inclusions=["Citizens", "Permanent residents", "Temporary residents present on census night"],
exclusions=["Diplomatic personnel", "Foreign military"],
)
Universes matter because: - Datasets with different universes may not be comparable - "Unemployment rate among all persons" ≠ "Unemployment rate among working-age persons" - Cross-dataset operations check universe compatibility
Reference Systems¶
A Reference System is any set of units you can group data by. Geography is the most common, but the concept is broader.
What Counts as a Reference System¶
| Type | Examples |
|---|---|
| Geography | Provinces, districts, wards |
| Facilities | Hospitals, clinics, schools |
| Organizations | Departments, agencies |
| Programs | Social programs, funding streams |
Why Reference Systems Matter¶
Reference systems change over time: - Geographic boundaries are redrawn - Facilities open and close - Codes get reassigned
Invariant tracks versions of reference systems and supports crosswalks for mapping between versions.
ReferenceSystemVersion(
id=ReferenceSystemVersionId(...),
reference_system_id=geo_system.id,
label="2021 Boundaries",
valid_from=date(2021, 1, 1),
)
Crosswalk(
id=CrosswalkId(...),
from_version_id=geo_2011.id,
to_version_id=geo_2021.id,
method=CrosswalkMethod.AREA_WEIGHTED,
)
When comparing data across versions, the kernel checks if a crosswalk exists and how to apply it.
Indicator Definitions¶
An IndicatorDefinition specifies how an indicator variable is computed and can be aggregated.
IndicatorDefinition(
variable_id=unemployment_rate_var.id,
indicator_type=IndicatorType.PERCENT,
aggregation_policy=AggregationPolicy.RECOMPUTE,
numerator_ref=VariableRef(dp_id, unemployed_var.id),
denominator_ref=VariableRef(dp_id, labour_force_var.id),
formula="unemployed / labour_force * 100",
)
Aggregation Policies¶
| Policy | Meaning | Example |
|---|---|---|
NOT_AGGREGATABLE |
Cannot be aggregated at all | Gini coefficient |
RECOMPUTE |
Can aggregate by recomputing from components | Unemployment rate |
ALLOW_LIST |
Only specific aggregations allowed | MIN/MAX for thresholds |
When a query tries to SUM or AVG an indicator:
1. Kernel checks the aggregation_policy
2. If RECOMPUTE, it can substitute the computation using numerator/denominator
3. If NOT_AGGREGATABLE, it blocks the query
4. If ALLOW_LIST, it checks if the requested aggregation is allowed
The Validation Gate¶
Every query passes through a validation gate that produces one of four statuses:
| Status | Meaning | User Experience |
|---|---|---|
ALLOW |
Query is valid | Execute immediately |
WARN |
Valid but with caveats | Execute with visible disclaimer |
REQUIRE_ACK |
Valid after user acknowledges risk | Modal: "You're comparing unlike things" |
BLOCK |
Invalid, cannot execute | Error with remediation suggestions |
The gate checks: - Aggregation rules (can this variable be summed?) - Comparability (are these datasets comparable?) - Reference system alignment (same geography version?) - Universe compatibility (same population?)
Disclosures¶
Every query result can include disclosures — messages about data provenance and transformations.
Disclosure(
disclosure_type="data_source",
text="Census 2021, Statistics South Africa",
)
Disclosure(
disclosure_type="crosswalk",
text="2011 data redistributed to 2021 boundaries using area-weighted interpolation",
)
Disclosure(
disclosure_type="suppression",
text="3 cells suppressed per census-standard policy",
)
Disclosures are accumulated through the query lifecycle and returned with results. They enable transparency about what happened to the data.
Query Plans¶
A QueryPlan is the normalized representation of a query that the kernel validates and executes.
QueryPlan(
query_id="q-abc123",
intent=QueryIntent.TABLE,
operations=[
SelectOp(
data_product_id=dp.id,
dimension_ids=[geo_var.id, sex_var.id],
metrics=[Metric(population_var.id, AggregationType.SUM)],
filters=[],
group_by_ids=[geo_var.id, sex_var.id],
)
],
presentation=PresentationSpec(format=PresentationFormat.TABLE),
)
Query plans use VariableId references (not names) for stability when variables are renamed.
Summary¶
| Concept | Purpose |
|---|---|
| Study | Organizational unit for data collection efforts |
| Dataset | Concrete table with temporal and universe metadata |
| DataProduct | Queryable unit with grain and variables |
| Variable | Column with semantic role (dimension/measure/indicator) |
| Universe | Population definition for comparability |
| ReferenceSystem | Unit system (geography, facilities) with versioning |
| IndicatorDefinition | How derived values are computed and aggregated |
| ValidationResult | Gate output with status and disclosures |
| Disclosure | Provenance information attached to results |
For formal definitions of all terms, see the generated glossary.