Core Concepts¶

This guide explains the key concepts in Invariant's domain model. Understanding these concepts is essential for using the library effectively.

Catalog Hierarchy¶

Invariant organizes statistical data in a hierarchy:

Study
  └── Dataset
        └── DataProduct
              └── Variable

Study¶

A Study is a data collection effort with methodology. It's the top-level organizational unit.

Study(
    id=StudyId(...),
    name="South Africa Census 2021",
    owner_org="Statistics South Africa",
    methodology_summary="Full enumeration of all households",
)

A study may produce multiple datasets (e.g., demographic tables, housing tables, economic tables).

Dataset¶

A Dataset is a concrete table produced by a study at a specific level of aggregation.

Dataset(
    id=DatasetId(...),
    study_id=study.id,
    name="Census 2021 Demographics",
    reference_date=date(2021, 10, 10),
    universe_id=universe.id,  # Links to population definition
    reference_system_version_id=geo_version.id,  # Links to geography version
)

Key properties: - reference_date — What point in time the data describes - collection_start/end — When data was gathered - universe_id — Which population this data represents - reference_system_version_id — Which version of the unit system (geography, facilities, etc.)

DataProduct¶

A DataProduct is what users actually query. It defines the grain and available variables.

DataProduct(
    id=DataProductId(...),
    dataset_id=dataset.id,
    name="Population by Geography and Demographics",
    kind=DataProductKind.FACT,  # or INDICATOR
    grain=GrainSpec(keys=[geo_var.id, age_var.id, sex_var.id]),
    variables=[geo_var, age_var, sex_var, population_var],
)

Two kinds: - FACT — Contains measures (counts, sums) that can be aggregated - INDICATOR — Contains derived values (rates, percentages) with special aggregation rules

Variable¶

A Variable is a column in a data product with semantic meaning.

Variable(
    id=VariableId(...),
    data_product_id=dp.id,
    name="population",
    role=VariableRole.MEASURE,  # or DIMENSION, INDICATOR
    data_type=DataType.INT,
    unit="persons",
)

Three roles: - DIMENSION — Used to slice and filter (geography, age_group, sex) - MEASURE — Additive facts that can be summed (population, count) - INDICATOR — Derived values that need special handling (rate, percentage)

Variable Types¶

Understanding variable types is critical because they determine what operations are valid.

Dimensions¶

Dimensions are classificatory variables used to slice data.

Property	Example
Finite domain	`sex` ∈
Not additive	You can't sum "Male + Female" to get a number
Often hierarchical	Country → Province → District

Dimensions go in GROUP BY clauses.

Measures¶

Measures are additive facts you can sum.

Property	Example
Numeric	`population = 1234567`
Meaningful under SUM	Total population = sum of all cells
May support MIN/MAX	Smallest population in a ward

Measures are safe to aggregate across dimensions.

Indicators¶

Indicators are derived values that cannot be naively aggregated.

Property	Example
Derived from measures	`unemployment_rate = unemployed / labour_force`
NOT additive	You can't sum 30% + 40% to get 70%
Require recomputation	To aggregate, sum numerator and denominator separately, then divide

This is where Invariant adds the most value. It prevents users from accidentally averaging percentages or summing rates.

Universes¶

A Universe defines the population to which data values apply.

Universe(
    id=UniverseId(...),
    label="All Residents",
    definition="All persons resident in South Africa at the time of enumeration",
    inclusions=["Citizens", "Permanent residents", "Temporary residents present on census night"],
    exclusions=["Diplomatic personnel", "Foreign military"],
)

Universes matter because: - Datasets with different universes may not be comparable - "Unemployment rate among all persons" ≠ "Unemployment rate among working-age persons" - Cross-dataset operations check universe compatibility

Reference Systems¶

A Reference System is any set of units you can group data by. Geography is the most common, but the concept is broader.

What Counts as a Reference System¶

Type	Examples
Geography	Provinces, districts, wards
Facilities	Hospitals, clinics, schools
Organizations	Departments, agencies
Programs	Social programs, funding streams

Why Reference Systems Matter¶

Reference systems change over time: - Geographic boundaries are redrawn - Facilities open and close - Codes get reassigned

Invariant tracks versions of reference systems and supports crosswalks for mapping between versions.

ReferenceSystemVersion(
    id=ReferenceSystemVersionId(...),
    reference_system_id=geo_system.id,
    label="2021 Boundaries",
    valid_from=date(2021, 1, 1),
)

Crosswalk(
    id=CrosswalkId(...),
    from_version_id=geo_2011.id,
    to_version_id=geo_2021.id,
    method=CrosswalkMethod.AREA_WEIGHTED,
)

When comparing data across versions, the kernel checks if a crosswalk exists and how to apply it.

Indicator Definitions¶

An IndicatorDefinition specifies how an indicator variable is computed and can be aggregated.

IndicatorDefinition(
    variable_id=unemployment_rate_var.id,
    indicator_type=IndicatorType.PERCENT,
    aggregation_policy=AggregationPolicy.RECOMPUTE,
    numerator_ref=VariableRef(dp_id, unemployed_var.id),
    denominator_ref=VariableRef(dp_id, labour_force_var.id),
    formula="unemployed / labour_force * 100",
)

Aggregation Policies¶

Policy	Meaning	Example
`NOT_AGGREGATABLE`	Cannot be aggregated at all	Gini coefficient
`RECOMPUTE`	Can aggregate by recomputing from components	Unemployment rate
`ALLOW_LIST`	Only specific aggregations allowed	MIN/MAX for thresholds

When a query tries to SUM or AVG an indicator: 1. Kernel checks the aggregation_policy 2. If RECOMPUTE, it can substitute the computation using numerator/denominator 3. If NOT_AGGREGATABLE, it blocks the query 4. If ALLOW_LIST, it checks if the requested aggregation is allowed

The Validation Gate¶

Every query passes through a validation gate that produces one of four statuses:

Status	Meaning	User Experience
`ALLOW`	Query is valid	Execute immediately
`WARN`	Valid but with caveats	Execute with visible disclaimer
`REQUIRE_ACK`	Valid after user acknowledges risk	Modal: "You're comparing unlike things"
`BLOCK`	Invalid, cannot execute	Error with remediation suggestions

The gate checks: - Aggregation rules (can this variable be summed?) - Comparability (are these datasets comparable?) - Reference system alignment (same geography version?) - Universe compatibility (same population?)

Disclosures¶

Every query result can include disclosures — messages about data provenance and transformations.

Disclosure(
    disclosure_type="data_source",
    text="Census 2021, Statistics South Africa",
)

Disclosure(
    disclosure_type="crosswalk",
    text="2011 data redistributed to 2021 boundaries using area-weighted interpolation",
)

Disclosure(
    disclosure_type="suppression",
    text="3 cells suppressed per census-standard policy",
)

Disclosures are accumulated through the query lifecycle and returned with results. They enable transparency about what happened to the data.

Query Plans¶

A QueryPlan is the normalized representation of a query that the kernel validates and executes.

QueryPlan(
    query_id="q-abc123",
    intent=QueryIntent.TABLE,
    operations=[
        SelectOp(
            data_product_id=dp.id,
            dimension_ids=[geo_var.id, sex_var.id],
            metrics=[Metric(population_var.id, AggregationType.SUM)],
            filters=[],
            group_by_ids=[geo_var.id, sex_var.id],
        )
    ],
    presentation=PresentationSpec(format=PresentationFormat.TABLE),
)

Query plans use VariableId references (not names) for stability when variables are renamed.

Summary¶

Concept	Purpose
Study	Organizational unit for data collection efforts
Dataset	Concrete table with temporal and universe metadata
DataProduct	Queryable unit with grain and variables
Variable	Column with semantic role (dimension/measure/indicator)
Universe	Population definition for comparability
ReferenceSystem	Unit system (geography, facilities) with versioning
IndicatorDefinition	How derived values are computed and aggregated
ValidationResult	Gate output with status and disclosures
Disclosure	Provenance information attached to results

For formal definitions of all terms, see the generated glossary.