Quickstart¶
What you're about to do¶
- Install the sample project
- Run a valid query
- Run an invalid query
- Watch Invariant explain + gate it
- Fix the query and rerun
Prerequisites¶
- Python 3.12+
- pip
- 5 minutes of attention span
1) Install the sample project¶
The repository includes Census Explorer, a demo CLI that shows Invariant in action with South African census-style data.
cd examples/sample-project
pip install -e ../../ # Install invariant
pip install -e . # Install census-explorer
This gives you the census-explorer command. See Sample Project for full details.
2) Run a valid query¶
3) Run an invalid query¶
census-explorer validate aa0e8400-e29b-41d4-a716-446655440002 \
-m unemployment_rate:SUM -d geography_code
Blocked
Expected: Invariant rejects the query and returns:
Status: BLOCK
Can Execute: No
Issues:
[INDICATOR_AGG_NOT_ALLOWED] Cannot aggregate indicator 'unemployment_rate' with SUM
You can't sum percentages—that's a semantic error Invariant catches.
4) Fix the query¶
census-explorer validate aa0e8400-e29b-41d4-a716-446655440002 \
-m unemployment_rate:NONE -d geography_code
Allowed
Expected: query now passes when requesting the indicator without aggregation.
What just happened¶
- You submitted a query (aggregate unemployment_rate with SUM)
- Invariant checked the semantic rules (unemployment_rate is an indicator, cannot sum)
- The gate returned BLOCK with an explanation
Populating the Catalog¶
The sample project loads its catalog from JSON, but you can also build it programmatically. Here's how the pieces fit together:
from invariant.catalog.domain.entities.study import Study
from invariant.catalog.domain.entities.dataset import Dataset
from invariant.catalog.domain.entities.data_product import DataProduct
from invariant.catalog.domain.entities.variable import Variable
from invariant.identity.domain.entities.concept import Concept
from invariant.identity.domain.entities.universe import Universe
from invariant.identity.domain.entities.variable_semantics import VariableSemantics
from invariant.semantic.domain.entities.indicator_definition import IndicatorDefinition
from invariant.shared.contracts.ids import (
StudyId, DatasetId, DataProductId, VariableId, ConceptId, UniverseId,
)
from invariant.shared.contracts.enums import (
DataProductKind, VariableRole, DataType,
IndicatorType, AggregationPolicy,
)
from invariant.shared.contracts.value_objects import GrainSpec, VariableRef
# 1. Create a Study (the data collection effort)
study = Study(
id=StudyId.create(),
name="Census 2021",
owner_org="Statistics South Africa",
description="National population and household census",
methodology_summary="Full enumeration with post-enumeration survey",
)
# 2. Define a Universe (who/what the data covers)
universe = Universe(
id=UniverseId.create(),
label="SA Residents",
definition="All usual residents in South Africa at census reference date",
inclusions=["Citizens", "Permanent residents", "Refugees"],
exclusions=["Diplomats", "Foreign military", "Visitors"],
)
# 3. Create a Dataset (a concrete table from the study)
dataset = Dataset(
id=DatasetId.create(),
study_id=study.id,
name="Person Level Data",
description="Individual-level census records",
universe_id=universe.id,
)
# 4. Define Concepts (semantic identity for cross-dataset comparison)
pop_concept = Concept(
id=ConceptId.create(),
label="Total Population",
description="Count of all persons in the universe",
canonical_unit="persons",
)
unemp_concept = Concept(
id=ConceptId.create(),
label="Unemployment Rate",
description="Proportion of labour force that is unemployed",
canonical_unit="percent",
)
# 5. Create Variables (columns in the data product)
dp_id = DataProductId.create()
geo_var = Variable(
id=VariableId.create(),
data_product_id=dp_id,
name="geography_code",
role=VariableRole.DIMENSION,
data_type=DataType.STRING,
description="Geographic area code",
)
pop_var = Variable(
id=VariableId.create(),
data_product_id=dp_id,
name="population",
role=VariableRole.MEASURE,
data_type=DataType.INT,
description="Total population count",
unit="persons",
)
employed_var = Variable(
id=VariableId.create(),
data_product_id=dp_id,
name="employed",
role=VariableRole.MEASURE,
data_type=DataType.INT,
description="Number of employed persons",
)
unemployed_var = Variable(
id=VariableId.create(),
data_product_id=dp_id,
name="unemployed",
role=VariableRole.MEASURE,
data_type=DataType.INT,
description="Number of unemployed persons",
)
unemp_rate_var = Variable(
id=VariableId.create(),
data_product_id=dp_id,
name="unemployment_rate",
role=VariableRole.INDICATOR,
data_type=DataType.FLOAT,
description="Unemployment rate as percentage",
unit="percent",
)
# 6. Link Variables to Concepts via VariableSemantics
# This enables cross-dataset comparison
pop_semantics = VariableSemantics(
variable_id=pop_var.id,
concept_id=pop_concept.id,
unit="persons",
)
unemp_rate_semantics = VariableSemantics(
variable_id=unemp_rate_var.id,
concept_id=unemp_concept.id,
unit="percent",
comparability_group="official_unemployment", # Only compare like-for-like
)
# 7. Build a FACT Data Product (contains measures)
fact_product = DataProduct(
id=dp_id,
dataset_id=dataset.id,
name="Census Facts",
kind=DataProductKind.FACT,
grain=GrainSpec(keys=[geo_var.id]), # One row per geography
variables=[geo_var, pop_var, employed_var, unemployed_var],
)
# 8. Build an INDICATOR Data Product (contains derived indicators)
indicator_dp_id = DataProductId.create()
indicator_geo_var = Variable(
id=VariableId.create(),
data_product_id=indicator_dp_id,
name="geography_code",
role=VariableRole.DIMENSION,
data_type=DataType.STRING,
)
indicator_unemp_var = Variable(
id=VariableId.create(),
data_product_id=indicator_dp_id,
name="unemployment_rate",
role=VariableRole.INDICATOR,
data_type=DataType.FLOAT,
unit="percent",
)
indicator_product = DataProduct(
id=indicator_dp_id,
dataset_id=dataset.id,
name="Census Indicators",
kind=DataProductKind.INDICATOR,
grain=GrainSpec(keys=[indicator_geo_var.id]),
variables=[indicator_geo_var, indicator_unemp_var],
)
# 9. Define how the indicator is computed (for validation)
unemp_rate_def = IndicatorDefinition(
variable_id=indicator_unemp_var.id,
indicator_type=IndicatorType.PERCENT,
aggregation_policy=AggregationPolicy.RECOMPUTE, # Can reaggregate via formula
numerator_ref=VariableRef(
data_product_id=fact_product.id,
variable_id=unemployed_var.id,
),
denominator_ref=VariableRef(
data_product_id=fact_product.id,
variable_id=VariableId.create(), # labour_force = employed + unemployed
),
formula="unemployed / (employed + unemployed) * 100",
)
Key relationships¶
| Entity | Purpose | Example |
|---|---|---|
| Study | A data collection effort | "Census 2021" |
| Universe | Who/what the data covers | "SA Residents" |
| Concept | What a variable measures | "Total Population" |
| Dataset | A concrete table from a study | "Person Level Data" |
| DataProduct | Queryable structure (FACT or INDICATOR) | "Census Facts" |
| Variable | A column (DIMENSION, MEASURE, or INDICATOR) | "population" |
| VariableSemantics | Links a Variable to a Concept (enables cross-dataset comparison) | "population → Total Population" |
| IndicatorDefinition | How an indicator aggregates | "unemployment_rate: RECOMPUTE" |
Aggregation rules¶
Invariant enforces semantic aggregation rules:
- MEASURE variables (like
population) can be summed, averaged, etc. - INDICATOR variables (like
unemployment_rate) follow theirIndicatorDefinition:NOT_AGGREGATABLE— cannot be aggregated at allRECOMPUTE— must recalculate from numerator/denominatorALLOW_LIST— only specified aggregations permitted