Populating the Catalog¶
How data and metadata get into Invariant.
The two things you provide¶
Invariant needs two separate things:
flowchart TB
subgraph your["Your System"]
subgraph data["1. Your Data"]
d1["Your tables, files,<br/>or data warehouse"]
end
subgraph meta["2. Catalog Metadata"]
m1["Describes your data:"]
m2["• Column names & types"]
m3["• Measures vs indicators"]
m4["• Universe definitions"]
end
end
meta --> inv["Invariant validates queries"]
Invariant never touches your data files. It only reads the catalog metadata to validate queries. Your query engine executes the actual queries.
Implementing the CatalogStore¶
You provide catalog metadata by implementing the CatalogStore port:
class CatalogStore(Protocol):
def get_data_product(self, id: DataProductId) -> DataProduct | None: ...
def get_universe(self, id: UniverseId) -> Universe | None: ...
def get_indicator_definition(self, var_id: VariableId) -> IndicatorDefinition | None: ...
def get_catalog_snapshot(self, product_ids: set[DataProductId]) -> CatalogSnapshot: ...
# ... other methods
How you store the metadata is up to you. Invariant doesn't care whether it comes from:
- A database (Postgres, SQLite)
- Configuration files (JSON, YAML)
- An external metadata API
- Hard-coded in your application
Example implementation¶
class YourCatalogStore:
"""Adapt your metadata storage to Invariant's interface."""
def __init__(self, db_connection):
self.db = db_connection
def get_data_product(self, id: DataProductId) -> DataProduct | None:
# Fetch from your storage
row = self.db.query("SELECT * FROM data_products WHERE id = ?", id)
if not row:
return None
# Convert to Invariant's domain objects
return DataProduct(
id=id,
name=row["name"],
kind=DataProductKind(row["kind"]),
variables=[self._to_variable(v) for v in row["variables"]],
grain=GrainSpec(keys=row["grain_keys"]),
)
def _to_variable(self, row) -> Variable:
return Variable(
id=VariableId(row["id"]),
name=row["name"],
role=VariableRole(row["role"]), # MEASURE, INDICATOR, or DIMENSION
data_type=DataType(row["data_type"]),
)
The sample project includes a complete JsonCatalogStore implementation you can use as a reference.
How rules get defined¶
Rules aren't separate configuration. They're implicit in your catalog metadata:
| You define... | Invariant enforces... |
|---|---|
role: MEASURE |
Can be summed |
role: INDICATOR |
Cannot be summed |
role: DIMENSION |
Used for grouping only |
universe_id on dataset |
Datasets with different universes require acknowledgment to compare |
reference_system_version_id |
Queries across versions require crosswalks |
aggregation_policy: NOT_AGGREGATABLE on indicator |
Blocks aggregation attempts |
Example: Defining an indicator that can't be summed¶
{
"data_products": [{
"id": "...",
"kind": "INDICATOR",
"variables": [
{"name": "unemployment_rate", "role": "INDICATOR", "data_type": "DECIMAL"}
]
}],
"indicator_definitions": [{
"variable_id": "...",
"indicator_type": "PERCENT",
"aggregation_policy": "NOT_AGGREGATABLE",
"numerator_variable": "unemployed_count",
"denominator_variable": "labor_force"
}]
}
Now when someone queries SUM(unemployment_rate), Invariant blocks it automatically.
The complete flow¶
Setup (once)¶
flowchart LR
subgraph setup["Setup Phase"]
A["1. Your data<br/>(tables, files, etc.)"] --> C["3. Implement CatalogStore"]
B["2. Catalog metadata<br/>(describes your data)"] --> C
C --> D["4. Wire up Invariant"]
end
Runtime (every query)¶
flowchart TB
Q["User query:<br/>SUM population BY geography_code"] --> V
subgraph V["1. Invariant Validates"]
V1["Check variable roles"]
V2["Check universe compatibility"]
V3["Return ALLOW or BLOCK"]
end
V -->|ALLOW| E
V -->|BLOCK| X["Reject with explanation"]
subgraph E["2. Your Engine Executes"]
E1["Query your data"]
end
E --> R["Results + Disclosures"]
Catalog entities¶
| Entity | Purpose | Example |
|---|---|---|
| Study | Top-level grouping | "Census 2021", "Labour Force Survey 2023" |
| Dataset | Logical table in a study | "Population by Age", "Employment by Province" |
| Data Product | Queryable asset with schema | Table with columns, grain, universe |
| Variable | Column with semantic type | population (MEASURE), unemployment_rate (INDICATOR) |
| Universe | Population definition | "All residents", "Working-age adults (15-64)" |
| Reference System | Boundary scheme with versions | "SA Municipalities 2021" |
| Indicator Definition | How derived values work | unemployment_rate = unemployed / labor_force |
Next steps¶
- Sample Project — See a complete working example
- Minimal Integration — Implement your first CatalogStore
- Query Lifecycle — How validation flows through the system