Skip to content

Populating the Catalog

How data and metadata get into Invariant.

The two things you provide

Invariant needs two separate things:

flowchart TB
    subgraph your["Your System"]
        subgraph data["1. Your Data"]
            d1["Your tables, files,<br/>or data warehouse"]
        end
        subgraph meta["2. Catalog Metadata"]
            m1["Describes your data:"]
            m2["• Column names & types"]
            m3["• Measures vs indicators"]
            m4["• Universe definitions"]
        end
    end
    meta --> inv["Invariant validates queries"]

Invariant never touches your data files. It only reads the catalog metadata to validate queries. Your query engine executes the actual queries.

Implementing the CatalogStore

You provide catalog metadata by implementing the CatalogStore port:

class CatalogStore(Protocol):
    def get_data_product(self, id: DataProductId) -> DataProduct | None: ...
    def get_universe(self, id: UniverseId) -> Universe | None: ...
    def get_indicator_definition(self, var_id: VariableId) -> IndicatorDefinition | None: ...
    def get_catalog_snapshot(self, product_ids: set[DataProductId]) -> CatalogSnapshot: ...
    # ... other methods

How you store the metadata is up to you. Invariant doesn't care whether it comes from:

  • A database (Postgres, SQLite)
  • Configuration files (JSON, YAML)
  • An external metadata API
  • Hard-coded in your application

Example implementation

class YourCatalogStore:
    """Adapt your metadata storage to Invariant's interface."""

    def __init__(self, db_connection):
        self.db = db_connection

    def get_data_product(self, id: DataProductId) -> DataProduct | None:
        # Fetch from your storage
        row = self.db.query("SELECT * FROM data_products WHERE id = ?", id)
        if not row:
            return None

        # Convert to Invariant's domain objects
        return DataProduct(
            id=id,
            name=row["name"],
            kind=DataProductKind(row["kind"]),
            variables=[self._to_variable(v) for v in row["variables"]],
            grain=GrainSpec(keys=row["grain_keys"]),
        )

    def _to_variable(self, row) -> Variable:
        return Variable(
            id=VariableId(row["id"]),
            name=row["name"],
            role=VariableRole(row["role"]),  # MEASURE, INDICATOR, or DIMENSION
            data_type=DataType(row["data_type"]),
        )

The sample project includes a complete JsonCatalogStore implementation you can use as a reference.

How rules get defined

Rules aren't separate configuration. They're implicit in your catalog metadata:

You define... Invariant enforces...
role: MEASURE Can be summed
role: INDICATOR Cannot be summed
role: DIMENSION Used for grouping only
universe_id on dataset Datasets with different universes require acknowledgment to compare
reference_system_version_id Queries across versions require crosswalks
aggregation_policy: NOT_AGGREGATABLE on indicator Blocks aggregation attempts

Example: Defining an indicator that can't be summed

{
  "data_products": [{
    "id": "...",
    "kind": "INDICATOR",
    "variables": [
      {"name": "unemployment_rate", "role": "INDICATOR", "data_type": "DECIMAL"}
    ]
  }],
  "indicator_definitions": [{
    "variable_id": "...",
    "indicator_type": "PERCENT",
    "aggregation_policy": "NOT_AGGREGATABLE",
    "numerator_variable": "unemployed_count",
    "denominator_variable": "labor_force"
  }]
}

Now when someone queries SUM(unemployment_rate), Invariant blocks it automatically.

The complete flow

Setup (once)

flowchart LR
    subgraph setup["Setup Phase"]
        A["1. Your data<br/>(tables, files, etc.)"] --> C["3. Implement CatalogStore"]
        B["2. Catalog metadata<br/>(describes your data)"] --> C
        C --> D["4. Wire up Invariant"]
    end

Runtime (every query)

flowchart TB
    Q["User query:<br/>SUM population BY geography_code"] --> V

    subgraph V["1. Invariant Validates"]
        V1["Check variable roles"]
        V2["Check universe compatibility"]
        V3["Return ALLOW or BLOCK"]
    end

    V -->|ALLOW| E
    V -->|BLOCK| X["Reject with explanation"]

    subgraph E["2. Your Engine Executes"]
        E1["Query your data"]
    end

    E --> R["Results + Disclosures"]

Catalog entities

Entity Purpose Example
Study Top-level grouping "Census 2021", "Labour Force Survey 2023"
Dataset Logical table in a study "Population by Age", "Employment by Province"
Data Product Queryable asset with schema Table with columns, grain, universe
Variable Column with semantic type population (MEASURE), unemployment_rate (INDICATOR)
Universe Population definition "All residents", "Working-age adults (15-64)"
Reference System Boundary scheme with versions "SA Municipalities 2021"
Indicator Definition How derived values work unemployment_rate = unemployed / labor_force

Next steps