Skip to content

YAML Asset Definitions

Semantic assets can be defined in YAML files for easier management and version control. The YAML loader validates schemas and cross-references before loading into the catalog.

Directory Structure

your-project/
└── assets/
    ├── datasets/
    │   ├── population.yml
    │   └── employment.yml
    ├── dimensions/
    │   ├── geography.yml
    │   ├── time.yml
    │   └── demographics.yml
    ├── geo_hierarchies/
    │   └── south_africa.yml
    ├── metrics/
    │   ├── population/
    │   │   ├── total_population.yml
    │   │   └── households.yml
    │   └── employment/
    │       ├── labour_force.yml
    │       └── unemployment_rate.yml
    ├── materializations/
    │   └── province_yearly.yml
    └── policies/
        └── comparability.yml

Metrics can be nested in subdirectories for organization.


Dataset Definition

assets/datasets/population.yml:

name: population_facts
kind: FACT  # or DIMENSION

physical_ref:
  schema: analytics
  table: population

grain_keys:
  geo:
    - geo_code
  time:
    - year
  other:
    - age_group
    - sex

time_config:
  column: year
  grain: YEAR
  supported_grains:
    - YEAR

geography_config:
  hierarchy_name: south_africa
  level_column: geo_level
  code_column: geo_code

# Reference dimensions by name
dimensions:
  geography:
    join_key: geo_code
  demographics:
    join_key: age_group

# Optional quality configuration
quality:
  suppression_column: is_suppressed
  suppression_threshold: 5
  confidence_column: confidence_level

Required fields:

Field Description
name Unique dataset name
kind FACT or DIMENSION
physical_ref.schema Database schema
physical_ref.table Table name

Valid kind values: FACT, DIMENSION

Valid grain values: DAY, WEEK, MONTH, QUARTER, YEAR


Dimension Definition

assets/dimensions/geography.yml:

name: geography

attributes:
  code:
    expr: geo_code
    data_type: STRING
    semantic_type: CATEGORY

  name:
    expr: geo_name
    data_type: STRING
    semantic_type: CATEGORY

  level:
    expr: geo_level
    data_type: STRING
    semantic_type: CATEGORY

  parent_code:
    expr: parent_geo_code
    data_type: STRING
    semantic_type: CATEGORY

Required fields:

Field Description
name Unique dimension name
attributes At least one attribute (non-empty)

Valid data_type values: STRING, INTEGER, DECIMAL, DATE, TIMESTAMP

Valid semantic_type values: CATEGORY, ORDINAL, CONTINUOUS


GeoHierarchy Definition

assets/geo_hierarchies/south_africa.yml:

name: south_africa

levels:
  - country
  - province
  - district
  - municipality
  - ward

parent_relationships:
  province:
    parent_level: country
  district:
    parent_level: province
  municipality:
    parent_level: district
  ward:
    parent_level: municipality

rollup_rules:
  default_allowed: true
  overrides:
    - from_level: ward
      to_level: country
      allowed: false

Required fields:

Field Description
name Unique hierarchy name
levels Ordered list (coarsest to finest)

Note: Parent relationships must reference levels that exist in the levels list.


Metric Definitions

Simple Aggregation

assets/metrics/population/total_population.yml:

name: total_population
kind: SIMPLE_AGG

# Spec fields (can be at root or under 'spec')
dataset_name: population_facts
expr: population_count
agg: SUM

# Optional filters
filters:
  - column: is_valid
    operator: "="
    value: true

additivity:
  type: ADDITIVE
  across_time: true
  across_geo: true
  rollup_policy: ALLOW

valid_geo_levels:
  - country
  - province
  - district
  - municipality

valid_time_grains:
  - YEAR

# Optional
unit:
  name: persons
  scale: 1

comparability:
  methodology_id: CENSUS_2021
  methodology_version: "1.0"
  population_definition: All residents

Required fields for SIMPLE_AGG:

Field Description
name Unique metric name
kind SIMPLE_AGG
dataset_name Source dataset
expr Column or expression
agg Aggregation function

Valid agg values: SUM, COUNT, COUNT_DISTINCT, AVG, MIN, MAX

Ratio Metric

assets/metrics/employment/unemployment_rate.yml:

name: unemployment_rate
kind: RATIO

numerator: unemployed_count
denominator: labour_force
ratio_format: PERCENTAGE
join_intent: N_TO_1_ONLY
join_intent_rationale: Both metrics share the same grain

additivity:
  type: NON_ADDITIVE
  rollup_policy: RECOMPUTE

valid_geo_levels:
  - province
  - district

Required fields for RATIO:

Field Description
numerator Name of numerator metric
denominator Name of denominator metric

Valid ratio_format values: PERCENTAGE, DECIMAL, PER_1000, PER_10000, PER_100000

Valid join_intent values: N_TO_1_ONLY, SAFE_ONE_TO_MANY

Derived Metric

assets/metrics/employment/employment_rate.yml:

name: employment_rate
kind: DERIVED

expr: "100 - unemployment_rate"
deps:
  - unemployment_rate

additivity:
  type: NON_ADDITIVE
  rollup_policy: RECOMPUTE

Required fields for DERIVED:

Field Description
expr Expression using dependent metrics
deps List of dependency metric names

Weighted Average

assets/metrics/prices/weighted_avg_price.yml:

name: weighted_avg_price
kind: WEIGHTED_AVG

value_expr: unit_price
weight_metric: quantity_sold

additivity:
  type: NON_ADDITIVE
  rollup_policy: RECOMPUTE

Required fields for WEIGHTED_AVG:

Field Description
value_expr Expression for the value
weight_metric Name of the weight metric

Additivity Configuration

additivity:
  type: ADDITIVE      # or SEMI_ADDITIVE, NON_ADDITIVE
  across_time: true   # Can aggregate across time?
  across_geo: true    # Can aggregate across geography?
  rollup_policy: ALLOW  # or RECOMPUTE, FORBID
Policy When to use
ALLOW Safe to sum (counts, totals)
RECOMPUTE Recompute from components (rates, ratios)
FORBID Cannot be aggregated (indices, rankings)

Materialization Definition

assets/materializations/province_yearly.yml:

name: population_province_yearly
dataset_name: population_facts

metrics:
  - total_population
  - households

source:
  type: PROFILE  # or QUERY

grain:
  geo_level: province
  time_grain: YEAR
  dimensions:
    - sex

refresh:
  strategy: INTERVAL  # or DATASET_RELEASE, MANUAL
  interval_minutes: 1440  # Required for INTERVAL
  # cron_expression: "0 0 * * *"  # Alternative

storage:
  schema: marts
  table: population_province_yearly

retention_days: 365  # Optional

Required fields:

Field Description
name Unique materialization name
dataset_name Source dataset
metrics List of metrics to materialize (non-empty)
source.type PROFILE or QUERY
refresh.strategy DATASET_RELEASE, INTERVAL, or MANUAL
storage.schema Target schema
storage.table Target table

Note: interval_minutes is required when strategy is INTERVAL.


Comparability Policy

assets/policies/comparability.yml:

default_policy: WARN  # or ALLOW, FORBID

forbid_on_mismatch:
  - methodology_id

warn_on_mismatch:
  - methodology_version
  - population_definition

allow_override_flag: allow_incomparable

Validating Assets

Use the CLI tool to validate YAML assets:

# Validate assets in current directory
python -m invariant_contrib.wazimap.tools.validate_assets

# Validate assets in specific directory
python -m invariant_contrib.wazimap.tools.validate_assets /path/to/project

# Strict mode - treat warnings as errors
python -m invariant_contrib.wazimap.tools.validate_assets --strict

# Only show errors, suppress warnings
python -m invariant_contrib.wazimap.tools.validate_assets --quiet

# Output as JSON
python -m invariant_contrib.wazimap.tools.validate_assets --json

Example output:

[ERROR] assets/metrics/broken.yml:kind: invalid value 'INVALID', must be one of: ['DERIVED', 'RATIO', 'SIMPLE_AGG', 'WEIGHTED_AVG']
[ERROR] assets/datasets/missing.yml:physical_ref.schema: required field 'schema' is missing
[WARNING] assets/metrics/rate.yml:spec.numerator: unknown metric 'nonexistent_metric'

Found 2 error(s) and 1 warning(s)

JSON output:

{
  "path": "/home/user/project",
  "errors": [
    {
      "file_path": "assets/metrics/broken.yml",
      "field_path": "kind",
      "message": "invalid value 'INVALID', must be one of: ['DERIVED', 'RATIO', 'SIMPLE_AGG', 'WEIGHTED_AVG']",
      "severity": "ERROR"
    }
  ],
  "warnings": [
    {
      "file_path": "assets/metrics/rate.yml",
      "field_path": "spec.numerator",
      "message": "unknown metric 'nonexistent_metric'",
      "severity": "WARNING"
    }
  ],
  "summary": {
    "error_count": 1,
    "warning_count": 1,
    "success": false
  }
}

Exit codes:

Code Meaning
0 Success (no errors, or only warnings without --strict)
1 Errors found (or warnings with --strict)
2 Usage error (e.g., path doesn't exist)

Validation Checks

The schema validator performs these checks:

Required Fields

Each asset type has required fields that must be present.

Type Checking

Field values must match expected types (string, list, dict, etc.).

Enum Validation

Enum fields must use valid values:

Field Valid values
kind (dataset) FACT, DIMENSION
kind (metric) SIMPLE_AGG, RATIO, DERIVED, WEIGHTED_AVG
agg SUM, COUNT, COUNT_DISTINCT, AVG, MIN, MAX
data_type STRING, INTEGER, DECIMAL, DATE, TIMESTAMP
semantic_type CATEGORY, ORDINAL, CONTINUOUS
additivity.type ADDITIVE, SEMI_ADDITIVE, NON_ADDITIVE
rollup_policy ALLOW, RECOMPUTE, FORBID
ratio_format PERCENTAGE, DECIMAL, PER_1000, PER_10000, PER_100000
join_intent N_TO_1_ONLY, SAFE_ONE_TO_MANY
time grain DAY, WEEK, MONTH, QUARTER, YEAR
refresh.strategy DATASET_RELEASE, INTERVAL, MANUAL
source.type PROFILE, QUERY

Cross-Reference Validation

References between assets are checked:

  • Metric numerator/denominator/deps must reference existing metrics
  • Metric dataset_name must reference existing dataset
  • Dataset geography_config.hierarchy_name must reference existing geo hierarchy
  • Dataset dimensions keys must reference existing dimensions
  • Materialization dataset_name must reference existing dataset
  • Materialization metrics must reference existing metrics

Note: Cross-reference errors produce warnings, not errors, since assets may be loaded in any order.


Loading Assets Programmatically

from invariant_contrib.wazimap.infrastructure.yaml_schema import (
    validate_assets,
    SchemaError,
    SchemaErrorSeverity,
)
from pathlib import Path

# Validate
errors = validate_assets(Path("/path/to/project"))

# Check for blocking errors
blocking_errors = [e for e in errors if e.severity == SchemaErrorSeverity.ERROR]

if blocking_errors:
    for error in blocking_errors:
        print(f"{error.file_path}:{error.field_path}: {error.message}")
    raise ValueError(f"Found {len(blocking_errors)} validation errors")

# Load assets using your store implementation
# (Store implementation depends on your infrastructure)

Best Practices

Organize Metrics by Domain

metrics/
├── population/
│   ├── total.yml
│   └── by_age.yml
├── employment/
│   ├── labour_force.yml
│   └── unemployment.yml
└── education/
    └── enrollment.yml

Use Consistent Naming

# Good: Clear, consistent naming
name: total_population
name: unemployment_rate
name: enrollment_count

# Avoid: Inconsistent or unclear
name: pop
name: UE_rate
name: students

Document Methodology

comparability:
  methodology_id: CENSUS_2021
  methodology_version: "1.0"
  population_definition: All persons resident in South Africa on census night

Specify Valid Grains

Be explicit about what aggregation levels make sense:

# This metric only makes sense at province level and above
valid_geo_levels:
  - country
  - province

# Annual data only
valid_time_grains:
  - YEAR

Use RECOMPUTE for Ratios

# Ratios should always use RECOMPUTE, not FORBID
kind: RATIO
additivity:
  type: NON_ADDITIVE
  rollup_policy: RECOMPUTE  # System will recalculate correctly

CI Validation

For continuous integration, use the comprehensive validation script that performs additional checks beyond schema validation.

Running CI Validation

# Validate assets in current directory
python scripts/validate_semantic_assets.py

# Validate assets in specific directory
python scripts/validate_semantic_assets.py /path/to/project

# Strict mode - treat warnings as errors
python scripts/validate_semantic_assets.py --strict

# Output as JSON (for CI parsing)
python scripts/validate_semantic_assets.py --json

CI Validation Checks

The CI validator performs four validation passes:

Pass Description
Schema Validation Required fields, types, enum values (via yaml_schema module)
Unique Names No duplicate names within asset categories
Acyclic DAG No circular dependencies between metrics
Cross-References All references point to existing assets

Error Codes

Code Severity Description
SCHEMA_ERROR ERROR/WARNING Schema validation failure
DUPLICATE_NAME ERROR Multiple assets with same name
CYCLIC_DEPENDENCY ERROR Circular metric dependency
UNKNOWN_REFERENCE WARNING Reference to non-existent asset
MISSING_ASSETS_DIR ERROR assets/ directory not found

Example: Detecting Circular Dependencies

# metrics/a.yml
name: metric_a
kind: DERIVED
expr: "metric_b * 2"
deps:
  - metric_b

# metrics/b.yml
name: metric_b
kind: DERIVED
expr: "metric_a + 1"
deps:
  - metric_a  # Creates cycle!
[ERROR] CYCLIC_DEPENDENCY: (global): Cyclic metric dependency detected: metric_a -> metric_b -> metric_a

GitHub Actions Integration

Add to .github/workflows/ci.yml:

- name: Validate semantic assets
  run: |
    python scripts/validate_semantic_assets.py path/to/assets --strict

JSON Output for CI

python scripts/validate_semantic_assets.py --json
{
  "path": "/home/user/project",
  "errors": [
    {
      "severity": "ERROR",
      "code": "CYCLIC_DEPENDENCY",
      "message": "Cyclic metric dependency detected: a -> b -> a",
      "file_path": null,
      "field_path": null
    }
  ],
  "warnings": [],
  "summary": {
    "error_count": 1,
    "warning_count": 0,
    "success": false
  }
}