YAML Asset Definitions¶
Semantic assets can be defined in YAML files for easier management and version control. The YAML loader validates schemas and cross-references before loading into the catalog.
Directory Structure¶
your-project/
└── assets/
├── datasets/
│ ├── population.yml
│ └── employment.yml
├── dimensions/
│ ├── geography.yml
│ ├── time.yml
│ └── demographics.yml
├── geo_hierarchies/
│ └── south_africa.yml
├── metrics/
│ ├── population/
│ │ ├── total_population.yml
│ │ └── households.yml
│ └── employment/
│ ├── labour_force.yml
│ └── unemployment_rate.yml
├── materializations/
│ └── province_yearly.yml
└── policies/
└── comparability.yml
Metrics can be nested in subdirectories for organization.
Dataset Definition¶
assets/datasets/population.yml:
name: population_facts
kind: FACT # or DIMENSION
physical_ref:
schema: analytics
table: population
grain_keys:
geo:
- geo_code
time:
- year
other:
- age_group
- sex
time_config:
column: year
grain: YEAR
supported_grains:
- YEAR
geography_config:
hierarchy_name: south_africa
level_column: geo_level
code_column: geo_code
# Reference dimensions by name
dimensions:
geography:
join_key: geo_code
demographics:
join_key: age_group
# Optional quality configuration
quality:
suppression_column: is_suppressed
suppression_threshold: 5
confidence_column: confidence_level
Required fields:
| Field | Description |
|---|---|
name |
Unique dataset name |
kind |
FACT or DIMENSION |
physical_ref.schema |
Database schema |
physical_ref.table |
Table name |
Valid kind values: FACT, DIMENSION
Valid grain values: DAY, WEEK, MONTH, QUARTER, YEAR
Dimension Definition¶
assets/dimensions/geography.yml:
name: geography
attributes:
code:
expr: geo_code
data_type: STRING
semantic_type: CATEGORY
name:
expr: geo_name
data_type: STRING
semantic_type: CATEGORY
level:
expr: geo_level
data_type: STRING
semantic_type: CATEGORY
parent_code:
expr: parent_geo_code
data_type: STRING
semantic_type: CATEGORY
Required fields:
| Field | Description |
|---|---|
name |
Unique dimension name |
attributes |
At least one attribute (non-empty) |
Valid data_type values: STRING, INTEGER, DECIMAL, DATE, TIMESTAMP
Valid semantic_type values: CATEGORY, ORDINAL, CONTINUOUS
GeoHierarchy Definition¶
assets/geo_hierarchies/south_africa.yml:
name: south_africa
levels:
- country
- province
- district
- municipality
- ward
parent_relationships:
province:
parent_level: country
district:
parent_level: province
municipality:
parent_level: district
ward:
parent_level: municipality
rollup_rules:
default_allowed: true
overrides:
- from_level: ward
to_level: country
allowed: false
Required fields:
| Field | Description |
|---|---|
name |
Unique hierarchy name |
levels |
Ordered list (coarsest to finest) |
Note: Parent relationships must reference levels that exist in the levels list.
Metric Definitions¶
Simple Aggregation¶
assets/metrics/population/total_population.yml:
name: total_population
kind: SIMPLE_AGG
# Spec fields (can be at root or under 'spec')
dataset_name: population_facts
expr: population_count
agg: SUM
# Optional filters
filters:
- column: is_valid
operator: "="
value: true
additivity:
type: ADDITIVE
across_time: true
across_geo: true
rollup_policy: ALLOW
valid_geo_levels:
- country
- province
- district
- municipality
valid_time_grains:
- YEAR
# Optional
unit:
name: persons
scale: 1
comparability:
methodology_id: CENSUS_2021
methodology_version: "1.0"
population_definition: All residents
Required fields for SIMPLE_AGG:
| Field | Description |
|---|---|
name |
Unique metric name |
kind |
SIMPLE_AGG |
dataset_name |
Source dataset |
expr |
Column or expression |
agg |
Aggregation function |
Valid agg values: SUM, COUNT, COUNT_DISTINCT, AVG, MIN, MAX
Ratio Metric¶
assets/metrics/employment/unemployment_rate.yml:
name: unemployment_rate
kind: RATIO
numerator: unemployed_count
denominator: labour_force
ratio_format: PERCENTAGE
join_intent: N_TO_1_ONLY
join_intent_rationale: Both metrics share the same grain
additivity:
type: NON_ADDITIVE
rollup_policy: RECOMPUTE
valid_geo_levels:
- province
- district
Required fields for RATIO:
| Field | Description |
|---|---|
numerator |
Name of numerator metric |
denominator |
Name of denominator metric |
Valid ratio_format values: PERCENTAGE, DECIMAL, PER_1000, PER_10000, PER_100000
Valid join_intent values: N_TO_1_ONLY, SAFE_ONE_TO_MANY
Derived Metric¶
assets/metrics/employment/employment_rate.yml:
name: employment_rate
kind: DERIVED
expr: "100 - unemployment_rate"
deps:
- unemployment_rate
additivity:
type: NON_ADDITIVE
rollup_policy: RECOMPUTE
Required fields for DERIVED:
| Field | Description |
|---|---|
expr |
Expression using dependent metrics |
deps |
List of dependency metric names |
Weighted Average¶
assets/metrics/prices/weighted_avg_price.yml:
name: weighted_avg_price
kind: WEIGHTED_AVG
value_expr: unit_price
weight_metric: quantity_sold
additivity:
type: NON_ADDITIVE
rollup_policy: RECOMPUTE
Required fields for WEIGHTED_AVG:
| Field | Description |
|---|---|
value_expr |
Expression for the value |
weight_metric |
Name of the weight metric |
Additivity Configuration¶
additivity:
type: ADDITIVE # or SEMI_ADDITIVE, NON_ADDITIVE
across_time: true # Can aggregate across time?
across_geo: true # Can aggregate across geography?
rollup_policy: ALLOW # or RECOMPUTE, FORBID
| Policy | When to use |
|---|---|
ALLOW |
Safe to sum (counts, totals) |
RECOMPUTE |
Recompute from components (rates, ratios) |
FORBID |
Cannot be aggregated (indices, rankings) |
Materialization Definition¶
assets/materializations/province_yearly.yml:
name: population_province_yearly
dataset_name: population_facts
metrics:
- total_population
- households
source:
type: PROFILE # or QUERY
grain:
geo_level: province
time_grain: YEAR
dimensions:
- sex
refresh:
strategy: INTERVAL # or DATASET_RELEASE, MANUAL
interval_minutes: 1440 # Required for INTERVAL
# cron_expression: "0 0 * * *" # Alternative
storage:
schema: marts
table: population_province_yearly
retention_days: 365 # Optional
Required fields:
| Field | Description |
|---|---|
name |
Unique materialization name |
dataset_name |
Source dataset |
metrics |
List of metrics to materialize (non-empty) |
source.type |
PROFILE or QUERY |
refresh.strategy |
DATASET_RELEASE, INTERVAL, or MANUAL |
storage.schema |
Target schema |
storage.table |
Target table |
Note: interval_minutes is required when strategy is INTERVAL.
Comparability Policy¶
assets/policies/comparability.yml:
default_policy: WARN # or ALLOW, FORBID
forbid_on_mismatch:
- methodology_id
warn_on_mismatch:
- methodology_version
- population_definition
allow_override_flag: allow_incomparable
Validating Assets¶
Use the CLI tool to validate YAML assets:
# Validate assets in current directory
python -m invariant_contrib.wazimap.tools.validate_assets
# Validate assets in specific directory
python -m invariant_contrib.wazimap.tools.validate_assets /path/to/project
# Strict mode - treat warnings as errors
python -m invariant_contrib.wazimap.tools.validate_assets --strict
# Only show errors, suppress warnings
python -m invariant_contrib.wazimap.tools.validate_assets --quiet
# Output as JSON
python -m invariant_contrib.wazimap.tools.validate_assets --json
Example output:
[ERROR] assets/metrics/broken.yml:kind: invalid value 'INVALID', must be one of: ['DERIVED', 'RATIO', 'SIMPLE_AGG', 'WEIGHTED_AVG']
[ERROR] assets/datasets/missing.yml:physical_ref.schema: required field 'schema' is missing
[WARNING] assets/metrics/rate.yml:spec.numerator: unknown metric 'nonexistent_metric'
Found 2 error(s) and 1 warning(s)
JSON output:
{
"path": "/home/user/project",
"errors": [
{
"file_path": "assets/metrics/broken.yml",
"field_path": "kind",
"message": "invalid value 'INVALID', must be one of: ['DERIVED', 'RATIO', 'SIMPLE_AGG', 'WEIGHTED_AVG']",
"severity": "ERROR"
}
],
"warnings": [
{
"file_path": "assets/metrics/rate.yml",
"field_path": "spec.numerator",
"message": "unknown metric 'nonexistent_metric'",
"severity": "WARNING"
}
],
"summary": {
"error_count": 1,
"warning_count": 1,
"success": false
}
}
Exit codes:
| Code | Meaning |
|---|---|
| 0 | Success (no errors, or only warnings without --strict) |
| 1 | Errors found (or warnings with --strict) |
| 2 | Usage error (e.g., path doesn't exist) |
Validation Checks¶
The schema validator performs these checks:
Required Fields¶
Each asset type has required fields that must be present.
Type Checking¶
Field values must match expected types (string, list, dict, etc.).
Enum Validation¶
Enum fields must use valid values:
| Field | Valid values |
|---|---|
kind (dataset) |
FACT, DIMENSION |
kind (metric) |
SIMPLE_AGG, RATIO, DERIVED, WEIGHTED_AVG |
agg |
SUM, COUNT, COUNT_DISTINCT, AVG, MIN, MAX |
data_type |
STRING, INTEGER, DECIMAL, DATE, TIMESTAMP |
semantic_type |
CATEGORY, ORDINAL, CONTINUOUS |
additivity.type |
ADDITIVE, SEMI_ADDITIVE, NON_ADDITIVE |
rollup_policy |
ALLOW, RECOMPUTE, FORBID |
ratio_format |
PERCENTAGE, DECIMAL, PER_1000, PER_10000, PER_100000 |
join_intent |
N_TO_1_ONLY, SAFE_ONE_TO_MANY |
time grain |
DAY, WEEK, MONTH, QUARTER, YEAR |
refresh.strategy |
DATASET_RELEASE, INTERVAL, MANUAL |
source.type |
PROFILE, QUERY |
Cross-Reference Validation¶
References between assets are checked:
- Metric
numerator/denominator/depsmust reference existing metrics - Metric
dataset_namemust reference existing dataset - Dataset
geography_config.hierarchy_namemust reference existing geo hierarchy - Dataset
dimensionskeys must reference existing dimensions - Materialization
dataset_namemust reference existing dataset - Materialization
metricsmust reference existing metrics
Note: Cross-reference errors produce warnings, not errors, since assets may be loaded in any order.
Loading Assets Programmatically¶
from invariant_contrib.wazimap.infrastructure.yaml_schema import (
validate_assets,
SchemaError,
SchemaErrorSeverity,
)
from pathlib import Path
# Validate
errors = validate_assets(Path("/path/to/project"))
# Check for blocking errors
blocking_errors = [e for e in errors if e.severity == SchemaErrorSeverity.ERROR]
if blocking_errors:
for error in blocking_errors:
print(f"{error.file_path}:{error.field_path}: {error.message}")
raise ValueError(f"Found {len(blocking_errors)} validation errors")
# Load assets using your store implementation
# (Store implementation depends on your infrastructure)
Best Practices¶
Organize Metrics by Domain¶
metrics/
├── population/
│ ├── total.yml
│ └── by_age.yml
├── employment/
│ ├── labour_force.yml
│ └── unemployment.yml
└── education/
└── enrollment.yml
Use Consistent Naming¶
# Good: Clear, consistent naming
name: total_population
name: unemployment_rate
name: enrollment_count
# Avoid: Inconsistent or unclear
name: pop
name: UE_rate
name: students
Document Methodology¶
comparability:
methodology_id: CENSUS_2021
methodology_version: "1.0"
population_definition: All persons resident in South Africa on census night
Specify Valid Grains¶
Be explicit about what aggregation levels make sense:
# This metric only makes sense at province level and above
valid_geo_levels:
- country
- province
# Annual data only
valid_time_grains:
- YEAR
Use RECOMPUTE for Ratios¶
# Ratios should always use RECOMPUTE, not FORBID
kind: RATIO
additivity:
type: NON_ADDITIVE
rollup_policy: RECOMPUTE # System will recalculate correctly
CI Validation¶
For continuous integration, use the comprehensive validation script that performs additional checks beyond schema validation.
Running CI Validation¶
# Validate assets in current directory
python scripts/validate_semantic_assets.py
# Validate assets in specific directory
python scripts/validate_semantic_assets.py /path/to/project
# Strict mode - treat warnings as errors
python scripts/validate_semantic_assets.py --strict
# Output as JSON (for CI parsing)
python scripts/validate_semantic_assets.py --json
CI Validation Checks¶
The CI validator performs four validation passes:
| Pass | Description |
|---|---|
| Schema Validation | Required fields, types, enum values (via yaml_schema module) |
| Unique Names | No duplicate names within asset categories |
| Acyclic DAG | No circular dependencies between metrics |
| Cross-References | All references point to existing assets |
Error Codes¶
| Code | Severity | Description |
|---|---|---|
SCHEMA_ERROR |
ERROR/WARNING | Schema validation failure |
DUPLICATE_NAME |
ERROR | Multiple assets with same name |
CYCLIC_DEPENDENCY |
ERROR | Circular metric dependency |
UNKNOWN_REFERENCE |
WARNING | Reference to non-existent asset |
MISSING_ASSETS_DIR |
ERROR | assets/ directory not found |
Example: Detecting Circular Dependencies¶
# metrics/a.yml
name: metric_a
kind: DERIVED
expr: "metric_b * 2"
deps:
- metric_b
# metrics/b.yml
name: metric_b
kind: DERIVED
expr: "metric_a + 1"
deps:
- metric_a # Creates cycle!
[ERROR] CYCLIC_DEPENDENCY: (global): Cyclic metric dependency detected: metric_a -> metric_b -> metric_a
GitHub Actions Integration¶
Add to .github/workflows/ci.yml:
- name: Validate semantic assets
run: |
python scripts/validate_semantic_assets.py path/to/assets --strict