Structural Metadata¶
The missing category of metadata that Datasculpt produces.
The Five Metadata Categories¶
| Category | What It Captures | Example |
|---|---|---|
| Technical | Physical storage details | File format, encoding, compression |
| Business | Domain meaning | "Revenue" means Q4 sales in USD |
| Operational | Lineage and freshness | Last updated 2024-01-15, from ERP |
| Governance | Access and compliance | PII, restricted to finance team |
| Structural | Layout and intent | Wide format, grain is (region, date) |
Most data catalogs and semantic layers handle the first four well. The fifth — structural metadata — is usually assumed to exist but rarely produced systematically.
What Structural Metadata Captures¶
1. Dataset Shape¶
How is the data laid out?
| Shape | Layout |
|---|---|
| Long observations | Rows are atomic observations |
| Long indicators | Rows are unpivoted indicator/value pairs |
| Wide observations | Measures are columns |
| Wide time columns | Time periods are column headers |
| Series column | Time series stored in arrays |
The same logical data can have multiple physical layouts. Structural metadata makes the layout explicit.
2. Column Roles¶
What purpose does each column serve?
| Role | Purpose |
|---|---|
| Key | Contributes to uniqueness |
| Dimension | Categorical grouping variable |
| Measure | Numeric, aggregatable value |
| Time | Temporal dimension |
| Indicator name | Identifies metric in unpivoted data |
| Value | Holds value in unpivoted data |
| Series | Contains embedded time series |
| Metadata | Descriptive, non-analytical |
Without explicit roles, systems guess — and often guess wrong.
3. Grain¶
What uniquely identifies each row?
- Which columns form the unique key?
- Are there duplicates?
- Are there nulls in key columns?
Grain errors cause silent failures: joins that duplicate, aggregations that double-count.
4. Structural Constraints¶
What invariants must hold?
- Time columns must be sequential
- Indicator names must be from a known set
- Series arrays must have consistent length
Constraints enable validation before data enters downstream systems.
Why Structural Metadata Is Missing¶
Systems Assume It¶
Data catalogs store whatever you tell them. They don't infer structure.
# Catalog stores what you provide
catalog.register(
table="demographics",
columns=["geo_id", "sex", "age_group", "population"],
# But: Is this wide or long? What's the grain? Which are measures?
)
Manual Annotation Doesn't Scale¶
You could manually annotate every dataset. In practice: - New datasets arrive faster than humans annotate - Annotations drift as data changes - Different teams annotate inconsistently
Inference Is Hard¶
Structural inference requires: - Understanding multiple shapes - Handling ambiguity gracefully - Producing audit trails - Being deterministic for trust
Most teams build one-off scripts that break on edge cases.
How Datasculpt Fills the Gap¶
1. Systematic Inference¶
Every dataset goes through the same pipeline:
No ad-hoc scripts. No human in the loop (unless ambiguous).
2. Confidence Scores¶
Every inference is scored: - Shape hypothesis: 0.72 confidence - Role assignment: 0.85 confidence - Grain: 95% uniqueness ratio
Low confidence triggers questions, not silent guesses.
3. Audit Trails¶
Every decision is recorded: - What was chosen - What alternatives were considered - What evidence supported the choice - What questions were asked
Reproducible. Debuggable. Trustworthy.
4. Determinism¶
Same input → same output. No: - LLM randomness - Hidden state - Environment dependence
Where Datasculpt Fits¶
flowchart TD
A[Raw Data] --> B[Datasculpt]
B --> C[Invariant]
C --> D[Catalog / Query Systems]
Datasculpt produces structural metadata (shape, grain, roles). Invariant uses it for governance. Catalogs store and serve the governed data.
See Also¶
- Evidence — The facts Datasculpt extracts
- Shapes — The structural patterns Datasculpt recognizes
- Decision Records — How inferences are captured