Skip to content

Datasculpt

Wide Observations

adieyal/datasculpt

Wide Observations¶

The most common dataset shape — spreadsheet-style data with measures as columns.

The Data¶

geo_id,sex,age_group,population,unemployed,unemployment_rate
ZA-GP,F,15-24,1200000,180000,0.15
ZA-WC,F,15-24,600000,75000,0.125
ZA-GP,M,15-24,1150000,160000,0.139
ZA-WC,M,15-24,580000,70000,0.121
ZA-GP,F,25-34,1300000,120000,0.092
ZA-WC,F,25-34,650000,55000,0.085
ZA-GP,M,25-34,1250000,110000,0.088
ZA-WC,M,25-34,620000,52000,0.084

What It Looks Like¶

geo_id	sex	age_group	population	unemployed	unemployment_rate
ZA-GP	F	15-24	1200000	180000	0.15
ZA-WC	F	15-24	600000	75000	0.125
...	...	...	...	...	...

The Inference¶

from datasculpt import infer

result = infer("wide_observations.csv")

Shape Detection¶

>>> result.proposal.shape_hypothesis
<ShapeHypothesis.WIDE_OBSERVATIONS: 'wide_observations'>

>>> result.decision_record.hypotheses[0]
HypothesisScore(
    hypothesis=<ShapeHypothesis.WIDE_OBSERVATIONS>,
    score=0.72,
    reasons=[
        'Multiple numeric columns suggest measures as columns',
        'No indicator_name/value column pair detected',
        'No date-like column headers'
    ]
)

Grain Detection¶

>>> result.decision_record.grain
GrainInference(
    key_columns=['geo_id', 'sex', 'age_group'],
    confidence=0.95,
    uniqueness_ratio=1.0,
    evidence=[
        'Combination of geo_id, sex, age_group is unique',
        'All three columns have low cardinality (dimension-like)',
        'No pseudo-key signals detected'
    ]
)

Role Assignments¶

Column	Role	Evidence
geo_id	dimension	Low cardinality (2 values), string type
sex	dimension	Low cardinality (2 values), string type
age_group	dimension	Low cardinality (2 values), string type
population	measure	Integer type, high cardinality, all positive
unemployed	measure	Integer type, high cardinality, all positive
unemployment_rate	measure	Number type, values in [0, 1]

Why This Shape¶

Datasculpt detected wide_observations because:

Multiple numeric columns — population, unemployed, unemployment_rate are all numeric with high distinct ratios
No indicator pattern — No column pair matches the indicator_name/value pattern
No time in headers — Column names don't look like dates
Clear dimension/measure split — String columns have low cardinality, numeric columns have high cardinality

The Proposal¶

>>> result.proposal
InvariantProposal(
    dataset_name='wide_observations',
    dataset_kind=<DatasetKind.OBSERVATIONS>,
    shape_hypothesis=<ShapeHypothesis.WIDE_OBSERVATIONS>,
    grain=['geo_id', 'sex', 'age_group'],
    columns=[
        ColumnSpec(name='geo_id', role=<Role.DIMENSION>, ...),
        ColumnSpec(name='sex', role=<Role.DIMENSION>, ...),
        ColumnSpec(name='age_group', role=<Role.DIMENSION>, ...),
        ColumnSpec(name='population', role=<Role.MEASURE>, ...),
        ColumnSpec(name='unemployed', role=<Role.MEASURE>, ...),
        ColumnSpec(name='unemployment_rate', role=<Role.MEASURE>, ...),
    ],
    warnings=[],
    required_user_confirmations=[]
)

Downstream Implications¶

When registered with Invariant as observations: - Aggregations over measures are safe - Joins on grain columns are safe - Comparing across dimension values is comparable - The unemployment_rate may get flagged as an indicator (derived from other columns)

See Also¶

Long Indicators — Same data in unpivoted format
Grain Detection — How grain is found