Skip to content

Shapes

The structural patterns Datasculpt recognizes in datasets.

What Is a Shape?

A shape describes how data is laid out — not what it means, but how it's structured.

The same logical data can have multiple physical shapes:

Logical: GDP by region and time

Shape 1: Wide observations
| region | 2022_gdp | 2023_gdp | 2024_gdp |

Shape 2: Long observations
| region | year | gdp |

Shape 3: Long indicators
| region | year | indicator | value |

Shape 4: Wide time columns
| region | 2022 | 2023 | 2024 |

Shape 5: Series column
| region | gdp_series |

Knowing the shape is essential for correct aggregation, joining, and querying.

Visual Comparison: Same Data, Six Shapes

The table below shows identical GDP data (North: 100, 110; South: 80, 85 for years 2022-2023) in all six shapes:

Shape Layout
long_observations region, year, gdp
North, 2022, 100
North, 2023, 110
South, 2022, 80
South, 2023, 85
long_indicators region, year, indicator, value
North, 2022, gdp, 100
North, 2023, gdp, 110
South, 2022, gdp, 80
South, 2023, gdp, 85
wide_observations region, gdp_2022, gdp_2023
North, 100, 110
South, 80, 85
wide_time_columns region, indicator, 2022, 2023
North, gdp, 100, 110
South, gdp, 80, 85
series_column region, indicator, series
North, gdp, [100, 110]
South, gdp, [80, 85]
microdata hhid, indiv, zone, s1aq1, s1aq2
001, 01, North, 1, 2
001, 02, North, 2, 1
002, 01, South, 1, 3
flowchart TB
    subgraph logical["Logical Data: GDP by Region/Time"]
        data["North: 100 (2022), 110 (2023)<br>South: 80 (2022), 85 (2023)"]
    end

    logical --> long_obs["<b>long_observations</b><br>━━━━━━━━━━━━━━━<br>region │ year │ gdp<br>───────┼──────┼─────<br>North  │ 2022 │ 100<br>North  │ 2023 │ 110<br>South  │ 2022 │ 80<br>South  │ 2023 │ 85"]

    logical --> long_ind["<b>long_indicators</b><br>━━━━━━━━━━━━━━━━━━━━<br>region │ year │ ind │ val<br>───────┼──────┼─────┼─────<br>North  │ 2022 │ gdp │ 100<br>North  │ 2023 │ gdp │ 110<br>South  │ 2022 │ gdp │ 80<br>South  │ 2023 │ gdp │ 85"]

    logical --> wide_obs["<b>wide_observations</b><br>━━━━━━━━━━━━━━━━━━━━━<br>region │ gdp_2022 │ gdp_2023<br>───────┼──────────┼──────────<br>North  │ 100      │ 110<br>South  │ 80       │ 85"]

    logical --> wide_time["<b>wide_time_columns</b><br>━━━━━━━━━━━━━━━━━━━━━━━<br>region │ indicator │ 2022 │ 2023<br>───────┼───────────┼──────┼──────<br>North  │ gdp       │ 100  │ 110<br>South  │ gdp       │ 80   │ 85"]

    logical --> series["<b>series_column</b><br>━━━━━━━━━━━━━━━━━━━━━━━━<br>region │ indicator │ series<br>───────┼───────────┼────────────<br>North  │ gdp       │ [100, 110]<br>South  │ gdp       │ [80, 85]"]

The Six Shapes

Long Observations

Rows are atomic observations with dimensions and measures as columns.

region,year,gdp,population
North,2022,100,1000
North,2023,110,1020
South,2022,80,800
South,2023,85,820

Characteristics: - Each row is one observation - Dimensions: region, year - Measures: gdp, population - Grain: (region, year)

Use case: Survey data, transaction logs, sensor readings

Long Indicators

Unpivoted format with indicator name and value columns.

region,year,indicator,value
North,2022,gdp,100
North,2022,population,1000
North,2023,gdp,110
North,2023,population,1020

Characteristics: - Each row is one indicator value - Indicator column has concept names - Value column has the numeric values - Grain: (region, year, indicator)

Use case: Statistical databases, open data portals, data exchange formats

Wide Observations

Spreadsheet-style with measures as columns.

region,gdp_2022,gdp_2023,pop_2022,pop_2023
North,100,110,1000,1020
South,80,85,800,820

Characteristics: - Each row is one entity - Multiple measures and/or time periods as columns - Grain: (region)

Use case: Summary reports, dashboards, exports from BI tools

Wide Time Columns

Time periods encoded in column headers.

region,indicator,2022,2023,2024
North,gdp,100,110,120
North,population,1000,1020,1040
South,gdp,80,85,90

Characteristics: - Column names are dates (2022, 2023, 2024-01, etc.) - Each column is a time period - Grain: (region, indicator) — time is in headers

Use case: Time series from Excel, economic data, forecasts

Series Column

Time series stored as arrays in a single column.

region,indicator,series,start_year
North,gdp,"[100, 110, 120]",2022
South,gdp,"[80, 85, 90]",2022

Characteristics: - One column contains JSON arrays - Arrays have consistent length - Metadata columns describe the series

Use case: APIs returning time series, compact data exchange

Microdata

Survey and observation data with coded question columns.

hhid,indiv,zone,state,lga,s1aq1,s1aq2,v101,hv001,wt
001,01,North,Kano,Dala,1,2,3,101,1.25
001,02,North,Kano,Dala,2,1,2,101,1.25
002,01,South,Lagos,Ikeja,1,3,1,102,0.95

Characteristics: - Many columns (30-100+) with coded names following survey patterns (s1aq1, v101, hv001) - Hierarchical ID structure (household ID + individual ID) - Geography hierarchy columns (zone, state, lga) - Many low-cardinality categorical responses - Survey weight columns present

Use case: Household surveys (DHS, LSMS, MICS), demographic health surveys, census microdata

Shape Detection

Datasculpt scores all six shapes based on evidence:

result = infer("data.csv")

for h in result.decision_record.hypotheses:
    print(f"{h.hypothesis.value}: {h.score:.2f}")
    for reason in h.reasons:
        print(f"  - {reason}")

Scoring Signals

Signal Suggests Shape
Multiple numeric columns wide_observations
Indicator/value column pair long_indicators
Date-like column headers wide_time_columns
JSON arrays in values series_column
Single numeric column long_observations
Coded question columns (s1aq1, v101) microdata
Hierarchical IDs (hhid + indiv) microdata
Geography hierarchy (zone, state, lga) microdata
Many low-cardinality categoricals microdata

Ambiguity

When top shapes have similar scores (gap < 0.10), inference is ambiguous:

>>> result.decision_record.hypotheses[:2]
[
    HypothesisScore(hypothesis=WIDE_OBSERVATIONS, score=0.58),
    HypothesisScore(hypothesis=LONG_OBSERVATIONS, score=0.54)
]
# Gap is 0.04 — ambiguous

See Ambiguous Shape for handling ambiguity.

Shape Implications

For Aggregation

Shape Safe Aggregation
long_observations SUM(measure) GROUP BY dimensions
long_indicators SUM(value) WHERE indicator='x' GROUP BY dimensions
wide_observations SUM(measure_col) GROUP BY dimensions
wide_time_columns Aggregate across time columns, or unpivot first
series_column Expand series, then aggregate
microdata Weighted aggregation using survey weights, GROUP BY geography

For Joins

Shape Join Safely On
long_observations Grain columns
long_indicators Grain columns (includes indicator)
wide_observations Grain columns
wide_time_columns Grain columns (time is in headers)
series_column Grain columns (series is not a key)
microdata Hierarchical ID columns (hhid, indiv)

For Schema Evolution

Shape Schema Stability
long_observations Stable (new data = new rows)
long_indicators Stable (new indicators = new rows)
wide_observations Unstable (new measures = new columns)
wide_time_columns Unstable (new times = new columns)
series_column Stable (new times = longer arrays)
microdata Semi-stable (new questions = new columns, new respondents = new rows)

Shape to DatasetKind Mapping

Datasculpt maps shapes to Invariant's DatasetKind:

Shape DatasetKind
long_observations OBSERVATIONS
long_indicators INDICATORS_LONG
wide_observations OBSERVATIONS
wide_time_columns TIMESERIES_WIDE
series_column TIMESERIES_SERIES
microdata MICRODATA

Configuration

Tune shape detection:

from datasculpt.core.types import InferenceConfig

config = InferenceConfig(
    min_time_columns_for_wide=3,    # Require 3+ date headers for wide_time_columns
    hypothesis_confidence_gap=0.1,  # Gap for ambiguity detection
)

See Also