Quickstart¶

Run your first dataset inference in 5 minutes.

What You're About to Do¶

Install Datasculpt
Run inference on a CSV file
Examine the output: shape, grain, and column roles
See the decision record explaining each choice

Install¶

pip install datasculpt

Run Inference¶

Create a sample CSV file:

geo_id,sex,age_group,population,unemployed,unemployment_rate
ZA-GP,F,15-24,1200000,180000,0.15
ZA-WC,F,15-24,600000,75000,0.125
ZA-GP,M,15-24,1150000,160000,0.139
ZA-WC,M,15-24,580000,70000,0.121

Run inference:

from datasculpt import infer

result = infer("demographics.csv")

Examine the Output¶

Shape¶

>>> result.proposal.shape_hypothesis
<ShapeHypothesis.WIDE_OBSERVATIONS: 'wide_observations'>

Datasculpt detected this as wide observations — a spreadsheet-style format where each row is an observation with measures as columns.

Grain¶

>>> result.decision_record.grain.key_columns
['geo_id', 'sex', 'age_group']

>>> result.decision_record.grain.uniqueness_ratio
1.0

>>> result.decision_record.grain.confidence
0.95

The grain is the minimal set of columns that uniquely identify each row. Here, the combination of geo_id, sex, and age_group uniquely identifies observations.

Column Roles¶

>>> for col in result.proposal.columns:
...     print(f"{col.name}: {col.role.value}")
geo_id: dimension
sex: dimension
age_group: dimension
population: measure
unemployed: measure
unemployment_rate: measure

Datasculpt assigned roles based on: - Dimensions: Categorical columns with low cardinality - Measures: Numeric columns with high cardinality

See the Evidence¶

Every decision is backed by evidence:

>>> evidence = result.decision_record.column_evidence["population"]
>>> evidence.primitive_type
<PrimitiveType.INTEGER: 'integer'>

>>> evidence.distinct_ratio
1.0

>>> evidence.role_scores
{<Role.MEASURE: 'measure'>: 0.85, <Role.KEY: 'key'>: 0.15, ...}

View Ranked Hypotheses¶

Datasculpt doesn't just pick a shape — it ranks all candidates:

>>> for h in result.decision_record.hypotheses:
...     print(f"{h.hypothesis.value}: {h.score:.2f}")
wide_observations: 0.72
long_observations: 0.65
long_indicators: 0.20
wide_time_columns: 0.10
series_column: 0.05

Handle Ambiguous Datasets¶

When Datasculpt isn't confident, it generates questions:

result = infer("ambiguous.csv", interactive=True)

if result.pending_questions:
    for q in result.pending_questions:
        print(q.prompt)
        print(f"  Choices: {[c['label'] for c in q.choices]}")

Provide answers to resolve ambiguity:

from datasculpt import apply_answers

answers = {result.pending_questions[0].id: "long_observations"}
result = apply_answers(result, answers)

What Just Happened¶

Datasculpt ran an 8-stage pipeline:

Input → Evidence → Roles → Shape → Grain → Questions → Decision → Proposal

Evidence extraction: Analyzed each column's type, cardinality, null rate, value distribution
Role scoring: Scored each column against 8 possible roles
Shape detection: Ranked 5 shape hypotheses
Grain inference: Found the minimal unique key
Question generation: Created questions for ambiguous aspects
Decision recording: Captured the full audit trail
Proposal generation: Produced output ready for registration

Next Steps¶

Mental Model — Understand the core concepts
Examples — See inference on different dataset shapes
API Reference — Full function signatures