Quickstart¶
Run your first dataset inference in 5 minutes.
What You're About to Do¶
- Install Datasculpt
- Run inference on a CSV file
- Examine the output: shape, grain, and column roles
- See the decision record explaining each choice
Install¶
Run Inference¶
Create a sample CSV file:
geo_id,sex,age_group,population,unemployed,unemployment_rate
ZA-GP,F,15-24,1200000,180000,0.15
ZA-WC,F,15-24,600000,75000,0.125
ZA-GP,M,15-24,1150000,160000,0.139
ZA-WC,M,15-24,580000,70000,0.121
Run inference:
Examine the Output¶
Shape¶
Datasculpt detected this as wide observations — a spreadsheet-style format where each row is an observation with measures as columns.
Grain¶
>>> result.decision_record.grain.key_columns
['geo_id', 'sex', 'age_group']
>>> result.decision_record.grain.uniqueness_ratio
1.0
>>> result.decision_record.grain.confidence
0.95
The grain is the minimal set of columns that uniquely identify each row. Here, the combination of geo_id, sex, and age_group uniquely identifies observations.
Column Roles¶
>>> for col in result.proposal.columns:
... print(f"{col.name}: {col.role.value}")
geo_id: dimension
sex: dimension
age_group: dimension
population: measure
unemployed: measure
unemployment_rate: measure
Datasculpt assigned roles based on: - Dimensions: Categorical columns with low cardinality - Measures: Numeric columns with high cardinality
See the Evidence¶
Every decision is backed by evidence:
>>> evidence = result.decision_record.column_evidence["population"]
>>> evidence.primitive_type
<PrimitiveType.INTEGER: 'integer'>
>>> evidence.distinct_ratio
1.0
>>> evidence.role_scores
{<Role.MEASURE: 'measure'>: 0.85, <Role.KEY: 'key'>: 0.15, ...}
View Ranked Hypotheses¶
Datasculpt doesn't just pick a shape — it ranks all candidates:
>>> for h in result.decision_record.hypotheses:
... print(f"{h.hypothesis.value}: {h.score:.2f}")
wide_observations: 0.72
long_observations: 0.65
long_indicators: 0.20
wide_time_columns: 0.10
series_column: 0.05
Handle Ambiguous Datasets¶
When Datasculpt isn't confident, it generates questions:
result = infer("ambiguous.csv", interactive=True)
if result.pending_questions:
for q in result.pending_questions:
print(q.prompt)
print(f" Choices: {[c['label'] for c in q.choices]}")
Provide answers to resolve ambiguity:
from datasculpt import apply_answers
answers = {result.pending_questions[0].id: "long_observations"}
result = apply_answers(result, answers)
What Just Happened¶
Datasculpt ran an 8-stage pipeline:
- Evidence extraction: Analyzed each column's type, cardinality, null rate, value distribution
- Role scoring: Scored each column against 8 possible roles
- Shape detection: Ranked 5 shape hypotheses
- Grain inference: Found the minimal unique key
- Question generation: Created questions for ambiguous aspects
- Decision recording: Captured the full audit trail
- Proposal generation: Produced output ready for registration
Next Steps¶
- Mental Model — Understand the core concepts
- Examples — See inference on different dataset shapes
- API Reference — Full function signatures