Design Principles¶

The core principles that guide Datasculpt's design.

1. Determinism First¶

Given identical input and configuration, Datasculpt produces identical output.

What This Means¶

No randomness in the inference pipeline
No LLMs making decisions (advisory only)
No environment-dependent behavior
Same file → same fingerprint → same decision

Why It Matters¶

Reproducibility: Results can be verified and recreated
Testing: Tests are reliable and don't flake
Debugging: Issues can be reproduced
Trust: Users know what to expect

Implementation¶

# Bad: Random sampling
sample = df.sample(1000)  # Different each run

# Good: Deterministic sampling
sample = df.head(1000)  # Same each run

# Bad: Order-dependent
for col in df.columns:  # Order may vary
    process(col)

# Good: Sorted order
for col in sorted(df.columns):  # Consistent order
    process(col)

2. Evidence, Not Authority¶

Every inference is scored and justified with evidence.

What This Means¶

No "because I said so" decisions
Every choice has a confidence score
Alternatives are preserved, not discarded
Users can see why any decision was made

Why It Matters¶

Auditability: Trace any decision to evidence
Trust: Users can verify reasoning
Debugging: Understand wrong decisions
Override: Users can correct with knowledge

Implementation¶

# Bad: Binary decision
if looks_like_indicator:
    return Role.INDICATOR_NAME

# Good: Scored decision
role_scores = {
    Role.INDICATOR_NAME: 0.85,
    Role.DIMENSION: 0.45,
    Role.METADATA: 0.10,
}
return max(role_scores, key=role_scores.get), role_scores

3. Shape Before Semantics¶

Focus on structure, not meaning.

What This Means¶

Datasculpt determines layout (long vs wide)
Datasculpt determines roles (dimension vs measure)
Datasculpt determines grain (unique key)
Datasculpt does NOT determine meaning (what "population" means)

Why It Matters¶

Scope clarity: Clear boundary with Invariant
Universality: Works without domain knowledge
Simplicity: Fewer assumptions to get wrong

The Boundary¶

Datasculpt:
  ✓ This is long_indicators shape
  ✓ "indicator" column is the indicator_name
  ✓ "value" column is the value
  ✓ Grain is (geo_id, date, indicator)

Invariant:
  ✓ "population" indicator means total headcount
  ✓ "population" is comparable across years
  ✓ "population" should not be summed across geographies

4. Minimal Core¶

Core functionality requires only pandas.

What This Means¶

pip install datasculpt → works immediately
Heavy dependencies are optional adapters
Core is fast and lightweight

Why It Matters¶

Adoption: Low barrier to entry
Deployment: Minimal container size
Maintenance: Fewer version conflicts
Testing: Fast test suite

Implementation¶

datasculpt/
├── core/           # Only pandas
│   ├── evidence.py
│   ├── roles.py
│   └── ...
└── adapters/       # Optional deps
    ├── frictionless_adapter.py  # requires frictionless
    └── dataprofiler_adapter.py  # requires dataprofiler

5. Multi-Candidate Scoring¶

Rank alternatives instead of making binary choices.

What This Means¶

Shape detection scores all 5 hypotheses
Role assignment scores all 8 roles
Grain inference tests multiple candidates
Ambiguity is surfaced, not hidden

Why It Matters¶

Visibility: See what was considered
Debugging: Understand close calls
Confidence: Know when uncertain
Interactive: Present options to users

Implementation¶

# Bad: First match wins
for shape in shapes:
    if matches(shape):
        return shape

# Good: Score all, rank
scores = {shape: score_shape(shape) for shape in shapes}
ranked = sorted(scores.items(), key=lambda x: -x[1])
selected = ranked[0][0]
is_ambiguous = ranked[0][1] - ranked[1][1] < threshold

6. Reversible Decisions¶

Users can override any inference.

What This Means¶

Interactive mode generates questions
Answers override automated decisions
Overrides are recorded in decision record

Why It Matters¶

Control: Users have final say
Domain knowledge: Humans know context
Edge cases: Handle what automation can't

Implementation¶

# Automated decision
result = infer("data.csv")
# shape = wide_observations (automated)

# User override
answers = {question.id: "long_indicators"}
result = apply_answers(result, answers)
# shape = long_indicators (user choice)
# decision_record.answers = {"q_123": "long_indicators"}

7. Explicit Over Implicit¶

Surface assumptions rather than hiding them.

What This Means¶

Warnings for low confidence
Required confirmations for risky decisions
Diagnostics for edge cases
Notes in evidence explaining signals

Why It Matters¶

No surprises: Users know what's uncertain
Trust: System is transparent
Debugging: Issues are visible

Implementation¶

# Bad: Silent fallback
if confidence < 0.8:
    return default_grain

# Good: Explicit warning
if confidence < 0.8:
    proposal.warnings.append(
        f"Grain confidence is {confidence:.2f}. "
        "Consider verifying the unique key columns."
    )
    return inferred_grain