Skip to content

Mental Model

Where Datasculpt fits and how it thinks.

Where Datasculpt Sits

flowchart TD
    A[Raw Data] --> B[Datasculpt]
    B --> C[Invariant]
    C --> D[Catalog]

Datasculpt works upstream of catalogs, semantic layers, and governance engines. It produces the structural metadata these systems assume exists.

The Problem

Most data systems assume structural understanding exists but don't produce it.

When you load a CSV into a semantic layer, it assumes you know: - Is this long or wide format? - Which columns form the unique key? - Which columns are dimensions vs measures?

If you guess wrong, errors are silent: - Joins break without error messages - Aggregations produce wrong numbers - Metrics drift over time

The Solution

Datasculpt infers and explains structural intent. It makes implicit assumptions explicit.

The Three Outputs

1. Shape

What structural pattern does the data follow?

Shape Description Example
long_observations Rows are atomic observations Survey responses
long_indicators Unpivoted indicator/value pairs Statistical data
wide_observations Measures as columns Spreadsheets
wide_time_columns Time periods in headers Yearly data
series_column Arrays/objects in cells Embedded time series

2. Grain

What uniquely identifies each row?

The grain is the minimal set of columns that, when combined, produce unique values for every row. Most data errors are grain errors — joins that silently duplicate rows, aggregations that double-count.

3. Column Roles

What purpose does each column serve?

Role Description
key Contributes to uniqueness
dimension Categorical grouping
measure Numeric, aggregatable
time Temporal dimension
indicator_name Names in unpivoted data
value Values in unpivoted data
series Embedded time series
metadata Descriptive, non-analytical

Evidence → Hypotheses → Decision

Datasculpt uses a three-phase process:

Phase 1: Evidence

Extract facts about each column: - Primitive type (string, integer, number, boolean, date) - Structural type (scalar, array, object) - Statistics (null rate, cardinality, value distribution) - Parse results (date parsing, JSON detection)

Evidence is objective — it's what we observe, not what we interpret.

Phase 2: Hypotheses

Score competing interpretations: - Each shape hypothesis gets a score (0.0 to 1.0) - Each column role gets a score - Grain candidates are tested for uniqueness

Hypotheses are ranked, not binary. We don't pick one and throw away alternatives.

Phase 3: Decision

Record the final choice with justification: - Selected shape with confidence - Alternative shapes that were considered - Evidence supporting each choice - Questions for ambiguous aspects

Decisions are auditable — you can trace why any choice was made.

Why Determinism Matters

Given identical input and configuration, Datasculpt produces identical output.

  • No LLMs in the decision loop
  • No random sampling
  • No hidden state

This means: - Results are reproducible - Tests are reliable - Debugging is tractable - Trust is earned, not assumed

When to Use Interactive Mode

Use interactive=True when: - You want to review ambiguous decisions - The domain requires human confirmation - You're building a registration workflow

In interactive mode, Datasculpt generates questions for: - Ambiguous shape (close scores between top hypotheses) - Low-confidence grain (uniqueness ratio < 1.0) - Uncertain role assignments

Next Steps

  • Examples — See these concepts in action
  • Concepts — Deep dive into each concept