Ambiguous Shape¶

When Datasculpt can't confidently distinguish between shapes.

The Problem¶

Some datasets genuinely fit multiple interpretations. Consider:

region,category,count,total,rate
North,A,100,1000,0.10
North,B,150,1000,0.15
South,A,80,800,0.10
South,B,120,800,0.15

Is this: - Wide observations with three measures (count, total, rate)? - Long indicators if category represents different metrics?

Both are defensible. The data structure alone doesn't tell us.

Detecting Ambiguity¶

from datasculpt import infer

result = infer("ambiguous.csv", interactive=True)

>>> result.decision_record.hypotheses[:2]
[
    HypothesisScore(
        hypothesis=<ShapeHypothesis.WIDE_OBSERVATIONS>,
        score=0.58,
        reasons=['Multiple numeric columns suggest measures']
    ),
    HypothesisScore(
        hypothesis=<ShapeHypothesis.LONG_INDICATORS>,
        score=0.52,
        reasons=['category column has indicator-like values']
    )
]

The gap between the top two scores is only 0.06 — below the default threshold of 0.10.

>>> result.decision_record.hypotheses[0].score - result.decision_record.hypotheses[1].score
0.06

>>> from datasculpt.core.types import InferenceConfig
>>> InferenceConfig().hypothesis_confidence_gap
0.1

Generated Questions¶

When ambiguity is detected in interactive mode, Datasculpt generates questions:

>>> result.pending_questions
[
    Question(
        id='q_abc12345',
        type=<QuestionType.CHOOSE_ONE>,
        prompt='The dataset shape is ambiguous. Please select the most appropriate shape:',
        choices=[
            {'value': 'wide_observations', 'label': 'Wide Observations', 'score': 0.58},
            {'value': 'long_indicators', 'label': 'Long Indicators', 'score': 0.52},
            {'value': 'long_observations', 'label': 'Long Observations', 'score': 0.45}
        ],
        default='wide_observations',
        rationale='Score gap between top hypotheses is 0.06, below threshold 0.10'
    )
]

Resolving Ambiguity¶

Provide an answer to resolve:

from datasculpt import apply_answers

answers = {result.pending_questions[0].id: "long_indicators"}
result = apply_answers(result, answers)

>>> result.proposal.shape_hypothesis
<ShapeHypothesis.LONG_INDICATORS: 'long_indicators'>

>>> result.pending_questions
[]  # No more questions

Multiple Ambiguities¶

A dataset can have multiple ambiguous aspects:

>>> result.pending_questions
[
    Question(
        id='q_shape_123',
        prompt='The dataset shape is ambiguous...',
        ...
    ),
    Question(
        id='q_grain_456',
        prompt='Please confirm or select the grain...',
        ...
    ),
    Question(
        id='q_role_789',
        prompt='What is the role of column "category"?',
        ...
    )
]

Provide all answers at once:

answers = {
    'q_shape_123': 'long_indicators',
    'q_grain_456': ['region', 'category'],
    'q_role_789': 'indicator_name'
}
result = apply_answers(result, answers)

Tuning Sensitivity¶

Adjust the confidence gap threshold to be more or less sensitive:

from datasculpt.core.types import InferenceConfig

# More sensitive (more questions)
config = InferenceConfig(hypothesis_confidence_gap=0.15)

# Less sensitive (fewer questions)
config = InferenceConfig(hypothesis_confidence_gap=0.05)

result = infer("data.csv", config=config, interactive=True)

Non-Interactive Mode¶

Without interactive=True, Datasculpt picks the top-scoring hypothesis but records the ambiguity:

result = infer("ambiguous.csv")  # No interactive flag

>>> result.proposal.shape_hypothesis
<ShapeHypothesis.WIDE_OBSERVATIONS: 'wide_observations'>

>>> result.proposal.warnings
['Shape detection confidence is low (gap: 0.06). Consider manual review.']

When Ambiguity Is Expected¶

Some datasets are genuinely ambiguous without domain context:

Scenario	Why Ambiguous
Survey data	Rows could be responses or pivoted metrics
Aggregated reports	Multiple interpretations of granularity
ETL staging tables	Intermediate format, not yet normalized
API exports	Format depends on client expectations

In these cases, interactive mode is the right approach — let domain experts resolve ambiguity.

Decision Record¶

Even when ambiguous, the decision record captures the full analysis:

>>> record = result.decision_record
>>> record.selected_hypothesis
<ShapeHypothesis.WIDE_OBSERVATIONS: 'wide_observations'>

>>> record.hypotheses
[
    HypothesisScore(hypothesis=WIDE_OBSERVATIONS, score=0.58, ...),
    HypothesisScore(hypothesis=LONG_INDICATORS, score=0.52, ...),
    HypothesisScore(hypothesis=LONG_OBSERVATIONS, score=0.45, ...),
    ...
]

>>> record.answers
{'q_abc12345': 'wide_observations'}  # If resolved via answer