Ambiguous Shape¶
When Datasculpt can't confidently distinguish between shapes.
The Problem¶
Some datasets genuinely fit multiple interpretations. Consider:
region,category,count,total,rate
North,A,100,1000,0.10
North,B,150,1000,0.15
South,A,80,800,0.10
South,B,120,800,0.15
Is this:
- Wide observations with three measures (count, total, rate)?
- Long indicators if category represents different metrics?
Both are defensible. The data structure alone doesn't tell us.
Detecting Ambiguity¶
from datasculpt import infer
result = infer("ambiguous.csv", interactive=True)
>>> result.decision_record.hypotheses[:2]
[
HypothesisScore(
hypothesis=<ShapeHypothesis.WIDE_OBSERVATIONS>,
score=0.58,
reasons=['Multiple numeric columns suggest measures']
),
HypothesisScore(
hypothesis=<ShapeHypothesis.LONG_INDICATORS>,
score=0.52,
reasons=['category column has indicator-like values']
)
]
The gap between the top two scores is only 0.06 — below the default threshold of 0.10.
>>> result.decision_record.hypotheses[0].score - result.decision_record.hypotheses[1].score
0.06
>>> from datasculpt.core.types import InferenceConfig
>>> InferenceConfig().hypothesis_confidence_gap
0.1
Generated Questions¶
When ambiguity is detected in interactive mode, Datasculpt generates questions:
>>> result.pending_questions
[
Question(
id='q_abc12345',
type=<QuestionType.CHOOSE_ONE>,
prompt='The dataset shape is ambiguous. Please select the most appropriate shape:',
choices=[
{'value': 'wide_observations', 'label': 'Wide Observations', 'score': 0.58},
{'value': 'long_indicators', 'label': 'Long Indicators', 'score': 0.52},
{'value': 'long_observations', 'label': 'Long Observations', 'score': 0.45}
],
default='wide_observations',
rationale='Score gap between top hypotheses is 0.06, below threshold 0.10'
)
]
Resolving Ambiguity¶
Provide an answer to resolve:
from datasculpt import apply_answers
answers = {result.pending_questions[0].id: "long_indicators"}
result = apply_answers(result, answers)
>>> result.proposal.shape_hypothesis
<ShapeHypothesis.LONG_INDICATORS: 'long_indicators'>
>>> result.pending_questions
[] # No more questions
Multiple Ambiguities¶
A dataset can have multiple ambiguous aspects:
>>> result.pending_questions
[
Question(
id='q_shape_123',
prompt='The dataset shape is ambiguous...',
...
),
Question(
id='q_grain_456',
prompt='Please confirm or select the grain...',
...
),
Question(
id='q_role_789',
prompt='What is the role of column "category"?',
...
)
]
Provide all answers at once:
answers = {
'q_shape_123': 'long_indicators',
'q_grain_456': ['region', 'category'],
'q_role_789': 'indicator_name'
}
result = apply_answers(result, answers)
Tuning Sensitivity¶
Adjust the confidence gap threshold to be more or less sensitive:
from datasculpt.core.types import InferenceConfig
# More sensitive (more questions)
config = InferenceConfig(hypothesis_confidence_gap=0.15)
# Less sensitive (fewer questions)
config = InferenceConfig(hypothesis_confidence_gap=0.05)
result = infer("data.csv", config=config, interactive=True)
Non-Interactive Mode¶
Without interactive=True, Datasculpt picks the top-scoring hypothesis but records the ambiguity:
result = infer("ambiguous.csv") # No interactive flag
>>> result.proposal.shape_hypothesis
<ShapeHypothesis.WIDE_OBSERVATIONS: 'wide_observations'>
>>> result.proposal.warnings
['Shape detection confidence is low (gap: 0.06). Consider manual review.']
When Ambiguity Is Expected¶
Some datasets are genuinely ambiguous without domain context:
| Scenario | Why Ambiguous |
|---|---|
| Survey data | Rows could be responses or pivoted metrics |
| Aggregated reports | Multiple interpretations of granularity |
| ETL staging tables | Intermediate format, not yet normalized |
| API exports | Format depends on client expectations |
In these cases, interactive mode is the right approach — let domain experts resolve ambiguity.
Decision Record¶
Even when ambiguous, the decision record captures the full analysis:
>>> record = result.decision_record
>>> record.selected_hypothesis
<ShapeHypothesis.WIDE_OBSERVATIONS: 'wide_observations'>
>>> record.hypotheses
[
HypothesisScore(hypothesis=WIDE_OBSERVATIONS, score=0.58, ...),
HypothesisScore(hypothesis=LONG_INDICATORS, score=0.52, ...),
HypothesisScore(hypothesis=LONG_OBSERVATIONS, score=0.45, ...),
...
]
>>> record.answers
{'q_abc12345': 'wide_observations'} # If resolved via answer
See Also¶
- Grain Detection — Ambiguity in unique key detection
- Mental Model — The evidence → hypothesis → decision pipeline