Skip to content

Grain Detection

How Datasculpt finds the minimal unique key for a dataset.

What Is Grain?

The grain is the minimal set of columns that uniquely identifies each row. It's the answer to: "What combination of columns, when taken together, has no duplicates?"

Why Grain Matters

Most data errors are grain errors:

Error Caused By
Duplicated rows after join Wrong grain assumption
Aggregation double-counting Grain finer than expected
Missing rows after join Grain coarser than expected
Metric drift Grain changed over time

Basic Detection

from datasculpt import infer

result = infer("demographics.csv")

>>> result.decision_record.grain
GrainInference(
    key_columns=['geo_id', 'sex', 'age_group'],
    confidence=0.95,
    uniqueness_ratio=1.0,
    evidence=[
        'Combination of geo_id, sex, age_group is unique (8/8 rows)',
        'All columns have low cardinality (dimension-like)',
        'No pseudo-key signals detected'
    ]
)

Detection Algorithm

Datasculpt tests grain candidates in order:

Step 1: Single Columns

Test each column individually for uniqueness:

# Column cardinality
geo_id: 2 unique values  not unique
sex: 2 unique values  not unique
age_group: 2 unique values  not unique
population: 8 unique values  unique! (but is it a key?)

Step 2: Pseudo-Key Detection

Check if unique columns are "real" keys or just artifacts:

# Pseudo-key signals
population:
  - is_monotonic_sequence: False
  - is_uuid_like: False
  - name_signal_penalty: 0.0  # No "id", "row", "index" in name
   Probably real key (but measure-like, so demoted)

Step 3: Column Combinations

Test 2-column, then 3-column, then 4-column combinations:

# 2-column tests
(geo_id, sex): 4 unique  not unique for 8 rows
(geo_id, age_group): 4 unique  not unique
(sex, age_group): 4 unique  not unique

# 3-column tests
(geo_id, sex, age_group): 8 unique  unique! 

Step 4: Minimality Check

Ensure the key is minimal — no subset is also unique:

# Is (geo_id, sex, age_group) minimal?
(geo_id, sex): 4 unique  not unique (need all 3)
(geo_id, age_group): 4 unique  not unique
(sex, age_group): 4 unique  not unique
 Yes, all 3 columns are required

Pseudo-Key Penalties

Some columns look unique but aren't meaningful keys:

Pattern Penalty Example
row_id, row_num 0.30 Auto-generated row numbers
index, idx 0.25 DataFrame indices
uuid, guid 0.25 Random identifiers
created_at, timestamp 0.20 Ingestion timestamps
Monotonic integers 0.20 1, 2, 3, 4, 5...
>>> ev = result.decision_record.column_evidence["row_id"]
>>> ev.role_scores
{<Role.KEY: 'key'>: 0.70, ...}  # Penalized from 1.0

Uniqueness Ratio

When no combination is fully unique, Datasculpt reports the best ratio:

>>> result.decision_record.grain
GrainInference(
    key_columns=['geo_id', 'date'],
    confidence=0.75,
    uniqueness_ratio=0.95,  # 95% of rows are unique
    evidence=[
        'Combination of geo_id, date has 95% unique rows',
        '5% of rows are duplicates'
    ],
    diagnostics=GrainDiagnostics(
        duplicate_groups=3,
        max_group_size=2,
        rows_with_null_in_key=0,
        example_duplicate_keys=[
            ('ZA-GP', '2024-01-01'),
            ('ZA-WC', '2024-01-15'),
            ('ZA-KZN', '2024-02-01')
        ]
    )
)

Handling Nulls in Keys

Nulls in key columns affect uniqueness:

>>> result.decision_record.grain.diagnostics
GrainDiagnostics(
    rows_with_null_in_key=12,
    null_columns=['geo_id'],  # geo_id has nulls
    ...
)

>>> result.proposal.warnings
['12 rows have NULL values in key column geo_id']

Survey-Aware Patterns

Datasculpt recognizes survey and roster patterns:

Pattern Interpretation
person_id, respondent_id Survey respondent
member_id, household_member Household roster member
visit_id, interview_id Survey visit/wave

These get boosted as likely key contributors.

Interactive Grain Confirmation

When confidence is low, Datasculpt asks for confirmation:

result = infer("data.csv", interactive=True)

>>> result.pending_questions[0]
Question(
    id='q_grain_xyz',
    type=<QuestionType.CHOOSE_MANY>,
    prompt='Please confirm or select the grain (unique key columns) for this dataset:',
    choices=[
        {'value': ['geo_id', 'date'], 'label': 'Inferred: geo_id, date (95% unique)'},
        {'value': ['geo_id'], 'label': 'geo_id'},
        {'value': ['date'], 'label': 'date'},
        {'value': ['geo_id', 'date', 'category'], 'label': 'geo_id, date, category'},
        ...
    ],
    default=['geo_id', 'date'],
    rationale='Uniqueness ratio is 95%, below 100% threshold'
)

Provide the correct grain:

from datasculpt import apply_answers

answers = {result.pending_questions[0].id: ['geo_id', 'date', 'category']}
result = apply_answers(result, answers)

>>> result.decision_record.grain.key_columns
['geo_id', 'date', 'category']

>>> result.decision_record.grain.confidence
1.0  # User confirmed

Shape-Aware Grain

Grain detection considers the dataset shape:

Shape Grain Behavior
wide_observations Dimensions form grain, measures excluded
long_indicators Dimensions + indicator_name form grain
wide_time_columns Dimensions form grain, time columns excluded
series_column Dimensions form grain, series column excluded
# Long indicators: indicator_name is part of grain
>>> result.decision_record.grain.key_columns
['geo_id', 'date', 'indicator']

# Wide observations: only dimensions
>>> result.decision_record.grain.key_columns
['geo_id', 'sex', 'age_group']

Configuration

Tune grain detection behavior:

from datasculpt.core.types import InferenceConfig

config = InferenceConfig(
    max_grain_columns=4,        # Max columns to test in combination
    min_uniqueness_confidence=0.95,  # Threshold for "unique enough"
)

result = infer("data.csv", config=config)

See Also