Grain¶

The minimal set of columns that uniquely identifies each row.

What Is Grain?¶

Grain answers: "What combination of columns has no duplicate values?"

geo_id,sex,age_group,population
ZA-GP,F,15-24,1200000
ZA-WC,F,15-24,600000
ZA-GP,M,15-24,1150000

The grain is (geo_id, sex, age_group) — this combination is unique for every row.

Why Grain Matters¶

Most data errors are grain errors. They cause silent failures:

Error	Symptom	Cause
Duplicated rows after join	Row count explodes	Joined on subset of grain
Missing rows after join	Row count drops	Joined on superset of grain
Wrong aggregation	Metrics are wrong	Summed at wrong granularity
Metric drift	Numbers change over time	Grain changed without notice

GrainInference Structure¶

@dataclass
class GrainInference:
    key_columns: list[str]      # The inferred grain
    confidence: float           # 0.0 to 1.0
    uniqueness_ratio: float     # Fraction of rows that are unique
    evidence: list[str]         # Reasons for the inference
    diagnostics: GrainDiagnostics | None  # Details about issues

Example¶

>>> result.decision_record.grain
GrainInference(
    key_columns=['geo_id', 'sex', 'age_group'],
    confidence=0.95,
    uniqueness_ratio=1.0,
    evidence=[
        'Combination of geo_id, sex, age_group is unique (8/8 rows)',
        'All columns have low cardinality (dimension-like)',
        'No pseudo-key signals detected'
    ]
)

Uniqueness Ratio¶

The fraction of rows with unique grain values:

Ratio	Interpretation
1.0	Perfect — every row is unique
0.95+	Good — few duplicates
0.90–0.95	Acceptable — some duplicates
< 0.90	Problematic — many duplicates

>>> result.decision_record.grain.uniqueness_ratio
0.95  # 95% of rows are unique

Confidence¶

How certain Datasculpt is about the grain:

Confidence	Meaning
0.95+	High confidence — clear grain
0.80–0.95	Medium — some uncertainty
< 0.80	Low — may need confirmation

Low confidence triggers questions in interactive mode.

Diagnostics¶

When there are issues, diagnostics provide details:

@dataclass
class GrainDiagnostics:
    duplicate_groups: int              # Number of duplicate key combinations
    max_group_size: int                # Largest duplicate group
    rows_with_null_in_key: int         # Rows with nulls in key columns
    null_columns: list[str]            # Which key columns have nulls
    example_duplicate_keys: list[tuple]  # Sample duplicates (max 3)

Example with Issues¶

>>> result.decision_record.grain.diagnostics
GrainDiagnostics(
    duplicate_groups=3,
    max_group_size=2,
    rows_with_null_in_key=5,
    null_columns=['geo_id'],
    example_duplicate_keys=[
        ('ZA-GP', '2024-01-01'),
        ('ZA-WC', '2024-01-15')
    ]
)

Detection Algorithm¶

Step 1: Identify Candidates¶

Columns that could participate in grain: - KEY role columns (high uniqueness) - DIMENSION role columns - TIME role columns - INDICATOR_NAME (in long_indicators shape)

Excluded: - MEASURE columns - SERIES columns - METADATA columns

Step 2: Test Single Columns¶

Check if any single column is unique:

for col in candidates:
    if df[col].nunique() == len(df):
        # col alone is the grain
        return [col]

Step 3: Apply Pseudo-Key Penalties¶

Single-column grains are checked for pseudo-key signals:

Signal	Penalty
Name matches `row_id`, `row_num`	0.30
Name matches `index`, `idx`	0.25
Name matches `uuid`, `guid`	0.25
Values are monotonic sequence	0.20
Name matches `created_at`, `timestamp`	0.20

Penalized columns are deprioritized in favor of multi-column grains.

Step 4: Test Combinations¶

Test 2-column, then 3-column, then 4-column combinations:

for size in [2, 3, 4]:
    for combo in combinations(candidates, size):
        if df[list(combo)].drop_duplicates().shape[0] == len(df):
            return list(combo)

Step 5: Select Minimal Grain¶

The first unique combination found is the grain. It's minimal because: - Single columns are tested first - Smaller combinations before larger

Shape-Aware Grain¶

Grain detection considers the dataset shape:

Shape	Grain Behavior
`long_observations`	Dimensions + time form grain
`long_indicators`	Dimensions + time + indicator_name form grain
`wide_observations`	Dimensions form grain (measures excluded)
`wide_time_columns`	Dimensions form grain (time columns excluded)
`series_column`	Dimensions form grain (series column excluded)

Example: Long Indicators¶

# Shape: long_indicators
# Data: geo_id, date, indicator, value

>>> result.decision_record.grain.key_columns
['geo_id', 'date', 'indicator']  # indicator_name is part of grain

Example: Wide Time Columns¶

# Shape: wide_time_columns
# Data: geo_id, indicator, 2022, 2023, 2024

>>> result.decision_record.grain.key_columns
['geo_id', 'indicator']  # Time columns excluded

Interactive Confirmation¶

When confidence is low, Datasculpt asks for confirmation:

result = infer("data.csv", interactive=True)

>>> result.pending_questions[0]
Question(
    prompt='Please confirm or select the grain (unique key columns):',
    choices=[
        {'value': ['geo_id', 'date'], 'label': 'Inferred: geo_id, date (95% unique)'},
        {'value': ['geo_id'], 'label': 'geo_id'},
        {'value': ['geo_id', 'date', 'category'], 'label': 'geo_id, date, category'}
    ]
)

Apply the user's choice:

from datasculpt import apply_answers

answers = {q.id: ['geo_id', 'date', 'category']}
result = apply_answers(result, answers)

>>> result.decision_record.grain.confidence
1.0  # User confirmed

Configuration¶

Tune grain detection:

from datasculpt.core.types import InferenceConfig

config = InferenceConfig(
    max_grain_columns=4,            # Max columns to test
    min_uniqueness_confidence=0.95, # Threshold for "good enough"
)

Common Issues¶

No Unique Grain Found¶

If no combination up to max_grain_columns is unique:

>>> result.decision_record.grain
GrainInference(
    key_columns=['geo_id', 'date'],  # Best found
    uniqueness_ratio=0.85,           # Only 85% unique
    evidence=['No combination up to 4 columns is fully unique']
)

Nulls in Key Columns¶

Nulls in key columns reduce uniqueness:

>>> result.decision_record.grain.diagnostics.rows_with_null_in_key
12  # 12 rows have nulls in key columns

>>> result.proposal.warnings
['12 rows have NULL values in key column geo_id']

Pseudo-Key Dominance¶

If the only unique column is a pseudo-key:

>>> result.decision_record.grain.evidence
['row_id is unique but appears to be a pseudo-key (name pattern)',
 'Falling back to multi-column grain: [geo_id, date]']