Roles¶
The purpose each column serves in the dataset structure.
What Is a Role?¶
A role describes what purpose a column serves — not its data type, but its structural function.
| Column | Type | Role |
|---|---|---|
| geo_id | string | dimension — groups observations |
| sex | string | dimension — groups observations |
| age_group | string | dimension — groups observations |
| population | integer | measure — aggregatable value |
| unemployed | integer | measure — aggregatable value |
Core Roles¶
Key¶
Contributes to uniqueness. Part of the grain.
Signals: - High distinct ratio (approaching 1.0) - No nulls - Not a pseudo-key (see below)
Dimension¶
Categorical grouping variable. Low cardinality.
Signals: - Low distinct ratio (< 0.1) - String or categorical type - Meaningful value labels
Measure¶
Numeric, aggregatable value. What you SUM, AVG, or COUNT.
Signals: - Numeric type (integer or number) - High distinct ratio - Non-negative values (often)
Time¶
Temporal dimension. Dates or timestamps.
Signals: - Parses as date or datetime - Column name suggests time (date, year, period, month) - Sequential values
Indicator Name¶
In unpivoted data, the column that names the metric.
Signals: - Low cardinality - Values look like concept names (population, gdp, rate) - Paired with a value column
Value¶
In unpivoted data, the column that holds the metric value.
Signals: - Numeric type - High distinct ratio - Paired with an indicator name column
Series¶
Contains embedded time series (arrays or objects).
Signals: - Structural type is ARRAY or OBJECT - Arrays have consistent length - Values are numeric
Metadata¶
Descriptive, non-analytical. Notes, comments, labels.
Signals: - High null rate - Long string values - Column name suggests metadata (notes, comments, description)
Survey-Specific Roles¶
These roles are primarily scored when microdata shape is detected. They capture the structural patterns common in household surveys, health surveys (DHS), and living standards measurement studies (LSMS).
Respondent ID¶
Primary unit identifier. The main entity being surveyed (household, person).
Signals:
- High cardinality (one per unit)
- Name patterns: hhid, hh_id, person_id, caseid, respondent_id, pid
- Low null rate
- Integer or string type
Subunit ID¶
Secondary identifier within a primary unit. Identifies individuals within households, children within mothers.
Signals:
- Low-moderate cardinality (typically 1-N where N is small)
- Name patterns: indiv, member_num, child_num, line_num, roster_num
- Integer type common (roster line numbers)
Cluster ID¶
Sampling cluster or enumeration area identifier. Used for complex survey designs.
Signals:
- Moderate cardinality (fewer clusters than respondents)
- Name patterns: ea, cluster, psu, stratum, segment
- Integer or string type
Survey Weight¶
Sampling weights for weighted estimates. Required for proper inference from complex surveys.
Signals:
- Numeric type (required)
- Positive values (weights are always positive)
- Name patterns: weight, wgt, wght, hh_weight, pweight, expansion_factor
- Moderate distinct ratio (weights often repeated within strata)
Question Response¶
Coded survey question answers. The actual response data collected from respondents.
Signals:
- Low cardinality (categorical responses, typically 2-20 options)
- Name patterns: s1aq1, v101, hv001, q1, var_01 (survey coding conventions)
- Most relevant in microdata shape
Geography Level¶
Administrative hierarchy levels. Geographic classification at various administrative divisions.
Signals:
- Low-moderate cardinality (enumerable administrative units)
- Name patterns: zone, region, state, province, lga, district, ward, admin1, admin2
- String type (geographic names) or integer (codes)
Role Scoring¶
Each column is scored against all roles. The highest score wins.
Scoring Factors¶
| Factor | Roles Affected |
|---|---|
| Distinct ratio | KEY (high), DIMENSION (low), MEASURE (high) |
| Null rate | KEY (low), METADATA (high) |
| Primitive type | MEASURE (numeric), DIMENSION (string), TIME (date) |
| Column name patterns | All roles (regex matching) |
| Value patterns | INDICATOR_NAME (concept-like), TIME (sequential) |
Example Scores¶
>>> ev = result.decision_record.column_evidence["population"]
>>> ev.role_scores
{
<Role.MEASURE>: 0.90,
<Role.KEY>: 0.15,
<Role.DIMENSION>: 0.05,
<Role.TIME>: 0.0,
<Role.INDICATOR_NAME>: 0.0,
<Role.VALUE>: 0.0,
<Role.SERIES>: 0.0,
<Role.METADATA>: 0.02
}
population scores highest as MEASURE because:
- Numeric type (integer)
- High distinct ratio (all unique values)
- Name doesn't match dimension patterns
Pseudo-Key Detection¶
Some columns appear unique but aren't meaningful keys:
| Pattern | Example | Penalty |
|---|---|---|
| Row numbers | row_id, row_num | 0.30 |
| Indices | index, idx | 0.25 |
| UUIDs | uuid, guid | 0.25 |
| Timestamps | created_at, ingested_at | 0.20 |
| Monotonic sequences | 1, 2, 3, 4, 5... | 0.20 |
Shape-Conditional Roles¶
Role scoring considers the dataset shape:
| Shape | Role Adjustments |
|---|---|
long_indicators |
Boost INDICATOR_NAME for low-cardinality string columns |
wide_time_columns |
Date-header columns get TIME role |
series_column |
Array columns get SERIES role |
# In long_indicators shape:
>>> ev = evidence["indicator"]
>>> ev.role_scores[Role.INDICATOR_NAME]
0.85 # Boosted because shape is long_indicators
Role Assignment¶
After scoring, roles are assigned:
>>> for col in result.proposal.columns:
... print(f"{col.name}: {col.role.value}")
geo_id: dimension
sex: dimension
age_group: dimension
population: measure
unemployed: measure
Low-Confidence Assignments¶
When the top role score is low or close to alternatives, Datasculpt flags uncertainty:
result = infer("data.csv", interactive=True)
>>> result.pending_questions
[
Question(
prompt='What is the role of column "category"?',
choices=[
{'value': 'dimension', 'label': 'dimension (0.45)'},
{'value': 'indicator_name', 'label': 'indicator_name (0.42)'},
{'value': 'metadata', 'label': 'metadata (0.35)'}
]
)
]
Role Implications¶
For Aggregation¶
| Role | Aggregation Behavior |
|---|---|
| MEASURE | Can be SUMmed, AVGed, etc. |
| DIMENSION | Use in GROUP BY |
| TIME | Use in GROUP BY or time series functions |
| INDICATOR_NAME | Filter before aggregating |
| VALUE | Aggregate after filtering by indicator |
For Joins¶
| Role | Join Behavior |
|---|---|
| KEY | Join on this column |
| DIMENSION | Can join on this (with care) |
| MEASURE | Never join on measures |
| TIME | Join on time ranges |
For Grain¶
| Role | Grain Participation |
|---|---|
| KEY | Always part of grain |
| DIMENSION | Often part of grain |
| TIME | Usually part of grain |
| INDICATOR_NAME | Part of grain in long_indicators |
| MEASURE | Never part of grain |
| SERIES | Never part of grain |