Glossary¶
Terminology used in Datasculpt.
A¶
Ambiguity¶
When Datasculpt cannot confidently distinguish between two or more interpretations. Measured by the score gap between hypotheses. Triggers questions in interactive mode.
Array Profile¶
Statistics about array-type columns: average length, min/max length, consistency.
C¶
Cardinality¶
The number of distinct values in a column. High cardinality suggests uniqueness (keys, measures). Low cardinality suggests categories (dimensions).
Column Evidence¶
See Evidence.
Confidence¶
A score from 0.0 to 1.0 indicating certainty in an inference. Low confidence triggers warnings or questions.
D¶
Decision Record¶
Complete audit trail for an inference run. Contains selected hypothesis, alternatives, evidence, questions, and answers.
Dimension¶
A column role for categorical grouping variables. Low cardinality, used in GROUP BY clauses.
Distinct Ratio¶
The fraction of unique values in a column. Calculated as unique_count / row_count.
E¶
Evidence¶
Objective facts about a column: type, cardinality, null rate, value distribution. Separated from interpretation.
G¶
Grain¶
The minimal set of columns that uniquely identifies each row. Also called the "unique key" or "natural key".
Grain Diagnostics¶
Details about grain quality: duplicate count, null count, example duplicates.
H¶
Hypothesis¶
A candidate interpretation. Shape hypotheses are the five structural patterns. Hypotheses are scored and ranked.
Hypothesis Score¶
A score from 0.0 to 1.0 for a shape hypothesis, with supporting reasons.
I¶
Indicator¶
In statistical data, a named metric (e.g., "population", "gdp"). In long_indicators shape, stored as indicator_name/value pairs.
Inference¶
The process of determining structure from data. Datasculpt infers shape, grain, and roles.
Interactive Mode¶
Mode where Datasculpt generates questions for ambiguous aspects instead of guessing.
Invariant Proposal¶
Output ready for registration with Invariant. Contains shape, grain, columns, warnings.
K¶
Key¶
A column role for uniqueness contributors. Part of the grain.
L¶
Long Indicators¶
Dataset shape where metrics are stored as indicator_name/value pairs, one per row.
Long Observations¶
Dataset shape where each row is an atomic observation with dimensions and measures as columns.
M¶
Measure¶
A column role for numeric, aggregatable values. High cardinality, used in SUM/AVG/COUNT.
Metadata¶
A column role for descriptive, non-analytical columns. Notes, comments, labels.
N¶
Null Rate¶
The fraction of missing values in a column. High null rates suggest optional or metadata columns.
P¶
Primitive Type¶
Basic data type: string, integer, number, boolean, date, datetime.
Pseudo-Key¶
A column that appears unique but doesn't represent a meaningful business key. Examples: row_id, uuid, created_at.
Q¶
Question¶
In interactive mode, a prompt for user input to resolve ambiguity. Types: choose_one, choose_many, confirm.
R¶
Role¶
The structural purpose of a column: key, dimension, measure, time, indicator_name, value, series, metadata.
Role Score¶
A likelihood score (0.0 to 1.0) for each possible role assignment.
S¶
Series¶
A column role for embedded time series stored as arrays or objects.
Shape¶
The structural pattern of a dataset: long_observations, long_indicators, wide_observations, wide_time_columns, series_column.
Shape Result¶
The output of shape detection: selected hypothesis, ranked alternatives, ambiguity status.
Structural Metadata¶
Metadata about layout and intent: shape, grain, roles. Distinct from technical, business, operational, or governance metadata.
Structural Type¶
How values are structured: scalar, array, object.
T¶
Time¶
A column role for temporal dimensions. Date or datetime type.
U¶
Uniqueness Ratio¶
The fraction of rows with unique grain values. 1.0 means no duplicates.
V¶
Value¶
A column role in unpivoted data that holds the numeric value paired with an indicator name.
Value Profile¶
Distribution statistics for numeric columns: min, max, mean, and ratios for bounded ranges.
W¶
Wide Observations¶
Dataset shape with measures as columns. Spreadsheet-style layout.
Wide Time Columns¶
Dataset shape with time periods encoded in column headers (e.g., 2022, 2023, 2024).