Microdata¶

Survey and observation data with coded question columns, hierarchical IDs, and geography levels.

The Data¶

hhid,indiv,zone,state,lga,weight,s1aq1,s1aq2,v101,v102
1001,1,North Central,Kogi,Dekina,1.234,1,35,2,3
1001,2,North Central,Kogi,Dekina,1.234,2,28,2,1
1002,1,South West,Lagos,Ikeja,0.987,2,42,1,2
1002,2,South West,Lagos,Ikeja,0.987,1,19,1,4
1003,1,North Central,Kwara,Ilorin,1.456,1,55,3,2

What It Looks Like¶

hhid	indiv	zone	state	lga	weight	s1aq1	s1aq2	v101	v102
1001	1	North Central	Kogi	Dekina	1.234	1	35	2	3
1001	2	North Central	Kogi	Dekina	1.234	2	28	2	1
1002	1	South West	Lagos	Ikeja	0.987	2	42	1	2
...	...	...	...	...	...	...	...	...	...

The Inference¶

from datasculpt import infer

result = infer("household_survey.csv")

Shape Detection¶

>>> result.proposal.shape_hypothesis
<ShapeHypothesis.MICRODATA: 'microdata'>

>>> result.decision_record.hypotheses[0]
HypothesisScore(
    hypothesis=<ShapeHypothesis.MICRODATA>,
    score=0.75,
    reasons=[
        'Found respondent ID column(s): hhid',
        'Found subunit ID column(s): indiv',
        'Found geography hierarchy (3 levels): zone, state, lga',
        'Found survey weight column: weight',
        'Moderate ratio (40%) of coded question columns'
    ]
)

Grain Detection¶

>>> result.decision_record.grain
GrainInference(
    key_columns=['hhid', 'indiv'],
    confidence=0.95,
    uniqueness_ratio=1.0,
    evidence=[
        'Combination of hhid, indiv is unique',
        'Hierarchical ID structure: primary (hhid) + subunit (indiv)'
    ]
)

Role Assignments¶

Column	Role	Evidence
hhid	respondent_id	Matches household ID pattern (hhid)
indiv	subunit_id	Matches individual/member pattern
zone	geography_level	Geographic administrative level
state	geography_level	Geographic administrative level
lga	geography_level	Local Government Area pattern
weight	survey_weight	Matches weight column pattern
s1aq1	question_response	LSMS section 1a question pattern
s1aq2	question_response	LSMS section 1a question pattern
v101	question_response	DHS variable pattern
v102	question_response	DHS variable pattern

Why This Shape¶

Datasculpt detected microdata because:

Hierarchical ID structure - hhid identifies households, indiv identifies individuals within households
Coded question columns - s1aq1, s1aq2 follow LSMS patterns (section 1, part a, question 1/2); v101, v102 follow DHS patterns
Geography hierarchy - zone, state, lga form an ordered geographic hierarchy (broad to narrow)
Survey weight present - weight column indicates complex survey design
No indicator/value pattern - Unlike long_indicators, values are in separate columns per question

Detection Signals¶

The microdata detector looks for these patterns:

Signal	Example Patterns	Weight
Respondent ID	hhid, hh_id, household_id, personid	+0.15
Subunit ID	indiv, member_num, child_num	+0.10
Cluster ID	ea, cluster, psu	+0.05
Geography levels	zone, state, region, lga, district	+0.10
Survey weight	weight, wgt, hh_weight	+0.05
Question columns	s1aq1, v101, hv001, q1	+0.30 (if >50%)
High column count	50+ columns	+0.25

Survey Type Inference¶

Datasculpt can infer the survey type from question column patterns:

>>> from datasculpt.core.microdata import infer_survey_type

>>> infer_survey_type("s1aq1")
'LSMS'

>>> infer_survey_type("hv001")
'DHS'

>>> infer_survey_type("v101")
'DHS'

>>> infer_survey_type("wm1")
'MICS'

The Proposal¶

>>> result.proposal
InvariantProposal(
    dataset_name='household_survey',
    dataset_kind=<DatasetKind.MICRODATA>,
    shape_hypothesis=<ShapeHypothesis.MICRODATA>,
    grain=['hhid', 'indiv'],
    columns=[
        ColumnSpec(name='hhid', role=<Role.RESPONDENT_ID>, ...),
        ColumnSpec(name='indiv', role=<Role.SUBUNIT_ID>, ...),
        ColumnSpec(name='zone', role=<Role.GEOGRAPHY_LEVEL>, ...),
        ColumnSpec(name='state', role=<Role.GEOGRAPHY_LEVEL>, ...),
        ColumnSpec(name='lga', role=<Role.GEOGRAPHY_LEVEL>, ...),
        ColumnSpec(name='weight', role=<Role.SURVEY_WEIGHT>, ...),
        ColumnSpec(name='s1aq1', role=<Role.QUESTION_RESPONSE>, ...),
        ColumnSpec(name='s1aq2', role=<Role.QUESTION_RESPONSE>, ...),
        ColumnSpec(name='v101', role=<Role.QUESTION_RESPONSE>, ...),
        ColumnSpec(name='v102', role=<Role.QUESTION_RESPONSE>, ...),
    ],
    warnings=[],
    required_user_confirmations=[]
)