Survey & ID module#

Variables#

countrycode

countrycode is a string variable that specifies the 3-character country code used by the World Bank to identify each country. Although there are different naming conventions, it is necessary to use those specified to ensure that the data for each country is appropriately labeled.

survname

survname codes the acronym of the survey.

survey

survey codes the type of survey (e.g., LFS for Labor Force Survey).

icls_v

Underlying version of the International Conference of Labour Statisticians that is being used in the survey to code concepts of work and employment.

Most commonly, surveys harmonized to GLD will either follow ICLS-13 or ICLS-19, that is, the directives set out during the 13th or the 19th conference, especially pertaining employment.

In ICLS-13 all work – other than household work – is seen as employment. Thus, subsistence farmers are as employed as the CEO of an international conglomerate.

The below screenshot (Figure 13) is from the questionnaire of the Zimbabwean 2014 LFS, where any yes answer will skip to questions on main employment (Q25). As highlighted, work for a wage is treated in the same than work on any agricultural holding. This survey follows ICLS-13.



Figure 13 - Example of 2014 ZWE LFS questionnaire Example of 2014 ZWE LFS questionnaire

Five years later, the Zimbabwean statistics office, ZimStat, switched to ICLS-19. In ICLS-19, only work for market exchange is considered employment (treating subsistence farming in the same way as household labour). Thus, an additional question is added to differentiate what kind of farming is taking place. The below screenshot (Figure 14) is part of the set of agriculture questions. If the agricultural work on the own agricultural holding is only or mostly for market exchange (codes 1 and 2) the individual should be asked about their first main job (MJ1). If agricultural production is only or mainly for own consumption, then the questionnaire continues, here asking whether they work for others for hire.



Figure 14 - Example of 2019 ZWE LFS questionnaire Example of 2019 ZWE LFS questionnaire

If the survey asks questions to understand what kind of farming takes place (subsistence or market exchange) and defines a skip pattern to lead to employment questions based on that, the survey questionnaire follows ICLS-19, otherwise it follows ICLS-13.

isced_version

Underlying version of the International Standard Classification of Education (ISCED) used in the survey. Acceptable values are either isced_1997 or isced_2011.

isco_version Underlying version of the International Standard Classification of Occupations (ISCO) used in the survey. Acceptable values are either isco_1988 or isco_2008

isic_version

Underlying version of the International Standard Industrial Classification of All Economic Activities (ISIC) used in the survey. Acceptable values are either isic_2, isic_3, isic_3.1, or isic_4.

year

year is a numeric variable that denotes the year in which the implementation of the household survey was begun. For example, if a survey was implemented during October 2018 and September 2019, the year would be 2018.

vermast

vermast codes the version of the master file (original data) being used in the harmonization.

veralt

veralt codes the version of the harmonization.

harmonization

harmonization codes the kind of harmonization (GLD or GMD). For GLD surveys this will always be GLD.

int_year

int_year is a numeric variable that specifies the year when the survey questionnaire was administered to the household.

int_month

int_month is a numeric variable that specifies the month when the survey questionnaire was administered to the household.

hhid

hhid specifies the unique household identification number in the data file. The original format, string or numeric, of original data should be kept. If there is Household ID in the original data, hhid and hhid_orig should be the same. If hhid_orig is missing, it is constructed by “variable names in raw data” variables.

pid

This variable allows identification of individuals. Variable will vary in length depending on how the identification code was constructed in each country. Depending on individual countries, this variable may be a concatenation of several variables in the raw data file. Keep format (string or numeric) of original data. If there is Personal ID in the original data, pid and pid_orig should be the same. If pid_orig is missing, it is constructed by “variable names in raw data” variables.

weight

weight contains household weights, typically inversely proportional to the probability of the household being selected for the sample, that should be applied to all analysis to make the results representative of the population.

weight_m

weight contains household weights, typically inversely proportional to the probability of the household being selected for the sample, that should be applied to all analysis to make the results representative of the population for each month. To be added only if present in the raw data and survey reports estimate results per month.

weight_q

weight contains household weights, typically inversely proportional to the probability of the household being selected for the sample, that should be applied to all analysis to make the results representative of the population for each quarter. To be added only if present in the raw data and survey reports estimate results per quarter.

psu

Primary sampling unit (psu) refers to sampling units that are selected in the first (primary) stage of multi- stage sample design. These sampling units typically correspond to a number of large aggregate units (clusters), each of which contains sub-units. For example, a primary sampling unit can represent the set of all housing units contained in a well-defined geographic area, such as a municipality or a group of contiguous municipalities. Primary sampling units are numeric and country-specific. A unique identifier is created for each primary sampling unit. In Stata, users are advised to specify the primary sampling unit through the svyset command.

ssu

Secondary sampling unit code (if present).

strata

Unit defining the first stage stratification strategy.

wave

In case of the survey being rolled out over several waves (e.g. quarterly), codes the information of the iteration of the survey.

panel

A string variable denoting which panel the individual belongs to. A panel is defined as all individuals who entered a survey at the same time (e.g., Q3 of 2020) and are scheduled to exit at the same time after a fixed number of survey waves (e.g., after four quarters).

Note that due to attrition not all intakes may exit at the same time. This variable is only to be coded if the concept is present in the raw data already.

visit_no

A numeric variable denoting the visit number (e.g., first visit coded as 1, second visit as 2, …) within a panel. This variable is only to be coded if the concept is present in the raw data already.

Lessons Learned and common challenges#

Coding IDs correctly is integral to allow for analytical tools to leverage the information. The coding of the identifications should follow from the survey structure and should not be built via a sequential index (e.g., gen hhid = _n). Since observations may be ordered differently, possibly even across different vintages of the same file, the coding may lead to different outcomes.

* Create hhid like this:
gen hhid = psu * 100 + hh
* Note like this:
gen hhid = _n

When creating hhid and pid, especially from string variables or from group(varlist) or concat(varlist) functions, users should try to create them from roster data files first where all information or observations are available. In addition, the order of the variables in the varlist option above must be the same across the files. Across the data files, the order and the sort on the variables in the varlist must be done in the same way across files.

When the hhid and pid are in numeric format but less precision, it is recommended to bring them the accurate precision level so it can be used in the merging correctly. For example, the value of the hhid for an observation might be 100021210121 (a long number), users should format the variable by “format %15.0g hh”.

In case a household survey is conducted more than once per year – e.g. quarterly HH surveys – you may want to use this as panel data, in which case the household ID can remain as is. However, if you want to use the data as cross-sectional, then new HHIDs can be constructed for each HH for each quarter.

Quarter

Quarter 1

Quarter 2

Quarter 3

Quarter 4

hhid_orig

hhid=1

hhid=1

hhid=1

hhid=1

hhid

hhid=1Q1

hhid=1Q2

hhid=1Q3

hhid=1Q4

hhid should never be missing and if there is any missing this variable should be checked.

assert missing(hhid)

It is recommended to check the uniqueness level of the data files with identifier variables at the corresponding level of the data (i.e. household vs individual level data).

hhid and pid need to be unique in the database.

isid hhid pid
cap destring pid, replace 
duplicates report hhid pid 
local n=r(unique_value)
`N'!= `n'

Ensure that country is a three-letter country code.

cap confirm str3 var country _rc!=0

Harmonizers should also ensure that country codes are updated according to the ISO country codes. Some common adjustments include the following:

replace countrycode="XKX" if countrycode=="KSV"
replace countrycode="TLS" if countrycode=="TMP" 
replace countrycode="PSE" if countrycode=="WBG"
replace countrycode="COD" if countrycode=="ZAR"

Furthermore, harmonizers should check that the years used are in an appropriate range.

The year needs to be a four-digit number in the range of 1980 to the current year (assumed here to be 2020).

assert year >= 1900 & year <= 2050

Overview of Variables#

Module

Variable label

Variable name

Notes

Survey & ID

ISO 3 Letter country code

countrycode

Survey & ID

Survey acronym

survname

No spaces, no underscores, split sections by “-” (e.g. “ETC-II”)

Survey & ID

Survey long name

survey

Possible names are: LFS, LSMS, … [I am unsure of this difference, some surveys contain either this or the previous variable, have yet to see one with both]

Survey & ID

Version of the ICLS followed

icls_v

Defines the labor force definitions used according to the rules set out by the nth International Conference of Labour Statisticians.

Survey & ID

Version of ISCED used

isced_version

Survey & ID

Version of ISCO used

isco_version

Survey & ID

Verstion of ISIC used

isic_version

Survey & ID

Year of survey start

year

Survey & ID

Master (Source) data version

vermast

Survey & ID

Alternate (Harmonized) data version

veralt

Survey & ID

Kind of harmonization

harmonization

Survey & ID

Year of interview start

int_year

For HH and Individual interviews in that HH earliest possible date

Survey & ID

Month of interview start

int_month

For HH and Individual interviews in that HH earliest possible date

Survey & ID

Household ID

hhid

Survey & ID

Personal ID

pid

Survey & ID

Survey weights

weight

Survey & ID

Primary sampling unit

psu

Survey & ID

Secondary sampling unit

ssu

Survey & ID

Stratification (of PSU)

strata

Survey & ID

Wave of the survey (e.g., Q1 for quarter 1)

wave

Survey & ID

Panel the individual belongs to

panel

Only code if concept already in survey

Survey & ID

Visit number in panel order

visit_no

Only code if concept already in survey