Chapter 7 Dictionary
The SARMD harmonized variables in Figure 7.1 are organized into seven categories: basic, demographic, education, assets, house, labor, and welfare.
Figure 7.2 displays which variables are currently available for each household survey. The different colors indicate a variable’s percentage of missing values.
7.1 Basic survey characteristics
The essential variables that identify each dataset and individual within a household are countrycode
, survey
, year
, vermast
, veralt
, idh
, and idp
. Household weights are homogenoursly defined for all the members of the household in the variable wgt
. So, to have individual level estimation when working on a household level dataset, you need to multiply household size household weight, pop_wgt=wgt*hsize
. When working on an indidividual household survey, you use wgt
. The dates of the survey are recorded as int_year
and int_month
. The geographical location within the country’s administrative division is recorded as subnatid1
, and in some cases, subnatid2
may provide a more specific location. Unfortunately, the variables psu
and strata
are only available for a few surveys. The same is true for the spatial deflator spdef
.
Short description of each variable
countrycode
: String variable with ISO3 code of each country. It takes the following values:
Country | countrycode |
---|---|
Afghanistan | AFG |
Bangladesh | BGD |
Bhutan | BTN |
India | IND |
Maldives | MDV |
Nepal | NPL |
Pakistan | PAK |
Sri Lanka | LKA |
survey
: String variable with the acronym of the name of the survey. For instance, the acronym of the Household Income and Expenditure Survey of Bangladesh would be HIES.year
: Four–digit numeric variable that refers to the year to which the consumption welfare aggregate refers.vermast
: Two–digit string variable that indicates the version of the master (raw) data.veralt
: Two–digit string variable that indicates the version of the harmonization (alternative) collection.idh
: String variable that serves as household id or household identificator.idp
: String variable that serves as individual id or individual identificator.wgt
: Numeric variable, also called an estimation weight, that is used to obtain estimates of population parameters of interest. The weight of a given individual may be interpreted as the number of individuals from the population that are represented by this sample unit. For example, if a random sample of 25 individuals has been selected from a population of 100, then each of the 25 sampled individuals may be viewed as representing 4 individuals of the population (Lavallée and Beaumont (2015)).pop_wgt
: Numeric variable that is used to obtain estimates of population parameters of interest when the survey data is collapsed at the household level. You may calculate it like this:=wgt*hsize
int_year
: Four–digit numeric variable that specifies the year the household survey was conducted in each household. This value may differ from variableyear
as it could be the case that some households were interviewed in a different year than the one to which the welfare aggregate refers.int_month
: Numeric variable that specifies the month of the interview (e.g., 1=January, 2=February, 3=March, etc.).subnatid1
: Country-specific categorical string variable to identify the highest level of subnational regional identifiers at which the survey is representative.subnatid2
: Country-specific categorical string variable to identify the second highest level of subnational regional identifiers at which the survey is representative.psu
identifies the primary sampling unit, which are the groups selected as the first stage of a multi-stage sample.strata
identifies the sampling strata.spdef
identifies the spatial deflator if avaialable.
7.2 Demographic variables
There are currently six essential demographic variables in SARMD: age
, hsize
, male
, relationcs
, relationharm
and marital
. Population pyramids such as the one provided in Figure 7.4 allow to see how countries’ demographics change over time while showing whether the marital status of individuals has been harmonized adequately. For example, about 63.7 percent of Afghans are under 25 years of age, which reflects a steep pyramid age structure.
age
is a numeric variable that indicates the age of an individual in years.age
is an important variable for most socio-economic analyses and must be established as accurately as possible especially for children aged less than 5 years.age>= 98
must be coded as 98. Missing values should be recorded asage=.
, never useage==99
. According to SARMD, the youngest country in the region is Afghanistan and the oldest is Sri Lanka.hsize
is a numeric variable that measures household size. Household size is close to seven in Afghanistan, Pakistan and Maldives, and much closer to 4-5 in the rest of the countries.
Figure 7.5 shows how average household size declines as we move to higher quintiles of per capita expenditures. Average household size is 9.66 in the poorest per capita expenditure quintile of Afghanistan, compared to 3.91 in the richest per capita expenditure quintile of Sri Lanka. Having more children (individuals < 15 years old) is negatively correlated with household per capita expenditures. We would expect higher poverty rates among large households and households with more children.
male
indicates the gender of the household member.- 1=Male
- 0=Female
relationcs
refers to the original categories that indicate the relationship of an individual to the household head. They change from survey to survey and may include categories like:- Household Head
- Husband/Wife
- Son/Daughter
- Grandchildren
- Grandfather/Grandmother
- Father or mother
- Father-in-law/Mother-in-law
- Brother or sister
- Brother-in-law/sister-in-law
- Nephew or niece
- Domestic servants
- Employee
- Other family relative
- Unrelated member
relationharm
refers to the simplified categories that indicate the relationship of an individual to the household head:- 1=Household Head
- 2=Spouse
- 3=Children
- 4=Parents
- 5=Other relatives
- 6=Non-relatives
marital
classifies individuals according to their marital status into these simplified categories:- 1=Married
- 2=Never Married
- 3=Living Together
- 4=Divorced/separated
- 5=Widowed
7.3 Education
There are currently eight essential education variables in SARMD: literacy
, atschool
, everattend
, ed_mod_age
, educy
, educat4
, educat5
, and educat7
.
7.3.1 Dummy variables of education
The dashboard below summarizes the share of the population that is lerate (literacy
), attends school (atschool
), and has ever attended school (everattend
).
literacy
is a dummy variable that indicates whether an individual is able to both read and write. A person is considered literate if she can both read and write and not just one or the other. A semi-literate person (one who can read, but cannot write) is said to be illiterate. In the case where the survey asks only whether a person can read but does not ask if they can write, literacy cannot be determined, and must be coded as missing. The adult literacy rate – referring to the population aged 15 and over – is an indicator that measures the accumulated achievement of the education system. The youth literacy rate – the literacy rate in the population aged 15-24 – reflects the outcomes of primary education over roughly the previous 10 years and is a measure of recent educational progress. Figure 7.6 allows the user to plot literacy rates at a subnational level by gender and age groups.atschool
is a dummy variable that indicates whether individual is currently enrolled in school.everattend
is a dummy variable that indicates whether individual has ever attended school.
7.3.2 Numerical variables of education
ed_mod_age
is the minimum age level at which the education module of the questionnaire is applied.educy
is a numeric variable that measures the number of years of education of an individual. It should not include pre-school.
7.3.3 Categorical variables of education
The dashboard below summarizes the categorical viariables of education. For instance, secondary education is everything from the end of primary to before tertiary (for example, grade 7 through 12). Figure 7.7 allows the user to present the absolute and relative frequencies of these categorical variables by survey.
educat4
is a numeric categorical variable that presents the level of education of an individual in four categories:- 1=No education
- 2=Primary (complete or incomplete)
- 3=Secondary (complete or incomplete)
- 4=Tertiary (complete or incomplete)
educat5
is a numeric categorical variable that presents the level of education of an individual in five categories:- 1=No education
- 2=Primary incomplete
- 3=Primary complete but secondary incomplete
- 4=Secondary complete
- 5=Some tertiary/post-secondary
educat7
is a numeric categorical variable that presents the level of education of an individual in seven categories:- 1=No education
- 2=Primary incomplete
- 3=Primary complete
- 4=Secondary incomplete
- 5=Secondary complete
- 6=Higher than secondary but not university
- 7=University incomplete or complete
7.4 Durable assets
The sixteen asset binary variables (1=Yes, 0=No) in SARMD represent whether households have access to a particular durable asset. They do not indicate the quantity of assets available or who is the owner of the asset within the household. The variables are defined at the household level and do not represent whether each individual owns an asset in particular, but whether the household as a whole has access to it. Therefore, a household where every member owns a cellphone and a household where only one member owns a cellphone are both cellphone=1
and cannot be distinguished.
The harmonization of these asset variables is limited by their availability in the household questionnaire. For example, cow, chicken, and buffalo cannot be harmonized if a survey does not cover live-stocking activities. It may also be that some of these assets are unnecessary (a fan in cold weather), obsolete (land phone), or too basic (lamp) to be included in a questionnaire.
Quality checks were conducted to make sure that these variables could only be equal to 0 or 1. We also verified that the value was the same within each household. A deeper look at these asset variables allowed to identify some interesting trends. Figure 7.8 displays the percentage of households that have access to an asset by country for the latest survey round available. It shows that cellphones are the most accessible assets and that there can be a wide range between the minimum and the maximum.
Figure 7.9 shows a clear exponential trend between the percentage of households that have access to electricity and the percentage of households that have access to a refrigerator. A similar relationship with electricity was found for television, washing machine, and fan. A quality check was conducted to identify the number of observations where the household had no access to electricity, but still had access to an asset. In some cases, it might seem illogical, for example, for a household to own a television if it does not have access to electricity. In Afghanistan (2013), 7,771 out of 20,773 households seemed to have a television and no electricity. However, a mistake in the harmonization process was identified and this number was reduced to 3,119 out of 20,773 households once the mistake was fixed. Still, 13-20% of observations in Afghanistan have consistently reported having a television and no electricity.
bicycle
is a dummy variable that indicates the availability of bicycles in the household (1=Yes, 0=No).buffalo
is a dummy variable that indicates the availability of buffaloes in the household (1=Yes, 0=No).cellphone
is a dummy variable that indicates the availability of cellphones in the household (1=Yes, 0=No).chicken
is a dummy variable that indicates the availability of chicken in the household (1=Yes, 0=No).computer
is a dummy variable that indicates the availability of computers in the household (1=Yes, 0=No).cow
is a dummy variable that indicates the availability of cows in the household (1=Yes, 0=No).fan
is a dummy variable that indicates the availability of fans in the household (1=Yes, 0=No).lamp
is a dummy variable that indicates the availability of lamps in the household (1=Yes, 0=No).landphone
is a dummy variable that indicates the availability of land phones in the household (1=Yes, 0=No).motorcar
is a dummy variable that indicates the availability of motor cars in the household (1=Yes, 0=No).motorcycle
is a dummy variable that indicates the availability of motorcycles in the household (1=Yes, 0=No).radio
is a dummy variable that indicates the availability of radios in the household (1=Yes, 0=No).refrigerator
is a dummy variable that indicates the availability of refrigerators in the household (1=Yes, 0=No).sewingmachine
is a dummy variable that indicates the availability of sewing machines in the household (1=Yes, 0=No).television
is a dummy variable that indicates the availability of televisions in the household (1=Yes, 0=No).washingmachine
is a dummy variable that indicates the availability of washing machines in the household (1=Yes, 0=No).
7.5 Housing
Housing is an essential component of household living conditions. For example, the living conditions of the Afghan population are to a large extent determined by the conditions of housing, including facilities for drinking water and sanitation. Most people live in dwellings that are constructed with non-durable materials and in conditions of overcrowding, meaning that there are more than three persons per room. The large majority of urban dwellers live in slums or in inadequate housing.
SARMD includes the following 12 harmonized variables related to housing:
electricity
is a dummy variable that indicates the availability of electricity in the household (1=Yes, 0=No).internet
is a dummy variable that indicates the availability of internet in the household (1=Yes, 0=No).ownhouse
is a dummy variable that indicates the ownership status of the dwelling unit by the household residing in it.- 0 == No: refers to renters, squatters, housing received for free, among others
- 1 == Yes: includes ownership whether or not full-payment has yet been made
piped_water
is a dummy variable that indicates the availability of piped water in the household (1=Yes, 0=No).sar_improved_toilet
is a dummy variable that indicates whether a household has access to an improved type of sanitation facility using country-specific definitions.- 0 == Unimproved
- 1 == Improved
sar_improved_water
is a dummy variable that indicates whether a household has access to an improved source of drinking water using country-specific definitions.- 0 == Unimproved
- 1 == Improved
sewage_toilet
is a dummy variable that indicates the availability of a sewage toilet in the household (1=Yes, 0=No).toilet_jmp
is a categorical variable that indicates the type of toilet using the Joint Monitoring Program categories:- 1=Flush to piped sewer system
- 2=Flush to septic tank
- 3=Flush to pit latrine
- 4=Flush to somewhere else
- 5=Flush, don’t know where
- 6=Ventilated improved pit latrine
- 7=Pit latrine with slab
- 8=Pit latrine without slab/open pit
- 9=Composting toilet
- 10=Bucket toilet
- 11=Hanging toilet/hanging latrine
- 12=No facility/bush/field
- 13=Other
toilet_orig
is a categorical variable that indicates the type of toilet using the original categories provided by the survey.urban
is a dummy variable that indicates whether a household is located in an urban or rural area.- 0=Rural
- 1=Urban
water_jmp
is a categorical variable that indicates the source of drinking water using the Joint Monitoring Program categories:- 1=Piped into dwelling
- 2=Piped into compound, yard or plot
- 3=Public tap / standpipe
- 4=Tubewell, Borehole
- 5=Protected well
- 6=Unprotected well
- 7=Protected spring
- 8=Unprotected spring
- 9=Rain water
- 10=Tanker-truck or other vendor
- 11=Cart with small tank / drum
- 12=Surface water (river, stream, dam, lake, pond)
- 13=Bottled water
- 14=Other
water_orig
is a categorical variable that indicates the source of drinking water using the original categories provided by the survey.
7.6 Labor
empstat
is a categorical variable that indicates the type of employment of the first job:- 1=Paid Employee
- 2=Non-Paid Employee
- 3=Employer
- 4=Self-employed
- 5=Other, workers not classifiable by status
empstat_2
is a categorical variable that indicates the type of employment of the second job:- 1=Paid Employee
- 2=Non-Paid Employee
- 3=Employer
- 4=Self-employed
- 5=Other, workers not classifiable by status
empstat_2_year
firmsize_l
indicates the firm size.industry
classifies the first job of any individual with a job, i.e.,lstatus=1
, and is missing otherwise. These single digit codes are based on the UN International Standard Industrial Classification (revision 3.1).- 1=Agriculture, Hunting, Fishing, etc.
- 2=Mining
- 3=Manufacturing
- 4=Public Utility Services
- 5=Construction
- 6=Commerce
- 7=Transport and Communications
- 8=Financial and Business Services
- 9=Public Administration
- 10=Others Services, Unspecified
industry_2
classifies the second job of any individual with a job, i.e.,lstatus=1
, and is missing otherwise. These single digit codes are based on the UN International Standard Industrial Classification (revision 3.1).- 1=Agriculture, Hunting, Fishing, etc.
- 2=Mining
- 3=Manufacturing
- 4=Public Utility Services
- 5=Construction
- 6=Commerce
- 7=Transport and Communications
- 8=Financial and Business Services
- 9=Public Administration
- 10=Others Services, Unspecified
industry_orig
is a categorical variable that indicates the original country-specific industry codes for the first job.industry_orig_2
is a categorical variable that indicates the original country-specific industry codes for the second job.lb_mod_age
is a numerical variable that indicates the age at which the labor module starts being applied.lstatus
is a categorical variable that indicates the labor force status of an individual. All persons are considered active in the labor force if they presently have a job (formal or informal) or do not have a job but are actively seeking work (unemployed).- 1=Employed
- 2=Unemployed
3=Not in labor force
njobs
indicates the total number of jobs of an individual.nlfreason
is a categorical variable that indicates the reason for an individual to not be in the labor force. This variable is constructed for all those who are not presently employed and are not looking for work withlstatus=3
and missing otherwise.- 1=Student
- 2=Housewife
- 3=Retired
- 4=Disabled
- 5=Other
occup
is a categorical variable that classifies jobs according to the following 1 digit occupational classification:- 1=Managers
- 2=Professionals
- 3=Technicians and associate professionals
- 4= Clerical support workers
- 5=Service and sales workers
- 6=Skilled agricultural, forestry and fishery workers
- 7=Craft and related trades workers
- 8=Plant and machine operators, and assemblers
- 9=Elementary occupations
- 10=Armed forces occupations
- 99=Other/unspecified
ocusec
is a categorical variable that classifies jobs according to their sector of activity:- 1=Public sector, Central Government, Army, NGO
- 2=Private
- 3=State owned
- 4=Public or State-owned, but cannot distinguish
unitwage
states the first job’s time measurement unit of an employed of any individuallstatus=1 & empstat=1
. Should be missing otherwise.- 1=Daily
- 2=Weekly
- 3=Every two weeks
- 4=Every two months
- 5=Monthly
- 6=Quarterly
- 7=Every six months
- 8=Annually
- 9=Hourly
- 10=Other
unitwage_2
states the second job’s time measurement unit of an employed of any individuallstatus=1 & empstat=1
. Should be missing otherwise.- 1=Daily
- 2=Weekly
- 3=Every two weeks
- 4=Every two months
- 5=Monthly
- 6=Quarterly
- 7=Every six months
- 8=Annually
- 9=Hourly
- 10=Other
wage
indicates the last wage payment of the first job where the time unit isunitwage
.wage_2
indicates the last wage payment of the second job where the time unit isunitwage_2
.whours
indicates the number of hours worked in the last week.
7.7 Welfare
cpi
is the value of the Consumer Price Index based on 2011 to convert local currency units.cpiperiod
indicates the periodicity of Consumer Price Index, which could be by year, year and month, year and quarter, or weighted.pline_int
provides the value of the international poverty line.pline_nat
provides the value of the national poverty line.poor_int
is a dummy variable that indicates whether an individual has been classified as poor (1=Yes, 0=No) as a result of being below the international poverty line.poor_nat
is a dummy variable that indicates whether an individual has been classified as poor (1=Yes, 0=No) as a result of being below the national poverty line.ppp
provides the value of the 2011 Purchasing Power Parity exchange rate.welfare
is the welfare aggregate used to compare to the international poverty line to estimate international poverty.welfaredef
is the spatially-deflated welfare aggregate used to compare to the poverty lines to estimate poverty.welfarenat
is the welfare aggregate used to compare to the national poverty line to estimate national poverty.welfarenom
is the welfare aggregate in nominal terms.welfareother
presents a welfare aggregate if different welfare type is used from welfare, welfarenom, or welfaredef.welfaretype
specifies the type of welfare measure for the variableswelfare
,welfarenom
andwelfaredef
.- INC=income
- CONS=consumption
- EXP=expenditure
welfareothertype
specifies the type of welfare measure for the variablewelfareother
.- INC=income
- CONS=consumption
- EXP=expenditure
welfshprosperity
presents a welfare aggregate for shared prosperity (if different from poverty).
References
Lavallée, Pierre, and Jean-François Beaumont. 2015. “Why We Should Put Some Weight on Weights. Survey Insights: Methods from the Field, Weighting: Practical Issues and ‘How to’ Approach.” https://surveyinsights.org/?p=6255.