Chapter 7 Dictionary


The SARMD harmonized variables in Figure 7.1 are organized into seven categories: basic, demographic, education, assets, house, labor, and welfare.

Figure 7.1: Seven categories of harmonized variables

Figure 7.2 displays which variables are currently available for each household survey. The different colors indicate a variable’s percentage of missing values.

SARMD Availability and missing observations

Figure 7.2: SARMD Availability and missing observations

7.1 Basic survey characteristics

Figure 7.3: Basic survey characteristics

The essential variables that identify each dataset and individual within a household are countrycode, survey, year, vermast, veralt, idh, and idp. Household weights are homogenoursly defined for all the members of the household in the variable wgt. So, to have individual level estimation when working on a household level dataset, you need to multiply household size household weight, pop_wgt=wgt*hsize. When working on an indidividual household survey, you use wgt. The dates of the survey are recorded as int_year and int_month. The geographical location within the country’s administrative division is recorded as subnatid1, and in some cases, subnatid2 may provide a more specific location. Unfortunately, the variables psu and strata are only available for a few surveys. The same is true for the spatial deflator spdef.

Short description of each variable

  • countrycode: String variable with ISO3 code of each country. It takes the following values:
Table 7.1: ISO3 codes of SAR countries
Country countrycode
Afghanistan AFG
Bangladesh BGD
Bhutan BTN
India IND
Maldives MDV
Nepal NPL
Pakistan PAK
Sri Lanka LKA
  • survey: String variable with the acronym of the name of the survey. For instance, the acronym of the Household Income and Expenditure Survey of Bangladesh would be HIES.

  • year: Four–digit numeric variable that refers to the year to which the consumption welfare aggregate refers.

  • vermast: Two–digit string variable that indicates the version of the master (raw) data.

  • veralt: Two–digit string variable that indicates the version of the harmonization (alternative) collection.

  • idh: String variable that serves as household id or household identificator.

  • idp: String variable that serves as individual id or individual identificator.

  • wgt: Numeric variable, also called an estimation weight, that is used to obtain estimates of population parameters of interest. The weight of a given individual may be interpreted as the number of individuals from the population that are represented by this sample unit. For example, if a random sample of 25 individuals has been selected from a population of 100, then each of the 25 sampled individuals may be viewed as representing 4 individuals of the population (Lavallée and Beaumont (2015)).

  • pop_wgt: Numeric variable that is used to obtain estimates of population parameters of interest when the survey data is collapsed at the household level. You may calculate it like this: =wgt*hsize

  • int_year: Four–digit numeric variable that specifies the year the household survey was conducted in each household. This value may differ from variable year as it could be the case that some households were interviewed in a different year than the one to which the welfare aggregate refers.

  • int_month: Numeric variable that specifies the month of the interview (e.g., 1=January, 2=February, 3=March, etc.).

  • subnatid1: Country-specific categorical string variable to identify the highest level of subnational regional identifiers at which the survey is representative.

  • subnatid2: Country-specific categorical string variable to identify the second highest level of subnational regional identifiers at which the survey is representative.

  • psu identifies the primary sampling unit, which are the groups selected as the first stage of a multi-stage sample.

  • strata identifies the sampling strata.

  • spdef identifies the spatial deflator if avaialable.

7.2 Demographic variables

There are currently six essential demographic variables in SARMD: age, hsize, male, relationcs, relationharm and marital. Population pyramids such as the one provided in Figure 7.4 allow to see how countries’ demographics change over time while showing whether the marital status of individuals has been harmonized adequately. For example, about 63.7 percent of Afghans are under 25 years of age, which reflects a steep pyramid age structure.

Population pyramids

Figure 7.4: Population pyramids

  • age is a numeric variable that indicates the age of an individual in years. age is an important variable for most socio-economic analyses and must be established as accurately as possible especially for children aged less than 5 years. age>= 98 must be coded as 98. Missing values should be recorded as age=., never use age==99. According to SARMD, the youngest country in the region is Afghanistan and the oldest is Sri Lanka.

  • hsize is a numeric variable that measures household size. Household size is close to seven in Afghanistan, Pakistan and Maldives, and much closer to 4-5 in the rest of the countries.

Figure 7.5 shows how average household size declines as we move to higher quintiles of per capita expenditures. Average household size is 9.66 in the poorest per capita expenditure quintile of Afghanistan, compared to 3.91 in the richest per capita expenditure quintile of Sri Lanka. Having more children (individuals < 15 years old) is negatively correlated with household per capita expenditures. We would expect higher poverty rates among large households and households with more children.

Household composition and expenditures

Figure 7.5: Household composition and expenditures

  • male indicates the gender of the household member.

    • 1=Male
    • 0=Female
  • relationcs refers to the original categories that indicate the relationship of an individual to the household head. They change from survey to survey and may include categories like:

    • Household Head
    • Husband/Wife
    • Son/Daughter
    • Grandchildren
    • Grandfather/Grandmother
    • Father or mother
    • Father-in-law/Mother-in-law
    • Brother or sister
    • Brother-in-law/sister-in-law
    • Nephew or niece
    • Domestic servants
    • Employee
    • Other family relative
    • Unrelated member
  • relationharm refers to the simplified categories that indicate the relationship of an individual to the household head:

    • 1=Household Head
    • 2=Spouse
    • 3=Children
    • 4=Parents
    • 5=Other relatives
    • 6=Non-relatives
  • marital classifies individuals according to their marital status into these simplified categories:

    • 1=Married
    • 2=Never Married
    • 3=Living Together
    • 4=Divorced/separated
    • 5=Widowed

7.3 Education

There are currently eight essential education variables in SARMD: literacy, atschool, everattend, ed_mod_age, educy, educat4, educat5, and educat7.

7.3.1 Dummy variables of education

The dashboard below summarizes the share of the population that is lerate (literacy), attends school (atschool), and has ever attended school (everattend).

Harmonized asset ownership in latest household survey round

Figure 7.6: Harmonized asset ownership in latest household survey round

  • literacy is a dummy variable that indicates whether an individual is able to both read and write. A person is considered literate if she can both read and write and not just one or the other. A semi-literate person (one who can read, but cannot write) is said to be illiterate. In the case where the survey asks only whether a person can read but does not ask if they can write, literacy cannot be determined, and must be coded as missing. The adult literacy rate – referring to the population aged 15 and over – is an indicator that measures the accumulated achievement of the education system. The youth literacy rate – the literacy rate in the population aged 15-24 – reflects the outcomes of primary education over roughly the previous 10 years and is a measure of recent educational progress. Figure 7.6 allows the user to plot literacy rates at a subnational level by gender and age groups.

  • atschool is a dummy variable that indicates whether individual is currently enrolled in school.

  • everattend is a dummy variable that indicates whether individual has ever attended school.

7.3.2 Numerical variables of education

  • ed_mod_age is the minimum age level at which the education module of the questionnaire is applied.

  • educy is a numeric variable that measures the number of years of education of an individual. It should not include pre-school.

7.3.3 Categorical variables of education

The dashboard below summarizes the categorical viariables of education. For instance, secondary education is everything from the end of primary to before tertiary (for example, grade 7 through 12). Figure 7.7 allows the user to present the absolute and relative frequencies of these categorical variables by survey.

Level of education categories

Figure 7.7: Level of education categories

  • educat4 is a numeric categorical variable that presents the level of education of an individual in four categories:
    • 1=No education
    • 2=Primary (complete or incomplete)
    • 3=Secondary (complete or incomplete)
    • 4=Tertiary (complete or incomplete)
  • educat5 is a numeric categorical variable that presents the level of education of an individual in five categories:
    • 1=No education
    • 2=Primary incomplete
    • 3=Primary complete but secondary incomplete
    • 4=Secondary complete
    • 5=Some tertiary/post-secondary
  • educat7 is a numeric categorical variable that presents the level of education of an individual in seven categories:
    • 1=No education
    • 2=Primary incomplete
    • 3=Primary complete
    • 4=Secondary incomplete
    • 5=Secondary complete
    • 6=Higher than secondary but not university
    • 7=University incomplete or complete

7.4 Durable assets

The sixteen asset binary variables (1=Yes, 0=No) in SARMD represent whether households have access to a particular durable asset. They do not indicate the quantity of assets available or who is the owner of the asset within the household. The variables are defined at the household level and do not represent whether each individual owns an asset in particular, but whether the household as a whole has access to it. Therefore, a household where every member owns a cellphone and a household where only one member owns a cellphone are both cellphone=1 and cannot be distinguished.

The harmonization of these asset variables is limited by their availability in the household questionnaire. For example, cow, chicken, and buffalo cannot be harmonized if a survey does not cover live-stocking activities. It may also be that some of these assets are unnecessary (a fan in cold weather), obsolete (land phone), or too basic (lamp) to be included in a questionnaire.

Quality checks were conducted to make sure that these variables could only be equal to 0 or 1. We also verified that the value was the same within each household. A deeper look at these asset variables allowed to identify some interesting trends. Figure 7.8 displays the percentage of households that have access to an asset by country for the latest survey round available. It shows that cellphones are the most accessible assets and that there can be a wide range between the minimum and the maximum.

Harmonized asset ownership in latest household survey round

Figure 7.8: Harmonized asset ownership in latest household survey round

Figure 7.9 shows a clear exponential trend between the percentage of households that have access to electricity and the percentage of households that have access to a refrigerator. A similar relationship with electricity was found for television, washing machine, and fan. A quality check was conducted to identify the number of observations where the household had no access to electricity, but still had access to an asset. In some cases, it might seem illogical, for example, for a household to own a television if it does not have access to electricity. In Afghanistan (2013), 7,771 out of 20,773 households seemed to have a television and no electricity. However, a mistake in the harmonization process was identified and this number was reduced to 3,119 out of 20,773 households once the mistake was fixed. Still, 13-20% of observations in Afghanistan have consistently reported having a television and no electricity.

Access to refrigerator and access to electricity across surveys

Figure 7.9: Access to refrigerator and access to electricity across surveys

  • bicycle is a dummy variable that indicates the availability of bicycles in the household (1=Yes, 0=No).

  • buffalo is a dummy variable that indicates the availability of buffaloes in the household (1=Yes, 0=No).

  • cellphone is a dummy variable that indicates the availability of cellphones in the household (1=Yes, 0=No).

  • chicken is a dummy variable that indicates the availability of chicken in the household (1=Yes, 0=No).

  • computer is a dummy variable that indicates the availability of computers in the household (1=Yes, 0=No).

  • cow is a dummy variable that indicates the availability of cows in the household (1=Yes, 0=No).

  • fan is a dummy variable that indicates the availability of fans in the household (1=Yes, 0=No).

  • lamp is a dummy variable that indicates the availability of lamps in the household (1=Yes, 0=No).

  • landphone is a dummy variable that indicates the availability of land phones in the household (1=Yes, 0=No).

  • motorcar is a dummy variable that indicates the availability of motor cars in the household (1=Yes, 0=No).

  • motorcycle is a dummy variable that indicates the availability of motorcycles in the household (1=Yes, 0=No).

  • radio is a dummy variable that indicates the availability of radios in the household (1=Yes, 0=No).

  • refrigerator is a dummy variable that indicates the availability of refrigerators in the household (1=Yes, 0=No).

  • sewingmachine is a dummy variable that indicates the availability of sewing machines in the household (1=Yes, 0=No).

  • television is a dummy variable that indicates the availability of televisions in the household (1=Yes, 0=No).

  • washingmachine is a dummy variable that indicates the availability of washing machines in the household (1=Yes, 0=No).

7.5 Housing

Housing is an essential component of household living conditions. For example, the living conditions of the Afghan population are to a large extent determined by the conditions of housing, including facilities for drinking water and sanitation. Most people live in dwellings that are constructed with non-durable materials and in conditions of overcrowding, meaning that there are more than three persons per room. The large majority of urban dwellers live in slums or in inadequate housing.

SARMD includes the following 12 harmonized variables related to housing:

  • electricity is a dummy variable that indicates the availability of electricity in the household (1=Yes, 0=No).

  • internet is a dummy variable that indicates the availability of internet in the household (1=Yes, 0=No).

  • ownhouse is a dummy variable that indicates the ownership status of the dwelling unit by the household residing in it.

    • 0 == No: refers to renters, squatters, housing received for free, among others
    • 1 == Yes: includes ownership whether or not full-payment has yet been made
  • piped_water is a dummy variable that indicates the availability of piped water in the household (1=Yes, 0=No).

  • sar_improved_toilet is a dummy variable that indicates whether a household has access to an improved type of sanitation facility using country-specific definitions.

    • 0 == Unimproved
    • 1 == Improved
  • sar_improved_water is a dummy variable that indicates whether a household has access to an improved source of drinking water using country-specific definitions.

    • 0 == Unimproved
    • 1 == Improved
  • sewage_toilet is a dummy variable that indicates the availability of a sewage toilet in the household (1=Yes, 0=No).

  • toilet_jmp is a categorical variable that indicates the type of toilet using the Joint Monitoring Program categories:

    • 1=Flush to piped sewer system
    • 2=Flush to septic tank
    • 3=Flush to pit latrine
    • 4=Flush to somewhere else
    • 5=Flush, don’t know where
    • 6=Ventilated improved pit latrine
    • 7=Pit latrine with slab
    • 8=Pit latrine without slab/open pit
    • 9=Composting toilet
    • 10=Bucket toilet
    • 11=Hanging toilet/hanging latrine
    • 12=No facility/bush/field
    • 13=Other
  • toilet_orig is a categorical variable that indicates the type of toilet using the original categories provided by the survey.

  • urban is a dummy variable that indicates whether a household is located in an urban or rural area.

    • 0=Rural
    • 1=Urban
  • water_jmp is a categorical variable that indicates the source of drinking water using the Joint Monitoring Program categories:

    • 1=Piped into dwelling
    • 2=Piped into compound, yard or plot
    • 3=Public tap / standpipe
    • 4=Tubewell, Borehole
    • 5=Protected well
    • 6=Unprotected well
    • 7=Protected spring
    • 8=Unprotected spring
    • 9=Rain water
    • 10=Tanker-truck or other vendor
    • 11=Cart with small tank / drum
    • 12=Surface water (river, stream, dam, lake, pond)
    • 13=Bottled water
    • 14=Other
  • water_orig is a categorical variable that indicates the source of drinking water using the original categories provided by the survey.

7.6 Labor

  • empstat is a categorical variable that indicates the type of employment of the first job:

    • 1=Paid Employee
    • 2=Non-Paid Employee
    • 3=Employer
    • 4=Self-employed
    • 5=Other, workers not classifiable by status
  • empstat_2 is a categorical variable that indicates the type of employment of the second job:

    • 1=Paid Employee
    • 2=Non-Paid Employee
    • 3=Employer
    • 4=Self-employed
    • 5=Other, workers not classifiable by status
  • empstat_2_year

  • firmsize_l indicates the firm size.

  • industry classifies the first job of any individual with a job, i.e., lstatus=1, and is missing otherwise. These single digit codes are based on the UN International Standard Industrial Classification (revision 3.1).

    • 1=Agriculture, Hunting, Fishing, etc.
    • 2=Mining
    • 3=Manufacturing
    • 4=Public Utility Services
    • 5=Construction
    • 6=Commerce
    • 7=Transport and Communications
    • 8=Financial and Business Services
    • 9=Public Administration
    • 10=Others Services, Unspecified
  • industry_2 classifies the second job of any individual with a job, i.e., lstatus=1, and is missing otherwise. These single digit codes are based on the UN International Standard Industrial Classification (revision 3.1).

    • 1=Agriculture, Hunting, Fishing, etc.
    • 2=Mining
    • 3=Manufacturing
    • 4=Public Utility Services
    • 5=Construction
    • 6=Commerce
    • 7=Transport and Communications
    • 8=Financial and Business Services
    • 9=Public Administration
    • 10=Others Services, Unspecified
  • industry_orig is a categorical variable that indicates the original country-specific industry codes for the first job.

  • industry_orig_2 is a categorical variable that indicates the original country-specific industry codes for the second job.

  • lb_mod_age is a numerical variable that indicates the age at which the labor module starts being applied.

  • lstatus is a categorical variable that indicates the labor force status of an individual. All persons are considered active in the labor force if they presently have a job (formal or informal) or do not have a job but are actively seeking work (unemployed).

  • 1=Employed
  • 2=Unemployed
  • 3=Not in labor force

  • njobs indicates the total number of jobs of an individual.

  • nlfreason is a categorical variable that indicates the reason for an individual to not be in the labor force. This variable is constructed for all those who are not presently employed and are not looking for work with lstatus=3 and missing otherwise.

    • 1=Student
    • 2=Housewife
    • 3=Retired
    • 4=Disabled
    • 5=Other
  • occup is a categorical variable that classifies jobs according to the following 1 digit occupational classification:

    • 1=Managers
    • 2=Professionals
    • 3=Technicians and associate professionals
    • 4= Clerical support workers
    • 5=Service and sales workers
    • 6=Skilled agricultural, forestry and fishery workers
    • 7=Craft and related trades workers
    • 8=Plant and machine operators, and assemblers
    • 9=Elementary occupations
    • 10=Armed forces occupations
    • 99=Other/unspecified
  • ocusec is a categorical variable that classifies jobs according to their sector of activity:

    • 1=Public sector, Central Government, Army, NGO
    • 2=Private
    • 3=State owned
    • 4=Public or State-owned, but cannot distinguish
  • unitwage states the first job’s time measurement unit of an employed of any individual lstatus=1 & empstat=1. Should be missing otherwise.

    • 1=Daily
    • 2=Weekly
    • 3=Every two weeks
    • 4=Every two months
    • 5=Monthly
    • 6=Quarterly
    • 7=Every six months
    • 8=Annually
    • 9=Hourly
    • 10=Other
  • unitwage_2 states the second job’s time measurement unit of an employed of any individual lstatus=1 & empstat=1. Should be missing otherwise.

    • 1=Daily
    • 2=Weekly
    • 3=Every two weeks
    • 4=Every two months
    • 5=Monthly
    • 6=Quarterly
    • 7=Every six months
    • 8=Annually
    • 9=Hourly
    • 10=Other
  • wage indicates the last wage payment of the first job where the time unit is unitwage.

  • wage_2 indicates the last wage payment of the second job where the time unit is unitwage_2.

  • whours indicates the number of hours worked in the last week.

7.7 Welfare

  • cpi is the value of the Consumer Price Index based on 2011 to convert local currency units.

  • cpiperiod indicates the periodicity of Consumer Price Index, which could be by year, year and month, year and quarter, or weighted.

  • pline_int provides the value of the international poverty line.

  • pline_nat provides the value of the national poverty line.

  • poor_int is a dummy variable that indicates whether an individual has been classified as poor (1=Yes, 0=No) as a result of being below the international poverty line.

  • poor_nat is a dummy variable that indicates whether an individual has been classified as poor (1=Yes, 0=No) as a result of being below the national poverty line.

  • ppp provides the value of the 2011 Purchasing Power Parity exchange rate.

  • welfare is the welfare aggregate used to compare to the international poverty line to estimate international poverty.

  • welfaredef is the spatially-deflated welfare aggregate used to compare to the poverty lines to estimate poverty.

  • welfarenat is the welfare aggregate used to compare to the national poverty line to estimate national poverty.

  • welfarenom is the welfare aggregate in nominal terms.

  • welfareother presents a welfare aggregate if different welfare type is used from welfare, welfarenom, or welfaredef.

  • welfaretype specifies the type of welfare measure for the variables welfare, welfarenom and welfaredef.

    • INC=income
    • CONS=consumption
    • EXP=expenditure
  • welfareothertype specifies the type of welfare measure for the variable welfareother.

    • INC=income
    • CONS=consumption
    • EXP=expenditure
  • welfshprosperity presents a welfare aggregate for shared prosperity (if different from poverty).

References

Lavallée, Pierre, and Jean-François Beaumont. 2015. “Why We Should Put Some Weight on Weights. Survey Insights: Methods from the Field, Weighting: Practical Issues and ‘How to’ Approach.” https://surveyinsights.org/?p=6255.