Chapter 1 Introduction

The South Asia Regional Micro Database (SARMD) is a regional database of socioeconomic indicators established in 2014 and managed by the South Asia Region Team for Statistical Development (SARTSD). It follows the Global Monitoring Database (GMD) harmonization guidelines, including the construction of the welfare aggregate used for global poverty monitoring. It consists of raw household survey data, documentation, questionnaires, and a repository of do files to reconstruct harmonized variables, consumption aggregates, and poverty estimates. SARMD currently includes the eight countries in the region (Afghanistan, Bangladesh, Bhutan, India, Maldives, Nepal, Pakistan, and Sri Lanka), contemplates forty-three surveys, and contains close to a hundred harmonized variables covering the 1993-2017 period.

SARMD follows the Global Monitoring Database (GMD) harmonization guidelines, including the construction of the welfare aggregate that is used for global poverty monitoring. Given that GMD contains a limited number of variables, SARMD provides users with a larger set of harmonized variables that are comparable across the countries of the South Asia region, though may not be comparable with countries from other regions. As GMD expands, new variables could be incorporated until users may be referred only to GMD.

Harmonization facilitates the use of household data at a regional level because all the databases in SARMD have a common set of approximately 100 variables with the same variable names (for example, welfare), variable labels and value labels. Technically speaking, this does not mean that each variable in each dataset refers to the same thing. For example, even though the variable welfare refers to the per capita welfare aggregate used to calculate official poverty rates and inequality indicators, it may include imputed rents in some countries (e.g., Bhutan and Bangladesh), but not in others (e.g., India and Afghanistan). This is so because questionnaires, sample frames, recollection periods, and several other factors make a strict harmonization and perfect comparability impossible across countries.

The best solution to lack of perfect comparability across countries is known as ex–post harmonization, which is a harmonization effort that starts from non-harmonized raw data. Two main approaches can be followed on ex–post harmonization:

The minimum common denominator approach refers to harmonizing variables according to the minimum common set of characteristics that all datasets share with respect to one variable. For example, if the minimum number of industrial sectors common to all surveys in all countries were two (e.g., agriculture and non-agriculture), the harmonized variable industcat, which is a categorical variable for industrial sectors, would only have two values in all datasets. This approach guarantees that all the variables are fully comparable, but it suffers from loss of information. Imagine, for example, that a single survey in the SARMD collection only has these two categories for industrial sectors while all the other surveys have five or more. If that were the case, the industcat ex-post harmonized variable would not capture all the valuable information on industrial sectors across countries.
The benchmark approach refers to harmonizing each variable according to a previously established ideal definition. For instance, at the beginning of the project, it could be agreed that the ideal number of industrial sectors should be ten (1 agriculture, 2 mining, 3 manufacturing, etc.). It would not matter whether any of the surveys has ten categories for industrial sectors. If some of them have more, then it would be necessary to aggregate some of the categories into broader ones. If some of them had less categories than ten, it might not be possible to disaggregate some of these categories into smaller ones.

As you may imagine, no approach to ex–post harmonization is perfect. However, SARMD follows the benchmark approach. The main reason for choosing this approach is that household surveys get better over time, and the minimum common denominator approach applied to both old and new surveys would inevitably drag down the content of SARMD to the old surveys’ methodological structure. In contrast, the benchmark approach takes advantage of the structure of new surveys. Once the benchmark stops reflecting the features of newer surveys, there is always room for updating and refining the benchmark and readapting old surveys to it.

Despite the disadvantages of ex–post harmonization, harmonized variables in SARMD may be used for research purposes in many ways. Let’s take for instance dichotomous variables (i.e., variables that take one of only two possible values, say Yes=1, No=0) of access to assets. The dashboard below (Figure 1.1) shows the share of household at a subnational level (state or province) with access to each of the assets in SARMD. For example, household access to at least one bicycle is averaged at a subnational (state or province) level and plotted in a map in Figure 1.1 for the most recent survey available in each country. Even though survey characteristics vary, and data comes from different years, having access to a bicycle seems to be significantly more common for households in Eastern India than in other areas of South Asia. One of our findings presented in Chapter 12, is that, in some cases, poorer households are more likely to have access to a bicycle.

Figure 1.1: Subnational averages for harmonized variables

Parts of the book

This book provides the user of SARMD a complete understanding of the harmonization of variables in three main areas:

Part 1 begins with an explanation on how to access the SARMD collection through the Stata command datalibweb. The next chapter presents an overview of the current poverty status in South Asia followed by an additional chapter covering some basic concepts in poverty measurement.

Part 2 presents a metadata analysis of SARMD, which may be broadly described as an overview of the household survey questionnaires and other survey characteristics. This part studies the household surveys’ raw data, sampling methodology, coverage, data capturing methods, and ability to measure food, non-food, durables, and housing expenditures.

Part 3 presents the harmonized variables in SARMD. The names of harmonized variables in this book are always emphasized as country, year, wgt, hsize, age, male, bicycle to help identify harmonized variables more easily. We inspect the quality of these harmonized variables and verify that each harmonized variable has been constructed properly. Two kinds of quality checks of the harmonized data have been conducted. A static quality check evaluates the coherence of each variable and its consistency with the rest of the variables in the dataset. At the individual level, for example, the variable age, which refers to the age of the individual, might be required to pass the following quality checks:

It should not have negative values (internal coherence) age>=0
All of its values should fall within a reasonable range (e.g., 0-120)¹ 0<=age<=120
The number of years of education of an individual educy should not be greater than age (internal consistency with other variables) educy<age

At the household level, for example, a high number of households that report having access to a television while not having access to electricity, that is, television==1 & electricity==0, may raise questions on whether these variables were harmonized incorrectly or whether the inconsistency comes from the raw data.

On the other hand, a dynamic quality check evaluates the inconsistencies in measurement of harmonized data and provides a better overview on whether categorical variables have changed over time. We tab the absolute and relative frequencies for the categories in categorical variables such as marital, urban, and educat7. In this section, it is also possible to see how the poverty dummy indicator used to generate extreme poverty headcounts, poor_int, changes over time.

Part 4 presents a set of socioeconomic findings at the regional level using SARMD. Each of these analytical notes is a brief example of the potential analyses that may be conducted using SARMD. These examples provide the Stata code that may be used to replicate the results. That way, users become more familiar with SARMD and may use these pieces of code as starting points for their own projects.

Even though it is possible to find in surveys people older than 120 years, it is not only unlikely, but also problematic for anonymization.↩