Chapter 14 Understanding poverty through household and individual level characteristics

Abstract:

We apply household-level multivariate regression modeling to identify key characteristics of poor households in South Asia. In the linear regression model, the dependent variable is the logarithm of expenditures per capita divided by the $1.90 poverty line. In the logit model, the dependent variable is binary and indicates whether the household is poor or not. This type of analysis has allowed us to reject a few hypotheses. For example, it is not true in general that female-headed households have lower levels of expenditures per capita. It is true, however, that urban households have significantly higher expenditures per capita. Increasing the size of the household increases the probability of falling below the poverty line. This analysis has allowed to quantify the magnitude of these effects using harmonized variables for surveys available in SARMD.

Regression analysis is commonly undertaken to identify the effects of different characteristics on expenditures per capita and poverty. Some of these characteristics may include independent variables at the individual (age, gender, marital status), household (household size, housing, access to assets), community (urban/rural), and regional (state/province) levels. These variables have been harmonized in SARMD. For example, the characteristics of the household head, such as age, educy, and male, may be identified using relationharm==1 or relationcs==1.

In this chapter, we provide a template for any user attempting to explain the levels of expenditure per capita (the dependent variable) as a function of a variety of harmonized independent variables contained in SARMD. In the linear regression model, the dependent variable is the logarithm of expenditures per capita divided by the $1.90 poverty line. In the logit model, the dependent variable is binary and indicates whether the household is poor or not.

A typical analysis of expenditures per capita may look something like this:

\[ \begin{equation} \ln \left( \frac{y_{i}}{1.90} \right) = \beta_{0} +\beta_1 X_{i}+\beta_2 Z_{i}+\varepsilon _{i} \end{equation} \]

This equation is said to be in the semi-log form because only the dependent variable is in logarithm. The coefficients in the semi-log model are partial or semi-elasticities. The model assumes a linear relationship between $\ln (y_{i}/1.90)$, $X_{i}$, and $Z_{i}$, where $X_{i}$ are independent variables for household head $i$ and $Z_{i}$ are independent variables for household $i$. The $\beta$’s are the coefficients we are trying to estimate. $\varepsilon _{i}$ represents a normally-distributed random error term.

The regression analysis of a binary variable determines the probability of an outcome rather than an alternative outcome, in this case, falling below the poverty line or not:

\[ \begin{equation} poor\_int = \begin{cases} 1 & if & \ln \left( \frac{y_{i}}{1.90} \right) < 0\\ 0 & if & \ln \left( \frac{y_{i}}{1.90} \right) \geq 0 \end{cases} \end{equation} \]

A two-step sample design, first selecting clusters and then households, generates a sample in which households are not randomly distributed over space. One complication of clustering is that the error term is correlated within cluster. Interviewing several households on a same block reduces survey costs, but may also lead to data being influenced by common unobserved cluster-specific characteristics. Surveys are also commonly designed to generate statistics for population subgroups defined by geographical area or ethnicity. Stratification guarantees there will be enough observations to permit estimates for each of these groups. Deaton (1997) provides an example. Suppose we know average rural income is lower than average urban income. A stratified survey would be two identical surveys, one rural and one urban, each of which would estimate average income. The average income for the whole country would be calculated by weighting together the urban and rural means using the proportion of the population in each group as weights.

Unfortunately, psu and strata are not always available in SARMD as shown in Figure 14.1, but we use them whenever they are available. We selected male, age, hsize, urban, electricity, ownhouse, and subnatid1 as control variables because they are present in most of the surveys and they have a low percentage of missing observations.

Figure 14.1: Summary statistics

An example of how to conduct these regressions is provided below:

* Preamble
clear all
eststo clear

* Open data
datalibweb, country(BGD) year(2016) type(SARMD) mod(IND) ///
vermast(01) veralt(03) surveyid(HIES)

* Declare survey design for dataset
svyset psu [pw=pop_wgt], strata(strata)

* Generate dependent variable
gen ln_welfare_perc=ln(welfare/cpi/ppp/365*12/1.90)

* Define control variables
gen age_squared=age^2
local controls "i.male age age_squared hsize i.urban ///
i.electricity i.ownhouse i.subnatid1"

* Describe data
sum ln_welfare_perc poor_int `controls' ///
  if relationharm==1 [aw=pop_wgt]

* Conduct regressions
eststo model1_linear: svy: reg ln_welfare_perc ///
`controls' if relationharm==1

eststo model2_logit: svy: logit poor_int ///
`controls' if relationharm==1

* Present results in table
esttab *, se r2 label

* Present results in figure
coefplot

14.1 Linear model results

Figure 14.2 displays the results of the linear model regression by variable. The confidence interval gives a range of values such that if the experiment was run many times (e.g., 10,000 times), the range would contain the true parameter 95% of the time. Note how the confidence intervals become wider when we incorporate the survey design variables psu and strata. In a semi-log model, semi elasticities are interpreted as proportionate changes rather than levels of changes. For example, given $\beta=0.02$, a one-unit change in $x$ is associated with a proportionate increase of 2% in $E(y|x)$.

Figure 14.2: Linear regression estimates

Based on this analysis, we provide the following highlights:

Urban expenditures per capita are 20-30% above rural, once controlling for other household characteristics.

Age of the household head follows the expected non-linear behavior. The coefficient on age is positive, while the coefficient on age squared is negative.

The mixed results obtained from gender of the household head must be treated with caution, but they suggest it is not true in general that female-headed households have lower levels of expenditures per capita as one may expect.

Increasing the size of the household consistently reduces expenditures per capita within a range from -2% to -13%.

Figure 14.3 displays the results of the linear model regression by survey.

Figure 14.3: Linear regression estimates

14.2 Logit model results

Logistic regression, also called a logit model, is used to model dichotomous outcome variables. In the logit model the log odds ratio of the outcome is modeled as a linear combination of the predictor variables:

\[ \begin{equation} \log \left( \frac{p}{1-p} \right) = \beta_{0} +\beta_1 X_{i}+\beta_2 Z_{i}+\varepsilon _{i} \end{equation} \] where $p$ is the probability of being poor (defined as expenditures per capita being below the $1.90 poverty line). In this model, the dependent variable maps probability ranging between 0 and 1 to log odds ratio ranging from negative infinity to positive infinity.

Figure 14.4 displays the results of the logit model regression by variable. As expected, the signs are reversed to those obtained from the linear regression.

Figure 14.4: Linear regression estimates

Figure 14.5 displays the results of the logit model regression by survey.

Figure 14.5: Linear regression estimates

You may access our full Stata do-file by accessing the following link.

References

Deaton, Angus S. 1997. The Analysis of Household Surveys : A Microeconometric Approach to Development Policy. Washington, D.C. : The World Bank. http://documents.worldbank.org/curated/en/593871468777303124/The-Analysis-of-Household-Surveys-A-Microeconometric-Approach-to-Development-Policy.