C Research design for impact evaluation

Development Research in Practice focuses on tools, workflows, and practical guidance for implementing research projects. All research team members, including field staff and research assistants, also need to understand research design and specifically how research design choices affect data work. Without going into too much technical detail, because there are many excellent resources on how to design impact evaluations, this appendix presents a brief overview of the most common methods of causal inference, focusing on their implications for data structure and analysis. This appendix is intended to be a reference, especially for junior team members, for understanding how treatment and control groups are constructed for common methods of causal inference, the data structures needed to estimate the corresponding effects, and specific code tools designed for each method.

Research team members who will do the data work need to understand the study design for several reasons. First, if team members do not know how to calculate the correct estimator for the study, they will not be able to assess the statistical power of the research design. This negatively affects their ability to make real-time decisions in the field, where trade-offs about allocating scarce resources between tasks are inevitable, such as deciding between increasing sample size or increasing response rates. Second, understanding how data need to be organized to produce meaningful analytics will save time throughout a project. Third, being familiar with the various approaches to causal inference will make it easier to recognize research opportunities: many of the most interesting projects occur because people in the field recognize the opportunity to implement one of these methods in response to an unexpected event.

This appendix is divided into two sections. The first covers methods of causal inference in experimental and quasi-experimental research designs. The second discusses how to measure treatment effects and structure data for specific methods, including cross-sectional randomized control trials, difference-in-differences designs, regression discontinuity, instrumental variables, matching, and synthetic controls.

Understanding causality, inference, and identification

The types of inputs that impact evaluations are typically concerned with— usually called “treatments”—are also commonly referred to as “programs” or “interventions.” Treatments are observed and measured in order to obtain estimates of study-specific treatment effects, which are the changes in outcomes attributable to the treatment (Abadie and Cattaneo (2018)). The primary goal of research design is to establish causal identification for a treatment effect. Causal identification means establishing that a change in an input directly altered an outcome. When a study is well identified, it is possible to say with confidence that the estimate of the treatment effect would, with an infinite amount of data, be precise.

Under this condition, it is possible to draw evidence from the limited samples that are actually accessible, using statistical techniques to express the uncertainty due to not having infinite data. Without identification, it is not possible to say whether the estimate would be accurate, even with unlimited data; therefore, changes in outcomes cannot be attributed to the treatment in the small samples to which researchers typically have access. Having more data is, therefore, not a substitute for having a well-identified experimental design, so it is important to understand how a study identifies its estimate of treatment effects. This understanding allows estimates to be calculated and interpreted appropriately.

All of the study designs discussed here use the potential outcomes framework (Athey and Imbens (2017b)) to compare a group that received some treatment to another, counterfactual, group. Each of these approaches can be used in two types of designs: experimental designs, in which the research team is directly responsible for creating the variation in treatment, and quasi-experimental designs, in which the team identifies a “natural” source of variation and uses it for identification. Neither type is implicitly better or worse, and both types are capable of achieving causal identification in different contexts.

Estimating treatment effects using control groups

The key assumption behind estimating treatment effects is that every person, facility, village, or whatever the unit of intervention is, has two possible states: their outcomes if they do not receive some treatment and their outcomes if they do receive that treatment. Each unit’s treatment effect is the individual difference between the outcomes that would be realized in the treated state and those that would be realized in the untreated state, and the average treatment effect (ATE) is the average of these individual differences across the potentially treated population. Most research designs attempt to estimate this parameter by establishing a counterfactual. A counterfactual is a statistical description of what would have happened to specific individuals in an alternative scenario—for example, a different treatment assignment outcome. Several resources provide more or less mathematically intensive approaches to understanding how various methods do this. Impact Evaluation in Practice (Gertler et al. (2016)) is a strong general guide to these methods. Causal Inference (Hernán and Robins (2010)) and Causal Inference: The Mixtape (Cunningham 2021) provide more detailed approaches to the tools. Mostly Harmless Econometrics (Angrist and Pischke (2008)) and Mastering ’Metrics (Angrist and Pischke (2014)) are excellent resources on the statistical principles behind all econometric approaches.

Intuitively, the problem of causal inference is as follows: it is not possible to observe the same unit in both its treated and untreated states simultaneously, so measuring and averaging these effects directly is impossible (Rubin (2003)). Instead, researchers typically make inferences from samples. Causal inference methods are those in which it is possible to identify and estimate an average treatment effect (or, in some designs, other types of treatment effects) by comparing averages between groups. Every research design is based on a way of comparing the outcomes of treated groups against those of another set of “control” observations. These designs all serve to establish that the outcomes in the control group would have been identical on average to those of the treated group in the absence of the treatment. Then, the mathematical properties of averages imply that the calculated difference in averages is equivalent to the average difference, which is the parameter of interest. In this framework, almost all causal inference methods can be described as a series of between-group comparisons.

Most of the methods encountered in impact evaluation research rely on some variant of this approach, which is designed to maximize the ability to estimate the effect of the treatment to be evaluated. The focus on identifying treatment effects, however, means that several essential features of causal identification methods are not common in other types of statistical and data science work. First, the econometric models and estimating equations used here do not attempt to create a predictive or comprehensive model of how outcomes are generated. Typically, causal inference designs are not interested in predictive accuracy, so the estimates and predictions that they produce are not as good at predicting outcomes or fitting the data as those of other data science approaches.

Second, when control variables or other variables are included in estimating equations, there is no guarantee that the parameters obtained for those variables are marginal effects in the same way that parameters for the treatment effect(s) are. They can be interpreted only as correlative averages, unless there are additional sources of identification. The models that will be constructed and estimated are intended to do exactly one thing: to express the intention of a project’s research design and to estimate accurately the effect of the treatment it is evaluating. In other words, these models tell the story of the research design in a way that clarifies the exact comparison being made between control and treatment groups.

Designing experimental and quasi-experimental research

Experimental research designs explicitly allow the research team to change the condition of the populations being studied, often in the form of government programs, nongovernmental organization projects, new regulations, information campaigns, and many more types of interventions (A. V. Banerjee and Duflo (2009); see the DIME Wiki at https://dimewiki​.worldbank.org/Experimental_Methods). The classic experimental causal inference method is the randomized control trial (RCT; see the DIME Wiki at https://dimewiki.worldbank.org/Randomized_Control_Trials ). In ­RCTs, the treatment group is randomized. That is, from an eligible population, a random group of units is placed in the treatment state. Another way to think about these designs is how they establish the control group: a random subset of units is not placed in the treatment state, so that it may serve as a counterfactual for the subset that is.

A randomized control group, intuitively, is meant to measure how things would have turned out for the treatment group if its members had not been treated. The RCT approach is particularly effective at doing this, as evidenced by its broad credibility in fields ranging from clinical medicine to development. As a result, RCTs are very popular tools for determining the causal impact of specific programs or policy interventions, as evidenced by the awarding of the 2019 Nobel Prize in Economics to Abhijit Banerjee, Esther Duflo, and Michael Kremer “for their experimental approach to alleviating global poverty” (NobelPrize.org (2020)). However, many types of interventions are impractical or unethical to approach effectively using an experimental strategy; for this reason, the ability to access “big questions” through RCT approaches is sometimes limited (Deaton (2009)).

Randomized designs all share several major statistical concerns. The first is the fact that it is always possible to select, by chance, a control group that is not in fact very similar to the treatment group. This risk is called randomization noise, and all RCTs need to assess how randomization noise affects the estimates that are obtained. Second, take-up and implementation fidelity (how closely work carried out in the field corresponds to its planning and intention) are extremely important because programs will, by definition, have no effect if the population intended to be treated does not accept or does not receive the treatment (for an example, see Jung and Hasan (2016)). Loss of statistical power occurs quickly and is highly nonlinear: 70 percent take-up or efficacy doubles the required sample, and 50 percent quadruples it (McKenzie 2011). Such effects are also very hard to correct ex post, because they require strong assumptions about the randomness or lack of randomness of take-up and fidelity. Therefore, field time and descriptive work must be dedicated to understanding how these effects play out in a given study.

Quasi-experimental research designs, by contrast, use causal inference methods based on events not controlled by the research team (see the DIME Wiki at https://dimewiki.worldbank.org/Quasi-Experimental​_Methods). Instead, they rely on “experiments of nature,” in which natural variation can be argued to approximate the type of exogenous variation in treatment availability that a researcher would attempt to create with an experiment (DiNardo (2016)). Unlike carefully planned experimental designs, quasi-experimental designs typically require the extra luck of having access to data collected at the right times and places to exploit events that occurred in the past or having the ability to collect data in a time and place where an event that produces causal identification occurred or will occur. Therefore, these methods often use secondary data, or they use primary data in a cross-sectional retrospective method, including administrative data or other new classes of routinely collected information.

Quasi-experimental designs therefore can sometimes access a much broader range of questions than experimental designs, and much less effort is required to produce the treatment and control groups. However, these designs require in-depth understanding of the precise events the researcher wishes to use in order to know what data to acquire and how to model the corresponding experimental design. Additionally, because the population exposed to such events is limited by the scale of the event, quasi-experimental designs are often power-constrained. Because the research team cannot change the population of the study or the treatment assignment, statistical power is typically maximized by ensuring that sampling for data collection is designed to match the study objectives and that attrition from the sampled groups is minimized.

Obtaining treatment effects from specific research designs

Cross-sectional designs

A cross-sectional research design is any type of study that observes data in only one time period and directly compares treatment and control groups. Such data are easy to collect and handle because it is not necessary to track units across time. If the time period is after a treatment has been fully delivered, then the outcome values at that time already reflect the effect of the treatment. If the study is experimental, the treatment and control groups are randomly constructed from the population eligible to receive each treatment. By construction, each unit’s receipt of the treatment is unrelated to any of its other characteristics, and the ordinary least squares regression of outcome on treatment, without any control variables or adjustments other than for the design (such as clustering and stratification), produces an unbiased estimate of the ATE.

Cross-sectional designs can also exploit variations in nonexperimental data to argue that observed correlations do in fact represent causal effects. This causation can be true unconditionally, which is to say that some random event, such as winning the lottery, is a truly random process and can provide information about the effect of receiving a large amount of money (Imbens, Rubin, and Sacerdote (2001)). It can also be true conditionally, which is to say that, once the characteristics that would affect both the likelihood of exposure to a treatment and the outcome of interest are controlled for, the process is as good as random. For example, a study could argue that, once risk preferences are taken into account, exposure to an earthquake is unpredictable (among people with the same risk preferences), and any excess differences after the event (after accounting for differences caused by risk preferences) are caused by the event itself (Callen (2015)).

For cross-sectional designs, what must be carefully maintained in data are the research design variables describing the treatment randomization process itself (whether experimental or not) as well as detailed information about differences in data quality and attrition across groups (Athey and Imbens (2017a)). Only design controls for the randomization process are needed to construct the appropriate estimator. Clustering of the standard errors is required at the level at which the treatment is assigned to observations, and variables that were used to stratify the treatment must be included as controls in the form of strata fixed effects (Barrios 2014). Randomization inference can be used to estimate the underlying variability in the randomization process. Balance checks—statistical tests of the similarity of treatment and control groups—are often reported as evidence of an effective randomization and are particularly important when the design is quasi-experimental, because then the randomization process cannot be simulated explicitly. However, controls for balance variables are usually superfluous in experimental designs, because it is certain that the treatment and the balance factors are not correlated in the data-generating process (McKenzie 2017).

Analysis of randomization is typically straightforward and well understood. A typical analysis will include a description of the sampling and randomization results, with analyses such as summary statistics for the eligible population and balance checks for randomization and sample selection. The main results will usually be primary regression specifications for outcomes of interest, with appropriate adjustments for multiple hypothesis testing (for an example, see Armand et al. (2017)). These will be followed by additional specifications with adjustments for nonresponse, imbalance, and other potential contaminations. Robustness checks might include randomization-inference analysis or other placebo regression approaches. Various user-written code tools are available to help with the complete process of data analysis, including analyzing balance (iebaltab; see the DIME Wiki at https://dimewiki.worldbank.org/iebaltab) and visualizing treatment effects (iegraph; see the DIME Wiki at https://dimewiki.worldbank.org/iegraph). Extensive tools and methods are available for analyzing selective nonresponse (Özler 2017).

Difference-in-differences

Whereas cross-sectional designs draw their estimates of treatment effects from differences in outcome levels in a single measurement, difference-in-differences designs (abbreviated as DD, DiD, diff-in-diff, and other variants; see the DIME Wiki at https://dimewiki.worldbank.org/Difference-in-Differences) estimate treatment effects from changes in outcomes between two or more rounds of measurement. In the simplest form of these designs, three control group averages are used to compute effect estimates—the baseline level of treatment units, the baseline level of nontreatment units, and the endline level of nontreatment units (Torres-Reyna (2015)). The estimated treatment effect is the excess growth of units that receive the treatment in the period they receive it: calculating that value is equivalent to taking the difference in means between treatment and nontreatment units at endline and subtracting the difference in means at baseline (McKenzie (2012)). The regression model includes a control variable for treatment assignment and a control variable for time period, and the treatment effect estimate corresponds to an interaction variable for treatment and time: it indicates the group of observations for which the treatment is active.

This “two-way fixed effects” design depends on the assumption that, in the absence of the treatment, the outcome of the two groups would have changed at the same rate over time, typically referred to as the parallel trends assumption (Friedman 2013). Experimental approaches satisfy this requirement in expectation, but a given randomization should still be checked for pretrends as an extension of balance checking (McKenzie 2020). More complex designs with multiple treatment groups or multiple time periods require correspondingly adjusted models (Baker, Larcker, and Wang 2021).

There are two main types of data structures for difference-in-­differences: repeated cross-sectional and panel data. In repeated cross-sectional designs, each successive round of data collection contains a random sample of observations from the treated and untreated groups; as in cross-sectional designs, both the randomization and sampling processes are critically important to maintain alongside the data. Panel data structures are used to observe the exact same units at different times, so that the same units can be analyzed both before and after they have (or have not) received treatment (Jakiela 2019). This structure allows each unit’s baseline outcome (the outcome before the intervention) to be used as an additional control for its endline outcome, which can provide increases in power and robustness (McKenzie 2015). When tracking individuals over time for this purpose, maintaining sampling and tracking records is especially important because attrition will remove that unit’s information from all time periods, not just the one in which they are unobserved. Panel-style experiments therefore require more effort in fieldwork for studies using original data (Torres-Reyna (2007)). Because the baseline and endline may be far apart in time, creating careful records during the first round makes it possible to follow up with the same subjects and to account properly for attrition across rounds (Özler 2017).

As with cross-sectional designs, difference-in-differences designs are widespread. Therefore, many standardized tools are available for analysis. DIME’s ietoolkit Stata package includes the ieddtab command, which produces standardized tables for reporting results (see https://dimewiki.worldbank.org/ieddtab). For more complicated versions of the model (and they can get quite complicated quite quickly), an online dashboard can be used to simulate counterfactual results (Kondylis and Loeser 2019a). As in cross-sectional designs, these main specifications will always be accompanied by balance checks (using baseline values) as well as by randomization, selection, and attrition analysis. In trials of this type, reporting experimental design and execution using the CONSORT style is common in many disciplines and is useful for tracking data over time (Schulz, Altman, and Moher (2010)).

Regression discontinuity

Regression discontinuity (RD) designs exploit sharp breaks or limits in policy designs to separate a single group of potentially eligible recipients into comparable groups of individuals who do and do not receive a treatment (see the DIME Wiki at https://dimewiki.worldbank​.org/Regression_Discontinuity). These designs differ from cross-sectional and difference-in-differences designs in that the group eligible to receive treatment is not defined directly but instead is created during the treatment implementation. In an RD design, there is typically some program or event that has limited availability because of practical considerations or policy choices and is therefore made available only to individuals who meet a certain threshold requirement.

The intuition of this design is that an underlying running variable serves as the sole determinant of access to the program, and a strict cutoff determines the value of this variable at which eligibility stops (Imbens and Lemieux (2008)). Common examples are test score thresholds and income thresholds (Evans 2013). The intuition is that individuals who are just above the threshold are very nearly indistinguishable from those who are just below it, and their outcomes after treatment are therefore directly comparable (Lee and Lemieux (2010)). The key assumption here is that the running variable cannot be manipulated directly by the potential recipients. If the running variable is time (what is commonly called an “event study”), there are special considerations (Hausman and Rapson (2018)). Similarly, spatial discontinuity designs are handled differently because of their multidimensionality (Kondylis and Loeser 2019b).

RD designs are, once implemented, similar in analysis to cross-sectional or difference-in-differences designs. Depending on the available data, the analytical approach will compare individuals who are narrowly on the inclusion side of the discontinuity with those who are narrowly on the exclusion side (Cattaneo, Idrobo, and Titiunik (2019)). The regression model will be identical to the corresponding research designs—that is, contingent on whether data have one or more time periods and whether the same units are known to be observed repeatedly.

The treatment effect will be identified by the addition of a control for the running variable—meaning that the treatment effect estimate will be automatically valid only for a subset of observations in a window around the cutoff. In many cases, the treatment effects estimated will be “local” rather than “average” when they cannot be assumed to hold for the entire sample. In the RD model, the functional form of the running variable control and the size of that window, often referred to as the choice of bandwidth for the design, are the critical parameters for the result (Calonico et al. (2019)). Therefore, RD analysis often includes extensive robustness checks using a variety of both functional forms and bandwidths as well as placebo tests for nonrealized locations of the cutoff.

In the analytical stage, RD designs often include a substantial component of visual evidence. These visual presentations help to suggest both the functional form of the underlying relationship and the type of change observed at the discontinuity; they also help to avoid pitfalls in modeling that are difficult to detect with parameterized hypothesis tests (Pischke (2018)). Because these designs are more flexible than others, an extensive set of commands helps to assess the efficacy and results from these designs under various assumptions (Calonico, Cattaneo, and Titiunik (2014)). These packages support the testing and reporting of robust plotting and estimation procedures, tests for manipulation of the running variable, and tests for power, sample size, and randomization inference approaches that will complement the main regression approach used for point estimates.

Instrumental variables

Instrumental variables (IV) designs, unlike the previous approaches, begin by assuming that the treatment delivered in the study in question is linked to the outcome in a pattern such that its effect is not directly identifiable. Instead, similar to RD designs, IV designs attempt to focus on a subset of the variation in treatment take-up and assess a limited window of variation that can be argued to be unrelated to other factors (Angrist and Krueger (2001)). To do so, the IV approach selects an instrument for the treatment status—an otherwise-unrelated predictor of exposure to treatment that affects the take-up status of an individual (see the DIME Wiki at https://dimewiki.worldbank.org/Instrumental_Variables ). Whereas RD designs are “sharp”—treatment status is strictly determined by which side of a cutoff an individual is on—IV designs are “fuzzy,” meaning that the values of the instrument(s) do not strictly determine the treatment status but instead influence the probability of treatment.

As in RD designs, the fundamental form of the regression is similar to either cross-sectional or difference-in-differences designs. However, instead of controlling for the instrument directly, the IV approach typically uses the two-stage-least-squares estimator (Bond (2020)). This estimator first forms a prediction of the probability that each unit receives treatment using a regression of treatment status against the instrumental variable(s). That prediction will, by assumption, be the portion of the actual treatment that is due to the instrument and not to any other source; because the instrument is unrelated to all other factors, this portion of the treatment variation can be used to estimate relevant effect sizes.

IV estimators are known to have very high variances relative to other methods, particularly when the relationship between the instrument and the treatment is weak (Andrews, Stock, and Sun 2019). IV designs furthermore rely on strong but untestable assumptions about the relationship between the instrument and the outcome (Bound, Jaeger, and Baker (1995)). Therefore, IV designs face scrutiny on the strength and exogeneity of the instrument, and tests for sensitivity to alternative specifications and samples are usually required. However, the method has special experimental cases that are significantly easier to assess: for example, a randomized treatment assignment can be used as an instrument for the eventual take-up of the treatment itself (for an example, see Iacovone, Maloney, and Mckenzie (2019)), especially in cases when take-up is expected to be low or in circumstances when the treatment is available to those who are not specifically assigned to it (“encouragement designs”).

In practice, various packages can be used to analyze data and report results from IV designs. Although the built-in Stata command ivregress is often used to create the final results, the built-in packages are not sufficient on their own. The first stage of the design should be tested extensively to demonstrate the strength of the relationship between the instrument and the treatment variable being instrumented (Stock and Yogo (2005)). This testing can be done using the weakiv and weakivtest commands (Pflueger and Wang (2015)). Additionally, tests should be run that identify and exclude individual observations or clusters that have extreme effects on the estimator, using customized bootstrap or leave-one-out approaches (Young 2019). Finally, bounds can be constructed allowing for imperfections in the exogeneity of the instrument using loosened assumptions, particularly when the underlying instrument is not directly randomized (Clarke and Matta (2018)).

Matching

Matching methods use observable characteristics of individuals to construct treatment and control groups that are as similar as possible to each other, either before a randomization process or after the collection of nonrandomized data (see the DIME Wiki at https://dimewiki.worldbank.org/Matching). Matching groups of observations within a data set may result in one-to-one matches or the creation of mutually matched groups; the result of a matching process is similar in concept to the use of randomization strata. In this way, the method can be conceptualized as averaging across the results of a large number of “micro-experiments” in which the units in each potential treatment group are verifiably similar except for their treatment status.

When matching is performed before a randomization process, it can be done on any observable characteristics, including baseline outcomes, if they are available. The randomization should record an indicator identifier for each matching set, as these sets become equivalent to randomization strata and require controls in analysis. This approach reduces the number of potential randomizations dramatically from the possible number that would be available if the matching was not conducted and therefore reduces the variance caused by the study design.

When matching is done ex post in order to substitute for randomization, it is based on the assertion that, within the matched groups, the assignment of treatment is as good as random. However, because many matching models rely on a specific linear model, such as propensity score matching (PSM), they are open to the criticism of “specification searching,” meaning that researchers can try different models of matching until one, by chance, leads to the desired result. Analytical approaches have shown that the better the fit of the matching model, the more likely it is to have arisen by chance and therefore to be biased (King and Nielsen (2019)). Newer methods, such as coarsened exact matching (Iacus, King, and Porro (2012)), are designed to remove some of the dependence on functional form. In all ex post cases, prespecification of the exact matching model can prevent some of the potential criticisms on this front, but ex post matching in general is not regarded as a strong identification strategy.

Analysis of data from matching designs is relatively straightforward; the simplest design only requires using controls (indicator variables) for each group or, in the case of propensity scoring and similar approaches, weighting the data appropriately in order to balance the analytical samples on the selected variables. The teffects suite in Stata provides a wide variety of estimators and analytical tools for various designs (Cooperative (2015)). The coarsened exact matching (cem) package applies the nonparametric approach (Blackwell et al. (2009)). The iematch command in the ietoolkit package produces matchings based on a single continuous matching variable (see the DIME Wiki at https://dimewiki.worldbank.org/iematch). In any of these cases, detailed reporting of the matching model is required, including the resulting effective weights of observations, because in some cases the lack of overlapping supports for treatment and control means that a large number of observations will be weighted near zero and the estimated effect will be generated using a subset of the data.

Synthetic control

Synthetic control is a relatively new method for the case when appropri- ate counterfactual units do not exist for a treatment of interest, and often there are very few (or only one) treatment units (Abadie, Diamond, and Hainmueller (2015)). For example, finding valid comparators for state- or national-level policy changes that can be analyzed only as a single unit is typically very difficult because the set of potential comparators is usually small and diverse with no close matches for the treated unit. Intuitively, the synthetic control method works by constructing a counterfactual version of the treated unit using an average of the other units available (Abadie, Diamond, and Hainmueller (2010)). This approach is particularly effective when the lower-level components of the units would be directly comparable: people, households, businesses, and so on in the case of states and countries or passengers or cargo shipments in the case of transport corridors, for example (Gobillon and Magnac (2016)). In those situations, the average of the untreated units can be thought of as balancing because it matches the composition of the treated unit.

To construct this estimator, the synthetic control method requires retrospective data on the treatment unit and possible comparators, including historical data on the outcome of interest for all units (for an example, see Fernandes, Hillberry, and Berg (2016)). The counterfactual blend is chosen by optimizing the prediction of past outcomes on the basis of potential input characteristics and typically selects a small set of comparators to weight into the final analysis. These data sets therefore may not have a large number of variables or observations, but the extent of the time series both before and after implementation of the treatment are key sources of power for the estimate, as are the number of counterfactual units available. Visualizations are often excellent demonstrations of these results. The synth package provides functionality for use in Stata and R; however, because the number of possible parameters and implementations of the design is large, the package can be complex to operate (Abadie, Diamond, and Hainmueller (2014)).