# C Research design for impact evaluation

*Development Research in Practice* focuses on tools, workflows, and practical
guidance for implementing research projects.
While not central to the content of this book,
We think it is essential for all research team members
– including field staff and research assistants –
to understand research design,
and specifically how research design choices impact data work.
Without going into too much technical detail,
as there are many excellent resources on impact evaluation design,
this appendix presents a brief overview
of the most common causal inference methods,
focusing on implications for data structure and analysis.
This appendix is intended to be a reference,
especially for junior team members,
to obtain an understanding of the way in which each
causal inference method constructs treatment and control groups,
the data structures needed to estimate the corresponding effects,
and specific code tools designed for each method.

It is essential for the research team members who will do the data work to understand the study design, for several reasons. If you do not know how to calculate the correct estimator for your study, you will not be able to assess the statistical power of your research design. You will also be unable to make decisions in the field when you inevitably have to allocate scarce resources between tasks like maximizing sample size and ensuring follow-up with specific individuals. You will save time by understanding the way your data needs to be organized to produce meaningful analytics throughout your projects. Just as importantly, familiarity with each of these approaches will allow you to keep your eyes open for research opportunities: many of the most interesting projects occur because people in the field recognize the opportunity to implement one of these methods in response to an unexpected event.

This appendix is split into two sections. The first covers causal inference methods in experimental and quasi-experimental research designs. The second discusses how to measure treatment effects and structure data for specific methods, including cross-sectional randomized control trials, difference-in-difference designs, regression discontinuity, instrumental variables, matching, and synthetic controls.

## Understanding causality, inference, and identification

When we are discussing the types of inputs – “treatments” – commonly referred to as
“programs” or “interventions”, we are typically attempting to obtain estimates
of program-specific **treatment effects**.
These are the changes in outcomes attributable to the treatment.^{304}
The primary goal of research design is to establish **causal identification** for an effect.
Causal identification means establishing that a change in an input directly altered an outcome.
When a study is well-identified, then we can say with confidence
that our estimate of the treatment effect would,
with an infinite amount of data,
give us a precise estimate of that treatment effect.
Under this condition, we can proceed to draw evidence from the limited samples we have access to,
using statistical techniques to express the uncertainty of not having infinite data.
Without identification, we cannot say that the estimate would be accurate,
even with unlimited data, and therefore cannot attribute it to the treatment
in the small samples that we typically have access to.
More data is not a substitute for a well-identified experimental design.
Therefore it is important to understand how exactly your study
identifies its estimate of treatment effects,
so you can calculate and interpret those estimates appropriately.

All the study designs we discuss here use the potential outcomes framework^{305}
to compare a group that received some treatment to another, counterfactual group.
Each of these approaches can be used in two types of designs:
**experimental** designs, in which the research team
is directly responsible for creating the variation in treatment,
and **quasi-experimental** designs, in which the team
identifies a “natural” source of variation and uses it for identification.
Neither type is implicitly better or worse,
and both types are capable of achieving causal identification in different contexts.

### Estimating treatment effects using control groups

The key assumption behind estimating treatment effects is that every
person, facility, or village (or whatever the unit of intervention is)
has two possible states: their outcomes if they do not receive some treatment
and their outcomes if they do receive that treatment.
Each unit’s treatment effect is the individual difference between these two states,
and the **average treatment effect (ATE)** is the average of all
individual differences across the potentially treated population.
This is the parameter that most research designs attempt to estimate,
by establishing a **counterfactual**^{306}
There are several resources that provide more or less mathematically intensive
approaches to understanding how various methods do this.
*Impact Evaluation in Practice*^{307}
is a strong general guide to these methods.
*Causal Inference*^{308} and
*Causal Inference: The Mixtape*^{309}
provides more detailed mathematical approaches to the tools.
*Mostly Harmless Econometrics*^{310}
and *Mastering Metrics*^{311}
are excellent resources on the statistical principles behind all econometric approaches.

Intuitively, the problem is as follows: we can never observe the same unit
in both their treated and untreated states simultaneously,
so measuring and averaging these effects directly is impossible.^{312}
Instead, we typically make inferences from samples.
**Causal inference** methods are those in which we are able to estimate the
average treatment effect without observing individual-level effects,
but through some comparison of averages with a **control** group.
Every research design is based on a way of comparing another set of observations –
the “control” observations – against the treatment group.
They all work to establish that the control observations would have been
identical *on average* to the treated group in the absence of the treatment.
Then, the mathematical properties of averages imply that the calculated
difference in averages is equivalent to the average difference:
exactly the parameter we are seeking to estimate.
Therefore, almost all designs can be accurately described
as a series of between-group comparisons.

Most of the methods that you will encounter rely on some variant of this strategy, which is designed to maximize their ability to estimate the effect of an average unit being offered the treatment you want to evaluate. The focus on identification of the treatment effect, however, means there are several essential features of causal identification methods that are not common in other types of statistical and data science work. First, the econometric models and estimating equations used do not attempt to create a predictive or comprehensive model of how the outcome of interest is generated. Typically, causal inference designs are not interested in predictive accuracy, and the estimates and predictions that they produce will not be as good at predicting outcomes or fitting the data as other models. Second, when control variables or other variables are used in estimation, there is no guarantee that the resulting parameters are marginal effects. They can only be interpreted as correlative averages, unless there are additional sources of identification. The models you will construct and estimate are intended to do exactly one thing: to express the intention of your project’s research design, and to accurately estimate the effect of the treatment it is evaluating. In other words, these models tell the story of the research design in a way that clarifies the exact comparison being made between control and treatment.

### Designing experimental and quasi-experimental research

Experimental research designs explicitly allow the research team
to change the condition of the populations being studied,^{313}
often in the form of government programs, NGO projects, new regulations,
information campaigns, and many more types of interventions.^{314}
The classic experimental causal inference method
is the **randomized control trial (RCT)**.^{315}
In randomized control trials, the treatment group is randomized –
that is, from an eligible population,
a random group of units are given the treatment.
Another way to think about these designs is how they establish the control group:
a random subset of units are *not* given access to the treatment,
so that they may serve as a counterfactual for those who are.
A randomized control group, intuitively, is meant to represent
how things would have turned out for the treated group
if they had not been treated, and it is particularly effective at doing so
as evidenced by its broad credibility in fields ranging from clinical medicine to development.
Therefore RCTs are very popular tools for determining the causal impact
of specific programs or policy interventions,
as evidenced by the awarding of the 2019 Nobel Prize in Economics
to Abhijit Banerjee, Esther Duflo and Michael Kremer
“for their experimental approach to alleviating global poverty.”^{316}
However, there are many other types of interventions that are impractical or unethical
to effectively approach using an experimental strategy,
and therefore there are limitations to accessing “big questions”
through RCT approaches.^{317}

Randomized designs all share several major statistical concerns.
The first is the fact that it is always possible to select a control group,
by chance, which is not in fact very similar to the treatment group.
This feature is called randomization noise, and all RCTs share the need to assess
how randomization noise may impact the estimates that are obtained.
(More detail on this later.)
Second, take-up and implementation fidelity are extremely important,
since programs will by definition have no effect
if the population intended to be treated
does not accept or does not receive the treatment.^{318}
Loss of statistical power occurs quickly and is highly nonlinear:
70% take-up or efficacy doubles the required sample, and 50% quadruples it.^{319}
Such effects are also very hard to correct ex post,
since they require strong assumptions about the randomness or non-randomness of take-up.
Therefore a large amount of field time and descriptive work
must be dedicated to understanding how these effects played out in a given study,
and may overshadow the effort put into the econometric design itself.

**Quasi-experimental** research designs,^{320}
by contrast, are causal inference methods based on events not controlled by the research team.
Instead, they rely on “experiments of nature”,
in which natural variation can be argued to approximate
the type of exogenous variation in treatment availability
that a researcher would attempt to create with an experiment.^{321}
Unlike carefully planned experimental designs,
quasi-experimental designs typically require the extra luck
of having access to data collected at the right times and places
to exploit events that occurred in the past,
or having the ability to collect data in a time and place
where an event that produces causal identification occurred or will occur.
Therefore, these methods often use either secondary data,
or they use primary data in a cross-sectional retrospective method,
including administrative data or other new classes of routinely-collected information.

Quasi-experimental designs therefore can access a much broader range of questions, and with much less effort in terms of executing an intervention. However, they require in-depth understanding of the precise events the researcher wishes to address in order to know what data to use and how to model the underlying natural experiment. Additionally, because the population exposed to such events is limited by the scale of the event, quasi-experimental designs are often power-constrained. Since the research team cannot change the population of the study or the treatment assignment, power is typically maximized by ensuring that sampling for data collection is carefully designed to match the study objectives and that attrition from the sampled groups is minimized.

## Obtaining treatment effects from specific research designs

### Cross-sectional designs

A cross-sectional research design is any type of study that observes data in only one time period and directly compares treatment and control groups. This type of data is easy to collect and handle because you do not need to track individuals across time. If this point in time is after a treatment has been fully delivered, then the outcome values at that point in time already reflect the effect of the treatment. If the study is experimental, the treatment and control groups are randomly constructed from the population that is eligible to receive each treatment. By construction, each unit’s receipt of the treatment is unrelated to any of its other characteristics and the ordinary least squares (OLS) regression of outcome on treatment, without any control variables, is an unbiased estimate of the average treatment effect.

Cross-sectional designs can also exploit variation in non-experimental data
to argue that observed correlations do in fact represent causal effects.
This can be true unconditionally – which is to say that something random,
such as winning the lottery, is a true random process and can tell you about the effect
of getting a large amount of money.^{322}
It can also be true conditionally – which is to say that once the
characteristics that would affect both the likelihood of exposure to a treatment
and the outcome of interest are controlled for,
the process is as good as random:
like arguing that once risk preferences are taken into account,
exposure to an earthquake is unpredictable and post-event differences
are causally related to the event itself.^{323}
For cross-sectional designs, what needs to be carefully maintained in data
is the treatment randomization process itself (whether experimental or not),
as well as detailed information about differences
in data quality and attrition across groups.^{324}
Only these details are needed to construct the appropriate estimator:
clustering of the standard errors is required at the level
at which the treatment is assigned to observations,
and variables which were used to stratify the treatment
must be included as controls (in the form of strata fixed effects).^{325}
**Randomization inference** can be used
to estimate the underlying variability in the randomization process.
**Balance checks**^{326}
are often reported as evidence of an effective randomization,
and are particularly important when the design is quasi-experimental
(since then the randomization process cannot be simulated explicitly).
However, controls for balance variables are usually unnecessary in RCTs,
because it is certain that the true data-generating process
has no correlation between the treatment and the balance factors.^{327}

Analysis is typically straightforward *once you have a strong understanding of the randomization*.
A typical analysis will include a description of the sampling and randomization results,
with analyses such as summary statistics for the eligible population,
and balance checks for randomization and sample selection.
The main results will usually be a primary regression specification
(with multiple hypotheses^{328}
appropriately adjusted for),
and additional specifications with adjustments for non-response, balance, and other potential contamination.
Robustness checks might include randomization-inference analysis or other placebo regression approaches.
There are a number of user-written code tools that are also available
to help with the complete process of data analysis,
including to analyze balance^{329}
and to visualize treatment effects.^{330}
Extensive tools and methods for analyzing selective non-response are available.^{331}

### Difference-in-differences

Where cross-sectional designs draw their estimates of treatment effects
from differences in outcome levels in a single measurement,
**differences-in-differences**^{332}
designs (abbreviated as DD, DiD, diff-in-diff, and other variants)
estimate treatment effects from *changes* in outcomes
between two or more rounds of measurement.
In these designs, three control groups are used –
the baseline level of treatment units,
the baseline level of non-treatment units,
and the endline level of non-treatment units.^{333}
The estimated treatment effect is the excess growth
of units that receive the treatment, in the period they receive it:
calculating that value is equivalent to taking
the difference in means at endline and subtracting
the difference in means at baseline
(hence the singular “difference-in-differences”).^{334}
The regression model includes a control variable for treatment assignment,
and a control variable for time period,
but the treatment effect estimate corresponds to
an interaction variable for treatment and time:
it indicates the group of observations for which the treatment is active.
This model depends on the assumption that,
in the absense of the treatment,
the outcome of the two groups would have changed at the same rate over time,
typically referred to as the **parallel trends** assumption.^{335}
Experimental approaches satisfy this requirement in expectation,
but a given randomization should still be checked for pre-trends
as an extension of balance checking.^{336}
There are two main types of data structures for differences-in-differences:
**repeated cross-sections** and **panel data**.
In repeated cross-sections, each successive round of data collection contains a random sample
of observations from the treated and untreated groups;
as in cross-sectional designs, both the randomization and sampling processes
are critically important to maintain alongside the data.
In panel data structures, we attempt to observe the exact same units
in different points in time, so that we see the same individuals
both before and after they have received treatment (or not).^{337}
This allows each unit’s baseline outcome (the outcome before the intervention) to be used
as an additional control for its endline outcome,
which can provide large increases in power and robustness.^{338}
When tracking individuals over time for this purpose,
maintaining sampling and tracking records is especially important,
because attrition will remove that unit’s information
from all points in time, not just the one they are unobserved in.
Panel-style experiments therefore require a lot more effort in field work
for studies that use original data.^{339}
Since baseline and endline may be far apart in time,
it is important to create careful records during the first round
so that follow-ups can be conducted with the same subjects,
and attrition across rounds can be properly taken into account.^{340}

As with cross-sectional designs, difference-in-differences designs are widespread.
Therefore there exist a large number of standardized tools for analysis.
Our `ietoolkit`

Stata package includes the `ieddtab`

command
which produces standardized tables for reporting results.^{341}
For more complicated versions of the model
(and they can get quite complicated quite quickly),
you can use an online dashboard to simulate counterfactual results.^{342}
As in cross-sectional designs, these main specifications
will always be accompanied by balance checks (using baseline values),
as well as randomization, selection, and attrition analysis.
In trials of this type, reporting experimental design and execution
using the CONSORT style is common in many disciplines
and will help you to track your data over time.^{343}

### Regression discontinuity

**Regression discontinuity (RD)** designs exploit sharp breaks or limits
in policy designs to separate a single group of potentially eligible recipients
into comparable groups of individuals who do and do not receive a treatment.^{344}
These designs differ from cross-sectional and diff-in-diff designs
in that the group eligible to receive treatment is not defined directly,
but instead created during the treatment implementation.
In an RD design, there is typically some program or event
that has limited availability due to practical considerations or policy choices
and is therefore made available only to individuals who meet a certain threshold requirement.
The intuition of this design is that there is an underlying **running variable**
that serves as the sole determinant of access to the program,
and a strict cutoff determines the value of this variable at which eligibility stops.^{345}
Common examples are test score thresholds and income thresholds.^{346}
The intuition is that individuals who are just above the threshold
will be very nearly indistinguishable from those who are just under it,
and their post-treatment outcomes are therefore directly comparable.^{347}
The key assumption here is that the running variable cannot be directly manipulated
by the potential recipients.
If the running variable is time (what is commonly called an “event study”),
there are special considerations.^{348}
Similarly, spatial discontinuity designs are handled a bit differently due to their multidimensionality.^{349}

Regression discontinuity designs are, once implemented,
very similar in analysis to cross-sectional or difference-in-differences designs.
Depending on the data that is available,
the analytical approach will center on the comparison of individuals
who are narrowly on the inclusion side of the discontinuity,
compared against those who are narrowly on the exclusion side.^{350}
The regression model will be identical to the matching research designs,
i.e., contingent on whether data has one or more time periods
and whether the same units are known to be observed repeatedly.
The treatment effect will be identified, however, by the addition of a control
for the running variable – meaning that the treatment effect estimate
will only be applicable for observations in a small window around the cutoff:
in the lingo, the treatment effects estimated will be “local” rather than “average”.
In the RD model, the functional form of the running variable control and the size of that window,
often referred to as the choice of **bandwidth** for the design,
are the critical parameters for the result.^{351}
Therefore, RD analysis often includes extensive robustness checking
using a variety of both functional forms and bandwidths,
as well as placebo testing for non-realized locations of the cutoff.

In the analytical stage, regression discontinuity designs
often include a large component of visual evidence presentation.
These presentations help to suggest both the functional form
of the underlying relationship and the type of change observed at the discontinuity,
and help to avoid pitfalls in modeling that are difficult to detect with hypothesis tests.^{352}
Because these designs are so flexible compared to others,
there is an extensive set of commands that help assess
the efficacy and results from these designs under various assumptions.^{353}
These packages support the testing and reporting
of robust plotting and estimation procedures,
tests for manipulation of the running variable,
and tests for power, sample size, and randomization inference approaches
that will complement the main regression approach used for point estimates.

### Instrumental variables

**Instrumental variables (IV)** designs, unlike the previous approaches,
begin by assuming that the treatment delivered in the study in question is
linked to the outcome in a pattern such that its effect is not directly identifiable.
Instead, similar to regression discontinuity designs,
IV attempts to focus on a subset of the variation in treatment take-up
and assesses that limited window of variation that can be argued
to be unrelated to other factors.^{354}
To do so, the IV approach selects an **instrument**
for the treatment status – an otherwise-unrelated predictor of exposure to treatment
that affects the take-up status of an individual.^{355}
Whereas regression discontinuity designs are “sharp” –
treatment status is completely determined by which side of a cutoff an individual is on –
IV designs are “fuzzy”, meaning that they do not completely determine
the treatment status but instead influence the *probability* of treatment.

As in regression discontinuity designs,
the fundamental form of the regression
is similar to either cross-sectional or difference-in-differences designs.
However, instead of controlling for the instrument directly,
the IV approach typically uses the **two-stage-least-squares (2SLS)** estimator.^{356}
This estimator forms a prediction of the probability that the unit receives treatment
based on a regression against the instrumental variable.
That prediction will, by assumption, be the portion of the actual treatment
that is due to the instrument and not any other source,
and since the instrument is unrelated to all other factors,
this portion of the treatment can be used to assess its effects.
Unfortunately, these estimators are known
to have very high variances relative other methods,
particularly when the relationship between the instrument and the treatment is small.^{357}
IV designs furthermore rely on strong but untestable assumptions
about the relationship between the instrument and the outcome.^{358}
Therefore IV designs face intense scrutiny on the strength and exogeneity of the instrument,
and tests for sensitivity to alternative specifications and samples
are usually required with an instrumental variables analysis.
However, the method has special experimental cases that are significantly easier to assess:
for example, a randomized treatment *assignment* can be used as an instrument
for the eventual take-up of the treatment itself,^{359}
especially in cases where take-up is expected to be low,
or in circumstances where the treatment is available
to those who are not specifically assigned to it (“encouragement designs”).

In practice, there are a variety of packages that can be used
to analyse data and report results from instrumental variables designs.
While the built-in Stata command `ivregress`

will often be used
to create the final results, the built-in packages are not sufficient on their own.
The **first stage** of the design should be extensively tested,
to demonstrate the strength of the relationship between
the instrument and the treatment variable being instrumented.^{360}
This can be done using the `weakiv`

and `weakivtest`

commands.^{361}
Additionally, tests should be run that identify and exclude individual
observations or clusters that have extreme effects on the estimator,
using customized bootstrap or leave-one-out approaches.^{362}
Finally, bounds can be constructed allowing for imperfections
in the exogeneity of the instrument using loosened assumptions,
particularly when the underlying instrument is not directly randomized.^{363}

### Matching

**Matching** methods use observable characteristics of individuals
to directly construct treatment and control groups to be as similar as possible
to each other, either before a randomization process
or after the collection of non-randomized data.^{364}
Matching observations may be one-to-one or many-to-many;
in any case, the result of a matching process
is similar in concept to the use of randomization strata
in simple randomized control trials.
In this way, the method can be conceptualized
as averaging across the results of a large number of “micro-experiments”
in which the randomized units are verifiably similar aside from the treatment.

When matching is performed before a randomization process,
it can be done on any observable characteristics,
including outcomes, if they are available.
The randomization should then record an indicator for each matching set,
as these become equivalent to randomization strata and require controls in analysis.
This approach is stratification taken to its most extreme:
it reduces the number of potential randomizations dramatically
from the possible number that would be available
if the matching was not conducted,
and therefore reduces the variance caused by the study design.
When matching is done ex post in order to substitute for randomization,
it is based on the assertion that within the matched groups,
the assignment of treatment is as good as random.
However, since most matching models rely on a specific linear model,
such as **propensity score matching**,^{365}
they are open to the criticism of “specification searching”,
meaning that researchers can try different models of matching
until one, by chance, leads to the final result that was desired;
analytical approaches have shown that the better the fit of the matching model,
the more likely it is that it has arisen by chance and is therefore biased.^{366}
Newer methods, such as **coarsened exact matching**,^{367}
are designed to remove some of the dependence on linearity.
In all ex-post cases, pre-specification of the exact matching model
can prevent some of the potential criticisms on this front,
but ex-post matching in general is not regarded as a strong identification strategy.

Analysis of data from matching designs is relatively straightforward;
the simplest design only requires controls (indicator variables) for each group
or, in the case of propensity scoring and similar approaches,
weighting the data appropriately in order to balance the analytical samples on the selected variables.
The `teffects`

suite in Stata provides a wide variety
of estimators and analytical tools for various designs.^{368}
The coarsened exact matching (`cem`

) package applies the nonparametric approach.^{369}
DIME’s `iematch`

command in the `ietoolkit`

package produces matchings based on a single continuous matching variable.^{370}
In any of these cases, detailed reporting of the matching model is required,
including the resulting effective weights of observations,
since in some cases the lack of overlapping supports for treatment and control
mean that a large number of observations will be weighted near zero
and the estimated effect will be generated based on a subset of the data.

### Synthetic control

**Synthetic control** is a relatively new method
for the case when appropriate counterfactual individuals
do not exist in reality and there are very few (often only one) treatment units.^{371}
For example, state- or national-level policy changes
that can only be analyzed as a single unit
are typically very difficult to find valid comparators for,
since the set of potential comparators is usually small and diverse
and therefore there are no close matches to the treated unit.
Intuitively, the synthetic control method works
by constructing a counterfactual version of the treated unit
using an average of the other units available.^{372}
This is a particularly effective approach
when the lower-level components of the units would be directly comparable:
people, households, business, and so on in the case of states and countries;
or passengers or cargo shipments in the case of transport corridors, for example.^{373}
This is because in those situations the average of the untreated units
can be thought of as balancing by matching the composition of the treated unit.

To construct this estimator, the synthetic controls method requires
retrospective data on the treatment unit and possible comparators,
including historical data on the outcome of interest for all units.^{374}
The counterfactual blend is chosen by optimizing the prediction of past outcomes
based on the potential input characteristics,
and typically selects a small set of comparators to weight into the final analysis.
These datasets therefore may not have a large number of variables or observations,
but the extent of the time series both before and after the implementation
of the treatment are key sources of power for the estimate,
as are the number of counterfactual units available.
Visualizations are often excellent demonstrations of these results.
The `synth`

package provides functionality for use in Stata and R,
although since there are a large number of possible parameters
and implementations of the design it can be complex to operate.^{375}

**Counterfactual:**A statistical description of what would have happened to specific individuals in an alternative scenario, for example, a different treatment assignment outcome. for the treatment group against which outcomes can be directly compared.↩︎https://blogs.worldbank.org/impactevaluations/power-calculations-101-dealing-with-incomplete-take-up↩︎

https://blogs.worldbank.org/impactevaluations/impactevaluations/how-randomize-using-many-baseline-variables-guest-post-thomas-barrios↩︎

**Balance checks:**Statistical tests of the similarity of treatment and control groups.↩︎https://blogs.worldbank.org/impactevaluations/should-we-require-balance-t-tests-baseline-observables-randomized-experiments↩︎

https://blogs.worldbank.org/impactevaluations/dealing-attrition-field-experiments↩︎

https://blogs.worldbank.org/impactevaluations/often-unspoken-assumptions-behind-difference-difference-estimator-practice↩︎

https://blogs.worldbank.org/impactevaluations/revisiting-difference-differences-parallel-trends-assumption-part-i-pre-trend↩︎

https://blogs.worldbank.org/impactevaluations/what-are-we-estimating-when-we-estimate-difference-differences↩︎

https://blogs.worldbank.org/impactevaluations/another-reason-prefer-ancova-dealing-changes-measurement-between-baseline-and-follow↩︎

https://blogs.worldbank.org/impactevaluations/dealing-attrition-field-experiments↩︎

https://blogs.worldbank.org/impactevaluations/econometrics-sandbox-event-study-designs-co↩︎

https://blogs.worldbank.org/impactevaluations/regression-discontinuity-porn↩︎

https://blogs.worldbank.org/impactevaluations/spatial-jumps↩︎

See Iacovone, Maloney, and Mckenzie (2019) for an example.↩︎

**Propensity Score Matching (PSM):**An estimation method that controls for the likelihood that each unit of observation would recieve treatment as predicted by observable characteristics.↩︎