# C Research design for impact evaluation

*Development Research in Practice* focuses on tools, workflows, and practical
guidance for implementing research projects. All research team members,
including field staff and research assistants, also need to understand research
design and specifically how research design choices affect data work. Without
going into too much technical detail, because there are many excellent resources
on how to design impact evaluations, this appendix presents a brief overview of
the most common methods of causal inference, focusing on their implications for
data structure and analysis. This appendix is intended to be a reference,
especially for junior team members, for understanding how treatment and control
groups are constructed for common methods of causal inference, the data
structures needed to estimate the corresponding effects, and specific code tools
designed for each method.

Research team members who will do the data work need to understand the study design for several reasons. First, if team members do not know how to calculate the correct estimator for the study, they will not be able to assess the statistical power of the research design. This negatively affects their ability to make real-time decisions in the field, where trade-offs about allocating scarce resources between tasks are inevitable, such as deciding between increasing sample size or increasing response rates. Second, understanding how data need to be organized to produce meaningful analytics will save time throughout a project. Third, being familiar with the various approaches to causal inference will make it easier to recognize research opportunities: many of the most interesting projects occur because people in the field recognize the opportunity to implement one of these methods in response to an unexpected event.

This appendix is divided into two sections. The first covers methods of causal inference in experimental and quasi-experimental research designs. The second discusses how to measure treatment effects and structure data for specific methods, including cross-sectional randomized control trials, difference-in-differences designs, regression discontinuity, instrumental variables, matching, and synthetic controls.

## Understanding causality, inference, and identification

The types of inputs that impact evaluations are typically concerned with—
usually called “treatments”—are also commonly referred to as “programs” or
“interventions.” Treatments are observed and measured in order to obtain
estimates of study-specific *treatment effects*, which are the changes in
outcomes attributable to the treatment (Abadie and Cattaneo (2018)). The primary
goal of research design is to establish causal identification for a treatment
effect. *Causal identification* means establishing that a change in an input
directly altered an outcome. When a study is well identified, it is possible to
say with confidence that the estimate of the treatment effect would, with an
infinite amount of data, be precise.

Under this condition, it is possible to draw evidence from the limited samples that are actually accessible, using statistical techniques to express the uncertainty due to not having infinite data. Without identification, it is not possible to say whether the estimate would be accurate, even with unlimited data; therefore, changes in outcomes cannot be attributed to the treatment in the small samples to which researchers typically have access. Having more data is, therefore, not a substitute for having a well-identified experimental design, so it is important to understand how a study identifies its estimate of treatment effects. This understanding allows estimates to be calculated and interpreted appropriately.

All of the study designs discussed here use the potential outcomes framework (Athey and Imbens (2017b)) to compare a group that received some treatment to another, counterfactual, group. Each of these approaches can be used in two types of designs: experimental designs, in which the research team is directly responsible for creating the variation in treatment, and quasi-experimental designs, in which the team identifies a “natural” source of variation and uses it for identification. Neither type is implicitly better or worse, and both types are capable of achieving causal identification in different contexts.

### Estimating treatment effects using control groups

The key assumption behind estimating treatment effects is that every person,
facility, village, or whatever the unit of intervention is, has two possible
states: their outcomes if they do not receive some treatment and their outcomes
if they do receive that treatment. Each unit’s treatment effect is the
individual difference between the outcomes that would be realized in the treated
state and those that would be realized in the untreated state, and the *average
treatment effect* (ATE) is the average of these individual differences across
the potentially treated population. Most research designs attempt to estimate
this parameter by establishing a counterfactual. A *counterfactual* is a
statistical description of what would have happened to specific individuals in
an alternative scenario—for example, a different treatment assignment outcome.
Several resources provide more or less mathematically intensive approaches to
understanding how various methods do this. *Impact Evaluation in Practice*
(Gertler et al. (2016)) is a strong general guide to these methods. *Causal
Inference* (Hernán and Robins (2010)) and *Causal Inference: The Mixtape*
(Cunningham 2021) provide more detailed approaches to the tools. *Mostly
Harmless Econometrics* (Angrist and Pischke (2008)) and *Mastering ’Metrics*
(Angrist and Pischke (2014)) are excellent resources on the statistical principles
behind all econometric approaches.

Intuitively, the problem of causal inference is as follows: it is not possible to observe the same unit in both its treated and untreated states simultaneously, so measuring and averaging these effects directly is impossible (Rubin (2003)). Instead, researchers typically make inferences from samples. Causal inference methods are those in which it is possible to identify and estimate an average treatment effect (or, in some designs, other types of treatment effects) by comparing averages between groups. Every research design is based on a way of comparing the outcomes of treated groups against those of another set of “control” observations. These designs all serve to establish that the outcomes in the control group would have been identical on average to those of the treated group in the absence of the treatment. Then, the mathematical properties of averages imply that the calculated difference in averages is equivalent to the average difference, which is the parameter of interest. In this framework, almost all causal inference methods can be described as a series of between-group comparisons.

Most of the methods encountered in impact evaluation research rely on some variant of this approach, which is designed to maximize the ability to estimate the effect of the treatment to be evaluated. The focus on identifying treatment effects, however, means that several essential features of causal identification methods are not common in other types of statistical and data science work. First, the econometric models and estimating equations used here do not attempt to create a predictive or comprehensive model of how outcomes are generated. Typically, causal inference designs are not interested in predictive accuracy, so the estimates and predictions that they produce are not as good at predicting outcomes or fitting the data as those of other data science approaches.

Second, when control variables or other variables are included in estimating equations, there is no guarantee that the parameters obtained for those variables are marginal effects in the same way that parameters for the treatment effect(s) are. They can be interpreted only as correlative averages, unless there are additional sources of identification. The models that will be constructed and estimated are intended to do exactly one thing: to express the intention of a project’s research design and to estimate accurately the effect of the treatment it is evaluating. In other words, these models tell the story of the research design in a way that clarifies the exact comparison being made between control and treatment groups.

### Designing experimental and quasi-experimental research

Experimental research designs explicitly allow the research team to change the
condition of the populations being studied, often in the form of government
programs, nongovernmental organization projects, new regulations, information
campaigns, and many more types of interventions (A. V. Banerjee and Duflo (2009); see
the DIME Wiki at https://dimewiki.worldbank.org/Experimental_Methods). The
classic experimental causal inference method is the *randomized control trial*
(RCT; see the DIME Wiki at
https://dimewiki.worldbank.org/Randomized_Control_Trials ). In RCTs, the
treatment group is randomized. That is, from an eligible population, a random
group of units is placed in the treatment state. Another way to think about
these designs is how they establish the control group: a random subset of units
is *not* placed in the treatment state, so that it may serve as a counterfactual
for the subset that is.

A randomized control group, intuitively, is meant to measure how things would have turned out for the treatment group if its members had not been treated. The RCT approach is particularly effective at doing this, as evidenced by its broad credibility in fields ranging from clinical medicine to development. As a result, RCTs are very popular tools for determining the causal impact of specific programs or policy interventions, as evidenced by the awarding of the 2019 Nobel Prize in Economics to Abhijit Banerjee, Esther Duflo, and Michael Kremer “for their experimental approach to alleviating global poverty” (NobelPrize.org (2020)). However, many types of interventions are impractical or unethical to approach effectively using an experimental strategy; for this reason, the ability to access “big questions” through RCT approaches is sometimes limited (Deaton (2009)).

Randomized designs all share several major statistical concerns. The first is the fact that it is always possible to select, by chance, a control group that is not in fact very similar to the treatment group. This risk is called randomization noise, and all RCTs need to assess how randomization noise affects the estimates that are obtained. Second, take-up and implementation fidelity (how closely work carried out in the field corresponds to its planning and intention) are extremely important because programs will, by definition, have no effect if the population intended to be treated does not accept or does not receive the treatment (for an example, see Jung and Hasan (2016)). Loss of statistical power occurs quickly and is highly nonlinear: 70 percent take-up or efficacy doubles the required sample, and 50 percent quadruples it (McKenzie 2011). Such effects are also very hard to correct ex post, because they require strong assumptions about the randomness or lack of randomness of take-up and fidelity. Therefore, field time and descriptive work must be dedicated to understanding how these effects play out in a given study.

*Quasi-experimental* research designs, by contrast, use causal inference methods
based on events not controlled by the research team (see the DIME Wiki at
https://dimewiki.worldbank.org/Quasi-Experimental_Methods). Instead, they
rely on “experiments of nature,” in which natural variation can be argued to
approximate the type of exogenous variation in treatment availability that a
researcher would attempt to create with an experiment (DiNardo (2016)).
Unlike carefully planned experimental designs, quasi-experimental designs
typically require the extra luck of having access to data collected at the right
times and places to exploit events that occurred in the past or having the
ability to collect data in a time and place where an event that produces causal
identification occurred or will occur. Therefore, these methods often use
secondary data, or they use primary data in a cross-sectional retrospective
method, including administrative data or other new classes of routinely
collected information.

Quasi-experimental designs therefore can sometimes access a much broader range of questions than experimental designs, and much less effort is required to produce the treatment and control groups. However, these designs require in-depth understanding of the precise events the researcher wishes to use in order to know what data to acquire and how to model the corresponding experimental design. Additionally, because the population exposed to such events is limited by the scale of the event, quasi-experimental designs are often power-constrained. Because the research team cannot change the population of the study or the treatment assignment, statistical power is typically maximized by ensuring that sampling for data collection is designed to match the study objectives and that attrition from the sampled groups is minimized.

## Obtaining treatment effects from specific research designs

### Cross-sectional designs

A *cross-sectional* research design is any type of study that observes data in
only one time period and directly compares treatment and control groups. Such
data are easy to collect and handle because it is not necessary to track units
across time. If the time period is after a treatment has been fully delivered,
then the outcome values at that time already reflect the effect of the
treatment. If the study is experimental, the treatment and control groups are
randomly constructed from the population eligible to receive each treatment. By
construction, each unit’s receipt of the treatment is unrelated to any of its
other characteristics, and the ordinary least squares regression of outcome on
treatment, without any control variables or adjustments other than for the
design (such as clustering and stratification), produces an unbiased estimate of
the ATE.

Cross-sectional designs can also exploit variations in nonexperimental data to argue that observed correlations do in fact represent causal effects. This causation can be true unconditionally, which is to say that some random event, such as winning the lottery, is a truly random process and can provide information about the effect of receiving a large amount of money (Imbens, Rubin, and Sacerdote (2001)). It can also be true conditionally, which is to say that, once the characteristics that would affect both the likelihood of exposure to a treatment and the outcome of interest are controlled for, the process is as good as random. For example, a study could argue that, once risk preferences are taken into account, exposure to an earthquake is unpredictable (among people with the same risk preferences), and any excess differences after the event (after accounting for differences caused by risk preferences) are caused by the event itself (Callen (2015)).

For cross-sectional designs, what must be carefully maintained in data are the
research design variables describing the treatment randomization process itself
(whether experimental or not) as well as detailed information about differences
in data quality and attrition across groups (Athey and Imbens (2017a)). Only
design controls for the randomization process are needed to construct the
appropriate estimator. Clustering of the standard errors is required at the
level at which the treatment is assigned to observations, and variables that
were used to stratify the treatment must be included as controls in the form of
strata fixed effects (Barrios 2014). *Randomization inference* can be used to
estimate the underlying variability in the randomization process. *Balance
checks*—statistical tests of the similarity of treatment and control groups—are
often reported as evidence of an effective randomization and are particularly
important when the design is quasi-experimental, because then the randomization
process cannot be simulated explicitly. However, controls for balance variables
are usually superfluous in experimental designs, because it is certain that the
treatment and the balance factors are not correlated in the data-generating
process (McKenzie 2017).

Analysis of randomization is typically straightforward and well understood. A
typical analysis will include a description of the sampling and randomization
results, with analyses such as summary statistics for the eligible population
and balance checks for randomization and sample selection. The main results will
usually be primary regression specifications for outcomes of interest, with
appropriate adjustments for multiple hypothesis testing (for an example, see
Armand et al. (2017)). These will be followed by additional specifications with
adjustments for nonresponse, imbalance, and other potential contaminations.
Robustness checks might include randomization-inference analysis or other
placebo regression approaches. Various user-written code tools are available to
help with the complete process of data analysis, including analyzing balance
(`iebaltab`

; see the DIME Wiki at https://dimewiki.worldbank.org/iebaltab) and
visualizing treatment effects (`iegraph`

; see the DIME Wiki at
https://dimewiki.worldbank.org/iegraph). Extensive tools and methods are
available for analyzing selective nonresponse (Özler 2017).

### Difference-in-differences

Whereas cross-sectional designs draw their estimates of treatment effects from
differences in outcome levels in a single measurement,
*difference-in-differences* designs (abbreviated as DD, DiD, diff-in-diff, and
other variants; see the DIME Wiki at https://dimewiki.worldbank.org/Difference-in-Differences)
estimate treatment effects from *changes* in
outcomes between two or more rounds of measurement. In the simplest form of
these designs, three control group averages are used to compute effect
estimates—the baseline level of treatment units, the baseline level of
nontreatment units, and the endline level of nontreatment units (Torres-Reyna (2015)).
The estimated treatment effect is the excess growth of units that receive the
treatment in the period they receive it: calculating that value is equivalent to
taking the difference in means between treatment and nontreatment units at
endline and subtracting the difference in means at baseline
(McKenzie (2012)). The regression model includes a control variable for
treatment assignment and a control variable for time period, and the treatment
effect estimate corresponds to an interaction variable for treatment and time:
it indicates the group of observations for which the treatment is active.

This “two-way fixed effects” design depends on the assumption that, in the
absence of the treatment, the outcome of the two groups would have changed at
the same rate over time, typically referred to as the *parallel trends*
assumption (Friedman 2013). Experimental approaches satisfy this requirement in
expectation, but a given randomization should still be checked for pretrends as
an extension of balance checking (McKenzie 2020). More complex designs with
multiple treatment groups or multiple time periods require correspondingly
adjusted models (Baker, Larcker, and Wang 2021).

There are two main types of data structures for *difference-in-differences*:
repeated cross-sectional and panel data. In *repeated cross-sectional designs*,
each successive round of data collection contains a random sample of
observations from the treated and untreated groups; as in cross-sectional
designs, both the randomization and sampling processes are critically important
to maintain alongside the data. *Panel* data structures are used to observe the
exact same units at different times, so that the same units can be analyzed both
before and after they have (or have not) received treatment (Jakiela 2019). This
structure allows each unit’s baseline outcome (the outcome before the
intervention) to be used as an additional control for its endline outcome, which
can provide increases in power and robustness (McKenzie 2015). When tracking
individuals over time for this purpose, maintaining sampling and tracking
records is especially important because attrition will remove that unit’s
information from all time periods, not just the one in which they are
unobserved. Panel-style experiments therefore require more effort in fieldwork
for studies using original data (Torres-Reyna (2007)). Because the baseline and endline
may be far apart in time, creating careful records during the first round makes
it possible to follow up with the same subjects and to account properly for
attrition across rounds (Özler 2017).

As with cross-sectional designs, difference-in-differences designs are
widespread. Therefore, many standardized tools are available for analysis.
DIME’s `ietoolkit`

Stata package includes the `ieddtab`

command, which
produces standardized tables for reporting results (see
https://dimewiki.worldbank.org/ieddtab).
For more complicated versions of the model (and they
can get quite complicated quite quickly), an online dashboard can be used to
simulate counterfactual results (Kondylis and Loeser 2019a). As in
cross-sectional designs, these main specifications will always be accompanied by
balance checks (using baseline values) as well as by randomization, selection,
and attrition analysis. In trials of this type, reporting experimental design
and execution using the CONSORT style is common in many disciplines and is
useful for tracking data over time (Schulz, Altman, and Moher (2010)).

### Regression discontinuity

*Regression discontinuity* (RD) designs exploit sharp breaks or limits in policy
designs to separate a single group of potentially eligible recipients into
comparable groups of individuals who do and do not receive a treatment (see the
DIME Wiki at https://dimewiki.worldbank.org/Regression_Discontinuity). These
designs differ from cross-sectional and difference-in-differences designs in
that the group eligible to receive treatment is not defined directly but instead
is created during the treatment implementation. In an RD design, there is
typically some program or event that has limited availability because of
practical considerations or policy choices and is therefore made available only
to individuals who meet a certain threshold requirement.

The intuition of this design is that an underlying *running variable* serves as
the sole determinant of access to the program, and a strict cutoff determines
the value of this variable at which eligibility stops (Imbens and Lemieux (2008)).
Common examples are test score thresholds and income thresholds (Evans 2013).
The intuition is that individuals who are just above the threshold are very
nearly indistinguishable from those who are just below it, and their outcomes
after treatment are therefore directly comparable (Lee and Lemieux (2010)). The key
assumption here is that the running variable cannot be manipulated directly by
the potential recipients. If the running variable is time (what is commonly
called an “event study”), there are special considerations
(Hausman and Rapson (2018)). Similarly, spatial discontinuity designs are handled
differently because of their multidimensionality (Kondylis and Loeser 2019b).

RD designs are, once implemented, similar in analysis to cross-sectional or difference-in-differences designs. Depending on the available data, the analytical approach will compare individuals who are narrowly on the inclusion side of the discontinuity with those who are narrowly on the exclusion side (Cattaneo, Idrobo, and Titiunik (2019)). The regression model will be identical to the corresponding research designs—that is, contingent on whether data have one or more time periods and whether the same units are known to be observed repeatedly.

The treatment effect will be identified by the addition of a control for the
running variable—meaning that the treatment effect estimate will be
automatically valid only for a subset of observations in a window around the
cutoff. In many cases, the treatment effects estimated will be “local” rather
than “average” when they cannot be assumed to hold for the entire sample. In the
RD model, the functional form of the running variable control and the size of
that window, often referred to as the choice of *bandwidth* for the design, are
the critical parameters for the result (Calonico et al. (2019)). Therefore, RD
analysis often includes extensive robustness checks using a variety of both
functional forms and bandwidths as well as placebo tests for nonrealized
locations of the cutoff.

In the analytical stage, RD designs often include a substantial component of visual evidence. These visual presentations help to suggest both the functional form of the underlying relationship and the type of change observed at the discontinuity; they also help to avoid pitfalls in modeling that are difficult to detect with parameterized hypothesis tests (Pischke (2018)). Because these designs are more flexible than others, an extensive set of commands helps to assess the efficacy and results from these designs under various assumptions (Calonico, Cattaneo, and Titiunik (2014)). These packages support the testing and reporting of robust plotting and estimation procedures, tests for manipulation of the running variable, and tests for power, sample size, and randomization inference approaches that will complement the main regression approach used for point estimates.

### Instrumental variables

*Instrumental variables* (IV) designs, unlike the previous approaches, begin by
assuming that the treatment delivered in the study in question is linked to the
outcome in a pattern such that its effect is not directly identifiable. Instead,
similar to RD designs, IV designs attempt to focus on a subset of the variation
in treatment take-up and assess a limited window of variation that can be argued
to be unrelated to other factors (Angrist and Krueger (2001)). To do so, the IV
approach selects an *instrument* for the treatment status—an otherwise-unrelated
predictor of exposure to treatment that affects the take-up status of an
individual (see the DIME Wiki at
https://dimewiki.worldbank.org/Instrumental_Variables ). Whereas RD designs
are “sharp”—treatment status is strictly determined by which side of a cutoff an
individual is on—IV designs are “fuzzy,” meaning that the values of the
instrument(s) do not strictly determine the treatment status but instead
influence the probability of treatment.

As in RD designs, the fundamental form of the regression is similar to either
cross-sectional or difference-in-differences designs. However, instead of
controlling for the instrument directly, the IV approach typically uses the
*two-stage-least-squares* estimator (Bond (2020)). This estimator first forms a
prediction of the probability that each unit receives treatment using a
regression of treatment status against the instrumental variable(s). That
prediction will, by assumption, be the portion of the actual treatment that is
due to the instrument and not to any other source; because the instrument is
unrelated to all other factors, this portion of the treatment variation can be
used to estimate relevant effect sizes.

IV estimators are known to have very high variances relative to other methods,
particularly when the relationship between the instrument and the treatment is
weak (Andrews, Stock, and Sun 2019). IV designs furthermore rely on strong but
untestable assumptions about the relationship between the instrument and the
outcome (Bound, Jaeger, and Baker (1995)). Therefore, IV designs face scrutiny on the
strength and exogeneity of the instrument, and tests for sensitivity to
alternative specifications and samples are usually required. However, the method
has special experimental cases that are significantly easier to assess: for
example, a randomized treatment *assignment* can be used as an instrument for
the eventual take-up of the treatment itself (for an example, see
Iacovone, Maloney, and Mckenzie (2019)), especially in cases when take-up is expected to be low
or in circumstances when the treatment is available to those who are not
specifically assigned to it (“encouragement designs”).

In practice, various packages can be used to analyze data and report results
from IV designs. Although the built-in Stata command `ivregress`

is often used
to create the final results, the built-in packages are not sufficient on their
own. The first stage of the design should be tested extensively to demonstrate
the strength of the relationship between the instrument and the treatment
variable being instrumented (Stock and Yogo (2005)). This testing can be done using the
`weakiv`

and `weakivtest`

commands (Pflueger and Wang (2015)). Additionally, tests
should be run that identify and exclude individual observations or clusters that
have extreme effects on the estimator, using customized bootstrap or
leave-one-out approaches (Young 2019). Finally, bounds can be constructed
allowing for imperfections in the exogeneity of the instrument using loosened
assumptions, particularly when the underlying instrument is not directly
randomized (Clarke and Matta (2018)).

### Matching

*Matching* methods use observable characteristics of individuals to construct
treatment and control groups that are as similar as possible to each other,
either before a randomization process or after the collection of nonrandomized
data (see the DIME Wiki at https://dimewiki.worldbank.org/Matching). Matching
groups of observations within a data set may result in one-to-one matches or the
creation of mutually matched groups; the result of a matching process is similar
in concept to the use of randomization strata. In this way, the method can be
conceptualized as averaging across the results of a large number of
“micro-experiments” in which the units in each potential treatment group are
verifiably similar except for their treatment status.

When matching is performed before a randomization process, it can be done on any observable characteristics, including baseline outcomes, if they are available. The randomization should record an indicator identifier for each matching set, as these sets become equivalent to randomization strata and require controls in analysis. This approach reduces the number of potential randomizations dramatically from the possible number that would be available if the matching was not conducted and therefore reduces the variance caused by the study design.

When matching is done ex post in order to substitute for randomization, it is
based on the assertion that, within the matched groups, the assignment of
treatment is as good as random. However, because many matching models rely on a
specific linear model, such as *propensity score matching* (PSM), they are open
to the criticism of “specification searching,” meaning that researchers can try
different models of matching until one, by chance, leads to the desired result.
Analytical approaches have shown that the better the fit of the matching model,
the more likely it is to have arisen by chance and therefore to be biased
(King and Nielsen (2019)). Newer methods, such as *coarsened exact matching*
(Iacus, King, and Porro (2012)), are designed to remove some of the dependence on functional
form. In all ex post cases, prespecification of the exact matching model can
prevent some of the potential criticisms on this front, but ex post matching in
general is not regarded as a strong identification strategy.

Analysis of data from matching designs is relatively straightforward; the
simplest design only requires using controls (indicator variables) for each
group or, in the case of propensity scoring and similar approaches, weighting
the data appropriately in order to balance the analytical samples on the
selected variables. The `teffects`

suite in Stata provides a wide variety of
estimators and analytical tools for various designs (Cooperative (2015)). The coarsened
exact matching (`cem`

) package applies the nonparametric approach
(Blackwell et al. (2009)). The `iematch`

command in the `ietoolkit`

package produces
matchings based on a single continuous matching variable (see the DIME Wiki at
https://dimewiki.worldbank.org/iematch). In any of these cases, detailed
reporting of the matching model is required, including the resulting effective
weights of observations, because in some cases the lack of overlapping supports
for treatment and control means that a large number of observations will be
weighted near zero and the estimated effect will be generated using a subset of
the data.

### Synthetic control

*Synthetic control* is a relatively new method for the case when appropri- ate
counterfactual units do not exist for a treatment of interest, and often there
are very few (or only one) treatment units (Abadie, Diamond, and Hainmueller (2015)). For
example, finding valid comparators for state- or national-level policy changes
that can be analyzed only as a single unit is typically very difficult because
the set of potential comparators is usually small and diverse with no close
matches for the treated unit. Intuitively, the synthetic control method works by
constructing a counterfactual version of the treated unit using an average of
the other units available (Abadie, Diamond, and Hainmueller (2010)). This approach is particularly
effective when the lower-level components of the units would be directly
comparable: people, households, businesses, and so on in the case of states and
countries or passengers or cargo shipments in the case of transport corridors,
for example (Gobillon and Magnac (2016)). In those situations, the average of the
untreated units can be thought of as balancing because it matches the
composition of the treated unit.

To construct this estimator, the synthetic control method requires retrospective
data on the treatment unit and possible comparators, including historical data
on the outcome of interest for all units (for an example, see
Fernandes, Hillberry, and Berg (2016)). The counterfactual blend is chosen by optimizing the
prediction of past outcomes on the basis of potential input characteristics and
typically selects a small set of comparators to weight into the final analysis.
These data sets therefore may not have a large number of variables or
observations, but the extent of the time series both before and after
implementation of the treatment are key sources of power for the estimate, as
are the number of counterfactual units available. Visualizations are often
excellent demonstrations of these results. The `synth`

package provides
functionality for use in Stata and R; however, because the number of possible
parameters and implementations of the design is large, the package can be
complex to operate (Abadie, Diamond, and Hainmueller (2014)).