Introduction

Welcome to Development Research in Practice: The DIME Analytics Data Handbook. This book is intended to teach all users of development data how to handle data effectively, efficiently, and ethically. An empirical revolution has changed the face of development research rapidly over the last decade. Increasingly, researchers are working not just with complex data, but with original data: datasets collected by the research team themselves or acquired through a unique agreement with a project partner. Research teams must carefully document how original data is created, handled, and analyzed. These tasks now contribute as much weight to the quality of the evidence as the research design and the statistical approaches do. At the same time, the scope and scale of empirical research projects is expanding: more people are working on the same data over longer timeframes. For that reason, the central premise of this book is that data work is a “social process”. This means that the many different people on a team need to have the same ideas about what is to be done, and when and where and by whom, so that they can collaborate effectively on a large, long-term research project.

Despite the growing importance of managing data work, little practical guidance is available for practitioners. There are few guides to the conventions, standards, and best practices that are fast becoming a necessity for empirical research. Development Research in Practice aims to fill that gap. It covers the full data workflow for a complex research project using original data. We share the lessons, tools, and processes developed within the World Bank’s Development Impact Evaluation (DIME) department, and compile them into a single narrative of best practices for data work. This book is not sector-specific; it will not teach you econometrics, or how to design an impact evaluation. There are many excellent existing resources on those topics. Instead, it will teach you how to think about all aspects of your research from a data perspective, how to structure research projects to ensure data quality, and how to institute transparent and reproducible workflows. We realize that adopting these workflows may have significant upfront learning costs, but we are convinced that these investments pay off quickly, as you will both save time and improve the quality of your research going forward.

How to read this book

This book aims to be a highly practical resource so the reader can immediately begin to collaborate more effectively on large, long-term research projects that use the methods and tools discussed. This introduction outlines the basic philosophies that motivate this book and our approach to research data. We want all readers to understand at the outset our mindset that research data work is primarily about communicating effectively within a team and that standardization and simplification of data tasks is a major enabler of effective collaboration. The main chapters of this book will walk you through the data workflow at each stage of an empirical research project, from design to publication. The figure in this introduction visualizes the data workflow. Chapters 1 and 2 contextualize the workflow, and set the stage for the hands-on data tasks which are described in detail in Chapters 3 to 7.

Chapter 1 This book aims to be a highly practical resource so the reader can immediately begin to collaborate more effectively on large, long-term research projects that use the methods and tools discussed. This introduction outlines the basic philosophies that motivate this book and our approach to research data. We want all readers to understand at the outset our mindset that research data work is primarily about communicating effectively within a team and that standardization and simplification of data tasks is a major enabler of effective collaboration. The main chapters of this book will walk you through the data workflow at each stage of an empirical research project, from design to publication. The figure in this introduction visualizes the data workflow. Chapters 1 and 2 contextualize the workflow, and set the stage for the hands-on data tasks which are described in detail in Chapters 3 to 7.

Chapter 2 teaches you to structure your data work for collaborative research, while ensuring the privacy and security of research participants. It discusses the importance of planning data work and associated tools in advance, long before any data is acquired. It also describes ethical concerns common to development data, common pitfalls in legal and practical management of data, and how to respect the rights of research participants at all stages of data work

Chapter 3 turns to the measurement framework, and how to translate research design to a data work plan. It details DIME’s data map template, a set of tools to communicate the project’s data requirements both across the team and across time. It also discusses how to implement random sampling and random assignment in a reproducible and credible manner.

Chapter 4 covers data acquisition. It starts with the legal and institutional frameworks for data ownership and licensing, to ensure that you are aware of the rights and responsibilities of using data collected by the research team or by others. It provides a deep dive on collecting high-quality primary electronic survey data, including developing and deploying survey instruments. Finally, it discusses secure data handling during transfer, sharing, and storage, which is essential in protecting the privacy of respondents in any data.

Chapter 5 describes data processing tasks. It details how to construct “tidy” data at the appropriate units of analysis, how to ensure uniquely identified datasets, and how to routinely incorporate data quality checks into the workflow. It also provides guidance on de-identification and cleaning of personally-identified data, focusing on how to understand and structure data so that it is ready for indicator construction and analytical work.

Chapter 6 discusses data analysis tasks. It begins with data construction, or the creation of new variables from the original data or collected in the field. It introduces core principles for writing analytical code and creating, exporting, and storing research outputs such as figures and tables reproducibly using dynamic documents.

Chapter 7 outlines the publication of research outputs, including manuscripts, code, and data. This chapter discusses how to effectively collaborate on technical writing using dynamic documents. It also covers how and why to publish datasets in an accessible, citable, and safe fashion. Finally, it provides guidelines for preparing functional and informative reproducibility packages that contain all the code, data, and meta-information needed for others to evaluate and reproduce your work.

Each chapter starts with a box which provides a summary of the most important points, takeaways for different types of readers, and a list of key tools and resources for implementing the recommended practices. After reading each chapter, you should understand what tasks will be performed at every stage of the workflow, and how to implement them according to best practices. You should also understand how the various stages of the workflow tie together, and what inputs and outputs are required and produced from each. The references and links contained in each chapter will lead you to detailed descriptions of individual ideas, tools, and processes to refer to when you need to implement the tasks yourself.

Demand for Safe Spaces Case Study

Throughout this Handbook, we will refer to a completed DIME project, Demand for Safe Spaces: Avoiding Harassment and Stigma, to provide a case study of the empirical research tasks described in this book. You will find boxes in each chapter with examples of how the practices and workflows described in that chapter were applied in this real-life example. All the code examples and diagrams referenced in the case study can be accessed directly through this book’s GitHub repository. We have made minor adaptations to the original study materials presented for function and clarity. All original materials can be found in the project’s reproducibility package.

The Demand for Safe Spaces study is summarized in its abstract as follows. What are the costs to women of harassment on public transit? This study randomizes the price of a women-reserved “safe space” in Rio de Janeiro and crowdsource information on 22,000 rides. Women in the public space experience harassment once a week. A fifth of riders are willing to forgo 20 percent of the fare to ride in the “safe space”. Randomly assigning riders to the “safe space” reduces physical harassment by 50 percent, implying a cost of $1.45 per incident. Implicit Association Tests show that women face a stigma for riding in the public space that may outweigh the benefits of the safe space.

The Demand for Safe Spaces study used novel original data from three sources. It collected information on 22,000 metro rides from a crowdsourcing app (referred to as crowdsourced ride data in the case study examples), a survey of randomly-sampled commuters on the platform (referred to as the platform survey), and data from an implicit association test. The research team first elicited revealed preferences for the women-reserved cars, and then randomly assigned riders across the reserved and non-reserved cars to measure differences in the incidence of harassment. The use of a customized app allowed the researchers to assign data collection tasks and vary assigned ride spaces (women-reserved car vs public cars) and associated payout across rides. In addition, the team administered social norm surveys and implicit associations tests on a random sample of men and women commuters to document a potential side effect of reserved spaces: stigma against women who choose to ride in the public space.

For the purposes of the Handbook, we focus on the protocols, methods, and data used in the Demand for Safe Spaces study, rather than the results. To learn more about the findings from this study, and more details on how it was conducted, we encourage readers to download the working paper. The material from all the examples in the book and be found at https://github.com/worldbank/dime-data-handbook/tree/master/examples.

The Demand for Safe Spaces study repository can be accessed at https://github.com/worldbank/rio-safe-space.

The working paper Demand for Safe Spaces: Avoiding Harassment and Stigma is available at https://openknowledge.worldbank.org/handle/10986/33853.

The DIME Wiki: A complementary resource

Throughout the book, you will find many references to the DIME Wiki. The DIME Wiki is a free online collection of impact evaluation resources and best practices. This handbook and the DIME Wiki are meant to go hand-in-hand: the handbook provides the narrative structure and workflow, and the Wiki dives into specific implementation details, offers detailed code examples, and provides a more exhaustive set of references for each topic. Importantly, the DIME Wiki is a living resource that is continuously updated and improved, by the authors of this book and external contributors. We welcome all readers to register as Wiki users and contribute directly.

Standardizing data work

In the past, data work was often treated as a “black box” in research. A published manuscript might exhaustively detail research designs, estimation strategies, and theoretical frameworks, but typically reserved very little space for detailed descriptions of how data was actually collected and handled. It is almost impossible to assess the quality of the data in such a paper, and whether the results could be reproduced. In the past decade, this has started to change,1 in part due to increasing requirements by publishers and funders to release code and data.

Data handling and documentation is a key skill for researchers and research staff. Standard processes and documentation practices are important throughout the research process to accurately convey and implement the intended research design,2 and to minimizes security risks: better protocols and processes lower the probability of data leakages, security breaches, and loss of personal information. When data work is done in an ad-hoc manner, it is very difficult for others to understand what is being done – a reader has to simply trust that the researchers did these things right. Most importantly, if any part of the data pipeline breaks down, research results become unreliable3 and cannot be faithfully interpreted as being an accurate picture of the intended research design.4 Because we almost never have “laboratory” settings5 in this type of research, such a failure has a very high cost: we will have wasted the investments that were made into knowledge generation, and the research opportunity itself, where we intended to conduct the study.6

Accurate and reproducible data management and analysis is essential to the success and credibility of modern research. Standardizing and documenting data handling processes is essential to be able to evaluate and understand the data work alongside any final research outputs. An important component of this is process standardization.7 Process standardization means that there is little ambiguity about how something ought to be done, and therefore the tools to do it can be set in advance. Standard processes help other people understand your work, and they also make your work easier to document. Process standardization and documentation should allow readers of your code to: (1) quickly understand what a particular process or output is supposed to be doing; (2) evaluate whether or not it does that thing correctly; and (3) modify it either to test alternative hypotheses or to adapt into their own work. This book will discuss specific standards recommended by DIME Analytics, but we are more interested in convincing the reader to discuss the adoption of a standard within research teams than to necessarily use the particular standards that we recommend.

Standardizing coding practices

Modern quantitative research relies heavily on standardized statistical software tools, written with various coding languages, to standardize analytical work. Outputs like regression tables and data visualizations are created using code in statistical software for two primary reasons. The first is that using a standard command or package ensures that the work is done right, and the second is that it ensures the same procedure can be confirmed or checked at a later date or using different data. Keeping a clear, human-readable record of these code and data structures is critical. While it is often possible to perform nearly all the relevant tasks through an interactive user interface or even through software such as Excel, this practice is strongly advised against. In the context of statistical analysis, the practice of writing all work using standard code is widely accepted. To support this practice, DIME now maintains portfolio-wide standards about how analytical code should be maintained and made accessible before, during, and after release or publication.

Over the last few years, DIME has extended the same principles to preparing data for analysis, which often comprises just as much (or more) of the manipulation done to the data over the life cycle of a research project. A major aim of this book is to encourage research teams to think of the tools and processes they use for designing, collecting, and handling data just as they do for analytical tasks. Correspondingly, a major contribution of DIME Analytics has been tools and standard practices for implementing these tasks using statistical software.

While we assume that you are going to do nearly all data work using code, many development researchers come from economics and statistics backgrounds and often understand code to be a means to an end rather than an output itself. We believe that this must change somewhat: in particular, we think that development practitioners must think about their code and programming workflows just as methodologically as they think about their research workflows, and think of code and data as research outputs, just as manuscripts and briefs are.

This approach arises because we see the code as the “recipe” for the analysis. The code tells others exactly what was done, how they can do it again in the future, and provides a roadmap and knowledge base for further original work.8 Performing every task through written code creates a record of every task you performed.9 It also prevents direct interaction with the data files that could lead to non-reproducible processes.10 Finally, DIME Analytics has invested a lot of time in developing code as a learning tool: the examples we have written and the commands we provide are designed to provide a framework for common practice across the entire DIME team, so that everyone is able to read, review, and provide feedback on the work of others starting from the same basic ideas about how various tasks are done.

Most specific code tools have a learning and adaptation process, meaning you will become most comfortable with each tool only by using it in real-world work. To support your process of learning reproducible tools and workflows, we reference free and open-source tools wherever possible, and point to more detailed instructions when relevant. Stata,11 as a proprietary software, is the notable exception here due to its persistent popularity in development economics and econometrics. This book also includes, in Appendix A, the DIME Analytics Coding Guide which includes instructions for how to write good code, instructions on how to use the code examples in this book, as well as our Stata Style Guide. DIME projects are strongly encouraged to explicitly adopt and follow coding style guides in their work. Style guides harmonize code within and across teams making it easier to understand and reuse code, which ultimately helps teams to build on each other’s best practices. Some of the programming languages used at DIME already have well-established and commonly used style guides, such as the Tidyverse style guide for R and PEP-8 for Python.12. Stata has relatively few resources of this type available, which is why we have created and included one here that we hope will be an asset to all Stata users.

The team behind this book

DIME is the Development Impact Evaluation department of the World Bank.13 Its mission is to generate high-quality and operationally relevant data and research to transform development policy, help reduce extreme poverty, and secure shared prosperity.14 DIME develops customized data and evidence ecosystems to produce actionable information and recommend specific policy pathways to maximize impact. The department conduct research in 60 countries with 200 agencies, leveraging a US$180 million research budget to shape the design and implementation of US$18 billion in development finance. DIME also provides advisory services to 30 multilateral and bilateral development agencies.15 DIME research is organized into four primary topic pillars: Economic Transformation and Growth; Gender, Economic Opportunity, and Fragility; Governance and Institution Building; and Infrastructure and Climate Change. Over the years, DIME has employed dozens of research economists, and hundreds of full-time research assistants, field coordinators, and other staff. The team has conducted over 325 impact evaluations. Development Research in Practice exists to take advantage of that concentration and scale of research, to synthesize many resources for data collection and research, and to make DIME tools available to the larger community of development researchers.

As part of its broader mission, DIME invests in public goods to improve the quality and reproducibility of development research around the world. One key early innovation at DIME was the creation of DIME Analytics, the team responsible for writing and maintaining this book. DIME Analytics is a centralized unit that develops and ensures adoption of high quality research practices across the department’s portfolio. This is done through an intensive, collaborative innovation cycle: DIME Analytics onboards and supports research assistants and field coordinators, provides standard tools and workflows to all teams, delivers hands-on support when new tasks or challenges arise, and then develops and integrates lessons from those engagements to bring to the full team. Resources developed and tested in DIME are converted into public goods for the global research community, through open-access trainings and open-source tools. Appendix B, the DIME Analytics Resource Directory, provides an introduction to public materials.

DIME Analytics has invested many hours over the past years learning from data work across DIME’s portfolio, identifying inefficiencies and barriers to success, developing tools and trainings, and standardizing best-practice workflows adopted in DIME projects. It has also invested significant energy in the language and materials used to teach these workflows to new team members, and, in many cases, in software tools that support these workflows explicitly. DIME team members often work on diverse portfolios of projects with a wide range of teammates, and we have found that standardizing core processes across all projects results in higher-quality work with fewer opportunities to make mistakes. In that way, the Analytics team is DIME’s method of institutionalizing tools and practices, developed and refined over time, that give the department a common base of knowledge and practice. In 2018, for example, DIME adopted universal reproducibility checks conducted by the Analytics team; the lessons from this practice helped move the DIME team from where 50% of submitted papers in 2018 required significant revision to pass to where 64% of papers passed in 2019 without revision required.

Looking ahead

While adopting the workflows and mindsets described in this book requires an up-front cost, it will save you (and your collaborators) a lot of time and hassle very quickly. In part this is because you will learn how to implement essential practices directly; in part because you will find tools for the more advanced practices; and most importantly because you will acquire the mindset of doing research with a high-quality data focus.

For some readers, the amount of new tools and practices recommended in this book may seem daunting. We know from experience at DIME that full-scale adoption is possible; in the last few years, the full DIME portfolio has transitioned to transparent and reproducible workflows, with a fair share of hiccups along the way. The authors of this book supported that at-scale transition, and we hope that by sharing our lessons learned and resources, the learning curve for readers will be less steep. In the summary boxes at the beginning of each chapter, we provide a list of the key tools and resources to help readers prioritize. We will also offer “second-best” practices in many cases, suggesting easy-to-implement suggestions to increase transparency and reproducibility, in cases where full-scale adoption of the recommended workflows is not immediately feasible. In fact, we encourage teams to adopt one new practice at a time rather than rebuild their whole workflow from scratch right away. We hope that by the end of the book, all readers will have learned how to handle data more efficiently, effectively and ethically at all stages of the research process.

Overview of development research data tasks

Figure 0.1: Overview of development research data tasks


  1. Swanson et al. (2020)↩︎

  2. Vilhuber (2020)↩︎

  3. McCullough, McGeary, and Harrison (2008)↩︎

  4. https://blogs.worldbank.org/impactevaluations/more-replication-economics↩︎

  5. See Baldwin and Mvukiyehe (2015) for an example.↩︎

  6. Camerer et al. (2016)↩︎

  7. Process standardization: Agreement within a research team about how all tasks of a specific type will be approached.↩︎

  8. Hamermesh (2007)↩︎

  9. Ozier (2019)↩︎

  10. Chang and Li (2015)↩︎

  11. StataCorp, LLC (2019)↩︎

  12. See DIME Analytics Coding Standards: https://github.com/worldbank/dime-standards↩︎

  13. https://www.worldbank.org/en/research/dime↩︎

  14. Legovini, Di Maro, and Piza (2015)↩︎

  15. Legovini et al. (2019)↩︎