π§ Building a Reproducibility Package (Flagship Edition)
World Bank Reproducible Research Initiative π reproducibility.worldbank.org
π¦ What Is a Reproducibility Package?
A reproducibility package includes everything needed to replicate the findings in a paper:
Includes | Details |
---|---|
π Documentation | README, Data Availability Statement (DAS), figure/table mapping |
π Code | All code files required to go from the original data to the results in the paper |
π Data | All raw data needed and/or detailed access instructions to obtain it |
π» Standard: Computational Reproducibility
A third party can reproduce the exact findings in the paper using the data, code, and documentation provided by the author.
β Verified packages are published in the Reproducible Research Repository. (RRR)
π Reproducibility Workflow
Step | What Happens |
---|---|
Submission | Authors prepare the reproducibility package and submit it for verification. |
Verification | The reproducibility team tests whether the results can be fully reproduced using the submitted code and data. A detailed verification report is issued. |
Publication | If reproducible, the package is published on the Reproducible Research Repository (RRR) with a DOI, metadata, and verification seal. |
π¦ Components of a Good Reproducibility Package β Flagships
π Flagship projects typically involve multiple datasets, chapters, and contributors, which adds complexity to reproducibility. The table below outlines the essential components of a high-quality package, with specific tips to support coordination and transparency in flagship workflows.
Component | Description & Flagship-Specific Tips |
---|---|
README File | π Critical for flagships: Serves as the main guide for replicators. β Provide step-by-step instructions for how to run the code and reproduce results. β Include a list of exhibits, indicating which are generated by the package and which are taken from external sources (with citations). β Include a Data Availability Statement (see below). β If the project structure is complex (e.g., organized by chapter or module), describe the folder layout to help others navigate it. π Use our templates: Markdown Β· Word |
Data Availability Statement (DAS) | π Essential for flagships: These often use a mix of public, restricted, and internal datasets. β List every dataset used, regardless of size or access level. β Clearly describe the access conditions for each dataset: e.g., public (include URL), restricted (how the team obtained it), or internal WB access only (with process and a contact name if possible). β Include the access date, since datasets may be updated before project completion. π Example DAS for Flagship |
Code Files | π Organize scripts by task (e.g., cleaning.R , analysis.do ), and manage them with a single main script (main.R , main.do , or equivalent).β List all dependencies explicitly (e.g., R packages, ado files, Python libraries). π For flagships: Use a modular structure by chapter or module, and agree on folder naming and structure across all contributors. π Use our templates: Stata Β· R |
Data | π Data is often the trickiest part for flagships due to mmultiple sources. β Keep raw and processed data in separate folders. β Document all data transformations in code. If manual edits were made, explain them in the README. β Remove any unused datasets before submission. β If using internally produced data (e.g., from other WB teams), provide as much detail as possible: include dataset title, source team, contact person (if applicable), and whether it could be shared on DDH/MDL under a restricted license. π Maintain consistent dataset versions across chapters and authors, and store original data in permanent, team-accessible locations. |
Final Outputs | π€ Include all raw outputs used in the paper (e.g., CSVs, LaTeX tables, plots). π If any outputs were sent to the design/publication team, specify which ones to avoid mismatches between the paper and the package. |
π§ Common Pitfalls & How to Avoid Them
π« Problem | β Solution |
---|---|
Version control. Code results β Report exhibits | Run full code right before submission and make sure outputs match your manuscript. Archive final outputs. |
Manual Excel edits not documented | Document all manual steps (e.g., which Excel tab β which figure). Automate when possible. |
Starts from intermediate data | Archive raw data + document entire cleaning pipeline. |
Instability. Results vary across runs | Control random seeds. Test stability. Contact reproducibility team for help. |
π Start Early: Timeline & Submission Steps
π Phase | Action Items |
---|---|
Kickoff | - Assign a reproducibility lead per chapter or module - Define folder and file structure for the whole team - Align on data sources and archive raw versions from day 1 |
During project | - Update README and DAS progressively - Automate figures/tables as much as possible - Keep scripts modular and coordinated across chapters |
Before submission | - Run the entire pipeline using your main script- Clean the repository: only include necessary data, code, and outputs - Ensure exhibits match the report exactly - Review our checklist to make sure everything is ready |
To submit | Fill out the Reproducibility Verification Request Form and share your package |
π‘ Start documenting from day 1βit will save time at submission.
Questions or support?
π§ reproducibility@worldbank.org
π Reproducible Research Resources