PRWP Reproducibility Protocol
This protocol outlines the step-by-step process for verifying the reproducibility of research packages submitted to the World Bank’s PRWP verification team. It ensures that research findings can be independently replicated using the provided code, data, and instructions. The protocol covers the entire workflow—from receiving submissions, verifying completeness and data access, running the package in a clean environment, tracking changes with version control, and checking consistency with the original manuscript, to publishing the final reproducibility package.
1. Receive submission
- Submissions come in through MS Form.
- Power Automate sends an automatic reply to the authors, indicating feedback will be provided within two days.
- Project coordinator:
- Log new submissions in the GitHub project.
- Open an issue with the format ID: TYPE_ISO_YEAR_NUMBER (e.g., RR_NGA_2024_213).
- Types options are the following.
- PP: published paper.
- RR: Policy Research Working Paper.
- FR: Flagships and reports (this category includes databases so far).
- Assign the reviewer, download the files, and store them in the designated folder (onedrive named after ID)
2.a Verify completeness
- Reviewer verifies the package includes all components listed in the checklist here:
- Code files that produce results from original data (ideally, only the necessary files).
- Data files required for analysis.
- Raw outputs generated by the code.
- A README file explaining how to run the package, data sources, and any access restrictions. Alternatively, data details can be provided in a separate data availability statement.
- A link to the manuscript (if published) or a copy of the manuscript itself.
- Note any missing components, even if they don’t prevent the package from running.
- If there is no README or information on the data, return the package to the author. If there’s no main script but the README is clear, proceed but recommend adding a main script.
2.b Verify data access
- Confirm the status of the data is clear: public, private, or public but restricted from resharing.
- If data cannot be included (private or restricted), verify that the README or data availability statement provides clear access instructions.
- Ensure data access instructions are sufficiently detailed.
- The Project coordinator starts the metadata to confirm clarity in the data sources for publication.
3. Start Clean Environment
- Start with clean environments to avoid dependency conflicts. Follow these instructions:
- Stata: Link to instructions
- R: Link to instructions
- Python: Link to instructions
4. Version control with Git
- Locate the folder of the reproducibility package and add a ` .gitignore` file. You can use this template. Adjust it as needed, as this is ignoring a lot of things.
- Create a Git repository to track changes.
- Use GitHub Desktop to create the repository in the package’s location.
- Commit the initial package received from the authors.
- Important: We only use GitHub Desktop locally and do not publish the repository on GitHub.com.
This will help you see if the outputs are changing after you run the code.
5. Run the package
- Before starting the run of the package delete the author’s outputs and the intermediate data. To make sure the code goes from raw data to analysis results, and all outputs are created by the code. Sometimes there are xlsx files the authors used to create figures, make sure this is explained in the README, in which case you wouldn’t delete this file.
- Make sure you followed step 3 and are starting from a clean slate, and this is included in the code if needed (for instance ` sysdir set PLUS
in Stata and set
version` in the main script) - Check if it runs from start to finish by only changing the top-level directory.
- Document any modifications required to run the package (this will be done automatically in git).
- Troubleshoot if the package does not run, often due to missing dependencies.
- Use the GitHUB issue to document any meaningful changes you had to do for the code to run.
- Commit to the git repo as
first-run
Document these critical aspects as you prepare and run the package:
- Dependencies
- List all libraries and packages required, saved in ado folder (Stata),
requirements.txt
(Python), orrenv.lock
(R).
- List all libraries and packages required, saved in ado folder (Stata),
- Changes Made
- Note any adjustments, such as installing packages, modifying paths, or code changes. Document substantive changes in the GitHub issue.
- System Information
- Record the OS, processor, memory, and software version (including Stata edition). You will include this in the report (step 9)
- Time Spent
- Log time spent at each stage as a comment in the GitHub issue, tagging the Project coordinator.
6. Verify stability
- Run the package again to ensure consistent outputs:
- If runtime is under a day, run twice.
- If runtime exceeds a day, run once.
- Commit results as
second-run
. - Use GitHub Desktop to track differences in outputs across runs. Use the slider for graphs and the changes window for
.txt
or.csv
files. - If discrepancies occur, document them in the GitHub issue. Use
reprun
(Stata) to detect where inconsistencies start. The Project coordinator may return the package to the authors with detailed feedback.
7. Confirm initial run
- If the package runs successfully:
- Note the run time.
- Notify the Project coordinator.
- Send confirmation to authors with an estimated report delivery date.
- Request any missing information if needed.
8. Verify consistency with the manuscript
- For papers with more than 10 exhibits in the appendix, randomly select 10 exhibits for review using this randomization code.
- Compare raw outputs to paper exhibits:
- Tables: Ensure consistent observation counts, coefficients, standard errors, signs, and significance indicators.
- Graphs: Verify axes, legends, and visual values match.
- Note non-analytical outputs (e.g., timelines) or tables missing from the code outputs.
- Document discrepancies in the report (step 9). If the authors provided outputs, compare them to your results.
9. Draft Reproducibility Report
- Use the Overleaf template here or the template generator (internal only) here.
- Document findings using the established color scheme.
- Compile code suggestions to be shared with authors by the Project coordinator.
- Include system specifications in the report (if someone else ran the package, include their specifications).
- Submit the draft to the Project coordinator for review.
- Log time spent drafting the report in the GitHub issue.
10. Prepare Metadata and Publication
- Ensure all relevant information is included in the metadata.
- The reproducibility package should include:
- Code and data (if sharable).
- README (PDF version).
- Reproducibility report.
- Modified BSD-3 license here.
- Zip the package for final archiving.
-
Follow the metadata protocol to create the package in the editor internal only.
- Considerations:
- Determine if the data can be publicly shared.
- If data sharing is restricted, clearly describe the limitations following these protocols. (internal view only).
11. Publish to RRR
- Draft the catalog entry on QA.
- Submit the entry to the Team Leader for review.
- If approved, the Project coordinator publishes it on the production server.
- Update the GitHub repo status and resolve pending issues.
- Once a week send published projects for the Microdata Library Program Manager’s approval and DOI creation.
12. Send Report and Package to Authors
- The Project coordinator sends the final report and package to the authors, including any unresolved issues.
- Discuss optional suggestions with the authors before publication. If changes are made, they need to resubmit and a revised report must be prepared.