reprun

Title

reprun - This command is used to automate a reproducibility check for a single Stata do-file, or a set of do-files called by a main do-file. The command should be used interactively; reprun will execute one run of the do-file and record the state of Stata after the execution of each line. It will then run the entire do-file a second time and flag all potential reproducibility error caused by comparing the Stata state to the first run after each line. Debugging and reporting options are available.

Syntax

reprundo-file.do” [using “/directory/”] , [verbose] [compact] [suppress(rng|srng|dsum|loop)] [debug] [noclear]

By default, reprun will execute the complete do-file specified in “do-file.do” once (Run 1), and record the “seed RNG state”, “sort order RNG”, and “data checksum” after the execution of every line, as well as the exact data in certain cases. reprun will then execute the do-file a second time (Run 2), and find all changes and mismatches in these states throughout Run 2. A table of mismatches will be reported in the Results window, as well as in a SMCL file in a new directory called /reprun/ in the same location as the do-file. If the using argument is supplied, the /reprun/ directory containing the SMCL file will be stored in that location instead.

options Description
verbose Report all lines where Run 1 and Run 2 mismatch or change for any value
compact Report only lines where Run 1 and Run 2 mismatch and change for either the seed or sort RNG
suppress(types) Suppress reporting of state changes that do not result in mismatches for seed RNG state (rng), sort order RNG (srng), and/or data checksum (dsum), for any reporting setting
debug Save all records of Stata states in Run 1 and Run 2 for inspection in the /reprun/ folder
noclear Do not reset the Stata state before beginning reproducibility Run 1

Description

The reprun command is intended to be used to check the reproducibility of a do-file or set of do-files (called by a main do-file) that are ready to be transferred to other users or published. The command will ensure that the outputs produced by the do-file or set of do-files are stable across runs, such that they do not produce reproducibility errors caused by incorrectly managed randomness in Stata. To do so, reprun will check three key sources of reproducibility failure at each point in execution of the do-file(s): the state of the random number generator, the sort order of the data, and the contents of the data itself (see detailed description below).

After completing Run 2, reprun will report all lines where there are mismatches between Run 1 and Run 2 in any of these values. Lines where changes lead to mismatches will be highlighted. Problems should be approached top-to-bottom, as solving earlier issues will often resolve later ones. Additionally, addressing issues from left-to-right in the table is effective. RNG states are responsible for most errors, followed by unstable sorts, while data mismatches are typically symptoms of these reproducibility failures rather than causes in and of themselves.

Mismatches are defined as follows:

Seed RNG State: A mismatch occurs whenever the RNG state differs from Run 1 to Run 2, except any time the RNG state is exactly equivalent to set seed 12345 in Run 1 (the initialization default). By default, reprun invokes clear and set seed 12345 to match the default Stata state before beginning Run 1. The noclear option prevents this behavior; this is not recommended unless you have a rare issue that you need to check at the very beginning of the file. Most projects should quickly set the randomization seed appropriately for replicability.

Sort Order RNG: Since the sort RNG state should always differ between Run 1 and Run 2, a mismatch is defined as any line where the sort RNG state is advanced and checksum fails to match when compared with the Run 1 data (as a CSV) at the same line. This mismatch occurs when the sort order RNG is used in a command that results in the data taking a different order between the two runs. Users should never manually set the sortseed (See help seed and help sortseed) to override these mismatches; instead, they should implement a unique sort on the data using a command like isid (See help isid).

Data Checksum: A mismatch occurs whenever checksum fails to match when comparing the result from the Run 1 data (as a CSV) in Run 2. Users should understand that lines where only the data checksum fails to match are unlikely to be where problems originate in the code; these mismatches are generally consequences of earlier reproducibility failures in randomization or sorting. Users should also note that results from datasignature are only unique up to the sort order of each column independently; hence, we do not use this command.

Options

By default, reprun returns a list of mismatches in Stata state between Run 1 and Run 2. This means that any time the state of the random number generator, the sort order of the data, or the contents of the data itself do not match Run 1 during Run 2, a flag will be generated for the corresponding line of code. The user may modify this reporting in several ways using options.

Line flagging options

The verbose option can be used to produce even more detail than the default. If the verbose option is specified, then any line in which the state changes during Run 1 or Run 2; or mismatches between the runs will be flagged and reported. This is intended to allow the user to do a deep-dive into the function and structure of the do-file’s execution.

The compact option, by contrast, produces less detailed reporting, but is often a good first step to begin locating issues in the code. If the compact option is specified, then only those lines which have mismatched seed or sort order RNG changes during Run 1 or Run 2 and mismatches between the runs will be flagged and reported. Data checksum mismatches alone will be ignored; as will RNG mismatches not accompanied by a change in the state. This is intended to reduce the reporting of “cascading” differences, which are caused because some state value changes inconsistently at a single point and remains inconsistent for the remainder of the run (making every subsequent data change a mismatch, for example).

The suppress() option is used to hide the reporting of changes that do not lead to mismatches (especially when the verbose option is specified) for one or more of the types. In particular, since the sort order RNG frequently changes and should not be forced to match between runs, it will very often have changes that do not produce errors, specifying suppress(srng) will remove a great deal of unhelpful output from the reporting table. To do this for all states, write suppress(rng srng dsum). Suppressing loop will clean up the display of loops so that the titles are only shown on the first line; but if combined with compact may not display at all.

Reporting and debugging options

The debug option allows the user to save all of the underlying materials used by reprun in the /reprun/ folder where the reporting SMCL file will be written. This will include copies of all do-files for each run for manual inspection and text files of the states of Stata after each line. This is automatically cleaned up after execution if debug is not specified.

Other options

By default, reprun invokes clear and set seed 12345 to match the default Stata state before beginning Run 1. noclear prevents this behavior. It is not recommended unless you have a rare issue that you need to check at the very beginning of the file, because most projects should very quickly set these states appropriately for reproducibility.

Note on Reproducibility of certain commands

by and bysort: Users will often use by and bysort or equivalent commands to produce “group-level” statistics. The syntax used is usually something like bysort groupvarname : egen newvarname = function(varlist). However, we note that such an approach necessarily introduces an instability in the sort order within each group. reprun will flag these instances as indeterminate sorts, since they can introduce issues later in the code when code is order-dependent; and will do so right away, for functions like rank() or other approaches like bysort groupvarname : egen newvarname = n. To avoid this, and to write truly reproducible code, users should use the less common but fully reproducible unique sorting syntax of bysort groupvarname (uniqueidvar) ... to ensure a unique sort with by-able commands. For commands with by() options, users should check whether this syntax is available, or remember to re-sort uniquely before any further processes are done. If bysort or the equivalent is called in intermediate or user-written commands that cannot be made to return the data sorted uniquely, those lines will continue to be flagged by ’reprun‘. There is not a technical solution to this, to the best of our knowledge; therefore, the flag will remain as a reminder that the user should implement a unique sort after the indicated lines.

merge m:m and set sortseed: These commands will be flagged interactively by reprun with warnings following the results table, regardless of whether any instability is obviously introduced according to the Stata RNG states. This is because merge m:m and set sortseed, while they often appear to work reproducibly, generally have the function of creating false stability that masks underlying issues in the code. In the case of merge m:m, the data that is produced is always sort-dependent in both datasets, and almost always meaningless as a result. In the case of set sortseed, the command often works to hide an instability in the underlying code that is sort-dependent. Users should instead remove all instances of these commands, and fix whatever issues in the process are causing their results to depend on the (indeterminate) sort order of the data

Examples

Example 1

This is the most basic usage of reprun. Specified in any of the following ways, either in the Stata command window or as part of a new do-file, reprun will execute the complete do-file “myfile.do” once (Run 1), and record the “seed RNG state”, “sort order RNG”, and “data checksum” after the execution of every line, as well as the exact data in certain cases. reprun will then execute “myfile.do” a second time (Run 2), and find all changes and mismatches in these states throughout Run 2. A table of mismatches will be reported in the Results window, as well as in a SMCL file in a new directory called /reprun/ in the same location as “myfile.do”.

reprun "myfile.do"

or

reprun "path/to/folder/myfile.do"

or

local myfolder "/path/to/folder"
reprun "`myfolder'/myfile.do"

Example 2

This example is similar to example 1, but the /reprun/ directory containing the SMCL file will be stored in the location specified by the using argument.

reprun "myfile.do" using "path/to/report"

or

reprun "path/to/folder/myfile.do" using "path/to/report"

or

local myfolder "/path/to/folder"
reprun "`myfolder'/myfile.do" using "`myfolder'/report"

Example 3

Assume “myfile1.do” contains the following code:

sysuse census, clear
isid state, sort
gen group = runiform() < .5

Running a reproducibility check on this do-file using reprun will generate a table listing mismatches in Stata state between Run 1 and Run 2.

reprun "myfile1.do"

In “myfile1.do”, Line 3 (gen group = runiform() < .5) generates a new variable group based on a random uniform distribution. The RNG state will differ between Run 1 and Run 2 unless the random seed is explicitly set before this command. As a result, a mismatch in the “seed RNG state” as well as “data checksum” will be flagged.

The issue can be resolved by setting a seed before the command:

sysuse census, clear
isid state, sort
set seed 346290
gen group = runiform() < .5

Running the reproducibility check on the modified do-file using reprun will confirm that there are no mismatches in Stata state between Run 1 and Run 2.

Example 4

Using the verbose option generates more detailed tables where any lines across Run 1 and Run 2 mismatch or change for any value. In addition to the output in Example 3, it will also report line 2 for changes in “sort order RNG” and “data checksum”.

reprun "myfile1.do", verbose

Example 5

Assume “myfile2.do” contains the following code:

sysuse auto, clear
sort mpg 
gen sequence = _n

Running a reproducibility check on this do-file using reprun will generate a table listing mismatches in Stata state between Run 1 and Run 2.

reprun "myfile2.do"

In “myfile2.do”, Line 2 sorts the data by the non-unique variable mpg, causing the sort order to vary between runs. This results in a mismatch in the “sort order RNG”. Consequently, Line 2 and Line 3 (gen sequence = _n) will be flagged for “data checksum” mismatches due to the differences in sort order, leading to discrepancies in the generated sequence variable.

The issue can be resolved by sorting the data on a unique combination of variables:

sysuse auto, clear
sort mpg make
gen sequence = _n

Example 6

Using the compact option generates less detailed tables where only lines with mismatched seed or sort order RNG changes during Run 1 or Run 2, and mismatches between the runs, are flagged and reported. The output will be similar to Example 5, except that line 3 will no longer be flagged for “data checksum”.

reprun "myfile2.do", compact

Example 7

reprun will perform a reproducibility check on a do-file, including all do-files it calls recursively. For example, the main do-file might contain the following code that calls on “myfile1.do” (Example 3) and “myfile2.do” (Example 5):

local myfolder "/path/to/folder"
do "`myfolder'/myfile1.do"
do "`myfolder'/myfile2.do"
reprun "main.do"

reprun on “main.do” performs reproducibility checks across “main.do”, as well as “myfile1.do”, and “myfile2.do”. The output will include tables for each do-file, illustrating the following process:

  • main.do: The initial check reveals no mismatches in “main.do”, indicating no discrepancies introduced directly by it.

  • Sub-file 1 (“myfile1.do”) : reprun steps into “myfile1.do”, where Line 3 is flagged for mismatches, as shown in Example 3. This table will show the issues specific to “myfile1.do”.

  • Return to “main.do”” : After checking “myfile1.do”, reprun returns to “main.do”. Here, Line 2 is flagged because it calls “myfile1.do”, reflecting the issues from the sub-file.

  • Sub-file 2 (“myfile2.do”): reprun then steps into “myfile2.do”, where Line 2 is flagged for mismatches, as detailed in Example 5.

  • Return to “main.do” (final check) : After checking “myfile2.do”, reprun returns to”main.do”. Line 3 in “main.do” is flagged due to the issues in “myfile2.do” propagating up.

In summary, reprun provides a comprehensive view by stepping through each do-file, showing where mismatches occur and how issues in sub-files impact the main do-file.

Feedback, bug reports and contributions

Read more about these commands on this repo where this package is developed. Please provide any feedback by opening an issue. PRs with suggestions for improvements are also greatly appreciated.

Authors

LSMS Team, The World Bank lsms@worldbank.org DIME Analytics, The World Bank dimeanalytics@worldbank.org