ado-management-with-repado
Using repado
for Ado-File Management
This is a comprehensive guide on how to use repado
. For a shorter and more technical syntax documentation, refer to the helpfile.
Background
To create a fully reproducible reproducibility package for a research project, you need the following four elements:
- The exact same data
- The exact same project code
- The exact same project code dependencies
- The exact same computation environment
Sharing the same project data and code is straightforward and is typically done by publishing the data and code to an archive. Using the same computation environment is more complicated and often involves using software like Docker. However, this article will focus on how to share the same project code dependencies and will primarily cover Stata.
Reproducible Code Dependencies
Writing code is about creating precise instructions that a computer can follow. This precision is crucial for ensuring reproducibility in research. The code used in a project includes both the code written by the team and the code within all the commands that the project code relies on. Users running different versions of these commands may obtain different results, even if they use the same project code and data. Therefore, it’s essential to establish a method for version controlling all the code (including the code for any commands used) that the project code depends on, which is commonly referred to as code dependencies.
All programming languages have two types of code dependencies: built-in commands and community-contributed commands. It’s equally important to version control both of these.
Version Control of Built-In Commands
The most effective way to version control built-in commands is to run the exact same version of the programming software. This is typically more straightforward in open-source software, as older software versions can be easily installed. In Stata, while running the exact same version is ideal, built-in commands can be version controlled using the version
command. The version
command allows Stata users to utilize built-in commands from any version equal to or older than the version they have installed.
This command can be added to the top of each script, as it doesn’t consume much time to execute. However, if the project uses a main script for one-click reproducibility, it’s sufficient to place this line at the top of the main script. See example below. Be sure to replace 12.1
with the version the project is targeting. Unless a specific new feature is required, it’s advisable not to choose the most recent version to ensure reproducibility for as many users as possible.
version 12.1
Version Control of Community-Contributed Commands
In most other programming languages, there are version-controlled archives for community-contributed commands (or libraries). For R, there is CRAN, and for Python, there are archives like pip
and conda
. Tools for these archives make older versions of commands available, and in these languages, these tools can be used to specify the required commands and their exact versions when reproducing a project. When a user reproduces such a project, these tools install the exact version of these dependencies.
In Stata, the primary archive for community-contributed commands is SSC (Statistical Software Components). SSC is not version controlled, and only the most recent version of a command is made available. For this reason, there is no tool in Stata that functions similarly to those in R and Python.
However, the repado
command in the repkit
package provides a Stata solution to the challenge of version controlling community-contributed commands.
Using repado
The Best Practice repado
Implements
In Stata, the best practice for version controlling community-contributed commands is to first include the ado-files of the commands used in the reproducibility package. Then, your project code should code that ensures all users utilize these files exclusively. To achieve this, the project must incorporate some technical code to set the “ado-paths”. The motivation behind repado
is to encapsulate this technical code within an accessible command, making this feature available to a wider audience of Stata users.
You can view your default ado-paths by running the adopath
command within your Stata installation. Simply enter adopath
and press Enter, and you will see output in the following format. The exact paths will differ:
[1] (BASE) "C:\Program Files\Stata17\ado\base/"
[2] (SITE) "C:\Program Files\Stata17\ado\site/"
[3] "."
[4] (PERSONAL) "C:\Users\user123\ado\personal/"
[5] (PLUS) "C:\Users\user123\ado\plus/"
[6] (OLDPLACE) "c:\ado/"
These locations are where your Stata installation looks for commands, from top to bottom. If Stata finds the command a user has used in the first folder, it will use that command. If it doesn’t find a command with that name in the first folder, it will continue to search in subsequent folders until it finds a matching command or exhausts the list and throws an error, such as command [...] is unrecognized
.
Built-in commands are located in the BASE
path, which should not be changed. Commands installed via ssc install
and net install
are placed in the PLUS
folder. However, as seen in the example, the default PLUS
location is specific to each user. In practice, different users may have different versions of the commands required for a project installed in their PLUS
folder.
The objective of repado
is to ensure that all users employ commands installed in a shared location, and this shared location is the exclusive source for the project’s command dependencies.
Setting Up the Project’s Ado-Folder
Using repado
is best done at the beginning of the project, but it can be implemented at any point of a project. The first requirement is to create a folder referred to as the ado-folder. A suitable name for this folder is ado
, but it can be given any name.
The ado-folder should be created in a location distinct from the project code. This separation is crucial to differentiate code owned by the project team from code authored by others. Commands from SSC can be republished as per the GPL v3
license. However, if commands are installed from other sources, different licenses might apply.
A Simple Example
Consider a project with the following folder structure. This example project will be used throughout this guide.
myproject/
├─ code/
│ ├─ ado/
│ ├─ projectcode/
├─ data/
├─ outputs/
In this case, the basic use of repado
would be as follows:
* Set user root folderglobal root "C:\Users\user123\github\myproject"
* Point repado to the ado-folderadopath("${root}/code/ado") mode(strict) repado,
Upon re-running the adopath
command, the output will now be:
[1] (BASE) "C:\Program Files\Stata17\ado\base/"
[2] (PLUS) "C:\Users\user123\github\myproject/code/ado"
It’s important to note three key aspects after running repado
.
First, the BASE
folder remains intact. Removing it would render Stata’s built-in commands unavailable.
Second, the PLUS
folder now points to the ado-folder shared within the project. The project folder can be shared via various means, such as Git repositories, syncing services (OneDrive, Dropbox), network drives, or even in a .zip archive (common for reproducibility packages but less so for collaborative sharing).
Third, all other paths are removed, meaning commands installed by the user in other locations are no longer available. Although this might initially seem inconvenient, it serves as a useful tool to ensure that all commands needed for the project are contained within the project’s ado-folder. Later, we will explore how this can be relaxed using the mode()
option. However, for teams aiming to guarantee project reproducibility, it’s advisable to use mode(strict)
as the default.
Remember that all settings discussed here are reset when Stata is restarted.
Installing Commands in the Ado-Folder
The project now has a project ado-folder and a mechanism to ensure that all users employ commands installed in that folder. However, no commands have been installed in the folder yet. An easy way to do so is facilitated by how repado
changed the ado-paths.
As mentioned earlier, ssc install
and net install
place packages in the PLUS
folder. Since the PLUS
folder now points to the project ado-folder, any actions like ssc install
, ssc update
, and adoupdate
will affect only the project ado-folder, provided Stata is not restarted.
In the repado
workflow, you should avoid including ssc install
or any other commands that install or update community-contributed commands within your do-file. Instead, this should be done interactively through the “Command” window in Stata’s main interface. Only one user needs to install each required command into the project ado-folder. Subsequently, any other team member or future project reproducer will use the version of the commands in the project ado-folder without needing to install anything in their Stata installations.
Using the strict mode (mode(strict)
) ensures that the project code exclusively source commands from the project ado-folder, guaranteeing uniformity in command versions for all current and future users.
When Not to Use Strict Mode
In short, strict mode (mode(strict)
) should be used almost always and especially for reproducibility packages. However, the no-strict mode (mode(nostrict)
) exists to enable a user to utilize a command already installed on their computer before deciding to install it in the project ado-folder. In no-strict mode, the project ado-folder path is set to PERSONAL
instead of PLUS
. PERSONAL
is given the second highest priority after BASE
in this mode, meaning PERSONAL
is given higher priority compared to PLUS
. Thus, a command installed in the project ado-folder will be used over a command with the same name (typically the same command but a different version) in the user’s default PLUS
folder.
It’s important to note that the no-strict mode workflow is not guaranteed to be reproducible and is only intended to be used temporarily.
Drawbacks of repado
Once repado
is set up and utilized, there are no drawbacks from a reproducibility perspective. It aligns with all the gold standards for future-proofing the reproducibility of a reproducibility package. However, there are two aspects that users of repado
should consider.
Licenses and Republishing of Commands
The first aspect pertains to licenses and whether republishing of commands is allowed. This is particularly relevant to reproducibility packages published publicly. Commands from SSC are published under the GPL v3
, where republishing is permitted. However, it’s important to review the GPL v3
for specific requirements regarding how to republish the commands. If other sources are used, the project team will need to ascertain whether they are allowed to publish a reproducibility package containing these commands.
repado
Cannot Install Itself
The other aspect to consider is that, even after repado
is set up in the project code, each user needs to install repkit
on their own computers before the repado
functionality will work on their computers. This is the case even if repkit
is installed in the project ado-folder. Below, we suggest our recommended mitigation for this aspect, as well as an alternative approach that achieves the same results but involves more lines of code.
We will never add any commands to the repkit
package that are used for analysis or generate outcomes in a project. This is why it is acceptable to depend on users installing repkit
before reproducing the results of a project.
Initially, this functionality was implemented as a feature in the ieboilstart
command in the ietoolkit
package. However, it was later realized that this was not a good idea, as other commands in ietoolkit
are used to generate results. To version control all dependencies used for generating results, the functionality now in repado
cannot share a package with any command used for that purpose.
Mitigation
One mitigation to this is to include code that tests if repkit
is already installed on a user’s computer. A do-file intended for potential wide distribution should never include code that automatically installs anything on other users’ computers. That practice is inconsiderate and is generally considered bad practice. However, it is a good practice to include a polite prompt instructing the user to install repkit
in order to run the code if the package is not already installed. See the example below:
* Make sure that repkit is installednot, prompt user to install it from ssc
* If
cap which repkit if _rc == 111 {
di as error "{pstd}You need to have {cmd:repkit} installed to run this reproducibility package. Click {stata ssc install repkit} to do so.{p_end}"
}
Alternative
As an alternative, the code from repado
can be copied into the project’s main file. repkit
is published using the MIT license where this is perfectly allowed. The ado-folder must still be created in the same way as described above. Afterward, you can add that folder using one of the two examples below. Both examples assume the same folder structure as previously outlined.
Before using any of these examples, ensure they are utilizing the most recent version of the code used in the repado
command here.
Example 1: In this example, the same results are achieved as when using repado
in strict mode. This version maintains only the project ado-folder and the BASE
folder in the ado-paths. After running this code, you can still install commands in the project ado-folder using ssc install
and use other commands that use the PLUS
folder.
* Set user root folderglobal root "C:\Users\user123\github\myproject"
PLUS to adopath and list it first, then list BASE first.
* Set means that BASE is first and PLUS is second.
* This sysdir set PLUS "${root}/code/ado"
adopath ++ PLUS
adopath ++ BASE
rank 3 until only BASE and PLUS,
* Keep removing adopaths with rank 1 and 2, are left in the adopaths
* that has local morepaths 1
while (`morepaths' == 1) {
capture adopath - 3
if _rc local morepaths 0
}
Example 2: This example is identical to the first one, with the exception that PLUS
is not used. This eliminates the risk of any user accidentally using adoupdate
to update all commands in the project ado-folder. This aspect is especially useful when not using Git, as mistakes like this cannot be easily reverted. If Git is employed, and all content in the ado-folder is tracked in the repository, this difference becomes less significant.
* Set user root folderglobal root "C:\Users\user123\github\myproject"
adopath as an unnamed path and list it first, then list BASE first.
* Set the project means that BASE is first and PLUS is second.
* This adopath ++ "${root}/code/ado"
adopath ++ BASE
rank 3 until only BASE and the project ado-folder,
* Keep removing adopaths with rank 1 and 2, are left in the adopaths
* that has local morepaths 1
while (`morepaths' == 1) {
capture adopath - 3
if _rc local morepaths 0
}
It is equally possible to create a future-proof reproducibility package using repado
or either of the two examples provided here. We have offered repado
in the form of a command to make this process easier and to abstract away technical code. In the end, it is up to each team to decide which method to employ.