A The DIME Analytics Coding Guide

Most academic programs that prepare students for a career in the type of work discussed in this book spend a disproportionately small amount of time teaching their students coding skills in relation to the share of their professional time they will spend writing code in their first few years after they graduate. Recent masters-level graduates joining our team have shown very good theoretical knowledge while requiring a lot of training in practical skills. To us, this is like an architecture graduate having learned how to sketch, describe, and discuss the concepts and requirements of a new building very well – but without the technical ability to contribute to a blueprint following professional standards that can be used and understood by others. The reasons for this are a topic for another book, but in today’s data-driven world, people working in quantitative development research must be proficient collaborative programmers, and that includes more than being able to compute correct numbers.

This appendix begins with a section on some general and language-agnostic principles on how to write “good” code. Good code is code that both generates correct results and is easily read and adapted by other professionals. The second section in this appendix contains instructions on how to access and use the code examples in this book. The last section contains the DIME Analytics Stata Style Guide. Widely accepted and used style guides are common in most programming languages, but we have never found a sufficiently encompassing style guide for Stata. We created this style guide having in mind practices that, in our experience, greatly improve the quality of research projects coded in Stata. We hope that this guide can help increase the emphasis given to using, improving, sharing, and standardizing code style among the Stata community. We believe these resources can help anyone write more understandable code, and how you, like an architect, can create a blueprint that can be understood and used by everyone in your trade.

Writing good code

“Good” code has two elements: (1) it is correct, in that it doesn’t produce any errors and its outputs are the objects intended, and (2) it is useful and comprehensible to someone who hasn’t seen it before (or even someone who has, weeks, months, or years later). Many researchers have only been trained to code correctly. But we believe that when your code runs on your computer and you get the desired results, you are only half-done writing good code. Good code is easy to read and replicate, making it easier to spot mistakes. Good code reduces sampling, randomization, and cleaning errors. Good code can easily be reviewed by others before it’s published and can be re-used afterwards. We always tell people to “code as if a stranger is reading it”.

You should think of good code in terms of three major elements: structure, syntax, and style. The structure is the environment and file organization your code lives in: good structure means that it is easy to find individual pieces of code, within and across files, that correspond to specific tasks and outputs. It also means that functional code blocks are sufficiently independent from each other such that they can be shuffled around, repurposed, and even deleted without affecting the execution of other portions. The syntax is the literal language of your code. Good syntax means that your code is readable in terms of how its mechanics implement ideas – it should not require arcane reverse-engineering to figure out what a code chunk is trying to do. It should use common commands in a generally accepted way so others can easily follow and reconstruct your intentions. Finally, style is the way that the non-functional elements of your code convey its purpose. Elements like spacing, indentation, and naming conventions (or lack thereof) can make your code much more (or much less) accessible to someone who is reading it for the first time and needs to understand it quickly and accurately.

One key tool for writing good code is using help documentation. Whether or not you are new to the programming language you are using – Stata, R, Python, or one of the many others – you should constantly revisit help files for the most common commands. Often you will learn they can do something you never realized. We cannot emphasize enough how important it is that you get into the habit of reading help files, especially for commands you are very familiar with. Most of us have a help file window open at all times. Similarly, you will always run into commands or uses of commands that you have not seen before or whose functionality you do not remember. Every time that happens, you should look up the help file for that command. We often encounter the conception that help files are only for beginners. We could not disagree with that conception more. The only way to get better at the programming language you use is to constantly read help files.

Using the code examples in this book

Providing some standardization for Stata code style is also a goal of our team. Stata is one of several programming languages used at DIME, but since few high-quality resources based on Stata exist relative to its importance and frequency of use, we decided to use Stata for the examples in this book. This book includes several code blocks where we demonstrate a good code execution of some of the most common tasks in data work. We have ensured that each code block runs independently, is well-formatted, and uses built-in commands as much as possible. We hope that these snippets will provide a foundation for your code style. We try to comment code examples generously (as you should), and you should reference Stata help files by writing help [command] whenever you do not understand the command that is being used. For example, if you are not familiar with isid, then write help isid, and the help file for the command isid will open. You should do this even if you know isid but has not read the help file for that command in a while, as there is always something new to learn.

In the book, code examples are presented like the following:

* Load the auto dataset
    sysuse auto.dta , clear

* Run a simple regression
    reg price mpg rep78 headroom , coefl

* Transpose and store the output
    matrix results = r(table)'

* Load the results into memory
    clear
    svmat results , n(col)

You can access the raw code used in examples in this book in several ways. We use GitHub to version control all the content of this book, including the code examples. To see the code examples on GitHub, go to https://github.com/worldbank/dime-data-handbook/tree/master/code. We only use Stata’s built-in datasets in our code examples, so you do not need to download any data. If you have Stata installed on your computer, then you will already have the data files used in the code.

A less technical way to access the code is to click the individual file in the URL above, then click the button that says Raw. You will then get to a page that looks like the one at https://raw.githubusercontent.com/worldbank/dime-data-handbook/master/code/code.do. There, you can copy the code from your browser window to your do-file editor with the formatting intact. This method is only practical for a single file at a time. If you want to download all code used in this book, you can do that at https://github.com/worldbank/dime-data-handbook/archive/master.zip. That link downloads a .zip file with all the content used in writing this book, including the LaTeX code used for the book itself. After extracting the .zip-file you will find all the code in a folder called /code/.

While we use built-in commands as much as possible in this book, we point to user-written commands when they provide important new functionality. In particular, we point to two suites of Stata commands developed by DIME Analytics, ietoolkit300 and iefieldkit,301 which DIME Analytics wrote to standardize core data collection, management, and analysis workflows.

When you encounter code that employs user-written commands, you will not be able to run them or read their help files until you have installed the commands. The most common place to distribute user-written commands for Stata is the Boston College Statistical Software Components (SSC) archive.302 The user-written commands in our code example are all from the SSC archive. So, if your installation of Stata does not recognize a command in our code, for example randtreat, then type ssc install randtreat in Stata.

Some commands on SSC are distributed in packages. This is the case, for example, of ieboilstart. That means that you will not be able to install it using ssc install ieboilstart. If you do, Stata will suggest that you instead use findit ieboilstart, which will search SSC (among other places) and see if there is a package that contains a command called ieboilstart. Stata will find ieboilstart in the package ietoolkit, so to use this command you will type ssc install ietoolkit in Stata instead.

We understand that it can be confusing to work with packages for first time, but this is the best way to set up your Stata installation to benefit from other people’s work that has been made publicly available. Once you get used to installing commands like this it will not be confusing at all. All code with user-written commands, furthermore, is best written when it installs such commands at the beginning of the master do-file, so that the user does not have to search for packages manually.

The DIME Analytics Stata Style Guide

Programming languages used in computer science always have style guides associated with them. Sometimes they are official guides that are universally agreed upon, such as PEP8 for Python. More commonly, there are well-recognized but non-official style guides like the JavaScript Standard Style for JavaScript or Hadley Wickham’s style guide for R. Google, for example, maintains its own style guides for all languages that are used in its projects.

Aesthetics is an important part of style guides, but not the main point. Neither is telling you which commands to use: there are plenty of guides to Stata’s extensive functionality. The important function is to allow programmers who are likely to work together to share conventions and understandings of what the code is doing. Style guides therefore help improve the quality of the code in that language that is produced by all programmers in a community. It is through a shared style that newer programmers can learn from more experienced programmers how certain coding practices are more or less error-prone. Broadly-accepted style conventions make it easier to borrow solutions from each other and from examples online without causing bugs that might only be found too late. Similarly, globally standardized style guides make it easier to solve each others’ problems and to collaborate or move from project to project, and from team to team.

There is room for personal preference in style guides, but style guides are first and foremost about quality and standardization – especially when collaborating on code. We believe that a commonly used Stata style guide would improve the quality of all code written in Stata, which is why we have begun the one included here. You do not necessarily need to follow our style guide precisely. We encourage you to write your own style guide if you disagree with us. The best style guide would be the one adopted the most widely. All style rules introduced in this section are the way we suggest to code, but the most important thing is that the way you style your code is consistent. This guide allows our team to have a consistent code style.

Commenting code

Comments do not change the output of code, but without them, your code will not be accessible to your colleagues. It will also take you a much longer time to edit code you wrote in the past if you did not comment it well. So, comment a lot: do not only write what your code is doing but also why you wrote it like the way you did. In general, try to write simpler code that needs less explanation, even if you could use an elegant and complex method in less space, unless the advanced method is a widely accepted one.

There are three types of comments in Stata and they have different purposes:

Commenting multiple lines

/*
    This is a do-file with examples of comments in Stata. This
    type of comment is used to document all of the do-file or a large
    section of it
*/

Commenting a single line

* Standardize settings, explicitly set version, and clear memory
* (This comment is used to document a task covering at maximum a few lines of code)
    ieboilstart, version(13.1)
    `r(version)'

Inline comments

* Open the dataset
    sysuse auto.dta // Built in dataset (This comment is used to document a single line)

Abbreviating commands

Stata commands can often be abbreviated in the code. You can tell if a command can be abbreviated if the help file indicates an abbreviation by underlining part of the name in the syntax section at the top. Only built-in commands can be abbreviated; user-written commands cannot. (Many commands additionally allow abbreviations of options: these are always acceptable at the shortest allowed abbreviation.) Although Stata allows some commands to be abbreviated to one or two characters, this can be confusing – two-letter abbreviations can rarely be “pronounced” in an obvious way that connects them to the functionality of the full command. Therefore, command abbreviations in code should not be shorter than three characters, with the exception of tw for twoway and di for display, and abbreviations should only be used when widely a accepted abbreviation exists. We do not abbreviate local, global, save, merge, append, or sort. The following is a list of accepted abbreviations of common Stata commands:

Abbreviation Command
tw twoway
di display
gen generate
mat matrix
reg regress
lab label
sum summarize
tab tabulate
bys bysort
qui quietly
noi noisily
cap capture
forv forvalues
prog program
hist histogram

Abbreviating variable names

Never abbreviate variable names. Instead, write them out completely. Your code may change if a variable is later introduced that has a name exactly as in the abbreviation. ieboilstart executes the command set varabbrev off by default, and will therefore break any code using variable abbreviations.

Using wildcards and lists in Stata for variable lists (*, ?, and -) is also discouraged, because the functionality of the code may change if the dataset is changed or even simply reordered. If you intend explicitly to capture all variables of a certain type, prefer unab or lookfor to build that list in a local macro, which can then be checked to have the right variables in the right order.

Writing loops

In Stata examples and other code languages, it is common for the name of the local generated by foreach or forvalues to be something as simple as i or j. In Stata, however, loops generally index a real object, and looping commands should name that index descriptively. One-letter indices are acceptable only for general examples; for looping through iterations with i; and for looping across matrices with i, j. Instead, index names should describe what the code is looping over – for example household members, crops, or medicines. Even counters should be explicitly named. This makes code much more readable, particularly in nested loops.

GOOD:

* Loop over crops
foreach crop in potato cassava maize {
  * do something to `crop'
}

GOOD:

* Loop over crops
local crops potato cassava maize
  foreach crop of local crops {
        * Loop over plot number
        forvalues plot_num = 1/10 {
          * do something to `crop' in `plot_num'
        } // End plot loop
  } // End crop loop

BAD:

* Loop over crops
foreach i in potato cassava maize {
   * do something to `i'
}

Using whitespace

In Stata, adding one or many spaces does not make a difference to code execution, and this can be used to make the code much more readable. We are all very well trained in using whitespace in software like PowerPoint and Excel: we would never present a PowerPoint presentation where the text does not align or submit an Excel table with unstructured rows and columns. The same principles apply to coding. In the example below the exact same code is written twice, but in the better example whitespace is used to signal to the reader that the central object of this segment of code is the variable employed. Organizing the code like this makes it much quicker to read, and small typos stand out more, making them easier to spot.

ACCEPTABLE:

* Create dummy for being employed
gen employed = 1
replace employed = 0 if (_merge == 2)
lab var employed "Person exists in employment data"
lab def yesno 1 "Yes" 0 "No"
lab val employed yesno

BETTER:

* Create dummy for being employed
gen      employed = 1
replace  employed = 0 if (_merge == 2)
lab var  employed "Person exists in employment data"
lab def           yesno 1 "Yes" 0 "No"
lab val  employed yesno

Indentation is another type of whitespace that makes code more readable. Any segment of code that is repeated in a loop or conditional on an if-statement should have indentation of 4 spaces relative to both the loop or conditional statement as well as the closing curly brace. Similarly, continuing lines of code should be indented under the initial command. If a segment is in a loop inside a loop, then it should be indented another 4 spaces, making it 8 spaces more indented than the main code. In some code editors this indentation can be achieved by using the tab button. However, the type of tab used in the Stata do-file editor does not always display the same across platforms, such as when publishing the code on GitHub. Therefore we recommend that indentation be 4 manual spaces instead of a tab.

GOOD:

* Loop over crops
foreach crop in potato cassava maize {

    * Loop over plot number
    forvalues plot_num = 1/10 {
        gen crop_`crop'_`plot_num' = "`crop'"
    }
}
local sampleSize = `c(N)'
if (`sampleSize' <= 100) {
    gen use_sample = 0
}
else {
    gen use_sample = 1
}

BAD:

* Loop over crops
foreach crop in potato cassava maize {
* Loop over plot number
forvalues plot_num = 1/10 {
gen crop_`crop'_`plot_num' = "`crop'"
}
}
local sampleSize = `c(N)'
if (`sampleSize' <= 100) {
gen use_sample = 0
}
else {
gen use_sample = 1
}

Writing conditional expressions

All conditional (true/false) expressions should be within at least one set of parentheses. The negation of logical expressions should use bang (!) and not tilde (~). Always use explicit truth checks (if `value' == 1) rather than implicits (if `value'). Always use the missing(`var') function instead of arguments like (if `var' >= .). Always consider whether missing values will affect the evaluation of conditional expressions.

GOOD:

replace gender_string = "Woman" if (gender == 1)
replace gender_string = "Man"   if ((gender != 1) & !missing(gender))

BAD:

replace gender_string = "Woman" if gender == 1
replace gender_string = "Man"   if (gender ~= 1)

Use if-else statements when applicable even if you can express the same thing with two separate if statements. When using if-else statements you are communicating to anyone reading your code that the two cases are mutually exclusive, which makes your code more readable. It is also less error-prone and easier to update if you want to change the condition.

GOOD:

if (`sampleSize' <= 100) {
    * do something
}
else {
    * do something else
}

BAD:

if (`sampleSize' <= 100) {
    * do something
}
if (`sampleSize' > 100) {
    * do something else
}

Writing file paths

All file paths should always be enclosed in double quotes, and should always use forward slashes for folder hierarchies (/). File names should be written in lower case with dashes (my-file.dta). Mac and Linux computers cannot read file paths with backslashes, and backslashes cannot easily be removed with find-and-replace because the character has other functional uses in code. File paths should always include the file extension (.dta, .do, .csv, etc.). Omitting the extension causes ambiguity if another file with the same name is created (even if there is a default file type).

File paths should also be absolute and dynamic. Absolute means that all file paths start at the root folder of the computer, often C:/ on a PC or /Users/ on a Mac. This ensures that you always get the correct file in the correct folder. Do not use cd unless there is a function that requires it. When using cd, it is easy to overwrite a file in another project folder. Many Stata functions use cd and therefore the current directory may change without warning. Relative file paths are common in many other programming languages, but unless they are relative to the location of the file running the code, this is a risky practice. In Stata, relative file paths are relative to the current directory, not to the code file being run.

Dynamic file paths use global macros for the location of the root folder. These globals should be set in a central master do-file. This makes it possible to write file paths that work very similarly to relative paths. It also achieves the functionality that setting cd is often intended to: executing the code on a new system only requires updating file path globals in one location. If global names are unique, there is no risk that files are saved in the incorrect project folder. You can create multiple folder globals as needed and this is encouraged.

GOOD:

global myDocs    = "C:/Users/username/Documents"
global myProject = "${myDocs}/MyProject"
use "${myProject}/my-dataset.dta" , clear

BAD:

Relative paths

cd "C:/Users/username/Documents/MyProject"
use MyDataset.dta

Static paths

use "C:/Users/username/Documents/MyProject/MyDataset.dta"

Line breaks

Long lines of code are difficult to read if you have to scroll left and right to see the full line of code. When your line of code is wider than text on a regular paper, you should introduce a line break. A common line breaking length is around 80 characters. Stata’s do-file editor and other code editors provide a visible guide line. Around that length, start a new line using ///. Using /// breaks the line in the code editor, while telling Stata that the same line of code continues on the next line. Recent versions of the Stata do-file editor – and many other code editors – automatically wrap code lines that are too long, but you should still use /// to make sure that the lines breaks are placed where they are most readable. The /// breaks do not need to be horizontally aligned in code, although you may prefer to if they have comments that read better aligned, since indentations should reflect that the command continues to a new line. Break lines where it makes functional sense. You can write comments after /// just as with //, and that is usually a good thing.

The #delimit command should only be used for advanced function programming and is officially discouraged in analytical code.303 Never, for any reason, use /* */ to wrap a line: it is distracting and difficult to follow compared to the use of those characters to write regular comments. Line breaks and indentations may be used to highlight the placement of the option comma or other functional syntax in Stata commands.

GOOD:

graph hbar invil      /// Proportion in village
     if (priv == 1)   /// Private facilities only
   , over(statename, sort(1) descending)      /// Order states by values
     blabel(bar, format(%9.0f))               /// Label the bars
     ylab(0 "0%" 25 "25%" 50 "50%" 75 "75%" 100 "100%") ///
     ytit("Share of private primary care visits made in own village")

BAD:

#delimit ;
graph hbar
    invil if (priv == 1)
  , over(statename, sort(1) descending) blabel(bar, format(%9.0f))
    ylab(0 "0%" 25 "25%" 50 "50%" 75 "75%" 100 "100%")
    ytit("Share of private primary care visits made in own village");
#delimit cr

UGLY:

graph hbar /*
*/    invil if (priv == 1)

Using boilerplate code

Boilerplate code consists of a few lines of code that always come at the top of the code file, and its purpose is to harmonize settings across users running the same code to the greatest degree possible. There is no way in Stata to guarantee that any two installations will always run code in exactly the same way. In the vast majority of cases it does, but not always, and boilerplate code can mitigate that risk. We have developed the ieboilstart command to implement many commonly-used boilerplate settings that are optimized given your installation of Stata. It requires two lines of code to execute the version setting, which avoids differences in results due to different versions of Stata. Among other things, it turns the more flag off so code never hangs while waiting to display more output; it turns varabbrev off so abbrevated variable names are rejected; and it maximizes the allowed memory usage and matrix size so that code is not rejected on other machines for violating system limits. (For example, Stata/SE and Stata/IC, allow for different maximum numbers of variables, and the same happens with Stata 14 and Stata 15, so it may not be able to run code written in one of these version using another.) Finally, it clears all stored information in Stata memory, such as non-installed programs and globals, getting as close as possible to opening Stata fresh.

GOOD:

ieboilstart, version(13.1)
`r(version)'

OK:

set more off , perm
clear all
set maxvar 10000
version 13.1

Miscellaneous notes

Write multiple graphs as tw (xx)(xx)(xx), not tw xx||xx||xx.

In simple expressions, put spaces around each binary operator except ^. Therefore write gen z = x + y`` andx^2`.

When order of operations applies, you may adjust spacing and parentheses: write hours + (minutes/60) + (seconds/3600), not hours + minutes / 60 + seconds / 3600. For long expressions, + and - operators should start the new line, but * and / should be used inline. For example:

gen newvar = x ///
             - (y/2) ///
             + a * (b - c)

Make sure your code doesn’t print very much to the results window as this is slow. This can be accomplished by using run file.do rather than do file.do. Interactive commands like sum or tab should be used sparingly in do-files, unless they are for the purpose of getting r()-statistics. In that case, consider using the qui prefix to prevent printing output. It is also faster to get outputs from commands like reg using the qui prefix.