A The DIME Analytics Coding Guide
Most academic programs that prepare students for a career in the type of work discussed in this book spend a disproportionately small amount of time teaching their students coding skills in relation to the share of their professional time they will spend writing code in their first few years after they graduate. Recent masters-level graduates joining our team have shown very good theoretical knowledge while requiring a lot of training in practical skills. To us, this is like an architecture graduate having learned how to sketch, describe, and discuss the concepts and requirements of a new building very well – but without the technical ability to contribute to a blueprint following professional standards that can be used and understood by others. The reasons for this are a topic for another book, but in today’s data-driven world, people working in quantitative development research must be proficient collaborative programmers, and that includes more than being able to compute correct numbers.
This appendix begins with a section on some general and language-agnostic principles on how to write “good” code. Good code is code that both generates correct results and is easily read and adapted by other professionals. The second section in this appendix contains instructions on how to access and use the code examples in this book. The last section contains the DIME Analytics Stata Style Guide. Widely accepted and used style guides are common in most programming languages, but we have never found a sufficiently encompassing style guide for Stata. We created this style guide having in mind practices that, in our experience, greatly improve the quality of research projects coded in Stata. We hope that this guide can help increase the emphasis given to using, improving, sharing, and standardizing code style among the Stata community. We believe these resources can help anyone write more understandable code, and how you, like an architect, can create a blueprint that can be understood and used by everyone in your trade.
Writing good code
“Good” code has two elements: (1) it is correct, in that it doesn’t produce any errors and its outputs are the objects intended, and (2) it is useful and comprehensible to someone who hasn’t seen it before (or even someone who has, weeks, months, or years later). Many researchers have only been trained to code correctly. But we believe that when your code runs on your computer and you get the desired results, you are only half-done writing good code. Good code is easy to read and replicate, making it easier to spot mistakes. Good code reduces sampling, randomization, and cleaning errors. Good code can easily be reviewed by others before it’s published and can be re-used afterwards. We always tell people to “code as if a stranger is reading it”.
You should think of good code in terms of three major elements: structure, syntax, and style. The structure is the environment and file organization your code lives in: good structure means that it is easy to find individual pieces of code, within and across files, that correspond to specific tasks and outputs. It also means that functional code blocks are sufficiently independent from each other such that they can be shuffled around, repurposed, and even deleted without affecting the execution of other portions. The syntax is the literal language of your code. Good syntax means that your code is readable in terms of how its mechanics implement ideas – it should not require arcane reverse-engineering to figure out what a code chunk is trying to do. It should use common commands in a generally accepted way so others can easily follow and reconstruct your intentions. Finally, style is the way that the non-functional elements of your code convey its purpose. Elements like spacing, indentation, and naming conventions (or lack thereof) can make your code much more (or much less) accessible to someone who is reading it for the first time and needs to understand it quickly and accurately.
One key tool for writing good code is using help documentation. Whether or not you are new to the programming language you are using – Stata, R, Python, or one of the many others – you should constantly revisit help files for the most common commands. Often you will learn they can do something you never realized. We cannot emphasize enough how important it is that you get into the habit of reading help files, especially for commands you are very familiar with. Most of us have a help file window open at all times. Similarly, you will always run into commands or uses of commands that you have not seen before or whose functionality you do not remember. Every time that happens, you should look up the help file for that command. We often encounter the conception that help files are only for beginners. We could not disagree with that conception more. The only way to get better at the programming language you use is to constantly read help files.
Using the code examples in this book
Providing some standardization for Stata code style
is also a goal of our team.
Stata is one of several programming languages used at DIME,
but since few high-quality resources based on Stata exist
relative to its importance and frequency of use,
we decided to use Stata for the examples in this book.
This book includes several code blocks
where we demonstrate a good code execution
of some of the most common tasks in data work.
We have ensured that each code block runs independently,
and uses built-in commands as much as possible.
We hope that these snippets will provide a foundation for your code style.
We try to comment code examples generously (as you should),
and you should reference Stata help files by writing
whenever you do not understand the command that is being used.
For example, if you are not familiar with
and the help file for the command
isid will open.
You should do this even if you know
but has not read the help file for that command in a while,
as there is always something new to learn.
In the book, code examples are presented like the following:
You can access the raw code used in examples in this book in several ways. We use GitHub to version control all the content of this book, including the code examples. To see the code examples on GitHub, go to https://github.com/worldbank/dime-data-handbook/tree/master/code. We only use Stata’s built-in datasets in our code examples, so you do not need to download any data. If you have Stata installed on your computer, then you will already have the data files used in the code.
A less technical way to access the code
is to click the individual file in the URL above, then click
the button that says Raw.
You will then get to a page that looks like the one at
There, you can copy the code
from your browser window to your do-file editor with the formatting intact.
This method is only practical for a single file at a time.
If you want to download all code used in this book, you can do that at
That link downloads a
with all the content used in writing this book,
including the LaTeX code used for the book itself.
After extracting the .zip-file you will find all the code in a folder called
While we use built-in commands as much as possible in this book,
we point to user-written commands when they provide important new functionality.
In particular, we point to two suites of Stata commands developed by DIME Analytics,
which DIME Analytics wrote to standardize
core data collection, management, and analysis workflows.
When you encounter code that employs user-written commands,
you will not be able to run them or read their help files
until you have installed the commands.
The most common place to distribute user-written commands for Stata
is the Boston College Statistical Software Components (SSC) archive.302
The user-written commands in our code example are all from the SSC archive.
So, if your installation of Stata does not recognize a command in our code, for example
randtreat, then type
ssc install randtreat in Stata.
Some commands on SSC are distributed in packages.
This is the case, for example, of
That means that you will not be able to install it using
ssc install ieboilstart.
If you do, Stata will suggest that you instead use
which will search SSC (among other places) and see if there is a
package that contains a command called
Stata will find
ieboilstart in the package
so to use this command you will type
ssc install ietoolkit in Stata instead.
We understand that it can be confusing to work with packages for first time, but this is the best way to set up your Stata installation to benefit from other people’s work that has been made publicly available. Once you get used to installing commands like this it will not be confusing at all. All code with user-written commands, furthermore, is best written when it installs such commands at the beginning of the master do-file, so that the user does not have to search for packages manually.
The DIME Analytics Stata Style Guide
Aesthetics is an important part of style guides, but not the main point. Neither is telling you which commands to use: there are plenty of guides to Stata’s extensive functionality. The important function is to allow programmers who are likely to work together to share conventions and understandings of what the code is doing. Style guides therefore help improve the quality of the code in that language that is produced by all programmers in a community. It is through a shared style that newer programmers can learn from more experienced programmers how certain coding practices are more or less error-prone. Broadly-accepted style conventions make it easier to borrow solutions from each other and from examples online without causing bugs that might only be found too late. Similarly, globally standardized style guides make it easier to solve each others’ problems and to collaborate or move from project to project, and from team to team.
There is room for personal preference in style guides, but style guides are first and foremost about quality and standardization – especially when collaborating on code. We believe that a commonly used Stata style guide would improve the quality of all code written in Stata, which is why we have begun the one included here. You do not necessarily need to follow our style guide precisely. We encourage you to write your own style guide if you disagree with us. The best style guide would be the one adopted the most widely. All style rules introduced in this section are the way we suggest to code, but the most important thing is that the way you style your code is consistent. This guide allows our team to have a consistent code style.
Stata commands can often be abbreviated in the code.
You can tell if a command can be abbreviated if the help file indicates
an abbreviation by underlining part of the name in the syntax section at the top.
Only built-in commands can be abbreviated; user-written commands cannot.
(Many commands additionally allow abbreviations of options:
these are always acceptable at the shortest allowed abbreviation.)
Although Stata allows some commands to be abbreviated to one or two characters,
this can be confusing – two-letter abbreviations can rarely be “pronounced”
in an obvious way that connects them to the functionality of the full command.
Therefore, command abbreviations in code should not be shorter than three characters,
with the exception of
and abbreviations should only be used when widely a accepted abbreviation exists.
We do not abbreviate
The following is a list of accepted abbreviations of common Stata commands:
Abbreviating variable names
Never abbreviate variable names. Instead, write them out completely.
Your code may change if a variable is later introduced
that has a name exactly as in the abbreviation.
ieboilstart executes the command
set varabbrev off by default,
and will therefore break any code using variable abbreviations.
Using wildcards and lists in Stata for variable lists
-) is also discouraged,
because the functionality of the code may change
if the dataset is changed or even simply reordered.
If you intend explicitly to capture all variables of a certain type,
lookfor to build that list in a local macro,
which can then be checked to have the right variables in the right order.
In Stata examples and other code languages, it is common for the name of the local
to be something as simple as
j. In Stata, however,
loops generally index a real object, and looping commands should name that index descriptively.
One-letter indices are acceptable only for general examples;
for looping through iterations with
and for looping across matrices with
Instead, index names should describe what the code is looping over –
for example household members, crops, or medicines.
Even counters should be explicitly named.
This makes code much more readable, particularly in nested loops.
In Stata, adding one or many spaces does not make a difference to code execution,
and this can be used to make the code much more readable.
We are all very well trained in using whitespace in software like PowerPoint and Excel:
we would never present a PowerPoint presentation where the text does not align
or submit an Excel table with unstructured rows and columns.
The same principles apply to coding.
In the example below the exact same code is written twice,
but in the better example whitespace is used to signal to the reader
that the central object of this segment of code is the variable
Organizing the code like this makes it much quicker to read,
and small typos stand out more, making them easier to spot.
Indentation is another type of whitespace that makes code more readable.
Any segment of code that is repeated in a loop or conditional on an
if-statement should have indentation of 4 spaces relative
to both the loop or conditional statement as well as the closing curly brace.
Similarly, continuing lines of code should be indented under the initial command.
If a segment is in a loop inside a loop, then it should be indented another 4 spaces,
making it 8 spaces more indented than the main code.
In some code editors this indentation can be achieved by using the tab button.
However, the type of tab used in the Stata do-file editor does not always display the same across platforms,
such as when publishing the code on GitHub.
Therefore we recommend that indentation be 4 manual spaces instead of a tab.
Writing conditional expressions
All conditional (true/false) expressions should be within at least one set of parentheses.
The negation of logical expressions should use bang (
!) and not tilde (
Always use explicit truth checks (
if `value' == 1)
rather than implicits (
Always use the
instead of arguments like (
if `var' >= .).
Always consider whether missing values will affect the evaluation of conditional expressions.
if-else statements when applicable
even if you can express the same thing with two separate
if-else statements you are communicating to anyone reading your code
that the two cases are mutually exclusive, which makes your code more readable.
It is also less error-prone and easier to update if you want to change the condition.
Writing file paths
All file paths should always be enclosed in double quotes,
and should always use forward slashes for folder hierarchies (
File names should be written in lower case with dashes (
Mac and Linux computers cannot read file paths with backslashes,
and backslashes cannot easily be removed with find-and-replace
because the character has other functional uses in code.
File paths should always include the file extension
Omitting the extension causes ambiguity
if another file with the same name is created
(even if there is a default file type).
File paths should also be absolute and dynamic.
Absolute means that all
file paths start at the root folder of the computer,
C:/ on a PC or
/Users/ on a Mac.
This ensures that you always get the correct file in the correct folder.
Do not use
cd unless there is a function that requires it.
cd, it is easy to overwrite a file in another project folder.
Many Stata functions use
cd and therefore the current directory may change without warning.
Relative file paths are common in many other programming languages,
but unless they are relative to the location of the file running the code,
this is a risky practice.
In Stata, relative file paths are relative to the current directory,
not to the code file being run.
Dynamic file paths use global macros for the location of the root folder.
These globals should be set in a central master do-file.
This makes it possible to write file paths that work very similarly to relative paths.
It also achieves the functionality that setting
cd is often intended to:
executing the code on a new system only requires updating file path globals in one location.
If global names are unique, there is no risk that files are saved in the incorrect project folder.
You can create multiple folder globals as needed and this is encouraged.
Long lines of code are difficult to read if you have to scroll left and right to see the full line of code.
When your line of code is wider than text on a regular paper, you should introduce a line break.
A common line breaking length is around 80 characters.
Stata’s do-file editor and other code editors provide a visible guide line.
Around that length, start a new line using
/// breaks the line in the code editor,
while telling Stata that the same line of code continues on the next line.
Recent versions of the Stata do-file editor –
and many other code editors –
automatically wrap code lines that are too long,
but you should still use
to make sure that the lines breaks are placed
where they are most readable.
/// breaks do not need to be horizontally aligned in code,
although you may prefer to if they have comments that read better aligned,
since indentations should reflect that the command continues to a new line.
Break lines where it makes functional sense.
You can write comments after
/// just as with
//, and that is usually a good thing.
#delimit command should only be used for advanced function programming
and is officially discouraged in analytical code.303
Never, for any reason, use
/* */ to wrap a line:
it is distracting and difficult to follow compared to the use
of those characters to write regular comments.
Line breaks and indentations may be used to highlight the placement
of the option comma or other functional syntax in Stata commands.
graph hbar invil /// Proportion in village if (priv == 1) /// Private facilities only , over(statename, sort(1) descending) /// Order states by values blabel(bar, format(%9.0f)) /// Label the bars ylab(0 "0%" 25 "25%" 50 "50%" 75 "75%" 100 "100%") /// ytit("Share of private primary care visits made in own village")
Using boilerplate code
Boilerplate code consists of a few lines of code that always come at the top of the code file,
and its purpose is to harmonize settings across users
running the same code to the greatest degree possible.
There is no way in Stata to guarantee that any two installations
will always run code in exactly the same way.
In the vast majority of cases it does, but not always,
and boilerplate code can mitigate that risk.
We have developed the
to implement many commonly-used boilerplate settings
that are optimized given your installation of Stata.
It requires two lines of code to execute the
setting, which avoids differences in results due to different versions of Stata.
Among other things, it turns the
more flag off
so code never hangs while waiting to display more output;
varabbrev off so abbrevated variable names are rejected;
and it maximizes the allowed memory usage and matrix size
so that code is not rejected on other machines for violating system limits.
(For example, Stata/SE and Stata/IC, allow for different maximum numbers of variables,
and the same happens with Stata 14 and Stata 15,
so it may not be able to run code written in one of these version using another.)
Finally, it clears all stored information in Stata memory,
such as non-installed programs and globals,
getting as close as possible to opening Stata fresh.
Write multiple graphs as
tw (xx)(xx)(xx), not
In simple expressions, put spaces around each binary operator except
gen z = x + y`` andx^2`.
When order of operations applies, you may adjust spacing and parentheses: write
hours + (minutes/60) + (seconds/3600), not
hours + minutes / 60 + seconds / 3600.
For long expressions,
- operators should start the new line,
/ should be used inline. For example:
Make sure your code doesn’t print very much to the results window as this is slow.
This can be accomplished by using
run file.do rather than
Interactive commands like
tab should be used sparingly in do-files,
unless they are for the purpose of getting
In that case, consider using the
qui prefix to prevent printing output.
It is also faster to get outputs from commands like
reg using the