Part 1: Introduction to Stata

1. Getting started with Stata

Stata is a statistical package which is:

Stata is not:

While R and Python are also valuable to learn, Stata remains the most-common programming language in empirical economics and is an excellent tool for learning and applying econometric methods. The majority of papers in applied economics include analysis using Stata and the majority of applied economists are comfortable using it.

1.1 The Stata Interface

The Stata Interface consists of the following windows:

Stata Interface

The roles of each element are:

1.2 Running code in Stata

There are two ways to run code in Stata:

While the first option is easy for one-off commands, as you become familiar with the software it is best to use .do files.

.do files are opened in the do-file editor, which can be opened through File -> New. They are run either by clicking ‘do’ or by pressing CMD+SHIFT+d (OSX) or CTRL+d (Windows).

As a rule, your projects should be contained in one or more .do files which contain all code needed to produce your analysis. The .do file should provide everything necessary to reproduce your results, from the raw data to the final output.

1.3 Finding help

Perhaps the most important Stata command to know is help, which brings up Stata documentation. For example, typing help regress into the command window then hitting enter will bring up all documentation on the regress command in Stata:

Stata help

See the end of this tutorial for further resources.

2. Worked example: Lalonde (1986)

Our application will be a classic paper in Labor economics, LaLonde (1986). In this paper, LaLonde investigates the impact of a randomized employment program on earnings.

** note that the snippets below contain both code and output. All code lines are indicated by either . or >. Anything else is output. **

2.1 Starting a .do file

.do files are the building blocks of Stata projects, so before launching into commands we need to first think about how to set them up.

We can now start looking at an example .do file. A typical project consists of several (often several dozen) .do files each with hundreds of lines of code which are interlinked, but here we will just work from a single file.

You should think about the .do file as a series of text instructions which tell Stata what to do.

A .do file should start with a title and short description. My .do file for example starts with the following:

. /* Lalonde analysis
> Jack Blundell
> Worked example of analysis of lalonde 1986 data.
> Structure:
> 0. Setup
> 1. Import data
> 2. Rename and relabel variables
> */    

The /* and */ symbols tell Stata that the enclosed text is not to be read as Stata commands, but instead provides information and context to the researcher reading the code. It is good practice when writing code to include some non-command text. These might be section titles, or small annotations describing code. Any text following * or //, or enclosed between /* and */ will not be read as a command by Stata, so can be used to convey this information.

Here’s an example of a comment, followed by a line of code (don’t worry about the code for now):

// regress earnings on treatment for adult sub-sample
reg y t if age >= 18

While your code might seem self-explanatory to you at first, oftentimes this is not the case for your co-authors, a grader or indeed your future self.

(To avoid clutter, I exclude all comments in the code included below, but you’ll see them in the full .do file later in this tutorial.)

Next up in my .do file is the following:

. clear
. set more off

These two commands (clear and set more off) are standard to include at the top of .do files. They clear the system and allow Stata to print large amounts of output to the screen.

Next I set the working directory. For this example, we will work in a single folder which will be our only directory, though more complex projects will require further structure. For a small project, this should be where your .do file is saved, and where the data file you will use is saved. Unless otherwise specified, any output you generate will be saved here.

. cd "~/git_repos/metricsinstata/docs/part1"
/Users/jack/git_repos/metricsinstata/docs/part1

The following two commands relate to log files. A log is a record of your code and the output your code generates. While it is possible for small projects to run all your code each time and simply inspect the results in the Stata results window, for larger projects it is useful to have a record of all your output. First I close any existing log file:

capture log close

The code above will close any existing log file you have open, so that a new one can be opened. The capture command here is useful but we will not go into detail of how it works quite yet.

The next command starts a new log:

log using "lecture1log", replace

The command tells stata to log (create a log file containing results) and to save the file (using) as lecture1log. Any results generated will now end up in a .smcl log file in the main working directory. The replace option which follows will crop up several times in this example attached to numerous commands. It tells Stata to overwrite any existing file as required. Excluding replace will result in an error message if an overwrite is attempted.

2.2 Loading data

Stata datasets are denoted by “.dta”.

The dataset containing the LaLonde data is called “nsw.dta” and is available at http://www.nber.org/~rdehejia/data/nsw.dta.

You need to download this file and place it into your working directory, which is the folder you set earlier using the cd command.

You then load the data into Stata by using the use command

. use "nsw.dta"

To see what the data looks like now it is in Stata, typing the command browse into the command window brings up the data in a familiar spreadsheet-style format:

The browse window

Here we see what looks like an Excel spreadsheet. Each row represents an observation and each column a variable. For example, we can see that for the first observation, variable treat is 1 and variable age is 37.

2.3 Names and labels

“There are only two hard things in Computer Science: cache invalidation and naming things.” (Phil Karlton)

While using browse can be useful to take a look at the data, a more useful command to show what a dataset contains is describe.

. describe

Contains data from nsw.dta
  obs:           722                          
 vars:            10                          18 May 2012 09:35
 size:        20,938                          
──────────────────────────────────────────────────────────────────────────────────────────────────────
              storage   display    value
variable name   type    format     label      variable label
──────────────────────────────────────────────────────────────────────────────────────────────────────
data_id         str14   %14s                  
treat           byte    %8.0g                 
age             byte    %8.0g                 
education       byte    %8.0g                 
black           byte    %8.0g                 
hispanic        byte    %8.0g                 
married         byte    %8.0g                 
nodegree        byte    %8.0g                 
re75            float   %9.0g                 
re78            float   %9.0g                 
──────────────────────────────────────────────────────────────────────────────────────────────────────
Sorted by: 

Here, we can now see that we have 722 observations and 10 variables. Each variable is named, but the dataset does not yet contain any variable labels.

It is possible to rename variables as follows:

. rename treat banana

This command renames variable treat, assigning it the new name banana. As banana is not a very informative variable name, we might like to change it back as follows:

. rename banana treat

In general, your variable names should be short, informative and easy to type. You can include further information via variable labels:

. label variable treat "Treatment indicator"

A few example variable names:

Use variable labels to fully specify what each variable represents, for example:

2.4 Abbreviations

Lets check out our data again:

. des

Contains data from nsw.dta
  obs:           722                          
 vars:            10                          18 May 2012 09:35
 size:        20,938                          
──────────────────────────────────────────────────────────────────────────────────────────────────────
              storage   display    value
variable name   type    format     label      variable label
──────────────────────────────────────────────────────────────────────────────────────────────────────
data_id         str14   %14s                  
treat           byte    %8.0g                 Treatment indicator
age             byte    %8.0g                 
education       byte    %8.0g                 
black           byte    %8.0g                 
hispanic        byte    %8.0g                 
married         byte    %8.0g                 
nodegree        byte    %8.0g                 
re75            float   %9.0g                 
re78            float   %9.0g                 
──────────────────────────────────────────────────────────────────────────────────────────────────────
Sorted by: 
     Note: Dataset has changed since last saved.

You may have noticed that rather than typing describe, the above code snippet achieves the same outcome with des.

This is a key feature of Stata code. Commands (and variables) can be abbreviated. In the rest of these tutorials, I will use full command names when first using a command, then follow standard abbreviations if the command comes up again.

2.5 Closing the log file

The very final thing to do is to close the log file:

log close

The log file you opened at the start of the do-file will now be closed and saved in your directory.

2.6 A complete .do file

Now lets take a look at what the complete .do file looks like:

/* Lalonde analysis
Jack Blundell

Worked example of analysis of lalonde 1986 data.
Structure:
0. Setup
1. Load data
2. Rename and relabel variables
*/

*** 0. Setup

clear
set more off

// set working directory
cd "~/git_repos/metricsinstata/docs/part1"

// open log 
capture log close
log using "lecture1log", replace

*** 1. Load data

use "nsw.dta"

*** 2. Rename and relabel variables

describe

rename treat banana

rename banana treat

label variable treat "Treatment indicator"

des

// close log
log close

3. General coding tips

3.1 .do file code layout

The following two sections of code (purely for illustration) contain exactly the same commands:

use "fruitdat.dta"
rename v1 fruit
label variable fruit "Type of fruit"
rename v2 sales
rename v3 price
label var sales "Sales (units)"
label var price "Price ($)"
sum price if fruit == "apple", det
// load dataset
use "fruitdat.dta"

// name variables
rename v1 fruit
rename v2 sales
rename v3 price

// label variables
label variable fruit "Type of fruit"
label var sales "Sales (units, weekly)"
label var price "Price ($)"

// average price of apples
sum price if fruit == "apple", det

The second output is structured logically, uses space to distinguish different tasks and provides comments. You should aim for this.

3. Common problems when first using Stata

4. Useful Stata shortcuts

5. Further resources

There exist a large number of comprehensive Stata resources available:

References

LaLonde, Robert J. “Evaluating the Econometric Evaluations of Training Programs with Experimental Data.” The American Economic Review, 1986