We will continue working with the dataset from Part 1. This is data on participants of an employment training programme.
First, we set our directory, load the data and take a look at the variables:
. cd "~/git_repos/metricsinstata/docs/part2" /Users/jack/git_repos/metricsinstata/docs/part2
. use "nsw.dta", clear
. des Contains data from nsw.dta obs: 722 vars: 10 18 May 2012 09:35 size: 20,938 ────────────────────────────────────────────────────────────────────────────────────────────────────── storage display value variable name type format label variable label ────────────────────────────────────────────────────────────────────────────────────────────────────── data_id str14 %14s treat byte %8.0g age byte %8.0g education byte %8.0g black byte %8.0g hispanic byte %8.0g married byte %8.0g nodegree byte %8.0g re75 float %9.0g re78 float %9.0g ────────────────────────────────────────────────────────────────────────────────────────────────────── Sorted by:
Note: We often write , clear
after use
. This tells Stata not to worry about losing any data currently loaded when loading the new data.
After taking a look at the variable names and number of observations, the first thing to do is to inspect the distribution of each variable.
Why do we care about the distributions?
We can check if there exist any outliers. It is common for there to be mistakes in data, for example someone with “fat fingers” pressing too many zeros when typing in earnings data. These types of data errors can strongly impact results and need to be removed before analysis
The distributions can help guide model choice. For example, some models function better when variables are approximately normally distributed. We can inspect that here.
For discrete or categorical variables, this can be achieved using the tabulate
command:
. tabulate treat treat │ Freq. Percent Cum. ────────────┼─────────────────────────────────── 0 │ 425 58.86 58.86 1 │ 297 41.14 100.00 ────────────┼─────────────────────────────────── Total │ 722 100.00
Here we can see that there are 297 individuals for whom variable treat
equals 1 and 425 for whom variable treat
equals 0.
Lets try doing the same with the age
variable:
. tab age age │ Freq. Percent Cum. ────────────┼─────────────────────────────────── 17 │ 52 7.20 7.20 18 │ 75 10.39 17.59 19 │ 63 8.73 26.32 20 │ 58 8.03 34.35 21 │ 40 5.54 39.89 22 │ 42 5.82 45.71 23 │ 41 5.68 51.39 24 │ 37 5.12 56.51 25 │ 56 7.76 64.27 26 │ 35 4.85 69.11 27 │ 47 6.51 75.62 28 │ 31 4.29 79.92 29 │ 24 3.32 83.24 30 │ 12 1.66 84.90 31 │ 22 3.05 87.95 32 │ 7 0.97 88.92 33 │ 10 1.39 90.30 34 │ 11 1.52 91.83 35 │ 7 0.97 92.80 36 │ 7 0.97 93.77 37 │ 3 0.42 94.18 38 │ 7 0.97 95.15 39 │ 5 0.69 95.84 40 │ 2 0.28 96.12 41 │ 5 0.69 96.81 42 │ 5 0.69 97.51 43 │ 2 0.28 97.78 44 │ 4 0.55 98.34 45 │ 3 0.42 98.75 46 │ 3 0.42 99.17 48 │ 1 0.14 99.31 49 │ 1 0.14 99.45 50 │ 2 0.28 99.72 54 │ 1 0.14 99.86 55 │ 1 0.14 100.00 ────────────┼─────────────────────────────────── Total │ 722 100.00
Variable age
takes on far more values than treat
, so we will be better off using the summarize
command:
. summarize age Variable │ Obs Mean Std. Dev. Min Max ─────────────┼───────────────────────────────────────────────────────── age │ 722 24.52078 6.625947 17 55
Here we can see that the mean age of our individuals is 24.5, the standard deviation is 6.63, the minimum value is 17 and the maximum value is 55.
summarize
is useful for continuous data. Lets apply it to variable re75
, which represents earnings for a particular year.
. sum re75 Variable │ Obs Mean Std. Dev. Min Max ─────────────┼───────────────────────────────────────────────────────── re75 │ 722 3042.897 5066.143 0 37431.66
You can add the option , detail
to get more information on a variable’s distribution:
. sum re75, detail re75 ───────────────────────────────────────────────────────────── Percentiles Smallest 1% 0 0 5% 0 0 10% 0 0 Obs 722 25% 0 0 Sum of Wgt. 722 50% 936.308 Mean 3042.897 Largest Std. Dev. 5066.143 75% 4023.211 29897.19 90% 8920.471 32984.25 Variance 2.57e+07 95% 12205.14 36941.27 Skewness 2.958421 99% 24294.75 37431.66 Kurtosis 14.54842
This gives a lot of detail on the quantiles of the distribution and higher order moments.
You will often also plot data at this stage. The histogram is an extremely quick and easy way to have a look at your data:
. histogram re75 (bin=26, start=0, width=1439.6792) . graph export "hist_re75.png", replace (file hist_re75.png written in PNG format)
The graph export "hist_re75.png"
part of the code tells Stata to export the current graph. A “.png” image file will appear in your current working directory. The , replace
option tells Stata to replace any existing file that is there.
We see from the histogram that the distribution is highly skewed, with a few very high earners. This is a common shape for an earnings distribution.
Oftentimes we would like a command to only apply to a selection of observations, rather than to all observations in the dataset.
We can achieve this using the if
command.
For example, the following code generates a histogram only for those with earnings under 5000:
. hist re75 if re75 < 5000 (bin=23, start=0, width=214.44412) . graph export "hist_re75_lim.png", replace (file hist_re75_lim.png written in PNG format)
We can also generate a histogram of earnings for those with treat
equal to 1:
. hist re75 if treat == 1 (bin=17, start=0, width=2201.8624) . graph export "hist_re75_treat.png", replace (file hist_re75_treat.png written in PNG format)
We can also chain conditions as follows:
. sum re75 if treat == 1 & re75 < 5000 Variable │ Obs Mean Std. Dev. Min Max ─────────────┼───────────────────────────────────────────────────────── re75 │ 229 995.5063 1313.29 0 4923.263
The gives features of the distribution of earnings for all individuals for whom earnings are under 5000 and treat
equals 1.
The full set of conditions is:
Note that for equality conditions, we need two equals signs.
Two very useful commands are keep
and drop
. Lets say we would like to keep only two variables, re75
and treat
. We can do this by typing:
keep re75 treat
All other variables will be deleted. They remain in the original file unless you save and replace the “.dta” file.
Now lets say we want to delete those two variables. You can do this by typing:
drop re75 treat
This also works for observations. Lets say you want to drop all observations with earnings equal to zero:
drop if re75 == 0
Be careful to keep track of what you have dropped. You may find something confusing further down the line which can be attributed to a poorly executed keep
or drop
command.
In Stata it is common to generate new variables which are some function of other variables.
Lets generate a new variable called re75_re78
, which is the sum of re75
and re78
.
. generate re75_re78 = re75 + re78 . label var re75_re78 "sum of re75 and re78"
This variable will be added to our existing variables:
. des Contains data from nsw.dta obs: 722 vars: 11 18 May 2012 09:35 size: 23,826 ────────────────────────────────────────────────────────────────────────────────────────────────────── storage display value variable name type format label variable label ────────────────────────────────────────────────────────────────────────────────────────────────────── data_id str14 %14s treat byte %8.0g age byte %8.0g education byte %8.0g black byte %8.0g hispanic byte %8.0g married byte %8.0g nodegree byte %8.0g re75 float %9.0g re78 float %9.0g re75_re78 float %9.0g sum of re75 and re78 ────────────────────────────────────────────────────────────────────────────────────────────────────── Sorted by: Note: Dataset has changed since last saved.
Typically, a project will involve some data cleaning. This could consist of:
Naming and labeling variables
Dropping variables we don’t need
Dropping observations which either we don’t want or have a problem in one or more variables
Transforming variables
Generating new variables as functions of existing ones.
With that basic data cleaning complete, we typically save a cleaned version. This allows us to perform analysis at a later time without having to go through all the previous steps again.
Data-cleaning code can often take a very long time to run, so its useful to “check-in” your dataset after cleaning by saving it.
The command to do this is save
:
save "lalonde_clean.dta", replace
You can then load the data again in your analysis section by using the use
command.
You’ll have recognized by now that Stata follows a fairly standard syntax across (most) commands. At its core, a Stata command is made up of a main command (e.g. sum
) followed by one or more variable names (e.g. re78
).
Any options (e.g. det
) then follow after a comma. Here is that full command:
sum re78, det
If we include a condition, it usually follows the variable names and precedes the comma:
sum re78 if re78 < 5000, det
This structure doesn’t hold for all commands, as we saw above for keep
and drop
. If unsure of how to use a command, remember you can type help
then the name of the command to access Stata’s help files, which are very useful.
Even advanced Stata users regularly have to go back to Stata help files to understand how to use a command, so its worth getting used to the process.
One rule of coding is to avoid repetition as much as possible. Reducing repetition:
makes mistakes less likely
makes it easier to update your code later
makes code easier to read
Lets say we want to inspect means by group. One option is to use if
, as we have seen:
sum earnings if education == "high school"
sum earnings if education == "college"
sum earnings if education == "below high school"
A convenient alternative is to use by
. You will get the same result from:
sort education
by education: sum earnings
What is going on here?
sort
tells Stata to sort the data based on education. This is necessary before the command by
.by education: sum earnings
tells Stata to perform operation sum earnings
separately for each value of education
Lets say we have three variables x
, y
and z
. Missing values for these variables are coded as -99 and we would like to drop all observations with any missing values.
One valid approach is:
drop if x == -99
drop if y == -99
drop if z == -99
This involves a lot of repetition.
We can instead use a loop which performs an operation to all variables in a varlist
. A varlist
is just a list of variable names:
foreach var of varlist x y z {
drop if `var' == - 99
}
Note the asymmetric quote symbols used in the loop.
What is going on here?
foreach var of varlist
tells Stata we are going to run a block of code over multiple variables.
x y z
provides the variables
{
and }
enclose the code to be performed
variable names are substituted in place of var
This is useful whenever we want to apply a single operation to multiple variables.
We can also loop over numbers held in a numlist
.
To illustrate, lets assume that we want to generate 10 new variables named var1, var2, var3 etc. We want all observations to have a value of 1 for all these new variables. The long-winded approach is:
gen var1 = 1
gen var2 = 1
gen var3 = 1
gen var4 = 1
gen var5 = 1
gen var6 = 1
gen var7 = 1
gen var8 = 1
gen var9 = 1
gen var10 = 1
A better way to do this is:
foreach num of numlist 1/10 {
gen var`num' = 1
}
What is going on here?
foreach num of numlist
tells Stata we are going to run a block of code over a list of numbers.
1/10
means that we are looping over all integers between 1 and 10
{
and }
enclose the code to be performed
Integers are substituted in place of num