Part 2: Exploring and cleaning data

1. Setup

We will continue working with the dataset from Part 1. This is data on participants of an employment training programme.

First, we set our directory, load the data and take a look at the variables:

. cd "~/git_repos/metricsinstata/docs/part2"
/Users/jack/git_repos/metricsinstata/docs/part2

. use "nsw.dta", clear

. des

Contains data from nsw.dta
  obs:           722                          
 vars:            10                          18 May 2012 09:35
 size:        20,938                          
──────────────────────────────────────────────────────────────────────────────────────────────────────
              storage   display    value
variable name   type    format     label      variable label
──────────────────────────────────────────────────────────────────────────────────────────────────────
data_id         str14   %14s                  
treat           byte    %8.0g                 
age             byte    %8.0g                 
education       byte    %8.0g                 
black           byte    %8.0g                 
hispanic        byte    %8.0g                 
married         byte    %8.0g                 
nodegree        byte    %8.0g                 
re75            float   %9.0g                 
re78            float   %9.0g                 
──────────────────────────────────────────────────────────────────────────────────────────────────────
Sorted by:

Note: We often write , clear after use. This tells Stata not to worry about losing any data currently loaded when loading the new data.

2. Inspecting distributions

After taking a look at the variable names and number of observations, the first thing to do is to inspect the distribution of each variable.

Why do we care about the distributions?

We can check if there exist any outliers. It is common for there to be mistakes in data, for example someone with “fat fingers” pressing too many zeros when typing in earnings data. These types of data errors can strongly impact results and need to be removed before analysis
The distributions can help guide model choice. For example, some models function better when variables are approximately normally distributed. We can inspect that here.

2.1 The tabulate and summarize commands

For discrete or categorical variables, this can be achieved using the tabulate command:

. tabulate treat

      treat │      Freq.     Percent        Cum.
────────────┼───────────────────────────────────
          0 │        425       58.86       58.86
          1 │        297       41.14      100.00
────────────┼───────────────────────────────────
      Total │        722      100.00

Here we can see that there are 297 individuals for whom variable treat equals 1 and 425 for whom variable treat equals 0.

Lets try doing the same with the age variable:

. tab age

        age │      Freq.     Percent        Cum.
────────────┼───────────────────────────────────
         17 │         52        7.20        7.20
         18 │         75       10.39       17.59
         19 │         63        8.73       26.32
         20 │         58        8.03       34.35
         21 │         40        5.54       39.89
         22 │         42        5.82       45.71
         23 │         41        5.68       51.39
         24 │         37        5.12       56.51
         25 │         56        7.76       64.27
         26 │         35        4.85       69.11
         27 │         47        6.51       75.62
         28 │         31        4.29       79.92
         29 │         24        3.32       83.24
         30 │         12        1.66       84.90
         31 │         22        3.05       87.95
         32 │          7        0.97       88.92
         33 │         10        1.39       90.30
         34 │         11        1.52       91.83
         35 │          7        0.97       92.80
         36 │          7        0.97       93.77
         37 │          3        0.42       94.18
         38 │          7        0.97       95.15
         39 │          5        0.69       95.84
         40 │          2        0.28       96.12
         41 │          5        0.69       96.81
         42 │          5        0.69       97.51
         43 │          2        0.28       97.78
         44 │          4        0.55       98.34
         45 │          3        0.42       98.75
         46 │          3        0.42       99.17
         48 │          1        0.14       99.31
         49 │          1        0.14       99.45
         50 │          2        0.28       99.72
         54 │          1        0.14       99.86
         55 │          1        0.14      100.00
────────────┼───────────────────────────────────
      Total │        722      100.00

Variable age takes on far more values than treat, so we will be better off using the summarize command:

. summarize age

    Variable │        Obs        Mean    Std. Dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
         age │        722    24.52078    6.625947         17         55

Here we can see that the mean age of our individuals is 24.5, the standard deviation is 6.63, the minimum value is 17 and the maximum value is 55.

summarize is useful for continuous data. Lets apply it to variable re75, which represents earnings for a particular year.

. sum re75

    Variable │        Obs        Mean    Std. Dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
        re75 │        722    3042.897    5066.143          0   37431.66

You can add the option , detail to get more information on a variable’s distribution:

. sum re75, detail

                            re75
─────────────────────────────────────────────────────────────
      Percentiles      Smallest
 1%            0              0
 5%            0              0
10%            0              0       Obs                 722
25%            0              0       Sum of Wgt.         722

50%      936.308                      Mean           3042.897
                        Largest       Std. Dev.      5066.143
75%     4023.211       29897.19
90%     8920.471       32984.25       Variance       2.57e+07
95%     12205.14       36941.27       Skewness       2.958421
99%     24294.75       37431.66       Kurtosis       14.54842

This gives a lot of detail on the quantiles of the distribution and higher order moments.

2.2. Histograms

You will often also plot data at this stage. The histogram is an extremely quick and easy way to have a look at your data:

. histogram re75
(bin=26, start=0, width=1439.6792)

. graph export "hist_re75.png", replace
(file hist_re75.png written in PNG format)

The graph export "hist_re75.png" part of the code tells Stata to export the current graph. A “.png” image file will appear in your current working directory. The , replace option tells Stata to replace any existing file that is there.

We see from the histogram that the distribution is highly skewed, with a few very high earners. This is a common shape for an earnings distribution.

3. Conditions

Oftentimes we would like a command to only apply to a selection of observations, rather than to all observations in the dataset.

We can achieve this using the if command.

For example, the following code generates a histogram only for those with earnings under 5000:

. hist re75 if re75 < 5000
(bin=23, start=0, width=214.44412)

. graph export "hist_re75_lim.png", replace
(file hist_re75_lim.png written in PNG format)

We can also generate a histogram of earnings for those with treat equal to 1:

. hist re75 if treat == 1
(bin=17, start=0, width=2201.8624)

. graph export "hist_re75_treat.png", replace
(file hist_re75_treat.png written in PNG format)

We can also chain conditions as follows:

. sum re75 if treat == 1 & re75 < 5000

    Variable │        Obs        Mean    Std. Dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
        re75 │        229    995.5063     1313.29          0   4923.263

The gives features of the distribution of earnings for all individuals for whom earnings are under 5000 and treat equals 1.

The full set of conditions is:

Note that for equality conditions, we need two equals signs.

4. Keep and drop

Two very useful commands are keep and drop. Lets say we would like to keep only two variables, re75 and treat. We can do this by typing:

keep re75 treat

All other variables will be deleted. They remain in the original file unless you save and replace the “.dta” file.

Now lets say we want to delete those two variables. You can do this by typing:

drop re75 treat

This also works for observations. Lets say you want to drop all observations with earnings equal to zero:

drop if re75 == 0

Be careful to keep track of what you have dropped. You may find something confusing further down the line which can be attributed to a poorly executed keep or drop command.

5. Generating new variables

In Stata it is common to generate new variables which are some function of other variables.

Lets generate a new variable called re75_re78, which is the sum of re75 and re78.

. generate re75_re78 = re75 + re78

. label var re75_re78 "sum of re75 and re78"

This variable will be added to our existing variables:

. des

Contains data from nsw.dta
  obs:           722                          
 vars:            11                          18 May 2012 09:35
 size:        23,826                          
──────────────────────────────────────────────────────────────────────────────────────────────────────
              storage   display    value
variable name   type    format     label      variable label
──────────────────────────────────────────────────────────────────────────────────────────────────────
data_id         str14   %14s                  
treat           byte    %8.0g                 
age             byte    %8.0g                 
education       byte    %8.0g                 
black           byte    %8.0g                 
hispanic        byte    %8.0g                 
married         byte    %8.0g                 
nodegree        byte    %8.0g                 
re75            float   %9.0g                 
re78            float   %9.0g                 
re75_re78       float   %9.0g                 sum of re75 and re78
──────────────────────────────────────────────────────────────────────────────────────────────────────
Sorted by: 
     Note: Dataset has changed since last saved.

6. Saving data

Typically, a project will involve some data cleaning. This could consist of:

Naming and labeling variables
Dropping variables we don’t need
Dropping observations which either we don’t want or have a problem in one or more variables
Transforming variables
Generating new variables as functions of existing ones.

With that basic data cleaning complete, we typically save a cleaned version. This allows us to perform analysis at a later time without having to go through all the previous steps again.

Data-cleaning code can often take a very long time to run, so its useful to “check-in” your dataset after cleaning by saving it.

The command to do this is save:

save "lalonde_clean.dta", replace

You can then load the data again in your analysis section by using the use command.

7. Standard Stata syntax

You’ll have recognized by now that Stata follows a fairly standard syntax across (most) commands. At its core, a Stata command is made up of a main command (e.g. sum) followed by one or more variable names (e.g. re78).

Any options (e.g. det) then follow after a comma. Here is that full command:

sum re78, det

If we include a condition, it usually follows the variable names and precedes the comma:

sum re78 if re78 < 5000, det

This structure doesn’t hold for all commands, as we saw above for keep and drop. If unsure of how to use a command, remember you can type help then the name of the command to access Stata’s help files, which are very useful.

Even advanced Stata users regularly have to go back to Stata help files to understand how to use a command, so its worth getting used to the process.

8. Reducing repetition

One rule of coding is to avoid repetition as much as possible. Reducing repetition:

makes mistakes less likely
makes it easier to update your code later
makes code easier to read

8.1 The “by” command

Lets say we want to inspect means by group. One option is to use if, as we have seen:

sum earnings if education == "high school"
sum earnings if education == "college"
sum earnings if education == "below high school"

A convenient alternative is to use by. You will get the same result from:

sort education
by education: sum earnings

What is going on here?

sort tells Stata to sort the data based on education. This is necessary before the command by.
by education: sum earnings tells Stata to perform operation sum earnings separately for each value of education

8.2 Looping over variables

Lets say we have three variables x, y and z. Missing values for these variables are coded as -99 and we would like to drop all observations with any missing values.

One valid approach is:

drop if x == -99
drop if y == -99
drop if z == -99

This involves a lot of repetition.

We can instead use a loop which performs an operation to all variables in a varlist. A varlist is just a list of variable names:

foreach var of varlist x y z {
    drop if `var' == - 99
}

Note the asymmetric quote symbols used in the loop.

What is going on here?

foreach var of varlist tells Stata we are going to run a block of code over multiple variables.
x y z provides the variables
{ and } enclose the code to be performed
variable names are substituted in place of var

This is useful whenever we want to apply a single operation to multiple variables.

8.3 Looping over numbers

We can also loop over numbers held in a numlist.

To illustrate, lets assume that we want to generate 10 new variables named var1, var2, var3 etc. We want all observations to have a value of 1 for all these new variables. The long-winded approach is:

gen var1 = 1
gen var2 = 1
gen var3 = 1
gen var4 = 1
gen var5 = 1
gen var6 = 1
gen var7 = 1
gen var8 = 1
gen var9 = 1
gen var10 = 1

A better way to do this is:

foreach num of numlist 1/10 {
    gen var`num' = 1
}

What is going on here?

foreach num of numlist tells Stata we are going to run a block of code over a list of numbers.
1/10 means that we are looping over all integers between 1 and 10
{ and } enclose the code to be performed
Integers are substituted in place of num