Part 3: Plotting data

1. Setup

Once again, we start by setting the working directory and loading the LaLonde (1986) data:

. cd "~/git_repos/metricsinstata/docs/part3"
/Users/jack/git_repos/metricsinstata/docs/part3
. use "nsw.dta", clear

2. Histogram (again)

We learned in part 2 how to create a histogram:

. hist re75
(bin=26, start=0, width=1439.6792)

. graph export "hist_re75.png", replace
(file hist_re75.png written in PNG format)
Histogram of 1975 earnings

This is useful for inspecting the distribution of single continuous or ordinal variables. In this tutorial we will learn how to generate a few other useful graphs and how to make them pretty enough for publication.

3. Box plots

Sometimes we would like to inspect the distribution of a continuous variable conditional on a categorical variable. For example, in the LaLonde example we might like to see how post-training earnings vary between those who received the training (treat = 1) and those who did not (treat = 0).

. graph box re75, by(treat)

. graph export "box_re78_byt.png", replace
(file box_re78_byt.png written in PNG format)
1978 earnings by treatment

From this we can see that the 25th percentile, median and 75th percentile of 1978 earnings are higher for the treat = 1 group.

We can achieve the same comparison by using the sum command with the det option:

. sum re78 if treat == 1, det

                            re78
─────────────────────────────────────────────────────────────
      Percentiles      Smallest
 1%            0              0
 5%            0              0
10%            0              0       Obs                 297
25%     549.2984              0       Sum of Wgt.         297

50%     4232.309                      Mean           5976.352
                        Largest       Std. Dev.      6923.796
75%     9381.295        26817.6
90%     13626.04       34099.28       Variance       4.79e+07
95%     17685.18       36646.95       Skewness       2.646437
99%     34099.28       60307.93       Kurtosis       16.72786

. sum re78 if treat == 0, det

                            re78
─────────────────────────────────────────────────────────────
      Percentiles      Smallest
 1%            0              0
 5%            0              0
10%            0              0       Obs                 425
25%            0              0       Sum of Wgt.         425

50%     3746.701                      Mean           5090.048
                        Largest       Std. Dev.      5718.089
75%     8329.823       23483.45
90%     12429.91       29408.04       Variance       3.27e+07
95%     16328.96        30247.5       Skewness       1.569358
99%     20942.24       39483.53       Kurtosis       6.963969

4. Scatter plots

Perhaps the most common type of plot in applied econometrics is the scatter plot. This is used when we would like to inspect the relationship between two continuous variables.

Lets make a scatter plot of re78, which represents 1978 earnings, on re75, which represents 1975 earnings.

. twoway scatter re78 re75

. graph export "scatter_re75re78.png", replace
(file scatter_re75re78.png written in PNG format)
1978 earnings and 1975 earnings

There may be something of a positive relationship here, but it is hard to tell, in part due to the skewness of the data and in part due to the many zeros. We would of course expect some positive relationship - those who earn more today are likely to be those who earned more 3 years ago.

When a scatter plot looks like this, a convenient tool is binscatter, which was recently developed by Michael Stepner.

binscatter is our first example of a command which does not come bundled with Stata, but rather must be installed.

Fortunately, this is very easy in Stata. Just type the command ssc install binscatter and the package will be installed for you. You only need to do this once on each machine. Most user-built Stata packages are available through this route.

Lets try the scatter plot again, this time using binscatter:

. binscatter re78 re75
warning: nquantiles(20) was specified, but only 13 were generated. see help file under nquantiles() fo
> r explanation.

. graph export "binscatter_re75re78.png", replace
(file binscatter_re75re78.png written in PNG format)
1978 earnings and 1975 earnings

Now we see a clear positive relationship.

The data has been grouped into quantiles based on the x-axis, and the displayed points are means within those bins. This is extremely useful, particularly when working with large datasets. The linear regression fit line has also been added to the graph by default.

5. Customizing plots

The default plots in Stata are not the most attractive. Fortunately, you can customize them to your heart’s content. Here is an example where I have changed the background color to white, modified point colors and change axis labels:

. twoway scatter re78 re75, mcolor(blue%50) ///
> graphregion(color(white)) ///
> xtitle("Earnings in $1,000's (1978)") ytitle("Earnings in $1,000's (1975)")
. graph export "scattermod_re75re78.png", replace
(file scattermod_re75re78.png written in PNG format)
1978 earnings and 1975 earnings

Note the use of ///. This tells Stata to move to the next line of the .do file, as if the first line were to continue. This is useful if you have very long lines of code that would be difficult to read if left on one line.

To see the full range of options, type help twoway options.

6. Layering

Often it is desirable to layer plots. This is also very easy in Stata. In the following plot, I layer a scatter plot of re78 on re75 for those with treat=1, on top of the equivalent for those with treat=0.

. twoway (scatter re78 re75 if treat == 1, mcolor(blue%50)) ///
> (scatter re78 re75 if treat == 0, mcolor(dkorange%50)), ///
> legend(label(1 "Treatment") label(2 "Control")) ///
> graphregion(color(white)) ///
> xtitle("Earnings $1,000's (1978)") ytitle("Earnings $1,000's (1975)")

. graph export "scattermodcombined_re75re78.png", replace
(file scattermodcombined_re75re78.png written in PNG format)
1978 earnings and 1975 earnings by treatment status

Note that each graph is embedded in parentheses. It is possible to set options for each element of the plot, such as mcolor(blue%50) which sets the marker color for the first layer. These options are set within the parentheses for a layer.

It is also possible to set options for the combined plot, such as graphregion(color(white)). These options come at the end of the command.

7. Other plots

Above I have shown the most common plots in my experience, but here are a few more that might prove useful: