Once again, we start by setting the working directory and loading the LaLonde (1986) data:
. cd "~/git_repos/metricsinstata/docs/part3" /Users/jack/git_repos/metricsinstata/docs/part3
. use "nsw.dta", clear
We learned in part 2 how to create a histogram:
. hist re75 (bin=26, start=0, width=1439.6792) . graph export "hist_re75.png", replace (file hist_re75.png written in PNG format)
This is useful for inspecting the distribution of single continuous or ordinal variables. In this tutorial we will learn how to generate a few other useful graphs and how to make them pretty enough for publication.
Sometimes we would like to inspect the distribution of a continuous variable conditional on a categorical variable. For example, in the LaLonde example we might like to see how post-training earnings vary between those who received the training (treat = 1
) and those who did not (treat = 0
).
. graph box re75, by(treat) . graph export "box_re78_byt.png", replace (file box_re78_byt.png written in PNG format)
From this we can see that the 25th percentile, median and 75th percentile of 1978 earnings are higher for the treat = 1
group.
We can achieve the same comparison by using the sum
command with the det
option:
. sum re78 if treat == 1, det re78 ───────────────────────────────────────────────────────────── Percentiles Smallest 1% 0 0 5% 0 0 10% 0 0 Obs 297 25% 549.2984 0 Sum of Wgt. 297 50% 4232.309 Mean 5976.352 Largest Std. Dev. 6923.796 75% 9381.295 26817.6 90% 13626.04 34099.28 Variance 4.79e+07 95% 17685.18 36646.95 Skewness 2.646437 99% 34099.28 60307.93 Kurtosis 16.72786 . sum re78 if treat == 0, det re78 ───────────────────────────────────────────────────────────── Percentiles Smallest 1% 0 0 5% 0 0 10% 0 0 Obs 425 25% 0 0 Sum of Wgt. 425 50% 3746.701 Mean 5090.048 Largest Std. Dev. 5718.089 75% 8329.823 23483.45 90% 12429.91 29408.04 Variance 3.27e+07 95% 16328.96 30247.5 Skewness 1.569358 99% 20942.24 39483.53 Kurtosis 6.963969
Perhaps the most common type of plot in applied econometrics is the scatter plot. This is used when we would like to inspect the relationship between two continuous variables.
Lets make a scatter plot of re78
, which represents 1978 earnings, on re75
, which represents 1975 earnings.
. twoway scatter re78 re75 . graph export "scatter_re75re78.png", replace (file scatter_re75re78.png written in PNG format)
There may be something of a positive relationship here, but it is hard to tell, in part due to the skewness of the data and in part due to the many zeros. We would of course expect some positive relationship - those who earn more today are likely to be those who earned more 3 years ago.
When a scatter plot looks like this, a convenient tool is binscatter
, which was recently developed by Michael Stepner.
binscatter
is our first example of a command which does not come bundled with Stata, but rather must be installed.
Fortunately, this is very easy in Stata. Just type the command ssc install binscatter
and the package will be installed for you. You only need to do this once on each machine. Most user-built Stata packages are available through this route.
Lets try the scatter plot again, this time using binscatter
:
. binscatter re78 re75 warning: nquantiles(20) was specified, but only 13 were generated. see help file under nquantiles() fo > r explanation. . graph export "binscatter_re75re78.png", replace (file binscatter_re75re78.png written in PNG format)
Now we see a clear positive relationship.
The data has been grouped into quantiles based on the x-axis, and the displayed points are means within those bins. This is extremely useful, particularly when working with large datasets. The linear regression fit line has also been added to the graph by default.
The default plots in Stata are not the most attractive. Fortunately, you can customize them to your heart’s content. Here is an example where I have changed the background color to white, modified point colors and change axis labels:
. twoway scatter re78 re75, mcolor(blue%50) /// > graphregion(color(white)) /// > xtitle("Earnings in $1,000's (1978)") ytitle("Earnings in $1,000's (1975)")
. graph export "scattermod_re75re78.png", replace (file scattermod_re75re78.png written in PNG format)
Note the use of ///
. This tells Stata to move to the next line of the .do file, as if the first line were to continue. This is useful if you have very long lines of code that would be difficult to read if left on one line.
To see the full range of options, type help twoway options
.
Often it is desirable to layer plots. This is also very easy in Stata. In the following plot, I layer a scatter plot of re78
on re75
for those with treat=1
, on top of the equivalent for those with treat=0
.
. twoway (scatter re78 re75 if treat == 1, mcolor(blue%50)) /// > (scatter re78 re75 if treat == 0, mcolor(dkorange%50)), /// > legend(label(1 "Treatment") label(2 "Control")) /// > graphregion(color(white)) /// > xtitle("Earnings $1,000's (1978)") ytitle("Earnings $1,000's (1975)") . graph export "scattermodcombined_re75re78.png", replace (file scattermodcombined_re75re78.png written in PNG format)
Note that each graph is embedded in parentheses. It is possible to set options for each element of the plot, such as mcolor(blue%50)
which sets the marker color for the first layer. These options are set within the parentheses for a layer.
It is also possible to set options for the combined plot, such as graphregion(color(white))
. These options come at the end of the command.
Above I have shown the most common plots in my experience, but here are a few more that might prove useful:
Regression line plot - lfit
Polynomial line plot - lpoly
Mapping - spmap