1 Regression Output Table Examples

#I AM MAKING A CHANGE-YILDIZ ALTIN!

This RMD file was one of the examples you worked through in 380. I’m including it in this example for a few reasons. First, you can see how RMD files that you used before as standalone RMD files to create standalone HTML output can be combined in a book using bookdown. Second, this file loads data using read.csv("data/ceo.csv"), serving as an example of how you will store all data in a “data” folder put in the main project folder. Third, this file provides examples of showing regression output side by side.

Note: Generally it’s a good idea to load your packages in a code chunk at the top rather than mixed in with the rest of the document. However, for this example I am going to load the packages in the same code chunk where I use them so that you can see exactly what package is needed for specific functions.

The dataset ceo.csv has the following variables:

Variable Description
salary 1990 compensation in dollars ($)
age age of CEO in years
comten years CEO with company
ceoten years as ceo with company
sales 1990 firm sales in $millions
profits 1990 profits in $millions
mktval 1990 market value in $millions

1.1 Summary Statistics

The stargazer package provides an easy way to display summary statistics. You need the result='asis' codechunk option so that it displays the html table in the knit output file. The warning=FALSE and message=FALSE keeps it from displaying some of the messages it displays by default. One limitation of the table output is that it doesn’t leave much space horizontally. This can be controlled by HTML options for cell padding in tables.

library(stargazer)
mydata <- read.csv("data/ceo.csv")
stargazer(mydata, type = "html",summary.stat = c("n","mean","sd", "min", "median", "max"))
Statistic N Mean St. Dev. Min Median Max
salary 177 865,864.400 587,589.300 100,000 707,000 5,299,000
age 177 56.429 8.422 33 57 86
comten 177 22.503 12.295 2 23 58
ceoten 177 7.955 7.151 0 6 37
sales 177 3,529.463 6,088.654 29 1,400 51,300
profits 177 207.831 404.454 -463 63 2,700
mktval 177 3,600.316 6,442.276 387 1,200 45,400

1.2 Model comparisons

Consider four models:

model1 <- lm(salary~sales+mktval+profits,data=mydata)
model2 <- lm(salary~sales+mktval+profits+age,data=mydata)
model3 <- lm(salary~sales+mktval+profits+ceoten,data=mydata)
model4 <- lm(salary~sales+mktval+profits+ceoten+age,data=mydata)

One package that allows you to report the results for all four models is the stargazer package. This produces a table similar to the esttab output in Stata. Remember that each column is a separate model and if a variable does not have a coefficient displayed in a column, that means it was not included as an explanatory variable in that model.

stargazer(model1, model2, model3, model4, type = "html", report=('vc*p'))
Dependent variable:
salary
(1) (2) (3) (4)
sales 15.984 15.369 18.233* 18.061
p = 0.152 p = 0.169 p = 0.100 p = 0.106
mktval 23.831 23.842 21.157 21.231
p = 0.137 p = 0.137 p = 0.183 p = 0.183
profits 31.703 28.100 48.636 47.528
p = 0.909 p = 0.920 p = 0.859 p = 0.863
age 4,528.026 823.966
p = 0.353 p = 0.873
ceoten 12,730.860** 12,390.430**
p = 0.026 p = 0.042
Constant 717,062.400*** 464,424.700* 613,958.800*** 570,743.300**
p = 0.000 p = 0.093 p = 0.000 p = 0.042
Observations 177 177 177 177
R2 0.178 0.182 0.201 0.202
Adjusted R2 0.163 0.163 0.183 0.178
Residual Std. Error 537,420.300 (df = 173) 537,621.100 (df = 172) 531,159.300 (df = 172) 532,670.100 (df = 171)
F Statistic 12.464*** (df = 3; 173) 9.559*** (df = 4; 172) 10.846*** (df = 4; 172) 8.633*** (df = 5; 171)
Note: p<0.1; p<0.05; p<0.01

1.2.1 Adjusted R-Squared

Recall the definition of \(R^2\): \[ R^2 = \frac{SSE}{SST} = 1 - \frac{SSR}{SST} \]

The denominator measures the total variation in the \(y\) variable: \(SST = (n-1)Var(y)\). Thus, it has nothing to do with the explanatory variables. Adding additional \(x\) variables does not affect \(SST\). Adding an additional \(x\) variable cannot decrease how much of the variation in \(y\) explained by the model, so \(SSE\) will not decrease. Usually it increases at least a little bit. Thus, adding an additional \(x\) variable cannot decrease \(R^2\), and it usually increases it at least a little bit. This means that \(R^2\) increasing is a not a good justification for adding an additional \(x\) variable to the model.

Adjusted \(R^2\), often denoted \(\bar{R}^2\), penalizes you for adding an additional \(x\) variable. Adjusted \(R^2\) only increases if the new variable has a sufficiently significant effect on \(y\). Adjusted \(R^2\) is defined as

\[ \bar{R}^2 = 1 - \frac{\left(\frac{SSR}{n-k-1}\right)}{\left(\frac{SST}{n-1}\right)} \]

Look at the models above. All four models include measures of the company, including sales, market value, and profits. Models 2-4 add variables measuring characteristics of the CEO. Compare models 1 and 2. Adding the CEOs age increases \(R^2\) but adjusted \(R^2\) does not increase, indicating that adding age does not improve the model. Comparing models 1 and 3, both \(R^2\) and adjusted \(R^2\) increase when adding the CEO’s tenure, indicating this variable does improve the model. Comparing models 3 and 4, we again see that adding the CEO’s age does not improve the model; \(R^2\) increases slightly but adjusted \(R^2\) goes down.

1.3 Test Linear Combinations of Parameters

Consider again Model 3. I’ll display the estimates again here to demonstrate using the kable() from knitr and broom)

library(knitr)
library(broom)
model3 <- lm(salary~sales+mktval+profits+ceoten,data=mydata)
kable(tidy(model3),digits = 2)
term estimate std.error statistic p.value
(Intercept) 613958.78 65486.13 9.38 0.00
sales 18.23 11.01 1.66 0.10
mktval 21.16 15.79 1.34 0.18
profits 48.64 273.37 0.18 0.86
ceoten 12730.86 5635.96 2.26 0.03

It looks like the coefficient on profits (48.636087) is larger than the coefficient on sales, but is this difference statistically significant? The test statistic is

\[ t=\frac{\left(\hat{\beta}_{profit}-\hat{\beta}_{sales}\right)-0}{se\left(\hat{\beta}_{profit}-\hat{\beta}_{sales}\right)} \]

In R, this test can be implemented using the glht() function from the multcomp package:

library(multcomp)
##We need to know the "names" so we can reference them
names(coef(model3))
## [1] "(Intercept)" "sales"       "mktval"      "profits"     "ceoten"
testOfDif <- glht(model3, linfct = c("profits - sales = 0"))
summary(testOfDif)
## 
## 	 Simultaneous Tests for General Linear Hypotheses
## 
## Fit: lm(formula = salary ~ sales + mktval + profits + ceoten, data = mydata)
## 
## Linear Hypotheses:
##                      Estimate Std. Error t value Pr(>|t|)
## profits - sales == 0     30.4      278.0   0.109    0.913
## (Adjusted p values reported -- single-step method)
confint(testOfDif)
## 
## 	 Simultaneous Confidence Intervals
## 
## Fit: lm(formula = salary ~ sales + mktval + profits + ceoten, data = mydata)
## 
## Quantile = 1.9739
## 95% family-wise confidence level
##  
## 
## Linear Hypotheses:
##                      Estimate  lwr       upr      
## profits - sales == 0   30.4026 -518.2618  579.0671

What about ceoten compared to sales?

testOfDif2 <- glht(model3, linfct = c("ceoten - sales = 0"))
summary(testOfDif2)
## 
## 	 Simultaneous Tests for General Linear Hypotheses
## 
## Fit: lm(formula = salary ~ sales + mktval + profits + ceoten, data = mydata)
## 
## Linear Hypotheses:
##                     Estimate Std. Error t value Pr(>|t|)  
## ceoten - sales == 0    12713       5635   2.256   0.0253 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
confint(testOfDif2)
## 
## 	 Simultaneous Confidence Intervals
## 
## Fit: lm(formula = salary ~ sales + mktval + profits + ceoten, data = mydata)
## 
## Quantile = 1.9739
## 95% family-wise confidence level
##  
## 
## Linear Hypotheses:
##                     Estimate   lwr        upr       
## ceoten - sales == 0 12712.6292  1590.0204 23835.2380

1.4 Log transformations

This data provides a good example of why visualizing the data can be helpful. By looking at histograms and scatter plots, you can see the effects of log transformations. .

1.4.1 Histograms

Try making histograms of salary and log(salary)

Try making histograms of sales and log(sales)

First, here is a histogram of sales.

library(ggplot2)
ggplot(mydata,aes(x=sales)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

library(ggplot2)
ggplot(mydata,aes(x=sales)) + geom_histogram(fill="skyblue", color="black")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This distribution is very skewed. Here is the histogram of \(log(sales)\)

mydata$logsales <- log(mydata$sales)
ggplot(mydata,aes(x=logsales)) + geom_histogram(fill="skyblue", color="black")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We notice a similar pattern with salary.

ggplot(mydata,aes(x=sales)) + geom_histogram(fill="skyblue", color="black")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

mydata$logsalary <- log(mydata$salary)
ggplot(mydata,aes(x=logsales)) + geom_histogram(fill="skyblue", color="black")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

1.4.2 Scatter plots

You can also see how log transformations spreads out the data by looking at scatter plots of salary or log(salary) (y) versus sales or log(sales) corresponding with the four different models. I colored the points based on age just as an example. That has nothing to do with the log transformations.

##plot(mydata$salary,mydata$sales)
ggplot(mydata,aes(y=salary,x=sales,col=age))+geom_point()+ scale_colour_gradientn(colours=rainbow(3)) + geom_smooth(method='lm',se=FALSE,col='black')
## `geom_smooth()` using formula 'y ~ x'

##plot(mydata$salary,mydata$sales)
ggplot(mydata,aes(y=salary,x=logsales,col=age))+geom_point()+ scale_colour_gradientn(colours=rainbow(3)) + geom_smooth(method='lm',se=FALSE,col='black')
## `geom_smooth()` using formula 'y ~ x'

##plot(mydata$salary,mydata$sales)
ggplot(mydata,aes(y=logsalary,x=sales,col=age))+geom_point()+ scale_colour_gradientn(colours=rainbow(3))+ geom_smooth(method='lm',se=FALSE,col='black')
## `geom_smooth()` using formula 'y ~ x'

##plot(mydata$salary,mydata$sales)
ggplot(mydata,aes(y=logsalary,x=logsales,col=age))+geom_point()+ scale_colour_gradientn(colours=rainbow(3)) + geom_smooth(method='lm',se=FALSE,col='black')
## `geom_smooth()` using formula 'y ~ x'

mydata$logsalary <- log(mydata$salary)
model4 <- lm(salary~sales+mktval+profits+ceoten+age,data=mydata)
model4loglevel <- lm(logsalary~sales+mktval+profits+ceoten+age,data=mydata)
model4levellog <- lm(salary~logsales+mktval+profits+ceoten+age,data=mydata)
model4loglog <- lm(logsalary~logsales+mktval+profits+ceoten+age,data=mydata)
stargazer(model4, model4loglevel, model4levellog, model4loglog, type = "html", report=('vc*p'))
Dependent variable:
salary logsalary salary logsalary
(1) (2) (3) (4)
sales 18.061 0.00003**
p = 0.106 p = 0.022
logsales 126,644.400*** 0.200***
p = 0.0004 p = 0.00000
mktval 21.231 0.00002 19.547 0.00001
p = 0.183 p = 0.280 p = 0.205 p = 0.330
profits 47.528 0.00002 23.633 -0.00003
p = 0.863 p = 0.932 p = 0.925 p = 0.894
ceoten 12,390.430** 0.011* 13,284.340** 0.013**
p = 0.042 p = 0.066 p = 0.025 p = 0.025
age 823.966 -0.001 -1,679.167 -0.005
p = 0.873 p = 0.880 p = 0.740 p = 0.332
Constant 570,743.300** 13.282*** -136,111.300 12.171***
p = 0.042 p = 0.000 p = 0.680 p = 0.000
Observations 177 177 177 177
R2 0.202 0.206 0.247 0.316
Adjusted R2 0.178 0.182 0.225 0.296
Residual Std. Error (df = 171) 532,670.100 0.548 517,256.700 0.509
F Statistic (df = 5; 171) 8.633*** 8.853*** 11.223*** 15.789***
Note: p<0.1; p<0.05; p<0.01
mydata$logsalary = log(mydata$salary)
model0 <- lm(salary~sales,data=mydata)
model0loglevel <- lm(logsalary~sales,data=mydata)
model0levellog <- lm(salary~logsales,data=mydata)
model0loglog <- lm(logsalary~logsales,data=mydata)
stargazer(model0, model0loglevel, model0levellog, model0loglog, type = "html", report=('vc*p'))
Dependent variable:
salary logsalary salary logsalary
(1) (2) (3) (4)
sales 36.694*** 0.00004***
p = 0.00000 p = 0.000
logsales 177,149.100*** 0.224***
p = 0.000 p = 0.000
Constant 736,355.200*** 13.347*** -415,105.000** 11.869***
p = 0.000 p = 0.000 p = 0.046 p = 0.000
Observations 177 177 177 177
R2 0.145 0.168 0.186 0.281
Adjusted R2 0.140 0.163 0.182 0.277
Residual Std. Error (df = 175) 545,008.600 0.554 531,513.300 0.515
F Statistic (df = 1; 175) 29.576*** 35.327*** 40.096*** 68.345***
Note: p<0.1; p<0.05; p<0.01