Introduction

The purpose of this exercise is to introduce you to some of the basics of:

fitting linear models in R,
data visualization for such models,
regression diagnostics to judge the adequacy of the statistical methods.

To work through this exercise in R Studio, open a new R script (say, vismlm-ex1.R) in the editor and work from there, following this document (in another window). In this document, you are encouraged to try things yourself, but you can click Code / Hide for individual chunks. When code is shown, the icon at the upper left will copy it to the clipboard. The script for this excercise is contained in exercises/duncan-ex1.R.

Load packages

It is a good idea to start a script by loading the packages that will be used. We will use the car and effects packages here. Load them into your R workspace.

library(car)
library(effects)

Data

The data used here come from a study by O. D. Duncan (1961), on the relation between prestige of occupations and the average income and education of those jobs. The variable type classifies the occupations as:

blue collar (bc)
white cholar (wc)
professional (prof)

Here are just a few of the observations (using car::some()):

data(Duncan, package="carData")
car::some(Duncan)

##                type income education prestige
## architect      prof     75        92       90
## minister       prof     21        84       87
## dentist        prof     80       100       90
## undertaker     prof     42        74       57
## physician      prof     76        97       97
## electrician      bc     47        39       53
## RR.engineer      bc     81        28       67
## auto.repairman   bc     22        22       26
## bartender        bc     16        28        7
## waiter           bc      8        32       10

NB: This data set is similar in scope to the carData::Prestige data used in the lecture, but the quantitative variables (income, education, prestige) here are scaled as percents— percent of those in a given occupation with a given level of income, education or prestige.

The data is in the carData package. Read it in with:

library(car)
data(Duncan, package="carData")

Graphic overviews

As always, it is useful to get an overview of the data by plotting. We should look for indications that relationships are seriously nonlinear or other oddities. Use the car::scatterplotMatrix() function to produce a reasonable plot of prestige, income and education.

scatterplotMatrix(~prestige + income + education, data=Duncan, 
                  id=list(n=2))

There is something unusual about the marginal distributions of the three variables. What do you think is behind this?

The scatterplot matrix above can be made much more useful with some of the many options available. Use help(scatterplotMatrix) (or ?scatterplotMatrix) to see the options.

My preferred version uses data ellipse (ellipse = list(...)) and suppresses the confidence band for the loess smoother (smooth = list(spread=FALSE, ...)). What can you do to approximate this?

scatterplotMatrix(~ prestige + education + income, data=Duncan,
                  id = list(n=2),
                  regLine = list(method=lm, lty=1, lwd=2, col="black"),
                  smooth=list(smoother=loessLine, spread=FALSE,
                              lty.smooth=1, lwd.smooth=3, col.smooth="red"),
                  ellipse=list(levels=0.68, fill.alpha=0.1))

We haven’t used type of occupation in this analysis yet. Use | type in the model formula to condition on occupation type.

scatterplotMatrix(~ prestige + education + income | type ,
                  data=Duncan, smooth=FALSE)

Fitting models

Proceed to fit a model using both income and education as predictors. What do you conclude from this? How would you interpret the coefficients in the model?

duncan.mod <- lm(prestige ~ income + education, data=Duncan)
summary(duncan.mod)

## 
## Call:
## lm(formula = prestige ~ income + education, data = Duncan)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.538  -6.417   0.655   6.605  34.641 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6.06466    4.27194  -1.420    0.163    
## income       0.59873    0.11967   5.003 1.05e-05 ***
## education    0.54583    0.09825   5.555 1.73e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.37 on 42 degrees of freedom
## Multiple R-squared:  0.8282, Adjusted R-squared:   0.82 
## F-statistic: 101.2 on 2 and 42 DF,  p-value: < 2.2e-16

Plotting residuals

Examine the four plots produced by plot(duncan.mod). What do they indicate about possible problems with the model?

par(mfrow = c(2,2))
plot(duncan.mod)

Other plots of residuals from the car package are more useful and more flexible. For well-behaved data they should all look unstructured, with no systematice patterns. Try a few of the methods illustrated in the lecture.

residualPlots(duncan.mod, id=list(n=2))

##            Test stat Pr(>|Test stat|)
## income       -0.1129           0.9107
## education     0.6725           0.5050
## Tukey test   -0.0810           0.9355

For today, we are interested mainly in whether there are any highly influential observations in the data –– those that could change the regression coefficients depending on whether they were included or omitted. Probably the single most useful plot you can make is an “influence plot” that shows the residual and leverage, with bubbles proportional to a measure of total influence –– Cook’s D statistic.

influencePlot(duncan.mod, id=list(n=2))

##                StudRes        Hat      CookD
## minister     3.1345186 0.17305816 0.56637974
## reporter    -2.3970224 0.05439356 0.09898456
## conductor   -1.7040324 0.19454165 0.22364122
## RR.engineer  0.8089221 0.26908963 0.08096807

In the plot above, which observation is most influential? Which observation has the greatest leverage?

Added-variable plots

added variable plots (aka: partial residual plots) are very useful, because they show the unique partial relationship of Y to each X, controlling (or adjusting) for all other predictors. In these plots “adjusting” means that prestige | others is actually the residual for prestige in the regression from the other variable.

The {car} function is avPlots(). Try this.

avPlots(duncan.mod, id=list(n=2))

Several observations stand out in both the influence plot and in the partial residual plots, and were identified in the plots by their labels (id.n= ). Try to think why these might be unusual in terms of the context of this data.

Effect plots

Effect plots are similar in spirit to added variable plots, in that they attempt to show the relationship of the outcome to a focal predictor with other predictors controlled. They differ in that, for a given focal predictor (shown in the plot), all other predictors in the model are averaged over by being set equal to their means (or medians).

In the {effects} package, allEffects() calculates the effect values for all predictors in the model; predictorEffects() is similar. Try this.

mod.effects <- allEffects(duncan.mod, residuals=TRUE)
plot(mod.effects, id=list(n=4, col="black"))

Exercise: Visualizing Regression Models with R

Michael Friendly

1/20/2021