The purpose of this exercise is to introduce you to some of the basics of:
To work through this exercise in R Studio, open a new R script (say, vismlm-ex1.R
) in the editor and work from there, following this document (in another window). In this document, you are encouraged to try things yourself, but you can click Code / Hide for individual chunks. When code is shown, the icon at the upper left will copy it to the clipboard. The script for this excercise is contained in exercises/duncan-ex1.R.
It is a good idea to start a script by loading the packages that will be used. We will use the car
and effects
packages here. Load them into your R workspace.
library(car)
library(effects)
The data used here come from a study by O. D. Duncan (1961), on the relation between prestige
of occupations and the average income
and education
of those jobs. The variable type
classifies the occupations as:
bc
)wc
)prof
)Here are just a few of the observations (using car::some()
):
data(Duncan, package="carData")
car::some(Duncan)
## type income education prestige
## architect prof 75 92 90
## minister prof 21 84 87
## dentist prof 80 100 90
## undertaker prof 42 74 57
## physician prof 76 97 97
## electrician bc 47 39 53
## RR.engineer bc 81 28 67
## auto.repairman bc 22 22 26
## bartender bc 16 28 7
## waiter bc 8 32 10
NB: This data set is similar in scope to the carData::Prestige
data used in the lecture, but the quantitative variables (income
, education
, prestige
) here are scaled as percents— percent of those in a given occupation with a given level of income, education or prestige.
carData
package. Read it in with:library(car)
data(Duncan, package="carData")
car::scatterplotMatrix()
function to produce a reasonable plot of prestige
, income
and education
.scatterplotMatrix(~prestige + income + education, data=Duncan,
id=list(n=2))
There is something unusual about the marginal distributions of the three variables. What do you think is behind this?
The scatterplot matrix above can be made much more useful with some of the many options available. Use help(scatterplotMatrix)
(or ?scatterplotMatrix
) to see the options.
ellipse = list(...)
) and suppresses the confidence band for the loess smoother (smooth = list(spread=FALSE, ...)
). What can you do to approximate this?scatterplotMatrix(~ prestige + education + income, data=Duncan,
id = list(n=2),
regLine = list(method=lm, lty=1, lwd=2, col="black"),
smooth=list(smoother=loessLine, spread=FALSE,
lty.smooth=1, lwd.smooth=3, col.smooth="red"),
ellipse=list(levels=0.68, fill.alpha=0.1))
type
of occupation in this analysis yet. Use | type
in the model formula to condition on occupation type.scatterplotMatrix(~ prestige + education + income | type ,
data=Duncan, smooth=FALSE)
duncan.mod <- lm(prestige ~ income + education, data=Duncan)
summary(duncan.mod)
##
## Call:
## lm(formula = prestige ~ income + education, data = Duncan)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.538 -6.417 0.655 6.605 34.641
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.06466 4.27194 -1.420 0.163
## income 0.59873 0.11967 5.003 1.05e-05 ***
## education 0.54583 0.09825 5.555 1.73e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.37 on 42 degrees of freedom
## Multiple R-squared: 0.8282, Adjusted R-squared: 0.82
## F-statistic: 101.2 on 2 and 42 DF, p-value: < 2.2e-16
plot(duncan.mod)
. What do they indicate about possible problems with the model?par(mfrow = c(2,2))
plot(duncan.mod)
car
package are more useful and more flexible. For well-behaved data they should all look unstructured, with no systematice patterns. Try a few of the methods illustrated in the lecture.residualPlots(duncan.mod, id=list(n=2))
## Test stat Pr(>|Test stat|)
## income -0.1129 0.9107
## education 0.6725 0.5050
## Tukey test -0.0810 0.9355
influencePlot(duncan.mod, id=list(n=2))
## StudRes Hat CookD
## minister 3.1345186 0.17305816 0.56637974
## reporter -2.3970224 0.05439356 0.09898456
## conductor -1.7040324 0.19454165 0.22364122
## RR.engineer 0.8089221 0.26908963 0.08096807
In the plot above, which observation is most influential? Which observation has the greatest leverage?
prestige | others
is actually the residual for prestige
in the regression from the other variable.The {car}
function is avPlots()
. Try this.
avPlots(duncan.mod, id=list(n=2))
Several observations stand out in both the influence plot and in the partial residual plots, and were identified in the plots by their labels (id.n= ). Try to think why these might be unusual in terms of the context of this data.
Effect plots are similar in spirit to added variable plots, in that they attempt to show the relationship of the outcome to a focal predictor with other predictors controlled. They differ in that, for a given focal predictor (shown in the plot), all other predictors in the model are averaged over by being set equal to their means (or medians).
In the {effects}
package, allEffects()
calculates the effect values for all predictors in the model; predictorEffects()
is similar. Try this.
mod.effects <- allEffects(duncan.mod, residuals=TRUE)
plot(mod.effects, id=list(n=4, col="black"))