The car
and effects
packages are the main workhorses here.
library("car")
library("effects") # effect plots
library("arm") # coefplots
data(Prestige, package="carData")
type
factorHere it is useful to recognize that “bc” < “wc” < “prof”. Do so by making Prestige$type
an ordered factor. Note that the occupation names are the row.names
in the data.frame, not a separate variable.
Prestige$type <- ordered(Prestige$type, levels=c("bc", "wc", "prof")) # reorder levels
some(Prestige, 6)
## education income women prestige census type
## librarians 14.15 6112 77.10 58.1 2351 prof
## university.teachers 15.97 12480 19.59 84.6 2711 prof
## veterinarians 15.94 14558 4.32 66.7 3115 prof
## travel.clerks 11.43 6259 39.17 35.7 4193 wc
## bartenders 8.50 3930 15.51 20.2 6123 bc
## taxi.drivers 7.93 4224 3.59 25.1 9173 bc
Let’s get a visual overview of the data in a scatterplot matrix. I use options to specify a linear regression line (regLine=
), a loess smoothed curve (smooth=
) and a 68% (~ \(\bar{y} \pm 1 sd\)) data ellipse (ellipse=
) in each panel.
scatterplotMatrix(~ prestige + education + income + women ,
data=Prestige,
regLine = list(method=lm, lty=1, lwd=2, col="black"),
smooth=list(smoother=loessLine, spread=FALSE,
lty.smooth=1, lwd.smooth=3, col.smooth="red"),
ellipse=list(levels=0.68, fill.alpha=0.1))
car::scatterplot
has many options to add informative annotations, label unusual points, etc. This plot shows clearly that income is positively skewed and the relation with prestige
is non-linear. The id =
argument allows to identify unusual observations with their labels
scatterplot(prestige ~ income, data=Prestige,
pch = 16,
regLine = list(col = "red", lwd=3),
smooth = list(smoother=loessLine,
lty.smooth = 1, col.smooth = "black", lwd.smooth=3,
col.var = "darkgreen"),
ellipse = list(levels = 0.68),
id = list(n=4, col="black", cex=1.2))
## general.managers lawyers ministers physicians
## 2 17 20 24
Rather than actually transforming income
, the argument log = "x"
says to plot income
on a log scale
scatterplot(prestige ~ income, data=Prestige,
log = "x",
pch = 16,
regLine = list(col = "red", lwd=3),
smooth = list(smoother=loessLine,
lty.smooth = 1, col.smooth = "black", lwd.smooth=3,
col.var = "darkgreen"),
ellipse = list(levels = 0.68),
id = list(n=4, col="black", cex=1.2))
## general.managers ministers newsboys babysitters
## 2 20 53 63
The formula notation ~ income | type
says to do plot annotations conditional on occupation type. Oh, now that has a very different interpretation!
scatterplot(prestige ~ income | type, data=Prestige,
col = c("blue", "red", "darkgreen"),
pch = 15:17,
legend = list(coords="bottomright"),
smooth=list(smoother=loessLine,
var=FALSE, span=1, lwd=4))
data(Prestige)
mod0 <- lm(prestige ~ education + income + women + type,
data=Prestige)
summary(mod0)
##
## Call:
## lm(formula = prestige ~ education + income + women + type, data = Prestige)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.748 -4.482 0.312 5.248 18.498
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.813903 5.331156 -0.15 0.87899
## education 3.662356 0.645830 5.67 1.6e-07 ***
## income 0.001043 0.000262 3.98 0.00014 ***
## women 0.006443 0.030378 0.21 0.83249
## typeprof 5.905197 3.937700 1.50 0.13713
## typewc -2.917072 2.665396 -1.09 0.27663
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.13 on 92 degrees of freedom
## (4 observations deleted due to missingness)
## Multiple R-squared: 0.835, Adjusted R-squared: 0.826
## F-statistic: 93.1 on 5 and 92 DF, p-value: <2e-16
Here we model income as log(income)`` and allow an interaction with occupation
type`.
mod1 <- lm(prestige ~ education + women +
log(income)*type, data=Prestige)
summary(mod1)
##
## Call:
## lm(formula = prestige ~ education + women + log(income) * type,
## data = Prestige)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.814 -4.685 0.656 3.966 16.569
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -152.2059 23.2499 -6.55 3.5e-09 ***
## education 2.9282 0.5883 4.98 3.1e-06 ***
## women 0.0883 0.0323 2.73 0.0076 **
## log(income) 18.9819 2.8285 6.71 1.7e-09 ***
## typeprof 85.2642 30.4582 2.80 0.0063 **
## typewc 29.4133 36.5075 0.81 0.4226
## log(income):typeprof -9.0124 3.4102 -2.64 0.0097 **
## log(income):typewc -3.8334 4.2603 -0.90 0.3706
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.27 on 90 degrees of freedom
## (4 observations deleted due to missingness)
## Multiple R-squared: 0.875, Adjusted R-squared: 0.865
## F-statistic: 90.1 on 7 and 90 DF, p-value: <2e-16
Plots of coefficients with CI often more informative that tables of coefficients. These intervals are shown for \(\pm 1, \pm 2\) standard errors.
However, this plots the raw coefficients, which are on different scales, so the small coefficients for income
do not appear to be significant. It would be better if there was an option to plot standardized (\(\beta\)) coefficients.
arm::coefplot(mod0, col.pts="red", cex.pts=1.5)
arm::coefplot(mod1, add=TRUE, col.pts="blue", cex.pts=1.5)
This anova
call tests whether mod1
is a significant improvement over the baseline mod0
anova(mod0, mod1)
## Analysis of Variance Table
##
## Model 1: prestige ~ education + income + women + type
## Model 2: prestige ~ education + women + log(income) * type
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 92 4679
## 2 90 3541 2 1138 14.5 3.6e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Effect of education
averaged over others
mod1.e1 <- predictorEffect("education", mod1)
plot(mod1.e1)
Effect of education
, but showing partial residuals. In some cases, this will help spot influential cases in the partial relation of a predictor to the response.
mod1.e1a <- predictorEffect("education", mod1, residuals=TRUE)
plot(mod1.e1a,
residuals.pch=16, id=list(n=4, col="black"))
Effect of women
averaged over others. The blue curve shows the fitted quadratic relation in the model. The red curve shows a non-parametric smoooth.
mod1.e2 <- predictorEffect("women", mod1, residuals=TRUE)
plot(mod1.e2, ylim=c(40, 65), lwd=4,
residuals.pch=16)
Effect of income
(by type
) averaged over others
plot(predictorEffect("income", mod1),
lines=list(multiline=TRUE, lwd=3),
key.args = list(x=.7, y=.35),
)
In base graphics, plot()
for an lm
object gives the “regression quartet” of diagnostic plots.
op <- par(mfrow=c(2,2))
plot(mod1, lwd=2, cex.lab=1.4)
car::qqPlot
Overall plot of residuals for mod1
qqPlot(mod1, pch=16)
## medical.technicians electronic.workers
## 31 82
QQ plots for individual variables and/or subsets
qqPlot(~ income, data=Prestige, subset = type == "prof")
## general.managers physicians
## 2 24
qqPlot(income ~ type, data=Prestige, layout=c(1, 3))
car::influencePlot(mod1, id=list(n=2, cex=1.2), cex.lab=1.4)
## StudRes Hat CookD
## general.managers -0.2834 0.28128 0.003969
## ministers 2.2992 0.19369 0.151516
## medical.technicians 2.8757 0.08799 0.092273
## farm.workers 0.9930 0.25501 0.042201
## electronic.workers 2.3233 0.08454 0.059407