Coefficient plots are often more useful than tables but plotting raw coefficients can be misleading when the predictors are on different scales.
The packages arm
and modelsummary
are used to illustrate these plots, discovering some other problems with naive use of coefficient plots. I compare plotting:
library(arm) # coefplot
library(modelsummary) # model plot
library(dplyr)
library(ggplot2)
How does occupational prestige depend on education (years), income ($), %women, …?
I use a classic example from the carData
package. The observations are averages for occupational categories rather than individuals. The data come from the 1971 Canadian census.
data(Prestige, package="carData")
head(Prestige, 5)
## education income women prestige census type
## gov.administrators 13.11 12351 11.16 68.8 1113 prof
## general.managers 12.26 25879 4.02 69.1 1130 prof
## accountants 12.77 9271 15.70 63.4 1171 prof
## purchasing.officers 11.42 8865 9.11 56.8 1175 prof
## chemists 14.62 8403 11.68 73.5 2111 prof
mod0 <- lm(prestige ~ education + income + women,
data=Prestige)
summary(mod0)
##
## Call:
## lm(formula = prestige ~ education + income + women, data = Prestige)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.825 -5.333 -0.136 5.159 17.504
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.794334 3.239089 -2.10 0.039 *
## education 4.186637 0.388701 10.77 < 2e-16 ***
## income 0.001314 0.000278 4.73 7.6e-06 ***
## women -0.008905 0.030407 -0.29 0.770
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.85 on 98 degrees of freedom
## Multiple R-squared: 0.798, Adjusted R-squared: 0.792
## F-statistic: 129 on 3 and 98 DF, p-value: <2e-16
The plots below show default coefficient plots for this model.
They are disappointing and misleading because these show the raw coefficients. It looks like only education
has a non-zero effect. It is misleading because the predictors are on different scales, so it makes little sense to compare the change in prestige
for a 1 year increase in education
with the change for a $1 increase in income
.
modelplot(mod0, coef_omit="Intercept", color="blue", size=1) +
labs(title="Raw coefficients for mod0")
arm::coefplot(mod0, col.pts="red", cex.pts=1.5)
Model log(income)
and add an interaction with type
of occupation.
mod1 <- lm(prestige ~ education + women +
log(income)*type, data=Prestige)
summary(mod1)
##
## Call:
## lm(formula = prestige ~ education + women + log(income) * type,
## data = Prestige)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.814 -4.685 0.656 3.966 16.569
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -152.2059 23.2499 -6.55 3.5e-09 ***
## education 2.9282 0.5883 4.98 3.1e-06 ***
## women 0.0883 0.0323 2.73 0.0076 **
## log(income) 18.9819 2.8285 6.71 1.7e-09 ***
## typeprof 85.2642 30.4582 2.80 0.0063 **
## typewc 29.4133 36.5075 0.81 0.4226
## log(income):typeprof -9.0124 3.4102 -2.64 0.0097 **
## log(income):typewc -3.8334 4.2603 -0.90 0.3706
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.27 on 90 degrees of freedom
## (4 observations deleted due to missingness)
## Multiple R-squared: 0.875, Adjusted R-squared: 0.865
## F-statistic: 90.1 on 7 and 90 DF, p-value: <2e-16
Several packages allow to compare several models in the same plot.
arm::coefplot(mod0, col.pts="red", cex.pts=1.5)
arm::coefplot(mod1, add=TRUE, col.pts="blue", cex.pts=1.5)
But WAIT: income
was entered as log(income)
in mod1
. It was stupid to try to compare the coefficients, but this should have raised an error.
An alternative is to present the standardized coefficients. These are interpreted as the standardized change in prestige for a one standard deviation change in the predictors.
One way to do this is to transform all variables to standardized (\(z\)) scores. The syntax ultimately uses scale
to transform all the numeric variables.
Prestige_std <- Prestige %>%
as_tibble() %>%
mutate(across(where(is.numeric), scale))
This accords better with the fitted (standardized) coefficients. income
and education
are now both large and significant compared with the effect of women
.
But a nagging doubt remains: How can I interpret the numerical size of coefficients?
mod0_std <- lm(prestige ~ education + income + women,
data=Prestige_std)
mod1_std <- lm(prestige ~ education + women +
log(income)*type,
data=Prestige_std)
#summary(mod1_std)
These plots look more like what I was expecting to show the relative magnitudes of the coefficients and CIs so one could see which differ from zero.
modelplot(mod0_std, coef_omit="Intercept", color="blue", size=1) +
labs(title="Standardized coefficients for mod0")
arm::coefplot(mod0_std, col.pts="red", cex.pts=1.5)
arm::coefplot(mod1_std, add=TRUE, col.pts="blue", cex.pts=1.5)
A better substantative comparison of the coefficients could use understandable scales for the predictors, e.g., months of education, $50,000 income or 10% of women’s participation.
Note that the effect of this is just to multiply the coefficients and their standard errors by a factor. The statistical conclusions of significance are unchanged.
Prestige_scaled <- Prestige %>%
mutate(education = 12 * education,
income = income / 50,
women = women / 10)
mod0_scaled <- lm(prestige ~ education + income + women,
data=Prestige_scaled)
summary(mod0_scaled)
##
## Call:
## lm(formula = prestige ~ education + income + women, data = Prestige_scaled)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.825 -5.333 -0.136 5.159 17.504
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.7943 3.2391 -2.10 0.039 *
## education 0.3489 0.0324 10.77 < 2e-16 ***
## income 0.0657 0.0139 4.73 7.6e-06 ***
## women -0.0891 0.3041 -0.29 0.770
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.85 on 98 degrees of freedom
## Multiple R-squared: 0.798, Adjusted R-squared: 0.792
## F-statistic: 129 on 3 and 98 DF, p-value: <2e-16
arm::coefplot(mod0_scaled, col.pts="red", cex.pts=1.5,
main = "Coefficients for prestige with scaled predictors",
varnames=c("intercept",
"education\n(/month)",
"income\n(/$50K)",
"women\n(/10%)")
)