8  Collinearity & Ridge Regression

Some of my collinearity diagnostics have large values, or small values, or whatever they are not supposed to be * What is bad? * If bad, what can I do about it?

In univariate multiple regression models, we usually hope to have high correlations between the outcome \(y\) and each of the predictors, \(x_1, x_2, \dots\), but high correlations among the predictors can cause problems in estimating and testing their effects. The quote above shows the a typical quandary of some researchers in trying do understand these problems and and take steps to resolve them. This chapter illustrates the problems of collinearity, describes diagnostic measures to asses its effects, and presents some novel visual tools for these purposes using the VisCollin package.

One class of solutions for collinearity involves regularization methods such as ridge regression. Another collection of graphical methods, generalized ridge trace plots, implemented in the genridge package, sheds further light on what is accomplished by this technique.

Packages

In this chapter we use the following packages. Load them now.

8.1 What is collinearity?

The chapter quote above is not untypical of researchers who have read standard treatments of linear models (eg.: ???) and yet are still confused about what collinearity is, how to find its sources and how to correct them. In Friendly & Kwan (2009), we liken this problem to that of the reader of Martin Hansford’s successful series of books, . These consist of a series of full-page illustrations of hundreds of people and things and a few Waldos— a character wearing a red and white striped shirt and hat, glasses, and carrying a walking stick or other paraphernalia. Waldo was never disguised, yet the complex arrangement of misleading visual cues in the pictures made him very hard to find. Collinearity diagnostics often provide a similar puzzle.

Recall the standard classical linear model for a response variable \(y\) with a collection of predictors in \(\mathbf{X} = (\mathbf{x}_1, \mathbf{x}_2, ..., \mathbf{x}_p)\)

\[\begin{eqnarray*} \mathbf{y} & =& \beta_0 + \beta_1 \mathbf{x}_1 + \beta_2 \mathbf{x}_2 + \cdots + \beta_p \mathbf{x}_p + \mathbf{\epsilon} \\ & = & \mathbf{X} \mathbf{\beta} + \mathbf{\epsilon} \; , \end{eqnarray*}\] for which the ordinary least squares solution is:

\[ \widehat{\mathbf{b}} = (\mathbf{X}^\textsf{T} \mathbf{X})^{-1} \; \mathbf{X}^\textsf{T} \mathbf{y} \; , \] with sampling variances and covariances \(\text{Var} (\widehat{\mathbf{b}}) = \sigma^2 \times (\mathbf{X}^\textsf{T} \mathbf{X})^{-1}\) and \(\sigma^2\) is the variance of the residuals \(\mathbf{\epsilon}\), estimated by the mean squared error (MSE).

In the limiting case, when one \(x_i\) is perfectly predictable from the other \(x\)s, i.e., \(R^2 (x_i | \text{other }x) = 1\),

  • there is no unique solution for the regression coefficients \(\mathbf{b} = (\mathbf{X}^\textsf{T} \mathbf{X})^{-1} \mathbf{X} \mathbf{y}\);
  • the standard errors \(s (b_i)\) of the estimated coefficients are infinite and t statistics \(t_i = b_i / s (b_i)\) are 0.

This extreme case reflects a situation when one or more predictors are effectively redundant, for example when you include two variables \(x\) and \(y\) and their sum \(z = x + y\) in a model, or use ipsatized scores that sum to a constant. More generally, collinearity refers to the case when there are very high multiple correlations among the predictors, such as \(R^2 (x_i | \text{other }x) \ge 0.9\). Note that you can’t tell simply by looking at the simple correlations. A large correlation \(r_{ij}\) is sufficient for collinearity, but not necessary — you can have variables \(x_1, x_2, x_3\) for which the pairwise correlation are low, but the multiple correlation is high.

The consequences are:

  • The estimated coefficients have large standard errors, \(s(\hat{b_j})\). They are multiplied by the square root of the variance inflation factor, \(\sqrt{\text{VIF}}\), discussed below.
  • This deflates the \(t\)-statistics, \(t = \hat{b_j} / s(\hat{b_j})\) by the same factor.
  • Thus you may find a situation where an overall model is highly significant (large \(F\)-statistic), while no (or few) of the individual predictors are. This is a puzzlement!
  • Beyond this, the least squares solution may have poor numerical accurracy (Longley, 1967), because the solution depends on the determinant \(|\,\mathbf{X}^\textsf{T} \mathbf{X}\,|\), which approaches 0 as multiple correlations increase.
  • As well, recall that the coefficients \(\hat{b}\) are partial coefficients, meaning the estimated change \(\Delta y\) in \(y\) when \(x\) changes by one unit \(\Delta x\), but holding all other variables constant. Then, the model may be trying to estimate something that does not occur in the data.

8.1.1 Visualizing collinearity

Collinearity can be illustrated in data space for two predictors in terms of the stability of the regression plane for a linear model Y = X1 + X2. In Figure 8.1 (adapted from Fox (2016), Fig. 13.2):

  1. shows a case where \(X_1\) and \(X_2\) are uncorrelated as can be seen in their scatter in the horizontal plane (+ symbols). The regression plane is well-supported; a small change in Y for one observation won’t make much difference.

  2. In panel (b), \(X_1\) and \(X_2\) have a perfect correlation, \(r (x_1, x_2) = 1.0\). The regression plane is not unique; in fact there are an infinite number of planes that fit the data equally well. Note that, if all we care about is prediction (not the coefficients), we could use \(X_1\) or \(X_2\), or both, or any weighted sum of them in a model and get the same predicted values.

  3. Shows a typical case where there is a strong correlation between \(X_1\) and \(X_2\). The regression plane here is unique, but is not well determined. A small change in Y can make quite a difference in the fitted value or coefficients, depending on the values of \(X_1\) and \(X_2\). Where \(X_1\) and \(X_2\) are far from their near linear relation in the botom plane, you can imagine that it is easy to tilt the plane substantially by a small change in \(Y\).

Figure 8.1: Effect of collinearity on the least squares regression plane. (a) Small correlation between predictors; (b) Perfect correlation ; (c) Very strong correlation. The black points show the data Y values, white points are the fitted values in the regression plane, and + signs represent the values of X1 and X2. Source: Adapted from Fox (2016), Fig. 13.2

8.1.2 Data space and \(\beta\) space

It is also useful to visualize collinearity by comparing the representation in data space with the analogous view of the confidence ellipses for coefficients in beta space. To do so, we generate data from a known model \(y = 3 x_1 + 3 x_2 + \epsilon\) with \(\epsilon \sim \mathcal{N} (0, 100)\) and various true correlations between \(x_1\) and \(x_2\), \(\rho_{12} = (0, 0.8, 0.97)\) 1.

Working file: R/collin-data-beta.R

First, we use MASS:mvrnorm() to construct a list of data frames XY with specified values for the means and covariance matrices and a corresponding list of models, mods.

Code
library(MASS)
library(car)

set.seed(421)            # reproducibility
N <- 200                 # sample size
mu <- c(0, 0)            # means
s <- c(1, 1)             # standard deviations
rho <- c(0, 0.8, 0.97)   # correlations
beta <- c(3, 3)          # true coefficients

# Specify a covariance matrix, with standard deviations s[1], s[2] and correlation r
Cov <- function(s, r){
  matrix(c(s[1],    r * prod(s),
         r * prod(s), s[2]), nrow = 2, ncol = 2)
}

# Generate a dataframe of X, y for each rho
# Fit the model for each
XY <- vector(mode ="list", length = length(rho))
mods <- vector(mode ="list", length = length(rho))
for (i in seq_along(rho)) {
  r <- rho[i]
  X <- mvrnorm(N, mu, Sigma = Cov(s, r))
  colnames(X) <- c("x1", "x2")
  y <- beta[1] * X[,1] + beta[2] * X[,2] + rnorm(N, 0, 10)

  XY[[i]] <- data.frame(X, y=y)
  mods[[i]] <- lm(y ~ x1 + x2, data=XY[[i]])
}

The estimated coefficients in these models are:

coefs <- sapply(mods, coef)
colnames(coefs) <- c("Intercept", "b1", "b2")
coefs
#>             Intercept      b1    b2
#> (Intercept)      1.01 -0.0535 0.141
#> x1               3.18  3.4719 3.053
#> x2               1.68  2.9734 2.059

Then, we define a function to plot the data ellipse (car::dataEllipse()) for each data frame and confidence ellipse (car::dataEllipse()) in the corresponding fitted model. In this figure, I specify the x, y limits for each plot so that the relative sizes of these ellipses are comparable, so that variance inflation can be assesed visually.

Code
do_plots <- function(XY, mod, r) {
  X <- as.matrix(XY[, 1:2])
  dataEllipse(X,
              levels= 0.95,
              col = "darkgreen",
              fill = TRUE, fill.alpha = 0.05,
              xlim = c(-3, 3),
              ylim = c(-3, 3), asp = 1)
  text(0, 3, bquote(rho == .(r)), cex = 2, pos = NULL)

  confidenceEllipse(mod,
                    col = "red",
                    fill = TRUE, fill.alpha = 0.1,
                    xlab = "x1 coefficient",
                    ylab = "x2 coefficient",
                    xlim = c(-5, 10),
                    ylim = c(-5, 10),
                    asp = 1)
  points(beta[1], beta[2], pch = "+", cex=2)
  abline(v=0, h=0, lwd=2)
}

op <- par(mar = c(4,4,1,1)+0.1,
          mfcol = c(2, 3),
          cex.lab = 1.5)

for (i in seq_along(rho)) {
  do_plots(XY[[i]], mods[[i]], rho[i])
}
par(op)
Figure 8.2: 95% Ddata ellipses for x1, x2 and the corresponding 95% confidence ellipses for their coefficients. In the confidence ellipse plots, reference lines show the value (0,0) for the null hypothesis and “+” marks the true values for the coefficients. This figure adapts an example by John Fox (2022).

Recall (Section #sec-data-beta) that the confidence ellipse for \((\beta_1, \beta_2)\) is just a 90 degree rotation (and rescaling) of the data ellipse for \((x_1, x_2)\): it is wide (more variance) in any direction where the data ellipse is narrow.

The shadows of the confidence ellipses on the coordinate axes in Figure 8.2 represent the standard errors of the coefficients, and get larger with increasing \(\rho\). This is the effect of variance inflation, described in the following section.

8.2 Measuring collinearity

8.2.1 Variance inflation factors

How can we measure the effect of collinearity? The essential idea is to compare, for each predictor the variance \(s^2 (\widehat{b_j})\) that the coefficient that \(x_j\) would have if it was totally unrelated to the other predictors to the actual variance it has in the given model.

For two predictors such as shown in Figure 8.2 the sampling variance of \(x_1\) can be expressed as

\[ s^2 (\widehat{b_1}) = \frac{MSE}{(n-1) \; s^2(x_1)} \; \times \; \left[ \frac{1}{1-r^2_{12}} \right] \] The first term here is the variance of \(b_1\) when the two predictors are uncorrelated. The term in brackets represents the variance inflation factor (Marquardt, 1970), the amount by which the variance of the coefficient is multiplied as a consequence of the correlation \(r_{12}\) of the predictors. As \(r_{12} \rightarrow 1\), the variances approaches infinity.

More generally, with any number of predictors, this relation has a similar form, replacing the simple correlation \(r_{12}\) with the multiple correlation predicting \(x_j\) from all others,

\[ s^2 (\widehat{b_j}) = \frac{MSE}{(n-1) \; s^2(x_j)} \; \times \; \left[ \frac{1}{1-R^2_{j | \text{others}}} \right] \] So, we have that the variance inflation factors are:

\[ \text{VIF}_j = \frac{1}{1-R^2_{j \,|\, \text{others}}} \] In practice, it is often easier to think in terms of the square root, \(\sqrt{\text{VIF}_j}\) as the multiplier of the standard errors. The denominator, \(1-R^2_{j | \text{others}}\) is sometimes called tolerance, a term I don’t find particularly useful.

For the cases shown in Figure 8.2 the VIFs and their square roots are:

vifs <- sapply(mods, car::vif)
colnames(vifs) <- paste("rho:", rho)
vifs
#>    rho: 0 rho: 0.8 rho: 0.97
#> x1      1     3.09      18.6
#> x2      1     3.09      18.6

sqrt(vifs)
#>    rho: 0 rho: 0.8 rho: 0.97
#> x1      1     1.76      4.31
#> x2      1     1.76      4.31

Note that when there are terms in the model with more than one df, such as education with four levels (and hence 3 df) or a polynomial term specified as poly(x, 3), the standard VIF calculation gives results that vary with how those terms are coded in the model. Fox & Monette (1992) define generalized, GVIFs as the inflation in the squared area of the confidence ellipse for the coefficients of such terms, relative to what would be obtained with uncorrelated data. Visually, this can be seen by comparing the areas of the ellipses in the bottom row of Figure 8.2. Because the magnitude of the GVIF increases with the number of degrees of freedom for the set of parameters, Fox & Monette suggest the analog \(\sqrt{\text{GVIF}^{1/2 \text{df}}}\) as the measure of impact on standard errors.

Example: This example uses the cars dataset in the VisCollin package containing various measures of size and performance on 406 models of automobiles from 1982. Interest is focused on predicting gas mileage, mpg.

data(cars, package = "VisCollin")
str(cars)
#> 'data.frame':    406 obs. of  10 variables:
#>  $ make    : Factor w/ 30 levels "amc","audi","bmw",..: 6 4 22 1 12 12 6 22 23 1 ...
#>  $ model   : chr  "chevelle" "skylark" "satellite" "rebel" ...
#>  $ mpg     : num  18 15 18 16 17 15 14 14 14 15 ...
#>  $ cylinder: int  8 8 8 8 8 8 8 8 8 8 ...
#>  $ engine  : num  307 350 318 304 302 429 454 440 455 390 ...
#>  $ horse   : int  130 165 150 150 140 198 220 215 225 190 ...
#>  $ weight  : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
#>  $ accel   : num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
#>  $ year    : int  70 70 70 70 70 70 70 70 70 70 ...
#>  $ origin  : Factor w/ 3 levels "Amer","Eur","Japan": 1 1 1 1 1 1 1 1 1 1 ...

We fit a model predicting gas mileage (mpg) from the number of cylinders, engine displacement, horsepower, weight, time to accelerate from 0 – 60 mph and model year (1970–1982). Perhaps surprisingly, only weight and year appear to significantly predict gas mileage. What’s going on here?

cars.mod <- lm (mpg ~ cylinder + engine + horse + weight + accel + year, 
                data=cars)
Anova(cars.mod)
#> Anova Table (Type II tests)
#> 
#> Response: mpg
#>           Sum Sq  Df F value Pr(>F)    
#> cylinder      12   1    0.99   0.32    
#> engine        13   1    1.09   0.30    
#> horse          0   1    0.00   0.98    
#> weight      1214   1  102.84 <2e-16 ***
#> accel          8   1    0.70   0.40    
#> year        2419   1  204.99 <2e-16 ***
#> Residuals   4543 385                   
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We check the variance inflation factors, using car::vif(). We see that most predictors have very high VIFs, indicating moderately severe multicollinearity.

vif(cars.mod)
#> cylinder   engine    horse   weight    accel     year 
#>    10.63    19.64     9.40    10.73     2.63     1.24

sqrt(vif(cars.mod))
#> cylinder   engine    horse   weight    accel     year 
#>     3.26     4.43     3.07     3.28     1.62     1.12

According to \(\sqrt{\text{VIF}}\), the standard error of cylinder has been multiplied by 3.26 and it’s \(t\)-value divided by this number, compared with the case when all predictors are uncorrelated. engine, horse and weight suffer a similar fate.

Connection with inverse of correlation matrix

In the linear regression model with standardized predictors, the covariance matrix of the estimated intercept-excluding parameter vector \(\mathbf{b}^\star\) has the simpler form, \[ \mathcal{V} (\mathbf{b}^\star) = \frac{\sigma^2}{n-1} \mathbf{R}^{-1}_{X} \; . \] where \(\mathbf{R}_{X}\) is the correlation matrix among the predictors. It can then be seen that the VIF\(_j\) are just the diagonal entries of \(\mathbf{R}^{-1}_{X}\).

More generally, the matrix \(\mathbf{R}^{-1}_{X} = (r^{ij})\), when standardized to a correlation matrix as \(-r^{ij} / \sqrt{r^{ii} \; r^{jj}}\) gives the matrix of all partial correlations, \(r_{ij \,|\, \textrm{others}}\). }

8.2.2 Collinearity diagnostics

OK, we now know that large VIF\(_j\) indicate predictor coefficients whose estimation is degraded due to large \(R^2_{j \,|\, \text{others}}\). But for this to be useful, we need to determine:

  • how many dimensions in the space of the predictors are associated with nearly collinear relations?
  • which predictors are most strongly implicated in each of these?

Answers to these questions are provided using measures developed by Belsley and colleagues (Belsley et al., 1980; Belsley, 1991). These measures are based on the eigenvalues \(\lambda_1, \lambda_2, \dots \lambda_p\) of the correlation matrix \(R_{X}\) of the predictors (preferably centered and scaled, and not including the constant term for the intercept), and the corresponding eigenvectors in the columns of \(\mathbf{V}_{p \times p}\), given by the the eigen decomposition \[ \mathbf{R}_{X} = \mathbf{V} \mathbf{\Lambda} \mathbf{V}^\textsf{T} \] By elementary matrix algebra, the eigen decomposition of \(\mathbf{R}_{XX}^{-1}\) is then \[ \mathbf{R}_{X}^{-1} = \mathbf{V} \mathbf{\Lambda}^{-1} \mathbf{V}^\textsf{T} \; , \tag{8.1}\] so, \(\mathbf{R}_{X}\) and \(\mathbf{R}_{XX}^{-1}\) have the same eigenvectors, and the eigenvalues of \(\mathbf{R}_{X}^{-1}\) are just \(\lambda_i^{-1}\). Using Equation 8.1, the variance inflation factors may be expressed as \[ \text{VIF}_j = \sum_{k=1}^p \frac{V^2_{jk}}{\lambda_k} \; , \] which shows that only the small eigenvalues contribute to variance inflation, but only for those predictors that have large eigenvector coefficients on those small components. These facts lead to the following diagnostic statistics for collinearity:

  • Condition indices: The smallest of the eigenvalues, those for which \(\lambda_j \approx 0\), indicate collinearity and the number of small values indicates the number of near collinear relations. Because the sum of the eigenvalues, \(\Sigma \lambda_i = p\) increases with the number of predictors \(p\), it is useful to scale them all in relation to the largest. This leads to condition indices, defined as \(\kappa_j = \sqrt{ \lambda_1 / \lambda_j}\). These have the property that the resulting numbers have common interpretations regardless of the number of predictors.

    • For completely uncorrelated predictors, all \(\kappa_j = 1\).
    • \(\kappa_j \rightarrow \infty\) as any \(\lambda_k \rightarrow 0\).
  • Variance decomposition proportions: Large VIFs indicate variables that are involved in some nearly collinear relations, but they don’t indicate which other variable(s) each is involved with. For this purpose, Belsley et al. (1980) and Belsley (1991) proposed calculation of the proportions of variance of each variable associated with each principal component as a decomposition of the coefficient variance for each dimension.

These measures can be calculated using VisCollin::colldiag(). For the current model, the usual display contains both the condition indices and variance proportions. However, even for a small example, it is often difficult to know what numbers to pay attention to.

(cd <- colldiag(cars.mod, center=TRUE))
#> Condition
#> Index    Variance Decomposition Proportions
#>           cylinder engine horse weight accel year 
#> 1   1.000 0.005    0.003  0.005 0.004  0.009 0.010
#> 2   2.252 0.004    0.002  0.000 0.007  0.022 0.787
#> 3   2.515 0.004    0.001  0.002 0.010  0.423 0.142
#> 4   5.660 0.309    0.014  0.306 0.087  0.063 0.005
#> 5   8.342 0.115    0.000  0.654 0.715  0.469 0.052
#> 6  10.818 0.563    0.981  0.032 0.176  0.013 0.004

Belsley (1991) recommends that the sources of collinearity be diagnosed (a) only for those components with large \(\kappa_j\), and (b) for those components for which the variance proportion is large (say, \(\ge 0.5\)) on two or more predictors. The print method for "colldiag" objects has a fuzz argument controlling this.

print(cd, fuzz = 0.5)
#> Condition
#> Index    Variance Decomposition Proportions
#>           cylinder engine horse weight accel year 
#> 1   1.000  .        .      .     .      .     .   
#> 2   2.252  .        .      .     .      .    0.787
#> 3   2.515  .        .      .     .      .     .   
#> 4   5.660  .        .      .     .      .     .   
#> 5   8.342  .        .     0.654 0.715   .     .   
#> 6  10.818 0.563    0.981   .     .      .     .

The mystery is solved, if you can read that table with these recommendations in mind. There are two nearly collinear relations among the predictors, corresponding to the two smallest dimensions.

  • Dimension 5 reflects the high correlation between horsepower and weight,
  • Dimension 6 reflects the high correlation between number of cylinders and engine displacement.

Note that the high variance proportion for year (0.787) on the second component creates no problem and should be ignored because (a) the condition index is low and (b) it shares nothing with other predictors.

8.2.3 Tableplots

The default tabular display of condition indices and variance proportions from colldiag() is what triggered the comparison to “Where’s Waldo”. It suffers from the fact that the important information — (a) how many Waldos? (b) where are they hiding — is disguised by being embedded in a sea of mostly irrelevant numbers. The simple option of using a principled fuzz factor helps considerably, but not entirely.

The simplified tabular display above can be improved to make the patterns of collinearity more visually apparent and to signify warnings directly to the eyes. A tableplot (Kwan et al., 2009) is a semi-graphic display that presents numerical information in a table using shapes proportional to the value in a cell and other visual attributes (shape type, color fill, and so forth) to encode other information.

For collinearity diagnostics, these show:

  • the condition indices, using using squares whose background color is red for condition indices > 10, green for values > 5 and green otherwise, reflecting danger, warning and OK respectively. The value of the condition index is encoded within this using a white square whose side is proportional to the value (up to some maximum value, cond.max that fills the cell).

  • Variance decomposition proportions are shown by filled circles whose radius is proportional to those values and are filled (by default) with shades ranging from white through pink to red. Rounded values of those diagnostics are printed in the cells.

The tableplot below (Figure 8.3) encodes all the information from the values of colldiag() printed above (but using prop.col color breaks such that variance proportions < 0.3 are shaded white). The visual message is that one should attend to collinearities with large condition indices and large variance proportions implicating two or more predictors.

tableplot(cd, title = "Tableplot of cars data", cond.max = 30 )
Figure 8.3: Tableplot of condition indices and variance proportions for the Cars data. In column 1, the square symbols are scaled relative to a maximum condition index of 30. In the remaining columns, variance proportions (times 100) are shown as circles scaled relative to a maximum of 100.

8.2.4 Collinearity biplots

As we have seen, the collinearity diagnostics are all functions of the eigenvalues and eigenvectors of the correlation matrix of the predictors in the regression model, or alternatively, the SVD of the \(\mathbf{X}\) matrix in the linear model (excluding the constant). The standard biplot (Gabriel, 1971; Gower & Hand, 1996) (see: Section 4.3) can be regarded as a multivariate analog of a scatterplot, obtained by projecting a multivariate sample into a low-dimensional space (typically of 2 or 3 dimensions) accounting for the greatest variance in the data.

However the standard biplot is less useful for visualizing the relations among the predictors that lead to nearly collinear relations. Instead, biplots of the smallest dimensions show these relations directly, and can show other features of the data as well, such as outliers and leverage points. We use prcomp(X, scale.=TRUE) to obtain the PCA of the correlation matrix of the predictors:

cars.X <- cars |>
  select(where(is.numeric)) |>
  select(-mpg) |>
  tidyr::drop_na()
cars.pca <- prcomp(cars.X, scale. = TRUE)
cars.pca
#> Standard deviations (1, .., p=6):
#> [1] 2.070 0.911 0.809 0.367 0.245 0.189
#> 
#> Rotation (n x k) = (6 x 6):
#>             PC1     PC2    PC3    PC4     PC5     PC6
#> cylinder -0.454 -0.1869  0.168 -0.659 -0.2711 -0.4725
#> engine   -0.467 -0.1628  0.134 -0.193 -0.0109  0.8364
#> horse    -0.462 -0.0177 -0.123  0.620 -0.6123 -0.1067
#> weight   -0.444 -0.2598  0.278  0.350  0.6860 -0.2539
#> accel     0.330 -0.2098  0.865  0.143 -0.2774  0.0337
#> year      0.237 -0.9092 -0.335  0.025 -0.0624  0.0142

The standard deviations above are the square roots \(\sqrt{\lambda_j}\) of the eigenvalues of the correlation matrix, and are returned in the sdev component of the "prcomp" object. The eigenvectors are returned in the rotation component, whose directions are arbitrary. Because we are interested in seeing the relative magnitude of variable vectors, we are free to multiply them by any constant to make them more visible in relation to the scores for the cars.

cars.pca$rotation <- -2.5 * cars.pca$rotation    # reflect & scale the variable vectors

ggp <- fviz_pca_biplot(
  cars.pca,
  axes = 6:5,
  geom = "point",
  col.var = "blue",
  labelsize = 5,
  pointsize = 1.5,
  arrowsize = 1.5,
  addEllipses = TRUE,
  ggtheme = ggplot2::theme_bw(base_size = 14),
  title = "Collinearity biplot for cars data")

# add point labels for outlying points
dsq <- heplots::Mahalanobis(cars.pca$x[, 6:5])
scores <- as.data.frame(cars.pca$x[, 6:5])
scores$name <- rownames(scores)

ggp + geom_text_repel(data = scores[dsq > qchisq(0.95, df = 6),],
                aes(x = PC6,
                    y = PC5,
                    label = name),
                vjust = -0.5,
                size = 5)
Figure 8.4: Collinearity biplot of the Cars data, showing the last two dimensions. The projections of the variable vectors on the coordinate axes are proportional to their variance proportions. To reduce graphic clutter, only the most outlying observations in predictor space are identified by case labels. An extreme outlier (case 20) appears in the lower left corner.

As with the tabular display of variance proportions, Waldo is hiding in the dimensions associated with the smallest eigenvalues (largest condition indices).
As well, it turns out that outliers in the predictor space (also high leverage observations) can often be seen as observations far from the centroid in the space of the smallest principal components.

The projections of the variable vectors in Figure 8.4 on the Dimension 5 and Dimension 6 axes are proportional to their variance proportions shown above. The relative lengths of these variable vectors can be considered to indicate the extent to which each variable contributes to collinearity for these two near-singular dimensions.

Thus, we see again that Dimension 6 is largely determined by engine size, with a substantial (negative) relation to cylinder. Dimension 5 has its’ strongest relations to weight and horse.

Moreover, there is one observation, #20, that stands out as an outlier in predictor space, far from the centroid. It turns out that this vehicle, a Buick Estate wagon, is an early-year (1970) American behemoth, with an 8-cylinder, 455 cu. in, 225 horse-power engine, and able to go from 0 to 60 mph in 10 sec. (Its MPG is only slightly under-predicted from the regression model, however.)

8.3 Remedies for collinearity: What can I do?

Collinearity is often a data problem, for which there is no magic cure. Nevertheless there are some general guidelines and useful techniques to address this problem.

  • Pure prediction: If we are only interested in predicting / explaining an outcome, and not the model coefficients or which are “significant”, collinearity can be largely ignored. The fitted values are unaffected by collinearity, even in the case of perfect collinearity as shown in Figure 8.1 (b).

  • structural collinearity: Sometimes collinearity results from structural relations among the variables that relate to how they have been defined.

    • For example, polynomial terms, like \(x, x^2, x^3\) or interaction terms like \(x_1, x_2, x_1 * x_2\) are necessarily correlated. A simple cure is to center the predictors at their means, using \(x - \bar{x}, (x - \bar{x})^2, (x - \bar{x})^3\) or \((x_1 - \bar{x}_1), (x_2 - \bar{x}_2), (x_1 - \bar{x}_1) * (x_2 - \bar{x}_2)\). Centering removes the spurious ill-conditioning, thus reducing the VIFs. Note that in polynomial models, using y ~ poly(x, 3) to specify a cubic model generates orthogonal (uncorrelated) regressors, whereas in y ~ x + I(x^2) + I(x^3) the terms have built-in correlations.

    • When some predictors share a common cause, as in GNP or population in time-series or cross-national data, you can reduce collinearity by re-defining predictors to reflect per capita measures. In a related example with sports data, when you have cumulative totals (e.g., runs, hits, homeruns in baseball) for players over years, expressing these measures as per year will reduce the common effect of longevity on these measures.

  • Model re-specification:

    • Drop one or more regressors that have a high VIF if they are not deemed to be essential

    • Replace highly correlated regressors with linear combination(s) of them. For example, two related variables, \(x_1\) and \(x_2\) can be replaced without any loss of information by replacing them with their sum and difference, \(z_1 = x_1 + x_2\) and \(z_2 = x_1 - x_2\). For example, in a dataset on fitness, we may have correlated predictors of resting pulse rate and pulse rate while running. Transforming these to average pulse rate and their difference gives new variables which are interpretable and less correlated.

  • Statistical remedies:

    • Transform the predictors \(\mathbf{X}\) to uncorrelated principal component scores \(\mathbf{Z} = \mathbf{X} \mathbf{V}\), and regress \(\mathbf{y}\) on \(\mathbf{Z}\). These will have the identical overall model fit without loss of information. A related technique is incomplete principal components regression, where some of the smallest dimensions (those causing collinearity) are omitted from the model. The trade-off is that it may be more difficult to interpret what the model means, but this can be countered with a biplot, showing the projections of the original variables into the reduced space of the principal components.

    • use regularization methods such as ridge regression and lasso, which correct for collinearity by introducing shrinking coefficients towards 0, introducing a small amount of bias, . See the genridge package and its pkgdown documentation for visualization methods.

    • use Bayesian regression; if multicollinearity prevents a regression coefficient from being estimated precisely, then a prior on that coefficient will help to reduce its posterior variance.

Example: Centering

To illustrate the effect of centering a predictor in a polynomial model, we generate a perfect quadratic relationship, \(y = x^2\) and consider the correlations of \(y\) with \(x\) and with \((x - \bar{x})^2\). The correlation of \(y\) with \(x\) is 0.97, while the correlation of \(y\) with \((x - \bar{x})^2\) is zero.

x <- 1:20
y1 <- x^2
y2 <- (x - mean(x))^2
XY <- data.frame(x, y1, y2)

(R <- cor(XY))
#>        x    y1    y2
#> x  1.000 0.971 0.000
#> y1 0.971 1.000 0.238
#> y2 0.000 0.238 1.000

The effect of centering here is remove the linear association in what is a purely quadratic relationship, as can be seen by plotting y1 and y2 against x.

r1 <- R[1, 2]
r2 <- R[1, 3]

gg1 <-
ggplot(XY, aes(x = x, y = y1)) +
  geom_point(size = 3) +
  geom_smooth(method = "lm", formula = y~x, linewidth = 2, se = FALSE) +
  labs(x = "X", y = "Y") +
  theme_bw(base_size = 16) +
  annotate("text", x = 5, y = 350, size = 6,
           label = paste("X Uncentered\nr =", round(r1, 3)))

gg2 <-
  ggplot(XY, aes(x = x, y = y2)) +
  geom_point(size = 3) +
  geom_smooth(method = "lm", formula = y~x, linewidth = 2, se = FALSE) +
  labs(x = "X", y = "Y") +
  theme_bw(base_size = 16) +
  annotate("text", x = 5, y = 80, size = 6,
           label = paste("X Centered\nr =", round(r2, 3)))

gg1 + gg2         # show plots side-by-side
Figure 8.5: Centering a predictor removes the nessessary correlation in a quadratic regression

Example: Interactions

The dataset genridge::Acetylene gives data from Marquardt & Snee (1975) on the yield of a chemical manufacturing process to produce acetylene in relation to reactor temperature (temp), the ratio of two components and the contact time in the reactor. A naive response surface model might suggest that yield is quadratic in time and there are potential interactions among all pairs of predictors.

data(Acetylene, package = "genridge")
acetyl.mod0 <- lm(yield ~ temp + ratio + time + I(time^2) + 
                          temp:time + temp:ratio + time:ratio,
                  data=Acetylene)

(acetyl.vif0 <- vif(acetyl.mod0))
#>       temp      ratio       time  I(time^2)  temp:time temp:ratio 
#>        383      10555      18080        564       9719       9693 
#> ratio:time 
#>        225

These results are horrible! How much does centering help? I first center all three predictors and then use update() to re-fit the same model using the centered data.

Acetylene.centered <-
  Acetylene |>
  mutate(temp = temp - mean(temp),
         time = time - mean(time),
         ratio = ratio - mean(ratio))

acetyl.mod1 <- update(acetyl.mod0, data=Acetylene.centered)
(acetyl.vif1 <- vif(acetyl.mod1))
#>       temp      ratio       time  I(time^2)  temp:time temp:ratio 
#>      57.09       1.09      81.57      51.49      44.67      30.69 
#> ratio:time 
#>      33.33

This is far better, although still not great in terms of VIF. But, how much have we improved the situation by the simple act of centering the predictors? The square roots of the ratios of VIFs tell us the impact of centering on the standard errors.

sqrt(acetyl.vif0 / acetyl.vif1)
#>       temp      ratio       time  I(time^2)  temp:time temp:ratio 
#>       2.59      98.24      14.89       3.31      14.75      17.77 
#> ratio:time 
#>       2.60

Finally, we use poly(time, 2) in the model for the centered data. Because there are multiple degree of freedom terms in the model, car::vif() calculates GVIFs here. The final column gives \(\sqrt{\text{GVIF}^{1/2 \text{df}}}\), the remaining effect of collinearity on the standard errors of terms in this model.

acetyl.mod2 <- lm(yield ~ temp + ratio + poly(time, 2) + 
                          temp:time + temp:ratio + time:ratio,
                  data=Acetylene.centered)

vif(acetyl.mod2, type = "term")
#>                  GVIF Df GVIF^(1/(2*Df))
#> temp            57.09  1            7.56
#> ratio            1.09  1            1.05
#> poly(time, 2) 1733.56  2            6.45
#> temp:time       44.67  1            6.68
#> temp:ratio      30.69  1            5.54
#> ratio:time      33.33  1            5.77

8.4 Ridge regression

8.4.1 What is ridge regression?

8.4.2 Univariate ridge trace plots

8.4.3 Bivariate ridge trace plots

#> Writing packages to  C:/R/Projects/Vis-MLM-book/bib/pkgs.txt
#> 11  packages used here:
#>  car, carData, dplyr, factoextra, genridge, ggplot2, ggrepel, knitr, MASS, patchwork, VisCollin

  1. This example is adapted from one by John Fox (2022), Collinearity Diagnostics↩︎