7  Topics in Linear Models

The geometric and graphical approach of earlier chapters has already introduced some new ideas for thinking about multivariate data, models for explaining them, and graphical methods for understanding their results. These can be applied to better understand common problems that arise in data analysis.

Packages

In this chapter I use the following packages. Load them now:

7.1 Ellipsoids in data space and \(\mathbf{\beta}\) space

It is most common to look at data and fitted models in “data space,” where axes correspond to variables, points represent observations, and fitted models are plotted as lines (or planes) in this space. As we’ve suggested, data ellipsoids provide informative summaries of relationships in data space. For linear models, particularly regression models with quantitative predictors, there is another space—“\(\mathbf{\beta}\) space”—that provides deeper views of models and the relationships among them. This discussion follows Friendly et al. (2013), Sec. 4.6.

In \(\mathbf{\beta}\) space, the axes pertain to coefficients, for example \((\beta_0, \beta_1)\) in a simple linear regression. Points in this space are models (true, hypothesized, fitted) whose coordinates represent values of these parameters. For example, one point \(\widehat{\mathbf{\beta}}_{\text{OLS}} = (\hat{\beta}_0, \hat{\beta}_1)\) represents the least squares estimate, points \(\widehat{\mathbf{\beta}}_{\text{WLS}}\) and \(\widehat{\mathbf{\beta}}_{\text{ML}}\) would give weighted least squares or maximum likelihood estimates, and the line \(\beta_1 = 0\) represents the null hypothesis that the slope is zero.

In the sense described below, data space and \(\boldsymbol{\beta}\) space are each dual to the other. In simple linear regression, for example:

  • each line like \(\mathbf{y} = a + b \mathbf{x}\) in data space corresponds to a point \((a,b)\) in \(\mathbf{\beta}\) space,
  • the set of points on any line \(\beta_1 = \alpha + \gamma \beta_0\) in \(\mathbf{\beta}\) space corresponds to a set of lines through a given point \((\alpha, \gamma)\) in data space, and
  • the geometric proposition that every pair of points defines a line in one space corresponds to the proposition that every two lines intersect in a point in the other space.

Moreover, ellipsoids in these spaces are dual and inversely related to each other (Dempster, 1969, Ch. 6). In data space, joint confidence intervals for the mean vector or joint prediction regions for the data are given by the ellipsoids \((\bar{x}_1, \bar{x}_2)^\textsf{T} \oplus c \sqrt{\mathbf{S}_{\mathbf{X}}}\), where the covariance matrix \(\mathbf{S}_{\mathbf{X}}\) depends on \(\mathbf{X}^\mathsf{T}\mathbf{X}\). In the dual \(\mathbf{\beta}\) space, joint confidence regions for the coefficients of a response variable \(y\) on \((x_1, x_2)\) are given by ellipsoids of the form \(\widehat{\mathbf{\beta}} \oplus c \sqrt{\mathbf{S}_{\mathbf{X}}^{-1}}\), and depend on \(\mathbf{(\mathbf{X}^\mathsf{T}\mathbf{X})}^{-1}\).

It is well to understand the underlying geometry here connecting the ellipses for a matrix and its inverse. This can be seen in Figure 7.1, which shows an ellipse for a covariance matrix \(\mathbf{S}\), whose axes, as we saw in Chapter 4 are the eigenvectors \(\mathbf{v}_i\) of \(\mathbf{S}\) and whose radii are the square roots \(\sqrt{\lambda_i}\) of the corresponding eigenvalues. The comparable ellipse for \(2 \mathbf{S}\) has radii multiplied by \(\sqrt{2}\).

As long as \(\mathbf{S}\) is of full rank, the eigenvectors of \(\mathbf{S}^{-1}\) are identical, while the eigenvalue are \(1 / \lambda_i\), so the radii are \(1 / \sqrt{\lambda_i}\); the ellipse for \((2 \mathbf{S}^{-1})\) is smaller by a factor of \(\sqrt{2}\). Thus, in two dimensions, the ellipse for \(\mathbf{S}^{-1}\) is a \(90^o\) rotation of that for \(\mathbf{S}\). It is small in directions where the ellipse for \(\mathbf{S}\) is large, and vice-versa.

Figure 7.1: Geometric properties of an ellipse \(\mathbf{S}\) and its inverse, \(\mathbf{S}^{-1}\). The principal axes (dotted lines) are given by the eigenvectors, which are the same for \(\mathbf{S}\) and \(\mathbf{S}^{-1}\). Multiplying \(\mathbf{S}\) by 2 makes it’s ellipse larger by \(\sqrt{2}\), while the same factor makes the ellipse for \((2 \mathbf{S})^{-1}\) smaller by the same factor.

We illustrate these relationships in the example below.

7.1.1 Coffee, stress and heart disease

Consider the dataset coffee, giving measures of Heart (\(y\)), an index of cardiac damage, Coffee (\(x_1\)), a measure of daily coffee consumption, and Stress (\(x_2\)), a measure of occupational stress, in a contrived sample of \(n=20\) university people.1 For the sake of the example we assume that the main goal is to determine whether or not coffee is good or bad for your heart, and stress represents one potential confounding variable among others (age, smoking, etc.) that might be useful to control statistically.

set.seed(1234)
load(here::here("data", "coffee.RData"))
coffee |> dplyr::sample_n(6)
#>          Group Coffee Stress Heart
#> 1 Grad_Student    104    117    92
#> 2      Student     52     86    63
#> 3 Grad_Student     76     92    58
#> 4      Student    100    123    92
#> 5      Student     64     74    63
#> 6    Professor    141    175   145

Figure 7.2 shows the scatterplot matrix, giving the marginal relations between all pairs of variables. The marginal message seems to be that coffee is bad for your heart, stress is bad for your heart and coffee consumption is also related to occupational stress.

Show the code
scatterplotMatrix(~ Heart + Coffee + Stress, data=coffee,
    smooth = FALSE,
    pch = 16, col = "brown",
    cex.axis = 1.3, cex.labels = 3,
    ellipse = list(levels = 0.68, fill.alpha = 0.1))
Figure 7.2: Scatterplot matrix for the coffee data showing the pairwise relationships among Heart damage (\(y\)), Coffee consumption (\(x_1\)), and Stress (\(x_2\)), with linear regression lines and 68% data ellipses.

Yet, when we fit both variables together, we obtain the following results, suggesting that coffee is good for you—the coefficient for coffee is now negative, though non-significant. How can this be?

coffee.mod <- lm(Heart ~ Coffee + Stress, data=coffee)
broom::tidy(coffee.mod)
#> # A tibble: 3 × 5
#>   term        estimate std.error statistic   p.value
#>   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
#> 1 (Intercept)   -7.79      5.79      -1.35 0.196    
#> 2 Coffee        -0.409     0.292     -1.40 0.179    
#> 3 Stress         1.20      0.224      5.34 0.0000536

The answer is that the marginal plots of Heart vs. Coffee and Stress in the first row of Figure 7.2 each ignore the the other predictor. In contrast, the coefficients for coffee and stress in the multiple regression model coffee.mod are partial coefficients, giving the estimated change in heart damage for a unit change in each predictor, but adjusting for (controlling for, or holding constant) the other predictor.

We can see these effects directly in added variable plots (Section 6.4), but here I consider the relationship of coffee and stress in data space and beta space and how their ellipses relate to each other and to hypothesis tests.

The left panel in Figure 7.3 is the same as that in the (3,2) cell of Figure 7.2 for the relation Stress ~ Coffee but with data ellipses of 40% and 60% coverage. The shadows of the 40% ellipse on any axis give univariate intervals of the mean \(\bar{x} \pm 1 s_x\) (standard deviation) shown by the thick red lines; the shadow of the 68% ellipse corresponds to an interval \(\bar{x} \pm 1.5 s_x\).

The right panel shows the joint 95% confidence region for the coefficients \((\beta_{\text{Coffee}}, \beta_{\text{Stress}})\) and individual confidence intervals in \(\mathbf{\beta}\) space. These are determined as

\[ \widehat{\mathbf{\beta}} \oplus \sqrt{d F^{.95}_{d, \nu}} \times s_e \times \mathbf{S}_X^{-1/2} \:\: . \] where \(d\) is the number of dimensions for which we want coverage, \(\nu\) is the residual degrees of freedom for \(s_e\), and \(\mathbf{S}_X\) is the covariance matrix of the predictors.

Figure 7.3: Data space and \(\mathbf{\beta}\) space representations of Coffee and Stress. Left: 40% and 68% data ellipses. Right: Joint 95% confidence ellipse (blue) for (\(\beta_{\text{Coffee}}, \beta_{\text{Stress}}\)), confidence interval generating ellipse (red) with 95% univariate shadows. \(H_0\) marks the joint hypothesis that both coefficients equal zero.

Thus, the blue ellipse in Figure 7.3 (right) is the ellipse of joint 95% coverage, using the factor \(\sqrt{2 F^{.95}_{2, \nu}}\) and covering the true values of (\(\beta_{\mathrm{Stress}}, \beta_{\mathrm{Coffee}}\)) in 95% of samples. Moreover:

  • Any joint hypothesis (e.g., \(H_0:\beta_{\mathrm{Stress}}=0, \beta_{\mathrm{Coffee}}=0\)) can be tested visually, simply by observing whether the hypothesized point, \((0, 0)\) here, lies inside or outside the joint confidence ellipse. That hypothesis is rejected
  • The shadows of this ellipse on the horizontal and vertical axes give Scheff'e joint 95% confidence intervals for the parameters, with protection for simultaneous inference (“fishing”) in a 2-dimensional space.
  • Similarly, using the factor \(\sqrt{F^{1-\alpha/d}_{1, \nu}} = t^{1-\alpha/2d}_\nu\) would give an ellipse whose 1D shadows are \(1-\alpha\) Bonferroni confidence intervals for \(d\) posterior hypotheses.

Visual hypothesis tests and \(d=1\) confidence intervals for the parameters separately are obtained from the red ellipse in Figure 7.3, which is scaled by \(\sqrt{F^{.95}_{1, \nu}} = t^{.975}_\nu\). We call this the _confidence-interval generating ellipse” (or, more compactly, the “confidence-interval ellipse”). The shadows of the confidence-interval ellipse on the axes (thick red lines) give the corresponding individual 95% confidence intervals, which are equivalent to the (partial, Type III) \(t\)-tests for each coefficient given in the standard multiple regression output shown above.

Thus, controlling for Stress, the confidence interval for the slope for Coffee includes 0, so we cannot reject the hypothesis that \(\beta_{\mathrm{Coffee}}=0\) in the multiple regression model, as we saw above in the numerical output. On the other hand, the interval for the slope for Stress excludes the origin, so we reject the null hypothesis that \(\beta_{\mathrm{Stress}}=0\), controlling for Coffee consumption.

Finally, consider the relationship between the data ellipse and the confidence ellipse. These have exactly the same shape, but (with equal coordinate scaling of the axes), the confidence ellipse is exactly a \(90^o\) rotation and rescaling of the data ellipse. In directions in data space where the slice of the data ellipse is wide—where we have more information about the relationship between Coffee and Stress—the projection of the confidence ellipse is narrow, reflecting greater precision of the estimates of coefficients. Conversely, where slice of the the data ellipse is narrow (less information), the projection of the confidence ellipse is wide (less precision).

Confidence ellipses are drawn using car::confidenceEllipse(). Click the button to show the code.

Code for confidence ellipses
confidenceEllipse(coffee.mod, 
    grid = FALSE,
    xlim = c(-2, 1), ylim = c(-0.5, 2.5),
    xlab = expression(paste("Coffee coefficient,  ", beta["Coffee"])),
    ylab = expression(paste("Stress coefficient,  ", beta["Stress"])),
    cex.lab = 1.5)
confidenceEllipse(coffee.mod, add=TRUE, draw = TRUE,
    col = "red", fill = TRUE, fill.alpha = 0.1,
    dfn = 1)
abline(h = 0, v = 0, lwd = 2)

# confidence intervals
beta <- coef( coffee.mod )[-1]
CI <- confint(coffee.mod)
lines( y = c(0,0), x = CI["Coffee",] , lwd = 6, col = 'red')
lines( x = c(0,0), y = CI["Stress",] , lwd = 6, col = 'red')
points( diag( beta ), col = 'black', pch = 16, cex=1.8)

abline(v = CI["Coffee",], col = "red", lty = 2)
abline(h = CI["Stress",], col = "red", lty = 2)

text(-2.1, 2.35, "Beta space", cex=2, pos = 4)
arrows(beta[1], beta[2], beta[1], 0, angle=8, len=0.2)
arrows(beta[1], beta[2], 0, beta[2], angle=8, len=0.2)

text( -1.5, 1.85, "df = 2", col = 'blue', adj = 0, cex=1.2)
text( 0.2, .85, "df = 1", col = 'red', adj = 0, cex=1.2)

heplots::mark.H0(col = "darkgreen", pch = "+", lty = 0, pos = 4, cex = 3)

7.2 Measurement error

7.2.1 OLS is BLUE

In classical linear models, the predictors are often considered to be fixed variables, or, if random, to be measured without error and independent of the regression errors. Either condition, along with the assumption of linearity, guarantees that the standard OLS estimators are unbiased. That is, in a simple linear regression, \(y = \beta_0 + \beta_1 x + \epsilon\), the estimated slope \(\hat{\beta}_1\) wiil have an average, expected value \(\mathcal{E} (\hat{\beta}_1)\) equal to the true population value \(\beta_1\) over repeated samples.

Not only this, but the Gauss-Markov theorem guarantees that the OLS estimator is also the most efficient because it has the least variance among all linear and unbiased estimators. The classical OLS estimator is said to be BLUE: Best (lowest variance), Linear (among linear estimators), Unbiased, Estimator.

7.2.2 Errors in predictors

Errors in the response \(y\) are accounted for in the model and measured by the mean squared error, \(\text{MSE} = \hat{\sigma}_\epsilon^2\). But in practice, of course, predictor variables are often also observed indicators, subject to their own error. Indeed, in the behavioral sciences it is rare that predictors are perfectly reliable and measured exactly. This fact that is recognized in errors-in-variables regression models (Fuller, 2006) and in more general structural equation models, but often ignored otherwise. Ellipsoids in data space and \(\beta\) space are well suited to showing the effect of measurement error in predictors on OLS estimates.

The statistical facts are well known, though perhaps counter-intuitive in certain details: measurement error in a predictor biases regression coefficients (towards 0), while error in the measurement in \(y\) increases the MSE and thus standard errors of the regression coefficients but does not introduce bias in the coefficients.

7.2.2.1 Example

An illuminating example can be constructed by starting with the simple linear regression \[ y_i = \beta_0 + \beta_1 x_i + \epsilon_i \; , \] where \(x_i\) is the true, fully reliable predictor and \(y\) is the response, with error variance \(\sigma_\epsilon^2\). Now consider that we don’t measure \(x_i\) exactly, but instead observe \(x^\star_i\). \[ x^\star_i = x_i + \eta_i \; , \] where the measurement error \(\eta_i\) is independent of the true \(x_i\) with variance \(\sigma^2_\eta\). We can extend this example to also consider the effect of adding additional, independent error variance to \(y\), so that instead of \(y_i\) we observe

\[ y^\star_i = y_i + \nu_i \] with variance \(\sigma^2_\nu\).

Let’s simulate an example where the true relation is \(y = 0.2 + 0.3 x\) with error standard deviation \(\sigma = 0.5\). I’ll take \(x\) to be uniformly distributed in [0, 10] and calculate \(y\) as normally distributed around that linear relation.

library(dplyr)
library(ggplot2)
library(forcats)
library(patchwork)

set.seed(123)
n <- 300

a <- 0.2    # true intercept
b <- 0.3    # true slope
sigma <- 0.5 # baseline error standard deviation

x <- runif(n, 0, 10)
y <- rnorm(n, a + b*x, sigma)
demo <- data.frame(x,y)

Then, generate alternative values \(x^\star\) and \(y^\star\) with additional error standard deviations around \(x\) given by \(\sigma_\eta = 4\) and around \(y\) given by \(\sigma_\nu = 1\).

err_y <- 1   # additional error stdev for y
err_x <- 4   # additional error stdev for x
demo  <- demo |>
  mutate(y_star = rnorm(n, y, err_y),
         x_star = rnorm(n, x, err_x))

There are four possible models we could fit and compare, using the combinations of \((x, x^\star)\) and \((y, y^\star)\)

fit_1 <- lm(y ~ x,           data = demo)   # no additional error
fit_2 <- lm(y_star ~ x,      data = demo)   # error in y
fit_3 <- lm(y ~ x_star,      data = demo)   # error in x
fit_4 <- lm(y_star ~ x_star, data = demo)   # error in x and y

However, to show the differences visually, we can simply plot the data for each pair and show the regression lines (with confidence bands) and the data ellipses. To do this efficiently with ggplot2, it is necessary to transform the demo data to long format with columns x and y, distinguished by name for the four combinations.

# make the demo dataset long, with names for the four conditions
df <- bind_rows(
  data.frame(x=demo$x,      y=demo$y,      name="No measurement error"),
  data.frame(x=demo$x,      y=demo$y_star, name="Measurement error on y"),
  data.frame(x=demo$x_star, y=demo$y,      name="Measurement error on x"),
  data.frame(x=demo$x_star, y=demo$y_star, name="Measurement error on x and y")) |>
  mutate(name = fct_inorder(name)) 

Then, we can plot the data in df with points, regression lines and a data ellipse, faceting by name to give the measurement error quartet.

ggplot(df, aes(x, y)) +
  geom_point(alpha = 0.3) +
  stat_ellipse(geom = "polygon", 
               color = "blue",fill= "blue", 
               alpha=0.1, linewidth = 1.1) +
  geom_smooth(method="lm", formula = y~x, fullrange=TRUE, level=0.995,
              color = "red", fill = "red", alpha = 0.2) +
  facet_wrap(~name) +
  theme_bw(base_size = 14)
Figure 7.4: The measurement error quartet: Each plot shows the linear regression of y on x, but where additional error variance has been added to y or x or both. The widths of the confidence bands and the vertical extent of the data ellipses show the effect on precision.

Comparing the plots in the first row, you can see that when additional error is added to \(y\), the regression slope remains essentially unchanged, illustrating that the estimate is unbiased. However, the confidence bounds on the regression line become wider, and the data ellipse becomes fatter in the \(y\) direction, illustrating the loss of precision.

The effect of error in \(x\) is less kind. Comparing the first row of plots with the second row, you can see that the estimated slope decreases when errors are added to \(x\). This is called attenuation bias, and it can be shown that \[ \widehat{\beta}_{x^\star} \longrightarrow \frac{\beta}{1+\sigma^2_\eta /\sigma^2_x} \; , \] where \(\beta\) here refers to the regression slope and \(\longrightarrow\) means “converges to”, as the sample size gets large. Thus, as \(\sigma^2_\eta\) increases, \(\widehat{\beta}_{x^\star}\) becomes less than $.

Beyond plots like Figure 7.4, we can see the effects of error in \(x\) or \(y\) on the model summary statistics such as the correlation \(r_{xy}\) or MSE by extracting these from the fitted models. This is easily done using dplyr::nest_by(name) and fitting the regression model to each subset, from which we can obtain the model statistics using sigma(), coef() and so forth.

model_stats <- df |>
  dplyr::nest_by(name) |>
  mutate(model = list(lm(y ~ x, data = data)),
         sigma = sigma(model),
         intercept = coef(model)[1],
         slope = coef(model)[2],
         r = sqrt(summary(model)$r.squared)) |>
  mutate(errX = stringr::str_detect(name, " x"),
         errY = stringr::str_detect(name, " y")) |>
  relocate(errX, errY, r, .after = name) |>
  select(-data) |>
  print()
#> # A tibble: 4 × 8
#> # Rowwise:  name
#>   name                errX  errY      r model sigma intercept  slope
#>   <fct>               <lgl> <lgl> <dbl> <lis> <dbl>     <dbl>  <dbl>
#> 1 No measurement err… FALSE FALSE 0.858 <lm>  0.495    0.244  0.294 
#> 2 Measurement error … FALSE TRUE  0.648 <lm>  1.09     0.0838 0.329 
#> 3 Measurement error … TRUE  FALSE 0.481 <lm>  0.844    1.22   0.0946
#> 4 Measurement error … TRUE  TRUE  0.401 <lm>  1.31     1.12   0.117

We plot the model \(R = r_{xy}\) and the estimated residual standard error in Figure 7.5 below. The lines connecting the points are approximately parallel, indicating that errors of measurement in \(x\) and \(y\) have nearly additive effects on model summaries.

p1 <- ggplot(data=model_stats, 
             aes(x = errX, y = r, 
                 group = errY, color = errY, shape = errY)) +
  geom_point(size = 4) +
  geom_line(linewidth = 1.2) +
  labs(x = "Error on X?",
       y = "Model R ",
       color = "Error on Y?",
       shape = "Error on Y?") +
  theme_bw(base_size = 14) +
  theme(legend.position = "inside",
        legend.position.inside = c(.75, .7))

p2 <- ggplot(data=model_stats, 
             aes(x = errX, y = sigma, 
                 group = errY, color = errY, shape = errY)) +
  geom_point(size = 4) +
  geom_line(linewidth = 1.2) +
  labs(x = "Error on X?",
     y = "Model residual standard error",
     color = "Error on Y?",
     shape = "Error on Y?") +
  theme_bw(base_size = 14) +
  theme(legend.position = "inside",
        legend.position.inside = c(.75, .7))

p1 + p2
Figure 7.5: Model statistics for the combinations of additional error variance in x or y or both. Left: model R; right: Residual standard error.

7.2.3 Coffee data: \(\beta\) space

In multiple regression the effects of measurement error in a predictor become more complex, because error variance in one predictor, \(x_1\), say, can affect the coefficients of other terms in the model.

Consider the marginal relation between Heart disease and Stress in the coffee data. Figure 7.6 shows this with data ellipses in data space and the corresponding confidence ellipses in \(\beta\) space. Each panel starts with the observed data (the darkest ellipse, marked \(0\)), then adds random normal error, \(\mathcal{N}(0, \delta \times \mathrm{SD}_{Stress})\), with \(\delta = \{0.75, 1.0, 1.5\}\), to the value of Stress, while keeping the mean of Stress the same. All of the data ellipses have the same vertical shadows (\(\text{SD}_{\textrm{Heart}}\)), while the horizontal shadows increase with \(\delta\), driving the slope for Stress toward 0.

In \(\beta\) space, it can be seen that the estimated coefficients, \((\beta_0, \beta_{\textrm{Stress}})\) vary along a line and approach \(\beta_{\textrm{Stress}}=0\) as \(\delta\) gets sufficiently large. The shadows of ellipses for \((\beta_0, \beta_{\textrm{Stress}})\) along the \(\beta_{\textrm{Stress}}\) axis also demonstrate the effects of measurement error on the standard error of \(\beta_{\textrm{Stress}}\).

Figure 7.6: Effects of measurement error in Stress on the marginal relationship between Heart disease and Stress. Each panel starts with the observed data (\(\delta = 0\)), then adds random normal error, \(\mathcal{N}(0, \delta \times \text{SD}_\text{Stress})\) with standard deviations multiplied by \(\delta\) = 0.75, 1.0, 1.5, to the value of Stress. Increasing measurement error biases the slope for Stress toward 0. Left: 50% data ellipses; right: 50% confidence ellipses.

Perhaps less well-known, but both more surprising and interesting, is the effect that measurement error in one variable, \(x_1\), has on the estimate of the coefficient for an variable, \(x_2\), in a multiple regression model. Figure 7.7 shows the confidence ellipses for \((\beta_{\textrm{Coffee}}, \beta_{\textrm{Stress}})\) in the multiple regression predicting Heart disease, adding random normal error \(\mathcal{N}(0, \delta \times \mathrm{SD}_{Stress})\), with \(\delta = \{0, 0.2, 0.4, 0.8\}\), to the value of Stress alone.
As can be plainly seen, while this measurement error in Stress attenuates its coefficient, it also has the effect of biasing the coefficient for Coffee toward that in the regression of Heart disease on Coffee alone.

Figure 7.7: Biasing effect of measurement error in one variable (Stress) on on the coefficient of another variable (Coffee) in a multiple regression. The coefficient for Coffee is driven towards its value in the marginal model using Coffee alone, as measurement error in Stress makes it less informative in the joint model.

  1. This example was developed by Georges Monette.↩︎