There is no excuse for failing to plot and look.
The greatest value of a picture is when it forces us to notice what we never expected to see. — John W. Tukey, Exploratory Data Analysis, 1977
These quotes from John Tukey remind us that data analysis should nearly always start with graphs to help us understand the main features of our data. It is important to understand the general patterns and trends: Are relationships increasing or decreasing? Are they approximately linear or non-linear? But it is also important to spot anomalies: “unusual” observations, groups of points that seem to differ from the rest, and so forth. As we saw with Anscombe’s quartet (Section 2.1.1) numerical summaries hide features that are immediately apparent in a plot.
This chapter introduces a toolbox of basic graphical methods for visualizing multivariate datasets. It starts with some simple techniques to enhance the basic scatterplot with graphical annotations such as fitted lines, curves and data ellipses to summarize the relation between two variables.
To visualize more than two variables, we can view all pairs of variables in a scatterplot matrix or shift gears entirely to show multiple variables along a set of parallel axes. As the number of variables increases, we may need to suppress details with stronger summaries for a high-level reconnaissance of our data terrain, as we do by zooming out on a map. For example, we can simply remove the data points or make them nearly transparent to focus on the visual summaries provided by fitted lines or other graphical summaries.
Packages
In this chapter I use the following packages. Load them now:
3.1 Bivariate summaries
The basic scatterplot is the workhorse of multivariate data visualization, showing how one variable, \(y\), often an outcome to be explained by or varies with another, \(x\). It is a building block for many useful techniques, so it is helpful to understand how it can be used as a tool for thinking in a wider, multivariate context.
The essential idea is that we can start with a simple version of the scatterplot and add annotations to show interesting features more clearly. We consider the following here:
- Smoothers: Showing overall trends, perhaps in several forms, as visual summaries such as fitted regression lines or curves and nonparametric smoothers.
- Stratifiers: Using color, shape or other features to identify subgroups; more generally, conditioning on other variables in multi-panel displays;
- Data ellipses: A compact 2D visual summary of bivariate linear relations and uncertainty assuming normality; more generally, contour plots of bivariate density.
Example: Academic salaries
Let’s start with data on the academic salaries of faculty members collected at a U.S. college for the purpose of assessing salary differences between male and female faculty members, and perhaps address anomalies in compensation. The dataset carData::Salaries
gives data on nine-month salaries and other variables for 397 faculty members in the 2008-2009 academic year.
data(Salaries, package = "carData")
str(Salaries)
#> 'data.frame': 397 obs. of 6 variables:
#> $ rank : Factor w/ 3 levels "AsstProf","AssocProf",..: 3 3 1 3 3 2 3 3 3 3 ...
#> $ discipline : Factor w/ 2 levels "A","B": 2 2 2 2 2 2 2 2 2 2 ...
#> $ yrs.since.phd: int 19 20 4 45 40 6 30 45 21 18 ...
#> $ yrs.service : int 18 16 3 39 41 6 23 45 20 18 ...
#> $ sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 1 ...
#> $ salary : int 139750 173200 79750 115000 141500 97000 175000 147765 119250 129000 ...
The most obvious, but perhaps naive, predictor of salary
is years.since.phd
. For simplicity, I’ll refer to this as years of “experience.” Before looking at differences between males and females, we would want consider faculty rank
(related also to yrs.service
) and discipline
, recorded here as "A"
(“theoretical” departments) or "B"
(“applied” departments). But, for a basic plot, we will ignore these for now to focus on what can be learned from plot annotations.
library(ggplot2)
gg1 <- ggplot(Salaries,
aes(x = yrs.since.phd, y = salary)) +
geom_jitter(size = 2) +
scale_y_continuous(labels = scales::dollar_format(
prefix="$", scale = 0.001, suffix = "K")) +
labs(x = "Years since PhD",
y = "Salary")
gg1 + geom_rug(position = "jitter", alpha = 1/4)
There is quite a lot we can see “just by looking” at this simple plot, but the main things are:
- Salary increases generally from 0 - 40 years since the PhD, but then maybe begins to drop off (partial retirement?);
- Variability in salary increases among those with the same experience, a “fan-shaped” pattern that signals a violation of homogeneity of variance in simple regression;
- Data beyond 50 years is thin, but there are some quite low salaries there. Adding rug plots to the X and Y axes is a simple but effective way to show the marginal distributions of the observations. Jitter and transparency helps to avoid overplotting due to discrete values.
3.1.1 Smoothers
Smoothers are among the most useful graphical annotations you can add to such plots, giving a visual summary of how \(y\) changes with \(x\). The most common smoother is a line showing the linear regression for \(y\) given \(x\), expressed in math notation as \(\mathbb{E} (y | x) = b_0 + b_1 x\). If there is doubt that a linear relation is an adequate summary, you can try a quadratic or other polynomial smoothers.
In ggplot2, these are easily added to a plot using geom_smooth()
with method = "lm"
, and a model formula
, which (by default) is y ~ x
for a linear relation or y ~ poly(x, k)
for a polynomial of degree \(k\).
Code
gg1 +
geom_smooth(method = "lm", formula = "y ~ x",
color = "red", fill= "pink",
linewidth = 2) +
geom_smooth(method = "lm", formula = "y ~ poly(x,2)",
color = "darkgreen", fill = "lightgreen",
linewidth = 2)
This serves to highlight some of our impressions from the basic scatterplot shown in Figure 3.1, making them more apparent. And that’s precisely the point: the regression smoother draws attention to a possible pattern that we can consider as a visual summary of the data. You can think of this as showing what a linear (or quadratic) regression “sees” in the data. Statistical tests can help you decide if there is more evidence for a quadratic fit compared to the simpler linear relation.
It is useful to also show some indication of uncertainty (or inversely, precision) associated with the predicted values. Both the linear and quadratic trends are shown in Figure 3.2 with 95% pointwise confidence bands.1 These are necessarily narrower in the center of the range of \(x\) where there is typically more data; they get wider toward the highest values of experience where the data are thinner.
Non-parametric smoothers
The most generally useful idea is a smoother that tracks an average value, \(\mathbb{E} (y | x)\), of \(y\) as \(x\) varies across its’ range without assuming any particular functional form, and so avoiding the necessity to choose among y ~ poly(x, 1)
, or y ~ poly(x, 2)
, or y ~ poly(x, 3)
, etc.
Non-parametric smoothers attempt to estimate \(\mathbb{E} (y | x) = f(x)\) where \(f(x)\) is some smooth function. These typically use a collection of weighted local regressions for each \(x_i\) within a window centered at that value. In the method called lowess or loess (Cleveland, 1979; Cleveland & Devlin, 1988), a weight function is applied, giving greatest weight to \(x_i\) and a weight of 0 outside a window containing a certain fraction, \(s\), called span, of the nearest neighbors of \(x_i\). The fraction, \(s\), is usually within the range \(1/3 \le s \le 2/3\), and it determines the smoothness of the resulting curve; smaller values produce a wigglier curve and larger values giving a smoother fit (an optimal span can be determined by \(k\)-fold cross-validation to minimize a measure of overall error of approximation).
Non-parametric regression is a broad topic; see Fox (2016), Ch. 18 for a more general treatment including smoothing splines, and Wood (2006) for generalized additive models, fit using method = "gam"
in ggplot2, which is the default when the largest group has more than 1,000 observations.
Figure 3.3 shows the addition of a loess smooth to the plot in Figure 3.2, suppressing the confidence band for the linear regression. The loess fit is nearly coincident with the quadratic fit, but has a slightly wider confidence band.
Code
gg1 +
geom_smooth(method = "loess", formula = "y ~ x",
color = "blue", fill = scales::muted("blue"),
linewidth = 2) +
geom_smooth(method = "lm", formula = "y ~ x", se = FALSE,
color = "red",
linewidth = 2) +
geom_smooth(method = "lm", formula = "y ~ poly(x,2)",
color = "darkgreen", fill = "lightgreen",
linewidth = 2)
But now comes an important question: is it reasonable that academic salary should increase up to about 40 years since the PhD degree and then decline? The predicted salary for someone still working 50 years after earning their degree is about the same as a person at 15 years. What else is going on here?
3.1.2 Stratifiers
Very often, we have a main relationship of interest, but various groups in the data are identified by discrete factors (like faculty rank
and sex
, their type of discipline
here), or there are quantitative predictors for which the main relation might vary. In the language of statistical models such effects are interaction terms, as in y ~ group + x + group:x
, where the term group:x
fits a different slope for each group and the grouping variable is often called a moderator variable. Common moderator variables are ethnicity, health status, social class and level of education. Moderators can also be continuous variables as in y ~ x1 + x2 + x1:x2
.
I call these stratifiers, recognizing that we should consider breaking down the overall relation to see whether and how it changes over such “other” variables. Such variables are most often factors, but we can cut a continuous variable into ranges (shingles) and do the same graphically. There are two general stratifying graphical techniques:
Grouping: Identify subgroups in the data by assigning different visual attributes, such as color, shape, line style, etc. within a single plot. This is quite natural for factors; quantitative predictors can be accommodated by cutting their range into ordered intervals. Grouping has the advantage that the levels of a grouping variable can be shown within the same plot, facilitating direct comparison.
Conditioning: Showing subgroups in different plot panels. This has the advantages that relations for the individual groups more easily discerned and one can easily stratify by two (or more) other variables jointly, but visual comparison is more difficult because the eye must scan from one panel to another.
Recognition of the roles of visual grouping by factors within a panel and conditioning in multi-panel displays was an important advance in the development of modern statistical graphics. It began at A.T.&T. Bell Labs in Murray Hill, NJ in conjunction with the S language, the mother of R.
Conditioning displays (originally called coplots (Chambers & Hastie, 1991)) are simply a collection of 1D, 2D or 3D plots separate panels for subsets of the data broken down by one or more factors, or, for quantitative variables, subdivided into a factor with several overlapping intervals (shingles). The first implementation was in Trellis plots (Becker et al., 1996; Cleveland, 1985).
Trellis displays were extended in the lattice package (Sarkar, 2024), which offered:
- A graphing syntax similar to that used in statistical model formulas:
y ~ x | g
conditions the data by the levels ofg
, with|
read as “given”; two or more conditioning are specified asy ~ x | g1 + g2 + ...
, with+
read as “and”. -
Panel functions define what is plotted in a given panel.
panel.xyplot()
is the default for scatterplots, plotting points, but you can addpanel.lmline()
for regression lines,latticeExtra::panel.smoother()
for loess smooths and a wide variety of others.
The car package (Fox et al., 2023) supports this graphing syntax in many of its functions. ggplot2 does not; it uses aesthetics (aes()
), which map variables in the data to visual characteristics in displays.
The most obvious variable that affects academic salary is rank
, because faculty typically get an increase in salary with a promotion that carries through in their future salary. What can we see if we group by rank
and fit a separate smoothed curve for each?
In ggplot2
thinking, grouping is accomplished simply by adding an aesthetic, such as color = rank
. What happens then is that points, lines, smooths and other geom_*()
inherit the feature that they are differentiated by color. In the case of geom_smooth()
, we get a separate fit for each subset of the data, according to rank
.
Code
# make some re-useable pieces to avoid repetitions
scale_salary <- scale_y_continuous(
labels = scales::dollar_format(prefix="$",
scale = 0.001,
suffix = "K"))
# position the legend inside the plot
legend_pos <- theme(legend.position = "inside",
legend.position.inside = c(.1, 0.95),
legend.justification = c(0, 1))
ggplot(Salaries,
aes(x = yrs.since.phd, y = salary,
color = rank, shape = rank)) +
geom_point() +
scale_salary +
labs(x = "Years since PhD",
y = "Salary") +
geom_smooth(aes(fill = rank),
method = "loess", formula = "y ~ x",
linewidth = 2) +
legend_pos
Well, there is a different story here. Salaries generally occupy separate vertical levels, increasing with academic rank. The horizontal extents of the smoothed curves show their ranges. Within each rank there is some initial increase after promotion, and then some tendency to decline with increasing years. But, by and large, years since the PhD doesn’t make as much difference once we’ve taken academic rank into account.
What about the discipline
which is classified, perhaps peculiarly, as “theoretical” vs. “applied”? The values are just "A"
and "B"
, so I map these to more meaningful labels before making the plot.
Code
Salaries <- Salaries |>
mutate(discipline =
factor(discipline,
labels = c("A: Theoretical", "B: Applied")))
Salaries |>
ggplot(aes(x = yrs.since.phd, y = salary, color = discipline)) +
geom_point() +
scale_salary +
geom_smooth(aes(fill = discipline ),
method = "loess", formula = "y ~ x",
linewidth = 2) +
labs(x = "Years since PhD",
y = "Salary") +
legend_pos
The story in Figure 3.5 is again different. Faculty in applied disciplines on average earn about 10,000$ more per year on average than their theoretical colleagues.
For both groups, there is an approximately linear relation up to about 30–40 years, but the smoothed curves then diverge into the region where the data is thinner.
This result is more surprising than differences among faculty ranks. The effect of annotation with smoothed curves as visual summaries is apparent, and provides a stimulus to think about why these differences (if they are real) exist between theoretical and applied professors, and maybe should theoreticians be paid more!
3.1.3 Conditioning
The previous plots use grouping by color to plot the data for different subsets inside the same plot window, making comparison among groups easier, because they can be directly compared along a common vertical scale 2. This gets messy, however, when there are more than just a few levels, or worse—when there are two (or more) variables for which we want to show separate effects. In such cases, we can plot separate panels using the ggplot2
concept of faceting. There are two options: facet_wrap()
takes one or more conditioning variables and produces a ribbon of plots for each combination of levels; facet_grid(row ~ col)
takes two or more conditioning variables and arranges the plots in a 2D array identified by the row
and col
variables.
Let’s look at salary broken down by the combinations of discipline and rank. Here, I chose to stratify using color by rank within each of panels faceting by discipline. Because there is more going on in this plot, a linear smooth is used to represent the trend.
Code
Salaries |>
ggplot(aes(x = yrs.since.phd, y = salary,
color = rank, shape = rank)) +
geom_point() +
scale_salary +
labs(x = "Years since PhD",
y = "Salary") +
geom_smooth(aes(fill = rank),
method = "lm", formula = "y ~ x",
linewidth = 2, alpha = 1/4) +
facet_wrap(~ discipline) +
legend_pos
Once both of these factors are taken into account, there does not seem to be much impact of years of service. Salaries in theoretical disciplines are noticeably greater than those in applied disciplines at all ranks, and there are even greater differences among ranks.
Finally, to shed light on the question that motivated this example— are there anomalous differences in salary for men and women— we can look at differences in salary according to sex, when discipline and rank are taken into account. To do this graphically, condition by both variables, but use facet_grid(discipline ~ rank)
to arrange their combinations in a grid whose rows are the levels of discipline
and columns are those of rank
. I want to make the comparison of males and females most direct, so I use color = sex
to stratify the panels. The smoothed regression lines and error bands are calculated separately for each combination of discipline, rank and sex.
Code
Salaries |>
ggplot(aes(x = yrs.since.phd, y = salary, color = sex)) +
geom_point() +
scale_salary +
labs(x = "Years since PhD",
y = "Salary") +
geom_smooth(aes(fill = sex),
method = "lm", formula = "y ~ x",
linewidth = 2, alpha = 1/4) +
facet_grid(discipline ~ rank) +
theme_bw(base_size = 14) +
legend_pos
3.2 Data Ellipses
The data ellipse (Monette, 1990), or concentration ellipse (Dempster, 1969) is a remarkably simple and effective display for viewing and understanding bivariate relationships in multivariate data. The data ellipse is typically used to add a visual summary to a scatterplot, that shows all together the means, standard deviations, correlation, and slope of the regression line for two variables, perhaps stratified by another variable. Under the classical assumption that the data are bivariate normally distributed, the data ellipse is also a sufficient visual summary, in the sense that it captures all relevant features of the data. See Friendly et al. (2013) for a complete discussion of the role of ellipsoids in statistical data visualization.
It is based on the idea that in a bivariate normal distribution, the contours of equal probability form a series of concentric ellipses. If the variables were uncorrelated and had the same variances, these would be circles, and Euclidean distance would measure the distance of each observation from the mean. When the variables are correlated, a different measure, Mahalanobis distance is the proper measure of how far a point is from the mean, taking the correlation into account.
To illustrate, Figure 3.8 shows a scatterplot with labels for two points, “A” and “B”. Which is further from the mean, “X”? A contour of constant Euclidean distance, shown by the red dashed circle, ignores the apparent negative correlation, so point “A” is further. The blue ellipse for Mahalanobis distance takes the correlation into account, so point “B” has a greater distance from the mean.
Mathematically, Euclidean (squared) distance for \(p\) variables, \(j = 1, 2, \dots , p\), is just a generalization of the square of a univariate standardized (\(z\)) score, \(z^2 = [(y - \bar{y}) / s]^2\),
\[ D_E^2 (\mathbf{y}) = \sum_j^p z_j^2 = \mathbf{z}^\textsf{T} \mathbf{z} = (\mathbf{y} - \bar{\mathbf{y}})^\textsf{T} \operatorname{diag}(\mathbf{S})^{-1} (\mathbf{y} - \bar{\mathbf{y}}) \; , \] where \(\mathbf{S}\) is the sample variance-covariance matrix, \(\mathbf{S} = ({n-1})^{-1} \sum_{i=1}^n (\mathbf{y}_i - \bar{\mathbf{y}})^\textsf{T} (\mathbf{y}_i - \bar{\mathbf{y}})\).
Mahalanobis’ distance takes the correlations into account simply by using the covariances as well as the variances, \[ D_M^2 (\mathbf{y}) = (\mathbf{y} - \bar{\mathbf{y}})^\mathsf{T} S^{-1} (\mathbf{y} - \bar{\mathbf{y}}) \; . \tag{3.1}\]
In Equation 3.1, the inverse \(S^{-1}\) serves to “divide” the matrix \((\mathbf{y} - \bar{\mathbf{y}})^\mathsf{T} (\mathbf{y} - \bar{\mathbf{y}})\) of squared distances by the variances (and covariances) of the variables, as in the univariate case.
For \(p\) variables, the data ellipsoid \(\mathcal{E}_c\) of size \(c\) is a \(p\)-dimensional ellipse, defined as the set of points \(\mathbf{y} = (y_1, y_2, \dots y_p)\) whose squared Mahalanobis distance, \(D_M^2 ( \mathbf{y} )\) is less than or equal to \(c^2\), \[ \mathcal{E}_c (\bar{\mathbf{y}}, \mathbf{S}) := \{ D_M^2 (\mathbf{y}) \le c^2 \} \; . \] A computational definition recognizes that the boundary of the ellipsoid can be found by transforming a unit sphere \(\mathcal{P}\) centered at the origin, \(\mathcal{P} : \{ \mathbf{x}^\textsf{T} \mathbf{x}= 1\}\), by \(\mathbf{S}^{1/2}\) and then shifting that to centroid of the data,
\[ \mathcal{E}_c (\bar{\mathbf{y}}, \mathbf{S}) = \bar{\mathbf{y}} \; \oplus \; \mathbf{S}^{1/2} \, \mathcal{P} \:\: , \] where \(\mathbf{S}^{1/2}\) represents a rotation and scaling and the notation \(\oplus\) represents translation to a new centroid, \(\bar{\mathbf{y}}\) here. The matrix \(\mathbf{S}^{1/2}\) is commonly computed as the Choleski factor of \(\mathbf{S}\). Slightly abusing notation and taking the unit sphere as given (like an identity matrix \(\mathbf{I}\)), we can write the data ellipsoid as simply:
\[ \mathcal{E}_c (\bar{\mathbf{y}}, \mathbf{S}) = \bar{\mathbf{y}} \; \oplus \; \sqrt{\mathbf{S}} \:\: . \tag{3.2}\]
When \(\mathbf{y}\) is (at least approximately) bivariate normal, \(D_M^2(\mathbf{y})\) has a large-sample \(\chi^2_2\) distribution (\(\chi^2\) with 2 df), so
- \(c^2 = \chi^2_2 (0.68) = 2.28\) gives a “1 standard deviation bivariate ellipse,” an analog of the standard interval \(\bar{y} \pm 1 s\), while
- \(c^2 = \chi^2_2 (0.95) = 5.99 \approx 6\) gives a data ellipse of 95% coverage.
In not-large samples, the radius \(c\) of the ellipsoid is better approximated by a multiple of a \(F_{p, n-p}\) distribution, becoming \(c =\sqrt{ 2 F_{2, n-2}^{1-\alpha} }\) in the bivariate case (\(p=2\)) for coverage \(1-\alpha\).
3.2.1 Ellipse properties
The essential ideas of correlation and regression and their relation to ellipses go back to Galton (1886). Galton’s goal was to predict (or explain) how a heritable trait, \(Y\), (e.g., height) of children was related to that of their parents, \(X\). He made a semi-graphic table of the frequencies of 928 observations of the average height of father and mother versus the height of their child, shown in Figure 3.9. He then drew smoothed contour lines of equal frequencies and had the wonderful visual insight that these formed concentric shapes that were tolerably close to ellipses.
He then calculated summaries, \(\text{Ave}(Y | X)\), and, for symmetry, \(\text{Ave}(X | Y)\), and plotted these as lines of means on his diagram. Lo and behold, he had a second visual insight: the lines of means of (\(Y | X\)) and (\(X | Y\)) corresponded approximately to the loci of horizontal and vertical tangents to the concentric ellipses. To complete the picture, he added lines showing the major and minor axes of the family of ellipses (which turned out to be the principal components) with the result shown in Figure 3.9.
For two variables, \(x\) and \(y\), the remarkable properties of the data ellipse are illustrated in Figure 3.10, a modern reconstruction of Galton’s diagram.
The ellipses have the mean vector \((\bar{x}, \bar{y})\) as their center.
The lengths of arms of the blue dashed central cross show the standard deviations of the variables, which correspond to the shadows of the ellipse covering 40% of the data. These are the bivariate analogs of the standard intervals \(\bar{x} \pm 1 s_x\) and \(\bar{y} \pm 1 s_y\).
-
More generally, shadows (projections) on the coordinate axes, or any linear combination of them, give any standard interval, \(\bar{x} \pm k s_x\) and \(\bar{y} \pm k s_y\). Those with \(k=1, 1.5, 2.45\), have bivariate coverage 40%, 68% and 95% respectively, corresponding to these quantiles of the \(\chi^2\) distribution with 2 degrees of freedom, i.e., \(\chi^2_2 (.40) \approx 1^2\), \(\chi^2_2 (.68) \approx 1.5^2\), and \(\chi^2_2 (.95) \approx 2.45\). The shadows of the 68% ellipse are the bivariate analog of a univariate \(\bar{x} \pm 1 s_x\) interval.
The regression line predicting \(y\) from \(x\) goes through the points where the ellipses have vertical tangents. The other regression line, predicting \(x\) from \(y\) goes through the points of horizontal tangency.
The correlation \(r(x, y)\) is the ratio of the vertical segment from the mean of \(y\) to the regression line to the vertical segment going to the top of the ellipse as shown at the right of the figure. It is \(r = 0.46\) in this example.
The residual standard deviation, \(s_e = \sqrt{MSE} = \sqrt{\Sigma (y - \bar{y})^2 / n-2}\), is the half-length of the ellipse at the mean \(\bar{x}\).
Because Galton’s values of parent
and child
height were recorded in class intervals of 1 in., they are shown as sunflower symbols in Figure 3.10, with multiple ‘petals’ reflecting the number of observations at each location. This plot (except for annotations) is constructed using sunflowerplot()
and car::dataEllipse()
for the ellipses.
data(Galton, package = "HistData")
sunflowerplot(parent ~ child, data=Galton,
xlim=c(61,75),
ylim=c(61,75),
seg.col="black",
xlab="Child height",
ylab="Mid Parent height")
y.x <- lm(parent ~ child, data=Galton) # regression of y on x
abline(y.x, lwd=2)
x.y <- lm(child ~ parent, data=Galton) # regression of x on y
cc <- coef(x.y)
abline(-cc[1]/cc[2], 1/cc[2], lwd=2, col="gray")
with(Galton,
car::dataEllipse(child, parent,
plot.points=FALSE,
levels=c(0.40, 0.68, 0.95),
lty=1:3)
)
Finally, as Galton noted in his diagram, the principal major and minor axes of the ellipse have important statistical properties. Pearson (1901) would later show that their directions are determined by the eigenvectors \(\mathbf{v}_1, \mathbf{v}_2, \dots\) of the covariance matrix \(\mathbf{S}\) and their radii by the square roots, \(\sqrt{\mathbf{v}_1}, \sqrt{\mathbf{v}_1}, \dots\) of the corresponding eigenvalues.
3.2.2 R functions for data ellipses
A number of packages provide functions for drawing data ellipses in a scatterplot, with various features.
-
car::scatterplot()
: uses base R graphics to draw 2D scatterplots, with a wide variety of plot enhancements including linear and non-parametric smoothers (loess, gam), a formula method, e.g.,y ~ x | group
, and marking points and lines using symbol shape, color, etc. Importantly, the car package generally allows automatic identification of “noteworthy” points by their labels in the plot using a variety of methods. For example,method = "mahal"
labels cases with the most extreme Mahalanobis distances;method = "r"
selects points according to their value ofabs(y)
, which is appropriate in residual plots. -
car::dataEllipse()
: plots classical or robust data (usingMASS::cov/trob()
) ellipses for one or more groups, with the same facilities for point identification. -
heplots::covEllipses()
: draws classical or robust data ellipses for one or more groups in a one-way design and optionally for the pooled total sample, where the focus is on homogeneity of within-group covariance matrices. -
ggplot2::stat_ellipse()
: uses the calculation methods ofcar::dataEllipse()
to add unfilled (geom = "path"
) or filled (geom = polygon"
) data ellipses in aggplot
scatterplot, using inherited aesthetics.
3.2.3 Example: Canadian occupational prestige
These examples use the data on the prestige of 102 occupational categories and other measures from the 1971 Canadian Census, recorded in carData::Prestige
.3 Our interest is in understanding how prestige
(the Pineo & Porter (2008) prestige score for an occupational category, derived from a social survey) is related to census measures of the average education, income, percent women of incumbents in those occupations. Occupation type
is a factor with levels "bc"
(blue collar), "wc"
(white collar) and "prof"
(professional).
data(Prestige, package="carData")
# `type` is really an ordered factor. Make it so.
Prestige$type <- ordered(Prestige$type,
levels=c("bc", "wc", "prof"))
str(Prestige)
#> 'data.frame': 102 obs. of 6 variables:
#> $ education: num 13.1 12.3 12.8 11.4 14.6 ...
#> $ income : int 12351 25879 9271 8865 8403 11030 8258 14163 11377 11023 ...
#> $ women : num 11.16 4.02 15.7 9.11 11.68 ...
#> $ prestige : num 68.8 69.1 63.4 56.8 73.5 77.6 72.6 78.1 73.1 68.8 ...
#> $ census : int 1113 1130 1171 1175 2111 2113 2133 2141 2143 2153 ...
#> $ type : Ord.factor w/ 3 levels "bc"<"wc"<"prof": 3 3 3 3 3 3 3 3 3 3 ...
I first illustrate the relation between income
and prestige
in Figure 3.11 using car::scatterplot()
with many of its bells and whistles, including marginal boxplots for the variables, the linear regression line, loess smooth and the 68% data ellipse.
scatterplot(prestige ~ income, data=Prestige,
pch = 16, cex.lab = 1.25,
regLine = list(col = "red", lwd=3),
smooth = list(smoother=loessLine,
lty.smooth = 1, lwd.smooth=3,
col.smooth = "darkgreen",
col.var = "darkgreen"),
ellipse = list(levels = 0.68),
id = list(n=4, method = "mahal", col="black", cex=1.2))
#> general.managers lawyers ministers physicians
#> 2 17 20 24
There is a lot that can be seen here:
-
income
is positively skewed, as is often the case. - The loess smooth, on the scale of income, shows
prestige
increasing up to $15,000 (these are 1971 incomes), and then leveling off. - The data ellipse, centered at the means encloses approximately 68% of the data points. It adds visual information about the correlation and precision of the linear regression; but here, the non-linear trend for higher incomes strongly suggests a different approach.
- The four points identified by their labels are those with the largest Mahalanobis distances.
scatterplot()
prints their labels to the console.
Figure 3.12 shows a similar plot for education, which from the boxplot appears to be reasonably symmetric. The smoothed curve is quite close to the linear regression, according to which prestige
increases on average coef(lm(prestige ~ education, data=Prestige))["education"]
= 5.361 with each year of education.
scatterplot(prestige ~ education, data=Prestige,
pch = 16, cex.lab = 1.25,
regLine = list(col = "red", lwd=3),
smooth = list(smoother=loessLine,
lty.smooth = 1, lwd.smooth=3,
col.smooth = "darkgreen",
col.var = "darkgreen"),
ellipse = list(levels = 0.68),
id = list(n=4, method = "mahal", col="black", cex=1.2))
#> physicians file.clerks newsboys farmers
#> 24 41 53 67
In this plot, farmers, newsboys, file.clerks and physicians are identified as noteworthy, for being furthest from the mean by Mahalanobis distance. In relation to their typical level of education, these are mostly understandable, but it is nice that farmers are rated of higher prestige than their level of education would predict.
Note that the method
argument for point identification can take a vector of case numbers indicating the points to be labeled. So, to label the observations with large absolute standardized residuals in the linear model m
, you can use method = which(abs(rstandard(m)) > 2)
.
m <- lm(prestige ~ education, data=Prestige)
scatterplot(prestige ~ education, data=Prestige,
pch = 16, cex.lab = 1.25,
boxplots = FALSE,
regLine = list(col = "red", lwd=3),
smooth = list(smoother=loessLine,
lty.smooth = 1, lwd.smooth=3,
col.smooth = "black",
col.var = "darkgreen"),
ellipse = list(levels = 0.68),
id = list(n=4, method = which(abs(rstandard(m))>2),
col="black", cex=1.2)) |> invisible()
3.2.3.1 Plotting on a log scale
A typical remedy for the non-linear relationship of income to prestige is to plot income on a log scale. This usually makes sense, and expresses a belief that a multiple of or percentage increase in income has a constant impact on prestige, as opposed to the additive interpretation for income itself.
For example, the slope of the linear regression line in Figure 3.11 is given by coef(lm(prestige ~ income, data=Prestige))["income"]
= 0.003. Multiplying this by 1000 says that a $1000 increase in income
is associated with with an average increase of prestige
of 2.9.
In the plot below, scatterplot(..., log = "x")
re-scales the x-axis to the \(\log_e()\) scale. The slope, coef(lm(prestige ~ log(income), data=Prestige))["log(income)"]
= 21.556 says that a 1% increase in salary is associated with an average change of 21.55 / 100 in prestige.
scatterplot(prestige ~ income, data=Prestige,
log = "x",
pch = 16, cex.lab = 1.25,
regLine = list(col = "red", lwd=3),
smooth = list(smoother=loessLine,
lty.smooth = 1, lwd.smooth=3,
col.smooth = "darkgreen", col.var = "darkgreen"),
ellipse = list(levels = 0.68),
id = list(n=4, method = "mahal", col="black", cex=1.2))
#> general.managers ministers newsboys babysitters
#> 2 20 53 63
The smoothed curve in Figure 3.14 exhibits a slight tendency to bend upwards, but a linear relation is a reasonable approximation.
3.2.3.2 Stratifying
Before going further, it is instructive to ask what we could see in the relationship between income and prestige if we stratified by type of occupation, fitting separate regressions and smooths for blue collar, white collar and professional incumbents in these occupations.
The formula prestige ~ income | type
(read: income given type) is a natural way to specify grouping by type
; separate linear regressions and smooths are calculated for each group, applying the color and point shapes specified by the col
and pch
arguments.
scatterplot(prestige ~ income | type, data=Prestige,
col = c("blue", "red", "darkgreen"),
pch = 15:17,
grid = FALSE,
legend = list(coords="bottomright"),
regLine = list(lwd=3),
smooth=list(smoother=loessLine,
var=FALSE, lwd.smooth=2, lty.smooth=1))
This visual analysis offers a different interpretation of the dependence of prestige on income, which appeared to be non-linear when occupation type was ignored. Instead, Figure 3.15 suggests an interaction of income by type. In a model formula this would be expressed as one of:
lm(prestige ~ income + type + income:type, data = Prestige)
lm(prestige ~ income * type, data = Prestige)
These models signify that there are different slopes (and intercepts) for the three occupational types. In this interpretation, type
is a moderator variable, with a different story. The slopes of the fitted lines suggest that among blue collar workers, prestige increases sharply with their income. For white collar and professional workers, there is still an increasing relation of prestige with income, but the effect of income (slope) diminishes with higher occupational category. A different plot entails a different story.
3.2.4 Example: Penguins data
The penguins
dataset from the palmerpenguins package (Horst et al., 2022) provides further instructive examples of plots and analyses of multivariate data. The data consists of measurements of body size (flipper length, body mass, bill length and depth) of 344 penguins collected at the Palmer Research Station in Antarctica.
There were three different species of penguins (Adélie, Chinstrap & Gentoo) collected from 3 islands in the Palmer Archipelago between 2007–2009 (Gorman et al., 2014). The purpose was to examine differences in size or appearance of these species, particularly differences among the sexes (sexual dimorphism) in relation to foraging and habitat.
Here, I use a slightly altered version of the dataset, peng
, renaming variables to remove the units, making factors of character variables and deleting a few cases with missing data.
data(penguins, package = "palmerpenguins")
peng <- penguins |>
rename(
bill_length = bill_length_mm,
bill_depth = bill_depth_mm,
flipper_length = flipper_length_mm,
body_mass = body_mass_g
) |>
mutate(species = as.factor(species),
island = as.factor(island),
sex = as.factor(substr(sex,1,1))) |>
tidyr::drop_na()
str(peng)
#> tibble [333 × 8] (S3: tbl_df/tbl/data.frame)
#> $ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
#> $ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
#> $ bill_length : num [1:333] 39.1 39.5 40.3 36.7 39.3 38.9 39.2 41.1 38.6 34.6 ...
#> $ bill_depth : num [1:333] 18.7 17.4 18 19.3 20.6 17.8 19.6 17.6 21.2 21.1 ...
#> $ flipper_length: int [1:333] 181 186 195 193 190 181 195 182 191 198 ...
#> $ body_mass : int [1:333] 3750 3800 3250 3450 3650 3625 4675 3200 3800 4400 ...
#> $ sex : Factor w/ 2 levels "f","m": 2 1 1 1 2 1 2 1 2 2 ...
#> $ year : int [1:333] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
There are quite a few variables to choose for illustrating data ellipses in scatterplots. Here I focus on the measures of their bills, bill_length
and bill_depth
(indicating curvature) and show how to use ggplot2
for these plots.
I’ll be using the penguins data quite a lot, so it is useful to set up custom colors like those used in Figure 3.16, and shown in Figure 3.17 with their color codes. These are shades of:
- Adelie: orange,
- Chinstrap: purple, and
- Gentoo: green.
To use these in ggplot2
I define a function peng.colors()
that allows shades of light, medium and dark and then functions scale_*_penguins()
for color and fill.
Code
peng.colors <- function(shade=c("medium", "light", "dark")) {
shade = match.arg(shade)
# light medium dark
oranges <- c("#FDBF6F", "#F89D38", "#F37A00") # Adelie
purples <- c("#CAB2D6", "#9A78B8", "#6A3D9A") # Chinstrap
greens <- c("#B2DF8A", "#73C05B", "#33a02c") # Gentoo
cols.vec <- c(oranges, purples, greens)
cols.mat <-
matrix(cols.vec, 3, 3,
byrow = TRUE,
dimnames = list(species = c("Adelie", "Chinstrap", "Gentoo"),
shade = c("light", "medium", "dark")))
# get shaded colors
cols.mat[, shade ]
}
# define color and fill scales
scale_fill_penguins <- function(shade=c("medium", "light", "dark"), ...){
shade = match.arg(shade)
ggplot2::discrete_scale(
"fill","penguins",
scales:::manual_pal(values = peng.colors(shade)), ...)
}
scale_colour_penguins <- function(shade=c("medium", "light", "dark"), ...){
shade = match.arg(shade)
ggplot2::discrete_scale(
"colour","penguins",
scales:::manual_pal(values = peng.colors(shade)), ...)
}
scale_color_penguins <- scale_colour_penguins
This is used to define a theme_penguins()
function that I use to simply change the color and fill scales for plots below.
An initial plot using ggplot2
shown in Figure 3.18 uses color and point shape to distinguish the three penguin species. I annotate the plot of points using the linear regression lines, loess smooths to check for non-linearity and 95% data ellipses to show precision of the linear relation.
Code
ggplot(peng,
aes(x = bill_length, y = bill_depth,
color = species, shape = species, fill=species)) +
geom_point(size=2) +
geom_smooth(method = "lm", formula = y ~ x,
se=FALSE, linewidth=2) +
geom_smooth(method = "loess", formula = y ~ x,
linewidth = 1.5, se = FALSE, alpha=0.1) +
stat_ellipse(geom = "polygon", level = 0.95, alpha = 0.2) +
theme_penguins("dark") +
theme(legend.position = "inside",
legend.position.inside = c(0.85, 0.15))
Overall, the three species occupy different regions of this 2D space and for each species the relation between bill length and depth appears reasonably linear. Given this, we can suppress plotting the data points to get a visual summary of the data using the fitted regression lines and data ellipses, as shown in Figure 3.19.
This idea, of visual thinning a graph to focus on what should be seen, becomes increasingly useful as the data becomes more complex. The ggplot2
framework encourages this, because we can think of various components as layers, to be included or not. Here I chose to include only the regression line and add data ellipses of 40%, 68% and 95% coverage to highlight the increasing bivariate density around the group means.
Code
ggplot(peng,
aes(x = bill_length, y = bill_depth,
color = species, shape = species, fill=species)) +
geom_smooth(method = "lm", se=FALSE, linewidth=2) +
stat_ellipse(geom = "polygon", level = 0.95, alpha = 0.2) +
stat_ellipse(geom = "polygon", level = 0.68, alpha = 0.2) +
stat_ellipse(geom = "polygon", level = 0.40, alpha = 0.2) +
theme_penguins("dark") +
theme(legend.position = "inside",
legend.position.inside = c(0.85, 0.15))
3.2.4.1 Nonparamtric bivariate density plots
While I emphasize data ellipses (because I like their beautiful geometry), other visual summaries of the bivariate density are possible and often useful.
For a single variable, stats::density()
and ggplot2::geom_density()
calculate a smoothed estimate of the density using nonparametric kernel methods (Silverman, 1986) whose smoothness is controlled by a bandwidth parameter, analogous to the span in a loess smoother. This idea extends to two (and more) variables (Scott, 1992). For bivariate data, MASS::kde2d()
estimates the density on a square \(n \times n\) grid over the ranges of the variables.
ggplot2
provides geom_density_2d()
which uses MASS::kde2d()
and displays these as contours— horizontal slices of the 3D surface at equally-spaced heights and projects these onto the 2D plane. The ggdensity package (Otto & Kahle, 2023) extends this with geom_hdr()
, computing the high density regions that bound given levels of probability and maps these to the alpha
transparency aesthetic. A method
argument allows you to specify various nonparametric (method ="kde"
is the default) and parametric (method ="mvnorm"
gives normal data ellipses) ways to estimate the underlying bivariate distribution.
Figure 3.20 shows these side-by-side for comparison. With geom_density_2d()
you can specify either the number of contour bins
or the width of these bins (binwidth
). For geom_hdr()
, the probs
argument gives a result that is easier to understand.
Code
library(ggdensity)
library(patchwork)
p1 <- ggplot(peng,
aes(x = bill_length, y = bill_depth,
color = species)) +
geom_smooth(method = "lm", se=FALSE, linewidth=2) +
geom_density_2d(linewidth = 1.1, bins = 8) +
ggtitle("geom_density_2d") +
theme_bw(base_size = 14) +
theme_penguins() +
theme(legend.position = "inside",
legend.position.inside = c(0.85, 0.15))
p2 <- ggplot(peng,
aes(x = bill_length, y = bill_depth,
color = species, fill = species)) +
geom_smooth(method = "lm", se=FALSE, linewidth=2) +
geom_hdr(probs = c(0.95, 0.68, 0.4), show.legend = FALSE) +
ggtitle("ggdensity::geom_hdr") +
theme_bw(base_size = 14) +
theme_penguins() +
theme(legend.position = "none")
p1 + p2
3.2.5 Simpson’s paradox: marginal and conditional relationships
Because it provides a visual representation of means, variances, and correlations, the data ellipse is ideally suited as a tool for illustrating and explicating various phenomena that occur in the analysis of linear models. One class of simple, but important, examples concerns the difference between the marginal relationship between variables, ignoring some important factor or covariate, and the conditional relationship, adjusting (controlling) for that variable.
Simpson’s (1951) paradox occurs when the marginal and conditional relationships differ in direction. That is, the overall correlation in a model y ~ x
might be negative, while the within-group correlations in separate models for each group y[g] ~ x[g]
might be positive, or vice versa.
This may be seen in the plots of bill length against bill depth for the penguin data shown in Figure 3.21. Ignoring penguin species, the marginal, total-sample correlation is slightly negative as seen in panel (a). The individual-sample ellipses in panel (b) show that the conditional, within-species correlations are all positive, with approximately equal regression slopes. However the group means have a negative relationship, accounting for the negative marginal correlation when species is ignored.
The regression line in panel (a) is that for the linear model lm(bill_depth ~ bill_length)
, while the separate lines in panel (b) are those for the model lm(bill_depth ~ bill_length * species)
which allows a different slope and intercept for each species.
A correct analysis of the (conditional) relationship between these variables, controlling or adjusting for mean differences among species, is based on the pooled within-sample covariance matrix, a weighted average of the individual within-group \(\mathbf{S}_i\), \[ \mathbf{S}_{\textrm{within}} = \sum_{i=1}^g (n_i - 1) \mathbf{S}_i \, / \, (N - g) \:\: , \] where \(N = \sum n_i\). The result is shown in panel (c) of Figure 3.21.
In this graph, the data for each species were first transformed to deviations from the species means on both variables and then translated back to the grand means. You can also see here that the shapes and sizes of the individual data ellipses are roughly comparable, but perhaps not identical. This visual idea of centering groups to a common mean will become important in Chapter 12 when we want to test the assumption of equality of error covariances in multivariate models.
The ggplot2
code for the panels in this figure are shown below. Note that for components that will be the same across panels, you can define elements (e.g., labels
, theme_penguins()
, legend_position
) once, and then re-use these across several graphs.
labels <- labs(
x = "Bill length (mm)",
y = "Bill depth (mm)",
color = "Species",
shape = "Species",
fill = "Species")
plt1 <- ggplot(data = peng,
aes(x = bill_length,
y = bill_depth)) +
geom_point(size = 1.5) +
geom_smooth(method = "lm", formula = y ~ x,
se = TRUE, color = "gray50") +
stat_ellipse(level = 0.68, linewidth = 1.1) +
ggtitle("Ignoring species") +
labels
plt1
legend_position <-
theme(legend.position = "inside",
legend.position.inside = c(0.83, 0.16))
plt2 <- ggplot(data = peng,
aes(x = bill_length,
y = bill_depth,
color = species,
shape = species,
fill = species)) +
geom_point(size = 1.5,
alpha = 0.8) +
geom_smooth(method = "lm", formula = y ~ x,
se = TRUE, alpha = 0.3) +
stat_ellipse(level = 0.68, linewidth = 1.1) +
ggtitle("By species") +
labels +
theme_penguins("dark") +
legend_position
plt2
# center within groups, translate to grand means
means <- colMeans(peng[, 3:4])
peng.centered <- peng |>
group_by(species) |>
mutate(bill_length = means[1] + scale(bill_length, scale = FALSE),
bill_depth = means[2] + scale(bill_depth, scale = FALSE))
plt3 <- ggplot(data = peng.centered,
aes(x = bill_length,
y = bill_depth,
color = species,
shape = species,
fill = species)) +
geom_point(size = 1.5,
alpha = 0.8) +
geom_smooth(method = "lm", formula = y ~ x,
se = TRUE, alpha = 0.3) +
stat_ellipse(level = 0.68, linewidth = 1.1) +
labels +
ggtitle("Within species") +
theme_penguins("dark") +
legend_position
plt3
3.3 Scatterplot matrices
Going beyond bivariate scatterplots, a pairs plot (or scatterplot matrix) displays all possible \(p \times p\) pairs of \(p\) variables in a matrix-like display where variables \((x_i, x_j)\) are shown in a plot for row \(i\), column \(j\). This idea, due to Hartigan (1975b), uses small multiple plots, so that the eye can easily scan across a row or down a column to see how a given variable is related to all the others.
The most basic version is provided by pairs()
in base R. When one variable is considered as an outcome or response, it is usually helpful to put this in the first row and column. For the Prestige
data, in addition to income and education, we also have a measure of % women in each occupational category.
Plotting these together gives Figure 3.22. In such plots, the diagonal cells give labels for the variables, but they are also a guide to interpreting what is shown. In each row, say row 2 for income
, income is the vertical \(y\) variable in plots against other variables. In each column, say column 3 for education
, education is the horizontal \(x\) variable.
pairs(~ prestige + income + education + women,
data=Prestige)
The plots in the first row show what we have seen before for the relations between prestige and income and education, adding to those the plot of prestige vs. % women. Plots in the first column show the same data, but with \(x\) and \(y\) interchanged.
But this basic pairs()
plot is very limited. A more feature-rich version is provided by car::scatterplotMatrix()
which can add the regression lines, loess smooths and data ellipses for each pair, as shown in Figure 3.23.
The diagonal panels show density curves for the distribution of each variable; for example, the distribution of education
appears to be multi-modal and that of women
shows that most of the occupations have a low percentage of women.
The combination of the regression line with the loess smoothed curve, but without their confidence envelopes, provides about the right amount of detail to take in at a glance where the relations are non-linear. We’ve already seen (Figure 3.11) the non-linear relation between prestige and income (row 1, column 2) when occupational type is ignored. But all relations with income in column 2 are non-linear, reinforcing our idea (Section 3.2.3.1) that effects of income should be assessed on a log scale.
scatterplotMatrix(~ prestige + income + education + women,
data=Prestige,
regLine = list(method=lm, lty=1, lwd=2, col="black"),
smooth=list(smoother=loessLine, spread=FALSE,
lty.smooth=1, lwd.smooth=3, col.smooth="red"),
ellipse=list(levels=0.68, fill.alpha=0.1))
scatterplotMatrix()
can also label points using the id =
argument (though this can get messy) and can stratify the observations by a grouping variable with different symbols and colors. For example, Figure 3.24 uses the syntax ~ prestige + education + income + women | type
to provide separate regression lines, smoothed curves and data ellipses for the three types of occupations. (The default colors are somewhat garish, so I use scales::hue_pal()
to mimic the discrete color scale used in ggplot2
).
scatterplotMatrix(~ prestige + income + education + women | type,
data = Prestige,
col = scales::hue_pal()(3),
pch = 15:17,
smooth=list(smoother=loessLine, spread=FALSE,
lty.smooth=1, lwd.smooth=3, col.smooth="black"),
ellipse=list(levels=0.68, fill.alpha=0.1))
It is now easy to see why education is multi-modal: blue collar, white collar and professional occupations have largely non-overlapping years of education. As well, the distribution of % women is much higher in the white collar category.
For the penguins
data, given what we’ve seen before in Figure 3.18 and Figure 3.19, we may wish to suppress details of the points (plot.points = FALSE
) and loess smooths (smooth = FALSE
) to focus attention on the similarity of regression lines and data ellipses for the three penguin species. In Figure 3.25, I’ve chosen to show boxplots rather than density curves in the diagonal panels in order to highlight differences in the means and interquartile ranges of the species, and to show 68% and 95% data ellipses in the off-diagonal panels.
scatterplotMatrix(~ bill_length + bill_depth + flipper_length + body_mass | species,
data = peng,
col = peng.colors("medium"),
legend=FALSE,
ellipse = list(levels = c(0.68, 0.95),
fill.alpha = 0.1),
regLine = list(lwd=3),
diagonal = list(method = "boxplot"),
smooth = FALSE,
plot.points = FALSE,
cex.labels=1)
It can be seen that the species are widely separated in most of the bivariate plots. As well, the regression lines for species have similar slopes and the data ellipses have similar size and shape in most of the plots. From the boxplots, we can also see that Adelie penguins have shorter bill lengths than the others, while Gentoo penguins have smaller bill depth, but longer flippers and are heavier than Chinstrap and Adelie penguins.
Figure 3.25 provides a reasonably complete visual summary of the data in relation to multivariate models that ask “do the species differ in their means on these body size measures?” This corresponds to the MANOVA model,
Hypothesis-error (HE) plots, described in Chapter 11 provide a better summary of the evidence for the MANOVA test of differences among means on all variables together. These give an \(\mathbf{H}\) ellipse reflecting the differences among means, to be compared with an \(\mathbf{E}\) ellipse reflecting within-group variation and a visual test of significance.
A related question is “how well are the penguin species distinguished by these body size measures?” Here, the relevant model is linear discriminant analysis (LDA), where species
plays the role of the response in the model,
peng.lda <- MASS:lda( species ~ cbind(bill_length, bill_depth, flipper_length, body_mass),
data=peng)
Both MANOVA and LDA depend on the assumption that the variances and correlations between the variables are the same for all groups. This assumption can be tested and visualized using the methods in Chapter 12.
3.3.1 Visual thinning
What can you do if there are even more variables than in these examples? If what you want is a high-level, zoomed-out display summarizing the pairwise relations more strongly, you can apply the idea of visual thinning to show only the most important features.
This example uses data on the rate of various crimes in the 50 U.S. states from the United States Statistical Abstracts, 1970, used by Hartigan (1975a) and Friendly (1991). These are ordered in the dataset roughly by seriousness of crime or from crimes of violence to property crimes.
data(crime, package = "ggbiplot")
str(crime)
#> 'data.frame': 50 obs. of 10 variables:
#> $ state : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
#> $ murder : num 14.2 10.8 9.5 8.8 11.5 6.3 4.2 6 10.2 11.7 ...
#> $ rape : num 25.2 51.6 34.2 27.6 49.4 42 16.8 24.9 39.6 31.1 ...
#> $ robbery : num 96.8 96.8 138.2 83.2 287 ...
#> $ assault : num 278 284 312 203 358 ...
#> $ burglary: num 1136 1332 2346 973 2139 ...
#> $ larceny : num 1882 3370 4467 1862 3500 ...
#> $ auto : num 281 753 440 183 664 ...
#> $ st : chr "AL" "AK" "AZ" "AR" ...
#> $ region : Factor w/ 4 levels "Northeast","South",..: 2 4 4 2 4 4 1 2 2 2 ...
Figure 3.26 displays the scatterplot matrix for these seven variables, using only the regression line and data ellipse to show the linear relation and the loess smooth to show potential non-linearity. To make this even more schematic, the axis tick marks and labels are also removed using the par()
settings xaxt = "n", yaxt = "n"
.
crime |>
select(where(is.numeric)) |>
scatterplotMatrix(
plot.points = FALSE,
ellipse = list(levels = 0.68, fill=FALSE),
smooth = list(spread = FALSE,
lwd.smooth=2, lty.smooth = 1, col.smooth = "red"),
cex.labels = 2,
xaxt = "n", yaxt = "n")
We can see that all pairwise correlations are positive, pairs closer to the main diagonal tend to be more highly correlated and in most cases the nonparametric smooth doesn’t differ much from the linear regression line. Exceptions to this appear mainly in the columns for robbery
and auto
(auto theft).
3.4 Corrgrams
What if you want to summarize the data even further simple visual thinning. For example with many variables you might want to show only the value of the correlation for each pair of variables, but do so in a way to help see patterns in the correlations that would be invisible in just a table.
A corrgram (Friendly, 2002) is a visual display of a correlation matrix, where the correlation can be rendered in a variety of ways to show the direction and magnitude: circular “pac-man” (or pie) symbols, ellipses, colored vars or shaded rectangles, as shown in Figure 3.27.
Another aspect is that of effect ordering (Friendly & Kwan, 2003), ordering the levels of factors and variables in graphic displays to make important features most apparent. For variables, this means that we can arrange the variables in a matrix-like display in such a way as to make the pattern of relationships easiest to see. Methods to achieve this include using principal components and cluster analysis to put the most related variables together as described in Chapter 4.
In R, these diagrams can be created using the corrgram (Wright, 2021) and corrplot (Wei & Simko, 2024) packages, with different features. corrgram::corrgram()
is closest to Friendly (2002), in that it allows different rendering functions for the lower, upper and diagonal panels as illustrated in Figure 3.27. For example, a corrgram similar to Figure 3.26 can be produced as follows (not shown here):
With the corrplot package, corrplot()
provides the rendering methods c("circle", "square", "ellipse", "number", "shade", "color", "pie")
, but only one can be used at a time. The function corrplot.mixed()
allows different options to be selected for the lower and upper triangles. The iconic rendering shape is colored with a gradient in relation to the correlation value. For comparison, Figure 3.28 uses ellipses below the diagonal and filled pie charts below the diagonal using a gradient of the fill color in both cases.
crime.cor <- crime |>
dplyr::select(where(is.numeric)) |>
cor()
corrplot.mixed(crime.cor,
lower = "ellipse",
upper = "pie",
tl.col = "black",
tl.srt = 0,
tl.cex = 1.25,
addCoef.col = "black",
addCoefasPercent = TRUE)
The combination of renderings shown in Figure 3.28 is instructive. Small differences among correlation values are easier to see with the pie symbols than with the ellipses; for example, compare the values for murder with larceny and auto theft in row 1, columns 6-7 with those in column 1, rows 6-7, where the former are easier to distinguish. The shading color adds another visual cue.
The variables in Figure 3.26 and Figure 3.28 are arranged by their order in the dataset, which is not often the most useful. A better idea is to arrange the variables so that the most highly correlated variables are adjacent.
A general method described in Section 4.5 orders the variables according to the angles of the first two eigenvectors from a principal components analysis (PCA) around a unit circle. The function corrMatOrder()
provides several methods (order = c("AOE", "FPC", "hclust", "alphabet")
) for doing this, and PCA ordering is order = "AOE"
. Murder and auto theft are still first and last, but some of the intermediate crimes have been rearranged.
ord <- corrMatOrder(crime.cor, order = "AOE")
rownames(crime.cor)[ord]
#> [1] "murder" "assault" "rape" "robbery" "burglary"
#> [6] "larceny" "auto"
Using this ordering in corrplot()
produces Figure 3.29.
corrplot.mixed(crime.cor,
order = "AOE",
lower = "ellipse",
upper = "ellipse",
tl.col = "black",
tl.srt = 0,
tl.cex = 1.25,
addCoef.col = "black",
addCoefasPercent = TRUE)
In this case, where the correlations among the crime variables are all positive, the effect of variable re-ordering is subtle, but note that there is now a slightly pronounced pattern of highest correlations near the diagonal, and decreasing away from the diagonal. Figure 4.27 and Figure 4.29 in Section 4.5 provide a more dramatic example of variable ordering using this method.
Variations of corrgrams are worthy replacements for a numeric table of correlations, which are often presented in publications only for archival value. Including the numeric value (rounded here, for presentation purposes), makes this an attractive alternative to boring tables of correlations.
3.5 Generalized pairs plots
When a dataset contains one or more discrete variables, the traditional pairs plot cannot cope, because the discrete categories would plot as many overlaid points. This cannot be represented using only color and/or point symbols in a meaningful scatterplot.
But the associations between categorical variables in a frequency table can be shown in mosaic displays (Friendly, 1994), using an array of tiles whose areas are depict the cell frequencies. For an \(n\)-way frequency, an analog of the scatterplot matrix uses mosaic plots for each pair of variables. The vcd package (Meyer et al., 2024) implements very general pairs()
methods for "table"
objects and vcdExtra (Friendly, 2023) extends this to wide classes of loglinear models (Friendly, 1999) See Friendly (1999) and my book Discrete Data Analysis with R (Friendly & Meyer, 2016) for mosaic plots and mosaic matrices.
For example, we can tabulate the distributions of penguin species by sex and the island where they were observed using xtabs()
. ftable()
prints this three-way table more compactly. (In this example, and what follows in the chapter, I’ve changed the labels for sex from (“f”, “m”) to (“Female”, “Male”)).
# use better labels for sex
peng <- peng |>
mutate(sex = factor(sex, labels = c("Female", "Male")))
peng.table <- xtabs(~ species + sex + island, data = peng)
ftable(peng.table)
#> island Biscoe Dream Torgersen
#> species sex
#> Adelie Female 22 27 24
#> Male 22 28 23
#> Chinstrap Female 0 34 0
#> Male 0 34 0
#> Gentoo Female 58 0 0
#> Male 61 0 0
We can see immediately that the penguin species differ by island: only Adelie were observed on all three islands; Biscoe Island had no Chinstraps and Dream Island had no Gentoos.
vcd::pairs()
produces all pairwise mosaic plots, as shown in Figure 3.30. The diagonal panels show the one-way frequencies by width of the divided bars. Each off-diagonal panel shows the bivariate counts, breaking down each column variable by splitting the bars in proportion to a second variable. Consequently, the frequency of each cell is represented by its’ area. The purpose is to show the pattern of association between each pair, and so, the tiles in the mosaic are shaded according to the signed standardized residual, \(d_{ij} = (n_{ij} - \hat{n}_{ij}) / \sqrt{\hat{n}_{ij}}\) in a simple \(\chi^2 = \Sigma_{ij} \; d_{ij}^2\) test for association— blue where the observed frequency \(n_{ij}\) is significantly greater than expected \(\hat{n}_{ij}\) under independence, and red where it is less than expected. The tiles are unshaded when \(| d_{ij} | < 2\).
library(vcd)
pairs(peng.table, shade = TRUE,
lower_panel_args = list(labeling = labeling_values()),
upper_panel_args = list(labeling = labeling_values()))
The shading patterns in cells (1,3) and (3,1) of Figure 3.30 show what we’ve seen before in the table of frequencies: The distribution of species varies across island because on each island one or more species did not occur. Row 2 and column 2 show that sex is nearly exactly proportional among species and islands, indicating independence, \(\text{sex} \perp \{\text{species}, \text{island}\}\). More importantly, mosaic pairs plots can show, at a glance, all (bivariate) associations among multivariate categorical variables.
The next step, by John Emerson and others (Emerson et al., 2013) was to recognize that combinations of continuous and discrete, categorical variables could be plotted in different ways.
- Two continuous variables can be shown as a standard scatterplot of points and/or bivariate density contours, or simply by numeric summaries such as a correlation value;
- A pair of one continuous and one categorical variable can be shown as side-by-side boxplots or violin plots, histograms or density plots;
- Two categorical variables could be shown in a mosaic plot or by grouped bar plots.
In the ggplot2 framework, these displays are implemented using the ggpairs()
function from the GGally package (Schloerke et al., 2024). This allows different plot types to be shown in the lower and upper triangles and in the diagonal cells of the plot matrix. As well, aesthetics such as color and shape can be used within the plots to distinguish groups directly. As illustrated below, you can define custom functions to control exactly what is plotted in any panel.
The basic, default plot shows scatterplots for pairs of continuous variables in the lower triangle and the values of correlations in the upper triangle. A combination of a discrete and continuous variables is plotted as histograms in the lower triangle and boxplots in the upper triangle. Figure 3.31 includes sex
to illustrate the combinations.
Code
ggpairs(peng, columns=c(3:6, 7),
aes(color=species, alpha=0.5),
progress = FALSE) +
theme_penguins() +
theme(axis.text.x = element_text(angle = -45))
To my eye, printing the values of correlations in the upper triangle is often a waste of graphic space. But in this example the correlations show something peculiar and interesting if you look closely: In all pairs among the penguin size measurements, there are positive correlations within each species, as we can see in Figure 3.25. Yet, in three of these panels, the overall correlation ignoring species is negative. For example, the overall correlation between bill depth and flipper length is \(r = -0.579\) in row 2, column 3; the scatterplot in the diagonally opposite cell, row 3, column 2 shows the data. These cases, of differing signs for an overall correlation, ignoring a group variable and the within group correlations are examples of Simpson’s Paradox, explored later in Chapter XX.
The last row and column, for sex
in Figure 3.31, provides an initial glance at the issue of sex differences among penguin species that motivated the collection of these data. We can go further by also examining differences among species and island, but first we need to understand how to display exactly what we want for each pairwise plot.
ggpairs()
is extremely general in that for each of the lower
, upper
and diag
sections you can assign any of a large number of built-in functions (of the form ggally_NAME
), or your own custom function for what is plotted, depending on the types of variables in each plot.
-
continuous
: both X and Y are continuous variables, supply this as theNAME
part of aggally_NAME()
function or the name of a custom function; -
combo
: one X of and Y variable is discrete while the other is continuous, using the same convention; -
discrete
: both X and Y are discrete variables.
The defaults, which were used in Figure 3.31, are:
upper = list(continuous = "cor", # correlation values
combo = "box_no_facet", # boxplots
discrete = "count") # rectangles ~ count
lower = list(continuous = "points", # just data points
combo = "facethist", # faceted histograms
discrete = "facetbar") # faceted bar plots
diag = list(continuous = "densityDiag", # density plots
discrete = "barDiag") # bar plots
Thus, ggpairs()
uses ggally_cor()
to print the correlation values for pairs of continuous variables in the upper triangle, and uses ggally_points()
to plot scatterplots of points in the lower portion. The diagonal panels as shown as density plots (ggally_densityDiag()
) for continuous variables but as bar plots (ggally_barDiag()
) for discrete factors.
See the vignette, ggally_plots for an illustrated list of available high-level plots. For our purpose here, which is to illustrate enhanced displays, note that for scatterplots of continuous variables, there are two functions which plot the points and also add a smoother, _lm
or _loess
.
ls(getNamespace("GGally")) |>
stringr::str_subset("^ggally_smooth_")
#> [1] "ggally_smooth_lm" "ggally_smooth_loess"
A customized display for scatterplots of continuous variables can be any function that takes data
and mapping
arguments and returns a "ggplot"
object. The mapping
argument supplies the aesthetics, e.g., aes(color=species, alpha=0.5)
, but only if you wish to override what is supplied in the ggpairs()
call.
Here is a function, my_panel()
that plots the data points, regression line and loess smooth:
my_panel <- function(data, mapping, ...){
p <- ggplot(data = data, mapping = mapping) +
geom_point() +
geom_smooth(method=lm, formula = y ~ x, se = FALSE, ...) +
geom_smooth(method=loess, formula = y ~ x, se = FALSE, ...)
p
}
For this example, I want only simple summaries of for the scatterplots, so I don’t want to plot the data points, but do want to add the regression line and a data ellipse.
my_panel1 <- function(data, mapping, ...){
p <- ggplot(data = data, mapping = mapping) +
geom_smooth(method=lm, formula = y ~ x, se = FALSE, ...) +
stat_ellipse(geom = "polygon", level = 0.68, ...)
p
}
Then, to show what can be done, Figure 3.32 uses my_panel1()
for the scatterplots in the 4 x 4 block of plots in the upper left. The combination of the continuous body size measures and the discrete factors species
, island
and sex
are shown in upper triangle by boxplots but by faceted histograms in the lower portion. The factors are shown as rectangles with area proportional to count (poor-man’s mosaic plots) above the diagonal and as faceted bar plots below.
Code
ggpairs(peng, columns=c(3:6, 1, 2, 7),
mapping = aes(color=species, fill = species, alpha=0.2),
lower = list(continuous = my_panel1),
upper = list(continuous = my_panel1),
progress = FALSE) +
theme_penguins() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank()) +
theme(axis.text.x = element_text(angle = -45))
There is certainly a lot going on in Figure 3.32, but it does show a high-level overview of all the variables (except year
) in the penguins dataset. It is probably easiest to “read” this figure by focusing on the four blocks for the combinations of 4 continuous and 3 categorical measures. In the upper left block, visual thinning of the scatterplots, showing only the data ellipses and regression lines gives a simple view as it did in Figure 3.25.
3.6 Parallel coordinate plots
As we have seen above, scatterplot matrices and generalized pairs plots extend data visualization to multivariate data, but these variables share one 2D space, so resolution decreases as the number of variable increase. You need a very large screen or sheet of paper to see more than, say 5-6 variables with any clarity.
Parallel coordinate plots are an attractive alternative, with which we can visualize an arbitrary number of variables to get a visual summary of a potentially high-dimensional dataset, and perhaps recognize outliers and clusters in the data in a different way. In these plots, each variable is shown on a separate, parallel axis. A multivariate observation is then plotted by connecting their respective values on each axis with lines across all the axes.
The geometry of parallel coordinates is interesting, because what is a point in \(n\)-dimensional (Euclidean) data space becomes a line in the projective parallel coordinate space with \(n\) axes, and vice-versa: lines in parallel coordinate space correspond to points in data space. Thus, a collection of points in data space map to lines that intersect in a point in projective space. What this does is to map \(n\)-dimensional relations into 2D patterns we can see in a parallel coordinates plot.
Those who don’t know history are doomed to plagarize it —The author
The theory of projective geometry originated with the French mathematician Maurice d’Ocagne (1885) who sought a way to provide graphic calculation of mathematical functions with alignment diagrams or nomograms using parallel axes with different scales. A three-variable equation, for example, could be solved using three parallel axes, where known values could be marked on their scales, a line drawn between them, and an unknown read on its scale at the point where the line intersects that scale.
Henry Gannet (1880), in work preceding the Statistical Atlas of the United States for the 1890 Census (Gannett, 1898), is widely credited with being the first to use parallel coordinates plots to show data, in his case, to show the rank ordering of US states by 10 measures including population, occupations, wealth, manufacturing, agriculture and so on.
However, both d’Ocagne and Gannet were far preceded in this by Andre-Michel Guerry (1833) who used this method to show how the rank order of various crimes changed with age of the accused. See Friendly (2022), Figure 7 for his version and for an appreciation of the remarkable contributions of this amateur statistician to the history of data visualization.
The use of parallel coordinates for display of multidimensional data was rediscovered by Alfred Inselberg (1985) and extended by Edward Wegman (1990), neither of whom recognized the earlier history. Somewhat earlier, David Andrews (1972) proposed mapping multivariate observations to smooth Fourrier functions composed of alternating \(\sin()\) and \(\cos()\) terms. And in my book, SAS System for Statistical Graphics (Friendly, 1991), I implemented what I called profile plots without knowing their earlier history as parallel coordinate plots.
Parallel coordinate plots present a challenge for graphic developers, in that they require a different way to think about plot construction for multiple variables, which can be quantitative, as in the original idea, or categorical factors, all to be shown along parallel axes.
Here, I use the ggpcp package (Hofmann et al., 2022), best described in VanderPlas et al. (2023), who also review the modern history.4 This takes some getting used to, because they develop pcp_*()
extensions of the ggplot2
grammar of graphics framework to allow:
-
pcp_select()
: selections of the variables to be plotted and their horizontal order on parallel axes, -
pcp_scale()
: methods for scaling of the variables to each axis, -
pcp_arrange()
: methods for breaking ties in factor variables to space them out.
Then, it provides geom_pcp_*()
functions to control the display of axes with appropriate aesthetics, labels for categorical factors and so forth. Figure 3.33 illustrates this type of display, using sex and species in addition to the quantitative variables for the penguin data.
Code
peng |>
pcp_select(bill_length:body_mass, sex, species) |>
pcp_scale(method = "uniminmax") |>
pcp_arrange() |>
ggplot(aes_pcp()) +
geom_pcp_axes() +
geom_pcp(aes(colour = species), alpha = 0.8, overplot = "none") +
geom_pcp_labels() +
scale_colour_manual(values = peng.colors()) +
labs(x = "", y = "") +
theme(axis.title.y = element_blank(), axis.text.y = element_blank(),
axis.ticks.y = element_blank(), legend.position = "none")
Rearranging the order of variables and the ordering of factor levels can make a difference in what we can see in such plots. For a simple example (following VanderPlas et al. (2023)), we reorder the levels of species and islands to make it clearer which species occur on each island.
Code
peng1 <- peng |>
mutate(species = factor(species, levels = c("Chinstrap", "Adelie", "Gentoo"))) |>
mutate(island = factor(island, levels = c("Dream", "Torgersen", "Biscoe")))
peng1 |>
pcp_select(species, island, bill_length:body_mass) |>
pcp_scale() |>
pcp_arrange(method = "from-left") |>
ggplot(aes_pcp()) +
geom_pcp_axes() +
geom_pcp(aes(colour = species), alpha = 0.6, overplot = "none") +
geom_pcp_boxes(fill = "white", alpha = 0.5) +
geom_pcp_labels() +
scale_colour_manual(values = peng.colors()[c(2,1,3)]) +
theme_bw() +
labs(x = "", y = "") +
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
legend.position = "none")
The order of variables in this plot emphasizes the relation between penguin species and the island where they were observed and then shows the values of the quantitative body size measurements. More generally, quantitative variables can, and probably should, be ordered to place the most highly correlated variables adjacently to minimize the degree of crossing lines from one variable to the next (Martí & Laguna, 2003). When variables are highly negatively correlated (such as bill_depth
and flipper_length
here), crossings can be reduced simply by reversing the scale of one of the variables, e.g., by plotting -bill_depth
.
3.7 Animated tours
In the mid 17\(^{th}\) to early 19\(^{th}\)-century the Grand Tour became a coming-of-age custom for young Europeans (mainly British nobility and landed gentry) of sufficient rank and means to undertake a journey to the principal sites of Europe (Paris, Geneva, Rome, Athens, …) to complete their education by learning something of the cultural legacies in history, art, and music from antiquity to the Renaissance. Thereby, they could gain a wider appreciation of history and be prepared to play a role in polite society or in their chosen endeavors.
Travels in high-dimensional data space might be less thrilling than a journey from London through Paris and Millan to Rome. Yet, in both cases it is useful to think of the path taken, and what might be seen along the way. But there are different kinds of tours. We might simply take a meandering tour, exploring all the way, or want to plan a tour to see the most interesting sites in travel or have a tour guided by an expert. Similarly in data space, we might travel randomly to see what we can find or be guided to find interesting features such as clusters, outliers or non-linear relations in data.
Following the demonstration in PRIM-9 (Section 3.1) of exploring multidimensional data space by rotation Asimov (1985) developed the idea of the grand tour, a computer method for viewing multivariate statistical data via orthogonal projections onto an animated sequence of low-dimensional subspaces, like a movie. In contrast to a scatterplot matrix which shows a static view of a data cloud projected onto all pairwise variable axes, a statistical tour is like the view of an eye moving smoothly in high-dimensional space, capturing what it sees from a given location onto the 2-d plane of the computer screen.
More generally, statistical tours are a type of dynamic projections onto orthogonal axes (called a basis) that embed data in a \(p\)−dimensional space into a \(d\)−dimensional viewing subspace. Typically, \(d=2\), and the result is displayed as scatterplots, together with vectors representing the projections of the data variables in this space. But the projected data can be rendered in 1-d as densities or histograms, or in other number of dimensions as glyphs, or even as parallel coordinate plots. The essential idea is that we can define, and animate, a tour path as a smooth sequence of such projections over small changes to the projection basis, which gives the orientation of the data in the viewing space.
3.7.1 Projections
The idea of a projection is fundamental to touring methods and other visualizations of high-D data, so it is useful to understand what a projection is. Quite simply, you can think of a projection as the shadow of an object or cloud of points. This is nicely illustrated by the cover image (Figure 3.34) used for Douglas Hofstadter’s (1979) Gödel, Bach and Escher which shows 3D solid shapes illuminated by light sources so their shadows form the letters G, B and E projected onto the planes formed by pairs of the three coordinate axes. The set of three 2D views is essentially the same that we see in a scatterplot matrix, where a 3D dataset is portrayed by the set of shadows of the points on planes formed by pairs of coordinate axes.
In the simplest case, a data point \(\mathbf{x} = (x_1, x_2)\) in two dimensions can be represented geometrically as a vector from the origin as shown in Figure 3.35. This point can be projected on any one-dimensional axis \(\mathbf{p}\) by dropping a line perpendicular to \(\mathbf{p}\), which is the idea of a shadow. Mathematically, this is calculated as the product \(\mathbf{x}^\mathsf{T} \mathbf{p} = x_1 p_1 + x_2 p_2\) and suitably normalized to give the correct length. …
More generally, a projection of an \((n \times p)\) data matrix \(\mathbf{X}\) representing \(n\) observations in \(p\) dimensions onto a \(d\)-dimensional viewing space \(\mathbf{Y}_{n \times d}\) is represented by a \(p \times d\) projection matrix \(\mathbf{P}\) as \(\mathbf{Y} = \mathbf{X} \mathbf{P}\), where the columns of \(\mathbf{P}\) are orthogonal and of unit length,i.e., \(\mathbf{P}^\mathsf{T} \mathbf{P} = \mathbf{I}_{(d \times d)}\).
For example, to project a data matrix \(\mathbf{X}\) in three dimensions onto a 2D plane, we would multiply it by a \((3 \times 2)\) orthonormal matrix \(\mathbf{P}\). The matrix \(\mathbf{P}_1\) below simply selects the first two columns of \(\mathbf{X}\).5
\[ \mathbf{X} = \begin{bmatrix} 0 & 0 & 0 \\ 0 & 0 & 10 \\ 0 & 10 & 0 \\ 0 & 10 & 10 \\ 10 & 0 & 0 \\ 10 & 0 & 10 \\ 10 & 10 & 0 \\ 10 & 10 & 10 \\ \end{bmatrix}_{8 \times 3} ;\; \mathbf{P_1} = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 0 & 0 \\ \end{bmatrix}_{3 \times 2} \;\Rightarrow\quad \mathbf{Y} = \mathbf{X} \; \mathbf{P_1} = \begin{bmatrix} 0 & 0 \\ 0 & 0 \\ 0 & 10 \\ 0 & 10 \\ 10 & 0 \\ 10 & 0 \\ 10 & 10 \\ 10 & 10 \\ \end{bmatrix}_{8 \times 2} \] An oblique projection using all three dimensions is given by \(\mathbf{P_2}\) below. This produces a new 2D view in \(\mathbf{Y}\): \[ \mathbf{P_2} = \begin{bmatrix} 0.71 & -0.42 \\ 0.71 & 0.42 \\ 0 & 0.84 \\ \end{bmatrix}_{3 \times 2} \quad\Rightarrow\quad \mathbf{Y} = \mathbf{X} \; \mathbf{P_2} = \begin{bmatrix} 0 & 0 \\ 0 & 8.4 \\ 7.1 & 4.2 \\ 7.1 & 12.6 \\ 7.1 & -4.2 \\ 7.1 & 4.2 \\ 14.2 & 0 \\ 14.2 & 8.4 \\ \end{bmatrix} \]
The columns in \(\mathbf{Y}\) are simply the linear combinations of those of \(\mathbf{X}\) using the weights in each column of \(\mathbf{P_2}\)
\[\begin{aligned} \mathbf{y}_1 & = & 0.71 \mathbf{x}_1 + 0.71 \mathbf{x}_2 + 0 \mathbf{x}_3\\ \mathbf{y}_2 & = & -0.42 \mathbf{x}_1 + 0.42 \mathbf{x}_2 + 0.84 \mathbf{x}_3 \\ \end{aligned}\]
In this example, the matrix \(\mathbf{X}\) consists of 8 points at the vertices of a cube of size 10, as shown in Figure 3.36 (a). The projections \(\mathbf{Y}_1 = \mathbf{P}_1 \mathbf{X}\) and \(\mathbf{Y}_2 = \mathbf{P}_2 \mathbf{X}\) are shown in panels (b) and (c). To make it easier to relate the points in different views, shapes and colors are assigned so that each point has a unique combination of these attributes.6
pch <- rep(15:18, times = 2)
colors <- c("red", "blue", "darkgreen", "brown")
col <- rep(colors, each = 2)
data.frame(X, pch, col)
#> x1 x2 x3 pch col
#> 1 0 0 0 15 red
#> 2 10 0 0 16 red
#> 3 0 10 0 17 blue
#> 4 10 10 0 18 blue
#> 5 0 0 10 15 darkgreen
#> 6 10 0 10 16 darkgreen
#> 7 0 10 10 17 brown
#> 8 10 10 10 18 brown
But, if we are traveling in the projection space of \(\mathbf{Y}\), we need some signposts to tell us how the new dimensions relate to those of \(\mathbf{X}\). The answer is provided simply by plotting the rows of \(\mathbf{P}\) as vectors, as shown in Figure 3.37. In these plots, each row of \(\mathbf{P}_1\) and \(\mathbf{P}_2\) appears as a vector from the origin. It’s direction shows the contribution each of \(\mathbf{x}_1, \mathbf{x}_2, \mathbf{x}_3\) make to the new coordinates \(\mathbf{y}_1\) and \(\mathbf{y}_2\).
In \(\mathbf{P}_1\), the projected variable \(\mathbf{y}_1\) is related only to \(\mathbf{x}_1\), while \(\mathbf{y}_2\) is related only to \(\mathbf{x}_2\) \(\mathbf{x}_3\) makes no contribution, and appears at the origin. However in the projection given by \(\mathbf{P}_2\), \(\mathbf{x}_1\) and \(\mathbf{x}_2\) make the same contribution to \(\mathbf{y}_1\), while \(\mathbf{x}_3\) has no contribution to that horizontal axis. The vertical axis, \(\mathbf{y}_2\) here is completely aligned with \(\mathbf{x}_3\); \(\mathbf{x}_1\) and \(\mathbf{x}_2\) have vertical components that are half of that for \(\mathbf{x}_3\) in absolute value.
Code
library(matlib)
op <- par(mar=c(4, 5, 1, 1)+.1)
xlim <- ylim <- c(-1.1, 1.1)
axes.x <- c(-1, 1, NA, 0, 0)
axes.y <- c(0, 0, NA, -1, 1)
labs <- c(expression(x[1]), expression(x[2]), expression(x[3]))
plot(xlim, ylim, type = "n", asp=1,
xlab = expression(y[1]), ylab = expression(y[2]),
cex.lab = 1.8)
circle(0, 0, 1, col = adjustcolor("skyblue", alpha = 0.2))
lines(axes.x, axes.y, col = "grey")
vectors(P1, labels = labs, cex.lab = 1.8, lwd = 3, pos.lab = c(4, 2, 1))
plot(xlim, ylim, type = "n", asp=1,
xlab = expression(y[1]), ylab = expression(y[2]),
cex.lab = 1.8)
circle(0, 0, 1, col = adjustcolor("skyblue", alpha = 0.2))
lines(axes.x, axes.y, col = "grey")
vectors(P2, labels = labs, cex.lab = 1.8, lwd = 3)
par(op)
3.7.1.1 Vector lengths
In Figure 3.37, the lengths of the \(\mathbf{x}\) vectors reflect the relative degree to which each variable is represented in the space of the projection, and this is important for interpretation. For the \(\mathbf{P}_1\) projection, \(\mathbf{x}_3\) is of length 0, while \(\mathbf{x}_1\) and \(\mathbf{x}_2\) fill the unit circle. In the projection given by \(\mathbf{P}_2\), all three \(\mathbf{x}\) are approximately the same length.
In algebra, the length of a vector \(\mathbf{x}\) is \(||\mathbf{x}|| = (\mathbf{x}^\mathsf{T} \mathbf{x})^{1/2} = \sqrt{\Sigma x_i^2}\), the Euclidean distance of the tip of the vector from the origin. In R, we calculate the lengths of row vectors in a projection matrix by transposing and using matlib::len()
.
3.7.1.2 Joint-views
To interpret such projections, we want to see both the projected data and the signposts that tell us where we are in relation to the original variables. To do this, we can overlay the variable vectors represented by the rows of the projection matrix \(\mathbf{P}\) onto plots like Figure 3.36 (b) and Figure 3.36 (c) to see how the axes in a projection relate to those in the data. To place these together on the same plot, we can either center the columns of \(\mathbf{Y}\) at their means or shift the the columns of \(\mathbf{P}\) to colMeans(Y)
. It is only the directions of the vectors that matters, so we are free to scale their lengths by any convenient factor.
Code
Y2s <- scale(Y2, scale=FALSE) # center Y2
plot(Y2s, cex = 3,
asp = 1,
pch = pch, col = col,
xlab = expression(y[1]), ylab = expression(y[2]),
xlim = c(-10, 10), ylim = c(-10, 10), cex.lab = 1.8)
r <- 7
vecs <- (r*diag(3) %*% P2)
vectors(vecs, labels = labs, cex.lab = 1.8, lwd = 2)
vectors(-vecs, labels = NULL, lty = 1, angle = 1, col = "gray")
The plot in Figure 3.38 illustrates this, centering \(\mathbf{Y}\), and multiplying the vectors in \(\mathbf{P}\) by 7. To check your understanding, try to see if you can relate what is shown in this plot to the 3D plot in Figure 3.36 (a).
The idea of viewing low-dimensional projections of data together with vectors representing the contributions of the original variables to the dimensions shown in a display is also the basis of biplot techniques (Section 4.3) we will use in relation to principal components analysis.
3.7.2 Touring methods
The trick of statistical touring methods is to generate a smooth sequence of interpolated projections \(\mathbf{P}_{(t)}\) indexed by time \(t\), \(\mathbf{P}_{(1)}, \mathbf{P}_{(2)}, \mathbf{P}_{(3)}, \dots, \mathbf{P}_{(T)}\). This gives a path of views \(\mathbf{Y}_{(t)} = \mathbf{X} \mathbf{P}_{(t)}\), that can be animated in successive frames, as shown schematically in Figure 3.39.
Asimov’s (1985) original idea of the grand tour was that of a random path, picking orthogonal projections \(\mathbf{P}_{(i)}\) at random. Given enough time, the grand tour gives a space-filling path and would eventually show every possible projection of the data. But it does so smoothly, by interpolating from one projection to the next. In the travel analogy, the path by road from London to Paris might go smoothly through Kent to Dover, thence via Amiens and Beauvais before reaching Paris. By air, the tour would follow a smoother geodesic path, and this is what the grand tour does. The sense in watching an animation of a statistical grand tour is that of continuous motion. The grand tour algorithm is described in detail by Buja et al. (2005) and Cook et al. (2008).
3.7.2.1 Guided tours
The next big idea was that rather than traveling randomly in projection space one could take a guided tour, following a path that leads to “interesting projections”, such as those that reveal clusters, gaps in data space or outliers. This idea, called projection pursuit (Cook et al., 1995), works by defining a measure of interestingness of a data projection. In a guided tour, the next projection is chosen to increase that index, so over time the projection moves toward one that is maximizes that index.
In the time since Asimov (1985), there have been many implementations of touring visualization methods. XGobi (Swayne et al., 1998) for X-Windows displays on Linux systems provided a test-bed for dynamic, interactive graphic methods; it’s successor, GGobi (Cook & Swayne, 2007; Swayne et al., 2003) extended the range of touring methods to include a wider variety of projection pursuit indices.
3.7.2.2 tourr
package
The current state of art is best captured in the tourr package for R (Wickham et al., 2011; Wickham & Cook, 2024). It defines a tour to consist of three components:
- data: An \((n \times p)\) numerical data matrix to be viewed.
-
path: A tour path function that produces a smoothed sequence of projection matrices \(\mathbf{P}_{(p \times d)}\) in \(d\). dimensions, for example
grand_tour(d = 2)
orguided_tour(index = holes)
. -
display: A function that renders the projected data, for example
display_xy()
for a scatterplot,display_depth()
for a 3D plot with simulated depth, ordisplay_pcp()
for a parallel coordinates plots
This very nicely separates the aspects of a tour, and allows one to think of and define new tour path methods and display methods. The package defines two general tour functions: animate()
produces a real-time animation on a display device and render()
saves image frames to disk, such as a .gif
file.
The tourr package provides a wide range of tour path methods and display methods:
# tour path methods
grep("_tour$", lsf.str("package:tourr"), value = TRUE)
#> [1] "dependence_tour" "frozen_guided_tour"
#> [3] "frozen_tour" "grand_tour"
#> [5] "guided_anomaly_tour" "guided_section_tour"
#> [7] "guided_tour" "little_tour"
#> [9] "local_tour" "new_tour"
#> [11] "planned_tour" "planned2_tour"
#> [13] "radial_tour"
# display methods
grep("display_", lsf.str("package:tourr"), value = TRUE)
#> [1] "display_andrews" "display_density2d" "display_depth"
#> [4] "display_dist" "display_faces" "display_groupxy"
#> [7] "display_idx" "display_image" "display_pca"
#> [10] "display_pcp" "display_sage" "display_scatmat"
#> [13] "display_slice" "display_stars" "display_stereo"
#> [16] "display_trails" "display_xy"
Tour path methods take a variety of optional arguments to specify the detailed behavior of the method. For example, most allow you to specify the number of dimension (d =
) of the projections. The guided_tour()
is of particular interest here.
args(guided_tour)
#> function (index_f, d = 2, alpha = 0.5, cooling = 0.99, max.tries = 25,
#> max.i = Inf, search_f = search_geodesic, n_sample = 100,
#> ...)
#> NULL
In this, index_f
specifies a function that the method tries to optimize on its path and package defines four indices:
- Holes (
holes()
): This is sensitive to projections with separated clusters of points, with few points near the origin - Central mass (
cmass()
): Sensitive to projections with lots of points in the center, but perhaps with some outliers - Linear discriminant analysis (
lda_pp()
): For data with a grouping factor, optimizes a measure of separation of the group means as in MANOVA or linear discriminant analysis. - PDA analysis (
pda_pp()
): A penalized version oflda_pp()
for cases of large \(p\) relative to sample size \(n\) (E.-K. Lee & Cook, 2009).
In addition, there is now a guided_anomaly_tour()
that looks for the best projection of observations that are outside the data ellipsoid, finding a view showing observations with large Mahalanobis distances from the centroid.
3.7.2.3 Penguin tours
Penguins are a traveling species. They make yearly travels inland to breeding sites in early spring, repeating the patterns of their ancestors. Near the beginning of summer, adult penguins and their chicks return to the sea and spend the rest of the summer feeding there (Black et al., 2018). If they were also data scientists, they might wonder about the relations among among their cousins of different species and take a tour of their measurements…
For example, using the Penguins dataset, the following calls produce grand tours in 2, 3, and 4 dimensions. The 2D tour is displayed as a scatterplot, the 3D tour using simulated depth as shown by variation in point size and transparency, and the 4D tour is shown using a parallel coordinate plot.
data(peng, package = "heplots")
peng_scaled <- scale(peng[,3:6])
colnames(peng_scaled) <- c("BL", "BD", "FL", "BM")
animate(peng_scaled, grand_tour(d = 2), display_xy())
animate(peng_scaled, grand_tour(d = 3), display_depth())
animate(peng_scaled, grand_tour(d = 4), display_pcp())
To illustrate, I’ll start with a grand tour designed to explore this 4D space of penguins. I’ll abbreviate the variables to two characters, “BL” = bill_length
, “BD” = bill_depth
, “FL” = flipper_length
, and “BM” = body_mass
and identify the penguin species using point shape (pch
) and color (col
).
As you watch this pay attention to the separation of the species and any other interesting features. What do you see?
data(peng, package = "heplots")
peng_scaled <- scale(peng[,3:6])
colnames(peng_scaled) <- c("BL", "BD", "FL", "BM")
pch <- c(15, 16, 17)[peng$species]
cex = 1.2
set.seed(1234)
animate(peng_scaled,
tour_path = grand_tour(d=2),
display_xy(col = peng$species,
palette = peng.colors("dark"),
pch = pch, cex = cex,
axis.col = "black",
axis.text.col = "black",
axis.lwd = 1.5))
Figure 3.42 shows three frames from this movie. The first (a) is the initial frame that shows the projection in the plane of bill depth and bill length. The variable vectors indicate that bill length differentiates Adelie penguins from the others. In frame (b), the three species are widely separated, with bill depth distinguishing Gentoo from the others. In frame (c) the three species are largely mixed, but two points stand out as outliers, with exceptionally long bills compared to the rest.
Let’s take the penguins on a guided tour, trying to find views that show the greatest separations among the penguin species; that is, a guided tour, optimizing the lda_pp()
index.
set.seed(1234)
animate(peng_scaled,
guided_tour(lda_pp(peng$species)),
display_xy(col = peng$species,
palette = peng.colors("dark"),
pch = pch,
cex = cex)
)
TODO: I’m trying to balance what will/can be shown in the HTML version vs. the printed PDF. Needs text here specifically for the PDF version.
lda_pp()
anomaly_index()
lda_pp()
criterion optimizes the separation of the means for species relative to within-group variation. (b) The anomalies_index()
optimizes the average Mahalanobis distance of points from the centroid
These examples are intended to highlight what is possible with dynamic graphics for exploring high-dimensional data visually. Cook & Laa (2024) extend the discussion of these methods from Cook & Swayne (2007) (which used Ggobi) to the tourr package. They illustrate dimension reduction, various cluster analysis methods, trees and random forests and some machine-learning techniques.
Ideally, we should be able interact with a tour,
- pausing when we see something interesting and saving the view for later analysis;
- selecting or highlighting unusual points,
- changing tour methods or variables displayed on the fly, and so forth.
Some packages that provide these capabilities are: detourr (Hart & Wang, 2022) liminal (S. Lee, 2021) and langevitour (Harrison, 2023, 2024) The loon package (Waddell & Oldford, 2023) is a general toolkit that enables highly interactive data visualization. It provides a loon.tour package (Xu & Oldford, 2021) for using touring methods within the loon
environment.
3.8 Network diagrams
A major theme throughout this chapter has been to understand how to extend data visualization from simple bivariate scatterplots to increasingly more complex situations with larger datasets. With a moderate number of variables, techniques such as smoothing, summarizing with data ellipses and fitted curves, and visual thinning can be used to tame “big \(N\)” datasets with thousands of cases.
However “big \(p\)” datasets, with more than a moderate number (\(p\)) of variables still remain a challenge. It is hard to see how the more advanced methods (corrgrams, parallel coordinate) described earlier could cope with \(p = 20, 50, 100, 500, \dots\) variables. At some point, each of these begins to break down for the purpose of visualizing associations among many variables. We are forced to thin the information presented in graphs more and more as the number of variables increases.
It turns out that there is a way to increase the number of variables displayed dramatically, if we are mainly interested in the pairwise correlations for reasonably normally distributed data. A graphical network diagram portrays variables by nodes (vertices), connected by (weighted) edges whose properties reflect the strength of connections between pairs, such as a correlation. Such diagrams can reveal properties not readily seen by other means.
As an example consider Figure 3.45, which portrays the correlations among 25 self-report items reflecting 5 factors (the “Big Five”) considered in personality psychology to represent the dominant aspects of all of personality. These factors are easily remembered by the acronum OCEAN: Openness, Conscientiousness, Extraversion, Agreeableness and Neuroticism. The dataset, psych::bfi
, contains data from an online sample of \(n=2800\) with 5 items for each scale.
In this figure (taken from Rodrigues (2021)), the item nodes are labeled according to the OCEAN factor they are assumed to measure. For 25 items, there are \(25 \times 24 / 2 = 300\) correlations, way too much to see. A clearer picture arises when we reduce the number of edges shown according to some criterion. Here, edges are drawn only between nodes where the correlation is considered important by a method (“glasso” = graphical LASSO) designed to make the graph optimally sparse.
The edges shown in Figure 3.45 reflect the Pearson correlation between a given pair of items by the visual attributes of color and line style: magnitude is shown by both the thickness and transparency of the edge; the sign of the correlation is shown by color and line type: solid blue for positive correlations and dashed red for negative ones.
According to some theories, the five personality factors should be largely non-overlapping, so there should not be many edges connecting items of one factor with those of another. Yet, there are quite a few cross-factor connections in Figure 3.45, so perhaps the theory is wrong, or, more likely, the 25 items are not good representatives of these underlying dimensions. The network diagram shown here is a visual tool for thought and refinement. See Costantini et al. (2015) for a tutorial on network analysis of personality data in R.
Network diagrams stem from mathematical graph theory (Bondy & Murty, 2008; West, 2001) of the abstract properties of nodes and edges used to represent pairwise relationships. These can be used to model many types of relations and processes in physical, biological, social and other sciences, where such properties as connectedness, centrality, cliques of connected nodes and so forth provide a vocabulary used to understand and explain complex systems.
For one example, Grandjean (2016) used network analysis to study the connections among 2500 Twitter users (nodes) who identified as belonging to a “digital humanities” community from the relations (edges) of who follows whom. Grandjean also used these methods to study the relationships among characters in Shakespeare’s tragedies in terms of the characters (nodes) and edges representing how often they appeared in the same scene.
The wide applicability of these ideas has led to what is now called network science (Barab’asi, 2016) encompassing computer networks, biological networks, cognitive and semantic networks, and social networks. Recent developments in psychology led to a framework of network psychometrics (Isvoranu et al., 2022), where, for example, symptoms of psychopathology (phobias, anxiety, substance abuse) can be conceptualized as an interconnected network of clusters and studied for possible causal relations (Robinaugh et al., 2019).
Because a network diagram can potentially reflect hundreds of variables, various graph layout algorithms have been developed to automatically position the nodes so as to generate aesthetically pleasing network visualizations that emphasize important structural properties, like clusters and central nodes, while minimizing visual clutter (many crossing lines) to promote understandability and usability.
There are quite a few R packages for constructing network diagrams, both static and dynamic / interactive, and these differ considerably in how the information required for a graph is structured as R objects, and the flexibility to produce attractive graphs. Among these, igraph (Csárdi et al., 2024) structures the data as a dataset of vertices and edges with properties
-> packages: qgraph, …
3.8.1 Crime data
For the present purposes, let’s see what network diagrams can tell us about the crime data analyzed earlier. Here, I first reorder the variables as in Figure 3.29. In the call to qgraph()
, the argument minimum = "sig"
says to show only the edges for significant correlations (at \(\alpha = 0.01\) here). In Figure 3.46, the variable nodes are positioned around a circle (layout = "circle"
), which is the default.
library(qgraph)
ord <- corrMatOrder(crime.cor, order = "AOE")
rownames(crime.cor)[ord]
#> [1] "murder" "assault" "rape" "robbery" "burglary"
#> [6] "larceny" "auto"
crime.cor <- crime.cor[ord, ord]
# "association graph": network of correlations
qgraph(crime.cor,
title = "Crime data:\ncorrelations", title.cex = 1.5,
graph = "cor",
layout = "circle",
minimum = "sig", sampleSize = nrow(crime), alpha = 0.01,
color = grey(.9), vsize = 12,
labels = rownames(crime.cor),
posCol = "blue")
In this figure, you can see the group of property crimes (auto theft, larceny, burglary) at the left separated from the violent crimes against persons at the right.
3.8.2 Partial correlations
Among the more important statistical applications of network graph theory is the idea that you can also use them to study the the partial (conditional) associations among variables with the contributions of all other variables removed in what are called Graphical Gaussian Models (GGMs) (Højsgaard et al., 2012; Lauritzen, 1996). In a network diagram of these partial associations,
The edges between nodes represent the partial correlations between those variables.
The absence of an edge between two nodes indicates their variables are conditionally independent, given the other variables.
So, whereas a network diagram of correlations shows marginal associations ignoring other variables, one of partial correlations allows you to visualize the direct relationship between each pair of variables, removing the indirect effects that might be mediated through all other variables.
For a set of variables \(X = \{x_1, x_2, \dots, x_p \}\), the partial correlation between \(x_i\) and \(x_i\), controlling for all other variables \(Z = X \setminus \{x_i, x_j\} = x_\text{others}\) is equivalent to the correlation between the residuals of the linear regressions of \(x_i\) on all other \(\mathbf{Z}\) and \(x_j\) on \(\mathbf{Z}\). (The notation \(X \setminus \{x_i, x_j\}\) is read as “\(X\) without the set \(\{x_i, x_j\}\)”).
Mathematically, let \(\hat{x}_i\) and \(\hat{x}_j\) be the predicted values from the linear regressions of \(x_i\) on \(\mathbf{Z}\) and of \(x_j\) on \(\mathbf{Z}\), respectively. The partial correlation \(p_{ij}\) between \(x_i\) and \(x_j\) controlling for \(\mathbf{Z}\) is given by: \[ p_{x_i,x_j|\mathbf{Z}} = r( x_i, x_j \mid \text{others}) = \text{cor}[ (x_i - \hat{x}_i),\; (x_j - \hat{x}_j)] \tag{3.3}\]
But, rather than running all these linear regressions, they can all be computed from the inverse of the correlation matrix (Whittaker, 1990, Ch. 5), a relation first noted by Dempster (1972). Let \(\mathbf{R}\) be the correlation matrix of the variables. Then, the matrix \(\mathbf{P}\) of partial correlations can be obtained from the negative inverse, \(-\mathbf{R}^{-1}\), standardized to a correlation matrix by dividing by the square root of product of its diagonal elements, \[ P_{ij} = - \frac{R^{-1}_{ij}}{\sqrt{(R^{-1}_{ii} \cdot R^{-1}_{jj})}} \:\: . \]
The practical implications of this are:
If a partial correlation is close to zero, it suggests the relationship between two variables is primarily mediated through other variables.
Non-zero partial correlations indicate a direct relationship that persists after controlling for other variables.
Figure 3.47 shows the partial correlation network for the crime data, using the qgraph()
argument graph = "pcor"
To provide a more interpretable result, the argument layout = "spring"
positions the nodes using a force-embedded algorithm where edges act like springs, pulling connected nodes together and unconnected nodes repel each other, pushing them apart.
qgraph(crime.cor,
title = "Crime data:\npartial correlations", title.cex = 1.5,
graph = "pcor",
layout = "spring", repulsion = 1.2,
minimum = "sig", sampleSize = nrow(crime), alpha = 0.05,
color = grey(.9), vsize = 14,
labels = rownames(crime.cor),
edge.labels = TRUE, edge.label.cex = 1.7,
posCol = "blue")
Figure 3.47 shows that, once all other crime variables are controlled for each pair, there remain only a few partial correlations at the \(\alpha = 0.05\) level. Of these, only the largest three in absolute value are significant at \(\alpha = 0.01\).
Thus, once all other variables are taken into account, what remains is mainly a strong positive association between burglary and larceny and a moderate one between auto theft and robbery. There also remains a moderate negative correlation between murder and larceny. The spring layout makes it clear that, with suppression of weak edges, auto theft and robbery form a cluster separated from the other variables.
3.8.3 Visualizing partial correlations
Just as you can visualize marginal association between variables in a scatterplot, you can also visualize conditional association. A partial variables plot is simply a scatterplot of the partial residuals \(e_i = (x_i - \hat{x}_i)\) from a regression of \(x_i\) on the other variables \(Z\) against those \(e_j = (x_j - \hat{x}_j)\) for another variable \(x_j\).
In this, you can use all the bells and whistles of standard scatterplots (regression lines, smooths, data ellipses, …) to listen more attentively to the story partial association has to tell. The function pvPlot()
calculates the partial residuals and then calls car::dataEllipse()
for display. The five most “unusual” observations by Mahalanobis \(D^2\) are identified with their abbreviated state labels. Figure 3.48 shows these plots for the variable pairs with the two largest partial correlations.
source("R/pvPlot.R")
# select numeric, make `st` into rownames
crime.num <- crime |>
tibble::column_to_rownames("st") |>
dplyr::select(where(is.numeric))
pvPlot(crime.num, vars = c("burglary", "larceny"),
id = list(n=5),
cex.lab = 1.5)
pvPlot(crime.num, vars = c("robbery", "auto"),
id = list(n=5),
cex.lab = 1.5)
In the pvPlot for burglary and larceny, you can see that the high partial correlation is largely driven by the extreme points at the left and and right sides. Once all other variables are taken into account, Arizona (AZ) and Hawaii (HI) have larger incidence of both crimes, while Arkansas (AK) are smaller on both.
In the pvPlot for robbery and auto theft, New York stands out as an influential, high-leverage point (see Section 6.6); Massachusetts (MA) is noteworthy because auto theft in that state is considerably higher than what would be predicted from all other variables.
3.9 Multivariate thinking and visualization
TODO: These are just initial notes on a chapter summary, and pointing the way to dimension reduction methods in Chapter 4.
This chapter has covered a lot of ground. We started with simple scatterplots and how to enhance them with graphical summaries and annotations …
The two curses
Multivariate data is often said to suffer from the curse of dimensionality (ref: Bellman1957), meaning that that as the dimensionality of data increases, the volume of the space increases so fast that the available data become sparse, so that the amount of data needed often grows exponentially with the dimensionality.
But, there is another curse here, the curse of two-dimensionality, meaning that as the dimensionality of data increases, what we can display and understand from a 2D image decreases rapidly with the number of dimensions of data. …
Package summary
For development, keep track of the packages used in each chapter.
16 packages used here: car, carData, corrgram, corrplot, dplyr, GGally, ggdensity, ggpcp, ggplot2, grid, knitr, patchwork, qgraph, tidyr, tourr, vcd
Confidence bands allow us to visualize the uncertainty around a fitted regression curve, which can be of two types: pointwise intervals or simultaneous intervals. The default setting in `
ggplot2::geom_smooth()
calculates pointwise intervals (usingstats::predict.lm(..., interval="confidence")
at a confidence level \(1-\alpha\) for the predicted response at each value \(x_i\) of a predictor, and have the frequentist interpretation that over repeated sampling only \(100\;\alpha\) of the predictions at \(x_i\) will be outside that interval. In contrast, simultaneous intervals are calculated so that \(1 - \alpha\) is the probability that all of them cover their corresponding true values simultaneously. These are necessarily wider than pointwise intervals. Commonly used methods for constructing simultaneous confidence bands in regression are the Bonferroni and Scheffé methods, which control the family-wise error rate over all values of \(x_i\). See for precise definitions of these terms. These are different from a prediction band, which is used to represent the uncertainty about the value of a new data-point on the curve, but subject to the additional variance reflected in one observation.↩︎The classic study by Cleveland & McGill (1984);Cleveland & McGill (1985) shows that judgements of magnitude along a common scale are more accurate than those along separate, aligned scales.↩︎
The dataset was collected by Bernard Blishen, William Carroll and Catherine Moore, but apparently unpublished. A version updated to the 1981 census is described in Blishen et al. (1987).↩︎
Other implementations of parallel coordinate plots in R include:
MASS::parcoord()
,GGally::ggparcoord() and
PairViz::pcp()`. The ggpcp version used here is the most general.↩︎This example was modified from one used by Cook et al. (2008).↩︎
Plot shapes given by
pch = 15:18
correspond to: filled square (15), filled circle (16), filled triangle point-up (17), filled diamond (18).↩︎