1 Introduction – Visualizing Multivariate Data and Models in R

This material may or may not survive; it was taken from an earlier article.

1.1 Multivariate vs. multivariable methods

multivariate \(\ne\) multivariable

In this era of multivitamins, multitools, multifactor authentication and even the multiverse, it is well to understand the distinction between multivariate and multivariable methods as these terms are generally used and as I use them here in relation to statistical methods and data visualization. The distinction is simple:

Multivariate methods for linear models such as multivariate regression have more than one dependent, response or outcome variable. Other multivariate methods such as principal components analysis or factor analysis treat all variables on an equal footing.
Multivariable methods have a single dependent variable and more than one independent variables or covariates.

1.2 Why use a multivariate design

A particular research outcome (e.g., depression, neuro-cognitive functioning, academic achievement, self-concept, attention deficit hyperactivity disorders) might take on a multivariate form if it has several observed measurement scales or related aspects by which it is quantified, or if there are multiple theoretically distinct outcomes that should be assessed in conjunction with each other (e.g., using depression, generalized anxiety, and stress inventories to model overall happiness). In this situation, the primary concern of the researcher is to ascertain the impact of potential predictors on two or more response variables simultaneously.

For example, if academic achievement is measured for adolescents by their reading, mathematics, science, and history scores, the following questions are of interest:

Do predictors such as parent encouragement, socioeconomic status and school environmental variables affect all of these outcomes?
Do they affect them in the same or different ways?
How many different aspects of academic achievement can be distinguished in the predictors? Equivalently, is academic achievement unidimensional or multidimensional in relation to the predictors?

Similarly, if psychiatric patients in various diagnostic categories are measured on a battery of tests related to social skills and cognitive functioning, we might want to know:

Which measures best discriminate among the diagnostic groups?
Which measures are most predictive of positive outcomes?
Further, how are the relationships between the outcomes affected by the predictors?

Such questions obviously concern more than just the separate univariate relations of each response to the predictors. Equally, or perhaps more importantly, are questions of how the response variables are predicted jointly.

SEM

Structural equation modeling (SEM) offers another route to explore and analyze the relationships among multiple predictors and multiple responses. They have the advantage of being able to test potentially complex systems of linear equations in very flexible ways; however, these methods are often far removed from data analysis per se and except for path diagrams offer little in the way of visualization methods to aid in understanding and communicating the results. The graphical methods we describe here can also be useful in a SEM context.

1.3 Linear models: Univariate to multivariate

For classical linear models for ANOVA and regression, the step from a univariate model for a single response, \(y\), to a multivariate one for a collection of \(p\) responses, \(\mathbf{y}\) is conceptually very easy. That’s because the univariate model,

\[y_i = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_q x_q + \epsilon_i , \]

or, in matrix terms,

\[\mathbf{y} = \mathbf{X} \; \mathbf{\beta} + \mathbf{\epsilon}, \quad\mbox{ with }\quad \mathbf{u} \sim \mathcal{N} (0, \sigma^2 \mathbf{I}) ,\]

generalizes directly to an analogous multivariate linear model (MLM),

\[\mathbf{Y} = [\mathbf{y_1}, \mathbf{y_2}, \dots, \mathbf{y_p}] = \mathbf{X} \; \mathbf{B} + \Epsilon \quad\mbox{ with }\quad \Epsilon \sim \mathcal{N} (\mathbf{0}, \mathbf{\Sigma})\]

for multiple responses (as will be discussed in detail). The design matrix, \(\mathbf{X}\) remains the same, and the vector \(\beta\) of coefficients becomes a matrix \(\mathbf{B}\), with one column for each of the \(p\) outcome variables.

Happily as well, hypothesis tests for the MLM are also straight-forward generalizations of the familiar \(F\) and \(t\)-tests for univariate response models. Moreover, there is a rich geometry underlying these generalizations which we can exploit for understanding and visualization.

1.4 Visualization is harder

However, with two or more response variables, visualizations for multivariate models are not as simple as they are for their univariate counterparts for understanding the effects of predictors, model parameters, or model diagnostics. Consequently, the results of such studies are often explored and discussed solely in terms of coefficients and significance, and visualizations of the relationships are only provided for one response variable at a time, if at all. This tradition can mask important nuances, and lead researchers to draw erroneous conclusions.

The aim of this book is to describe and illustrate some central methods that we have developed over the last ten years that aid in the understanding and communication of the results of multivariate linear models (Friendly, 2007; Friendly & Meyer, 2016). These methods rely on data ellipsoids as simple, minimally sufficient visualizations of variance that can be shown in 2D and 3D plots. As will be demonstrated, the Hypothesis-Error (HE) plot framework applies this idea to the results of multivariate tests of linear hypotheses.

Further, in the case where there are more than just a few outcome variables, the important nectar of their relationships to predictors can often be distilled in a multivariate juicer— a projection of the multivariate relationships to the predictors in the low-D space that captures most of the flavor. This idea can be applied using canonical correlation plots and with canonical discriminant HE plots.

**Projection**: The cover image from Hofstadter’s *Gödel, Bach and Escher* illustrates projection of 3D solids onto each 2D plane.

1.5 Problems in understanding and communicating MLM results

In my consulting practice within the Statistical Consulting Service at York University, I see hundreds of clients each year ranging from advanced undergraduate thesis students, to graduate students and faculty from a variety of fields. Over the last two decades, and across each of these groups, I have noticed an increasing desire to utilize multivariate methods. As researchers are exposed to the utility and power of multivariate tests, they see them as an appealing alternative to running many univariate ANOVAs or multiple regressions for each response variable separately.

However, multivariate analyses are more complicated than such approaches, especially when it comes to understanding and communicating results. Output is typically voluminous, and researchers will often get lost in the numbers. While software (SPSS, SAS and R) make tabular summary displays easy, these often obscure the findings that researchers are most interested in. The most common analytic oversights that we have observed are:

Atomistic data screening: Researchers have mostly learned the assumptions (the Holy Trinity of normality, constant variance and independence) of univariate linear models, but then apply univariate tests (e.g., Shapiro-Wilk) and diagnostic plots (normal QQ plots) to every predictor and every response.
Bonferroni everywhere: Faced with the task of reporting the results for multiple response measures and a collection of predictors for each, a common tendency is to run (and sometimes report) each of the separate univariate response models and then apply a correction for multiple testing. Not only is this confusing and awkward to report, but it is largely unnecessary because the multivariate tests provide protection for multiple testing.
Reverting to univariate visualizations: To display results, SPSS and SAS make some visualization methods available through menu choices or syntax, but usually these are the wrong (or at least unhelpful) choices, in that they generate separate univariate graphs for the individual responses.

This book to discusses a few essential procedures for multivariate linear models, how their interpretation can be aided through the use of well-crafted (though novel) visualizations, and provides replicable sample code in R to showcase their use in applied behaviorial research. A later section [ref?] provides some practical guidelines for analyzing, visualizing and reporting such models to help avoid these and other problems.

Package summary:

1 packages used here: knitr