Visualizing Generalized Canonical Discriminant and Canonical Correlation Analysis
Version 1.1.0
Description
This package includes functions for computing and visualizing generalized canonical discriminant analyses and canonical correlation analysis for a multivariate linear model (MLM). The goal is to provide ways of visualizing such models in a low-dimensional space corresponding to dimensions (linear combinations of the response variables) of maximal relationship to the predictor variables.
Traditional canonical discriminant analysis is restricted to a one-way MANOVA design and is equivalent to canonical correlation analysis between a set of quantitative response variables and a set of dummy variables coded from the factor variable. The candisc package generalizes this to multi-way MANOVA designs for all terms in a multivariate linear model (i.e., an mlm object), computing canonical scores and vectors for each term (giving a "candiscList" object).
For mlms with more than a few response variables, these methods often provide a much simpler interpretation of the nature of effects in low-D canonical space than heplots for pairs of responses or an HE plot matrix of all responses in variable space.
The candisc package originated as a low-D cousin of the heplots package, designed to provide visualization methods in low-dimensional space.
Visualization methods
The graphic functions are designed to provide low-rank (1D, 2D, 3D) visualizations of terms in a "mlm" via the plot.candisc() method, which plots the observations in canonical space, together with variable vectors showing the relations of the response y variables to the canonical variables Can1, Can2. This is the same idea as that of a biplot (Gabriel, 1971).
The HE plot heplot.candisc() and heplot3d.candisc() methods use a similar framework, but replace the observations and groupwise data ellipses in the plot with representations of the H ellipsoid, representing between-group variation in the means and the E ellipsoid reflecting the pooled within-group variation.
Analogously, a multivariate linear (regression) model with quantitative predictors can also be represented in a reduced-rank space by means of a canonical correlation transformation of the Y and X variables to uncorrelated canonical variates, named with the prefix Ycan and Xcan. Computation for this analysis is provided by cancor() and related methods. Visualization of these results in canonical space are provided by the plot.cancor(), heplot.cancor() and heplot3d.cancor() methods.
Discriminant analysis
Some of these visualization methods have now been extended to linear and quadratic discriminant analysis, using MASS:lda() or MASS:qda().
- Provides a simplified interface to prediction, in
predict_discrim(). - A new plotting method,
plot_discrim(), providesggplot2plots of the classification regions and decision boundaries in data space and in discriminant space. -
cor_lda()calculates correlations between the observed variables and the discriminant dimensions.
Variable ordering
The relations among response variables in linear models can also be useful for “effect ordering” (Friendly & Kwan (2003)) for variables in other multivariate data displays, such as heatmaps or “corrgrams” (Friendly, 2002) of correlations. to make the displayed relationships more coherent. The function varOrder() implements a collection of these methods.
Installation
The current official release of the candisc package can be installed from CRAN. The most recent development version can be installed from R-universe or this Github repo.
| CRAN version | install.packages("candisc") |
| R-universe | install.packages('candisc', repos = c('https://friendly.r-universe.dev') |
| GitHub version | remotes::install_github("friendly/candisc") |
Vignettes
A new vignette,
vignette("diabetes", package="candisc"), illustrates some of the methods of this package, with a dataset on forms of diabetes.Another vignette,
vignette("painters", package="candisc"), applies these methods to a dataset on the aesthetic ratings of classical painters.-
A more comprehensive collection of examples, illustrating multivariate regression and MANOVA methods, is contained in the vignettes for the
heplotspackage. UsebrowseVignettes(package = "heplots")to see them all.
Datasets
In addition to the datasets in the heplots package, candisc includes a few more related to the statistical and graphical methods implemented here. The table below classifies these with method tags. Their names are linked to their documentation with graphical output on the pkgdown website, [http://friendly.github.io/candisc].
| dataset | rows | cols | Title | tags |
|---|---|---|---|---|
| Grass | 40 | 7 | Yields from Nitrogen nutrition of grass species | MANOVA candisc discrim |
| HSB | 600 | 15 | High School and Beyond Data | MMRA cancor |
| PsyAcad | 600 | 8 | Psychological Measures and Academic Achievement | cancor |
| Wine | 178 | 14 | Chemical composition of three cultivars of wine | MANOVA candisc discrim |
| Wolves | 25 | 12 | Wolf skulls | candisc discrim |
| cereal | 77 | 16 | Breakfast Cereal Dataset | MMRA cancor |
| painters2 | 54 | 10 | Painters Data with Historical Art Variables | MANOVA candisc discrim |
Examples
These examples will get you started.
Iris data
Using the iris data, fit the multivariate model for Species. You can feed this to candisc() to get the canonical discriminant equivalent.
iris.mod <- lm(cbind(Petal.Length, Sepal.Length, Petal.Width, Sepal.Width) ~ Species, data=iris)
car::Anova(iris.mod)
#>
#> Type II MANOVA Tests: Pillai test statistic
#> Df test stat approx F num Df den Df Pr(>F)
#> Species 2 1.1919 53.466 8 290 < 2.2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
iris.can <- candisc(iris.mod, data=iris)
iris.can
#>
#> Canonical Discriminant Analysis for Species:
#>
#> CanRsq Eigenvalue Difference Percent Cumulative
#> 1 0.96987 32.19193 31.907 99.12126 99.121
#> 2 0.22203 0.28539 31.907 0.87874 100.000
#>
#> Test of H0: The canonical correlations in the
#> current row and all that follow are zero
#>
#> LR test stat approx F numDF denDF Pr(> F)
#> 1 0.02344 199.145 8 288 < 2.2e-16 ***
#> 2 0.77797 13.794 3 145 5.794e-08 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1This says that although there are 4 dimensions in the iris data, only two are needed to fully account for the differences among the group means relative to within-group variation. Both are significant by the likelihood ratio tests, but the first accounts for 99.1% of these differences.
Methods are available to extract coefficients (coef()) and scores of observations on the canonical dimensions (scores()).
# get coefficients
coef(iris.can)
#> Can1 Can2
#> Petal.Length 0.9472572 0.40103782
#> Sepal.Length -0.4269548 -0.01240753
#> Petal.Width 0.5751608 -0.58103986
#> Sepal.Width -0.5212417 -0.73526131
scores(iris.can) |> str()
#> 'data.frame': 150 obs. of 3 variables:
#> $ Species: Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
#> $ Can1 : num -8.06 -7.13 -7.49 -6.81 -8.13 ...
#> $ Can2 : num -0.3 0.787 0.265 0.671 -0.514 ...Correlations between the observed variables and the canonical dimensions are given by the $structure component of the object.
iris.can$structure
#> Can1 Can2
#> Petal.Length 0.9849513 -0.04603709
#> Sepal.Length 0.7918878 -0.21759312
#> Petal.Width 0.9728120 -0.22290236
#> Sepal.Width -0.5307590 -0.75798931Plotting methods
The basic plot for a "candisc" object is a scatterplot of the scores()name for the observations on the canonical variates, which are the linear combinations of the observed variables. For ease of interpretation, it plots vectors for the responses in the model showing their structure correlations with the canonical dimension.
The following plot illustrates some of the options.
#-- assign colors and symbols corresponding to species
iris.colors <- c("red", "darkgreen", "blue")
iris.pch <- 15:17
plot(
iris.can,
col = iris.colors,
pch = iris.pch,
ellipse = TRUE,
var.lwd = 2,
cex.lab = 1.4)
The heplot() method for a “candisc” object replaces the individual scores by a H ellipse showing variation of the means and an E ellipse reflecting the pooled within-group variation.
heplot(
iris.can,
fill = TRUE, fill.alpha = 0.1,
prefix = "Canonical dimension",
var.col = "black",
var.lwd = 2,
scale = 35,
lab.cex = 1.25)
Citation
To cite package candisc in publications use:
Friendly M., Fox J. (2025). candisc: Visualizing Generalized Canonical Discriminant and Canonical Correlation Analysis_. R package version 1.0.0, https://CRAN.R-project.org/package=heplots.
For the theory on which these methods are based, also cite:
Friendly, M. (2007). “HE plots for Multivariate General Linear Models.” Journal of Computational and Graphical Statistics, 16(2), 421-444. https://doi.org/10.1198/106186007X208407.
References
Friendly, M. (2002). Corrgrams: Exploratory displays for correlation matrices. The American Statistician, 56(4), 316–324. https://doi.org/10.1198/000313002533
Friendly, M., & Kwan, E. (2003). Effect Ordering for Data Displays. Computational Statistics and Data Analysis, 43(4), 509–539. https://doi.org/10.1016/S0167-9473(02)00290-6
Gabriel, K. R. (1971). The Biplot Graphic Display of Matrices with Application to Principal Components Analysis. Biometrics, 58(3), 453–467.
