Exploring the idea of using association diagrams for fitting and understanding relationships in high-way categorical data fitted as loglinear models. The essential idea is to try to connect model fit statistics with some visual representation as network diagrams, with nodes representing variables in a model and edges representing pairwise associations.
One realization of this would be a function, say
assocgraph()
that takes a loglinear model (from
MASS::loglm()
, or glm(, family=poisson)
) and
builds the graph object corresponding to all two-way
terms in the model. A weight for each edge could be obtained using the
deviance for that pair, found using either
MASS::dropterm()
– the change in model fit (deviance,
LRT) due to dropping each term from a given model.MASS::addterm()
– the change in model fit (deviance,
LRT) due to adding each two-way association to the model of mutual
independence.igraph
seems to be
the base package for constructing & drawing network diagramstidygraph
provides a tidy API for graph/network manipulationggraph
uses such graph datasets to draw them better, within the
ggplot2
framework.MASS::dropterm()
tries fitting all models that differ
from the current model by dropping a single term, maintaining
marginality.MASS::addterm()
tries fitting all models that differ
from the current model by adding a single term, maintaining
marginality.vcdExtra::Kway()
fits a collection of loglinear models
for all 0-, 1-, 2-, … way associations in a K-way tablevcdExtra::seq_loglm()
fits various sequential models to
the 1-, 2-, … n-way marginal tables, corresponding to a variety of types
of loglinear models.library(vcdExtra)
library(MASS) # for loglm(), dropterm()
library(igraph) # Network Analysis and Visualization
library(ggraph) # An Implementation of Grammar of Graphics for Graphs and Networks
library(tidygraph)
library(ggplot2)
library(dplyr)
This example uses the DaytonSurvey
dataset, representing
a \(2^5\) table.
data (DaytonSurvey, package = "vcdExtra")
head(DaytonSurvey)
## cigarette alcohol marijuana sex race Freq
## 1 Yes Yes Yes female white 405
## 2 No Yes Yes female white 13
## 3 Yes No Yes female white 1
## 4 No No Yes female white 1
## 5 Yes Yes No female white 268
## 6 No Yes No female white 218
View as a frequency table
Dayton_tab <- xtabs(Freq ~ ., data=DaytonSurvey)
structable(cigarette + alcohol + marijuana ~ sex + race, data=Dayton_tab)
## cigarette Yes No
## alcohol Yes No Yes No
## marijuana Yes No Yes No Yes No Yes No
## sex race
## female white 405 268 1 17 13 218 1 117
## other 23 23 0 1 2 19 0 12
## male white 453 228 1 17 28 201 1 133
## other 30 19 1 8 1 18 0 17
A quick survey of loglm models with all 0, 1, 2, 3 way terms
mods <- Kway(Freq ~ cigarette + alcohol + marijuana + sex + race,
data = DaytonSurvey,
order = 3)
Compare varying order of associations among variables
anova(mods, test="Chisq")
## Analysis of Deviance Table
##
## Model 1: Freq ~ 1
## Model 2: Freq ~ cigarette + alcohol + marijuana + sex + race
## Model 3: Freq ~ cigarette + alcohol + marijuana + sex + race + cigarette:alcohol +
## cigarette:marijuana + cigarette:sex + cigarette:race + alcohol:marijuana +
## alcohol:sex + alcohol:race + marijuana:sex + marijuana:race +
## sex:race
## Model 4: Freq ~ cigarette + alcohol + marijuana + sex + race + cigarette:alcohol +
## cigarette:marijuana + cigarette:sex + cigarette:race + alcohol:marijuana +
## alcohol:sex + alcohol:race + marijuana:sex + marijuana:race +
## sex:race + cigarette:alcohol:marijuana + cigarette:alcohol:sex +
## cigarette:alcohol:race + cigarette:marijuana:sex + cigarette:marijuana:race +
## cigarette:sex:race + alcohol:marijuana:sex + alcohol:marijuana:race +
## alcohol:sex:race + marijuana:sex:race
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 31 4818.1
## 2 26 1325.9 5 3492.1 <2e-16 ***
## 3 16 15.3 10 1310.6 <2e-16 ***
## 4 6 5.3 10 10.1 0.4345
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
LRstats()
gives AIC, BIC and LR Chisq tests.
LRstats(mods)
## Likelihood summary table:
## AIC BIC LR Chisq Df Pr(>Chisq)
## kway.0 4956.3 4957.8 4818.1 31 <2e-16 ***
## kway.1 1474.2 1483.0 1325.9 26 <2e-16 ***
## kway.2 183.6 207.0 15.3 16 0.4999
## kway.3 193.5 231.6 5.3 6 0.5094
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Focusing on the all-twoway model, this could have been fitted directly using the model formula notation.
mod.2way <- loglm(Freq ~ (cigarette + alcohol + marijuana + sex + race)^2,
data = DaytonSurvey)
mods$kway.2
##
## Call: glm(formula = Freq ~ cigarette + alcohol + marijuana + sex +
## race + cigarette:alcohol + cigarette:marijuana + cigarette:sex +
## cigarette:race + alcohol:marijuana + alcohol:sex + alcohol:race +
## marijuana:sex + marijuana:race + sex:race, family = family,
## data = data)
##
## Coefficients:
## (Intercept) cigaretteNo alcoholNo
## 5.9917 -3.0683 -5.7080
## marijuanaNo sexmale raceother
## -0.3938 0.1310 -2.8402
## cigaretteNo:alcoholNo cigaretteNo:marijuanaNo cigaretteNo:sexmale
## 2.0538 2.8596 0.1096
## cigaretteNo:raceother alcoholNo:marijuanaNo alcoholNo:sexmale
## -0.1354 2.9907 0.2483
## alcoholNo:raceother marijuanaNo:sexmale marijuanaNo:raceother
## 0.5073 -0.3176 0.3731
## sexmale:raceother
## 0.1456
##
## Degrees of Freedom: 31 Total (i.e. Null); 16 Residual
## Null Deviance: 4818
## Residual Deviance: 15.34 AIC: 183.6
We can use these as weights for the links in an association graph
(DT <- dropterm(mods$kway.2))
## Single term deletions
##
## Model:
## Freq ~ cigarette + alcohol + marijuana + sex + race + cigarette:alcohol +
## cigarette:marijuana + cigarette:sex + cigarette:race + alcohol:marijuana +
## alcohol:sex + alcohol:race + marijuana:sex + marijuana:race +
## sex:race
## Df Deviance AIC
## <none> 15.34 183.58
## cigarette:alcohol 1 201.20 367.44
## cigarette:marijuana 1 513.47 679.71
## cigarette:sex 1 16.32 182.56
## cigarette:race 1 15.78 182.02
## alcohol:marijuana 1 106.96 273.20
## alcohol:sex 1 18.72 184.96
## alcohol:race 1 20.32 186.56
## marijuana:sex 1 25.16 191.40
## marijuana:race 1 18.93 185.17
## sex:race 1 16.18 182.42
Remove the <none>
model. All we want is the stats
for dropping each two-way term.
DT <- DT[-1,]
igraph
Exploring the ways to use the igraph
package to
visualize associations in multi-way tables.
Create the full graph of all two-way associations
full <- make_full_graph(5)
V()
and E()
igraph::V()
and igraph::E()
are used both
to display the node and edge information in a graph, or to assign
attributes to them.
vertexes (nodes):
V(full)
## + 5/5 vertices, from 87cb913:
## [1] 1 2 3 4 5
edges:
E(full)
## + 10/10 edges from 87cb913:
## [1] 1--2 1--3 1--4 1--5 2--3 2--4 2--5 3--4 3--5 4--5
plot()
igraph::plot()
plots the nodes and edges
plot(full)
The name
attribute of nodes is special in
igraph
.
For later use, this step assumes that the names assigned to the nodes
are in the same order as those used in the loglm()
model.
V(full)$name <- names(DaytonSurvey)[1:5]
V(full)
## + 5/5 vertices, named, from 87cb913:
## [1] cigarette alcohol marijuana sex race
E(full)
## + 10/10 edges from 87cb913 (vertex names):
## [1] cigarette--alcohol cigarette--marijuana cigarette--sex
## [4] cigarette--race alcohol --marijuana alcohol --sex
## [7] alcohol --race marijuana--sex marijuana--race
## [10] sex --race
make colors correspond to outcomes vs. demographic variables
V(full)$color <- c(rep("orange", 3),
rep("lightblue", 2))
You can also assign arbitrary variables to the edges.
Here, assign deviance and AIC as potential weights for edges
E(full)$deviance <- DT$Deviance
E(full)$aic <- DT$AIC
E(full)
## + 10/10 edges from 87cb913 (vertex names):
## [1] cigarette--alcohol cigarette--marijuana cigarette--sex
## [4] cigarette--race alcohol --marijuana alcohol --sex
## [7] alcohol --race marijuana--sex marijuana--race
## [10] sex --race
This is the key step for this visualization. For generality, need a better scheme to translate deviance into the width of the edges.
E(full)$width <- E(full)$deviance/25
igraph
objectThere are lots of igraph
functions to get details of a
given graph object.
full[]
## 5 x 5 sparse Matrix of class "dgCMatrix"
## cigarette alcohol marijuana sex race
## cigarette . 1 1 1 1
## alcohol 1 . 1 1 1
## marijuana 1 1 . 1 1
## sex 1 1 1 . 1
## race 1 1 1 1 .
edge.attributes(full)
## $deviance
## [1] 201.19931 513.47218 16.31718 15.78347 106.95800 18.71695 20.32086
## [8] 25.16101 18.92894 16.17921
##
## $aic
## [1] 367.4376 679.7105 182.5555 182.0217 273.1963 184.9552 186.5591 191.3993
## [9] 185.1672 182.4175
##
## $width
## [1] 8.0479724 20.5388872 0.6526874 0.6313387 4.2783199 0.7486780
## [7] 0.8128344 1.0064403 0.7571574 0.6471683
ends()
delivers the edges as two-column data.frame
ends(full, es=E(full), names=T)
## [,1] [,2]
## [1,] "cigarette" "alcohol"
## [2,] "cigarette" "marijuana"
## [3,] "cigarette" "sex"
## [4,] "cigarette" "race"
## [5,] "alcohol" "marijuana"
## [6,] "alcohol" "sex"
## [7,] "alcohol" "race"
## [8,] "marijuana" "sex"
## [9,] "marijuana" "race"
## [10,] "sex" "race"
Plot the all two-way model with edge width ~ deviance
.
This uses the E(full)$width
and V(full)$color
attributes assigned above. Other attributes can be assigned with
optional arguments.
set.seed(1234)
plot(full, vertex.label.cex=2)
ggraph
???This doesn’t work …
# ggraph(full) +
# geom_edge_link(aes(width = DT$Deviance[-1]/25)) +
# geom_node_point()