Exploring the idea of using association diagrams for fitting and understanding relationships in high-way categorical data fitted as loglinear models. The essential idea is to try to connect model fit statistics with some visual representation as network diagrams, with nodes representing variables in a model and edges representing pairwise associations.
One realization of this would be a function, say
assocgraph() that takes a loglinear model (from
MASS::loglm(), or glm(, family=poisson)) and
builds the graph object corresponding to all two-way
terms in the model. A weight for each edge could be obtained using the
deviance for that pair, found using either
MASS::dropterm() – the change in model fit (deviance,
LRT) due to dropping each term from a given model.MASS::addterm() – the change in model fit (deviance,
LRT) due to adding each two-way association to the model of mutual
independence.igraph seems to be
the base package for constructing & drawing network diagramstidygraph
provides a tidy API for graph/network manipulationggraph
uses such graph datasets to draw them better, within the
ggplot2 framework.MASS::dropterm() tries fitting all models that differ
from the current model by dropping a single term, maintaining
marginality.MASS::addterm() tries fitting all models that differ
from the current model by adding a single term, maintaining
marginality.vcdExtra::Kway() fits a collection of loglinear models
for all 0-, 1-, 2-, … way associations in a K-way tablevcdExtra::seq_loglm() fits various sequential models to
the 1-, 2-, … n-way marginal tables, corresponding to a variety of types
of loglinear models.library(vcdExtra)
library(MASS) # for loglm(), dropterm()
library(igraph) # Network Analysis and Visualization
library(ggraph) # An Implementation of Grammar of Graphics for Graphs and Networks
library(tidygraph)
library(ggplot2)
library(dplyr)
This example uses the DaytonSurvey dataset, representing
a \(2^5\) table.
data (DaytonSurvey, package = "vcdExtra")
head(DaytonSurvey)
## cigarette alcohol marijuana sex race Freq
## 1 Yes Yes Yes female white 405
## 2 No Yes Yes female white 13
## 3 Yes No Yes female white 1
## 4 No No Yes female white 1
## 5 Yes Yes No female white 268
## 6 No Yes No female white 218
View as a frequency table
Dayton_tab <- xtabs(Freq ~ ., data=DaytonSurvey)
structable(cigarette + alcohol + marijuana ~ sex + race, data=Dayton_tab)
## cigarette Yes No
## alcohol Yes No Yes No
## marijuana Yes No Yes No Yes No Yes No
## sex race
## female white 405 268 1 17 13 218 1 117
## other 23 23 0 1 2 19 0 12
## male white 453 228 1 17 28 201 1 133
## other 30 19 1 8 1 18 0 17
A quick survey of loglm models with all 0, 1, 2, 3 way terms
mods <- Kway(Freq ~ cigarette + alcohol + marijuana + sex + race,
data = DaytonSurvey,
order = 3)
Compare varying order of associations among variables
anova(mods, test="Chisq")
## Analysis of Deviance Table
##
## Model 1: Freq ~ 1
## Model 2: Freq ~ cigarette + alcohol + marijuana + sex + race
## Model 3: Freq ~ cigarette + alcohol + marijuana + sex + race + cigarette:alcohol +
## cigarette:marijuana + cigarette:sex + cigarette:race + alcohol:marijuana +
## alcohol:sex + alcohol:race + marijuana:sex + marijuana:race +
## sex:race
## Model 4: Freq ~ cigarette + alcohol + marijuana + sex + race + cigarette:alcohol +
## cigarette:marijuana + cigarette:sex + cigarette:race + alcohol:marijuana +
## alcohol:sex + alcohol:race + marijuana:sex + marijuana:race +
## sex:race + cigarette:alcohol:marijuana + cigarette:alcohol:sex +
## cigarette:alcohol:race + cigarette:marijuana:sex + cigarette:marijuana:race +
## cigarette:sex:race + alcohol:marijuana:sex + alcohol:marijuana:race +
## alcohol:sex:race + marijuana:sex:race
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 31 4818.1
## 2 26 1325.9 5 3492.1 <2e-16 ***
## 3 16 15.3 10 1310.6 <2e-16 ***
## 4 6 5.3 10 10.1 0.4345
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
LRstats() gives AIC, BIC and LR Chisq tests.
LRstats(mods)
## Likelihood summary table:
## AIC BIC LR Chisq Df Pr(>Chisq)
## kway.0 4956.3 4957.8 4818.1 31 <2e-16 ***
## kway.1 1474.2 1483.0 1325.9 26 <2e-16 ***
## kway.2 183.6 207.0 15.3 16 0.4999
## kway.3 193.5 231.6 5.3 6 0.5094
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Focusing on the all-twoway model, this could have been fitted directly using the model formula notation.
mod.2way <- loglm(Freq ~ (cigarette + alcohol + marijuana + sex + race)^2,
data = DaytonSurvey)
mods$kway.2
##
## Call: glm(formula = Freq ~ cigarette + alcohol + marijuana + sex +
## race + cigarette:alcohol + cigarette:marijuana + cigarette:sex +
## cigarette:race + alcohol:marijuana + alcohol:sex + alcohol:race +
## marijuana:sex + marijuana:race + sex:race, family = family,
## data = data)
##
## Coefficients:
## (Intercept) cigaretteNo alcoholNo
## 5.9917 -3.0683 -5.7080
## marijuanaNo sexmale raceother
## -0.3938 0.1310 -2.8402
## cigaretteNo:alcoholNo cigaretteNo:marijuanaNo cigaretteNo:sexmale
## 2.0538 2.8596 0.1096
## cigaretteNo:raceother alcoholNo:marijuanaNo alcoholNo:sexmale
## -0.1354 2.9907 0.2483
## alcoholNo:raceother marijuanaNo:sexmale marijuanaNo:raceother
## 0.5073 -0.3176 0.3731
## sexmale:raceother
## 0.1456
##
## Degrees of Freedom: 31 Total (i.e. Null); 16 Residual
## Null Deviance: 4818
## Residual Deviance: 15.34 AIC: 183.6
We can use these as weights for the links in an association graph
(DT <- dropterm(mods$kway.2))
## Single term deletions
##
## Model:
## Freq ~ cigarette + alcohol + marijuana + sex + race + cigarette:alcohol +
## cigarette:marijuana + cigarette:sex + cigarette:race + alcohol:marijuana +
## alcohol:sex + alcohol:race + marijuana:sex + marijuana:race +
## sex:race
## Df Deviance AIC
## <none> 15.34 183.58
## cigarette:alcohol 1 201.20 367.44
## cigarette:marijuana 1 513.47 679.71
## cigarette:sex 1 16.32 182.56
## cigarette:race 1 15.78 182.02
## alcohol:marijuana 1 106.96 273.20
## alcohol:sex 1 18.72 184.96
## alcohol:race 1 20.32 186.56
## marijuana:sex 1 25.16 191.40
## marijuana:race 1 18.93 185.17
## sex:race 1 16.18 182.42
Remove the <none> model. All we want is the stats
for dropping each two-way term.
DT <- DT[-1,]
igraphExploring the ways to use the igraph package to
visualize associations in multi-way tables.
Create the full graph of all two-way associations
full <- make_full_graph(5)
V() and E()igraph::V() and igraph::E() are used both
to display the node and edge information in a graph, or to assign
attributes to them.
vertexes (nodes):
V(full)
## + 5/5 vertices, from 87cb913:
## [1] 1 2 3 4 5
edges:
E(full)
## + 10/10 edges from 87cb913:
## [1] 1--2 1--3 1--4 1--5 2--3 2--4 2--5 3--4 3--5 4--5
plot()igraph::plot() plots the nodes and edges
plot(full)
The name attribute of nodes is special in
igraph.
For later use, this step assumes that the names assigned to the nodes
are in the same order as those used in the loglm()
model.
V(full)$name <- names(DaytonSurvey)[1:5]
V(full)
## + 5/5 vertices, named, from 87cb913:
## [1] cigarette alcohol marijuana sex race
E(full)
## + 10/10 edges from 87cb913 (vertex names):
## [1] cigarette--alcohol cigarette--marijuana cigarette--sex
## [4] cigarette--race alcohol --marijuana alcohol --sex
## [7] alcohol --race marijuana--sex marijuana--race
## [10] sex --race
make colors correspond to outcomes vs. demographic variables
V(full)$color <- c(rep("orange", 3),
rep("lightblue", 2))
You can also assign arbitrary variables to the edges.
Here, assign deviance and AIC as potential weights for edges
E(full)$deviance <- DT$Deviance
E(full)$aic <- DT$AIC
E(full)
## + 10/10 edges from 87cb913 (vertex names):
## [1] cigarette--alcohol cigarette--marijuana cigarette--sex
## [4] cigarette--race alcohol --marijuana alcohol --sex
## [7] alcohol --race marijuana--sex marijuana--race
## [10] sex --race
This is the key step for this visualization. For generality, need a better scheme to translate deviance into the width of the edges.
E(full)$width <- E(full)$deviance/25
igraph objectThere are lots of igraph functions to get details of a
given graph object.
full[]
## 5 x 5 sparse Matrix of class "dgCMatrix"
## cigarette alcohol marijuana sex race
## cigarette . 1 1 1 1
## alcohol 1 . 1 1 1
## marijuana 1 1 . 1 1
## sex 1 1 1 . 1
## race 1 1 1 1 .
edge.attributes(full)
## $deviance
## [1] 201.19931 513.47218 16.31718 15.78347 106.95800 18.71695 20.32086
## [8] 25.16101 18.92894 16.17921
##
## $aic
## [1] 367.4376 679.7105 182.5555 182.0217 273.1963 184.9552 186.5591 191.3993
## [9] 185.1672 182.4175
##
## $width
## [1] 8.0479724 20.5388872 0.6526874 0.6313387 4.2783199 0.7486780
## [7] 0.8128344 1.0064403 0.7571574 0.6471683
ends() delivers the edges as two-column data.frame
ends(full, es=E(full), names=T)
## [,1] [,2]
## [1,] "cigarette" "alcohol"
## [2,] "cigarette" "marijuana"
## [3,] "cigarette" "sex"
## [4,] "cigarette" "race"
## [5,] "alcohol" "marijuana"
## [6,] "alcohol" "sex"
## [7,] "alcohol" "race"
## [8,] "marijuana" "sex"
## [9,] "marijuana" "race"
## [10,] "sex" "race"
Plot the all two-way model with edge width ~ deviance.
This uses the E(full)$width and V(full)$color
attributes assigned above. Other attributes can be assigned with
optional arguments.
set.seed(1234)
plot(full, vertex.label.cex=2)
ggraph ???This doesn’t work …
# ggraph(full) +
# geom_edge_link(aes(width = DT$Deviance[-1]/25)) +
# geom_node_point()