Basic ideas

Exploring the idea of using association diagrams for fitting and understanding relationships in high-way categorical data fitted as loglinear models. The essential idea is to try to connect model fit statistics with some visual representation as network diagrams, with nodes representing variables in a model and edges representing pairwise associations.

One realization of this would be a function, say assocgraph() that takes a loglinear model (from MASS::loglm(), or glm(, family=poisson)) and builds the graph object corresponding to all two-way terms in the model. A weight for each edge could be obtained using the deviance for that pair, found using either

Graph packages:

Tutorials

Model fitting:

  • MASS::dropterm() tries fitting all models that differ from the current model by dropping a single term, maintaining marginality.
  • MASS::addterm() tries fitting all models that differ from the current model by adding a single term, maintaining marginality.
  • vcdExtra::Kway() fits a collection of loglinear models for all 0-, 1-, 2-, … way associations in a K-way table
  • vcdExtra::seq_loglm() fits various sequential models to the 1-, 2-, … n-way marginal tables, corresponding to a variety of types of loglinear models.

Load packages

library(vcdExtra)
library(MASS)     # for loglm(), dropterm()
library(igraph)   # Network Analysis and Visualization
library(ggraph)   # An Implementation of Grammar of Graphics for Graphs and Networks
library(tidygraph)
library(ggplot2)
library(dplyr)

Load the data

This example uses the DaytonSurvey dataset, representing a \(2^5\) table.

data (DaytonSurvey, package = "vcdExtra")
head(DaytonSurvey)
##   cigarette alcohol marijuana    sex  race Freq
## 1       Yes     Yes       Yes female white  405
## 2        No     Yes       Yes female white   13
## 3       Yes      No       Yes female white    1
## 4        No      No       Yes female white    1
## 5       Yes     Yes        No female white  268
## 6        No     Yes        No female white  218

View as a frequency table

Dayton_tab <- xtabs(Freq ~ ., data=DaytonSurvey)
structable(cigarette + alcohol + marijuana ~ sex + race, data=Dayton_tab)
##              cigarette Yes              No            
##              alcohol   Yes      No     Yes      No    
##              marijuana Yes  No Yes  No Yes  No Yes  No
## sex    race                                           
## female white           405 268   1  17  13 218   1 117
##        other            23  23   0   1   2  19   0  12
## male   white           453 228   1  17  28 201   1 133
##        other            30  19   1   8   1  18   0  17

Model survey

A quick survey of loglm models with all 0, 1, 2, 3 way terms

mods <- Kway(Freq ~ cigarette + alcohol + marijuana + sex + race, 
             data = DaytonSurvey, 
             order = 3)

Test nested models

Compare varying order of associations among variables

anova(mods, test="Chisq")
## Analysis of Deviance Table
## 
## Model 1: Freq ~ 1
## Model 2: Freq ~ cigarette + alcohol + marijuana + sex + race
## Model 3: Freq ~ cigarette + alcohol + marijuana + sex + race + cigarette:alcohol + 
##     cigarette:marijuana + cigarette:sex + cigarette:race + alcohol:marijuana + 
##     alcohol:sex + alcohol:race + marijuana:sex + marijuana:race + 
##     sex:race
## Model 4: Freq ~ cigarette + alcohol + marijuana + sex + race + cigarette:alcohol + 
##     cigarette:marijuana + cigarette:sex + cigarette:race + alcohol:marijuana + 
##     alcohol:sex + alcohol:race + marijuana:sex + marijuana:race + 
##     sex:race + cigarette:alcohol:marijuana + cigarette:alcohol:sex + 
##     cigarette:alcohol:race + cigarette:marijuana:sex + cigarette:marijuana:race + 
##     cigarette:sex:race + alcohol:marijuana:sex + alcohol:marijuana:race + 
##     alcohol:sex:race + marijuana:sex:race
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)    
## 1        31     4818.1                         
## 2        26     1325.9  5   3492.1   <2e-16 ***
## 3        16       15.3 10   1310.6   <2e-16 ***
## 4         6        5.3 10     10.1   0.4345    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Summarize these models

LRstats() gives AIC, BIC and LR Chisq tests.

LRstats(mods)
## Likelihood summary table:
##           AIC    BIC LR Chisq Df Pr(>Chisq)    
## kway.0 4956.3 4957.8   4818.1 31     <2e-16 ***
## kway.1 1474.2 1483.0   1325.9 26     <2e-16 ***
## kway.2  183.6  207.0     15.3 16     0.4999    
## kway.3  193.5  231.6      5.3  6     0.5094    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Direct fitting

Focusing on the all-twoway model, this could have been fitted directly using the model formula notation.

mod.2way <- loglm(Freq ~ (cigarette + alcohol + marijuana + sex + race)^2, 
                 data = DaytonSurvey)

The all two-way model

mods$kway.2
## 
## Call:  glm(formula = Freq ~ cigarette + alcohol + marijuana + sex + 
##     race + cigarette:alcohol + cigarette:marijuana + cigarette:sex + 
##     cigarette:race + alcohol:marijuana + alcohol:sex + alcohol:race + 
##     marijuana:sex + marijuana:race + sex:race, family = family, 
##     data = data)
## 
## Coefficients:
##             (Intercept)              cigaretteNo                alcoholNo  
##                  5.9917                  -3.0683                  -5.7080  
##             marijuanaNo                  sexmale                raceother  
##                 -0.3938                   0.1310                  -2.8402  
##   cigaretteNo:alcoholNo  cigaretteNo:marijuanaNo      cigaretteNo:sexmale  
##                  2.0538                   2.8596                   0.1096  
##   cigaretteNo:raceother    alcoholNo:marijuanaNo        alcoholNo:sexmale  
##                 -0.1354                   2.9907                   0.2483  
##     alcoholNo:raceother      marijuanaNo:sexmale    marijuanaNo:raceother  
##                  0.5073                  -0.3176                   0.3731  
##       sexmale:raceother  
##                  0.1456  
## 
## Degrees of Freedom: 31 Total (i.e. Null);  16 Residual
## Null Deviance:       4818 
## Residual Deviance: 15.34     AIC: 183.6

Find deviance / AIC for all pairwise terms.

We can use these as weights for the links in an association graph

(DT <- dropterm(mods$kway.2))
## Single term deletions
## 
## Model:
## Freq ~ cigarette + alcohol + marijuana + sex + race + cigarette:alcohol + 
##     cigarette:marijuana + cigarette:sex + cigarette:race + alcohol:marijuana + 
##     alcohol:sex + alcohol:race + marijuana:sex + marijuana:race + 
##     sex:race
##                     Df Deviance    AIC
## <none>                    15.34 183.58
## cigarette:alcohol    1   201.20 367.44
## cigarette:marijuana  1   513.47 679.71
## cigarette:sex        1    16.32 182.56
## cigarette:race       1    15.78 182.02
## alcohol:marijuana    1   106.96 273.20
## alcohol:sex          1    18.72 184.96
## alcohol:race         1    20.32 186.56
## marijuana:sex        1    25.16 191.40
## marijuana:race       1    18.93 185.17
## sex:race             1    16.18 182.42

Remove the <none> model. All we want is the stats for dropping each two-way term.

DT <- DT[-1,]

Exploring igraph

Exploring the ways to use the igraph package to visualize associations in multi-way tables.
Create the full graph of all two-way associations

full <- make_full_graph(5)

Functions V() and E()

igraph::V() and igraph::E() are used both to display the node and edge information in a graph, or to assign attributes to them.

vertexes (nodes):

V(full)
## + 5/5 vertices, from 87cb913:
## [1] 1 2 3 4 5

edges:

E(full)
## + 10/10 edges from 87cb913:
##  [1] 1--2 1--3 1--4 1--5 2--3 2--4 2--5 3--4 3--5 4--5

plot()

igraph::plot() plots the nodes and edges

plot(full)

Give nodes the names of the variables

The name attribute of nodes is special in igraph.

For later use, this step assumes that the names assigned to the nodes are in the same order as those used in the loglm() model.

V(full)$name <- names(DaytonSurvey)[1:5]

V(full)
## + 5/5 vertices, named, from 87cb913:
## [1] cigarette alcohol   marijuana sex       race
E(full)
## + 10/10 edges from 87cb913 (vertex names):
##  [1] cigarette--alcohol   cigarette--marijuana cigarette--sex      
##  [4] cigarette--race      alcohol  --marijuana alcohol  --sex      
##  [7] alcohol  --race      marijuana--sex       marijuana--race     
## [10] sex      --race

Node colors

make colors correspond to outcomes vs. demographic variables

V(full)$color <- c(rep("orange", 3),
                   rep("lightblue", 2))

Edge statistics

You can also assign arbitrary variables to the edges.

Here, assign deviance and AIC as potential weights for edges

E(full)$deviance <- DT$Deviance
E(full)$aic      <- DT$AIC

E(full)
## + 10/10 edges from 87cb913 (vertex names):
##  [1] cigarette--alcohol   cigarette--marijuana cigarette--sex      
##  [4] cigarette--race      alcohol  --marijuana alcohol  --sex      
##  [7] alcohol  --race      marijuana--sex       marijuana--race     
## [10] sex      --race

Make edge width ~ deviance

This is the key step for this visualization. For generality, need a better scheme to translate deviance into the width of the edges.

E(full)$width <- E(full)$deviance/25

Explore the igraph object

There are lots of igraph functions to get details of a given graph object.

full[]
## 5 x 5 sparse Matrix of class "dgCMatrix"
##           cigarette alcohol marijuana sex race
## cigarette         .       1         1   1    1
## alcohol           1       .         1   1    1
## marijuana         1       1         .   1    1
## sex               1       1         1   .    1
## race              1       1         1   1    .
edge.attributes(full)
## $deviance
##  [1] 201.19931 513.47218  16.31718  15.78347 106.95800  18.71695  20.32086
##  [8]  25.16101  18.92894  16.17921
## 
## $aic
##  [1] 367.4376 679.7105 182.5555 182.0217 273.1963 184.9552 186.5591 191.3993
##  [9] 185.1672 182.4175
## 
## $width
##  [1]  8.0479724 20.5388872  0.6526874  0.6313387  4.2783199  0.7486780
##  [7]  0.8128344  1.0064403  0.7571574  0.6471683

ends() delivers the edges as two-column data.frame

ends(full, es=E(full), names=T) 
##       [,1]        [,2]       
##  [1,] "cigarette" "alcohol"  
##  [2,] "cigarette" "marijuana"
##  [3,] "cigarette" "sex"      
##  [4,] "cigarette" "race"     
##  [5,] "alcohol"   "marijuana"
##  [6,] "alcohol"   "sex"      
##  [7,] "alcohol"   "race"     
##  [8,] "marijuana" "sex"      
##  [9,] "marijuana" "race"     
## [10,] "sex"       "race"

Plot

Plot the all two-way model with edge width ~ deviance. This uses the E(full)$width and V(full)$color attributes assigned above. Other attributes can be assigned with optional arguments.

set.seed(1234)
plot(full, vertex.label.cex=2)

Try to do this using ggraph ???

This doesn’t work …

# ggraph(full) +
#   geom_edge_link(aes(width = DT$Deviance[-1]/25)) + 
#   geom_node_point()