Negative Binomial: Interpreting the overdispersion parameter

The use of the negative binomial distribution to account for overdispersion in the standard Poisson model for count data in the glm() framework, as provided by MASS::glm.nb() might need a bit more explanation than is provided in the lecture slides, particularly as to the interpretation of the \(\theta\) and \(\alpha = 1 / \theta\) parameters.

The NB2 variance function

MASS::glm.nb() fits the NB2 parameterization of the negative binomial, where the variance is a quadratic function of the mean:

\[\text{Var}(Y) = \mu + \frac{\mu^2}{\theta}\]

theta (\(\theta\)) is the size / precision parameter. Its reciprocal \(\alpha = 1/\theta\) is the overdispersion parameter:

Expressed in terms of \(\alpha\), this is:

\[\text{Var}(Y) = \mu + \alpha\mu^2 = \mu(1 + \alpha\mu)\]

As \(\theta \to \infty\) (i.e. \(\alpha \to 0\)), the variance collapses to \(\mu\) and the distribution reduces to Poisson.

Fit the model

data(PhdPubs, package = "vcdExtra")

# Poisson baseline
phd.pois <- glm(articles ~ ., data = PhdPubs, family = poisson)

# Negative binomial
phd.nbin <- glm.nb(articles ~ ., data = PhdPubs)

# Extract theta and alpha
theta_hat <- phd.nbin$theta
se_theta  <- phd.nbin$SE.theta
alpha_hat <- 1 / theta_hat

cat(sprintf("theta (size):        %.4f  (SE = %.4f)\n", theta_hat, se_theta))

## theta (size):        2.2670  (SE = 0.2716)

cat(sprintf("alpha = 1/theta:     %.4f\n", alpha_hat))

## alpha = 1/theta:     0.4411

The Poisson-Gamma mixture view

The most concrete interpretation comes from writing the model as a Poisson-Gamma mixture. Each observation has its own latent rate:

\[\lambda_i = \mu_i \cdot u_i, \qquad u_i \sim \text{Gamma}(\theta,\ 1/\theta)\]

where \(u_i\) is a multiplicative individual-level random effect with:

\(\text{E}(u_i) = 1\) — no bias, centered on the model prediction
\(\text{Var}(u_i) = 1/\theta = \alpha\)

Interpretation of \(\alpha\): the variance of the unobserved multiplier applied to each person’s expected count. It captures all the individual heterogeneity not explained by the predictors.

# Parameters of the Gamma mixing distribution for u_i
shape_u <- theta_hat          # = theta
rate_u  <- theta_hat          # so that E(u) = shape/rate = 1

cat(sprintf("Gamma mixing distribution for u_i:\n"))

## Gamma mixing distribution for u_i:

cat(sprintf("  shape = rate = theta = %.4f\n", theta_hat))

##   shape = rate = theta = 2.2670

cat(sprintf("  E(u_i)   = 1  (by construction)\n"))

##   E(u_i)   = 1  (by construction)

cat(sprintf("  Var(u_i) = 1/theta = alpha = %.4f\n", alpha_hat))

##   Var(u_i) = 1/theta = alpha = 0.4411

cat(sprintf("  SD(u_i)  = %.4f\n", sqrt(alpha_hat)))

##   SD(u_i)  = 0.6642

# Approximate range of u_i (Gamma 5th–95th percentiles)
lo <- qgamma(0.05, shape = shape_u, rate = rate_u)
hi <- qgamma(0.95, shape = shape_u, rate = rate_u)
cat(sprintf("  Middle 90%% of u_i: [%.2f, %.2f]\n", lo, hi))

##   Middle 90% of u_i: [0.21, 2.28]

u_seq <- seq(0, 4, length.out = 500)
u_df  <- data.frame(
  u       = u_seq,
  density = dgamma(u_seq, shape = shape_u, rate = rate_u)
)

ggplot(u_df, aes(x = u, y = density)) +
  geom_line(linewidth = 1.1, color = "#2171b5") +
  geom_area(alpha = 0.15, fill = "#2171b5") +
  geom_vline(xintercept = 1, linetype = "dashed", color = "grey40") +
  annotate("text", x = 1.05, y = Inf, vjust = 1.5,
           label = "E(u) = 1", color = "grey40", size = 4) +
  annotate("rect", xmin = lo, xmax = hi, ymin = 0, ymax = Inf,
           alpha = 0.08, fill = "firebrick") +
  annotate("text", x = (lo + hi) / 2, y = Inf, vjust = 3,
           label = "middle 90%", color = "firebrick", size = 3.5) +
  labs(
    title    = "Gamma mixing distribution: u_i ~ Gamma(θ, θ)",
    subtitle = sprintf("θ = %.2f,  α = 1/θ = %.2f,  SD(u_i) = %.2f",
                       theta_hat, alpha_hat, sqrt(alpha_hat)),
    x = expression(u[i] ~ "(multiplicative rate multiplier)"),
    y = "Density"
  ) +
  theme_bw(base_size = 13)

Gamma mixing distribution for the individual-level multiplier u_i. Each PhD student’s true rate is their predicted count μ_i multiplied by their u_i.

In plain language: even after accounting for all predictors, individual publication rates vary by a factor with SD ≈ 0.66 around the model prediction. A student at the 5th percentile of \(u_i\) publishes at 0.21× their predicted rate; one at the 95th publishes at 2.28×.

Variance inflation relative to Poisson

The variance inflation factor (VIF) relative to Poisson depends on the mean \(\mu\):

\[\text{VIF}(\mu) = \frac{\text{Var}(Y)}{\mu} = 1 + \frac{\mu}{\theta} = 1 + \alpha\mu\]

mu_vals <- c(1, 2, 4, 6, 10)
vif     <- 1 + alpha_hat * mu_vals

data.frame(mu = mu_vals, VIF = round(vif, 2),
           `Var(Y)` = round(vif * mu_vals, 2)) |>
  knitr::kable(caption = "Variance inflation at selected values of μ (PhdPubs NB fit)")

Variance inflation at selected values of μ (PhdPubs NB fit)
mu	VIF	Var.Y.
1	1.44	1.44
2	1.88	3.76
4	2.76	11.06
6	3.65	21.88
10	5.41	54.11

mu_seq <- seq(0.5, 12, length.out = 200)
vif_df <- data.frame(
  mu       = mu_seq,
  Poisson  = mu_seq,
  NegBin   = mu_seq + alpha_hat * mu_seq^2
)

tidyr::pivot_longer(vif_df, -mu, names_to = "model", values_to = "variance") |>
  ggplot(aes(x = mu, y = variance, color = model)) +
  geom_line(linewidth = 1.1) +
  scale_color_manual(values = c(Poisson = "#e6550d", NegBin = "#2171b5")) +
  labs(
    title    = "Var(Y): Poisson vs. Negative Binomial",
    subtitle = sprintf("NB: Var(Y) = μ + αμ²,  α = 1/θ = %.3f", alpha_hat),
    x        = "Mean  μ",
    y        = "Var(Y)",
    color    = NULL
  ) +
  theme_bw(base_size = 13) +
  theme(legend.position = "inside", legend.position.inside = c(0.2, 0.8))

Variance of Y under Poisson (= μ) vs. negative binomial, as a function of the mean.

Testing against Poisson: is overdispersion significant?

The likelihood-ratio test compares the NB log-likelihood to Poisson. Under \(H_0: \theta \to \infty\) (no overdispersion), the test statistic is on the boundary of the parameter space, so the p-value is conservative (halved \(\chi^2_1\)).

# LRT: NegBin vs Poisson
LRT  <- 2 * (logLik(phd.nbin) - logLik(phd.pois))
pval <- pchisq(LRT, df = 1, lower.tail = FALSE) / 2   # boundary correction

cat(sprintf("LRT statistic: %.2f  (df = 1)\n", LRT))

## LRT statistic: 179.97  (df = 1)

cat(sprintf("p-value (boundary-corrected): %.2e\n", pval))

## p-value (boundary-corrected): 2.46e-41

# AIC comparison
cat(sprintf("\nAIC — Poisson: %.1f,  NegBin: %.1f,  ΔAIC: %.1f\n",
            AIC(phd.pois), AIC(phd.nbin), AIC(phd.pois) - AIC(phd.nbin)))

## 
## AIC — Poisson: 3313.3,  NegBin: 3135.4,  ΔAIC: 178.0

Quick reference

Quantity	Expression	PhdPubs estimate
`theta` (\(\theta\))	size / precision	2.267
`alpha` (\(\alpha = 1/\theta\))	Var(\(u_i\))	0.441
SD of \(u_i\)	\(\sqrt{\alpha}\)	0.664
Poisson limit	\(\theta \to \infty\), \(\alpha \to 0\)	—
VIF at \(\mu = 4\)	\(1 + \alpha\mu\)	2.76

The \(\alpha\) parameterization (used by Stata, pscl, countreg) is more directly interpretable as an overdispersion quantity. The \(\theta\) parameterization is more natural as the shape of the Gamma mixing distribution. Either way, the test \(H_0: \theta \to \infty\) (i.e. \(\alpha = 0\)) is the test against Poisson.