GLMM Fundamentals

This page covers the statistical theory behind the GLMM tab. See that page for usage instructions.

Hierarchical Data and the Independence Problem

Most statistical models assume observations are independent. In practice, data often has a hierarchical structure where observations are grouped:

  • Students nested within schools
  • Patients nested within hospitals
  • Repeated measurements within subjects

Observations in the same group share group-level factors (school policies, hospital resources, individual physiology) and tend to be more similar to each other. Analyzing such data with GLM while ignoring group structure is a model misspecification: the data-generating process has hierarchical structure that the model fails to represent.

This misspecification has two aspects.

Wrong covariance structure: The model assumes zero covariance between observations, but observations within the same group are positively correlated. Ignoring this correlation causes both fixed-effect standard errors and prediction intervals to be underestimated. 95% confidence intervals do not actually achieve 95% coverage, and false positive rates in hypothesis tests exceed the nominal significance level. Additionally, between-group and within-group variability cannot be separated, preventing quantitative understanding of the hierarchical structure.

Possible omitted variable bias: When group-level confounders exist, fixed effect estimates become biased. For example, if school policy influences both study hours and test scores, a regression ignoring school differences cannot correctly estimate the effect of study hours. Random intercepts model mean-level differences between groups, but if confounders are correlated with predictors, this bias is not resolved. Measured confounders should be included as fixed effects.

Mixed models directly address the covariance structure problem by explicitly modeling within-group correlation as random effects. For omitted variable bias, the confounders must be included as fixed effects.

Fixed Effects and Random Effects

The "mixed" in mixed model refers to containing both fixed and random effects.

Fixed effects are the systematic relationships the researcher wants to estimate. For example, "how many points does test score increase per additional hour of study" is a fixed effect.

Random effects are a modeling device for representing unobserved group-level variability within the model. Each school's average score differs due to policies, student demographics, and other factors. This variability is not generated by a random mechanism — it stems from concrete causes — but measuring and including all those factors as predictors is rarely practical. Random effects approximate this "unobserved group difference" using a probability distribution.

The assumption ujN(0,σu2)u_j \sim N(0, \sigma_u^2) is not a claim that school differences are generated from a normal distribution. It is a modeling assumption that enables estimation of the overall magnitude of group variability σu2\sigma_u^2 and produces shrinkage estimates (see below).

Under this interpretation, the fixed/random distinction is not about whether groups were "randomly sampled" but about how to model group differences — a design choice (Gelman, 2005). When the overall pattern of group variability is of interest, random effects are appropriate. When the effect of a specific group is the target, fixed effects are more suitable.

Random Intercept Model

The simplest mixed model is the random intercept model, which gives each group its own baseline:

g(μi)=xiβ+uj[i],ujN(0,σu2)g(\mu_i) = x_i'\beta + u_{j[i]}, \quad u_j \sim N(0, \sigma_u^2)

where gg is the link function, xiβx_i'\beta is the fixed-effect linear predictor, and uj[i]u_{j[i]} is the random intercept for group jj.

Intuitively, there is a single regression line shared by all groups (xiβx_i'\beta), and each group's line is shifted up or down by uju_j. The normal distribution N(0,σu2)N(0, \sigma_u^2) assumption is a tool for summarizing the magnitude of these shifts with a single parameter σu2\sigma_u^2. The actual value of each uju_j is inferred from the data (see Estimation and Prediction).

For the Gaussian family, this is a linear mixed model (LMM):

yi=xiβ+uj[i]+εi,ujN(0,σu2),εiN(0,σe2)y_i = x_i'\beta + u_{j[i]} + \varepsilon_i, \quad u_j \sim N(0, \sigma_u^2), \quad \varepsilon_i \sim N(0, \sigma_e^2)

σu2\sigma_u^2 is the between-group variance; σe2\sigma_e^2 is the within-group (individual-level) variance.

Parameter Estimation

Estimation in mixed models is more complex than in standard regression because fixed effects β\beta and variance components (σu2\sigma_u^2, σe2\sigma_e^2) must be estimated simultaneously.

REML (Restricted Maximum Likelihood)

Maximum likelihood (ML) tends to underestimate variance components because the likelihood does not account for degrees of freedom consumed by estimating β\beta (the same reason sample variance uses n1n-1 rather than nn as divisor).

REML corrects this bias by maximizing the likelihood of residuals after projecting out β\beta.

AIC/BIC Limitations

Since the REML projection depends on the fixed-effect structure, REML-based AIC/BIC cannot be compared across models with different fixed effects. To compare models with different fixed effects, refit using ML and compare ML-based AIC/BIC.

Laplace Approximation and PIRLS

For Gaussian families, the random-effect integral is analytically tractable, so REML suffices. For non-Gaussian families (Binomial, Poisson, Gamma, etc.), integrating out the random effects:

L(β,σu2)=if(yiβ,u)ϕ(u;0,σu2)duL(\beta, \sigma_u^2) = \int \prod_i f(y_i \mid \beta, u) \cdot \phi(u; 0, \sigma_u^2) \, du

has no closed-form solution (ϕ\phi is the normal density).

The Laplace approximation replaces this integral with a second-order expansion around the mode of the integrand. MIDAS implements the algorithm described in Bates et al. (2015). The Laplace approximation may lose accuracy when group sizes are small or when events are rare in Binomial data.

  • Outer loop: Optimizes the relative covariance parameter θ\theta via Nelder-Mead
  • Inner loop: PIRLS (Penalized IRLS) simultaneously estimates (β,u)(\beta, u)

PIRLS extends GLM's IRLS with a random-effect penalty. The penalized deviance (following Bates et al., 2015 notation, where θ\theta is the relative covariance parameter: θ=σu/σe\theta = \sigma_u/\sigma_e for Gaussian, θ=σu\theta = \sigma_u for Poisson/Binomial, θ=σu/ϕ\theta = \sigma_u/\sqrt\phi for Gamma):

Dp=id(yi,μi)+juj2θ2D_p = \sum_i d(y_i, \mu_i) + \sum_j \frac{u_j^2}{\theta^2}

The first term measures fit to data; the second penalizes uju_j values that deviate too far from the normal prior. This penalty pulls estimates for data-sparse groups toward the overall mean (see shrinkage below).

Estimation and Prediction

Fixed effects β\beta are unknown constants, so they are "estimated." Random effects uju_j are treated as random variables within the model, so formally they are "predicted."

The term "prediction" for uju_j arises from the mathematical structure of the model. Because a probability distribution is assumed for uju_j, inferring its value combines two sources of information: the observed data and the assumed distribution (N(0,σu2)N(0, \sigma_u^2)). For β\beta, only the finite data contributes uncertainty. For uju_j, the model-assumed distribution also acts as an information source. This combination of two sources produces the shrinkage behavior described below.

The "prediction" terminology does not presuppose that uju_j was truly generated randomly. It is a formal distinction that accompanies the modeling decision to represent group differences with a probability distribution.

BLUP (Best Linear Unbiased Predictor) is commonly used for predicting random effects. BLUP is the linear predictor that minimizes the mean squared prediction error E[(u^juj)2]E[(\hat u_j - u_j)^2] subject to the constraint that estimable linear functions of the fixed effects are unbiased. It derives from Henderson's mixed model equations.

For Gaussian, the BLUP of uju_j takes the form of a shrinkage estimator. For non-Gaussian families, random effects are estimated as conditional modes (posterior modes). These are obtained as the PIRLS solution and exhibit shrinkage toward the overall mean similarly to Gaussian BLUP, but are not strictly BLUPs.

The Gaussian BLUP formula:

u^j=njnj+σe2/σu2×rˉj\hat{u}_j = \frac{n_j}{n_j + \sigma_e^2 / \sigma_u^2} \times \bar{r}_j

where njn_j is the group size and rˉj\bar{r}_j is the mean residual for group jj (the part not explained by fixed effects).

The coefficient nj/(nj+σe2/σu2)n_j / (n_j + \sigma_e^2/\sigma_u^2) ranges from 0 to 1, approaching 1 as group size increases:

  • Large groups: Enough data to trust the group-specific estimate, so use it nearly as-is
  • Small groups: Limited data, so pull the estimate toward the overall mean (zero)

This is also called "borrowing strength." Data-sparse groups borrow information from other groups to stabilize their estimates. Using group-specific estimates directly would have high variance; using only the overall mean would ignore group characteristics. BLUP optimally balances this bias-variance tradeoff.

ICC (Intraclass Correlation Coefficient)

ICC measures the proportion of total variance attributable to between-group differences:

ICC=σu2σu2+σe2\text{ICC} = \frac{\sigma_u^2}{\sigma_u^2 + \sigma_e^2}

ICC = 0 means no group differences — all variability is at the individual level. ICC = 1 means no within-group differences — all variability is between groups.

When ICC is small, ignoring group structure and using GLM produces nearly identical results. However, the impact depends not only on ICC but also on group size. The design effect DEFF=1+(nˉ1)×ICC\text{DEFF} = 1 + (\bar n - 1) \times \text{ICC} (where nˉ\bar n is the average group size) provides a rough measure of variance inflation. Standard errors are inflated by a factor of DEFF\sqrt{\text{DEFF}}. When ICC is large, a mixed model is necessary.

Non-Gaussian Families

For non-Gaussian families, σe2\sigma_e^2 is not directly defined, so a latent-variable variance is used instead. For Binomial + Logit link, the observed binary response is assumed to have a continuous latent variable underlying it. With the Logit link, this latent variable follows a logistic distribution, and its variance π2/3\pi^2/3 serves as the residual variance.

Family + LinkLatent-Scale Residual Variance
Binomial + Logitπ2/33.29\pi^2/3 \approx 3.29
Poisson + Log11 (approximation from the log-normal-Poisson mixture model; accuracy decreases when μ\mu is small)
Gamma + Logϕ\phi (estimated dispersion parameter)

MIDAS Implementation

MIDAS uses different algorithms depending on the distribution family.

Gaussian (LMM): Profile REML. The variance component ratio θ2=σu2/σe2\theta^2 = \sigma_u^2/\sigma_e^2 is optimized via Nelder-Mead, with β\beta and σe2\sigma_e^2 computed in closed form for each θ\theta.

Non-Gaussian (Binomial, Poisson, Gamma, etc.): Laplace approximation following Bates et al. (2015). An outer Nelder-Mead loop optimizes the relative covariance parameter θ\theta, while an inner PIRLS loop simultaneously estimates (β,u)(\beta, u). The dispersion parameter ϕ\phi is fixed at 1 for Poisson/Binomial and estimated from profiled deviance (Dp/nD_p/n) for Gamma.

Fixed-effect initial values come from a GLM fit (without random effects). Predictors are internally scaled (standardized) and back-transformed after estimation.

Fixed Effect Inference and the Normal Approximation

Fixed effects are tested using Wald tests. The test statistic is z=β^j/SE(β^j)z = \hat\beta_j / \text{SE}(\hat\beta_j), and p-values and confidence intervals are computed from the standard normal distribution.

In ordinary regression (OLS), the error variance σ2\sigma^2 is estimated, so β^j/SE(β^j)\hat\beta_j / \text{SE}(\hat\beta_j) follows a t-distribution with npn - p degrees of freedom. In mixed models, there are multiple variance components (σu2\sigma_u^2, σe2\sigma_e^2), and the corresponding degrees of freedom are not uniquely determined. The normal approximation treats the estimated variance components as known.

In LMM, the normality assumptions on uju_j and εi\varepsilon_i ensure that β^j/SE(β^j)\hat\beta_j / \text{SE}(\hat\beta_j) follows the standard normal exactly when the variance components are known. In non-Gaussian GLMM, the normal distribution is an asymptotic approximation. In both cases, replacing the true variance components with estimates introduces additional uncertainty, making the distribution of β^j/SE(β^j)\hat\beta_j / \text{SE}(\hat\beta_j) heavier-tailed than the standard normal.

When the number of groups is large, the variance component estimates are stable and this approximation is adequate. With few groups, confidence intervals tend to be too narrow, and the null hypothesis is rejected more often than the nominal significance level.

Random Slopes and Crossed Random Effects

The random intercept model only captures group-level differences in baseline. When the effect of a predictor also varies by group, a random slope model is needed:

g(μi)=xiβ+u0j+u1jx1ig(\mu_i) = x_i'\beta + u_{0j} + u_{1j} x_{1i}

where u0ju_{0j} is the random intercept and u1ju_{1j} is the random slope, jointly following a multivariate normal distribution.

When observations belong to multiple grouping variables (e.g., students belong to both a school and a region), crossed random effects can be used.

MIDAS currently supports random intercept models only.

See also

References