GLMM Fundamentals

This page covers the statistical theory behind the GLMM tab. See that page for usage instructions.

Hierarchical Data and the Independence Problem

Most statistical models assume observations are independent. In practice, data often has a hierarchical structure where observations are grouped:

  • Students nested within schools
  • Patients nested within hospitals
  • Repeated measurements within subjects

Observations in the same group share group-level factors (school policies, hospital resources, individual physiology) and tend to be more similar to each other. Analyzing such data with GLM while ignoring group structure is a model misspecification: the data-generating process has hierarchical structure that the model fails to represent.

This misspecification causes several problems:

  • Underestimated standard errors: Within-group correlation means the effective sample size is smaller than the nominal nn. Standard errors computed under independence do not reflect this, producing confidence intervals that are too narrow
  • Possible coefficient bias: When group-level confounders exist, fixed effect estimates become biased. For example, if school policy influences both study hours and test scores, a regression ignoring school differences cannot correctly estimate the effect of study hours. This is an omitted variable bias problem; random intercepts partially absorb group-level confounding but do not fully resolve it
  • Underestimated prediction uncertainty: Prediction intervals that ignore group-level variation understate actual variability
  • Loss of variance structure information: Without separating between-group and within-group variability, the data structure is not properly captured

Mixed models (also called multilevel models) directly address the standard error underestimation and loss of variance structure by explicitly modeling within-group correlation as random effects.

Fixed Effects and Random Effects

The "mixed" in mixed model refers to containing both fixed and random effects.

Fixed effects are the systematic relationships the researcher wants to estimate. For example, "how many points does test score increase per additional hour of study" is a fixed effect.

Random effects are a modeling device for representing unobserved group-level variability within the model. Each school's average score differs due to policies, student demographics, and other factors. This variability is not generated by a random mechanism — it stems from concrete causes — but measuring and including all those factors as predictors is rarely practical. Random effects approximate this "unobserved group difference" using a probability distribution (Hodges, 2014).

The assumption ujN(0,σu2)u_j \sim N(0, \sigma_u^2) is not a claim that school differences are generated from a normal distribution. It is a modeling assumption that enables estimation of the overall magnitude of group variability σu2\sigma_u^2 and produces shrinkage estimates (see below).

Under this interpretation, the fixed/random distinction is not about whether groups were "randomly sampled" but about how to model group differences — a design choice (Gelman, 2005). When the overall pattern of group variability is of interest, random effects are appropriate. When the effect of a specific group is the target, fixed effects are more suitable.

Random Intercept Model

The simplest mixed model is the random intercept model, which gives each group its own baseline:

g(μi)=xiβ+uj[i],ujN(0,σu2)g(\mu_i) = x_i'\beta + u_{j[i]}, \quad u_j \sim N(0, \sigma_u^2)

where gg is the link function, xiβx_i'\beta is the fixed-effect linear predictor, and uj[i]u_{j[i]} is the random intercept for group jj.

Intuitively, there is a single regression line shared by all groups (xiβx_i'\beta), and each group's line is shifted up or down by uju_j. The normal distribution N(0,σu2)N(0, \sigma_u^2) assumption is a tool for summarizing the magnitude of these shifts with a single parameter σu2\sigma_u^2. The actual value of each uju_j is inferred from the data (see Estimation and Prediction).

For the Gaussian family, this is a linear mixed model (LMM):

yi=xiβ+uj[i]+εi,ujN(0,σu2),εiN(0,σe2)y_i = x_i'\beta + u_{j[i]} + \varepsilon_i, \quad u_j \sim N(0, \sigma_u^2), \quad \varepsilon_i \sim N(0, \sigma_e^2)

σu2\sigma_u^2 is the between-group variance; σe2\sigma_e^2 is the within-group (individual-level) variance.

Parameter Estimation

Estimation in mixed models is more complex than in standard regression because fixed effects β\beta and variance components (σu2\sigma_u^2, σe2\sigma_e^2) must be estimated simultaneously.

REML (Restricted Maximum Likelihood)

Maximum likelihood (ML) tends to underestimate variance components because the likelihood does not account for degrees of freedom consumed by estimating β\beta (the same reason sample variance uses n1n-1 rather than nn as divisor).

REML corrects this bias by maximizing the likelihood of residuals after projecting out β\beta.

AIC/BIC Limitations

Since the REML projection depends on the fixed-effect structure, REML-based AIC/BIC cannot be compared across models with different fixed effects. To compare models with different fixed effects, refit using ML and compare ML-based AIC/BIC.

Laplace Approximation and PIRLS

For Gaussian families, the random-effect integral is analytically tractable, so REML suffices. For non-Gaussian families (Binomial, Poisson, Gamma, etc.), integrating out the random effects:

L(β,σu2)=if(yiβ,u)ϕ(u;0,σu2)duL(\beta, \sigma_u^2) = \int \prod_i f(y_i \mid \beta, u) \cdot \phi(u; 0, \sigma_u^2) \, du

has no closed-form solution (ϕ\phi is the normal density).

The Laplace approximation replaces this integral with a second-order expansion around the mode of the integrand. MIDAS implements the algorithm described in Bates et al. (2015). The Laplace approximation may lose accuracy when group sizes are small or when events are rare in Binomial data.

  • Outer loop: Optimizes the relative covariance parameter θ\theta via Nelder-Mead
  • Inner loop: PIRLS (Penalized IRLS) simultaneously estimates (β,u)(\beta, u)

PIRLS extends GLM's IRLS with a random-effect penalty. The penalized deviance (following Bates et al., 2015 notation, where θ\theta is the relative covariance parameter):

Dp=id(yi,μi)+juj2θ2D_p = \sum_i d(y_i, \mu_i) + \sum_j \frac{u_j^2}{\theta^2}

The first term measures fit to data; the second penalizes uju_j values that deviate too far from the normal prior. This penalty pulls estimates for data-sparse groups toward the overall mean (see shrinkage below).

Estimation and Prediction

Fixed effects β\beta are unknown constants, so they are "estimated." Random effects uju_j are treated as random variables within the model, so formally they are "predicted."

The term "prediction" for uju_j arises from the mathematical structure of the model. Because a probability distribution is assumed for uju_j, inferring its value combines two sources of information: the observed data and the assumed distribution (N(0,σu2)N(0, \sigma_u^2)). For β\beta, only the finite data contributes uncertainty. For uju_j, the model-assumed distribution also acts as an information source. This combination of two sources produces the shrinkage behavior described below.

The "prediction" terminology does not presuppose that uju_j was truly generated randomly. It is a formal distinction that accompanies the modeling decision to represent group differences with a probability distribution.

BLUP (Best Linear Unbiased Predictor) is commonly used for predicting random effects. Among linear predictors satisfying E[u^juj]=0E[\hat u_j - u_j] = 0 (prediction unbiasedness), BLUP minimizes mean squared prediction error. It derives from Henderson's (1950) mixed model equations.

For Gaussian, the BLUP of uju_j takes the form of a shrinkage estimator:

u^j=njnj+σe2/σu2×rˉj\hat{u}_j = \frac{n_j}{n_j + \sigma_e^2 / \sigma_u^2} \times \bar{r}_j

where njn_j is the group size and rˉj\bar{r}_j is the mean residual for group jj (the part not explained by fixed effects).

The coefficient nj/(nj+σe2/σu2)n_j / (n_j + \sigma_e^2/\sigma_u^2) ranges from 0 to 1, approaching 1 as group size increases:

  • Large groups: Enough data to trust the group-specific estimate, so use it nearly as-is
  • Small groups: Limited data, so pull the estimate toward the overall mean (zero)

This is also called "borrowing strength." Data-sparse groups borrow information from other groups to stabilize their estimates. Using group-specific estimates directly would have high variance; using only the overall mean would ignore group characteristics. BLUP optimally balances this bias-variance tradeoff.

ICC (Intraclass Correlation Coefficient)

ICC measures the proportion of total variance attributable to between-group differences:

ICC=σu2σu2+σe2\text{ICC} = \frac{\sigma_u^2}{\sigma_u^2 + \sigma_e^2}

ICC = 0 means no group differences — all variability is at the individual level. ICC = 1 means no within-group differences — all variability is between groups.

When ICC is small, ignoring group structure and using GLM produces nearly identical results. However, the impact depends not only on ICC but also on group size. The design effect DEFF=1+(nˉ1)×ICC\text{DEFF} = 1 + (\bar n - 1) \times \text{ICC} (where nˉ\bar n is the average group size) provides a rough measure of standard error inflation. When ICC is large, a mixed model is necessary.

Non-Gaussian Families

For non-Gaussian families, σe2\sigma_e^2 is not directly defined, so a latent-variable variance is used instead. For example, with Binomial + Logit link, the latent continuous variable follows a logistic distribution with variance π2/3\pi^2/3.

Family + LinkLatent-Scale Residual Variance
Binomial + Logitπ2/33.29\pi^2/3 \approx 3.29
Poisson + Log11 (approximation from the log-normal-Poisson mixture model)
Gamma + Logϕ\phi (estimated dispersion parameter)

MIDAS Implementation

MIDAS uses different algorithms depending on the distribution family.

Gaussian (LMM): Profile REML. The variance component ratio θ2=σu2/σe2\theta^2 = \sigma_u^2/\sigma_e^2 is optimized via Nelder-Mead, with β\beta and σe2\sigma_e^2 computed in closed form for each θ\theta.

Non-Gaussian (Binomial, Poisson, Gamma, etc.): Laplace approximation following Bates et al. (2015). An outer Nelder-Mead loop optimizes the relative covariance parameter θ\theta, while an inner PIRLS loop simultaneously estimates (β,u)(\beta, u). The dispersion parameter ϕ\phi is fixed at 1 for Poisson/Binomial and estimated from profiled deviance (Dp/nD_p/n) for Gamma.

Fixed-effect initial values come from a GLM fit (without random effects). Predictors are internally scaled (standardized) and back-transformed after estimation.

Random Slopes and Crossed Random Effects

The random intercept model only captures group-level differences in baseline. When the effect of a predictor also varies by group, a random slope model is needed:

g(μi)=xiβ+u0j+u1jx1ig(\mu_i) = x_i'\beta + u_{0j} + u_{1j} x_{1i}

where u0ju_{0j} is the random intercept and u1ju_{1j} is the random slope, jointly following a multivariate normal distribution.

When observations belong to multiple grouping variables (e.g., students belong to both a school and a region), crossed random effects can be used.

MIDAS currently supports random intercept models only.

See also

References

  • Henderson, C. R. (1950). Estimation of genetic parameters (Abstract). Annals of Mathematical Statistics, 21, 309-310.
  • Gelman, A. (2005). Analysis of variance—why it is more important than ever. The Annals of Statistics, 33(1), 1-53.
  • Hodges, J. S. (2014). Richly Parameterized Linear Models: Additive, Time Series, and Spatial Models Using Random Effects. Chapman & Hall/CRC.
  • Bates, D., Mächler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1), 1-48.