GLMM Fundamentals
This page covers the statistical theory behind the GLMM tab. See that page for usage instructions.
Hierarchical Data and the Independence Problem
Most statistical models assume observations are independent. In practice, data often has a hierarchical structure where observations are grouped:
- Students nested within schools
- Patients nested within hospitals
- Repeated measurements within subjects
Observations in the same group share group-level factors (school policies, hospital resources, individual physiology) and tend to be more similar to each other. Analyzing such data with GLM while ignoring group structure is a model misspecification: the data-generating process has hierarchical structure that the model fails to represent.
This misspecification causes several problems:
- Underestimated standard errors: Within-group correlation means the effective sample size is smaller than the nominal . Standard errors computed under independence do not reflect this, producing confidence intervals that are too narrow
- Possible coefficient bias: When group-level confounders exist, fixed effect estimates become biased. For example, if school policy influences both study hours and test scores, a regression ignoring school differences cannot correctly estimate the effect of study hours. This is an omitted variable bias problem; random intercepts partially absorb group-level confounding but do not fully resolve it
- Underestimated prediction uncertainty: Prediction intervals that ignore group-level variation understate actual variability
- Loss of variance structure information: Without separating between-group and within-group variability, the data structure is not properly captured
Mixed models (also called multilevel models) directly address the standard error underestimation and loss of variance structure by explicitly modeling within-group correlation as random effects.
Fixed Effects and Random Effects
The "mixed" in mixed model refers to containing both fixed and random effects.
Fixed effects are the systematic relationships the researcher wants to estimate. For example, "how many points does test score increase per additional hour of study" is a fixed effect.
Random effects are a modeling device for representing unobserved group-level variability within the model. Each school's average score differs due to policies, student demographics, and other factors. This variability is not generated by a random mechanism — it stems from concrete causes — but measuring and including all those factors as predictors is rarely practical. Random effects approximate this "unobserved group difference" using a probability distribution (Hodges, 2014).
The assumption is not a claim that school differences are generated from a normal distribution. It is a modeling assumption that enables estimation of the overall magnitude of group variability and produces shrinkage estimates (see below).
Under this interpretation, the fixed/random distinction is not about whether groups were "randomly sampled" but about how to model group differences — a design choice (Gelman, 2005). When the overall pattern of group variability is of interest, random effects are appropriate. When the effect of a specific group is the target, fixed effects are more suitable.
Random Intercept Model
The simplest mixed model is the random intercept model, which gives each group its own baseline:
where is the link function, is the fixed-effect linear predictor, and is the random intercept for group .
Intuitively, there is a single regression line shared by all groups (), and each group's line is shifted up or down by . The normal distribution assumption is a tool for summarizing the magnitude of these shifts with a single parameter . The actual value of each is inferred from the data (see Estimation and Prediction).
For the Gaussian family, this is a linear mixed model (LMM):
is the between-group variance; is the within-group (individual-level) variance.
Parameter Estimation
Estimation in mixed models is more complex than in standard regression because fixed effects and variance components (, ) must be estimated simultaneously.
REML (Restricted Maximum Likelihood)
Maximum likelihood (ML) tends to underestimate variance components because the likelihood does not account for degrees of freedom consumed by estimating (the same reason sample variance uses rather than as divisor).
REML corrects this bias by maximizing the likelihood of residuals after projecting out .
AIC/BIC Limitations
Since the REML projection depends on the fixed-effect structure, REML-based AIC/BIC cannot be compared across models with different fixed effects. To compare models with different fixed effects, refit using ML and compare ML-based AIC/BIC.
Laplace Approximation and PIRLS
For Gaussian families, the random-effect integral is analytically tractable, so REML suffices. For non-Gaussian families (Binomial, Poisson, Gamma, etc.), integrating out the random effects:
has no closed-form solution ( is the normal density).
The Laplace approximation replaces this integral with a second-order expansion around the mode of the integrand. MIDAS implements the algorithm described in Bates et al. (2015). The Laplace approximation may lose accuracy when group sizes are small or when events are rare in Binomial data.
- Outer loop: Optimizes the relative covariance parameter via Nelder-Mead
- Inner loop: PIRLS (Penalized IRLS) simultaneously estimates
PIRLS extends GLM's IRLS with a random-effect penalty. The penalized deviance (following Bates et al., 2015 notation, where is the relative covariance parameter):
The first term measures fit to data; the second penalizes values that deviate too far from the normal prior. This penalty pulls estimates for data-sparse groups toward the overall mean (see shrinkage below).
Estimation and Prediction
Fixed effects are unknown constants, so they are "estimated." Random effects are treated as random variables within the model, so formally they are "predicted."
The term "prediction" for arises from the mathematical structure of the model. Because a probability distribution is assumed for , inferring its value combines two sources of information: the observed data and the assumed distribution (). For , only the finite data contributes uncertainty. For , the model-assumed distribution also acts as an information source. This combination of two sources produces the shrinkage behavior described below.
The "prediction" terminology does not presuppose that was truly generated randomly. It is a formal distinction that accompanies the modeling decision to represent group differences with a probability distribution.
BLUP (Best Linear Unbiased Predictor) is commonly used for predicting random effects. Among linear predictors satisfying (prediction unbiasedness), BLUP minimizes mean squared prediction error. It derives from Henderson's (1950) mixed model equations.
For Gaussian, the BLUP of takes the form of a shrinkage estimator:
where is the group size and is the mean residual for group (the part not explained by fixed effects).
The coefficient ranges from 0 to 1, approaching 1 as group size increases:
- Large groups: Enough data to trust the group-specific estimate, so use it nearly as-is
- Small groups: Limited data, so pull the estimate toward the overall mean (zero)
This is also called "borrowing strength." Data-sparse groups borrow information from other groups to stabilize their estimates. Using group-specific estimates directly would have high variance; using only the overall mean would ignore group characteristics. BLUP optimally balances this bias-variance tradeoff.
ICC (Intraclass Correlation Coefficient)
ICC measures the proportion of total variance attributable to between-group differences:
ICC = 0 means no group differences — all variability is at the individual level. ICC = 1 means no within-group differences — all variability is between groups.
When ICC is small, ignoring group structure and using GLM produces nearly identical results. However, the impact depends not only on ICC but also on group size. The design effect (where is the average group size) provides a rough measure of standard error inflation. When ICC is large, a mixed model is necessary.
Non-Gaussian Families
For non-Gaussian families, is not directly defined, so a latent-variable variance is used instead. For example, with Binomial + Logit link, the latent continuous variable follows a logistic distribution with variance .
| Family + Link | Latent-Scale Residual Variance |
|---|---|
| Binomial + Logit | |
| Poisson + Log | (approximation from the log-normal-Poisson mixture model) |
| Gamma + Log | (estimated dispersion parameter) |
MIDAS Implementation
MIDAS uses different algorithms depending on the distribution family.
Gaussian (LMM): Profile REML. The variance component ratio is optimized via Nelder-Mead, with and computed in closed form for each .
Non-Gaussian (Binomial, Poisson, Gamma, etc.): Laplace approximation following Bates et al. (2015). An outer Nelder-Mead loop optimizes the relative covariance parameter , while an inner PIRLS loop simultaneously estimates . The dispersion parameter is fixed at 1 for Poisson/Binomial and estimated from profiled deviance () for Gamma.
Fixed-effect initial values come from a GLM fit (without random effects). Predictors are internally scaled (standardized) and back-transformed after estimation.
Random Slopes and Crossed Random Effects
The random intercept model only captures group-level differences in baseline. When the effect of a predictor also varies by group, a random slope model is needed:
where is the random intercept and is the random slope, jointly following a multivariate normal distribution.
When observations belong to multiple grouping variables (e.g., students belong to both a school and a region), crossed random effects can be used.
MIDAS currently supports random intercept models only.
See also
- Generalized Linear Mixed Model (GLMM) - How to run GLMM analysis and interpret results
- GLM Fundamentals - GLM theory underlying GLMM
- OLS Fundamentals - Mathematical background of linear models
- Glossary - Statistical term definitions
References
- Henderson, C. R. (1950). Estimation of genetic parameters (Abstract). Annals of Mathematical Statistics, 21, 309-310.
- Gelman, A. (2005). Analysis of variance—why it is more important than ever. The Annals of Statistics, 33(1), 1-53.
- Hodges, J. S. (2014). Richly Parameterized Linear Models: Additive, Time Series, and Spatial Models Using Random Effects. Chapman & Hall/CRC.
- Bates, D., Mächler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1), 1-48.