---
title: GLMM Fundamentals
description: Hierarchical data and mixed models. Fixed vs random effects, random intercept models, REML and Laplace approximation, BLUP shrinkage estimation, and ICC.
priority: 0.5
---

# GLMM Fundamentals {#glmm-fundamentals}

This page covers the statistical theory behind the [GLMM](glmm) tab. See that page for usage instructions.

## Hierarchical Data and the Independence Problem {#hierarchical-data-and-the-independence-problem}

Most statistical models assume observations are independent. In practice, data often has a hierarchical structure where observations are grouped:

- Students nested within schools
- Patients nested within hospitals
- Repeated measurements within subjects

Observations in the same group share group-level factors (school policies, hospital resources, individual physiology) and tend to be more similar to each other. Analyzing such data with [GLM](concepts-glm) while ignoring group structure is a model misspecification: the data-generating process has hierarchical structure that the model fails to represent.

This misspecification has two aspects.

**Wrong covariance structure**: The model assumes zero covariance between observations, but observations within the same group are positively correlated. Ignoring this correlation causes both fixed-effect standard errors and prediction intervals to be underestimated. 95% confidence intervals do not actually achieve 95% coverage. Additionally, between-group and within-group variability cannot be separated, preventing quantitative understanding of the hierarchical structure.

**Possible omitted variable bias**: When group-level confounders exist, fixed effect estimates become biased. For example, if school policy influences both study hours and test scores, a regression ignoring school differences cannot correctly estimate the effect of study hours. Random intercepts model mean-level differences between groups, but if confounders are correlated with predictors, this bias is not resolved. Measured confounders should be included as fixed effects.

Mixed models directly address the covariance structure problem by explicitly modeling within-group correlation as random effects. For omitted variable bias, the confounders must be included as fixed effects.

## Fixed Effects and Random Effects {#fixed-effects-and-random-effects}

The "mixed" in mixed model refers to containing both fixed and random effects.

**Fixed effects** are the systematic relationships the researcher wants to estimate. For example, "how many points does test score increase per additional hour of study" is a fixed effect.

**Random effects** are a modeling device for representing unobserved group-level variability within the model. Each school's average score differs due to policies, student demographics, and other factors. This variability is not generated by a random mechanism — it stems from concrete causes — but measuring and including all those factors as predictors is rarely practical. Random effects approximate this "unobserved group difference" using a probability distribution.

The assumption $u_j \sim N(0, \sigma_u^2)$ is not a claim that school differences are generated from a normal distribution. It is a modeling assumption that enables estimation of the overall magnitude of group variability $\sigma_u^2$ and produces shrinkage estimates (see below).

Under this interpretation, the fixed/random distinction is not about whether groups were "randomly sampled" but about how to model group differences — a design choice ([Gelman, 2005](#ref-gelman-2005)). When the overall pattern of group variability is of interest, random effects are appropriate. When the effect of a specific group is the target, fixed effects are more suitable.

## Random Intercept Model {#random-intercept-model}

The simplest mixed model is the random intercept model, which gives each group its own baseline:

$$
g(\mu_i) = x_i'\beta + u_{j[i]}, \quad u_j \sim N(0, \sigma_u^2)
$$

where $g$ is the [link function](concepts-glm#link-functions), $x_i'\beta$ is the fixed-effect linear predictor, and $u_{j[i]}$ is the random intercept for group $j$.

Intuitively, there is a single regression line shared by all groups ($x_i'\beta$), and each group's line is shifted up or down by $u_j$. The normal distribution $N(0, \sigma_u^2)$ assumption is a tool for summarizing the magnitude of these shifts with a single parameter $\sigma_u^2$. The actual value of each $u_j$ is inferred from the data (see [Estimation and Prediction](#estimation-and-prediction)).

For the Gaussian family, this is a linear mixed model (LMM):

$$
y_i = x_i'\beta + u_{j[i]} + \varepsilon_i, \quad u_j \sim N(0, \sigma_u^2), \quad \varepsilon_i \sim N(0, \sigma_e^2)
$$

$\sigma_u^2$ is the between-group variance; $\sigma_e^2$ is the within-group (individual-level) variance.

## Parameter Estimation {#parameter-estimation}

Estimation in mixed models is more complex than in standard regression because fixed effects $\beta$ and variance components ($\sigma_u^2$, $\sigma_e^2$) must be estimated simultaneously.

### REML (Restricted Maximum Likelihood) {#reml-restricted-maximum-likelihood}

Maximum likelihood (ML) tends to underestimate variance components because the likelihood does not account for degrees of freedom consumed by estimating $\beta$ (the same reason sample variance uses $n-1$ rather than $n$ as divisor).

REML corrects this bias by maximizing the likelihood of residuals after projecting out $\beta$.

### AIC/BIC Limitations {#aicbic-limitations}

Since the REML projection depends on the fixed-effect structure, REML-based AIC/BIC cannot be compared across models with different fixed effects. MIDAS GLMM displays AIC/BIC, but this constraint applies. For Laplace-approximated log-likelihoods, the approximation error propagates into AIC/BIC, so comparisons should be limited to models within the same family and link.

### Laplace Approximation and PIRLS {#laplace-approximation-and-pirls}

For the Gaussian family with the identity link, the random-effect integral is analytically tractable, so REML suffices. For all other combinations (Binomial, Poisson, Gamma, and Gaussian + log), integrating out the random effects:

$$
L(\beta, \sigma_u^2) = \int \prod_i f(y_i \mid \beta, u) \cdot \phi(u; 0, \sigma_u^2) \, du
$$

has no closed-form solution ($\phi$ is the normal density).

The Laplace approximation replaces this integral with a second-order expansion around the mode of the integrand. MIDAS implements the algorithm described in [Bates et al. (2015)](#ref-bates-2015). The Laplace approximation may lose accuracy when group sizes are small or when events are rare in Binomial data.

- **Outer loop**: Optimizes the relative covariance parameter $\theta$ via Nelder-Mead
- **Inner loop**: PIRLS (Penalized IRLS) simultaneously estimates $(\beta, u)$

PIRLS extends [GLM's IRLS](concepts-glm#parameter-estimation-irls) with a random-effect penalty. The penalized deviance (following [Bates et al., 2015](#ref-bates-2015) notation, where $\theta$ is the relative covariance parameter: $\theta = \sigma_u/\sigma_e$ for Gaussian, $\theta = \sigma_u$ for Poisson/Binomial, $\theta = \sigma_u/\sqrt\phi$ for Gamma):

$$
D_p = \sum_i d(y_i, \mu_i) + \sum_j \frac{u_j^2}{\theta^2}
$$

The first term measures fit to data; the second penalizes $u_j$ values that deviate too far from the normal prior. This penalty pulls estimates for data-sparse groups toward the overall mean (see shrinkage below).

## Estimation and Prediction {#estimation-and-prediction}

Fixed effects $\beta$ are unknown constants, so they are "estimated." Random effects $u_j$ are treated as random variables within the model, so formally they are "predicted."

The term "prediction" for $u_j$ arises from the mathematical structure of the model. Because a probability distribution is assumed for $u_j$, inferring its value combines two sources of information: the observed data and the assumed distribution ($N(0, \sigma_u^2)$). For $\beta$, only the finite data contributes uncertainty. For $u_j$, the model-assumed distribution also acts as an information source. This combination of two sources produces the shrinkage behavior described below.

The "prediction" terminology does not presuppose that $u_j$ was truly generated randomly. It is a formal distinction that accompanies the modeling decision to represent group differences with a probability distribution.

BLUP (Best Linear Unbiased Predictor) is commonly used for predicting random effects. BLUP is the linear predictor that minimizes the mean squared prediction error $E[(\hat u_j - u_j)^2]$ subject to the constraint that estimable linear functions of the fixed effects are unbiased. It derives from Henderson's mixed model equations.

For Gaussian, the BLUP of $u_j$ takes the form of a shrinkage estimator. For non-Gaussian families, random effects are estimated as conditional modes (posterior modes). These are obtained as the PIRLS solution and exhibit shrinkage toward the overall mean similarly to Gaussian BLUP, but are not strictly BLUPs.

The Gaussian BLUP formula:

$$
\hat{u}_j = \frac{n_j}{n_j + \sigma_e^2 / \sigma_u^2} \times \bar{r}_j
$$

where $n_j$ is the group size and $\bar{r}_j$ is the mean residual for group $j$ (the part not explained by fixed effects).

The coefficient $n_j / (n_j + \sigma_e^2/\sigma_u^2)$ ranges from 0 to 1, approaching 1 as group size increases:

- **Large groups**: Enough data to trust the group-specific estimate, so use it nearly as-is
- **Small groups**: Limited data, so pull the estimate toward the overall mean (zero)

This is also called "borrowing strength." Data-sparse groups borrow information from other groups to stabilize their estimates. Using group-specific estimates directly would have high variance; using only the overall mean would ignore group characteristics. BLUP optimally balances this bias-variance tradeoff.

## ICC (Intraclass Correlation Coefficient) {#icc-intraclass-correlation-coefficient}

ICC measures the share of unexplained variance — the variance remaining after the fixed effects are accounted for — attributable to between-group differences:

$$
\text{ICC} = \frac{\sigma_u^2}{\sigma_u^2 + \sigma_e^2}
$$

In a model with fixed effects $X\beta$, both the numerator and denominator are residual-side quantities after removing the variance explained by the fixed effects, so this is a conditional ICC.

ICC = 0 means no group differences — all variability is at the individual level. ICC = 1 means no within-group differences — all variability is between groups.

When ICC is small, ignoring group structure and using GLM produces nearly identical results. However, the impact depends not only on ICC but also on group size. The design effect $\text{DEFF} = 1 + (\bar n - 1) \times \text{ICC}$ (where $\bar n$ is the average group size) provides a rough measure of variance inflation. This formula is exact when group sizes are equal and approximate when they are unequal. For group means or predictors that vary at the group level, the standard error is inflated by roughly a factor of $\sqrt{\text{DEFF}}$; predictors that vary within groups are affected to a different degree. When ICC is large, a mixed model is necessary.

### Non-Gaussian Families {#non-gaussian-families}

For non-Gaussian families, $\sigma_e^2$ is not directly estimated from data. ICC as a variance decomposition requires that $\sigma_e^2$ is a theoretically derived quantity, not a convention or an approximation.

Binomial models with logit or probit links have a threshold model ([Goldstein et al. 2002](#ref-goldstein-2002)): an unobserved continuous variable $y^* = X\beta + u + e$ generates the binary response as $y = \mathbf{1}(y^* > 0)$. The distribution of $e$ — and hence its variance — follows deductively from the choice of link function. With the logit link, $e$ follows a logistic distribution with variance $\pi^2/3$. With the probit link, $e$ follows $N(0, 1)$ with variance $1$. In this framework, $\sigma_u^2 + \sigma_e^2$ represents the total variance of the latent variable, and ICC is a genuine variance decomposition.

| Family + Link | Residual Variance | Basis |
|--------------|-------------------|-------|
| Binomial + Logit | $\pi^2/3 \approx 3.29$ | Logistic distribution (threshold model) |
| Binomial + Probit | $1$ | Standard normal distribution (threshold model) |

The Binomial ICC is a value on the latent scale and does not equal the correlation of the observed binary responses. Keep this distinction in mind when using it for sample size planning.

MIDAS displays ICC only for Gaussian + identity (where $\sigma_e^2$ is estimated directly via REML) and the two Binomial combinations above. For other family+link combinations, MIDAS does not display ICC:

- **Poisson (all links)**: Poisson distributions have $\operatorname{Var}(Y \mid \mu) = \mu$, so the variance is fully determined by the mean and there is no independent residual variance parameter. The value $1$ sometimes used in the ICC denominator (e.g. in lme4) is a convention — $\sigma_u^2 / (\sigma_u^2 + 1)$ is a monotone transformation of $\sigma_u^2$, not a variance decomposition.
- **Gamma (all links)**: The profiled dispersion $\phi$ is a conditional variance parameter, not a latent-scale residual variance.
- **Gaussian + log**: $\sigma_u^2$ is on the link scale while $\sigma_e^2$ is estimated on the response scale. There is no theoretical residual variance on the link scale to reconcile the two.

## MIDAS Implementation {#midas-implementation}

MIDAS uses different algorithms depending on the combination of distribution family and link function.

**Gaussian + identity (LMM)**: Profile REML. The variance component ratio $\theta^2 = \sigma_u^2/\sigma_e^2$ is optimized via Nelder-Mead, with $\beta$ and $\sigma_e^2$ computed in closed form for each $\theta$.

**All other combinations (Binomial, Poisson, Gamma, and Gaussian + log)**: Laplace approximation following [Bates et al. (2015)](#ref-bates-2015). An outer Nelder-Mead loop optimizes the relative covariance parameter $\theta$, while an inner PIRLS loop simultaneously estimates $(\beta, u)$. The dispersion parameter $\phi$ is fixed at 1 for Poisson/Binomial. For Gamma and Gaussian + log, likelihood evaluations during optimization use $\phi = \sum_i d(y_i, \mu_i)/n$ computed from the conditional deviance, and the reported $\phi$ is estimated as Pearson $\chi^2/(n-p)$.

In the Laplace path, fixed-effect initial values come from a GLM fit (without random effects). Predictors are internally scaled (standardized) and back-transformed after estimation.

### Fixed Effect Inference and the Normal Approximation {#fixed-effect-inference}

Fixed-effect confidence intervals are Wald-based, constructed as $\hat\beta_j \pm z_{1-\alpha/2} \cdot \text{SE}(\hat\beta_j)$. This construction relies on $(\hat\beta_j - \beta_j) / \text{SE}(\hat\beta_j)$ approximately following the standard normal distribution.

In ordinary regression (OLS), the error variance $\sigma^2$ is estimated, so $(\hat\beta_j - \beta_j) / \text{SE}(\hat\beta_j)$ follows a t-distribution with $n - p$ degrees of freedom. In mixed models, there are multiple variance components ($\sigma_u^2$, $\sigma_e^2$), and the corresponding degrees of freedom are not uniquely determined. The normal approximation treats the estimated variance components as known.

In LMM, the normality assumptions on $u_j$ and $\varepsilon_i$ ensure that $(\hat\beta_j - \beta_j) / \text{SE}(\hat\beta_j)$ follows the standard normal exactly when the variance components are known. In non-Gaussian GLMM, the normal distribution is an asymptotic approximation. In both cases, replacing the true variance components with estimates introduces additional uncertainty, making the distribution of $(\hat\beta_j - \beta_j) / \text{SE}(\hat\beta_j)$ heavier-tailed than the standard normal.

When the number of groups is large, the variance component estimates are stable and this approximation is adequate. With few groups, confidence intervals tend to be too narrow.

## Random Slopes and Crossed Random Effects {#random-slopes-and-crossed-random-effects}

The random intercept model only captures group-level differences in baseline. When the effect of a predictor also varies by group, a random slope model is needed:

$$
g(\mu_i) = x_i'\beta + u_{0j} + u_{1j} x_{1i}
$$

where $u_{0j}$ is the random intercept and $u_{1j}$ is the random slope, jointly following a multivariate normal distribution.

When observations belong to multiple grouping variables (e.g., students belong to both a school and a region), crossed random effects can be used.

MIDAS currently supports random intercept models only.

## See also {#see-also}

- **[Generalized Linear Mixed Model (GLMM)](glmm)** - How to run GLMM analysis and interpret results
- **[GLM Fundamentals](concepts-glm)** - GLM theory underlying GLMM
- **[OLS Fundamentals](concepts-regression)** - Mathematical background of linear models
- **[Glossary](glossary)** - Statistical term definitions

## References {#references}

- <span id="ref-gelman-2005">Gelman, A. (2005). Analysis of variance—why it is more important than ever. *The Annals of Statistics*, 33(1), 1-53. https://www.jstor.org/stable/3448650</span>
- <span id="ref-bates-2015">Bates, D., Mächler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects models using lme4. *Journal of Statistical Software*, 67(1), 1-48. https://www.jstatsoft.org/v67/i01/</span>
- <span id="ref-goldstein-2002">Goldstein, H., Browne, W., & Rasbash, J. (2002). Partitioning variation in multilevel models. *Understanding Statistics*, 1(4), 223-231. https://doi.org/10.1207/S15328031US0104_02</span>