GLMM Fundamentals

This page covers the statistical theory behind the GLMM tab. See that page for usage instructions.

Hierarchical Data and the Independence Problem

Most statistical models assume observations are independent. In practice, data often has a hierarchical structure where observations are grouped:

Students nested within schools
Patients nested within hospitals
Repeated measurements within subjects

Observations in the same group share group-level factors (school policies, hospital resources, individual physiology) and tend to be more similar to each other. Analyzing such data with GLM while ignoring group structure is a model misspecification: the data-generating process has hierarchical structure that the model fails to represent.

This misspecification has two aspects.

Wrong covariance structure: The model assumes zero covariance between observations, but observations within the same group are positively correlated. Ignoring this correlation causes both fixed-effect standard errors and prediction intervals to be underestimated. 95% confidence intervals do not actually achieve 95% coverage, and false positive rates in hypothesis tests exceed the nominal significance level. Additionally, between-group and within-group variability cannot be separated, preventing quantitative understanding of the hierarchical structure.

Possible omitted variable bias: When group-level confounders exist, fixed effect estimates become biased. For example, if school policy influences both study hours and test scores, a regression ignoring school differences cannot correctly estimate the effect of study hours. Random intercepts model mean-level differences between groups, but if confounders are correlated with predictors, this bias is not resolved. Measured confounders should be included as fixed effects.

Mixed models directly address the covariance structure problem by explicitly modeling within-group correlation as random effects. For omitted variable bias, the confounders must be included as fixed effects.

Fixed Effects and Random Effects

The "mixed" in mixed model refers to containing both fixed and random effects.

Fixed effects are the systematic relationships the researcher wants to estimate. For example, "how many points does test score increase per additional hour of study" is a fixed effect.

Random effects are a modeling device for representing unobserved group-level variability within the model. Each school's average score differs due to policies, student demographics, and other factors. This variability is not generated by a random mechanism — it stems from concrete causes — but measuring and including all those factors as predictors is rarely practical. Random effects approximate this "unobserved group difference" using a probability distribution.

The assumption $u_j \sim N(0, \sigma_u^2)$ is not a claim that school differences are generated from a normal distribution. It is a modeling assumption that enables estimation of the overall magnitude of group variability $\sigma_u^2$ and produces shrinkage estimates (see below).

Under this interpretation, the fixed/random distinction is not about whether groups were "randomly sampled" but about how to model group differences — a design choice (Gelman, 2005). When the overall pattern of group variability is of interest, random effects are appropriate. When the effect of a specific group is the target, fixed effects are more suitable.

Random Intercept Model

The simplest mixed model is the random intercept model, which gives each group its own baseline:

g(\mu_i) = x_i'\beta + u_{j[i]}, \quad u_j \sim N(0, \sigma_u^2)

where $g$ is the link function, $x_i'\beta$ is the fixed-effect linear predictor, and $u_{j[i]}$ is the random intercept for group $j$ .

Intuitively, there is a single regression line shared by all groups ( $x_i'\beta$ ), and each group's line is shifted up or down by $u_j$ . The normal distribution $N(0, \sigma_u^2)$ assumption is a tool for summarizing the magnitude of these shifts with a single parameter $\sigma_u^2$ . The actual value of each $u_j$ is inferred from the data (see Estimation and Prediction).

For the Gaussian family, this is a linear mixed model (LMM):

y_i = x_i'\beta + u_{j[i]} + \varepsilon_i, \quad u_j \sim N(0, \sigma_u^2), \quad \varepsilon_i \sim N(0, \sigma_e^2)

$\sigma_u^2$ is the between-group variance; $\sigma_e^2$ is the within-group (individual-level) variance.

Parameter Estimation

Estimation in mixed models is more complex than in standard regression because fixed effects $\beta$ and variance components ( $\sigma_u^2$ , $\sigma_e^2$ ) must be estimated simultaneously.

REML (Restricted Maximum Likelihood)

Maximum likelihood (ML) tends to underestimate variance components because the likelihood does not account for degrees of freedom consumed by estimating $\beta$ (the same reason sample variance uses $n-1$ rather than $n$ as divisor).

REML corrects this bias by maximizing the likelihood of residuals after projecting out $\beta$ .

AIC/BIC Limitations

Since the REML projection depends on the fixed-effect structure, REML-based AIC/BIC cannot be compared across models with different fixed effects. To compare models with different fixed effects, refit using ML and compare ML-based AIC/BIC.

Laplace Approximation and PIRLS

For Gaussian families, the random-effect integral is analytically tractable, so REML suffices. For non-Gaussian families (Binomial, Poisson, Gamma, etc.), integrating out the random effects:

L(\beta, \sigma_u^2) = \int \prod_i f(y_i \mid \beta, u) \cdot \phi(u; 0, \sigma_u^2) \, du

has no closed-form solution ( $\phi$ is the normal density).

The Laplace approximation replaces this integral with a second-order expansion around the mode of the integrand. MIDAS implements the algorithm described in Bates et al. (2015). The Laplace approximation may lose accuracy when group sizes are small or when events are rare in Binomial data.

Outer loop: Optimizes the relative covariance parameter $\theta$ via Nelder-Mead
Inner loop: PIRLS (Penalized IRLS) simultaneously estimates $(\beta, u)$

PIRLS extends GLM's IRLS with a random-effect penalty. The penalized deviance (following Bates et al., 2015 notation, where $\theta$ is the relative covariance parameter: $\theta = \sigma_u/\sigma_e$ for Gaussian, $\theta = \sigma_u$ for Poisson/Binomial, $\theta = \sigma_u/\sqrt\phi$ for Gamma):

D_p = \sum_i d(y_i, \mu_i) + \sum_j \frac{u_j^2}{\theta^2}

The first term measures fit to data; the second penalizes $u_j$ values that deviate too far from the normal prior. This penalty pulls estimates for data-sparse groups toward the overall mean (see shrinkage below).

Estimation and Prediction

Fixed effects $\beta$ are unknown constants, so they are "estimated." Random effects $u_j$ are treated as random variables within the model, so formally they are "predicted."

The term "prediction" for $u_j$ arises from the mathematical structure of the model. Because a probability distribution is assumed for $u_j$ , inferring its value combines two sources of information: the observed data and the assumed distribution ( $N(0, \sigma_u^2)$ ). For $\beta$ , only the finite data contributes uncertainty. For $u_j$ , the model-assumed distribution also acts as an information source. This combination of two sources produces the shrinkage behavior described below.

The "prediction" terminology does not presuppose that $u_j$ was truly generated randomly. It is a formal distinction that accompanies the modeling decision to represent group differences with a probability distribution.

BLUP (Best Linear Unbiased Predictor) is commonly used for predicting random effects. BLUP is the linear predictor that minimizes the mean squared prediction error $E[(\hat u_j - u_j)^2]$ subject to the constraint that estimable linear functions of the fixed effects are unbiased. It derives from Henderson's mixed model equations.

For Gaussian, the BLUP of $u_j$ takes the form of a shrinkage estimator. For non-Gaussian families, random effects are estimated as conditional modes (posterior modes). These are obtained as the PIRLS solution and exhibit shrinkage toward the overall mean similarly to Gaussian BLUP, but are not strictly BLUPs.

The Gaussian BLUP formula:

\hat{u}_j = \frac{n_j}{n_j + \sigma_e^2 / \sigma_u^2} \times \bar{r}_j

where $n_j$ is the group size and $\bar{r}_j$ is the mean residual for group $j$ (the part not explained by fixed effects).

The coefficient $n_j / (n_j + \sigma_e^2/\sigma_u^2)$ ranges from 0 to 1, approaching 1 as group size increases:

Large groups: Enough data to trust the group-specific estimate, so use it nearly as-is
Small groups: Limited data, so pull the estimate toward the overall mean (zero)

This is also called "borrowing strength." Data-sparse groups borrow information from other groups to stabilize their estimates. Using group-specific estimates directly would have high variance; using only the overall mean would ignore group characteristics. BLUP optimally balances this bias-variance tradeoff.

ICC (Intraclass Correlation Coefficient)

ICC measures the proportion of total variance attributable to between-group differences:

\text{ICC} = \frac{\sigma_u^2}{\sigma_u^2 + \sigma_e^2}

ICC = 0 means no group differences — all variability is at the individual level. ICC = 1 means no within-group differences — all variability is between groups.

When ICC is small, ignoring group structure and using GLM produces nearly identical results. However, the impact depends not only on ICC but also on group size. The design effect $\text{DEFF} = 1 + (\bar n - 1) \times \text{ICC}$ (where $\bar n$ is the average group size) provides a rough measure of variance inflation. Standard errors are inflated by a factor of $\sqrt{\text{DEFF}}$ . When ICC is large, a mixed model is necessary.

Non-Gaussian Families

For non-Gaussian families, $\sigma_e^2$ is not directly defined, so a latent-variable variance is used instead. For Binomial + Logit link, the observed binary response is assumed to have a continuous latent variable underlying it. With the Logit link, this latent variable follows a logistic distribution, and its variance $\pi^2/3$ serves as the residual variance.

Family + Link	Latent-Scale Residual Variance
Binomial + Logit	$\pi^2/3 \approx 3.29$
Poisson + Log	$1$ (approximation from the log-normal-Poisson mixture model; accuracy decreases when $\mu$ is small)
Gamma + Log	$\phi$ (estimated dispersion parameter)

MIDAS Implementation

MIDAS uses different algorithms depending on the distribution family.

Gaussian (LMM): Profile REML. The variance component ratio $\theta^2 = \sigma_u^2/\sigma_e^2$ is optimized via Nelder-Mead, with $\beta$ and $\sigma_e^2$ computed in closed form for each $\theta$ .

Non-Gaussian (Binomial, Poisson, Gamma, etc.): Laplace approximation following Bates et al. (2015). An outer Nelder-Mead loop optimizes the relative covariance parameter $\theta$ , while an inner PIRLS loop simultaneously estimates $(\beta, u)$ . The dispersion parameter $\phi$ is fixed at 1 for Poisson/Binomial and estimated from profiled deviance ( $D_p/n$ ) for Gamma.

Fixed-effect initial values come from a GLM fit (without random effects). Predictors are internally scaled (standardized) and back-transformed after estimation.

Fixed Effect Inference and the Normal Approximation

Fixed effects are tested using Wald tests. The test statistic is $z = \hat\beta_j / \text{SE}(\hat\beta_j)$ , and p-values and confidence intervals are computed from the standard normal distribution.

In ordinary regression (OLS), the error variance $\sigma^2$ is estimated, so $\hat\beta_j / \text{SE}(\hat\beta_j)$ follows a t-distribution with $n - p$ degrees of freedom. In mixed models, there are multiple variance components ( $\sigma_u^2$ , $\sigma_e^2$ ), and the corresponding degrees of freedom are not uniquely determined. The normal approximation treats the estimated variance components as known.

In LMM, the normality assumptions on $u_j$ and $\varepsilon_i$ ensure that $\hat\beta_j / \text{SE}(\hat\beta_j)$ follows the standard normal exactly when the variance components are known. In non-Gaussian GLMM, the normal distribution is an asymptotic approximation. In both cases, replacing the true variance components with estimates introduces additional uncertainty, making the distribution of $\hat\beta_j / \text{SE}(\hat\beta_j)$ heavier-tailed than the standard normal.

When the number of groups is large, the variance component estimates are stable and this approximation is adequate. With few groups, confidence intervals tend to be too narrow, and the null hypothesis is rejected more often than the nominal significance level.

Random Slopes and Crossed Random Effects

The random intercept model only captures group-level differences in baseline. When the effect of a predictor also varies by group, a random slope model is needed:

g(\mu_i) = x_i'\beta + u_{0j} + u_{1j} x_{1i}

where $u_{0j}$ is the random intercept and $u_{1j}$ is the random slope, jointly following a multivariate normal distribution.

When observations belong to multiple grouping variables (e.g., students belong to both a school and a region), crossed random effects can be used.

MIDAS currently supports random intercept models only.

References

Gelman, A. (2005). Analysis of variance—why it is more important than ever. The Annals of Statistics, 33(1), 1-53. https://www.jstor.org/stable/3448650
Bates, D., Mächler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1), 1-48. https://www.jstatsoft.org/v67/i01/