---
title: GLM Fundamentals
description: Mathematical foundations of the Generalized Linear Model. Covers exponential family distributions, link functions, canonical link properties, IRLS algorithm, variance functions, and overdispersion.
priority: 0.5
---

# GLM Fundamentals {#glm-fundamentals}

This page covers the statistical theory behind the [GLM](glm) tab. See that page for usage instructions.

## Model Formulation {#model-formulation}

GLM generalizes the normal linear model to the exponential family of distributions (defined in the next section), as introduced by [Nelder & Wedderburn (1972)](#ref-nelder-wedderburn-1972). A GLM is defined by three components:

1. **Distribution family**: The response variable $Y$ follows a distribution in the exponential family
2. **Linear predictor**: $\eta = X\beta$ (a linear combination of explanatory variables)
3. **Link function**: A monotonic function $g$ such that $\eta = g(\mu)$, connecting the linear predictor to the mean $\mu = E[Y]$

OLS is a special case of GLM (Gaussian family with identity link). In this case, IRLS converges in a single iteration to the normal equations solution. Because MIDAS estimates $\phi$ from the data, the Wald statistic follows $t_{n-p}$ exactly, so the Wald test is equivalent to the OLS $t$-test in finite samples. For other family/link combinations, the Wald test is only an asymptotic approximation.

## Exponential Family {#exponential-family}

By restricting distributions to this form, GLM expresses the mean-variance relationship uniformly through $b(\theta)$ and derives a common estimation algorithm (IRLS) applicable across families.

A family of distributions is called an exponential family if its density (or mass) function can be written as:

$$
f(y \mid \theta, \phi) = \exp\!\left\{\frac{y\theta - b(\theta)}{a(\phi)} + c(y, \phi)\right\}
$$

where $\theta$ is the natural (canonical) parameter, $\phi$ is the dispersion parameter, and $b(\theta)$ is the log-partition function. The mean and variance are derived from $b(\theta)$:

- $E[Y] = b'(\theta) = \mu$
- $\operatorname{Var}(Y) = b''(\theta) \cdot a(\phi)$

Rewriting $b''(\theta)$ as a function of $\mu$ rather than $\theta$ gives the variance function $V(\mu)$, so $\operatorname{Var}(Y) = V(\mu) \cdot a(\phi)$. For example, Poisson has $b(\theta) = e^\theta$, giving $b'(\theta) = e^\theta = \mu$ and $b''(\theta) = e^\theta = \mu$, hence $V(\mu) = \mu$.

Exponential family parameters for each distribution family:

| Family | $\theta$ (natural parameter) | $a(\phi)$ | $b(\theta)$ | $c(y, \phi)$ |
|--------|------------------------------|-----------|-------------|---------------|
| Gaussian | $\mu$ | $\phi$ | $\theta^2/2$ | $-\dfrac{y^2}{2\phi} - \dfrac{\log(2\pi\phi)}{2}$ |
| Binomial | $\log\!\bigl(\mu/(1-\mu)\bigr)$ | $1/n_i$ | $\log(1+e^\theta)$ | $\log\binom{n_i}{k_i}$ |
| Poisson | $\log\mu$ | $1$ | $e^\theta$ | $-\log(y!)$ |
| Gamma | $-1/\mu$ | $\phi$ | $-\log(-\theta)$ | $(1/\phi - 1)\log y + (1/\phi)\log(1/\phi) - \log\Gamma(1/\phi)$ |
| Negative Binomial | $\log\!\bigl(\mu/(\mu+r)\bigr)$ | $1$ | $-r\log(1-e^\theta)$ | $\log\Gamma(y+r) - \log\Gamma(r) - \log(y!)$ |

- In the Binomial row, $y_i$ is the proportion of successes ($y_i = k_i/n_i$, $0 \le y_i \le 1$), $k_i$ is the number of successes at observation $i$, $n_i$ is the number of trials at observation $i$ (not the sample size $n$ but the per-observation trial count), and $\mu$ is the success probability. When $n_i=1$, it reduces to the Bernoulli distribution
- The Negative Binomial $r$ is displayed as $\theta$ in the MIDAS UI; on this page we use $r$ to avoid confusion with the exponential family natural parameter. The Negative Binomial belongs to the exponential family only when $r$ is known. In MIDAS's automatic estimation mode, $r$ is estimated by maximizing the profile likelihood $L_p(r) = \max_\beta L(\beta, r)$ (with $\beta$ profiled out) in an outer loop (see [GLM usage](glm#negative-binomial-settings)). The standard errors for $\hat\beta$ reported in this mode are computed from the information matrix with $r = \hat r$ treated as known, so uncertainty in $r$ is not reflected

## Link Functions {#link-functions}

The link function is a monotonic function $\eta = g(\mu)$ connecting the linear predictor $\eta$ to the expected value $\mu$ of the response. A link function satisfying $g(\mu) = \theta$ (the natural parameter) is called the canonical link.

| Link Function | Formula | Canonical Link For |
|--------------|---------|-------------------|
| Identity | $\eta = \mu$ | Gaussian |
| Logit | $\eta = \log\!\bigl(\mu / (1 - \mu)\bigr)$ | Binomial |
| Log | $\eta = \log(\mu)$ | Poisson |
| Inverse | $\eta = 1/\mu$ | Gamma |
| Probit | $\eta = \Phi^{-1}(\mu)$ | — |

The canonical link has important properties: since $\eta = \theta$, $X'y$ becomes a [sufficient statistic](glossary#sufficient-statistic) for $\beta$, and the [log-likelihood](glossary#likelihood) is concave in $\beta$. When the design matrix $X$ has full rank and the MLE exists, the solution is unique and IRLS converges stably. However, complete separation (when a linear combination of predictors perfectly separates the response) leaves the log-likelihood concave and $X$ full-rank but without a finite maximum, so the MLE does not exist. Binary logistic regression is the canonical example, and similar cases arise in other discrete-response models such as multinomial logit (see the convergence issues section in [GLM usage](glm#convergence-issues)).

Non-canonical links forfeit these properties but may be chosen for easier coefficient interpretation. For example, the canonical link for Gamma is Inverse ($\eta = 1/\mu$), which puts coefficients on a $1/\mu$ scale that is hard to interpret. The Log link ($\exp(\beta)$ as a multiplicative effect) is more commonly used in practice.

## Parameter Estimation (IRLS) {#parameter-estimation-irls}

GLM parameters $\beta$ are estimated by [maximum likelihood](glossary#mle). Under regularity conditions (differentiability of the log-likelihood, the true parameter being an interior point of the parameter space, etc.), the estimator is [consistent, asymptotically normal, and asymptotically efficient](glossary#mle). In general no closed-form solution exists, so IRLS (Iteratively Reweighted Least Squares) is used (Gaussian + Identity is an exception: $V(\mu)=1$ and $d\eta/d\mu=1$ make $W=I$ and $z=y$ in the formulas below, so the weights do not depend on the data and IRLS reaches the OLS solution $\hat\beta = (X'X)^{-1}X'y$ in a single iteration from any starting point).

At each iteration, working weights $W$ and an adjusted dependent variable $z$ are computed, then the weighted least squares problem:

$$
\hat\beta^{(t+1)} = (X'W^{(t)}X)^{-1}X'W^{(t)}z^{(t)}
$$

is solved to update $\beta$. $W$ is a diagonal matrix, and its $i$-th diagonal entry $W_{ii}$ and the $i$-th component $z_i$ of $z$ are computed from the current $\hat\mu^{(t)}$ and the link function as:

$$
W_{ii} = \frac{1}{V(\mu_i)\,(d\eta/d\mu)_i^2}, \qquad z_i = \eta_i + (y_i - \mu_i)\,\Bigl(\frac{d\eta}{d\mu}\Bigr)_i
$$

where $V(\mu)$ is the variance function and $d\eta/d\mu$ is the derivative of the link function. See [Nelder & Wedderburn (1972)](#ref-nelder-wedderburn-1972) for the original formulation of IRLS for GLMs. Iteration stops when the maximum absolute change in coefficients falls below the convergence threshold.

With the canonical link, the concavity of the log-likelihood ensures stable convergence. Non-canonical links may lead to slower convergence or convergence failure.

## Variance Functions and Overdispersion {#variance-functions-and-overdispersion}

As described in the [Exponential Family](#exponential-family) section, the variance function $V(\mu) = b''(\theta)$ is the second derivative of the log-partition function rewritten in terms of $\mu$. Through the relationship $\operatorname{Var}(Y) = V(\mu) \cdot a(\phi)$, it defines the mean-variance relationship for each family.

| Family | $b''(\theta)$ | $V(\mu)$ | $a(\phi)$ | $\operatorname{Var}(Y)$ |
|--------|---------------|----------|-----------|------------------------|
| Gaussian | $1$ | $1$ | $\phi$ | $\phi$ (= $\sigma^2$) |
| Binomial | $\dfrac{e^\theta}{(1+e^\theta)^2}$ | $\mu(1 - \mu)$ | $1/n_i$ | $\mu(1-\mu)/n_i$ |
| Poisson | $e^\theta$ | $\mu$ | $1$ | $\mu$ |
| Gamma | $1/\theta^2$ | $\mu^2$ | $\phi$ | $\mu^2 \phi$ |
| Negative Binomial | $\dfrac{re^\theta}{(1-e^\theta)^2}$ | $\mu + \mu^2/r$ | $1$ | $\mu + \mu^2/r$ |

Poisson and Binomial assume a dispersion parameter $\phi = 1$. When the actual data variance exceeds this assumption, the condition is called overdispersion. Overdispersion leads to underestimated standard errors and confidence intervals that are too narrow. To diagnose overdispersion, use the Deviance/df ratio shown in the [Deviance Goodness-of-Fit](glm#deviance-goodness-of-fit) section of the GLM Diagnostics tab, opened via **View Diagnostics**. Under the assumption, this ratio should be close to 1, so a value far from 1 suggests overdispersion. Note that the ratio fluctuates more around 1 when $n - p$ is small.

When overdispersion is detected with Poisson data, switching to Negative Binomial adds a $\mu^2/r$ term to the variance, explicitly modeling the extra dispersion. When $r$ is estimated, overdispersion is absorbed into $r$, so $\phi = 1$. When $r$ is fixed, residual overdispersion beyond the fixed $r$ is estimated as $\hat\phi = \text{Pearson }\chi^2/(n-p)$ and reflected in standard errors and confidence intervals.

However, for binary data with $n_i = 1$ (logistic regression), each observation follows Bernoulli$(\mu_i)$, and once the mean $\mu_i$ is fixed the marginal variance $\mu_i(1-\mu_i)$ is determined as well. There is no degree of freedom in the per-observation variance, so there is simply nothing to compare against to say "the data variance exceeds the theoretical variance." This is why Pearson $\chi^2$ and deviance cannot detect overdispersion at the individual level. This does not mean overdispersion is absent — only that it cannot be detected from the same data; extra dispersion arising from clusters or repeated measurements (e.g., treating multiple patients from the same hospital as independent) can still exist and must be handled separately (see the [glossary](glossary#overdispersion)). Classical overdispersion diagnostics and remedies are meaningful only for grouped Binomial data with $n_i > 1$.

For grouped Binomial overdispersion, MIDAS does not currently support quasi-binomial or Beta-Binomial alternatives. If the extra dispersion arises from cluster structure, introducing random effects via [GLMM](concepts-glmm) is an option. When overdispersion is suspected, check the estimated dispersion parameter and consider that standard errors and confidence intervals may be underestimated.

## Prediction Interval Methods {#prediction-intervals}

Mathematical background for prediction intervals computed by the [GLM](glm#prediction) prediction feature.

In the formulas below, $\hat\phi$ denotes the estimated dispersion parameter. For Gaussian, this is the residual deviance divided by $n - p$ (identically equal to Pearson $\chi^2$ for Gaussian). For Gamma, this is Pearson $\chi^2$ divided by $n - p$ (the deviance-based estimator is not consistent for Gamma, so Pearson $\chi^2/(n-p)$ is used). For Poisson/Binomial, $\hat\phi = 1$. $h_i = x_\text{new}^T (X^T \hat W X)^{-1} x_\text{new}$ is the leverage of the prediction point, measuring how far the new observation is from the center of the training data in predictor space.

A plug-in method treats the estimated parameters as if they were the true values and computes the interval from those values. Unlike a confidence interval, it does not account for parameter estimation uncertainty.

Prediction interval computation depends on the family:

- **Gaussian with identity link**: The analytical formula $\hat\mu \pm t_{n-p} \sqrt{\hat\phi(1 + h_i)}$ accounts for both the variance of a new observation ($\hat\phi$) and the estimation uncertainty of the mean ($\hat\phi \cdot h_i$)
- **Gaussian with non-identity link**: A plug-in method is used: $\hat\mu \pm t_{n-p} \sqrt{\hat\phi}$, where $t_{n-p}$ is the $t$-distribution quantile at the selected confidence level. The non-linear link transformation prevents exact incorporation of estimation uncertainty on the $\mu$ scale in closed form. Alternatives such as (a) building the interval on the link scale as $\hat\eta \pm t_{n-p}\sqrt{\hat\phi(1+h_i)}$ and back-transforming via $g^{-1}$, or (b) a first-order delta-method approximation $\operatorname{Var}(\hat\mu) \approx (d\mu/d\eta)^2 \cdot \hat\phi h_i$, both weaken coverage guarantees under non-linear links or in small samples, so MIDAS does not use them. As a result, prediction intervals for this combination are a simplified form that does not reflect estimation uncertainty, meaning predictions at the center of the data and at extrapolation points receive the same interval width
- **Poisson, Binomial, Gamma, Negative Binomial**: A plug-in quantile method is used. The estimated distribution parameters are treated as true values, and distribution quantiles are computed directly:
    - Poisson: quantiles of Poisson$(\hat\mu)$
    - Binomial: quantiles of Binomial$(n_\text{new}, \hat\mu)$ with success probability $\hat\mu$ and trial count $n_\text{new}$ (for grouped Binomial, each row's value of the training **Trials** column in the prediction dataset; $1$ for binary data), divided by $n_\text{new}$ so the interval is reported on the success-proportion scale
    - Gamma: quantiles of the Gamma distribution with mean $\hat\mu$, shape $\alpha = 1/\hat\phi$, and scale $\hat\phi \cdot \hat\mu$
    - Negative Binomial: quantiles of the Negative Binomial distribution with mean $\hat\mu$ and $r$ ($\hat r$ in automatic estimation mode, the specified value in fixed mode)

For discrete distributions (Poisson, Binomial, Negative Binomial), quantiles are rounded conservatively to the smallest integer $k$ satisfying $P(X \le k) \ge \alpha$ (no randomized intervals). The actual coverage probability therefore meets or exceeds the nominal level. For individual Binomial data ($n_i = 1$), the only quantile candidates are $\{0, 1\}$, limiting the informativeness of the interval.

The plug-in methods do not account for parameter estimation uncertainty, so the actual coverage probability may fall below the stated confidence level, particularly in small samples or for predictions far from the observed data range. When coverage matters in small samples, consider increasing the sample size or using a more computationally expensive method such as bootstrapping (not currently supported in MIDAS).

## See also {#see-also}

- **[Generalized Linear Model (GLM)](glm)** - How to run GLM analysis and interpret results
- **[OLS Fundamentals](concepts-regression)** - Mathematical background of OLS, a special case of GLM
- **[GLMM Fundamentals](concepts-glmm)** - Generalized linear mixed model theory with random effects
- **[Glossary](glossary)** - Statistical term definitions

## References {#references}

- <span id="ref-nelder-wedderburn-1972">Nelder, J. A., & Wedderburn, R. W. M. (1972). Generalized linear models. *Journal of the Royal Statistical Society: Series A*, 135(3), 370-384. https://www.jstor.org/stable/2344614</span>
- <span id="ref-mccullagh-nelder-1989">McCullagh, P., & Nelder, J. A. (1989). *Generalized Linear Models* (2nd ed.). Chapman and Hall.</span>