Glossary of Statistical Terms

Definitions of statistical terms assumed as prerequisites in concepts pages. Terms are listed in alphabetical order.

Asymptotic normality

The property that the distribution of an estimator converges in distribution to a normal distribution as the sample size $n \to \infty$ . Under appropriate normalization,

\sqrt{n}(\hat\theta_n - \theta) \xrightarrow{d} N(0, V)

The $d$ above the arrow stands for "distribution." $V$ is the asymptotic variance (or the asymptotic covariance matrix when $\hat\theta_n$ is a vector) and depends on the type of estimator. MLEs possess asymptotic normality under regularity conditions. Even for OLS (Ordinary Least Squares) without normality assumptions, the central limit theorem ensures that $\sqrt{n}(\hat\beta - \beta)$ converges in distribution to a normal distribution in large samples (OLS Fundamentals).

Consistency

The property that an estimator $\hat\theta_n$ converges in probability to the true parameter $\theta$ as $n \to \infty$ , written $\hat\theta_n \xrightarrow{p} \theta$ .

Consistency is a basic requirement for estimators: it guarantees that estimates approach the true value as data accumulate. Consistency alone says nothing about estimation precision at finite sample sizes. The OLS estimator is consistent when $\operatorname{plim}(X'\varepsilon/n) = 0$ ( $\operatorname{plim}$ denotes the probability limit) and the probability limit $Q = \operatorname{plim}(X'X/n)$ is nonsingular. Homoscedasticity and uncorrelated errors (required by Gauss-Markov) are not needed for consistency (OLS Fundamentals).

Convergence in distribution

A mode of convergence for a sequence of random variables $X_n$ where the distribution approaches another distribution as $n \to \infty$ . Formally, $X_n \xrightarrow{d} X$ if the distribution functions satisfy $F_n(x) \to F(x)$ at every continuity point of $F$ .

The $d$ above the arrow stands for "distribution." Convergence in distribution holds as long as the shapes of the distributions of $X_n$ and $X$ approach each other; it does not require the values of $X_n$ and $X$ themselves to be close. In contrast, convergence in probability requires the values themselves to be close: $|X_n - X|$ must become small with high probability. Convergence in probability implies convergence in distribution; the converse holds only when the limit is a constant. Asymptotic normality is defined using this concept.

Convergence in probability

A mode of convergence for a sequence of random variables $X_n$ toward a random variable $X$ . For every $\varepsilon > 0$ ,

P(|X_n - X| > \varepsilon) \to 0 \quad (n \to \infty)

Written $X_n \xrightarrow{p} X$ . The $p$ above the arrow stands for "probability." When $X$ is a constant $c$ , this means that the probability $X_n$ deviates from $c$ by more than $\varepsilon$ vanishes as $n$ grows. Consistency of an estimator is defined as convergence in probability to the true parameter $\theta$ (a constant). The notation $\operatorname{plim} X_n = c$ (probability limit) is equivalent to $X_n \xrightarrow{p} c$ .

Delta method

A technique for approximating the variance of a nonlinear function $g(\hat\theta)$ of an estimator. By taking a first-order Taylor expansion of $g$ around the true value $\theta$ :

$\operatorname{Var}(g(\hat\theta)) \approx g'(\theta)^2 \operatorname{Var}(\hat\theta)$

In the multivariate case, use the gradient vector $\nabla g$ and the variance-covariance matrix $\Sigma$ : $\nabla g^\top \Sigma \, \nabla g$ .

The delta method relies on asymptotic normality and may be inaccurate in small samples. The linearization error also grows when $g$ has high curvature near $\theta$ or when $\hat\theta$ has large variance. When $g'(\theta) = 0$ at the true value, the first-order delta method degenerates and $g(\hat\theta)$ no longer has a normal asymptotic distribution; the second-order delta method (incorporating quadratic terms) is needed instead.

Fieller's method

A method for constructing a confidence interval for the ratio $\rho = \beta_0 / \beta_1$ of two parameters. It exploits the (asymptotic) bivariate normality of $(\hat\beta_0, \hat\beta_1)$ .

Under the hypothesis that the true ratio is $\rho$ , we have $\beta_0 - \rho \beta_1 = 0$ , so the statistic $\hat\beta_0 - \rho \hat\beta_1$ has mean zero and variance

V(\rho) = \operatorname{Var}(\hat\beta_0) - 2\rho \operatorname{Cov}(\hat\beta_0, \hat\beta_1) + \rho^2 \operatorname{Var}(\hat\beta_1).

The statistic $(\hat\beta_0 - \rho \hat\beta_1)^2 / V(\rho)$ follows a $\chi^2_1$ distribution when the variances and covariance are known, so the $1 - \alpha$ confidence set is the collection of $\rho$ for which this quantity does not exceed the critical value $c$ (the upper $\alpha$ point of $\chi^2_1$ ). Rearranging $(\hat\beta_0 - \rho \hat\beta_1)^2 \le c \cdot V(\rho)$ yields a quadratic inequality in $\rho$ :

A \rho^2 + B \rho + C \le 0,

where $A = \hat\beta_1^2 - c \cdot \operatorname{Var}(\hat\beta_1)$ , $B = -2\bigl(\hat\beta_0 \hat\beta_1 - c \cdot \operatorname{Cov}(\hat\beta_0, \hat\beta_1)\bigr)$ , and $C = \hat\beta_0^2 - c \cdot \operatorname{Var}(\hat\beta_0)$ . The sign of $A$ and the discriminant $D = B^2 - 4 A C$ determine the shape of the confidence set.

If $A > 0$ , the set is a finite interval $[\rho_-, \rho_+]$ . The condition $A > 0$ is equivalent to $\hat\beta_1^2 / \operatorname{Var}(\hat\beta_1) > c$ , i.e., the Wald test for $\beta_1 = 0$ with the same critical value $c$ rejects the null.
If $A < 0$ and $D \ge 0$ , the set is an unbounded union $(-\infty, \rho_-] \cup [\rho_+, \infty)$ . This happens when the Wald test does not reject $\beta_1 = 0$ .
If $A < 0$ and $D < 0$ , the set is the entire real line $\mathbb{R}$ , meaning no information about $\rho$ .

Unlike the delta method, which linearizes $g(\hat\beta_0, \hat\beta_1) = \hat\beta_0 / \hat\beta_1$ via a Taylor expansion, Fieller's method avoids linearization. When $\hat\beta_1$ is close to zero, the delta-method approximation breaks down; Fieller's method instead reflects this uncertainty through unbounded or all-real-line confidence sets. It is exact when the estimators are exactly normal and is an approximation under asymptotic normality, as in GLMs. When the variances are estimated from residuals (as in OLS), the original formulation uses $t^2_{n-p}$ (equivalently $F_{1, n-p}$ ) as the critical value rather than $\chi^2_1$ (Fieller, 1954).

Deviance

A measure of model fit based on the log-likelihood difference from the saturated model:

D = 2(\ell_{\text{saturated}} - \ell_{\text{model}})

The saturated model assigns an individual parameter to each covariate pattern and reproduces the data exactly (its residuals are identically zero). A covariate pattern is a group of observations sharing the same combination of predictor values. For individual-level observations each observation typically forms its own pattern, so the number of parameters equals the number of observations. For data pre-aggregated by covariate pattern (for instance, counts of successes and trials for each pattern), the number of parameters equals the number of patterns. Following McCullagh & Nelder's convention, the quantity $D$ defined above is the scaled deviance, and multiplying it by the dispersion parameter $\phi$ gives the unscaled deviance $D^* = \phi D$ (the quantity commonly called "deviance"). For Poisson and Binomial with $\phi = 1$ , the two coincide. For the Gaussian family, the scaled deviance equals the fitted model's residual sum of squares divided by the error variance, $\text{RSS}_\text{model}/\sigma^2$ , and the unscaled deviance is $\text{RSS}_\text{model}$ itself. MIDAS reports the unscaled form. Deviance generalizes this relationship to any exponential family distribution, with larger values indicating poorer fit. In GLMMs (Generalized Linear Mixed Models), penalized deviance is used for parameter estimation (GLMM Fundamentals).

Estimator

A function of data used to infer an unknown parameter. Since data are random variables, an estimator is itself a random variable that takes different values across samples. The specific numerical value obtained by applying an estimator to observed data is called an estimate.

For example, the sample mean $\bar{X} = \frac{1}{n}\sum_{i=1}^n X_i$ is an estimator of the population mean $\mu$ . The quality of an estimator is evaluated through properties such as consistency, unbiasedness, and asymptotic normality.

Likelihood and log-likelihood

The likelihood is the same formula as the probability density (or mass) function, read as a function of the parameter $\theta$ . For a single observation, $L(\theta) = f(y \mid \theta)$ ; for $n$ independent observations, $L(\theta) = \prod_{i=1}^n f(y_i \mid \theta)$ . While probability varies over possible data for a given parameter, likelihood varies over possible parameters for observed data.

The log-likelihood $\ell(\theta) = \log L(\theta)$ converts products of independent observations into sums, making numerical computation more tractable. Because $\log$ is a monotonically increasing function, the $\theta$ that maximizes the likelihood is the same as the $\theta$ that maximizes the log-likelihood.

Parameter estimation in GLMs (Generalized Linear Models) (GLM Fundamentals), Laplace approximation in GLMMs (GLMM Fundamentals), and Cox model partial likelihood (Survival Analysis Fundamentals) are all based on log-likelihood.

Maximum likelihood estimator (MLE)

The parameter value that maximizes the likelihood function: $\hat\theta_{\text{ML}} = \arg\max_\theta L(\theta; y)$ .

When the model is correctly specified and regularity conditions (technical conditions on the smoothness of the likelihood function and the parameter space) hold, MLEs possess consistency, asymptotic normality, and asymptotic efficiency: the asymptotic variance of $\sqrt{n}(\hat\theta_n - \theta)$ equals $I(\theta)^{-1}$ , where $I(\theta)$ is the Fisher information matrix for a single observation. The Cramér-Rao information inequality guarantees in finite samples that the variance of any regular unbiased estimator is at least $(nI(\theta))^{-1}$ , the inverse of the Fisher information aggregated over all $n$ observations; asymptotic efficiency means that this bound is attained as $n \to \infty$ , with $\operatorname{Var}(\hat\theta_n)$ of order $I(\theta)^{-1}/n$ . In GLMs, the MLE has no closed-form solution and is computed numerically via IRLS (Iteratively Reweighted Least Squares) (GLM Fundamentals).

Overdispersion

A condition where the observed variance in data exceeds the variance assumed by the model. Poisson and Binomial families assume the dispersion parameter $\phi = 1$ , but real data often exhibit greater variability.

Overdispersion leads to underestimated standard errors and overly narrow confidence intervals. When overdispersion is detected in Poisson models, switching to Negative Binomial explicitly models the extra variance. For Binomial overdispersion, see GLM Fundamentals. Note that when the Binomial trial count is $n_i = 1$ (Bernoulli, i.e., logistic regression), the marginal variance $\mu_i(1-\mu_i)$ is fully determined by the mean $\mu_i$ , so individual-level Bernoulli data cannot reveal overdispersion through Pearson $\chi^2$ or deviance diagnostics. Extra variability stemming from clustering, repeated measures, or unobserved heterogeneity can still arise, but it is handled separately via GLMMs or quasi-likelihood. Classical overdispersion detection and correction is meaningful only for grouped Binomial data with $n_i \ge 2$ .

Sufficient statistic

A statistic that retains all information in the data about a parameter $\theta$ . Formally, $T(X)$ is sufficient for $\theta$ if the conditional distribution of $X$ given $T(X)$ does not depend on $\theta$ (Fisher-Neyman factorization theorem).

Summarizing data through a sufficient statistic loses no information relevant to estimating $\theta$ . In GLMs with canonical links, $X'y$ is a sufficient statistic for $\beta$ , and the log-likelihood is concave in $\beta$ . When the design matrix has full rank, this guarantees uniqueness of the MLE and stable IRLS convergence (GLM Fundamentals).

Unbiasedness

The property that the expected value of an estimator equals the true parameter: $E[\hat\theta] = \theta$ .

The OLS estimator is unbiased under $E[\varepsilon \mid X] = 0$ (strict exogeneity). This condition means that $\varepsilon$ is uncorrelated with any measurable function of $X$ — a strictly stronger condition than linear uncorrelatedness $\operatorname{Cov}(X, \varepsilon) = 0$ . Neither $E[\varepsilon] = 0$ (unconditional mean zero) nor $\operatorname{Cov}(X, \varepsilon) = 0$ alone is sufficient for unbiasedness. With the additional assumptions of homoscedasticity and uncorrelated errors ( $\operatorname{Var}(\varepsilon \mid X) = \sigma^2 I$ ), the Gauss-Markov theorem guarantees minimum variance among linear unbiased estimators (BLUE: Best Linear Unbiased Estimator) (OLS Fundamentals). MLEs are generally biased in finite samples.

Variance function

In exponential family distributions, the function $V(\mu)$ that determines the mean-variance relationship: $\operatorname{Var}(Y) = V(\mu) \cdot a(\phi)$ . $V(\mu)$ is the second derivative of the log-partition function $b(\theta)$ , expressed as a function of $\mu$ .

The scaling factor $a(\phi)$ takes different forms depending on the family. For Gaussian and Gamma, $a(\phi) = \phi$ , the dispersion parameter itself. For Poisson, $a(\phi) = 1$ , a constant. For Binomial, $a(\phi) = 1/n_i$ , depending on the per-observation number of trials $n_i$ . Here $n_i$ is the number of trials in the $i$ -th observation and is distinct from the overall sample size $n$ . In Poisson and Binomial, $\phi$ is fixed at $1$ , leaving no room for scaling through $\phi$ .

For Poisson, $V(\mu) = \mu$ ; for Binomial, $V(\mu) = \mu(1-\mu)$ ; for Gamma, $V(\mu) = \mu^2$ (GLM Fundamentals).

References

Fieller, E. C. (1954). Some problems in interval estimation. Journal of the Royal Statistical Society: Series B, 16(2), 175-185. https://www.jstor.org/stable/2984043