Glossary of Statistical Terms

Definitions of statistical terms assumed as prerequisites in concepts pages. Terms are listed in alphabetical order.

Asymptotic normality

The property that the distribution of an estimator converges in distribution to a normal distribution as the sample size nn \to \infty. Under appropriate normalization,

n(θ^nθ)dN(0,V)\sqrt{n}(\hat\theta_n - \theta) \xrightarrow{d} N(0, V)

The dd above the arrow stands for "distribution." VV is the asymptotic variance (or the asymptotic covariance matrix when θ^n\hat\theta_n is a vector) and depends on the type of estimator. MLEs possess asymptotic normality under regularity conditions. Even for OLS (Ordinary Least Squares) without normality assumptions, the central limit theorem ensures that n(β^β)\sqrt{n}(\hat\beta - \beta) converges in distribution to a normal distribution in large samples (OLS Fundamentals).

Consistency

The property that an estimator θ^n\hat\theta_n converges in probability to the true parameter θ\theta as nn \to \infty, written θ^npθ\hat\theta_n \xrightarrow{p} \theta.

Consistency is a basic requirement for estimators: it guarantees that estimates approach the true value as data accumulate. Consistency alone says nothing about estimation precision at finite sample sizes. The OLS estimator is consistent when plim(Xε/n)=0\operatorname{plim}(X'\varepsilon/n) = 0 (plim\operatorname{plim} denotes the probability limit) and the probability limit Q=plim(XX/n)Q = \operatorname{plim}(X'X/n) is nonsingular. Homoscedasticity and uncorrelated errors (required by Gauss-Markov) are not needed for consistency (OLS Fundamentals).

Convergence in distribution

A mode of convergence for a sequence of random variables XnX_n where the distribution approaches another distribution as nn \to \infty. Formally, XndXX_n \xrightarrow{d} X if the distribution functions satisfy Fn(x)F(x)F_n(x) \to F(x) at every continuity point of FF.

The dd above the arrow stands for "distribution." Convergence in distribution holds as long as the shapes of the distributions of XnX_n and XX approach each other; it does not require the values of XnX_n and XX themselves to be close. In contrast, convergence in probability requires the values themselves to be close: XnX|X_n - X| must become small with high probability. Convergence in probability implies convergence in distribution; the converse holds only when the limit is a constant. Asymptotic normality is defined using this concept.

Convergence in probability

A mode of convergence for a sequence of random variables XnX_n toward a random variable XX. For every ε>0\varepsilon > 0,

P(XnX>ε)0(n)P(|X_n - X| > \varepsilon) \to 0 \quad (n \to \infty)

Written XnpXX_n \xrightarrow{p} X. The pp above the arrow stands for "probability." When XX is a constant cc, this means that the probability XnX_n deviates from cc by more than ε\varepsilon vanishes as nn grows. Consistency of an estimator is defined as convergence in probability to the true parameter θ\theta (a constant). The notation plimXn=c\operatorname{plim} X_n = c (probability limit) is equivalent to XnpcX_n \xrightarrow{p} c.

Delta method

A technique for approximating the variance of a nonlinear function g(θ^)g(\hat\theta) of an estimator. By taking a first-order Taylor expansion of gg around the true value θ\theta:

Var(g(θ^))g(θ)2Var(θ^)\operatorname{Var}(g(\hat\theta)) \approx g'(\theta)^2 \operatorname{Var}(\hat\theta)

In the multivariate case, use the gradient vector g\nabla g and the variance-covariance matrix Σ\Sigma: gΣg\nabla g^\top \Sigma \, \nabla g.

The delta method relies on asymptotic normality and may be inaccurate in small samples. The linearization error also grows when gg has high curvature near θ\theta or when θ^\hat\theta has large variance. When g(θ)=0g'(\theta) = 0 at the true value, the first-order delta method degenerates and g(θ^)g(\hat\theta) no longer has a normal asymptotic distribution; the second-order delta method (incorporating quadratic terms) is needed instead.

Fieller's method

A method for constructing a confidence interval for the ratio ρ=β0/β1\rho = \beta_0 / \beta_1 of two parameters. It exploits the (asymptotic) bivariate normality of (β^0,β^1)(\hat\beta_0, \hat\beta_1).

Under the hypothesis that the true ratio is ρ\rho, we have β0ρβ1=0\beta_0 - \rho \beta_1 = 0, so the statistic β^0ρβ^1\hat\beta_0 - \rho \hat\beta_1 has mean zero and variance

V(ρ)=Var(β^0)2ρCov(β^0,β^1)+ρ2Var(β^1).V(\rho) = \operatorname{Var}(\hat\beta_0) - 2\rho \operatorname{Cov}(\hat\beta_0, \hat\beta_1) + \rho^2 \operatorname{Var}(\hat\beta_1).

The statistic (β^0ρβ^1)2/V(ρ)(\hat\beta_0 - \rho \hat\beta_1)^2 / V(\rho) follows a χ12\chi^2_1 distribution when the variances and covariance are known, so the 1α1 - \alpha confidence set is the collection of ρ\rho for which this quantity does not exceed the critical value cc (the upper α\alpha point of χ12\chi^2_1). Rearranging (β^0ρβ^1)2cV(ρ)(\hat\beta_0 - \rho \hat\beta_1)^2 \le c \cdot V(\rho) yields a quadratic inequality in ρ\rho:

Aρ2+Bρ+C0,A \rho^2 + B \rho + C \le 0,

where A=β^12cVar(β^1)A = \hat\beta_1^2 - c \cdot \operatorname{Var}(\hat\beta_1), B=2(β^0β^1cCov(β^0,β^1))B = -2\bigl(\hat\beta_0 \hat\beta_1 - c \cdot \operatorname{Cov}(\hat\beta_0, \hat\beta_1)\bigr), and C=β^02cVar(β^0)C = \hat\beta_0^2 - c \cdot \operatorname{Var}(\hat\beta_0). The sign of AA and the discriminant D=B24ACD = B^2 - 4 A C determine the shape of the confidence set.

  • If A>0A > 0, the set is a finite interval [ρ,ρ+][\rho_-, \rho_+]. The condition A>0A > 0 is equivalent to β^12/Var(β^1)>c\hat\beta_1^2 / \operatorname{Var}(\hat\beta_1) > c, i.e., the Wald test for β1=0\beta_1 = 0 with the same critical value cc rejects the null.
  • If A<0A < 0 and D0D \ge 0, the set is an unbounded union (,ρ][ρ+,)(-\infty, \rho_-] \cup [\rho_+, \infty). This happens when the Wald test does not reject β1=0\beta_1 = 0.
  • If A<0A < 0 and D<0D < 0, the set is the entire real line R\mathbb{R}, meaning no information about ρ\rho.

Unlike the delta method, which linearizes g(β^0,β^1)=β^0/β^1g(\hat\beta_0, \hat\beta_1) = \hat\beta_0 / \hat\beta_1 via a Taylor expansion, Fieller's method avoids linearization. When β^1\hat\beta_1 is close to zero, the delta-method approximation breaks down; Fieller's method instead reflects this uncertainty through unbounded or all-real-line confidence sets. It is exact when the estimators are exactly normal and is an approximation under asymptotic normality, as in GLMs. When the variances are estimated from residuals (as in OLS), the original formulation uses tnp2t^2_{n-p} (equivalently F1,npF_{1, n-p}) as the critical value rather than χ12\chi^2_1 (Fieller, 1954).

Deviance

A measure of model fit based on the log-likelihood difference from the saturated model:

D=2(saturatedmodel)D = 2(\ell_{\text{saturated}} - \ell_{\text{model}})

The saturated model assigns an individual parameter to each covariate pattern and reproduces the data exactly (its residuals are identically zero). A covariate pattern is a group of observations sharing the same combination of predictor values. For individual-level observations each observation typically forms its own pattern, so the number of parameters equals the number of observations. For data pre-aggregated by covariate pattern (for instance, counts of successes and trials for each pattern), the number of parameters equals the number of patterns. Following McCullagh & Nelder's convention, the quantity DD defined above is the scaled deviance, and multiplying it by the dispersion parameter ϕ\phi gives the unscaled deviance D=ϕDD^* = \phi D (the quantity commonly called "deviance"). For Poisson and Binomial with ϕ=1\phi = 1, the two coincide. For the Gaussian family, the scaled deviance equals the fitted model's residual sum of squares divided by the error variance, RSSmodel/σ2\text{RSS}_\text{model}/\sigma^2, and the unscaled deviance is RSSmodel\text{RSS}_\text{model} itself. MIDAS reports the unscaled form. Deviance generalizes this relationship to any exponential family distribution, with larger values indicating poorer fit. In GLMMs (Generalized Linear Mixed Models), penalized deviance is used for parameter estimation (GLMM Fundamentals).

Estimator

A function of data used to infer an unknown parameter. Since data are random variables, an estimator is itself a random variable that takes different values across samples. The specific numerical value obtained by applying an estimator to observed data is called an estimate.

For example, the sample mean Xˉ=1ni=1nXi\bar{X} = \frac{1}{n}\sum_{i=1}^n X_i is an estimator of the population mean μ\mu. The quality of an estimator is evaluated through properties such as consistency, unbiasedness, and asymptotic normality.

Likelihood and log-likelihood

The likelihood is the same formula as the probability density (or mass) function, read as a function of the parameter θ\theta. For a single observation, L(θ)=f(yθ)L(\theta) = f(y \mid \theta); for nn independent observations, L(θ)=i=1nf(yiθ)L(\theta) = \prod_{i=1}^n f(y_i \mid \theta). While probability varies over possible data for a given parameter, likelihood varies over possible parameters for observed data.

The log-likelihood (θ)=logL(θ)\ell(\theta) = \log L(\theta) converts products of independent observations into sums, making numerical computation more tractable. Because log\log is a monotonically increasing function, the θ\theta that maximizes the likelihood is the same as the θ\theta that maximizes the log-likelihood.

Parameter estimation in GLMs (Generalized Linear Models) (GLM Fundamentals), Laplace approximation in GLMMs (GLMM Fundamentals), and Cox model partial likelihood (Survival Analysis Fundamentals) are all based on log-likelihood.

Maximum likelihood estimator (MLE)

The parameter value that maximizes the likelihood function: θ^ML=argmaxθL(θ;y)\hat\theta_{\text{ML}} = \arg\max_\theta L(\theta; y).

When the model is correctly specified and regularity conditions (technical conditions on the smoothness of the likelihood function and the parameter space) hold, MLEs possess consistency, asymptotic normality, and asymptotic efficiency: the asymptotic variance of n(θ^nθ)\sqrt{n}(\hat\theta_n - \theta) equals I(θ)1I(\theta)^{-1}, where I(θ)I(\theta) is the Fisher information matrix for a single observation. The Cramér-Rao information inequality guarantees in finite samples that the variance of any regular unbiased estimator is at least (nI(θ))1(nI(\theta))^{-1}, the inverse of the Fisher information aggregated over all nn observations; asymptotic efficiency means that this bound is attained as nn \to \infty, with Var(θ^n)\operatorname{Var}(\hat\theta_n) of order I(θ)1/nI(\theta)^{-1}/n. In GLMs, the MLE has no closed-form solution and is computed numerically via IRLS (Iteratively Reweighted Least Squares) (GLM Fundamentals).

Overdispersion

A condition where the observed variance in data exceeds the variance assumed by the model. Poisson and Binomial families assume the dispersion parameter ϕ=1\phi = 1, but real data often exhibit greater variability.

Overdispersion leads to underestimated standard errors and overly narrow confidence intervals. When overdispersion is detected in Poisson models, switching to Negative Binomial explicitly models the extra variance. For Binomial overdispersion, see GLM Fundamentals. Note that when the Binomial trial count is ni=1n_i = 1 (Bernoulli, i.e., logistic regression), the marginal variance μi(1μi)\mu_i(1-\mu_i) is fully determined by the mean μi\mu_i, so individual-level Bernoulli data cannot reveal overdispersion through Pearson χ2\chi^2 or deviance diagnostics. Extra variability stemming from clustering, repeated measures, or unobserved heterogeneity can still arise, but it is handled separately via GLMMs or quasi-likelihood. Classical overdispersion detection and correction is meaningful only for grouped Binomial data with ni2n_i \ge 2.

Sufficient statistic

A statistic that retains all information in the data about a parameter θ\theta. Formally, T(X)T(X) is sufficient for θ\theta if the conditional distribution of XX given T(X)T(X) does not depend on θ\theta (Fisher-Neyman factorization theorem).

Summarizing data through a sufficient statistic loses no information relevant to estimating θ\theta. In GLMs with canonical links, XyX'y is a sufficient statistic for β\beta, and the log-likelihood is concave in β\beta. When the design matrix has full rank, this guarantees uniqueness of the MLE and stable IRLS convergence (GLM Fundamentals).

Unbiasedness

The property that the expected value of an estimator equals the true parameter: E[θ^]=θE[\hat\theta] = \theta.

The OLS estimator is unbiased under E[εX]=0E[\varepsilon \mid X] = 0 (strict exogeneity). This condition means that ε\varepsilon is uncorrelated with any measurable function of XX — a strictly stronger condition than linear uncorrelatedness Cov(X,ε)=0\operatorname{Cov}(X, \varepsilon) = 0. Neither E[ε]=0E[\varepsilon] = 0 (unconditional mean zero) nor Cov(X,ε)=0\operatorname{Cov}(X, \varepsilon) = 0 alone is sufficient for unbiasedness. With the additional assumptions of homoscedasticity and uncorrelated errors (Var(εX)=σ2I\operatorname{Var}(\varepsilon \mid X) = \sigma^2 I), the Gauss-Markov theorem guarantees minimum variance among linear unbiased estimators (BLUE: Best Linear Unbiased Estimator) (OLS Fundamentals). MLEs are generally biased in finite samples.

Variance function

In exponential family distributions, the function V(μ)V(\mu) that determines the mean-variance relationship: Var(Y)=V(μ)a(ϕ)\operatorname{Var}(Y) = V(\mu) \cdot a(\phi). V(μ)V(\mu) is the second derivative of the log-partition function b(θ)b(\theta), expressed as a function of μ\mu.

The scaling factor a(ϕ)a(\phi) takes different forms depending on the family. For Gaussian and Gamma, a(ϕ)=ϕa(\phi) = \phi, the dispersion parameter itself. For Poisson, a(ϕ)=1a(\phi) = 1, a constant. For Binomial, a(ϕ)=1/nia(\phi) = 1/n_i, depending on the per-observation number of trials nin_i. Here nin_i is the number of trials in the ii-th observation and is distinct from the overall sample size nn. In Poisson and Binomial, ϕ\phi is fixed at 11, leaving no room for scaling through ϕ\phi.

For Poisson, V(μ)=μV(\mu) = \mu; for Binomial, V(μ)=μ(1μ)V(\mu) = \mu(1-\mu); for Gamma, V(μ)=μ2V(\mu) = \mu^2 (GLM Fundamentals).

References