---
title: Glossary of Statistical Terms
description: Definitions of statistical terms used in MIDAS documentation, including estimator, convergence concepts, consistency, unbiasedness, asymptotic normality, sufficient statistic, MLE, likelihood, deviance, overdispersion, and variance function, as well as visualization methods such as kernel density estimation and LOESS.
priority: 0.5
---

# Glossary of Statistical Terms {#glossary-of-statistical-terms}

Definitions of statistical terms used in MIDAS documentation. Terms are listed in alphabetical order.

## Asymptotic normality {#asymptotic-normality}

The property that the distribution of an [estimator](#estimator) [converges in distribution](#convergence-in-distribution) to a normal distribution as the sample size $n \to \infty$. Under appropriate normalization,

$$
\sqrt{n}(\hat\theta_n - \theta) \xrightarrow{d} N(0, V)
$$

The $d$ above the arrow stands for "distribution." $V$ is the asymptotic variance (or the asymptotic covariance matrix when $\hat\theta_n$ is a vector) and depends on the type of estimator. [MLEs](#mle) possess asymptotic normality under regularity conditions. Even for OLS (Ordinary Least Squares) without normality assumptions, the central limit theorem ensures that $\sqrt{n}(\hat\beta - \beta)$ converges in distribution to a normal distribution in large samples ([OLS Fundamentals](concepts-regression)).

## Consistency {#consistency}

The property that an [estimator](#estimator) $\hat\theta_n$ [converges in probability](#convergence-in-probability) to the true parameter $\theta$ as $n \to \infty$, written $\hat\theta_n \xrightarrow{p} \theta$.

Consistency is a basic requirement for estimators: it guarantees that estimates approach the true value as data accumulate. Consistency alone says nothing about estimation precision at finite sample sizes. The OLS estimator is consistent when $\operatorname{plim}(X'\varepsilon/n) = 0$ ($\operatorname{plim}$ denotes the [probability limit](#convergence-in-probability)) and the probability limit $Q = \operatorname{plim}(X'X/n)$ is nonsingular. Homoscedasticity and uncorrelated errors (required by Gauss-Markov) are not needed for consistency ([OLS Fundamentals](concepts-regression)).

## Convergence in distribution {#convergence-in-distribution}

A mode of convergence for a sequence of random variables $X_n$ where the distribution approaches another distribution as $n \to \infty$. Formally, $X_n \xrightarrow{d} X$ if the distribution functions satisfy $F_n(x) \to F(x)$ at every continuity point of $F$.

The $d$ above the arrow stands for "distribution." Convergence in distribution holds as long as the shapes of the distributions of $X_n$ and $X$ approach each other; it does not require the values of $X_n$ and $X$ themselves to be close. In contrast, [convergence in probability](#convergence-in-probability) requires the values themselves to be close: $|X_n - X|$ must become small with high probability. Convergence in probability implies convergence in distribution; the converse holds only when the limit is a constant. [Asymptotic normality](#asymptotic-normality) is defined using this concept.

## Convergence in probability {#convergence-in-probability}

A mode of convergence for a sequence of random variables $X_n$ toward a random variable $X$. For every $\varepsilon > 0$,

$$
P(|X_n - X| > \varepsilon) \to 0 \quad (n \to \infty)
$$

Written $X_n \xrightarrow{p} X$. The $p$ above the arrow stands for "probability." When $X$ is a constant $c$, this means that the probability $X_n$ deviates from $c$ by more than $\varepsilon$ vanishes as $n$ grows. [Consistency](#consistency) of an [estimator](#estimator) is defined as convergence in probability to the true parameter $\theta$ (a constant). The notation $\operatorname{plim} X_n = c$ (probability limit) is equivalent to $X_n \xrightarrow{p} c$.

## Delta method {#delta-method}

A technique for approximating the variance of a nonlinear function $g(\hat\theta)$ of an estimator. By taking a first-order Taylor expansion of $g$ around the true value $\theta$:

$$\operatorname{Var}(g(\hat\theta)) \approx g'(\theta)^2 \operatorname{Var}(\hat\theta)$$

In the multivariate case, use the gradient vector $\nabla g$ and the variance-covariance matrix $\Sigma$: $\nabla g^\top \Sigma \, \nabla g$.

The delta method relies on [asymptotic normality](#asymptotic-normality) and may be inaccurate in small samples. The linearization error also grows when $g$ has high curvature near $\theta$ or when $\hat\theta$ has large variance. When $g'(\theta) = 0$ at the true value, the first-order delta method degenerates and $g(\hat\theta)$ no longer has a normal asymptotic distribution; the second-order delta method (incorporating quadratic terms) is needed instead.

## Fieller's method {#fieller-method}

A method for constructing a confidence interval for the ratio $\rho = \beta_0 / \beta_1$ of two parameters. It exploits the (asymptotic) bivariate normality of $(\hat\beta_0, \hat\beta_1)$.

Under the hypothesis that the true ratio is $\rho$, we have $\beta_0 - \rho \beta_1 = 0$, so the statistic $\hat\beta_0 - \rho \hat\beta_1$ has mean zero and variance

$$
V(\rho) = \operatorname{Var}(\hat\beta_0) - 2\rho \operatorname{Cov}(\hat\beta_0, \hat\beta_1) + \rho^2 \operatorname{Var}(\hat\beta_1).
$$

The statistic $(\hat\beta_0 - \rho \hat\beta_1)^2 / V(\rho)$ follows a $\chi^2_1$ distribution when the variances and covariance are known, so the $1 - \alpha$ confidence set is the collection of $\rho$ for which this quantity does not exceed the critical value $c$ (the upper $\alpha$ point of $\chi^2_1$). Rearranging $(\hat\beta_0 - \rho \hat\beta_1)^2 \le c \cdot V(\rho)$ yields a quadratic inequality in $\rho$:

$$
A \rho^2 + B \rho + C \le 0,
$$

where $A = \hat\beta_1^2 - c \cdot \operatorname{Var}(\hat\beta_1)$, $B = -2\bigl(\hat\beta_0 \hat\beta_1 - c \cdot \operatorname{Cov}(\hat\beta_0, \hat\beta_1)\bigr)$, and $C = \hat\beta_0^2 - c \cdot \operatorname{Var}(\hat\beta_0)$. The sign of $A$ and the discriminant $D = B^2 - 4 A C$ determine the shape of the confidence set.

- If $A > 0$, the set is a finite interval $[\rho_-, \rho_+]$. The condition $A > 0$ is equivalent to $\hat\beta_1^2 / \operatorname{Var}(\hat\beta_1) > c$, i.e., the Wald test for $\beta_1 = 0$ with the same critical value $c$ rejects the null.
- If $A < 0$ and $D \ge 0$, the set is an unbounded union $(-\infty, \rho_-] \cup [\rho_+, \infty)$. This happens when the Wald test does not reject $\beta_1 = 0$.
- If $A < 0$ and $D < 0$, the set is the entire real line $\mathbb{R}$, meaning no information about $\rho$.

Unlike the delta method, which linearizes $g(\hat\beta_0, \hat\beta_1) = \hat\beta_0 / \hat\beta_1$ via a Taylor expansion, Fieller's method avoids linearization. When $\hat\beta_1$ is close to zero, the delta-method approximation breaks down; Fieller's method instead reflects this uncertainty through unbounded or all-real-line confidence sets. It is exact when the estimators are exactly normal and is an approximation under asymptotic normality, as in GLMs. When the variances are estimated from residuals (as in OLS), the original formulation uses $t^2_{n-p}$ (equivalently $F_{1, n-p}$) as the critical value rather than $\chi^2_1$ ([Fieller, 1954](#ref-fieller-1954)).

## Deviance {#deviance}

A measure of model fit based on the [log-likelihood](#likelihood) difference from the saturated model:

$$
D = 2(\ell_{\text{saturated}} - \ell_{\text{model}})
$$

The saturated model assigns an individual parameter to each covariate pattern and reproduces the data exactly (its residuals are identically zero). A covariate pattern is a group of observations sharing the same combination of predictor values. For individual-level observations each observation typically forms its own pattern, so the number of parameters equals the number of observations. For data pre-aggregated by covariate pattern (for instance, counts of successes and trials for each pattern), the number of parameters equals the number of patterns. Following McCullagh & Nelder's convention, the quantity $D$ defined above is the *scaled deviance*, and multiplying it by the dispersion parameter $\phi$ gives the *unscaled deviance* $D^* = \phi D$ (the quantity commonly called "deviance"). For Poisson and Binomial with $\phi = 1$, the two coincide. For the Gaussian family, the scaled deviance equals the fitted model's residual sum of squares divided by the error variance, $\text{RSS}_\text{model}/\sigma^2$, and the unscaled deviance is $\text{RSS}_\text{model}$ itself. MIDAS reports the unscaled form. Deviance generalizes this relationship to any exponential family distribution, with larger values indicating poorer fit. In GLMMs (Generalized Linear Mixed Models), penalized deviance is used for parameter estimation ([GLMM Fundamentals](concepts-glmm#parameter-estimation)).

## Estimator {#estimator}

A function of data used to infer an unknown parameter. Since data are random variables, an estimator is itself a random variable that takes different values across samples. The specific numerical value obtained by applying an estimator to observed data is called an estimate.

For example, the sample mean $\bar{X} = \frac{1}{n}\sum_{i=1}^n X_i$ is an estimator of the population mean $\mu$. The quality of an estimator is evaluated through properties such as [consistency](#consistency), [unbiasedness](#unbiasedness), and [asymptotic normality](#asymptotic-normality).

## Kernel density estimation {#kde}

A nonparametric method for estimating the probability density function of data. A kernel function (typically a Gaussian kernel) is placed at each data point, and summing the kernels yields a smooth density curve. For $n$ data points $x_1, \ldots, x_n$, the estimated density is

$$
\hat{f}(x) = \frac{1}{n h} \sum_{i=1}^{n} K\!\left(\frac{x - x_i}{h}\right)
$$

where $K$ is the kernel function and $h$ is the bandwidth. The bandwidth controls the spread of each kernel: larger values produce smoother curves, while smaller values reflect finer structure in the data.

Whereas a histogram depends on where its bin boundaries fall, kernel density estimation produces a smooth distribution directly from the data. MIDAS uses this method for the density curve on histograms (univariate) and the density contours on scatter plots (bivariate) in [Graph Builder](graph-basics).

For the univariate case, the bandwidth is computed automatically by Silverman's rule of thumb, $h = 0.9 \cdot \min(\hat\sigma, \mathrm{IQR}/1.34) \cdot n^{-1/5}$, where $\hat\sigma$ is the square root of the unbiased variance (dividing by $n-1$). This rule is derived under the assumption of a unimodal, approximately normal distribution, so it can oversmooth multimodal distributions and flatten their peaks. Compare the density curve with histograms at different bin counts to check whether it captures the features of the distribution.

For the bivariate case, the bandwidth is computed automatically based on Scott's rule, $h = n^{-1/6} \cdot \bar\sigma$. This calculation uses the drawing coordinates after axis scaling, not data units. $\bar\sigma$ is the average of the standard deviations in the X and Y directions measured in drawing coordinates. Scott's rule originally uses a different bandwidth $h_j = n^{-1/6} \cdot \hat\sigma_j$ for each axis, but MIDAS applies a single bandwidth, the average of the standard deviations, equally in both directions. Differences in units between the two axes are absorbed by the axis scaling. Because the bandwidth is determined in drawing coordinates, however, the estimated contours depend on the aspect ratio and display size of the plot.

## LOESS (locally weighted regression) {#loess}

A nonparametric regression method that represents the overall trend of data as a smooth curve by repeating local weighted regressions. LOESS stands for LOcally Estimated Scatterplot Smoothing.

For each point $x_0$, a first-degree polynomial (local linear regression) is fitted to nearby data points with weights based on distance, giving the predicted value at $x_0$. Repeating this procedure for every point builds the whole curve. The weight function is the tricube function $w(u) = (1 - |u|^3)^3$ for $|u| < 1$, with $w(u) = 0$ for $|u| \ge 1$. Here $u$ is the distance from $x_0$ normalized by the maximum distance within the neighborhood, so points outside the neighborhood receive no weight.

The span is a parameter between 0 and 1 that specifies the fraction of data points used for each prediction. Larger values use more data points in each regression, producing smoother curves. In MIDAS, the Smooth statistic in [Custom Graph](custom-graph#layers---overlaying-layers) lets you select LOESS and adjust the span from 0.1 to 1.0. The default is 0.75, which uses 75% of all data points as the neighborhood of each point. The trend lines on the datetime histogram in [Graph Builder](graph-basics#datetime-histogram), the **Date Distribution** in the [Statistics](basic-statistics) tab, and the [Schoenfeld residual plot](survival-analysis#scaled-schoenfeld-residuals) for Cox regression are also drawn with LOESS, with the span fixed at 0.75.

## Likelihood and log-likelihood {#likelihood}

The likelihood is the same formula as the probability density (or mass) function, read as a function of the parameter $\theta$. For a single observation, $L(\theta) = f(y \mid \theta)$; for $n$ independent observations, $L(\theta) = \prod_{i=1}^n f(y_i \mid \theta)$. While probability varies over possible data for a given parameter, likelihood varies over possible parameters for observed data.

The log-likelihood $\ell(\theta) = \log L(\theta)$ converts products of independent observations into sums, making numerical computation more tractable. Because $\log$ is a monotonically increasing function, the $\theta$ that maximizes the likelihood is the same as the $\theta$ that maximizes the log-likelihood.

Parameter estimation in GLMs (Generalized Linear Models) ([GLM Fundamentals](concepts-glm#parameter-estimation-irls)), Laplace approximation in GLMMs ([GLMM Fundamentals](concepts-glmm#parameter-estimation)), and Cox model partial likelihood ([Survival Analysis Fundamentals](concepts-survival#partial-likelihood)) are all based on log-likelihood.

## Maximum likelihood estimator (MLE) {#mle}

The parameter value that maximizes the [likelihood](#likelihood) function: $\hat\theta_{\text{ML}} = \arg\max_\theta L(\theta; y)$.

When the model is correctly specified and regularity conditions (technical conditions on the smoothness of the likelihood function and the parameter space) hold, MLEs possess [consistency](#consistency), [asymptotic normality](#asymptotic-normality), and asymptotic efficiency: the asymptotic variance of $\sqrt{n}(\hat\theta_n - \theta)$ equals $I(\theta)^{-1}$, where $I(\theta)$ is the Fisher information matrix for a single observation. The Cramér-Rao information inequality guarantees in finite samples that the variance of any regular unbiased estimator is at least $(nI(\theta))^{-1}$, the inverse of the Fisher information aggregated over all $n$ observations; asymptotic efficiency means that this bound is attained as $n \to \infty$, with $\operatorname{Var}(\hat\theta_n)$ of order $I(\theta)^{-1}/n$. In GLMs, the MLE has no closed-form solution and is computed numerically via IRLS (Iteratively Reweighted Least Squares) ([GLM Fundamentals](concepts-glm#parameter-estimation-irls)).

## Overdispersion {#overdispersion}

A condition where the observed variance in data exceeds the variance assumed by the model. Poisson and Binomial families assume the dispersion parameter $\phi = 1$, but real data often exhibit greater variability.

Overdispersion leads to underestimated standard errors and overly narrow confidence intervals. When overdispersion is detected in Poisson models, switching to Negative Binomial explicitly models the extra variance. For Binomial overdispersion, see [GLM Fundamentals](concepts-glm#variance-functions-and-overdispersion). Note that when the Binomial trial count is $n_i = 1$ (Bernoulli, i.e., logistic regression), the marginal variance $\mu_i(1-\mu_i)$ is fully determined by the mean $\mu_i$, so individual-level Bernoulli data cannot reveal overdispersion through Pearson $\chi^2$ or deviance diagnostics. Extra variability stemming from clustering, repeated measures, or unobserved heterogeneity can still arise, but it is handled separately via GLMMs or quasi-likelihood. Classical overdispersion detection and correction is meaningful only for grouped Binomial data with $n_i \ge 2$.

## Sufficient statistic {#sufficient-statistic}

A statistic that retains all information in the data about a parameter $\theta$. Formally, $T(X)$ is sufficient for $\theta$ if the conditional distribution of $X$ given $T(X)$ does not depend on $\theta$ (Fisher-Neyman factorization theorem).

Summarizing data through a sufficient statistic loses no information relevant to estimating $\theta$. In GLMs with canonical links, $X'y$ is a sufficient statistic for $\beta$, and the [log-likelihood](#likelihood) is concave in $\beta$. When the design matrix has full rank, this guarantees uniqueness of the [MLE](#mle) and stable IRLS convergence ([GLM Fundamentals](concepts-glm#link-functions)).

## Unbiasedness {#unbiasedness}

The property that the expected value of an [estimator](#estimator) equals the true parameter: $E[\hat\theta] = \theta$.

The OLS estimator is unbiased under $E[\varepsilon \mid X] = 0$ (strict exogeneity). This condition means that $\varepsilon$ is uncorrelated with any measurable function of $X$ — a strictly stronger condition than linear uncorrelatedness $\operatorname{Cov}(X, \varepsilon) = 0$. Neither $E[\varepsilon] = 0$ (unconditional mean zero) nor $\operatorname{Cov}(X, \varepsilon) = 0$ alone is sufficient for unbiasedness. With the additional assumptions of homoscedasticity and uncorrelated errors ($\operatorname{Var}(\varepsilon \mid X) = \sigma^2 I$), the Gauss-Markov theorem guarantees minimum variance among linear unbiased estimators (BLUE: Best Linear Unbiased Estimator) ([OLS Fundamentals](concepts-regression)). [MLEs](#mle) are generally biased in finite samples.

## Variance function {#variance-function}

In exponential family distributions, the function $V(\mu)$ that determines the mean-variance relationship: $\operatorname{Var}(Y) = V(\mu) \cdot a(\phi)$. $V(\mu)$ is the second derivative of the log-partition function $b(\theta)$, expressed as a function of $\mu$.

The scaling factor $a(\phi)$ takes different forms depending on the family. For Gaussian and Gamma, $a(\phi) = \phi$, the dispersion parameter itself. For Poisson, $a(\phi) = 1$, a constant. For Binomial, $a(\phi) = 1/n_i$, depending on the per-observation number of trials $n_i$. Here $n_i$ is the number of trials in the $i$-th observation and is distinct from the overall sample size $n$. In Poisson and Binomial, $\phi$ is fixed at $1$, leaving no room for scaling through $\phi$.

For Poisson, $V(\mu) = \mu$; for Binomial, $V(\mu) = \mu(1-\mu)$; for Gamma, $V(\mu) = \mu^2$ ([GLM Fundamentals](concepts-glm#variance-functions-and-overdispersion)).

## References {#references}

- <span id="ref-fieller-1954">Fieller, E. C. (1954). Some problems in interval estimation. *Journal of the Royal Statistical Society: Series B*, 16(2), 175-185. https://www.jstor.org/stable/2984043</span>