---
title: OLS Fundamentals
description: Mathematical foundations of Ordinary Least Squares. Covers normal equations, Gauss-Markov theorem, standardized residuals, Cook's Distance, and VIF.
priority: 0.5
---

# OLS Fundamentals {#ols-fundamentals}

This page covers the statistical theory behind the [Linear Regression](linear-regression) tab. See that page for usage instructions.

## Model Formulation {#model-formulation}

The linear regression model is formulated as:

$$
Y = X\beta + \varepsilon
$$

where $Y$ is the $n \times 1$ response vector, $X$ is the $n \times p$ design matrix (predictors and intercept), $\beta$ is the $p \times 1$ coefficient vector, and $\varepsilon$ is the error term.

The OLS estimator minimizes the residual sum of squares $\|Y - X\beta\|^2$ and is obtained from the normal equations:

$$
\hat\beta = (X'X)^{-1}X'Y
$$

The properties of this estimator depend on the assumptions placed on $\varepsilon$. [Consistency](glossary#consistency) is an asymptotic property (convergence as $n \to \infty$) and [unbiasedness](glossary#unbiasedness) is a finite-sample property ($E[\hat\beta] = \beta$); they are conceptually independent. For example, [maximum likelihood estimators](glossary#mle) are consistent under regularity conditions but generally biased in finite samples.

**Consistency**: Under $\operatorname{plim}(X'\varepsilon/n) = 0$ (where $\operatorname{plim}$ denotes the [probability limit](glossary#convergence-in-probability)) and $\operatorname{plim}(X'X/n)$ nonsingular, the OLS estimator is consistent. Homoscedasticity and uncorrelated errors are not required. However, when predictors are correlated with the error term, for example through omission of relevant variables or measurement error in the predictors, this condition fails and the OLS estimator loses consistency.

**Unbiasedness**: Under $E[\varepsilon \mid X] = 0$, the OLS estimator is unbiased. Homoscedasticity is not required.

**BLUE**: Under $E[\varepsilon \mid X] = 0$ and $\operatorname{Var}(\varepsilon \mid X) = \sigma^2 I$ (homoscedastic and uncorrelated), the Gauss-Markov theorem guarantees the OLS estimator has minimum variance among linear unbiased estimators (Best Linear Unbiased Estimator). Here, a linear estimator is one that can be written as $CY$, where $C$ is a matrix that depends only on $X$.

**Normality-based inference**: Under $\varepsilon \sim N(0, \sigma^2 I)$, the sampling distribution of $\hat\beta$ is exact in finite samples, and the $t$-based confidence interval $\hat\beta \pm t_{\alpha/2,\, n-p} \times \operatorname{SE}(\hat\beta)$ has exact coverage. Without normality, if the errors have finite variance, the central limit theorem ensures asymptotic normality of $\hat\beta$ in large samples, and the coverage of the confidence interval approaches the nominal level. The required sample size depends on the true error distribution, so no universal threshold applies. If the residual Q-Q plot shows strong skewness or heavy tails, the asymptotic approximation becomes less reliable (see [residual diagnostic plots](linear-regression#residual-diagnostics)).

OLS is a special case of [GLM](concepts-glm) (Gaussian family with identity link).

## Standardized Residuals and Diagnostic Statistics {#standardized-residuals-and-diagnostic-statistics}

Residual diagnostics in OLS use the internally studentized residual $r_i^*$:

$$
r_i^* = \frac{e_i}{\hat\sigma\sqrt{1 - h_i}}
$$

where $e_i = y_i - \hat y_i$ is the residual, $\hat\sigma = \sqrt{\text{RSS}/(n - p)}$ is the error standard deviation estimated from all observations, and $h_i = \operatorname{diag}(H)_i$ is the diagonal element of the hat matrix $H = X(X'X)^{-1}X'$ (leverage). $p$ is the number of columns in the design matrix $X$, including the intercept. Since $H$ is symmetric and idempotent (an orthogonal projection matrix), $0 \le h_i \le 1$. For models with an intercept, $h_i \ge 1/n$. Leverage measures how far an observation's predictor values are from the others. Since $\operatorname{tr}(H) = p$, the average leverage is $p/n$, and $2p/n$ is the conventional threshold for high leverage.

Cook's Distance combines residual magnitude and leverage into a single influence measure:

$$
D_i = \frac{r_i^{*2}}{p} \cdot \frac{h_i}{1 - h_i}
$$

## Multicollinearity and VIF {#multicollinearity-and-vif}

When predictors are highly correlated, $(X'X)$ approaches singularity and coefficient estimates become unstable.

VIF (Variance Inflation Factor) = $1 / (1 - R_j^2)$ is computed from $R_j^2$, the R-squared obtained by regressing $X_j$ on all other predictors. A high $R_j^2$ means most of the variation in $X_j$ is already explained by other variables, leaving little unique information. VIF tells you how many times the variance of $\hat\beta_j$ is inflated as a result. For example, VIF = 5 means the standard error of $\hat\beta_j$ is $\sqrt{5} \approx 2.2$ times wider than it would be with uncorrelated predictors. $\hat\beta_j$ itself remains unbiased, but the inflated variance widens the confidence interval.

## See also {#see-also}

- **[Linear Regression](linear-regression)** - How to run OLS regression and interpret results
- **[GLM Fundamentals](concepts-glm)** - Generalized linear model theory, which includes OLS as a special case
- **[Glossary](glossary)** - Statistical term definitions
