OLS Fundamentals

This page covers the statistical theory behind the Linear Regression tab. See that page for usage instructions.

Model Formulation

The linear regression model is formulated as:

Y=Xβ+εY = X\beta + \varepsilon

where YY is the n×1n \times 1 response vector, XX is the n×pn \times p design matrix (predictors and intercept), β\beta is the p×1p \times 1 coefficient vector, and ε\varepsilon is the error term.

The OLS estimator minimizes the residual sum of squares YXβ2\|Y - X\beta\|^2 and is obtained from the normal equations:

β^=(XX)1XY\hat\beta = (X'X)^{-1}X'Y

The properties of this estimator depend on the assumptions placed on ε\varepsilon.

Assuming plim(Xε/n)=0\operatorname{plim}(X'\varepsilon/n) = 0: The OLS estimator is consistent. Homoscedasticity and uncorrelated errors are not required.

Further assuming E[ε]=0E[\varepsilon] = 0 and Var(ε)=σ2I\operatorname{Var}(\varepsilon) = \sigma^2 I (homoscedastic and uncorrelated): By the Gauss-Markov theorem, the OLS estimator is the Best Linear Unbiased Estimator (BLUE).

Further assuming εN(0,σ2I)\varepsilon \sim N(0, \sigma^2 I) (normality): Exact finite-sample distributions for tt-tests and FF-tests are obtained.

Without normality, the central limit theorem ensures the test statistics are asymptotically normal in large samples. The required sample size depends on the true error distribution, so no universal threshold applies.

OLS is a special case of GLM (Gaussian family with identity link).

Standardized Residuals and Diagnostic Statistics

Residual diagnostics in OLS use the internally studentized residual rir_i^*:

ri=eiσ^1hir_i^* = \frac{e_i}{\hat\sigma\sqrt{1 - h_i}}

where ei=yiy^ie_i = y_i - \hat y_i is the residual, σ^=RSS/(np)\hat\sigma = \sqrt{\text{RSS}/(n - p)} is the error standard deviation estimated from all observations, and hi=diag(H)ih_i = \operatorname{diag}(H)_i is the diagonal element of the hat matrix H=X(XX)1XH = X(X'X)^{-1}X' (leverage). pp is the number of columns in the design matrix XX, including the intercept. Leverage measures how far an observation's predictor values are from the others. Since tr(H)=p\operatorname{tr}(H) = p, the average leverage is p/np/n, and 2p/n2p/n is the conventional threshold for high leverage.

Cook's Distance combines residual magnitude and leverage into a single influence measure:

Di=ri2phi1hiD_i = \frac{r_i^{*2}}{p} \cdot \frac{h_i}{1 - h_i}

Multicollinearity and VIF

When predictors are highly correlated, (XX)(X'X) approaches singularity and coefficient estimates become unstable.

VIF (Variance Inflation Factor) = 1/(1Rj2)1 / (1 - R_j^2) is computed from Rj2R_j^2, the R-squared obtained by regressing XjX_j on all other predictors. A high Rj2R_j^2 means most of the variation in XjX_j is already explained by other variables, leaving little unique information. VIF tells you how many times the variance of β^j\hat\beta_j is inflated as a result. For example, VIF = 5 means the standard error of β^j\hat\beta_j is 52.2\sqrt{5} \approx 2.2 times wider than it would be with uncorrelated predictors. β^j\hat\beta_j itself remains unbiased, but its estimate becomes less reliable and the confidence interval widens.

See also