Missing Data Mechanisms

Many analysis tabs in MIDAS (Linear Regression, GLM, GLMM, ANOVA, DoE, Survival Analysis, Random Forest) automatically exclude rows containing missing values before running the analysis. This approach is called listwise deletion or complete-case analysis. Whether listwise deletion produces valid estimates depends on the mechanism that generated the missing data.

This page describes Rubin's (1976) classification of missing data mechanisms and their relationship to listwise deletion. For instructions on each analysis tab, see the individual pages.

Classification of Missing Data Mechanisms

Missing data mechanisms are classified into three categories based on what the probability of missingness depends on.

Let $Y = (Y_\text{obs}, Y_\text{mis})$ denote the complete data (observed and missing parts combined), and let $M$ be the missingness indicator ( $M_i = 1$ if $Y_i$ is missing). The missing data mechanism is characterized by the conditional distribution $P(M \mid Y_\text{obs}, Y_\text{mis})$ .

MCAR (Missing Completely at Random)

P(M \mid Y_\text{obs}, Y_\text{mis}) = P(M)

The probability of a value being missing does not depend on any observed or unobserved data. The missingness pattern is unrelated to the data values.

Example: A measurement instrument malfunctions randomly, regardless of the value being measured, causing some readings to go unrecorded.

Under MCAR, complete cases (rows with no missing values) constitute a random subsample of the full dataset. MCAR is a special case of MAR (since $P(M) = P(M \mid Y_\text{obs})$ holds trivially).

MAR (Missing at Random)

P(M \mid Y_\text{obs}, Y_\text{mis}) = P(M \mid Y_\text{obs})

The probability of missingness depends on observed data but not on the missing values themselves. Despite the name "random," missingness is not unconditionally random — it becomes random only after conditioning on the observed data.

Example: In a survey, older respondents are more likely to leave the income question blank. Age is fully observed for all respondents, and within each age group, missingness is unrelated to the actual income value. This is MAR.

MNAR (Missing Not at Random)

P(M \mid Y_\text{obs}, Y_\text{mis}) \neq P(M \mid Y_\text{obs})

Even after conditioning on observed data, the probability of missingness depends on the missing values themselves.

Example: In a clinical trial, patients with more severe symptoms are more likely to drop out. The severity (the outcome that would have been measured) is itself the cause of missingness. This is MNAR. In this case, remaining patients are skewed toward milder cases, so complete-case analysis underestimates the average severity.

Listwise Deletion and MCAR

Listwise deletion excludes every row that has a missing value in any variable used by the analysis, and estimates are computed from complete cases only.

Under MCAR: Complete cases are a random subsample of the full data, so estimators are unbiased and standard errors are correctly computed. However, discarding incomplete rows reduces efficiency (standard errors are larger than they would be using all available data).

Under MAR: Listwise deletion generally compromises sample representativeness. An exception arises in regression models when predictors are fully observed and missingness in the response depends only on the predictor values: complete cases then form a random subsample conditional on the predictors, so regression coefficients remain unbiased. Outside this special case, listwise deletion under MAR can introduce bias in addition to efficiency loss.

Under MNAR: Complete cases are no longer representative of the full population, and estimates are biased. The direction and magnitude of the bias depend on the specific structure of the missing data mechanism.

MIDAS currently supports only listwise deletion and does not provide alternative approaches such as multiple imputation (MI) or full information maximum likelihood (FIML). When MAR or MNAR is suspected, consider this limitation when interpreting results.

MIDAS tabs that perform listwise deletion:

Testability of MCAR

The missing data mechanism cannot be determined from data alone. The distinction between MAR and MNAR depends on $Y_\text{mis}$ , which is by definition unobserved and therefore cannot be tested. Departures from MCAR (associations between missingness and observed values) are detectable from data, but the absence of detectable departures does not prove MCAR.

Assessing the missing data mechanism requires knowledge of the data-generating process. Consider under what conditions measurements were taken and what reasons may have caused the data to be missing, prior to the analysis.

References

Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581--592. https://www.jstor.org/stable/2335739