Hypothesis Testing Fundamentals

This page covers the statistical theory behind the Two-Sample Test and Paired Test tabs. See that page for usage instructions.

Null and Alternative Hypotheses

Hypothesis testing is a procedure that sets up a "no effect" hypothesis as H0H_0 and evaluates whether the data are inconsistent with it.

  • Null hypothesis H0H_0: The "no effect" hypothesis set up as the target of rejection. For example, set H0H_0 to "the two population means are equal"
  • Alternative hypothesis H1H_1: The hypothesis adopted when H0H_0 is rejected. For example, set H1H_1 to "the two population means are not equal"

When the data are very unlikely under H0H_0, we reject H0H_0 and adopt H1H_1. When H0H_0 is not rejected, the conclusion is "insufficient evidence to reject H0H_0," not "H0H_0 is true."

p-value

The p-value is the probability of observing results as extreme as (or more extreme than) the observed data, assuming H0H_0 and all model assumptions underlying the test (distributional form, independence, etc.) are true.

"Extreme" is measured by a single number computed from the data called the test statistic. The test statistic summarizes how far the data deviate from H0H_0, and is defined for each type of test (e.g., the t statistic for Welch's t-test, the U statistic for the Mann-Whitney U test). Since the distribution of the test statistic under H0H_0 is known, the p-value is computed from where the observed statistic falls in that distribution.

A smaller p-value means the observed result is less likely under H0H_0. A significance level α\alpha (typically 0.05) is set in advance; if p<αp < \alpha, H0H_0 is rejected.

What the p-value does NOT represent:

  • Not the probability that H0H_0 is true (the p-value is a probability of data, not of hypotheses)
  • Not the size of the effect (large samples can yield small p-values for trivial differences)
  • Not the probability of replication

Type I and Type II Errors

H0H_0 actually trueH0H_0 actually false
Reject H0H_0Type I error (false positive)Correct decision
Do not reject H0H_0Correct decisionType II error (false negative)
  • Type I error: Concluding a difference exists when there is none. Its probability is controlled by α\alpha
  • Type II error: Failing to detect a real difference. Its probability is β\beta; 1β1 - \beta is the power

Making α\alpha stricter reduces Type I errors but increases Type II errors.

Statistical Significance and Practical Importance

A small p-value does not imply a large effect. With a sufficiently large sample, even trivially small differences can reach statistical significance. Conversely, with a small sample, practically meaningful differences may go undetected.

The p-value answers "should we reject H0H_0?" but not "how large is the difference?" or "does this difference matter in practice?" Answering those questions requires directly examining the magnitude and precision of the estimated effect.

Confidence Interval Interpretation

In many fields, the estimate and its confidence interval are the most direct way to convey effect magnitude and precision.

In regression analysis, the coefficient β^\hat\beta is the effect size itself -- "YY changes by β^\hat\beta per unit increase in XX" -- and the confidence interval conveys the precision of that estimate. A narrow interval indicates high precision; a wide interval indicates limited information from the data.

For t-tests, the confidence interval for the mean difference tells "how large the difference is," providing more information than the p-value alone. If the interval excludes zero, the conclusion is the same as rejecting H0H_0, but the interval width also reveals the plausible range of the difference.

Standardized Effect Size (Cohen's d)

In psychology and education, standardized effect sizes are widely used to compare effect magnitudes across variables measured on different scales. For t-tests, Cohen's d (mean difference divided by the pooled standard deviation) is standard. Cohen (1988) proposed benchmarks of small (0.2), medium (0.5), and large (0.8) as tentative guidelines when no other basis exists; appropriate values vary by field and context.

In regression analysis, model-specific coefficients such as odds ratios and hazard ratios directly represent effect magnitude, so a separate standardized effect size is typically unnecessary.

Rank-Biserial r (Effect Size)

For rank-based nonparametric tests, rank-biserial r serves as the effect size measure. It ranges from 1-1 to 11.

For the Mann-Whitney U test, rank-biserial r is computed from the difference in mean ranks between the two groups. A value of +1+1 means every observation in one group ranks above every observation in the other; 00 means the rank distributions overlap completely.

For the Wilcoxon signed-rank test, rank-biserial r is computed from the difference between the sum of positive ranks (W+W^+) and the sum of negative ranks (WW^-), normalized by the total rank sum. A value of +1+1 means all differences are positive; 1-1 means all are negative.

There are no universal benchmarks for rank-biserial r comparable to Cohen's d guidelines. Interpret the value in the context of your data and research question.

Sample Size and Power

Power is the probability of correctly detecting a real effect (1β1 - \beta). It depends on:

  • Sample size: Larger samples yield higher power
  • Effect magnitude: Larger effects are easier to detect
  • Significance level α\alpha: Larger α\alpha increases power (but also false positives)
  • Data variability: Less variability yields higher power

Welch's t-test

The independent two-sample t-test in MIDAS is Welch's (1947) t-test. It does not assume equal variances between groups.

Problem Setting

Given samples from two independent groups, determine whether the population means differ.

  • Set H0H_0: μ1=μ2\mu_1 = \mu_2 (the two population means are equal)
  • Set H1H_1: μ1μ2\mu_1 \neq \mu_2 (for a two-sided test)

Test Statistic

t=Xˉ1Xˉ2s12n1+s22n2t = \frac{\bar{X}_1 - \bar{X}_2}{\sqrt{\dfrac{s_1^2}{n_1} + \dfrac{s_2^2}{n_2}}}

where Xˉi\bar{X}_i, si2s_i^2, and nin_i are the sample mean, unbiased variance, and sample size for group ii. The denominator is the standard error of the mean difference, treating each group's variance independently.

Degrees of freedom are computed using the Welch-Satterthwaite approximation:

ν=(s12n1+s22n2)2(s12/n1)2n11+(s22/n2)2n21\nu = \frac{\left(\dfrac{s_1^2}{n_1} + \dfrac{s_2^2}{n_2}\right)^2}{\dfrac{(s_1^2/n_1)^2}{n_1 - 1} + \dfrac{(s_2^2/n_2)^2}{n_2 - 1}}

ν\nu is generally non-integer. H0H_0 is rejected when the tt statistic is sufficiently extreme under the tt distribution with ν\nu degrees of freedom.

Rank-Based Tests (Nonparametric Tests)

Rank-based tests replace the original data values with their ranks and perform inference on the ranks. Because ranks depend only on the ordering of values, these tests do not require assumptions about the shape of the population distribution.

How Rank Tests Work

  1. Pool all observations and assign ranks from smallest to largest
  2. When tied values occur, assign the average of the ranks they would occupy
  3. Compute a test statistic from the ranks
  4. Determine the p-value from the known distribution of the test statistic under H0H_0

MIDAS uses the normal approximation with tie correction and continuity correction to compute p-values. The approximation is accurate when each group has at least 10 or more observations. For smaller samples, the approximation becomes less reliable and p-values should be interpreted with caution. MIDAS does not compute exact p-values.

Mann-Whitney U Test

The Mann-Whitney (1947) U test compares two independent groups.

Null hypothesis: The two samples come from the same distribution.

Pool all n1+n2n_1 + n_2 observations and rank them. Let R1R_1 be the sum of ranks for group 1. The U statistic for group 1 is:

U1=R1n1(n1+1)2U_1 = R_1 - \frac{n_1(n_1 + 1)}{2}

U1U_1 counts the number of times an observation from group 1 ranks above an observation from group 2. Similarly, U2=n1n2U1U_2 = n_1 n_2 - U_1. MIDAS reports U1U_1.

Under H0H_0, the expected value and variance of UU are known, and the z-score for the normal approximation is:

z=UμUσUz = \frac{U - \mu_U}{\sigma_U}

where μU=n1n2/2\mu_U = n_1 n_2 / 2. The variance σU2\sigma_U^2 includes a tie correction term that adjusts for groups of tied ranks.

Wilcoxon Signed-Rank Test

The Wilcoxon (1945) signed-rank test compares paired measurements.

Null hypothesis: The distribution of differences is symmetric about zero.

  1. Compute pairwise differences di=x1ix2id_i = x_{1i} - x_{2i}
  2. Exclude pairs where di=0d_i = 0 (zero differences carry no information about direction)
  3. Rank the absolute differences di|d_i|
  4. Sum the ranks of positive differences (W+W^+) and negative differences (WW^-)

If the distribution of differences is symmetric about zero, W+W^+ and WW^- should be roughly equal. A large discrepancy provides evidence against H0H_0.

Under H0H_0, the expected value and variance of W+W^+ are known, and the z-score for the normal approximation is:

z=W+μWσWz = \frac{W^+ - \mu_W}{\sigma_W}

where μW=n(n+1)/4\mu_W = n'(n'+1)/4 and nn' is the number of non-zero differences. The variance σW2\sigma_W^2 includes a tie correction term.

Choosing Between Parametric and Nonparametric Tests

Choose the test method based on the characteristics of your data before examining the data:

  • Ordinal data: Use nonparametric tests. Rank-based tests are the natural choice for data measured on an ordinal scale, where differences between values are not meaningful
  • Continuous data expected to be approximately normal: The t-test has greater power than rank-based tests under normality. When the normality assumption is reasonable, the t-test provides more precise inference
  • Continuous data with known non-normal characteristics: When the nature of the variable suggests non-normality beforehand (e.g., income data with a heavy right tail, bounded variables, or discrete counts), choose a nonparametric test before examining the data

Switching to a nonparametric test after seeing the normality test result is a form of pre-testing -- see Multiple Testing for why this is problematic.

Normality Assumption and Diagnostics

The t-test relies on the sample mean following a normal distribution. This holds when the population is normally distributed. Even when the population is not normal, the Central Limit Theorem ensures the distribution of the sample mean approaches normality as the sample size grows. This makes the t-test robust to non-normality with large samples.

For paired t-tests, the test is internally a one-sample t-test on the differences di=x1ix2id_i = x_{1i} - x_{2i}, so the normality assumption applies to the population of differences, not to each group individually. The test remains valid when the differences are approximately normal, even if the individual groups are not. The MIDAS diagnostics panel displays Q-Q plots, histograms, and Shapiro-Wilk tests for the differences in addition to the per-group diagnostics.

MIDAS displays diagnostics (Q-Q plots, Shapiro-Wilk (1965) normality test) after running a test. These are not a gate for switching methods -- they help you judge how much to trust the results.

  • Large sample, normality questionable: The CLT ensures the normal approximation for the sample mean works well, so the t-test results are generally reliable. The Shapiro-Wilk test flags even trivial deviations in large samples, so a significant result does not necessarily indicate a problem
  • Small sample, normality questionable: The t-test results become less reliable. Check the Q-Q plot for the degree of skewness or heavy tails, and interpret the results with appropriate caution

For nonparametric tests, normality is not an assumption. MIDAS still displays diagnostics to help you understand the shape of your data distributions.

Switching to a nonparametric test after seeing the normality test result is a form of pre-testing -- a multiple testing procedure that causes the overall Type I error rate to deviate from the nominal level. If the nature of your data suggests non-normality beforehand (e.g., income data with a heavy right tail), choose a nonparametric test before examining the data.

Multiple Testing

When you run multiple tests on the same data, the probability of incorrectly rejecting at least one true null hypothesis (the familywise error rate) increases. Even if each test uses α=0.05\alpha = 0.05, running mm independent tests gives a familywise error rate of 1(1α)m1 - (1 - \alpha)^m. For m=5m = 5 this is about 23%; for m=10m = 10, about 40%.

Typical situations where multiple testing is a concern:

  • Running Log-rank tests or t-tests repeatedly with different group variables
  • Testing the same hypothesis across multiple outcomes
  • Switching test methods after examining a normality test result (pre-testing)

Patterns discovered by exploring data are hypothesis generation, not hypothesis testing. To test a hypothesis that emerged from exploration, collect independent data. Testing on the same data that generated the hypothesis risks "discovering" chance patterns.

When new data collection is not feasible, report the results as exploratory analysis. Familywise error rate corrections (such as Bonferroni) can lend support to exploratory findings, but they are not a substitute for confirmatory testing.

References

  • Welch, B. L. (1947). The generalization of "Student's" problem when several different population variances are involved. Biometrika, 34(1-2), 28-35.
  • Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Lawrence Erlbaum Associates.
  • Shapiro, S. S., & Wilk, M. B. (1965). An analysis of variance test for normality (complete samples). Biometrika, 52(3-4), 591-611.
  • Mann, H. B., & Whitney, D. R. (1947). On a test of whether one of two random variables is stochastically larger than the other. Annals of Mathematical Statistics, 18(1), 50-60.
  • Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics Bulletin, 1(6), 80-83.