Hypothesis Testing Fundamentals

This page covers the statistical theory behind the Two-Sample Test and Paired Test tabs. See that page for usage instructions.

Null and Alternative Hypotheses

Hypothesis testing is a procedure that sets up a "no effect" hypothesis as $H_0$ and evaluates whether the data are inconsistent with it.

Null hypothesis $H_0$ : The "no effect" hypothesis set up as the target of rejection. For example, set $H_0$ to "the two population means are equal"
Alternative hypothesis $H_1$ : The hypothesis adopted when $H_0$ is rejected. For example, set $H_1$ to "the two population means are not equal"

When the data are very unlikely under $H_0$ , we reject $H_0$ and adopt $H_1$ . When $H_0$ is not rejected, the conclusion is "insufficient evidence to reject $H_0$ ," not " $H_0$ is true."

p-value

The p-value is the probability of observing results as extreme as (or more extreme than) the observed data, assuming $H_0$ and all model assumptions underlying the test (distributional form, independence, etc.) are true.

"Extreme" is measured by a single number computed from the data called the test statistic. The test statistic summarizes how far the data deviate from $H_0$ , and is defined for each type of test (e.g., the t statistic for Welch's t-test, the U statistic for the Mann-Whitney U test). Since the distribution of the test statistic under $H_0$ is known, the p-value is computed from where the observed statistic falls in that distribution.

A smaller p-value means the observed result is less likely under $H_0$ . A significance level $\alpha$ (typically 0.05) is set in advance; if $p < \alpha$ , $H_0$ is rejected.

What the p-value does NOT represent:

Not the probability that $H_0$ is true (the p-value is a probability of data, not of hypotheses)
Not the size of the effect (large samples can yield small p-values for trivial differences)
Not the probability of replication

Type I and Type II Errors

	$H_0$ actually true	$H_0$ actually false
Reject $H_0$	Type I error (false positive)	Correct decision
Do not reject $H_0$	Correct decision	Type II error (false negative)

Type I error: Concluding a difference exists when there is none. Its probability is controlled by $\alpha$
Type II error: Failing to detect a real difference. Its probability is $\beta$ ; $1 - \beta$ is the power

Making $\alpha$ stricter reduces Type I errors but increases Type II errors.

Statistical Significance and Practical Importance

A small p-value does not imply a large effect. With a sufficiently large sample, even trivially small differences can reach statistical significance. Conversely, with a small sample, practically meaningful differences may go undetected.

The p-value answers "should we reject $H_0$ ?" but not "how large is the difference?" or "does this difference matter in practice?" Answering those questions requires directly examining the magnitude and precision of the estimated effect.

Confidence Interval Interpretation

In many fields, the estimate and its confidence interval are the most direct way to convey effect magnitude and precision.

In regression analysis, the coefficient $\hat\beta$ is the effect size itself -- " $Y$ changes by $\hat\beta$ per unit increase in $X$ " -- and the confidence interval conveys the precision of that estimate. A narrow interval indicates high precision; a wide interval indicates limited information from the data.

For t-tests, the confidence interval for the mean difference tells "how large the difference is," providing more information than the p-value alone. If the interval excludes zero, the conclusion is the same as rejecting $H_0$ , but the interval width also reveals the plausible range of the difference.

Standardized Effect Size (Cohen's d)

In psychology and education, standardized effect sizes are widely used to compare effect magnitudes across variables measured on different scales. For t-tests, Cohen's d (mean difference divided by the pooled standard deviation) is standard. Cohen (1988) proposed benchmarks of small (0.2), medium (0.5), and large (0.8) as tentative guidelines when no other basis exists; appropriate values vary by field and context.

In regression analysis, model-specific coefficients such as odds ratios and hazard ratios directly represent effect magnitude, so a separate standardized effect size is typically unnecessary.

Rank-Biserial r (Effect Size)

For rank-based nonparametric tests, rank-biserial r serves as the effect size measure. It ranges from $-1$ to $1$ .

For the Mann-Whitney U test, rank-biserial r is computed from the difference in mean ranks between the two groups. A value of $+1$ means every observation in one group ranks above every observation in the other; $0$ means the rank distributions overlap completely.

For the Wilcoxon signed-rank test, rank-biserial r is computed from the difference between the sum of positive ranks ( $W^+$ ) and the sum of negative ranks ( $W^-$ ), normalized by the total rank sum. A value of $+1$ means all differences are positive; $-1$ means all are negative.

There are no universal benchmarks for rank-biserial r comparable to Cohen's d guidelines. Interpret the value in the context of your data and research question.

Sample Size and Power

Power is the probability of correctly detecting a real effect ( $1 - \beta$ ). It depends on:

Sample size: Larger samples yield higher power
Effect magnitude: Larger effects are easier to detect
Significance level $\alpha$ : Larger $\alpha$ increases power (but also false positives)
Data variability: Less variability yields higher power

Welch's t-test

The independent two-sample t-test in MIDAS is Welch's (1947) t-test. It does not assume equal variances between groups.

Problem Setting

Given samples from two independent groups, determine whether the population means differ.

Set $H_0$ : $\mu_1 = \mu_2$ (the two population means are equal)
Set $H_1$ : $\mu_1 \neq \mu_2$ (for a two-sided test)

Test Statistic

t = \frac{\bar{X}_1 - \bar{X}_2}{\sqrt{\dfrac{s_1^2}{n_1} + \dfrac{s_2^2}{n_2}}}

where $\bar{X}_i$ , $s_i^2$ , and $n_i$ are the sample mean, unbiased variance, and sample size for group $i$ . The denominator is the standard error of the mean difference, treating each group's variance independently.

Degrees of freedom are computed using the Welch-Satterthwaite approximation:

\nu = \frac{\left(\dfrac{s_1^2}{n_1} + \dfrac{s_2^2}{n_2}\right)^2}{\dfrac{(s_1^2/n_1)^2}{n_1 - 1} + \dfrac{(s_2^2/n_2)^2}{n_2 - 1}}

$\nu$ is generally non-integer. $H_0$ is rejected when the $t$ statistic is sufficiently extreme under the $t$ distribution with $\nu$ degrees of freedom.

Rank-Based Tests (Nonparametric Tests)

Rank-based tests replace the original data values with their ranks and perform inference on the ranks. Because ranks depend only on the ordering of values, these tests do not require assumptions about the shape of the population distribution.

How Rank Tests Work

Pool all observations and assign ranks from smallest to largest
When tied values occur, assign the average of the ranks they would occupy
Compute a test statistic from the ranks
Determine the p-value from the known distribution of the test statistic under $H_0$

MIDAS uses the normal approximation with tie correction and continuity correction to compute p-values. The approximation is accurate when each group has at least 10 or more observations. For smaller samples, the approximation becomes less reliable and p-values should be interpreted with caution. MIDAS does not compute exact p-values.

Mann-Whitney U Test

The Mann-Whitney (1947) U test compares two independent groups.

Null hypothesis: The two samples come from the same distribution.

Pool all $n_1 + n_2$ observations and rank them. Let $R_1$ be the sum of ranks for group 1. The U statistic for group 1 is:

U_1 = R_1 - \frac{n_1(n_1 + 1)}{2}

$U_1$ counts the number of times an observation from group 1 ranks above an observation from group 2. Similarly, $U_2 = n_1 n_2 - U_1$ . MIDAS reports $U_1$ .

Under $H_0$ , the expected value and variance of $U$ are known, and the z-score for the normal approximation is:

z = \frac{U - \mu_U}{\sigma_U}

where $\mu_U = n_1 n_2 / 2$ . The variance $\sigma_U^2$ includes a tie correction term that adjusts for groups of tied ranks.

Wilcoxon Signed-Rank Test

The Wilcoxon (1945) signed-rank test compares paired measurements.

Null hypothesis: The distribution of differences is symmetric about zero.

Compute pairwise differences $d_i = x_{1i} - x_{2i}$
Exclude pairs where $d_i = 0$ (zero differences carry no information about direction)
Rank the absolute differences $|d_i|$
Sum the ranks of positive differences ( $W^+$ ) and negative differences ( $W^-$ )

If the distribution of differences is symmetric about zero, $W^+$ and $W^-$ should be roughly equal. A large discrepancy provides evidence against $H_0$ .

Under $H_0$ , the expected value and variance of $W^+$ are known, and the z-score for the normal approximation is:

z = \frac{W^+ - \mu_W}{\sigma_W}

where $\mu_W = n'(n'+1)/4$ and $n'$ is the number of non-zero differences. The variance $\sigma_W^2$ includes a tie correction term.

Choosing Between Parametric and Nonparametric Tests

Choose the test method based on the characteristics of your data before examining the data:

Ordinal data: Use nonparametric tests. Rank-based tests are the natural choice for data measured on an ordinal scale, where differences between values are not meaningful
Continuous data expected to be approximately normal: The t-test has greater power than rank-based tests under normality. When the normality assumption is reasonable, the t-test provides more precise inference
Continuous data with known non-normal characteristics: When the nature of the variable suggests non-normality beforehand (e.g., income data with a heavy right tail, bounded variables, or discrete counts), choose a nonparametric test before examining the data

Switching to a nonparametric test after seeing the normality test result is a form of pre-testing -- see Multiple Testing for why this is problematic.

Normality Assumption and Diagnostics

The t-test relies on the sample mean following a normal distribution. This holds when the population is normally distributed. Even when the population is not normal, the Central Limit Theorem ensures the distribution of the sample mean approaches normality as the sample size grows. This makes the t-test robust to non-normality with large samples.

For paired t-tests, the test is internally a one-sample t-test on the differences $d_i = x_{1i} - x_{2i}$ , so the normality assumption applies to the population of differences, not to each group individually. The test remains valid when the differences are approximately normal, even if the individual groups are not. The MIDAS diagnostics panel displays Q-Q plots, histograms, and Shapiro-Wilk tests for the differences in addition to the per-group diagnostics.

MIDAS displays diagnostics (Q-Q plots, Shapiro-Wilk (1965) normality test) after running a test. These are not a gate for switching methods -- they help you judge how much to trust the results.

Large sample, normality questionable: The CLT ensures the normal approximation for the sample mean works well, so the t-test results are generally reliable. The Shapiro-Wilk test flags even trivial deviations in large samples, so a significant result does not necessarily indicate a problem
Small sample, normality questionable: The t-test results become less reliable. Check the Q-Q plot for the degree of skewness or heavy tails, and interpret the results with appropriate caution

For nonparametric tests, normality is not an assumption. MIDAS still displays diagnostics to help you understand the shape of your data distributions.

Switching to a nonparametric test after seeing the normality test result is a form of pre-testing -- a multiple testing procedure that causes the overall Type I error rate to deviate from the nominal level. If the nature of your data suggests non-normality beforehand (e.g., income data with a heavy right tail), choose a nonparametric test before examining the data.

Multiple Testing

When you run multiple tests on the same data, the probability of incorrectly rejecting at least one true null hypothesis (the familywise error rate) increases. Even if each test uses $\alpha = 0.05$ , running $m$ independent tests gives a familywise error rate of $1 - (1 - \alpha)^m$ . For $m = 5$ this is about 23%; for $m = 10$ , about 40%.

Typical situations where multiple testing is a concern:

Running Log-rank tests or t-tests repeatedly with different group variables
Testing the same hypothesis across multiple outcomes
Switching test methods after examining a normality test result (pre-testing)

Patterns discovered by exploring data are hypothesis generation, not hypothesis testing. To test a hypothesis that emerged from exploration, collect independent data. Testing on the same data that generated the hypothesis risks "discovering" chance patterns.

When new data collection is not feasible, report the results as exploratory analysis. Familywise error rate corrections (such as Bonferroni) can lend support to exploratory findings, but they are not a substitute for confirmatory testing.

References

Welch, B. L. (1947). The generalization of "Student's" problem when several different population variances are involved. Biometrika, 34(1-2), 28-35.
Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Lawrence Erlbaum Associates.
Shapiro, S. S., & Wilk, M. B. (1965). An analysis of variance test for normality (complete samples). Biometrika, 52(3-4), 591-611.
Mann, H. B., & Whitney, D. R. (1947). On a test of whether one of two random variables is stochastically larger than the other. Annals of Mathematical Statistics, 18(1), 50-60.
Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics Bulletin, 1(6), 80-83.