The Analogy
The p-value answers: “If H₀ were true, how likely is it to see evidence this extreme or more extreme?” If p = 0.03, there’s only a 3% chance of seeing this result by random chance. If p < 0.05 (conventional threshold), we say the result is statistically significant — unlikely to be due to chance alone.
Key insight: A p-value of 0.03 does NOT mean “3% chance H₀ is true.” It means “3% chance of seeing this data if H₀ were true.” This subtle distinction trips up even experienced researchers. Also, statistical significance ≠ practical significance — a tiny improvement can be “significant” with enough data.
Worked Example
from scipy import stats
# Compare two models' accuracy on test set
model_a_scores = [0.85, 0.87, 0.84, 0.86, 0.88]
model_b_scores = [0.82, 0.83, 0.81, 0.84, 0.82]
# Paired t-test: is the difference real?
t_stat, p_value = stats.ttest_rel(
model_a_scores, model_b_scores
)
# p_value ≈ 0.004 < 0.05
# → Statistically significant!
# Model A is genuinely better (not luck)
Rule of thumb: p < 0.05 = significant (reject H₀). p < 0.01 = highly significant. p < 0.001 = very highly significant. But always consider effect size too — a 0.1% improvement might be “significant” but not worth deploying.