Moving to a World Beyond "p < 0.05" (2019)
Genetic variation and heterogeneous effects
- Several comments use omega-3 / FADS gene variants as an example where a treatment is vital for a small subgroup but appears useless on average.
- Argue that in genetically diverse populations, p<0.05 on group means can hide clinically huge effects for minorities.
- Some see this mainly as unmodeled effect heterogeneity and missing covariates; others note practical barriers (genotyping, privacy, regulator concerns over “p-hacking” via subgroup selection).
Power, distributions, and study design
- Debate whether the “1 in 100” responder problem is just low power vs. a deeper issue with rare subgroups.
- Commenters highlight multimodal and heavily skewed real-world distributions, and criticize routine assumptions of unimodality/normality.
- Others stress that classical tests rely on the distribution of the statistic; CLT helps, but can fail in edge cases.
Misuse and limits of p-values / NHST
- Many agree with the article’s core “don’ts”: p-values don’t prove effects exist, don’t prove the null, and don’t measure real-world importance.
- Multiple statisticians in the thread claim misuse is widespread across disciplines, not just among “consumers” of research.
- Clarifications: p-value is about data assuming the null is true, not the probability the hypothesis is true. Tiny but trivial effects can yield tiny p-values with large N.
Replication crisis, publication bias, and incentives
- Commenters link overreliance on p<0.05 and selective publication of “significant” results to the replication crisis, especially in psychology and biomedicine.
- Point out that if only 1 in 20 studies with p<0.05 is published, false positives dominate the visible literature.
- Note that journals and careers reward “interesting” positive findings; null or inconclusive results are hard to publish.
Averaging, effect sizes, and interpretation
- Several criticize overuse of averages, arguing that they obscure heterogeneous responses and rare but large effects.
- Emphasis on reporting effect sizes, confidence/credibility intervals, subgroup patterns, and raw data when possible.
- Some argue thresholds (including p) are pragmatically useful as rough filters; others see hard cutoffs as fundamentally distorting.
Alternatives and methodological culture
- Suggestions include Bayesian approaches, causal inference, better power analysis, preregistration, and publishing null results.
- Repeated theme: statistics alone can’t fix institutional incentives or poor study design; the deeper problem is cultural and systemic, not purely mathematical.