Moving to a World Beyond "p < 0.05" (2019)

Genetic variation and heterogeneous effects

  • Several comments use omega-3 / FADS gene variants as an example where a treatment is vital for a small subgroup but appears useless on average.
  • Argue that in genetically diverse populations, p<0.05 on group means can hide clinically huge effects for minorities.
  • Some see this mainly as unmodeled effect heterogeneity and missing covariates; others note practical barriers (genotyping, privacy, regulator concerns over “p-hacking” via subgroup selection).

Power, distributions, and study design

  • Debate whether the “1 in 100” responder problem is just low power vs. a deeper issue with rare subgroups.
  • Commenters highlight multimodal and heavily skewed real-world distributions, and criticize routine assumptions of unimodality/normality.
  • Others stress that classical tests rely on the distribution of the statistic; CLT helps, but can fail in edge cases.

Misuse and limits of p-values / NHST

  • Many agree with the article’s core “don’ts”: p-values don’t prove effects exist, don’t prove the null, and don’t measure real-world importance.
  • Multiple statisticians in the thread claim misuse is widespread across disciplines, not just among “consumers” of research.
  • Clarifications: p-value is about data assuming the null is true, not the probability the hypothesis is true. Tiny but trivial effects can yield tiny p-values with large N.

Replication crisis, publication bias, and incentives

  • Commenters link overreliance on p<0.05 and selective publication of “significant” results to the replication crisis, especially in psychology and biomedicine.
  • Point out that if only 1 in 20 studies with p<0.05 is published, false positives dominate the visible literature.
  • Note that journals and careers reward “interesting” positive findings; null or inconclusive results are hard to publish.

Averaging, effect sizes, and interpretation

  • Several criticize overuse of averages, arguing that they obscure heterogeneous responses and rare but large effects.
  • Emphasis on reporting effect sizes, confidence/credibility intervals, subgroup patterns, and raw data when possible.
  • Some argue thresholds (including p) are pragmatically useful as rough filters; others see hard cutoffs as fundamentally distorting.

Alternatives and methodological culture

  • Suggestions include Bayesian approaches, causal inference, better power analysis, preregistration, and publishing null results.
  • Repeated theme: statistics alone can’t fix institutional incentives or poor study design; the deeper problem is cultural and systemic, not purely mathematical.