2024-10-31

Moving to a World Beyond "p < 0.05" (2019)

Genetic variation and heterogeneous effects

Several comments use omega-3 / FADS gene variants as an example where a treatment is vital for a small subgroup but appears useless on average.
Argue that in genetically diverse populations, p<0.05 on group means can hide clinically huge effects for minorities.
Some see this mainly as unmodeled effect heterogeneity and missing covariates; others note practical barriers (genotyping, privacy, regulator concerns over “p-hacking” via subgroup selection).

Power, distributions, and study design

Debate whether the “1 in 100” responder problem is just low power vs. a deeper issue with rare subgroups.
Commenters highlight multimodal and heavily skewed real-world distributions, and criticize routine assumptions of unimodality/normality.
Others stress that classical tests rely on the distribution of the statistic; CLT helps, but can fail in edge cases.

Misuse and limits of p-values / NHST

Many agree with the article’s core “don’ts”: p-values don’t prove effects exist, don’t prove the null, and don’t measure real-world importance.
Multiple statisticians in the thread claim misuse is widespread across disciplines, not just among “consumers” of research.
Clarifications: p-value is about data assuming the null is true, not the probability the hypothesis is true. Tiny but trivial effects can yield tiny p-values with large N.

Replication crisis, publication bias, and incentives

Commenters link overreliance on p<0.05 and selective publication of “significant” results to the replication crisis, especially in psychology and biomedicine.
Point out that if only 1 in 20 studies with p<0.05 is published, false positives dominate the visible literature.
Note that journals and careers reward “interesting” positive findings; null or inconclusive results are hard to publish.

Averaging, effect sizes, and interpretation

Several criticize overuse of averages, arguing that they obscure heterogeneous responses and rare but large effects.
Emphasis on reporting effect sizes, confidence/credibility intervals, subgroup patterns, and raw data when possible.
Some argue thresholds (including p) are pragmatically useful as rough filters; others see hard cutoffs as fundamentally distorting.

Alternatives and methodological culture

Suggestions include Bayesian approaches, causal inference, better power analysis, preregistration, and publishing null results.
Repeated theme: statistics alone can’t fix institutional incentives or poor study design; the deeper problem is cultural and systemic, not purely mathematical.

Related topics