P-Hacking in Startups
How common and useful is rigorous A/B testing in startups?
- Early-stage startups often lack enough users for meaningful experiments; many argue you should rely on intuition, qualitative feedback, and focus on core product/PMF.
- As products scale (e.g., ~1M MAU), disciplined A/B testing becomes more feasible and impactful.
- Several people report A/B tests commonly show no significant effect, adding delay and cost; others see them as protection against “HIPPO” (highest-paid person’s opinion).
- Some recommend using experiments mainly for high-impact changes (e.g., pricing, ranking algorithms), not visual micro-optimizations.
Rigor vs practicality: how “serious” should stats be?
- Strong disagreement over the article’s analogy to medical trials:
- One camp: business decisions still burn time/money; sloppy inference accumulates bad bets and false confidence.
- Other camp: software is reversible; over-rigor (waiting weeks for stat sig, strict corrections) is often worse than occasional false positives.
- Many suggest calibrating rigor to risk: lower p‑value thresholds for costly/irreversible changes, higher tolerance (e.g., p≈0.1) for cheap, reversible tweaks.
- Some argue the “right” startup strategy is to run many underpowered tests, pick the variant that looks best, accept lots of noise, and keep moving.
P‑hacking, pre-registration, and multiple metrics
- Pre‑registration is framed as a commitment device: define one primary metric and analysis plan up front so all other patterns are treated as exploratory, not confirmatory.
- Concern that wandering through many variants/metrics guarantees some spurious “wins”; discussions mention Bonferroni, Benjamini–Hochberg, and “alpha ledgers” to control error rates.
- Others emphasize organizational drivers of p‑hacking: pressure to “have a win,” vanity metrics, and ignoring long runs of inconclusive tests that imply the UI barely matters.
Methodological debates: p‑values, Bayesian approaches, and alternatives
- Several commenters note conceptual errors in the post (miscomputed probabilities, misinterpretation of p‑values) and stress that p<0.05 is about “data given no effect,” not “5% chance the feature is bad.”
- Multiple voices advocate Bayesian decision-making, multi‑armed bandits, sequential tests, permutation tests, or simply focusing on effect sizes and business relevance rather than thresholds.
- Some suggest standard designs (ANOVA, contingency tables, power analysis) and user research would be more appropriate than many-fragmented A/Bs on layouts.
Bigger picture: product strategy vs micro-optimization
- Widespread skepticism that layout/pixel-level tweaks matter much for early startups; likened to “rearranging deck chairs on the Titanic.”
- Repeated theme: choose better problems and metrics first; use experimentation to avoid harm and large mistakes, not to overfit trivial UI decisions.