P-Hacking in Startups

How common and useful is rigorous A/B testing in startups?

  • Early-stage startups often lack enough users for meaningful experiments; many argue you should rely on intuition, qualitative feedback, and focus on core product/PMF.
  • As products scale (e.g., ~1M MAU), disciplined A/B testing becomes more feasible and impactful.
  • Several people report A/B tests commonly show no significant effect, adding delay and cost; others see them as protection against “HIPPO” (highest-paid person’s opinion).
  • Some recommend using experiments mainly for high-impact changes (e.g., pricing, ranking algorithms), not visual micro-optimizations.

Rigor vs practicality: how “serious” should stats be?

  • Strong disagreement over the article’s analogy to medical trials:
    • One camp: business decisions still burn time/money; sloppy inference accumulates bad bets and false confidence.
    • Other camp: software is reversible; over-rigor (waiting weeks for stat sig, strict corrections) is often worse than occasional false positives.
  • Many suggest calibrating rigor to risk: lower p‑value thresholds for costly/irreversible changes, higher tolerance (e.g., p≈0.1) for cheap, reversible tweaks.
  • Some argue the “right” startup strategy is to run many underpowered tests, pick the variant that looks best, accept lots of noise, and keep moving.

P‑hacking, pre-registration, and multiple metrics

  • Pre‑registration is framed as a commitment device: define one primary metric and analysis plan up front so all other patterns are treated as exploratory, not confirmatory.
  • Concern that wandering through many variants/metrics guarantees some spurious “wins”; discussions mention Bonferroni, Benjamini–Hochberg, and “alpha ledgers” to control error rates.
  • Others emphasize organizational drivers of p‑hacking: pressure to “have a win,” vanity metrics, and ignoring long runs of inconclusive tests that imply the UI barely matters.

Methodological debates: p‑values, Bayesian approaches, and alternatives

  • Several commenters note conceptual errors in the post (miscomputed probabilities, misinterpretation of p‑values) and stress that p<0.05 is about “data given no effect,” not “5% chance the feature is bad.”
  • Multiple voices advocate Bayesian decision-making, multi‑armed bandits, sequential tests, permutation tests, or simply focusing on effect sizes and business relevance rather than thresholds.
  • Some suggest standard designs (ANOVA, contingency tables, power analysis) and user research would be more appropriate than many-fragmented A/Bs on layouts.

Bigger picture: product strategy vs micro-optimization

  • Widespread skepticism that layout/pixel-level tweaks matter much for early startups; likened to “rearranging deck chairs on the Titanic.”
  • Repeated theme: choose better problems and metrics first; use experimentation to avoid harm and large mistakes, not to overfit trivial UI decisions.