2025-06-18

P-Hacking in Startups

How common and useful is rigorous A/B testing in startups?

Early-stage startups often lack enough users for meaningful experiments; many argue you should rely on intuition, qualitative feedback, and focus on core product/PMF.
As products scale (e.g., ~1M MAU), disciplined A/B testing becomes more feasible and impactful.
Several people report A/B tests commonly show no significant effect, adding delay and cost; others see them as protection against “HIPPO” (highest-paid person’s opinion).
Some recommend using experiments mainly for high-impact changes (e.g., pricing, ranking algorithms), not visual micro-optimizations.

Rigor vs practicality: how “serious” should stats be?

Strong disagreement over the article’s analogy to medical trials:
- One camp: business decisions still burn time/money; sloppy inference accumulates bad bets and false confidence.
- Other camp: software is reversible; over-rigor (waiting weeks for stat sig, strict corrections) is often worse than occasional false positives.
Many suggest calibrating rigor to risk: lower p‑value thresholds for costly/irreversible changes, higher tolerance (e.g., p≈0.1) for cheap, reversible tweaks.
Some argue the “right” startup strategy is to run many underpowered tests, pick the variant that looks best, accept lots of noise, and keep moving.

P‑hacking, pre-registration, and multiple metrics

Pre‑registration is framed as a commitment device: define one primary metric and analysis plan up front so all other patterns are treated as exploratory, not confirmatory.
Concern that wandering through many variants/metrics guarantees some spurious “wins”; discussions mention Bonferroni, Benjamini–Hochberg, and “alpha ledgers” to control error rates.
Others emphasize organizational drivers of p‑hacking: pressure to “have a win,” vanity metrics, and ignoring long runs of inconclusive tests that imply the UI barely matters.

Methodological debates: p‑values, Bayesian approaches, and alternatives

Several commenters note conceptual errors in the post (miscomputed probabilities, misinterpretation of p‑values) and stress that p<0.05 is about “data given no effect,” not “5% chance the feature is bad.”
Multiple voices advocate Bayesian decision-making, multi‑armed bandits, sequential tests, permutation tests, or simply focusing on effect sizes and business relevance rather than thresholds.
Some suggest standard designs (ANOVA, contingency tables, power analysis) and user research would be more appropriate than many-fragmented A/Bs on layouts.

Bigger picture: product strategy vs micro-optimization

Widespread skepticism that layout/pixel-level tweaks matter much for early startups; likened to “rearranging deck chairs on the Titanic.”
Repeated theme: choose better problems and metrics first; use experimentation to avoid harm and large mistakes, not to overfit trivial UI decisions.

Related topics