Lines of code that beat A/B testing (2012)

Multi-armed bandits vs. A/B testing

  • MAB (multi-armed bandits) are praised for maximizing reward during experiments, especially for simple, immediate metrics like clicks.
  • Supporters say MAB “beats” classic A/B by shifting traffic toward better variants earlier and generalizing well to many variants.
  • Critics argue the blog post overclaims: statistical significance requirements don’t change, and simple, well-run A/B can be equally effective for many real-world needs.
  • Several people note MAB is best seen as an optimization tool; A/B is better as a learning tool to estimate true effects.

Implementation & infrastructure complexity

  • Biggest cost is not the algorithm but state management and online feedback loops: extra DB columns, performance concerns, outcome computation.
  • Simple client-side randomization + logging is often much easier than wiring online reward tracking for MAB.
  • Consistent user assignment (stickiness) complicates both A/B and MAB; hashing, seeding, and feature flags are common tools, with pitfalls around non‑uniformity and ID assumptions.

Statistics, significance, and traffic constraints

  • Many sites lack enough traffic to reach significance in reasonable time, especially with >2 variants.
  • Some argue point estimates can be enough to choose a version when costs are similar, even without formal significance.
  • Others stress that if you care about effect size and significance, the article’s approach is insufficient.

Dynamic environments & bias risks

  • Standard MAB assumes static reward rates; in e‑commerce, conversions change with time of day, sales, device mix, etc.
  • Time-varying or delayed rewards can cause MAB to lock onto the wrong variant; forgetting factors and more advanced methods exist but add complexity.
  • MAB can amplify biases from bugs, eligibility issues, caching discrepancies, or mis-specified metrics, potentially converging on very bad experiences.

User experience and ethics

  • Constantly changing variants can harm UX, support workflows, and even safety (e.g., UI changes while driving).
  • Drug-trial analogy is debated: control groups “miss out” on benefits but are also protected from unknown harms.

Real-world practice & politics

  • Many organizations use A/B mostly for gradual rollouts, safety checks, and political cover rather than pure optimization.
  • There is widespread concern about “data-driven” rhetoric masking gut-driven or statistically sloppy decisions.