2025-01-09

Lines of code that beat A/B testing (2012)

Multi-armed bandits vs. A/B testing

MAB (multi-armed bandits) are praised for maximizing reward during experiments, especially for simple, immediate metrics like clicks.
Supporters say MAB “beats” classic A/B by shifting traffic toward better variants earlier and generalizing well to many variants.
Critics argue the blog post overclaims: statistical significance requirements don’t change, and simple, well-run A/B can be equally effective for many real-world needs.
Several people note MAB is best seen as an optimization tool; A/B is better as a learning tool to estimate true effects.

Implementation & infrastructure complexity

Biggest cost is not the algorithm but state management and online feedback loops: extra DB columns, performance concerns, outcome computation.
Simple client-side randomization + logging is often much easier than wiring online reward tracking for MAB.
Consistent user assignment (stickiness) complicates both A/B and MAB; hashing, seeding, and feature flags are common tools, with pitfalls around non‑uniformity and ID assumptions.

Statistics, significance, and traffic constraints

Many sites lack enough traffic to reach significance in reasonable time, especially with >2 variants.
Some argue point estimates can be enough to choose a version when costs are similar, even without formal significance.
Others stress that if you care about effect size and significance, the article’s approach is insufficient.

Dynamic environments & bias risks

Standard MAB assumes static reward rates; in e‑commerce, conversions change with time of day, sales, device mix, etc.
Time-varying or delayed rewards can cause MAB to lock onto the wrong variant; forgetting factors and more advanced methods exist but add complexity.
MAB can amplify biases from bugs, eligibility issues, caching discrepancies, or mis-specified metrics, potentially converging on very bad experiences.

User experience and ethics

Constantly changing variants can harm UX, support workflows, and even safety (e.g., UI changes while driving).
Drug-trial analogy is debated: control groups “miss out” on benefits but are also protected from unknown harms.

Real-world practice & politics

Many organizations use A/B mostly for gradual rollouts, safety checks, and political cover rather than pure optimization.
There is widespread concern about “data-driven” rhetoric masking gut-driven or statistically sloppy decisions.

Related topics