Lines of code that beat A/B testing (2012)
Multi-armed bandits vs. A/B testing
- MAB (multi-armed bandits) are praised for maximizing reward during experiments, especially for simple, immediate metrics like clicks.
- Supporters say MAB “beats” classic A/B by shifting traffic toward better variants earlier and generalizing well to many variants.
- Critics argue the blog post overclaims: statistical significance requirements don’t change, and simple, well-run A/B can be equally effective for many real-world needs.
- Several people note MAB is best seen as an optimization tool; A/B is better as a learning tool to estimate true effects.
Implementation & infrastructure complexity
- Biggest cost is not the algorithm but state management and online feedback loops: extra DB columns, performance concerns, outcome computation.
- Simple client-side randomization + logging is often much easier than wiring online reward tracking for MAB.
- Consistent user assignment (stickiness) complicates both A/B and MAB; hashing, seeding, and feature flags are common tools, with pitfalls around non‑uniformity and ID assumptions.
Statistics, significance, and traffic constraints
- Many sites lack enough traffic to reach significance in reasonable time, especially with >2 variants.
- Some argue point estimates can be enough to choose a version when costs are similar, even without formal significance.
- Others stress that if you care about effect size and significance, the article’s approach is insufficient.
Dynamic environments & bias risks
- Standard MAB assumes static reward rates; in e‑commerce, conversions change with time of day, sales, device mix, etc.
- Time-varying or delayed rewards can cause MAB to lock onto the wrong variant; forgetting factors and more advanced methods exist but add complexity.
- MAB can amplify biases from bugs, eligibility issues, caching discrepancies, or mis-specified metrics, potentially converging on very bad experiences.
User experience and ethics
- Constantly changing variants can harm UX, support workflows, and even safety (e.g., UI changes while driving).
- Drug-trial analogy is debated: control groups “miss out” on benefits but are also protected from unknown harms.
Real-world practice & politics
- Many organizations use A/B mostly for gradual rollouts, safety checks, and political cover rather than pure optimization.
- There is widespread concern about “data-driven” rhetoric masking gut-driven or statistically sloppy decisions.