Multi-Armed Bandits Are Not Smarter A/B Tests

A magnifying glass focused sharply on a single illuminated point while the surrounding area remains unexplored

Multi-armed bandits are an adaptive testing method that shifts traffic toward your best-performing variant as the test runs, rather than holding a fixed 50/50 split throughout. The idea is to minimize the cost of running a losing variant. The problem is that teams adopt them as an upgrade to A/B testing, and they're not: they're a different tool that trades statistical validity for short-term efficiency.

If you're using MABs for product features, checkout flows, or anything you'll iterate on, you're probably getting cleaner-looking results that tell you less than you think.

The Core Tradeoff You're Actually Making

A/B tests assign traffic randomly. That randomness is the whole point. It's what lets you make causal claims. When you can say "I randomly assigned users to this variant, and they converted at a higher rate," you're not just observing a correlation. You've approximated a controlled experiment.

MABs discard that guarantee. By adaptively shifting traffic based on observed performance, they introduce selection bias. Users who see your "winning" variant late in the test aren't a random sample. They're drawn disproportionately because that variant performed well early. The users still exposed to your "losing" variant are increasingly unrepresentative as traffic drains away.

The statistical consequence: your effect size estimates are biased. A 2019 NeurIPS study found that the bias in MAB sample means tends to be positive, meaning MABs typically overestimate the winner's true lift. You observe a 12% improvement during the test. You deploy. The real lift settles at 5%. The MAB was telling you what was winning in the test context, not what the true treatment effect was.

The more exploitation-heavy your algorithm (a Thompson sampling setup that aggressively routes to the current leader, for instance), the worse the bias. This is fundamental: minimizing regret and producing unbiased estimates are mathematically at odds.

When the Bias Doesn't Matter

There are situations where you genuinely don't care about effect size estimates. You just want to deploy whatever works fastest.

Email subject lines are the clearest case. You're sending a one-time campaign. The same subscriber won't see the subject twice. You have no use for a causal effect estimate; you want the most opens. A MAB gets you there with less wasted exposure than a fixed 50/50 split. Same logic applies to push notification copy for short-lived promotions.

The conditions that make MABs appropriate: the decision is one-off (no future iteration cycle), you won't generalize the result to other contexts, and the business cost of running a losing variant is real and time-sensitive. If all three hold, a MAB is probably the right call.

Where MABs Quietly Erode Programs

The damage isn't obvious, which is what makes it insidious.

Teams run MABs, see clean "winners" surface quickly, deploy them, and feel like the program is humming. But over time, effect sizes don't hold post-deployment. Variants that showed 12% lifts in testing settle at 3-4% in production. You can't tell if this is novelty effect, seasonality, or MAB-induced estimation bias. You don't know, because the MAB didn't give you a clean enough experiment to diagnose from.

More quietly: you stop accumulating institutional knowledge. Good experimentation programs build a knowledge base, what types of changes move the needle, what your users respond to, which segments behave differently. All of that requires accurate effect size estimates across many experiments. MABs corrode that foundation one test at a time.

Optimization vs Learning

The mental model I keep coming back to: A/B tests are for learning, MABs are for optimizing.

Learning means you want to understand the causal relationship between a change and user behavior. You're building a model of your product and your users that will inform future decisions. That requires unbiased estimates.

Optimizing means you have a bounded decision and you want to minimize regret during the process of making it. You're not trying to generalize. You're trying to get the best outcome from this specific situation, for this specific audience, right now.

Spotify Engineering articulated this explicitly in early 2026: they run a separate personalization stack (optimization at scale) and a separate experimentation stack (causal inference). The two have different infrastructure requirements because they answer different questions. Most teams don't have Spotify's scale. That doesn't make the distinction less real.

What This Means in Practice

For product features, landing pages, onboarding flows, checkout steps: run a proper A/B test. Accept the two-week runtime. The unbiased effect estimate is worth more than the regret you save during the test.

For email campaigns, SMS copy, push notification variants, short-lived promotional content: MABs are a reasonable choice. The decision is one-time, the stakes are bounded, and you want the best outcome fast.

I haven't run MABs at massive scale (think 10M+ decisions per day), and I suspect there are contexts where the bias becomes manageable relative to the regret cost. But at the enterprise CRO scale I work at, the teams I've seen suffer most are the ones that replaced their A/B testing program with bandits because someone sold them on "smarter testing." The tests ran faster. The learning stopped.