All posts

Why 41% of CRO Teams Switched to Bayesian A/B Testing (and What They Got Wrong First)

According to a recent Kameleoon study on A/B testing stats, 41.2% of CRO programs now use Bayesian statistical frameworks, up from 18.4% in 2022. That…

Analog precision gauge dial representing Bayesian probability accumulating toward a decision threshold

According to a recent Kameleoon study on A/B testing stats, 41.2% of CRO programs now use Bayesian statistical frameworks, up from 18.4% in 2022. That's not a small shift. That's a near-doubling in four years across organizations that actually run structured experiments at scale.

I've watched this shift happen, and I have opinions about it. Not all the teams moving to Bayesian are doing it right. Some are doing it for the wrong reasons. And a few are so enamored with the methodology that they've created new problems to replace the old ones.

What Actually Changes When You Go Bayesian

The math underneath Bayesian testing is genuinely different from frequentist methods. But for practitioners, the real change isn't the algorithm. It's the probability statement you get at the end.

Frequentist testing gives you a p-value, which most stakeholders misread as "the probability that our variant is better." It isn't. It's the probability of seeing your data (or more extreme data) if there were no true effect. That's a mouthful that loses most rooms.

Bayesian testing gives you something closer to what stakeholders actually want: "Variant B has an 87% probability of outperforming the control." That sentence makes sense to a product manager. It makes sense to a CMO. It doesn't require a footnote.

That readability shift matters more than most statisticians want to admit. When stakeholders understand what they're looking at, they make fewer gut-feel override decisions. I've sat in enough readout calls to know that p=0.04 gets a very different reaction than "94% probability it's a winner."

The Stopping-Early Trap

Here's where a lot of teams stub their toe. Bayesian testing doesn't have the same strict sample size requirements as frequentist methods, and that flexibility gets misread as permission to stop whenever you feel like it.

The thinking goes: "We're at 80% probability of being better. Good enough. Let's ship."

That's not how it works. An 80% probability of being better still means a 1-in-5 chance you're making the wrong call. Across a testing program running 40 tests a year, that failure rate compounds fast. Your "wins" will include a meaningful chunk of false positives, and you won't know which ones they are.

Bayesian testing lets you stop early when you've reached a threshold you set in advance. The operative phrase is "set in advance." If you decide mid-test that 80% is good enough because the variant is behind, you're not doing Bayesian testing. You're doing peeking with extra steps.

Most mature Bayesian programs I've seen set their decision threshold at 95% probability or higher. Some go to 97% for tests that touch checkout or pricing. That's not dramatically different from frequentist 95% confidence intervals in practice. The interpretation is cleaner; the bar doesn't have to be lower.

The Prior Problem

Bayesian testing requires a prior: some starting belief about what the true effect might look like. Most practitioners use a flat, uninformative prior that says "I have no idea what the lift will be." That's fine and honest for most tests.

The problem comes when stakeholders start treating priors as a dial they can adjust. I've seen teams set optimistic priors because a similar test "worked before" at a different company, in a different context, on a different audience. That's not prior knowledge. That's wishful thinking baked into the statistical model.

The prior should encode what you genuinely know from past tests on your specific product with your specific users. If you have a solid historical baseline from previous experiments, an informative prior can legitimately reduce the sample you need. If you don't, use the flat prior and accept that you need a proper sample anyway.

When Bayesian Actually Makes Sense

I'll use it when:

  • The test is measuring something binary (converted or didn't, signed up or didn't) with a moderate-to-large expected effect
  • Stakeholders are going to be in the room demanding probability statements, not p-values
  • The platform supports it natively (Optimizely has a solid Bayesian engine; Adobe Target's Autotarget is built on it under the hood)
  • You have enough historical test data to build a sensible prior, if you want to use one

I'll push back on Bayesian when:

  • You're testing for small effects (1-2% lift) where the sample requirements are enormous regardless of method
  • The test has regulatory or legal scrutiny attached, where frequentist p-values are the expected standard
  • The team is new to experimentation and already struggling to explain what a p-value is

That last point is real. Switching a new program to Bayesian before the team has any statistical intuition is like teaching someone to drive in a manual transmission because it's "more nuanced." Start where people can build confidence, then upgrade.

A Practical Guide to Making the Switch

If you're considering moving to Bayesian, a few things I'd tell any team before they flip the switch:

  • Define your decision threshold before the test starts. 95% probability minimum; go to 97% for revenue-critical tests like checkout flow or pricing pages.
  • Be honest about your prior. If you don't have solid historical data from your own product, use an uninformative prior. Don't import priors from competitor case studies or industry benchmarks.
  • Don't confuse "more flexible stopping" with "stop whenever you want." The flexibility is there to let you stop when your threshold is met, not to let you stop when you're bored or when stakeholders are impatient.
  • Pick Bayesian for the interpretability story, not because you think it'll give you winners faster. On similar sample sizes, the win rate doesn't magically improve.

The teams at 41% who made the switch and stayed with it mostly did it because communicating results got easier. That's a legitimate reason. The teams who regret it made the same mistake most experimentation programs make: they changed the statistics without changing the culture around how decisions get made.

The statistics don't fix the culture. That part is still on you.