The Winner's Curse in A/B Testing: Why Your Biggest Lifts Are Probably Exaggerated

A small trophy under a spotlight casting an enormous shadow on the wall behind it

I've audited a lot of experimentation programs. The most common red flag isn't a low win rate. It's a suspiciously high one.

If your team is consistently reporting 40%, 50%, or 60%+ win rates with lifts above 20% on your primary metric, something is probably wrong. Not "wrong" in the sense of fraud, but wrong in the statistical sense: you're almost certainly looking at the winner's curse.

What the Winner's Curse Actually Is

The winner's curse is not about bad luck. It's a mathematical outcome of running underpowered tests.

Here's the mechanism: when a test is underpowered (say, 30% or 40% statistical power instead of the standard 80%), the test usually fails to detect a real effect. Most runs come back null. But occasionally, by chance, the noise in your data pushes the result over the significance threshold. When that happens, the observed lift is almost always an exaggeration of the true effect. The only way a small, underpowered test crosses p = 0.05 is if the random noise happened to line up in the treatment's favor.

So you end up selecting for exaggerated wins. You ship a "35% lift on revenue per visitor." You track what happens post-launch. A month later, revenue per visitor is up maybe 8%. The team is confused. Stakeholder trust bleeds out.

The numbers are specific. At 80% power (the industry standard), the exaggeration factor on detected effects is about 13%. That's bad enough. At 50% power, it's 40%. At 20% power, which is where a lot of programs quietly live, the exaggeration exceeds 130%. Your "detected" lift is essentially noise that happened to clear a threshold.

The High Win Rate Is the Tell

A well-run large-scale experimentation program, think Amazon or Booking.com, has a true win rate somewhere in the 10-30% range. Ronny Kohavi, who ran experimentation at Microsoft and Amazon, talked about this publicly and it surprised a lot of people. An 8% true win rate at scale is not a sign of failure. It's a sign of a mature program picking hard problems.

If you're reporting 50%+ wins, you're almost certainly doing one or more of the following:

Running tests too short and calling them when they first cross p = 0.05 (the peeking problem)
Testing extremely easy changes, obvious broken UX, that inflate the "win" count without teaching you anything strategically useful
Measuring secondary metrics that are noisy and easy to move, not your actual business north star
Systematically ignoring failed tests and only surfacing the ones that "worked"

Any one of these will inflate your apparent win rate. All four together, and I've seen programs that do all four, give you a dashboard full of green numbers that are completely disconnected from actual business outcomes.

What Underpowered Looks Like in Practice

The minimum detectable effect (MDE) calculation is where this usually goes wrong.

In Adobe Target, Optimizely, or any other platform, the MDE is the smallest true lift you'd reliably detect given your sample size and alpha level. Most teams either skip this step or set it to an unrealistically large value just to make the required sample size look manageable.

Say your landing page converts at 4%, and you'd need three weeks to reach 10,000 visitors per variant at 80% power to detect a 1 percentage point lift. That feels long. So the team runs it for 10 days, gets 4,000 visitors per variant, and if they see p < 0.05 they call it a win. The test was powered to detect roughly a 3 percentage point lift, not 1. If the true effect is 0.8 points, the test will miss it 90% of the time. If it somehow "detects" something, that something is noise.

I've seen teams at large retail clients running tests with MDEs above 15% on a metric that already sits at a 3-4% baseline. The test can only detect effects that are almost impossible to actually achieve. Any "win" from that setup is an artifact, not an insight.

A fishing net with large mesh holes letting small fish escape while only one oversized fish is caught

Three Ways to Fix It

First, calculate power before you launch. Use an MDE calculator, Evan Miller's sample size calculator is free and takes 30 seconds, to confirm your expected traffic and duration can actually detect the effect size you care about. If it can't, either extend the test, limit the scope to a higher-traffic page, or explicitly acknowledge the power gap and treat any result as directional only.

Second, stop peeking and calling early. This single change fixes a lot of inflated win rates. Either commit to a fixed sample size and don't call the test until you hit it, or move to a sequential testing methodology. Both Amplitude Experiment and GrowthBook support sequential testing natively, and it controls the false positive rate even when you check results frequently.

Third, separate learning tests from decision tests. Not every test needs 80% power. If you're exploring a new hypothesis and the cost of shipping is low, a lower-powered exploratory run is fine. But then you don't call it a "win." You call it a "directional signal, needs confirmation." The language matters for how stakeholders consume the data downstream.

The Culture Part Nobody Wants to Address

The real reason most programs end up underpowered isn't ignorance of statistics. It's organizational pressure. Someone needs a win for the quarterly deck. A test that runs three weeks and comes back null feels like failure.

Building a program that can tolerate a 70-80% null rate requires explicit leadership alignment. You have to re-frame null results as answers. "We ruled out this hypothesis" is genuinely valuable. It's just harder to put on a slide than "+12% RPV."

I haven't solved this at every client. Some orgs are too short-cycle to change. But the ones that do make the shift end up with insights that compound over time, not wins that evaporate a month after launch.