All posts

Concurrent A/B Tests: How to Know When Interaction Effects Actually Matter

If you've run experimentation at any scale, you've hit this scenario. You've got three tests live simultaneously: one on the hero headline, one on the…

Two overlapping spotlights on a dark stage floor creating a distorted third color where they meet

If you've run experimentation at any scale, you've hit this scenario. You've got three tests live simultaneously: one on the hero headline, one on the checkout CTA, one on the product page layout. The checkout CTA test shows a 12% lift. You ship it. The lift evaporates. Post-ship numbers look nothing like the test.

Your first instinct is novelty effect. But the real culprit might be that the checkout CTA test was running at the same time as the product page layout test, and users who saw both variants behaved differently than those who saw just one.

That's an interaction effect. It's one of the least understood problems in applied experimentation, and it's where a lot of phantom wins actually come from.

What an interaction effect actually is

In statistics, an interaction happens when the effect of one variable changes depending on the level of another. In A/B testing, it means the combined effect of two experiments running on overlapping user populations isn't simply the sum of their individual effects.

A concrete example. You're testing a red CTA button vs. blue on the product page. At the same time, you're testing a simplified checkout flow vs. the current one. These tests overlap: most users who see the product page also go through checkout. If the red button drives higher conversions in the original checkout but not in the simplified one, you've got an interaction. Your CTA test analysis treats all users the same, but the result is only valid for users who saw the original checkout variant.

I've seen this wipe out reported lifts on price-sensitive product pages where two copy tests were pulling the message frame in opposite directions at the same time.

How often do they actually matter?

Honestly, less often than people assume, and more often than anyone bothers to check.

The encouraging part: most user behavior is fairly additive. Changing the headline on one page and the button on a different page usually affects different decision moments. Studies from large tech experimentation teams have consistently found that statistically significant interaction effects are relatively rare when tests touch genuinely separate UI areas or different funnel steps.

But the cases where interactions bite you cluster around a few patterns. Tests on the same page that compete for the same visual attention (two tests simultaneously adding banners, for instance). Tests in the same funnel step where one changes the pre-condition for the other. Tests on pricing or trust signals where the combined message creates a contradiction: a discount banner running alongside a "limited stock" scarcity test. And the sneaky one: a test that changes traffic allocation upstream, which then alters who enters a downstream experiment, producing something that looks a lot like sample ratio mismatch.

The mutual exclusion reflex and why it costs you

When teams discover interaction effects are possible, the instinct is to mutex everything. Put all tests into mutual exclusion groups so no user ever sees more than one. Clean, no contamination.

The cost is real and often underappreciated. Mutual exclusion means you're splitting traffic across tests rather than layering them. If you have three concurrent tests with mutual exclusion and a site getting 100,000 visitors a week, each test sees roughly 33,000 users. Without exclusion and with orthogonal assignment, each test sees close to 100,000. Your runtime triples or your minimum detectable effect blows out.

For most teams, that cost is too high. You're already fighting for traffic. Forcing mutual exclusion everywhere means serializing experiments that probably wouldn't have interacted anyway. You're solving a rare problem by creating a guaranteed one.

The namespace model: a smarter way to isolate

The better framework, which Optimizely and Statsig both implement natively, is layers and namespaces. The idea: group experiments by the domain they affect, not by whether they might conceivably interact.

A pricing test and a navigation test go into different layers. Tests within the same layer are mutually exclusive. Tests in different layers can overlap freely. Users get at most one treatment per layer but can be in multiple layers simultaneously.

This lets you scale concurrent experimentation without destroying statistical power. The practical translation for a team without a custom platform: define experiment "zones" for your site (product page, checkout, navigation, homepage above-the-fold, email). Enforce the rule that any two tests targeting the same zone run sequentially. Everything outside that zone can overlap.

Three horizontal transparent glass panels stacked in layers, each containing distinct elements that do not cross between layers

What I actually watch for in practice

I don't audit every test pair for potential interactions. That's not realistic. I watch for specific red flags.

Two tests with overlapping conversion metrics and overlapping audiences. If both tests use "purchase completed" as the primary metric and both target all site visitors, I check whether they're affecting the same funnel step.

Shipped wins that vanish within two weeks. If you consistently see lifts that disappear after shipping, add "what was running simultaneously?" to your post-ship retrospective. It's a simple check that catches more than you'd expect.

Tests that change information architecture or navigation. These change the base experience for everything else running at the same time. Run them alone, or put them in their own layer and mutex aggressively within it.

The formal way to detect a specific suspected interaction is a 2x2 factorial design: four cells covering all combinations of the two experiments (control+control, control+treatment, treatment+control, treatment+treatment). You look for a crossing interaction term in the analysis. The traffic cost is steep. I'd only reach for it if I had a strong prior that two specific tests would interact, typically when both touch the same checkout step or the same pricing element. Optimizely's guidance on concurrent test design covers the tradeoffs here clearly.

The actual risk calculus

Interaction effects are real but overdiagnosed as a blanket risk. The right response isn't to serialize all your tests or mutex everything by default. It's to think clearly about which tests share a funnel stage, compete for the same visual attention, or fundamentally change the base conditions for other tests.

Use layers to isolate the real interaction risk. Let tests that don't share a funnel stage run concurrently. Watch for the post-ship evaporation pattern. Treat the 2x2 factorial as a diagnostic you reach for in specific high-stakes situations, not a default operating procedure.

The goal is running more experiments faster. Mutual exclusion everywhere achieves the opposite.