Evals Are Just A/B Testing for Your Agents

A precision engineer's micrometer measuring a small stylized robot figure

If you've spent serious time in experimentation, you already understand LLM evals better than most AI engineers. You just haven't been told yet.

I've been running A/B tests for enterprise teams for years. Last year I started building agents in earnest. And somewhere around the third time an "improved" prompt made things quietly worse in production, I had the realization: the eval problem and the experimentation problem are structurally identical. Teams are reinventing controlled comparison, doing it badly, because nobody told them they'd been here before.

The Match Nobody Is Pointing Out

Here is what an eval actually is.

You have a baseline: your current prompt, model, or agent configuration. You make a change. You want to know whether that change made things better or worse. You need a consistent way to measure "better."

That's an A/B test. Exactly that.

The golden dataset is your holdout test set. The eval judge, human or LLM, is your metric. Offline evals running before a deploy are pre-production checks, the same thing as validating your sample ratio mismatch before flipping a feature flag. Online evals sampling live traffic are post-launch monitoring, the same pattern as watching your primary metric after a winning variant ships to 100%.

The reason eval pipelines fail is identical to why A/B testing programs fail: no rigor around the measurement layer.

Where Eval Pipelines Break Down

The most common mistake I see is evaluating against cherry-picked examples. Teams write 12 test cases covering scenarios the system already handles well, tweak a prompt, all 12 pass, and call it a win. That's like running an A/B test only on the segment where you already expect the treatment to win. You're not measuring anything real.

The fix is the same as in experimentation: hold out a representative sample. Braintrust builds their entire approach around this: define a golden dataset from real production inputs, cover edge cases systematically, and score against consistent criteria. 20 to 50 genuinely representative examples beat 500 hand-crafted ones.

The second mistake: no baseline. Teams run one eval, get a score of 74%, and have no idea if that's good or terrible because they've never measured the previous version. In A/B testing we call this "not having a control." You need a reference run on your current production configuration every single time you evaluate a change. Everything is relative to something.

The LLM-as-Judge Problem

A lot of eval setups now use an LLM to score outputs. GPT-4o rates response quality on a 1-5 scale. Convenient, and it scales better than human review.

But there's a reliability problem that mirrors something I've seen in experimentation: when your measurement tool has variance, your results carry more noise than you think. LLM judges show position bias (preferring whichever response appears first in the prompt), verbosity bias (longer outputs score higher regardless of actual quality), and meaningful inter-run variability. Using the same model family to both generate and judge outputs introduces something close to self-serving bias.

The mitigation is the same as in experimentation: calibrate your judge against human ground truth on a small sample. If your LLM judge agrees with human raters 80% of the time, your eval results carry 80% of the signal you think they do. Factor that into your conclusions. Don't make high-stakes shipping decisions off a single scoring pass. Average across multiple runs and be skeptical of narrow score differences.

Offline vs. Online Evals: A Better Frame

Most write-ups treat offline and online evals as two sequential phases. I'd frame them differently.

Offline evals are your pre-experiment validation. They tell you whether a change is safe to ship. They run on a fixed dataset, they're reproducible, and they're fast. Fail a deploy if a critical metric drops below threshold here, the same way you'd block a feature flag flip if your bucketing ratio looks wrong.

Online evals are your in-experiment monitoring. They sample live traffic, catch distribution shift, and feed low-scoring examples back into your golden dataset. This is how you catch cases your offline set didn't cover. It mirrors the same pattern as sequential testing with early stopping criteria, except you're not stopping early; you're catching regression before it becomes a crisis.

If you're only running offline evals, you're over-confident in a static test set. If you're only running online evals, you're flying blind before each ship. You need both, for the same reason you run pre-launch QA and post-launch monitoring in any serious experimentation program.

What Experimentation Rigor Actually Looks Like

Here's how I set this up in practice:

Lock a baseline first. Run your current production configuration against your golden dataset and record the score. That's your control. Every subsequent eval is a treatment variant.
Size your golden set properly. For a binary metric (correct vs. incorrect), detecting a 10-point difference with 80% power requires roughly 150 to 200 examples. Not many, but far more than 10.
Run the full set before concluding. Don't check results at 20 examples and ship.
Log everything: input, output, score, model version, prompt hash, timestamp. You want to reproduce any eval run six months from now, the same way you'd audit a test's bucketing logic after the fact.
Separate your judge from your generator. Use a different provider or a fine-tuned judge model, not the same system you're evaluating.

The tools that handle this well right now are Braintrust (strong on dataset management and experiment tracking), Promptfoo (CI-native, good for prompt regression), and LangSmith if you're deep in the LangChain ecosystem. No single tool covers everything; most production teams end up combining two of them.

Where the Analogy Breaks

Not everything transfers. In A/B testing, you define your primary metric before the test runs. In evals, you often discover what you care about as you build. That's fine, but it means early eval results aren't comparable to later ones if you've changed your scoring rubric. Version your rubrics the same way you version your prompts. They're part of the experiment configuration.

Also, agents have non-determinism built in. The same input won't always produce the same output. That means you need more eval runs per configuration than you'd intuitively expect, especially for multi-step tasks. Treat it like measuring a noisy metric: more samples, averaged scores, skepticism toward single-run results.

The instinct that "we evaluated it and it looked fine" is exactly as dangerous as "we ran a 200-session test and it trended positive." Volume and measurement rigor matter. You already knew that. Apply it here.