All posts

Gemini 3.5 Flash and the End of 'Use the Biggest Model' for Agents

I've been defaulting to Opus-tier or GPT-5.5 for anything agent-related because that felt like the safe call. Better reasoning, better tool use, bette…

A branching circuit pathway split at a routing switch, representing model selection for agentic workloads

I've been defaulting to Opus-tier or GPT-5.5 for anything agent-related because that felt like the safe call. Better reasoning, better tool use, better outcomes. Flash-tier models were for batch jobs, summaries, things where you didn't care that much about output quality.

That calculus broke for me after spending time with the Gemini 3.5 Flash benchmarks. The model went GA on May 19 at Google I/O. The number that got my attention: 83.6% on MCP Atlas, a benchmark specifically for multi-step tool orchestration using Model Context Protocol servers. That puts it 8.3 points ahead of GPT-5.5 (75.3%) and 4.5 points ahead of Claude Opus 4.7 on the same eval. "Flash" doesn't mean what it used to.

What MCP Atlas Is Actually Measuring

MCP Atlas tests whether a model can chain together multiple tool calls across MCP servers, recover from partial failures, and complete multi-step tasks without going off-script. It's not a writing or reasoning benchmark. If you're building anything with n8n and MCP, or any orchestration layer where the model is selecting and sequencing tools, this benchmark maps more directly to your real workload than SWE-Bench or MMLU.

The fact that a Flash-tier model leads MCP Atlas outright changes how I think about model selection for agent loops. Speed compounds in agentic systems because a typical run isn't one LLM call, it's dozens. Gemini 3.5 Flash outputs at 156.9 tokens per second on the Gemini API. Faster loop cycles across a long task add up to real latency wins at the system level. And when your agent is calling tools in sequence, latency between steps matters more than most people account for.

The Part Where It Still Trails

SWE-Bench Pro is where Gemini 3.5 Flash falls short. Claude Opus 4.7 scores 64.3%. Flash comes in at 55.1%. That 9-point gap is meaningful if your agent's job is producing code a senior engineer will review and merge. For repo-level changes, refactoring across multiple files, or anything where the output needs to be correct on the first pass, Opus-tier still earns the higher cost.

The story isn't "Flash replaced flagship." These are now genuinely different tools for different jobs, and routing matters.

The Cost Argument Is Now Concrete

Gemini 3.5 Flash prices at $1.50 per million input tokens and $9.00 per million output tokens. Cached input drops to $0.15 per million. If you're running an orchestration agent that makes 50 tool calls per session, and most of those are "choose next step, call tool, parse result," input costs stack up fast. Running that loop on an Opus-tier model at 4-5x the price, for worse numbers on the benchmark that directly tests what you're doing, is hard to justify.

The caching math gets lopsided quickly. Agents with a long system prompt and persistent context will hit the cache constantly. At $0.15 per million cached tokens, high-volume workloads end up much cheaper in practice than the headline pricing suggests.

I haven't run this at production scale on a stateful long-horizon agent yet, so I can't give you a real-world failure mode distribution. On the benchmark evidence though, you'd need a specific reason to not use Flash for the orchestration layer.

A Routing Architecture That Makes Sense Now

What I'm sketching out: Flash handles orchestration, tool selection, and intermediate reasoning steps. A heavier model like Claude Opus 4.7 or GPT-5.5 (which still leads Terminal-Bench 2.1 for terminal-native agentic coding) handles final synthesis steps that produce code or content going directly to a reviewer or user.

Model routing and cascades have been a concept for a while. The benchmark gap is now wide enough that the case is concrete, not just theoretical cost-cutting.

One Caveat on Context Window

Flash ships with a 1M token input context. That's large. But the output limit is 65,536 tokens. For agents that need to produce long structured outputs in a single step, that ceiling matters. Plan around it. Gemini 3.5 Pro is expected to follow with 2M context, but there's no confirmed release date. Don't build around it until it ships.