
MiniMax released M3 on June 1, 2026, and it's the first open-weight model to genuinely combine three things at once: frontier-level coding performance, a 1M-token context window, and native multimodal input. The interesting part isn't the feature list. It's the architectural trick that makes long-context inference practical at a fraction of what GPT-5.5 costs.
A New Way to Do Attention at Scale
Standard transformer attention scales quadratically with context length, which is why running a full 1M-token window at inference time is usually too expensive to be useful. MiniMax's answer is MSA (MiniMax Sparse Attention), and the mechanics are worth understanding.
Instead of computing attention over every token in the context, MSA uses a two-stage process. A lightweight index branch first scans incoming tokens and selects which blocks of the KV cache are actually relevant to the query. The main attention layer then processes only those selected blocks. MiniMax's numbers at 1M-token context: 9.7x faster prefill, 15.6x faster decode, and roughly 1/20th the per-token compute compared to their previous generation M2.
One design choice that stands out: MSA operates on uncompressed KV data, not compressed approximations. That preserves long-context retrieval accuracy better than methods that squash context down first. The tradeoff is higher memory. If you're running this on constrained hardware or in a memory-tight deployment, factor that in before committing.
What the Benchmark Numbers Actually Say
M3 scores 59.0% on SWE-Bench Pro, beating both GPT-5.5 and Gemini 3.1 Pro on that benchmark. That raised eyebrows at launch. TechTimes's June 1 headline read "Frontier Claims, Unverified Benchmarks." Seventeen days later, after the community ran independent evaluations, they updated to "Sparse Attention Architecture Now Verified."
That arc matters. The numbers held under scrutiny. But SWE-Bench Pro is a specific benchmark: it tests the ability to resolve real GitHub issues from software repos. It's meaningful if you're building agentic coding pipelines. It tells you little about broad reasoning, instruction following, or creative tasks. Claiming M3 beats GPT-5.5 across the board based on this one number would be wrong.
The Pricing Math
This is the part that's genuinely hard to dismiss. MiniMax M3 lists at $0.60 per million input tokens and $2.40 per million output tokens. GPT-5.5 is $5 per million input and $30 per million output. Claude Opus 4.7 is $5 input and $25 output.
On output tokens, M3 runs 10-12x cheaper than either. For workloads with high output volume (agentic loops, multi-step code generation, long-document summarization), that gap compounds fast. VentureBeat called it "5-10% of the cost", and the math roughly holds on output tokens compared to GPT-5.5.
A catch: the launch window included a 50% discount for new accounts in the first week. The prices I'm quoting are standard post-discount rates. Still a large gap, but not the extreme end of what some early benchmarks were run against.
What It Actually Supports
M3 handles image and video input natively, not via a separate vision module. It also includes built-in desktop computer operation. For agentic workflows that need visual context alongside long text, or that need to interact with desktop apps, having both in a single open-weight model is new.
I haven't stress-tested the 1M-token retrieval on adversarial inputs or measured the multimodal quality against dedicated vision models. The claim about retrieval accuracy from uncompressed KV data is theoretically sound, but theory and practice diverge on long-context tasks often enough that I'd want more community benchmarks before trusting it on critical retrieval paths.
Where This Actually Makes Sense
The cases where M3 earns consideration:
- Inference-heavy agentic pipelines with long context and high output volume. At 10x cheaper output, you can run 10x more eval iterations, parallel agents, or retry loops for the same cost. The budget math changes meaningfully.
- Teams that need open weights. Fine-tuning, air-gapped deployments, or wanting control over your own stack. M3 is currently the strongest open-weight option that doesn't force a tradeoff between context length, multimodal, and coding capability.
Where I'd slow down: don't route general-purpose reasoning to it based on the coding benchmark alone. Run evals on your actual task distribution first. Watch the memory ceiling if you're chasing the full 1M-context window in self-hosted setups.
MiniMax committed to releasing the full model weights and a technical report within 10 days of launch. If the report is as transparent as the architecture overview suggests, it'll be worth reading closely before you route production traffic.
The fact that community evaluation backed up the launch benchmarks is what pushes this from a press release to something that deserves a test run. That clears a higher bar than most model launches manage.