Published: 2026-04-16
Opus 4.7 Benchmarks: A Half-Step Up, and the Mythos Distillation Theory
Nick Saraev runs through Opus 4.7's benchmarks against 4.6, GPT 5.4, Gemini 3.1 Pro, and Mythos preview — and notices something strange: almost every improvement is mathematically about halfway between 4.6 and Mythos. His hypothesis: Opus 4.7 is Mythos preview distilled down and deployed on faster hardware, rather than a fundamentally new model.
Source video
"Claude Opus-4.7 Just Dropped, And..." by Nick Saraev — Watch on YouTube →
Key Takeaways
- Opus 4.7 is better than 4.6 in essentially every benchmark — but the step up is consistently about half the distance between 4.6 and Mythos preview. On SWE-bench Pro (the main software engineering benchmark): 53.4% (4.6) → 64.3% (4.7) → ~75% (Mythos). A +10.9% improvement that lands almost exactly halfway.
- The same halfway pattern appears across multiple benchmarks. Nick finds this suspicious — genuine independent model improvements rarely produce such mathematically clean gaps. It suggests intentional calibration, not emergent performance.
- Nick's Mythos distillation theory: Opus 4.7 is probably Mythos preview "basically just distilled, dummified down a little bit and running on a lot faster and better hardware." A smaller, faster version of the same model rather than a new architecture.
- Agentic terminal coding shows a smaller step up: 65.4% → 69.4% (4.7) → 82% (Mythos). Nick thinks this is where the safety concerns from Mythos concentrate — Anthropic is reluctant to give the full agentic terminal capability to general users because this is the attack surface that allowed Mythos to compromise Chrome and multiple operating systems.
- Anthropic's position on Mythos: they've described it as being like "giving kids nuclear weapons" — a model capable enough to autonomously compromise security systems. This is why they're not releasing Mythos directly, and why they're releasing a distilled version that's meaningfully safer on the agentic dimensions.
- GPT/Spud model expected within days of Opus 4.7 — the competitive cycle is tight enough that significant Anthropic launches are reliably followed by OpenAI responses within a week.
Benchmark Comparison
| Benchmark | Opus 4.6 | Opus 4.7 | Mythos Preview |
|---|---|---|---|
| SWE-bench Pro | 53.4% | 64.3% | ~75% |
| SWE-bench Verified | — | +10–11% | ~2× the gap |
| Agentic terminal coding | 65.4% | 69.4% | 82% |
Mythos preview figures are approximate, sourced from Nick's benchmark comparison scorecard in the video.
Related on OpenClawDatabase
- Was Opus 4.6 Intentionally Degraded? — Nate Herk's analysis of the quality regression that preceded 4.7
- Claude Opus 4.7 as a 24/7 Trading Agent — practical application of the upgraded model
- OpenClaw Configuration — how to pin and upgrade model versions
← Back to News digest · See also: OpenClaw guide