Published: 2026-04-16

Opus 4.7 Benchmarks: A Half-Step Up, and the Mythos Distillation Theory

Nick Saraev runs through Opus 4.7's benchmarks against 4.6, GPT 5.4, Gemini 3.1 Pro, and Mythos preview — and notices something strange: almost every improvement is mathematically about halfway between 4.6 and Mythos. His hypothesis: Opus 4.7 is Mythos preview distilled down and deployed on faster hardware, rather than a fundamentally new model.

Source video

"Claude Opus-4.7 Just Dropped, And..." by Nick Saraev — Watch on YouTube →

Key Takeaways

Opus 4.7 is better than 4.6 in essentially every benchmark — but the step up is consistently about half the distance between 4.6 and Mythos preview. On SWE-bench Pro (the main software engineering benchmark): 53.4% (4.6) → 64.3% (4.7) → ~75% (Mythos). A +10.9% improvement that lands almost exactly halfway.
The same halfway pattern appears across multiple benchmarks. Nick finds this suspicious — genuine independent model improvements rarely produce such mathematically clean gaps. It suggests intentional calibration, not emergent performance.
Nick's Mythos distillation theory: Opus 4.7 is probably Mythos preview "basically just distilled, dummified down a little bit and running on a lot faster and better hardware." A smaller, faster version of the same model rather than a new architecture.
Agentic terminal coding shows a smaller step up: 65.4% → 69.4% (4.7) → 82% (Mythos). Nick thinks this is where the safety concerns from Mythos concentrate — Anthropic is reluctant to give the full agentic terminal capability to general users because this is the attack surface that allowed Mythos to compromise Chrome and multiple operating systems.
Anthropic's position on Mythos: they've described it as being like "giving kids nuclear weapons" — a model capable enough to autonomously compromise security systems. This is why they're not releasing Mythos directly, and why they're releasing a distilled version that's meaningfully safer on the agentic dimensions.
GPT/Spud model expected within days of Opus 4.7 — the competitive cycle is tight enough that significant Anthropic launches are reliably followed by OpenAI responses within a week.

Benchmark Comparison

Benchmark	Opus 4.6	Opus 4.7	Mythos Preview
SWE-bench Pro	53.4%	64.3%	~75%
SWE-bench Verified	—	+10–11%	~2× the gap
Agentic terminal coding	65.4%	69.4%	82%

Mythos preview figures are approximate, sourced from Nick's benchmark comparison scorecard in the video.

Related on OpenClawDatabase

Was Opus 4.6 Intentionally Degraded? — Nate Herk's analysis of the quality regression that preceded 4.7
Claude Opus 4.7 as a 24/7 Trading Agent — practical application of the upgraded model
OpenClaw Configuration — how to pin and upgrade model versions

← Back to News digest · See also: OpenClaw guide