Last updated: 2026-06-17

Python refactoring — Benchmark Sources & Consensus

Rewriting Python code for clarity, performance, or style with existing tests passing.

Platforms tracked: Claude Cowork · Openclaw · Hermes · Chatgpt

Consensus across 7 sources

Sources disagree across 7 benchmarks: Claude leads Aider Python (89.4%), SWE-bench Verified (87.6%) and Ramp's production set (Fable 5, 87.5%); GPT-5 leads Aider polyglot (88%) and DeepSWE (70%).

All Sources

We aggregate published benchmarks; we never run our own tests and never pick winners. Each row links back to the original publication.

SourceDateFindingMethodologyQuality
Paul Gauthier / Aider community 2026-03-28 Claude Sonnet 4.5 leads code-edit correctness at 89.4% Code edit correctness on 133 Exercism Python exercises; automated pass/fail · 133 exercises high winner: cowork
marc0.dev 2026-04-19 Claude Opus 4.7 leads SWE-bench Verified at 87.6%; GPT-5.3-Codex second at 85.0%; agent frameworks add 10-20 points over raw model scores Agent resolves real GitHub issues from verified repos; automated test-pass scoring high winner: cowork
Aider Leaderboard 2026-05-17 gpt-5 leads Aider polyglot at 88.0% across 6 languages on 225 exercises; gpt-5 medium second at 86.7% Code edit correctness on 225 Exercism exercises across C++, Go, Java, JavaScript, Python, Rust; automated pass/fail · 225 exercises high winner: chatgpt
Hacker News / GitHub 2026-05-24 Three-stage pipeline using Claude Sonnet for generation and GPT-5.5 Codex as adversarial filter resolves 426/438 SWE-bench Verified instances (~97%); median solve time 8 minutes 3-stage agent loop (recon, craft, audit); adversarial filter rejects weak patches; automated test-pass; 500-instance SWE-bench Verified set high
datacurve.ai 2026-05-26 GPT-5.5 leads DeepSWE at 70%±4%; Claude Opus 4.7 second at 54%±5%; stronger models self-generate tests 80%+ of the time 113 original tasks across 91 repos in 5 languages; mini-swe-agent harness; contamination-free; automated pass/fail on observable behavior · 113 tasks, 91 repos high winner: chatgpt
mini-swe-agent.com 2026-05-28 A 100-line bash-only agent achieves >74% on SWE-bench Verified with Gemini 3 Pro — competitive with far more complex frameworks; adopted by Meta and NVIDIA SWE-bench Verified; bash-only subprocess approach; no custom tools; linear message history; sandboxed execution medium
Ramp Labs 2026-06-12 Contamination-free SWE-bench on real production tasks: Claude Fable 5 leads at 87.5% resolved; Opus 4.7 and GPT-5.5 tie at 83.8%. 80 real production engineering tasks per model; 14 frontier models; automated test-pass · 80 tasks × 14 models high winner: cowork

How we work

OpenClawDatabase aggregates and links to published benchmarks. We don't run our own tests, and we don't pick winners. Our weekly benchmark-aggregator routine scans 7+ live leaderboards (OpenRouter, Aider, SWE-bench, GAIA, LMSYS, BigCodeBench, MMLU-Pro) plus relevant Reddit and Hacker News threads, then writes structured entries into /assets/benchmarks.json. Every row here links back to the original publication.

← Back to all benchmark tasks · See also: Decision guide · Cost calculator

📬 Weekly Digest — In Your Inbox

One email a week: top news, releases, and our deepest new guide. No spam. Same content via RSS if you prefer.