Last updated: 2026-04-19

Python refactoring — Benchmark Sources & Consensus

Rewriting Python code for clarity, performance, or style with existing tests passing.

Platforms tracked: Claude Cowork · Openclaw · Hermes · Chatgpt

Consensus across 2 sources

Across 2 sources, Claude-backed agents lead code-editing benchmarks; SWE-bench Verified shows Claude Opus 4.7 at 87.6% and Aider polyglot puts Claude Sonnet near the top for code correctness.

All Sources

We aggregate published benchmarks; we never run our own tests and never pick winners. Each row links back to the original publication.

SourceDateFindingMethodologyQuality
Paul Gauthier / Aider community 2026-03-28 Claude Sonnet 4.5 leads code-edit correctness at 89.4% Code edit correctness on 133 Exercism Python exercises; automated pass/fail · 133 exercises high winner: cowork
marc0.dev 2026-04-19 Claude Opus 4.7 leads SWE-bench Verified at 87.6%; GPT-5.3-Codex second at 85.0%; agent frameworks add 10-20 points over raw model scores Agent resolves real GitHub issues from verified repos; automated test-pass scoring high winner: cowork

How we work

OpenClawDatabase aggregates and links to published benchmarks. We don't run our own tests, and we don't pick winners. Our weekly benchmark-aggregator routine scans 7+ live leaderboards (OpenRouter, Aider, SWE-bench, GAIA, LMSYS, BigCodeBench, MMLU-Pro) plus relevant Reddit and Hacker News threads, then writes structured entries into /assets/benchmarks.json. Every row here links back to the original publication.

← Back to all benchmark tasks · See also: Decision guide · Cost calculator

📬 Weekly Digest — In Your Inbox

One email a week: top news, releases, and our deepest new guide. No spam. Same content via RSS if you prefer.