Python refactoring — Benchmark Sources & Consensus
Rewriting Python code for clarity, performance, or style with existing tests passing.
Platforms tracked: Claude Cowork · Openclaw · Hermes · Chatgpt
Consensus across 2 sources
Across 2 sources, Claude-backed agents lead code-editing benchmarks; SWE-bench Verified shows Claude Opus 4.7 at 87.6% and Aider polyglot puts Claude Sonnet near the top for code correctness.
All Sources
We aggregate published benchmarks; we never run our own tests and never pick winners. Each row links back to the original publication.
| Source | Date | Finding | Methodology | Quality |
|---|---|---|---|---|
| Paul Gauthier / Aider community | 2026-03-28 | Claude Sonnet 4.5 leads code-edit correctness at 89.4% | Code edit correctness on 133 Exercism Python exercises; automated pass/fail · 133 exercises | high winner: cowork |
| marc0.dev | 2026-04-19 | Claude Opus 4.7 leads SWE-bench Verified at 87.6%; GPT-5.3-Codex second at 85.0%; agent frameworks add 10-20 points over raw model scores | Agent resolves real GitHub issues from verified repos; automated test-pass scoring | high winner: cowork |
How we work
OpenClawDatabase aggregates and links to published benchmarks. We don't run our own tests, and we don't pick winners. Our weekly benchmark-aggregator routine scans 7+ live leaderboards (OpenRouter, Aider, SWE-bench, GAIA, LMSYS, BigCodeBench, MMLU-Pro) plus relevant Reddit and Hacker News threads, then writes structured entries into /assets/benchmarks.json. Every row here links back to the original publication.
← Back to all benchmark tasks · See also: Decision guide · Cost calculator