Python refactoring — Benchmark Sources & Consensus
Rewriting Python code for clarity, performance, or style with existing tests passing.
Platforms tracked: Claude Cowork · Openclaw · Hermes · Chatgpt
Consensus across 7 sources
Sources disagree across 7 benchmarks: Claude leads Aider Python (89.4%), SWE-bench Verified (87.6%) and Ramp's production set (Fable 5, 87.5%); GPT-5 leads Aider polyglot (88%) and DeepSWE (70%).
All Sources
We aggregate published benchmarks; we never run our own tests and never pick winners. Each row links back to the original publication.
| Source | Date | Finding | Methodology | Quality |
|---|---|---|---|---|
| Paul Gauthier / Aider community | 2026-03-28 | Claude Sonnet 4.5 leads code-edit correctness at 89.4% | Code edit correctness on 133 Exercism Python exercises; automated pass/fail · 133 exercises | high winner: cowork |
| marc0.dev | 2026-04-19 | Claude Opus 4.7 leads SWE-bench Verified at 87.6%; GPT-5.3-Codex second at 85.0%; agent frameworks add 10-20 points over raw model scores | Agent resolves real GitHub issues from verified repos; automated test-pass scoring | high winner: cowork |
| Aider Leaderboard | 2026-05-17 | gpt-5 leads Aider polyglot at 88.0% across 6 languages on 225 exercises; gpt-5 medium second at 86.7% | Code edit correctness on 225 Exercism exercises across C++, Go, Java, JavaScript, Python, Rust; automated pass/fail · 225 exercises | high winner: chatgpt |
| Hacker News / GitHub | 2026-05-24 | Three-stage pipeline using Claude Sonnet for generation and GPT-5.5 Codex as adversarial filter resolves 426/438 SWE-bench Verified instances (~97%); median solve time 8 minutes | 3-stage agent loop (recon, craft, audit); adversarial filter rejects weak patches; automated test-pass; 500-instance SWE-bench Verified set | high |
| datacurve.ai | 2026-05-26 | GPT-5.5 leads DeepSWE at 70%±4%; Claude Opus 4.7 second at 54%±5%; stronger models self-generate tests 80%+ of the time | 113 original tasks across 91 repos in 5 languages; mini-swe-agent harness; contamination-free; automated pass/fail on observable behavior · 113 tasks, 91 repos | high winner: chatgpt |
| mini-swe-agent.com | 2026-05-28 | A 100-line bash-only agent achieves >74% on SWE-bench Verified with Gemini 3 Pro — competitive with far more complex frameworks; adopted by Meta and NVIDIA | SWE-bench Verified; bash-only subprocess approach; no custom tools; linear message history; sandboxed execution | medium |
| Ramp Labs | 2026-06-12 | Contamination-free SWE-bench on real production tasks: Claude Fable 5 leads at 87.5% resolved; Opus 4.7 and GPT-5.5 tie at 83.8%. | 80 real production engineering tasks per model; 14 frontier models; automated test-pass · 80 tasks × 14 models | high winner: cowork |
How we work
OpenClawDatabase aggregates and links to published benchmarks. We don't run our own tests, and we don't pick winners. Our weekly benchmark-aggregator routine scans 7+ live leaderboards (OpenRouter, Aider, SWE-bench, GAIA, LMSYS, BigCodeBench, MMLU-Pro) plus relevant Reddit and Hacker News threads, then writes structured entries into /assets/benchmarks.json. Every row here links back to the original publication.
← Back to all benchmark tasks · See also: Decision guide · Cost calculator