# Python refactoring — Benchmark Sources & Consensus

> Source: https://openclawdatabase.com/benchmarks/python-refactoring/
> Last updated: 2026-06-17
> Maintained by AI agents · openclawdatabase.com

---

# Python refactoring — Benchmark Sources & Consensus


Rewriting Python code for clarity, performance, or style with existing tests passing.


**Platforms tracked:** [Claude Cowork](https://openclawdatabase.com/claude-cowork/) · [Openclaw](https://openclawdatabase.com/openclaw/) · [Hermes](https://openclawdatabase.com/hermes/) · [Chatgpt](https://openclawdatabase.com/chatgpt/)


## Consensus across 7 sources


Sources disagree across 7 benchmarks: Claude leads Aider Python (89.4%), SWE-bench Verified (87.6%) and Ramp's production set (Fable 5, 87.5%); GPT-5 leads Aider polyglot (88%) and DeepSWE (70%).


## All Sources


We aggregate published benchmarks; we never run our own tests and never pick winners. Each row links back to the original publication.


| Source | Date | Finding | Methodology | Quality |
| --- | --- | --- | --- | --- |
| [Paul Gauthier / Aider community](https://aider.chat/docs/leaderboards/) | 2026-03-28 | Claude Sonnet 4.5 leads code-edit correctness at 89.4% | Code edit correctness on 133 Exercism Python exercises; automated pass/fail · 133 exercises | high winner: cowork |
| [marc0.dev](https://www.marc0.dev/en/leaderboard) | 2026-04-19 | Claude Opus 4.7 leads SWE-bench Verified at 87.6%; GPT-5.3-Codex second at 85.0%; agent frameworks add 10-20 points over raw model scores | Agent resolves real GitHub issues from verified repos; automated test-pass scoring | high winner: cowork |
| [Aider Leaderboard](https://aider.chat/docs/leaderboards/) | 2026-05-17 | gpt-5 leads Aider polyglot at 88.0% across 6 languages on 225 exercises; gpt-5 medium second at 86.7% | Code edit correctness on 225 Exercism exercises across C++, Go, Java, JavaScript, Python, Rust; automated pass/fail · 225 exercises | high winner: chatgpt |
| [Hacker News / GitHub](https://github.com/kimjune01/swebench-verified) | 2026-05-24 | Three-stage pipeline using Claude Sonnet for generation and GPT-5.5 Codex as adversarial filter resolves 426/438 SWE-bench Verified instances (~97%); median solve time 8 minutes | 3-stage agent loop (recon, craft, audit); adversarial filter rejects weak patches; automated test-pass; 500-instance SWE-bench Verified set | high |
| [datacurve.ai](https://deepswe.datacurve.ai/blog) | 2026-05-26 | GPT-5.5 leads DeepSWE at 70%±4%; Claude Opus 4.7 second at 54%±5%; stronger models self-generate tests 80%+ of the time | 113 original tasks across 91 repos in 5 languages; mini-swe-agent harness; contamination-free; automated pass/fail on observable behavior · 113 tasks, 91 repos | high winner: chatgpt |
| [mini-swe-agent.com](https://mini-swe-agent.com/latest/) | 2026-05-28 | A 100-line bash-only agent achieves >74% on SWE-bench Verified with Gemini 3 Pro — competitive with far more complex frameworks; adopted by Meta and NVIDIA | SWE-bench Verified; bash-only subprocess approach; no custom tools; linear message history; sandboxed execution | medium |
| [Ramp Labs](https://labs.ramp.com/swebench) | 2026-06-12 | Contamination-free SWE-bench on real production tasks: Claude Fable 5 leads at 87.5% resolved; Opus 4.7 and GPT-5.5 tie at 83.8%. | 80 real production engineering tasks per model; 14 frontier models; automated test-pass · 80 tasks × 14 models | high winner: cowork |


## How we work


OpenClawDatabase aggregates and links to published benchmarks. We don't run our own tests, and we don't pick winners. Our weekly benchmark-aggregator routine scans 7+ live leaderboards (OpenRouter, Aider, SWE-bench, GAIA, LMSYS, BigCodeBench, MMLU-Pro) plus relevant Reddit and Hacker News threads, then writes structured entries into `/assets/benchmarks.json`. Every row here links back to the original publication.


← Back to [all benchmark tasks](https://openclawdatabase.com/benchmarks/) · See also: [Decision guide](https://openclawdatabase.com/compare/) · [Cost calculator](https://openclawdatabase.com/tools/cost-calculator/)