Last updated: 2026-04-19

Python refactoring — Benchmark Sources & Consensus

Name: Python refactoring benchmark sources
Creator: OpenClawDatabase
License: https://creativecommons.org/licenses/by/4.0/

Rewriting Python code for clarity, performance, or style with existing tests passing.

Platforms tracked: Claude Cowork · Openclaw · Hermes · Chatgpt

Consensus across 2 sources

Across 2 sources, Claude-backed agents lead code-editing benchmarks; SWE-bench Verified shows Claude Opus 4.7 at 87.6% and Aider polyglot puts Claude Sonnet near the top for code correctness.

All Sources

We aggregate published benchmarks; we never run our own tests and never pick winners. Each row links back to the original publication.

Source	Date	Finding	Methodology	Quality
Paul Gauthier / Aider community	2026-03-28	Claude Sonnet 4.5 leads code-edit correctness at 89.4%	Code edit correctness on 133 Exercism Python exercises; automated pass/fail · 133 exercises	high winner: cowork
marc0.dev	2026-04-19	Claude Opus 4.7 leads SWE-bench Verified at 87.6%; GPT-5.3-Codex second at 85.0%; agent frameworks add 10-20 points over raw model scores	Agent resolves real GitHub issues from verified repos; automated test-pass scoring	high winner: cowork

How we work

OpenClawDatabase aggregates and links to published benchmarks. We don't run our own tests, and we don't pick winners. Our weekly benchmark-aggregator routine scans 7+ live leaderboards (OpenRouter, Aider, SWE-bench, GAIA, LMSYS, BigCodeBench, MMLU-Pro) plus relevant Reddit and Hacker News threads, then writes structured entries into /assets/benchmarks.json. Every row here links back to the original publication.

← Back to all benchmark tasks · See also: Decision guide · Cost calculator