Last updated: 2026-04-16
AI Agent Benchmarks — Community Comparison Hub
Which agent wins at Python refactoring? Email triage? Tool use? The community has been running these tests for years — we collect every credible comparison in one place, link back to every author, and let you judge. We don't run our own benchmarks; we curate the ecosystem's.
Not sure which agent to try in the first place? Start at the decision guide →
How this page works
- We link, we don't rehost. Every finding points to the original author's work.
- Methodology is shown per source — human-voted, automated tests, token logs, etc. — so you can decide what to trust.
- We don't pick winners. Consensus lines are presented as "across N sources" — never as our own ranking.
- Never deleted. Disputed or superseded sources get marked, not removed.
- Updated weekly by an automation that watches leaderboards, Reddit, and Hacker News.
Live leaderboards
General-purpose rankings updated continuously by their maintainers. Use these for a baseline view before diving into task-specific comparisons.
Loading leaderboards…
Task-specific comparisons
Curated community benchmarks grouped by task. Click through for source-by-source breakdowns.
Loading tasks…