Last updated: 2026-04-16

AI Agent Benchmarks — Community Comparison Hub

Which agent wins at Python refactoring? Email triage? Tool use? The community has been running these tests for years — we collect every credible comparison in one place, link back to every author, and let you judge. We don't run our own benchmarks; we curate the ecosystem's.

Not sure which agent to try in the first place? Start at the decision guide →

How this page works

  • We link, we don't rehost. Every finding points to the original author's work.
  • Methodology is shown per source — human-voted, automated tests, token logs, etc. — so you can decide what to trust.
  • We don't pick winners. Consensus lines are presented as "across N sources" — never as our own ranking.
  • Never deleted. Disputed or superseded sources get marked, not removed.
  • Updated weekly by an automation that watches leaderboards, Reddit, and Hacker News.

Live leaderboards

General-purpose rankings updated continuously by their maintainers. Use these for a baseline view before diving into task-specific comparisons.

Loading leaderboards…

Task-specific comparisons

Curated community benchmarks grouped by task. Click through for source-by-source breakdowns.

Loading tasks…

📬 Weekly Digest — In Your Inbox

One email a week: top news, releases, and our deepest new guide. No spam. Same content via RSS if you prefer.