Last updated: 2026-05-31

AI Agent Benchmarks — Community Comparison Hub

Which agent wins at Python refactoring? Email triage? Tool use? Security? Memory persistence? The community has been running these tests for years — we collect every credible comparison in one place, link back to every author, and let you judge. We don't run our own benchmarks; we curate the ecosystem's.

Not sure which agent to try in the first place? Start at the decision guide →

How this page works

We link, we don't rehost. Every finding points to the original author's work.
Methodology is shown per source — human-voted, automated tests, token logs, etc. — so you can decide what to trust.
We don't pick winners. Consensus lines are presented as "across N sources" — never as our own ranking.
Never deleted. Disputed or superseded sources get marked, not removed.
Updated weekly by an automation that watches leaderboards, Reddit, and Hacker News.

Live leaderboards

General-purpose rankings updated continuously by their maintainers. Use these for a baseline view before diving into task-specific comparisons.

Loading leaderboards…

Task-specific comparisons

Curated community benchmarks grouped by task. Click through for source-by-source breakdowns.

Agent memory persistence — Agent ability to retain, transfer, and recall context across sessions — measured by task success rates before and after memory handoffs between models or restarts (1 source)
Agent security & vulnerability handling — How well coding agents resist prompt injection, avoid generating vulnerable code, and find real security vulnerabilities — benchmarked against curated attack-class scenarios and CVE datasets (4 sources)
Code generation & program synthesis — Generating or reconstructing working programs from specs, binaries, or natural-language descriptions. Frontier-difficulty benchmarks where state-of-the-art is still well under 10% (1 source)
Cost per task — Total dollar cost to complete a representative agent workflow (2 sources)
Email triage — Sorting, drafting replies to, and flagging incoming email for human review (0 sources)
Local / on-device agents — Running an agent entirely on local hardware (no cloud API calls) (3 sources)
Long-context summarization — Summarize documents longer than 32K tokens without losing key facts (0 sources)
Python refactoring — Rewriting Python code for clarity, performance, or style with existing tests passing (7 sources)
Tool use / MCP — Ability to select, call, and chain external tools (MCP servers, function calls) correctly (1 source)