Last updated: 2026-05-31
AI Agent Benchmarks — Community Comparison Hub
Which agent wins at Python refactoring? Email triage? Tool use? Security? Memory persistence? The community has been running these tests for years — we collect every credible comparison in one place, link back to every author, and let you judge. We don't run our own benchmarks; we curate the ecosystem's.
Not sure which agent to try in the first place? Start at the decision guide →
How this page works
- We link, we don't rehost. Every finding points to the original author's work.
- Methodology is shown per source — human-voted, automated tests, token logs, etc. — so you can decide what to trust.
- We don't pick winners. Consensus lines are presented as "across N sources" — never as our own ranking.
- Never deleted. Disputed or superseded sources get marked, not removed.
- Updated weekly by an automation that watches leaderboards, Reddit, and Hacker News.
Live leaderboards
General-purpose rankings updated continuously by their maintainers. Use these for a baseline view before diving into task-specific comparisons.
Loading leaderboards…
Task-specific comparisons
Curated community benchmarks grouped by task. Click through for source-by-source breakdowns.
- Agent memory persistence — Agent ability to retain, transfer, and recall context across sessions — measured by task success rates before and after memory handoffs between models or restarts (1 source)
- Agent security & vulnerability handling — How well coding agents resist prompt injection, avoid generating vulnerable code, and find real security vulnerabilities — benchmarked against curated attack-class scenarios and CVE datasets (4 sources)
- Code generation & program synthesis — Generating or reconstructing working programs from specs, binaries, or natural-language descriptions. Frontier-difficulty benchmarks where state-of-the-art is still well under 10% (1 source)
- Cost per task — Total dollar cost to complete a representative agent workflow (2 sources)
- Email triage — Sorting, drafting replies to, and flagging incoming email for human review (0 sources)
- Local / on-device agents — Running an agent entirely on local hardware (no cloud API calls) (3 sources)
- Long-context summarization — Summarize documents longer than 32K tokens without losing key facts (0 sources)
- Python refactoring — Rewriting Python code for clarity, performance, or style with existing tests passing (7 sources)
- Tool use / MCP — Ability to select, call, and chain external tools (MCP servers, function calls) correctly (1 source)