Last updated: 2026-04-18
What is Benchmark?
A standardized test comparing agent or LLM performance — SWE-bench for coding, GAIA for tool use, HumanEval for code generation. Always check methodology before trusting a ranking; benchmarks are often gamed or overfit.
See also
← Back to the full AI agent glossary.