Last updated: 2026-04-18

What is Benchmark?

A standardized test comparing agent or LLM performance — SWE-bench for coding, GAIA for tool use, HumanEval for code generation. Always check methodology before trusting a ranking; benchmarks are often gamed or overfit.

See also

← Back to the full AI agent glossary.