Last updated: 2026-06-07

Tool use / MCP — Benchmark Sources & Consensus

Ability to select, call, and chain external tools (MCP servers, function calls) correctly.

Platforms tracked: Claude Cowork · Openclaw · Hermes · Chatgpt

Consensus across 1 source

Across 1 source, poor MCP tool design (excessive data return, tool proliferation) consumed 4.98× more tokens and 35 more agent steps vs a well-designed equivalent across 40 test prompts.

All Sources

We aggregate published benchmarks; we never run our own tests and never pick winners. Each row links back to the original publication.

SourceDateFindingMethodologyQuality
Hacker News 2026-06-05 Poor MCP tool design (excessive data return, tool proliferation) consumed 4.98× more input tokens and 35 more ReAct loops vs a well-designed equivalent across 40 identical test prompts. 40 identical test prompts on 2 MCP implementations with same functionality; measured input tokens, agent ReAct loops, and total time high

How we work

OpenClawDatabase aggregates and links to published benchmarks. We don't run our own tests, and we don't pick winners. Our weekly benchmark-aggregator routine scans 7+ live leaderboards (OpenRouter, Aider, SWE-bench, GAIA, LMSYS, BigCodeBench, MMLU-Pro) plus relevant Reddit and Hacker News threads, then writes structured entries into /assets/benchmarks.json. Every row here links back to the original publication.

← Back to all benchmark tasks · See also: Decision guide · Cost calculator

📬 Weekly Digest — In Your Inbox

One email a week: top news, releases, and our deepest new guide. No spam. Same content via RSS if you prefer.