Last updated: 2026-05-17

Local / on-device agents — Benchmark Sources & Consensus

Running an agent entirely on local hardware (no cloud API calls).

Platforms tracked: Nemoclaw · Openclaw

Consensus across 3 sources

Across 3 sources, local agent performance swings dramatically with framework design: state machine constraints and hardware/harness choices can boost coding success rates by 10-50+ points over unguided baseline models.

All Sources

We aggregate published benchmarks; we never run our own tests and never pick winners. Each row links back to the original publication.

SourceDateFindingMethodologyQuality
Hacker News 2026-03-27 A $500 local GPU using multi-solution generation with test-feedback filtering achieves performance comparable to Claude Sonnet on coding tasks Multi-candidate solution sampling with iterative refinement; compared against Claude Sonnet API on coding benchmarks medium
neuralnoise.com 2026-04-28 17 model-quants × 5 harnesses on M3 Max: claude harness 3rd at 66.2%; Qwen3.6-27B+pi tops at 82.5% on 16 SE coding tasks 17 quants × 5 harnesses × 16 SE tasks on local M3 Max 128GB; automated pass/fail high
Hacker News 2026-05-17 State machine constraints improved local models (Gemma, Llama) from 20% to 100% on a 5-task SWE-bench subset; framework integrates with Claude Code and Cursor Local model comparison on 5-task SWE-bench subset; with vs without tool-space constraints; automated pass/fail medium

How we work

OpenClawDatabase aggregates and links to published benchmarks. We don't run our own tests, and we don't pick winners. Our weekly benchmark-aggregator routine scans 7+ live leaderboards (OpenRouter, Aider, SWE-bench, GAIA, LMSYS, BigCodeBench, MMLU-Pro) plus relevant Reddit and Hacker News threads, then writes structured entries into /assets/benchmarks.json. Every row here links back to the original publication.

← Back to all benchmark tasks · See also: Decision guide · Cost calculator

📬 Weekly Digest — In Your Inbox

One email a week: top news, releases, and our deepest new guide. No spam. Same content via RSS if you prefer.