Local / on-device agents — Benchmark Sources & Consensus
Running an agent entirely on local hardware (no cloud API calls).
Platforms tracked: Nemoclaw · Openclaw
Consensus across 2 sources
Across 2 sources, local agents show promise: $500 GPU setups rival cloud Claude Sonnet, and harness selection on M3 Max hardware swings coding pass rates by 16+ points.
All Sources
We aggregate published benchmarks; we never run our own tests and never pick winners. Each row links back to the original publication.
| Source | Date | Finding | Methodology | Quality |
|---|---|---|---|---|
| Hacker News | 2026-03-27 | A $500 local GPU using multi-solution generation with test-feedback filtering achieves performance comparable to Claude Sonnet on coding tasks | Multi-candidate solution sampling with iterative refinement; compared against Claude Sonnet API on coding benchmarks | medium |
| neuralnoise.com | 2026-04-28 | 17 model-quants × 5 harnesses on M3 Max: claude harness 3rd at 66.2%; Qwen3.6-27B+pi tops at 82.5% on 16 SE coding tasks | 17 quants × 5 harnesses × 16 SE tasks on local M3 Max 128GB; automated pass/fail | high |
How we work
OpenClawDatabase aggregates and links to published benchmarks. We don't run our own tests, and we don't pick winners. Our weekly benchmark-aggregator routine scans 7+ live leaderboards (OpenRouter, Aider, SWE-bench, GAIA, LMSYS, BigCodeBench, MMLU-Pro) plus relevant Reddit and Hacker News threads, then writes structured entries into /assets/benchmarks.json. Every row here links back to the original publication.
← Back to all benchmark tasks · See also: Decision guide · Cost calculator