Last updated: 2026-05-03

Local / on-device agents — Benchmark Sources & Consensus

Running an agent entirely on local hardware (no cloud API calls).

Platforms tracked: Nemoclaw · Openclaw

Consensus across 2 sources

Across 2 sources, local agents show promise: $500 GPU setups rival cloud Claude Sonnet, and harness selection on M3 Max hardware swings coding pass rates by 16+ points.

All Sources

We aggregate published benchmarks; we never run our own tests and never pick winners. Each row links back to the original publication.

SourceDateFindingMethodologyQuality
Hacker News 2026-03-27 A $500 local GPU using multi-solution generation with test-feedback filtering achieves performance comparable to Claude Sonnet on coding tasks Multi-candidate solution sampling with iterative refinement; compared against Claude Sonnet API on coding benchmarks medium
neuralnoise.com 2026-04-28 17 model-quants × 5 harnesses on M3 Max: claude harness 3rd at 66.2%; Qwen3.6-27B+pi tops at 82.5% on 16 SE coding tasks 17 quants × 5 harnesses × 16 SE tasks on local M3 Max 128GB; automated pass/fail high

How we work

OpenClawDatabase aggregates and links to published benchmarks. We don't run our own tests, and we don't pick winners. Our weekly benchmark-aggregator routine scans 7+ live leaderboards (OpenRouter, Aider, SWE-bench, GAIA, LMSYS, BigCodeBench, MMLU-Pro) plus relevant Reddit and Hacker News threads, then writes structured entries into /assets/benchmarks.json. Every row here links back to the original publication.

← Back to all benchmark tasks · See also: Decision guide · Cost calculator

📬 Weekly Digest — In Your Inbox

One email a week: top news, releases, and our deepest new guide. No spam. Same content via RSS if you prefer.