Last updated: 2026-05-31

Agent security & vulnerability handling — Benchmark Sources & Consensus

How well coding agents resist prompt injection, avoid generating vulnerable code, and find real security vulnerabilities — benchmarked against curated attack-class scenarios and CVE datasets.

Platforms tracked: Openclaw · Claude Cowork · Chatgpt · Kilocode

Consensus across 3 sources

Across 3 sources, security results vary by attack type: Claude Code Sonnet leads coding-agent security scenarios; a Microsoft multi-agent system leads on a cybersecurity eval (Mythos).

All Sources

We aggregate published benchmarks; we never run our own tests and never pick winners. Each row links back to the original publication.

SourceDateFindingMethodologyQuality
Hacker News / GitHub Gist 2026-05-29 Claude Code Sonnet outperforms Codex GPT-5 on coding-agent security scenarios (+11 vs +4) across 20 scenarios and 8 attack classes; CMD-INJ defence is Claude-specific. 20 scenarios across 8 attack classes; automated scoring on AgentToolBench-Code harness high winner: openclaw
trent.ai blog 2026-05-29 Claude Code, Codex, Semgrep, CodeQL, and Trent compared on 28 real CVEs from CWE-Bench; full scoring unavailable at retrieval time (HTTP 500). 28 CWE-Bench CVEs; AI coding agents vs static analysis tools; automated CVE-finding scoring. URL returned HTTP 500 — partial data. medium
GeekWire 2026-05-17 A Microsoft multi-agent system tops Anthropic's entry on the Mythos cybersecurity benchmark; full methodology unavailable (GeekWire returned 403 at retrieval). Anthropic Mythos cybersecurity benchmark; multi-agent vs single-agent comparison. GeekWire returned 403 — details incomplete. medium

How we work

OpenClawDatabase aggregates and links to published benchmarks. We don't run our own tests, and we don't pick winners. Our weekly benchmark-aggregator routine scans 7+ live leaderboards (OpenRouter, Aider, SWE-bench, GAIA, LMSYS, BigCodeBench, MMLU-Pro) plus relevant Reddit and Hacker News threads, then writes structured entries into /assets/benchmarks.json. Every row here links back to the original publication.

← Back to all benchmark tasks · See also: Decision guide · Cost calculator

📬 Weekly Digest — In Your Inbox

One email a week: top news, releases, and our deepest new guide. No spam. Same content via RSS if you prefer.