Agent security & vulnerability handling — Benchmark Sources & Consensus
How well coding agents resist prompt injection, avoid generating vulnerable code, and find real security vulnerabilities — benchmarked against curated attack-class scenarios and CVE datasets.
Platforms tracked: Openclaw · Claude Cowork · Chatgpt · Kilocode
Consensus across 3 sources
Across 3 sources, security results vary by attack type: Claude Code Sonnet leads coding-agent security scenarios; a Microsoft multi-agent system leads on a cybersecurity eval (Mythos).
All Sources
We aggregate published benchmarks; we never run our own tests and never pick winners. Each row links back to the original publication.
| Source | Date | Finding | Methodology | Quality |
|---|---|---|---|---|
| Hacker News / GitHub Gist | 2026-05-29 | Claude Code Sonnet outperforms Codex GPT-5 on coding-agent security scenarios (+11 vs +4) across 20 scenarios and 8 attack classes; CMD-INJ defence is Claude-specific. | 20 scenarios across 8 attack classes; automated scoring on AgentToolBench-Code harness | high winner: openclaw |
| trent.ai blog | 2026-05-29 | Claude Code, Codex, Semgrep, CodeQL, and Trent compared on 28 real CVEs from CWE-Bench; full scoring unavailable at retrieval time (HTTP 500). | 28 CWE-Bench CVEs; AI coding agents vs static analysis tools; automated CVE-finding scoring. URL returned HTTP 500 — partial data. | medium |
| GeekWire | 2026-05-17 | A Microsoft multi-agent system tops Anthropic's entry on the Mythos cybersecurity benchmark; full methodology unavailable (GeekWire returned 403 at retrieval). | Anthropic Mythos cybersecurity benchmark; multi-agent vs single-agent comparison. GeekWire returned 403 — details incomplete. | medium |
How we work
OpenClawDatabase aggregates and links to published benchmarks. We don't run our own tests, and we don't pick winners. Our weekly benchmark-aggregator routine scans 7+ live leaderboards (OpenRouter, Aider, SWE-bench, GAIA, LMSYS, BigCodeBench, MMLU-Pro) plus relevant Reddit and Hacker News threads, then writes structured entries into /assets/benchmarks.json. Every row here links back to the original publication.
← Back to all benchmark tasks · See also: Decision guide · Cost calculator