Last updated: 2026-06-07

Agent security & vulnerability handling — Benchmark Sources & Consensus

Name: Agent security & vulnerability handling benchmark sources
Creator: OpenClawDatabase
License: https://creativecommons.org/licenses/by/4.0/

How well coding agents resist prompt injection, avoid generating vulnerable code, and find real security vulnerabilities — benchmarked against curated attack-class scenarios and CVE datasets.

Platforms tracked: Openclaw · Claude Cowork · Chatgpt · Kilocode

Consensus across 4 sources

Across 4 sources, agent security varies by domain: Claude Code Sonnet leads coding-agent attack scenarios; Microsoft leads cybersecurity evals (Mythos); OWASP logs 92.5% memory-poisoning detection.

All Sources

We aggregate published benchmarks; we never run our own tests and never pick winners. Each row links back to the original publication.

Source	Date	Finding	Methodology	Quality
Hacker News / GitHub Gist	2026-05-29	Claude Code Sonnet outperforms Codex GPT-5 on coding-agent security scenarios (+11 vs +4) across 20 scenarios and 8 attack classes; CMD-INJ defence is Claude-specific.	20 scenarios across 8 attack classes; automated scoring on AgentToolBench-Code harness	high winner: openclaw
trent.ai blog	2026-05-29	Claude Code, Codex, Semgrep, CodeQL, and Trent compared on 28 real CVEs from CWE-Bench; full scoring unavailable at retrieval time (HTTP 500).	28 CWE-Bench CVEs; AI coding agents vs static analysis tools; automated CVE-finding scoring. URL returned HTTP 500 — partial data.	medium
GeekWire	2026-05-17	A Microsoft multi-agent system tops Anthropic's entry on the Mythos cybersecurity benchmark; full methodology unavailable (GeekWire returned 403 at retrieval).	Anthropic Mythos cybersecurity benchmark; multi-agent vs single-agent comparison. GeekWire returned 403 — details incomplete.	medium
Hacker News / OWASP	2026-06-01	AgentThreatBench detected 92.5% of 55 memory-poisoning attack payloads across 4 threat categories with 100% precision and 59µs median latency; prompt injection defence achieved 100% detection.	55 real-world attack payloads across 4 categories (prompt injection, key tampering, data leakage, size anomaly); automated Python benchmark script; open-source via OWASP	medium

How we work

OpenClawDatabase aggregates and links to published benchmarks. We don't run our own tests, and we don't pick winners. Our weekly benchmark-aggregator routine scans 7+ live leaderboards (OpenRouter, Aider, SWE-bench, GAIA, LMSYS, BigCodeBench, MMLU-Pro) plus relevant Reddit and Hacker News threads, then writes structured entries into /assets/benchmarks.json. Every row here links back to the original publication.

← Back to all benchmark tasks · See also: Decision guide · Cost calculator