Last updated: 2026-04-18

AI Agent Security

Agents are powerful because they read wide and act autonomously. That combination is also the root of every real security risk. This is a practical, balanced guide — not fear-mongering, not vendor boosterism. Eight deep-dive topics, a 6-platform posture comparison, and a 15-minute hardening checklist you can actually complete.

Platform security posture at a glance

Rough posture rating based on default-deny vs. default-allow, sandbox enforcement, and managed-vs-self-hosted trade-offs. "Medium" is not bad — it means you need to do the work; the defaults won't save you.

PlatformPostureSecurity model
OpenClaw 🟡 Medium Self-hosted. You own the sandbox boundary. Default-allow on skills unless you configure otherwise.
NemoClaw 🟡 Medium Self-hosted like OpenClaw, but with a policy layer (YAML rules) that gates every tool call.
IronClaw 🟢 Strong Sandboxed-by-default. Every skill runs in an isolated process with a manifest-declared capability set.
Hermes 🟡 Medium Managed cloud service. Anthropic (or the vendor) handles infrastructure; you configure scopes via OAuth.
Claude Cowork 🟢 Strong Anthropic-managed. Projects are isolated; system prompts and files stay within your workspace.
ChatGPT 🟡 Medium OpenAI-managed. Custom GPTs and Actions run in OpenAI's infrastructure with API calls to third-party services you configure.

Deep-dive topics

Prompt Injection — the #1 agent vulnerability

Malicious content embedded in web pages, emails, or documents tricks your agent into executing attacker instructions. How to recognize it and design around it.

🔴 Critical · Applies to 7 platforms

Skill & Tool Allowlisting — default-deny is not optional

Skills (or tools, MCP servers, Actions) are the agent's hands. Controlling which skills are available — and for which projects — is the single highest-impact security control.

🟠 High · Applies to 4 platforms

Secrets & Credentials — never in prompts, never in memory

API keys, OAuth tokens, passwords. Where they live, how they leak, and how to rotate them when (not if) they do.

🟠 High · Applies to 7 platforms

Sandboxing — contain the blast radius

Assume the agent will eventually do something wrong. Sandboxing is how you make that a small mistake instead of a catastrophic one.

🟠 High · Applies to 3 platforms

MCP Server Supply Chain — the new npm attack surface

MCP servers are the agent equivalent of npm packages. Same trust problem, new ecosystem, much less mature tooling.

🟠 High · Applies to 4 platforms

Email & Calendar Scopes — the read-write boundary matters

Giving an agent access to email is the fastest way to unlock high-value use cases — and the fastest way to cause a catastrophe. Scope discipline is the whole game.

🟠 High · Applies to 4 platforms

Incident Response — what to do when the agent goes wrong

Playbook for the inevitable day your agent does something it shouldn't. Speed matters — the first hour is everything.

🔴 Critical · Applies to 7 platforms

The Agent Security Checklist

The 15-minute hardening pass you should do for every new agent setup. Print it, work through it, sign off.

ℹ️ Baseline · Applies to 7 platforms

The non-negotiables

If you skip everything else, do these four:

  1. Default-deny on skills. Never enable a skill globally. Scope per project.
  2. Draft-only for irreversible actions. Email send, git push, file delete, payments. Always a human confirmation gate.
  3. Secrets in .env, never in prompts. SOUL.md, CLAUDE.md, and system prompts get sent to the model on every turn.
  4. Read-only OAuth scopes by default. Grant write access only for the specific action that needs it, and prefer draft/label over send/delete.

Need the shortest possible version? Go to the 15-minute checklist. Building something new? Start with prompt injection — it's the attack class every agent is exposed to.

📬 Weekly Digest — In Your Inbox

One email a week: top news, releases, and our deepest new guide. No spam. Same content via RSS if you prefer.