AI Agent Security
Agents are powerful because they read wide and act autonomously. That combination is also the root of every real security risk. This is a practical, balanced guide — not fear-mongering, not vendor boosterism. Eight deep-dive topics, a 6-platform posture comparison, and a 15-minute hardening checklist you can actually complete.
Platform security posture at a glance
Rough posture rating based on default-deny vs. default-allow, sandbox enforcement, and managed-vs-self-hosted trade-offs. "Medium" is not bad — it means you need to do the work; the defaults won't save you.
| Platform | Posture | Security model |
|---|---|---|
| OpenClaw | 🟡 Medium | Self-hosted. You own the sandbox boundary. Default-allow on skills unless you configure otherwise. |
| NemoClaw | 🟡 Medium | Self-hosted like OpenClaw, but with a policy layer (YAML rules) that gates every tool call. |
| IronClaw | 🟢 Strong | Sandboxed-by-default. Every skill runs in an isolated process with a manifest-declared capability set. |
| Hermes | 🟡 Medium | Managed cloud service. Anthropic (or the vendor) handles infrastructure; you configure scopes via OAuth. |
| Claude Cowork | 🟢 Strong | Anthropic-managed. Projects are isolated; system prompts and files stay within your workspace. |
| ChatGPT | 🟡 Medium | OpenAI-managed. Custom GPTs and Actions run in OpenAI's infrastructure with API calls to third-party services you configure. |
Deep-dive topics
Prompt Injection — the #1 agent vulnerability
Malicious content embedded in web pages, emails, or documents tricks your agent into executing attacker instructions. How to recognize it and design around it.
🔴 Critical · Applies to 7 platforms
Skill & Tool Allowlisting — default-deny is not optional
Skills (or tools, MCP servers, Actions) are the agent's hands. Controlling which skills are available — and for which projects — is the single highest-impact security control.
🟠 High · Applies to 4 platforms
Secrets & Credentials — never in prompts, never in memory
API keys, OAuth tokens, passwords. Where they live, how they leak, and how to rotate them when (not if) they do.
🟠 High · Applies to 7 platforms
Sandboxing — contain the blast radius
Assume the agent will eventually do something wrong. Sandboxing is how you make that a small mistake instead of a catastrophic one.
🟠 High · Applies to 3 platforms
MCP Server Supply Chain — the new npm attack surface
MCP servers are the agent equivalent of npm packages. Same trust problem, new ecosystem, much less mature tooling.
🟠 High · Applies to 4 platforms
Email & Calendar Scopes — the read-write boundary matters
Giving an agent access to email is the fastest way to unlock high-value use cases — and the fastest way to cause a catastrophe. Scope discipline is the whole game.
🟠 High · Applies to 4 platforms
Incident Response — what to do when the agent goes wrong
Playbook for the inevitable day your agent does something it shouldn't. Speed matters — the first hour is everything.
🔴 Critical · Applies to 7 platforms
The Agent Security Checklist
The 15-minute hardening pass you should do for every new agent setup. Print it, work through it, sign off.
ℹ️ Baseline · Applies to 7 platforms
The non-negotiables
If you skip everything else, do these four:
- Default-deny on skills. Never enable a skill globally. Scope per project.
- Draft-only for irreversible actions. Email send, git push, file delete, payments. Always a human confirmation gate.
- Secrets in .env, never in prompts. SOUL.md, CLAUDE.md, and system prompts get sent to the model on every turn.
- Read-only OAuth scopes by default. Grant write access only for the specific action that needs it, and prefer draft/label over send/delete.
Need the shortest possible version? Go to the 15-minute checklist. Building something new? Start with prompt injection — it's the attack class every agent is exposed to.