Your AI Agent's Memory, Identity, and Skills Are All Attack Vectors

OpenClaw has full access to your filesystem, your email, your payment processor. It reads a MEMORY.md file to recall your preferences, a SOUL.md file to know who it’s supposed to be, and a collection of skill scripts to know what it can do. If an attacker poisons any one of those files, they own your agent.

That’s the central finding of “Your Agent, Their Asset,” a paper published April 6 by researchers Zijun Wang, Haoqin Tu, and colleagues. They ran 12 attack scenarios against a live OpenClaw instance using four frontier models — Claude Sonnet 4.5, Opus 4.6, Gemini 3.1 Pro, and GPT-5.4 — and the results are damning.

The CIK Taxonomy: Three Ways to Compromise an Agent

The researchers introduce the CIK framework, which maps an AI agent’s persistent state into three attack surfaces:

Capability — the executable skills and tool documentation the agent uses. Poisoning a skill script embeds hidden payloads that execute outside the LLM’s reasoning loop entirely. The model never sees the malicious code because it runs in a subprocess.

Identity — the agent’s persona, values, and behavioral rules. Inject a trust anchor into the owner profile, and the agent starts treating attacker-controlled destinations as authorized backup locations. It doesn’t question the redirect because its own identity file says the destination is legitimate.

Knowledge — the agent’s long-term memory and learned facts. Fabricate a refund history in the memory file, and the agent treats unauthorized batch transactions as routine. It has “memories” of doing this before.

The baseline attack success rate across all four models was 24.6%. Poison any single CIK dimension, and that number jumps to between 64% and 74%.

The Numbers, Model by Model

The results split along predictable lines — and one surprise.

Sonnet 4.5 was the most vulnerable when poisoned. Its baseline attack success rate was 26.7%, but knowledge poisoning drove it to 89.2%. Identity attacks hit 85.4%. Capability attacks reached 88.5%. A compliant, helpful model becomes a compliant, helpful attack vector.

Gemini 3.1 Pro showed a similar pattern: 36.7% baseline, rising to 83.3% under knowledge poisoning.

GPT-5.4 was more resistant to identity and capability attacks (63.1% and 57.7%) but still folded under knowledge poisoning at 80.8%.

Opus 4.6 performed best. Its baseline was 10.0%, and even under the worst-case knowledge poisoning, attack success reached 44.2%. Still nearly half the time — but meaningfully better than the competition. The researchers attribute this to stronger instruction-following that makes it harder to override with injected context.

Defenses That Create New Problems

The team tested several defensive strategies on Sonnet 4.5. The most effective — capability-level defense with file protection — reduced attacks from 92.6% to 16.8%. That sounds good until you see the other side of the ledger: legitimate file updates also dropped from 100% to 13.2%.

The researchers call this “a fundamental evolution–safety tradeoff.” Lock down the files that make an agent useful, and you also lock down the mechanism that lets it learn and adapt. An agent that can’t update its own memory isn’t much of an agent.

Context-level defenses worked better for knowledge and identity attacks, cutting success rates roughly in half. But capability attacks — where malicious code runs outside the model’s reasoning loop — bypassed these defenses almost entirely. The strongest defensive strategy still allowed a 63.8% success rate under capability-targeted attacks.

The Architectural Problem Nobody Wants to Admit

This paper arrives one week after the ClawSafety study showed that models passing safety benchmarks become exploitable as agents. The two papers tell the same story from different angles.

ClawSafety found the problem is deployment context: give a safe model agency, and it becomes unsafe. This paper finds the problem is persistent state: give an agent memory, identity, and skills, and each becomes an attack surface.

The researchers’ recommended mitigations — code signing, sandboxed execution, mandatory human approval for skill updates — are architectural changes, not model improvements. They require redesigning how agents work, not fine-tuning the models that power them.

“The same persistent files that enable evolution are also the attack surface,” the authors write. The features that make OpenClaw useful — personalization, memory, extensible skills — are structurally identical to the features that make it exploitable.

Why This Should Worry You

OpenClaw has over 346,000 GitHub stars. It’s the most widely deployed personal AI agent. Its architecture — persistent memory files, identity configuration, skill scripts — isn’t unique. It’s the template that every personal AI agent follows.

Every agent framework that stores memories, maintains an identity, and runs extensible tools faces this same tradeoff. The attack surface isn’t a bug in OpenClaw. It’s a property of the architecture that makes AI agents useful.

Making AI agents safe may require making them less capable. That’s a trade-off the industry has shown zero willingness to accept.