The safety guardrails protecting the AI models used by hundreds of millions of people can be defeated with a single line of code. No gradient access required. No optimization. No elaborate prompt engineering. Just six words injected into an API call.
Researchers Asen Dotsinski and Panagiotis Eustratiadis introduced the technique — called “sockpuppeting” — in a January 2026 arxiv paper. Trend Micro’s TrendAI Research team then validated it at scale, testing 14 attack variants across 11 commercial and open-source models, publishing their findings on April 10. The results confirm that a foundational assumption of LLM safety — that models control their own first words — is wrong.
How It Works
Every major LLM API includes a feature called “assistant prefill.” It lets developers set the opening tokens of a model’s response, typically for formatting purposes: starting a response with JSON syntax, or a specific heading, or a numbered list.
Sockpuppeting exploits this feature by injecting an acceptance phrase — something like “Sure, here is how to do it:\n” — as the model’s first output. The model, trained to maintain self-consistency, continues the compliant response instead of triggering its safety refusal.
The technique works because safety training concentrates disproportionately in the model’s opening tokens. Research has repeatedly shown that the decision to refuse happens in the first few tokens of generation. Skip those tokens by pre-filling them with an agreeable prefix, and the safety training never activates.
It’s the AI equivalent of putting words in someone’s mouth — hence “sockpuppeting.”
The Damage Report
Trend Micro tested 420 probes per model across four strategic categories: basic acceptance prefixes, multi-turn persona setups, instruction overrides with prefix injection, and diverse formatting prefixes. The results:
| Model | Provider | Accepts Prefill | Attack Success Rate |
|---|---|---|---|
| Gemini 2.5 Flash | Google Vertex AI | Yes | 15.7% |
| Claude 4 Sonnet | Anthropic Vertex AI | Yes | 8.3% |
| Qwen3-32B | Self-hosted | Yes | 3.3% |
| Gemma 3 4B | Self-hosted | Yes | 3.1% |
| GPT-4o | Microsoft Azure | Yes | 1.4% |
| Qwen3-30B-Instruct | Self-hosted | Yes | 0.7% |
| GPT-4o-mini | Microsoft Azure | Yes | 0.5% |
| Claude 4.6 Opus | Anthropic Vertex AI | No | 0% |
| DeepSeek-R1 | AWS Bedrock | No | 0% |
| Llama-3.1-8B | AWS Bedrock | No | 0% |
Every model that accepted the prefill was at least partially vulnerable. No exceptions.
The original arxiv paper reported even higher success rates: 95% against Qwen-8B and 77% against Llama-3.1-8B in per-prompt comparisons — up to 80% higher than GCG, the previous state-of-the-art automated attack that requires gradient access and expensive computation.
Multi-turn persona variants proved most effective overall. These set up the model as an “unrestricted research assistant” across several messages, then injected the compliance prefix. Claude 4 Sonnet and GPT-4o were most susceptible to this strategy.
Real Exploitation, Not Theory
During testing, Gemini 2.5 Flash generated functional cross-site scripting (XSS) payloads. Models leaked complete system prompt configurations, including internal feature flags and logging settings, when sockpuppeting was combined with prompt leakage optimization techniques.
Google’s flagship model showed the highest vulnerability at 15.7% — meaning roughly one in six carefully crafted attacks succeeded. For an attacker running automated probes, those are workable odds.
The Fix Is Simple. Most Providers Haven’t Applied It.
Three providers have already blocked the attack at the API layer: OpenAI, AWS Bedrock, and Anthropic (for Claude 4.6 Opus). The fix is straightforward — reject any API request where the final message role is “assistant” rather than “user.” It eliminates the entire attack surface with zero impact on legitimate usage.
But Google Vertex AI still accepts prefills for Gemini 2.5 Flash. Anthropic’s own Vertex AI deployment of Claude 4 Sonnet still accepts them. And every self-hosted model running on Ollama, vLLM, or TGI is vulnerable by default, because open-source inference servers don’t enforce message ordering.
The instruct-tuned versus base-model gap matters too. Qwen3-32B base showed nearly five times the vulnerability of the Qwen3-30B-Instruct variant, suggesting that instruction tuning provides some resilience — but not immunity.
Why This Should Worry You
Sockpuppeting isn’t sophisticated. That’s the point. GCG requires gradient access and GPU time. DAN prompts require creative social engineering. Sockpuppeting requires knowing that an API parameter exists and typing six words.
The attack exploits a legitimate developer feature that exists across nearly every model provider. It’s been public since January. Trend Micro mapped it to both the OWASP Top 10 for LLM Applications and the MITRE ATLAS framework. And the majority of deployments remain unpatched.
The broader problem is architectural. LLM safety training is a thin veneer applied after the fact. It teaches the model to refuse in its first few tokens, then trusts the model to follow through. Sockpuppeting demonstrates that if you control those first tokens — through a feature the API gives you for free — the safety training is irrelevant.
Every new model release touts improved safety benchmarks. But those benchmarks test the model’s own refusal behavior. They don’t test what happens when someone else controls the refusal.
The fix at the API layer is trivial. That three major providers took months to implement it — and several still haven’t — tells you everything about how seriously the industry takes security when it isn’t a marketing bullet point.