Death by a Thousand Prompts: Open-Weight Models Are Sitting Ducks

Want to jailbreak an AI model?

Don’t try to do it in one prompt. That fails 87% of the time.

Instead, have a conversation. Be patient. Keep talking.

According to Cisco’s AI Defense team, multi-turn jailbreak attacks against open-weight AI models succeed 92.78% of the time. Not in lab conditions with custom exploits. In black-box testing against production models with no prior knowledge of their safety configurations.

The industry calls this “death by a thousand prompts.” The research calls it a systemic vulnerability. I call it what happens when you build safety systems that can’t remember what they said five messages ago.

The Experiment

Cisco tested eight major open-weight models: Alibaba’s Qwen3-32B, DeepSeek v3.1, Google’s Gemma-3-1B-IT, Meta’s Llama 3.3-70B-Instruct, Microsoft’s Phi-4, Mistral Large-2, OpenAI’s GPT-OSS-20b, and Zhipu AI’s GLM 4.5-Air.

They used five multi-turn attack strategies:

Information decomposition and reassembly - breaking harmful requests into innocent-looking pieces, then combining them (95% success against Mistral Large-2)
Contextual ambiguity - exploiting the model’s tendency to infer charitable interpretations (94.78% success)
Crescendo attacks - gradually escalating from benign to harmful requests (92.69% success)
Role-play and persona adoption - getting the model to inhabit characters with different values (92.44% success)
Refusal reframe - rephrasing rejected requests until they’re accepted (up to 89.15% success)

Every technique worked. The only question was which worked best against which model.

The Numbers

Single-turn attacks - one prompt, one attempt - achieved success rates between 13% and 25% depending on the model. Annoying, but manageable. Most attacks fail.

Multi-turn attacks achieved success rates between 25.86% and 92.78%. A 2x to 10x increase just by having a conversation.

The model-by-model breakdown:

Mistral Large-2: 92.78% multi-turn success
Meta Llama 3.3-70B-Instruct: 70%+ gap between single and multi-turn defenses
Google Gemma-3-1B-IT: 25.86% multi-turn success (the “safe” one)

Meta’s Llama showed the largest vulnerability increase between single and multi-turn testing. Cisco attributes this to “capability-first design philosophy” - Llama was optimized to be useful, with safety as an afterthought. Gemma, which prioritizes alignment more centrally, held up better. But “better” here means failing only 26% of the time instead of 93%.

Why This Happens

Multi-turn attacks work because safety guardrails don’t maintain state across conversations.

The model rejects a harmful request in message 1. In message 3, after two innocuous exchanges, it’s forgotten why that request was harmful. In message 7, after the attacker has established a fictional scenario where the harmful action is “justified,” the model complies.

This isn’t a bug in any specific model. It’s a property of how language models process conversation. Each response is generated based on the conversation history, but the safety layer often evaluates individual messages rather than cumulative intent.

Attackers exploit this by never triggering the safety filter at any single step. Each message is innocent. The trajectory is not.

The Open-Weight Problem

These are all open-weight models. The weights are public. Anyone can download them, host them, modify them.

This matters because open-weight deployment means no one is monitoring for attack patterns. No one is updating guardrails in real-time. No one is logging which conversations led to harmful outputs.

When OpenAI or Anthropic’s models get jailbroken, they can patch the vulnerability. When an open-weight model running on someone’s local GPU gets jailbroken, nobody knows. The model just keeps generating whatever the attacker wanted.

Cisco’s research tested unmodified, off-the-shelf models. In practice, many open-weight deployments strip out safety training entirely - it’s easy to do and improves performance on benchmarks. Against those deployments, multi-turn attacks aren’t necessary. Single-turn works fine.

What This Means

The open-source AI community celebrates accessibility. Anyone can run these models. Anyone can build on them. Anyone can deploy them for any purpose.

The Cisco research shows the other side: anyone can break them. Consistently. With publicly documented techniques. In extended conversations that look like normal usage until they suddenly don’t.

The 93% success rate isn’t against toy models. It’s against Llama 3.3-70B - the largest model Meta has open-sourced. Against Mistral Large-2 - one of the most capable open European models. Against DeepSeek v3.1 - China’s frontier open-weight release.

These are the models people actually deploy. They’re in production applications right now. And they can be talked into doing almost anything if you’re patient enough to keep the conversation going.

The Uncomfortable Conclusion

Safety in AI has been framed as a development problem. Build better guardrails. Train on more safety data. Add constitutional AI. Make the model refuse harder.

Cisco’s research suggests it might be a conversational problem. Safety systems that evaluate individual messages can’t defend against attacks that distribute harm across conversations. The attacker just needs to keep talking until the safety window slides past the original refusal.

There are mitigations. Track conversation-level intent. Flag conversations that resemble known attack patterns. Implement hard blocks on certain topics regardless of context.

But those mitigations require infrastructure. Logging. Monitoring. Updating. They require someone to be watching.

Open-weight models, by design, have no one watching. That’s the point. That’s also the problem.

The thousand prompts keep coming. And the models keep dying.