Push an AI Hard Enough and Its Ethics Collapse

Safety benchmarks test whether an AI will refuse a dangerous request. They almost never test what happens when someone keeps pushing.

A team from Polytechnique Montréal’s SWAT Laboratory just did exactly that. Their paper, “Adversarial Moral Stress Testing of Large Language Models,” introduces a framework that applies sustained adversarial pressure across multiple conversation rounds — and watches the ethics crumble.

What They Actually Tested

The researchers built a structured stress-testing framework called AMST that hits LLMs with five categories of adversarial pressure: time urgency, emotional distress, moral uncertainty, deception, and conflicts of interest. Unlike standard red-teaming, which fires single-shot prompts, AMST compounds these stressors across multi-turn conversations — mimicking how a real bad actor would wear down defenses.

They tested three models: Meta’s LLaMA-3-8B, OpenAI’s GPT-4o, and DeepSeek-v3.

The Results Are Not Reassuring

Every model degraded under sustained pressure. The researchers measured what they call “ethical drift” — the progressive erosion of alignment behavior as adversarial rounds accumulate. They found that robustness decay follows an approximately linear relationship, getting worse with each round of interaction.

DeepSeek-v3 performed worst, exhibiting the steepest degradation slope and the highest vulnerability to compounded stressors. GPT-4o landed in the middle with moderate stability. LLaMA-3-8B, the smallest model tested, actually showed the most resilience — the lowest mean robustness degradation and the strongest recovery capacity.

That last finding should unsettle anyone who assumes bigger models are safer models.

The researchers identified critical stability thresholds near robustness values of 0.4 and 0.7. Below these points, models undergo what the paper calls “stability transitions” — not gradual decline, but phase changes where alignment behavior breaks down more rapidly. Think of it as the point where an overstressed beam doesn’t just bend — it snaps.

Why This Should Worry You

Standard safety evaluations are like testing a bridge with a single car. AMST is like running rush-hour traffic across it for a week. The problem is that every major LLM deployment — customer service bots, coding assistants, healthcare tools — involves sustained multi-turn interactions. If an adversary knows the pressure points, they can methodically erode a model’s ethical guardrails across a conversation.

The paper demonstrates that “ethical robustness is not uniformly distributed across models but is instead affected by architectural differences.” In plain English: some models are structurally more vulnerable to being worn down, and we have no standard way of knowing which ones.

Current safety benchmarks, which rely on single-round evaluations and aggregate metrics, are blind to this. They can tell you a model will refuse to make a bomb. They cannot tell you what happens after twenty rounds of increasingly creative pressure.

What’s Being Done (And Why It’s Not Enough)

The AMST framework itself is a step forward — it’s the first structured methodology for quantifying cumulative ethical degradation. But it was tested on only three models. It evaluated black-box behavior without access to internal representations. And it identifies the problem without offering a fix.

The AI industry’s current approach to safety is snapshot-based: test once, deploy, and assume the guardrails hold. This research shows that alignment is not a fixed property but a dynamic one, subject to erosion under the right conditions. Until safety evaluations account for sustained adversarial pressure, we’re certifying bridges based on their performance with no traffic.