Newer AI Models Are Becoming Less Safe, Not More

A study of 82,000 harm ratings across eight model releases finds 'alignment drift': GPT-5 and Claude 4.5 are more vulnerable to adversarial attacks than their predecessors.

Each new model release is supposed to be safer than the last. The marketing materials say so. The safety reports document improvements. The benchmarks show progress.

A study of eight major multimodal models across two generations tells a different story: some of the industry’s flagship models are getting worse at resisting adversarial attacks, not better.

The Study

Researchers Casey Ford, Madison Van Doren, and Emily Dix ran the largest longitudinal evaluation of multimodal LLM safety to date. They tested GPT-4o, Claude Sonnet 3.5, Pixtral 12B, and Qwen VL Plus in Phase 1, then evaluated their successors - GPT-5, Claude Sonnet 4.5, Pixtral Large, and Qwen Omni - in Phase 2.

The methodology:

  • 726 adversarial prompts created by 26 professional red teamers
  • Both text-only and multimodal attack vectors
  • 82,256 human harm ratings
  • Consistent benchmarking across model generations

The question: are newer models better at refusing harmful requests?

The Results

The researchers tracked Attack Success Rate (ASR) - the percentage of adversarial prompts that successfully elicited harmful responses.

GPT models: ASR increased from GPT-4o to GPT-5. The newer model is more susceptible to adversarial attacks than its predecessor.

Claude models: ASR also increased from Sonnet 3.5 to Sonnet 4.5. Claude maintained the lowest absolute ASR through high refusal rates, but the trend moved in the wrong direction.

Pixtral and Qwen: Modest decreases in ASR - these models actually improved. But Pixtral remained the most vulnerable overall.

The pattern the researchers identified: “clear alignment drift” where more capable models showed degraded safety characteristics.

Why Newer Isn’t Safer

The most obvious explanation: capability-safety trade-off. More capable models are also more capable at following harmful instructions when jailbroken. The same improvements that make GPT-5 better at complex reasoning may make it better at generating harmful content once safety guardrails are bypassed.

But there’s a more uncomfortable possibility: safety training may be getting deprioritized relative to capability improvements. As competition intensifies and release cycles accelerate, the incentive structure favors shipping impressive capabilities over thorough safety hardening.

The researchers found that modality effects also shifted: text-only attacks were more effective in Phase 1, while Phase 2 showed model-specific vulnerability patterns. This suggests safety teams may be playing whack-a-mole - patching known attack vectors while new ones emerge faster than they can be addressed.

The Claude Paradox

Claude models appeared safest in both phases, primarily through aggressive refusal behavior. But here’s the tension: Claude’s ASR still increased from 3.5 to 4.5. And high refusal rates are not cost-free - they make the model less useful for legitimate applications.

The question facing Anthropic: is the strategy sustainable? If safety comes primarily through refusing borderline requests, competitive pressure may eventually force loosening those refusals. The underlying vulnerability to successful attacks is still there, masked by conservatism that may not survive commercial pressures.

What This Means

The standard industry narrative goes: each model release is an improvement. Capabilities get better. Safety gets better. Progress on all fronts.

This study complicates that narrative. Capabilities clearly improved between generations. Safety did not uniformly improve. In some cases, it regressed.

This matters because:

  1. Deployment decisions assume safety improvements - organizations upgrading to newer models expect better safety properties, not worse ones

  2. Regulatory frameworks reference model versions - if older models are actually safer against adversarial attacks, regulatory approaches based on newest-model-is-best assumptions may be backwards

  3. Safety investments may be net-negative - if capability improvements outpace safety improvements, more money spent on AI development could mean more net harm, not less

The Uncomfortable Question

If newer models are more vulnerable to adversarial attacks, what exactly is all that safety research accomplishing?

One possibility: safety teams are running hard to stay in place. Without their work, alignment drift would be even worse. The situation is bad, but it would be worse without intervention.

Another possibility: safety work is increasingly cosmetic. It generates impressive-looking publications and benchmarks while actual adversarial robustness degrades. The theater of safety rather than its substance.

The data doesn’t distinguish between these explanations. What it does show: the simple story of continuous safety improvement is not holding up to empirical scrutiny.

Model releases should include not just capability improvements and safety highlights, but honest longitudinal comparisons showing where safety actually improved, where it degraded, and where the tradeoffs lie. This study suggests we’re not getting that transparency.

The researchers call it alignment drift. A less charitable term: safety regression. Either way, it’s worth tracking as the next generation of models arrives.