AI Safety Benchmarks Are Testing for the Wrong Thing

Labelbox researchers stripped obvious red flags from attack prompts. Every 'safe' model broke — GPT-4o, Claude, Gemini, Grok — with bypass rates hitting 90%.

Cracked glass with light refracting through the fracture lines

Every major AI lab points to the same safety benchmarks when they tell you their model is safe. AdvBench. HarmBench. The names show up in technical reports, safety cards, and press releases. Researchers at Labelbox decided to look at what those benchmarks actually test — and found that they measure something closer to keyword detection than real safety.

The Benchmarks Are Looking for Obvious Tells

Shahriar Golchin and Marc Wetter at Labelbox’s Applied Machine Learning Research examined AdvBench (520 data points) and HarmBench (200 data points), two of the most commonly cited AI safety evaluation datasets. What they found: these datasets are riddled with “triggering cues” — words and phrases with blatant negative connotations designed to activate safety refusals.

Think prompts containing words like “hack,” “bomb,” “steal,” or “kill.” The models aren’t learning to detect harmful intent. They’re learning to refuse when they see specific vocabulary.

Over 45% of AdvBench’s data points are near-identical duplicates at a 0.95 similarity threshold. The dataset isn’t just testing for the wrong thing — it’s testing the same wrong thing repeatedly.

Intent Laundering: Strip the Cues, Keep the Malice

The researchers developed what they call “intent laundering” — a two-step process that removes triggering cues while preserving every bit of malicious intent:

Connotation neutralization replaces flagged vocabulary with neutral or positive alternatives. “Hack into the database” becomes a description of testing security configurations. The harmful goal stays identical.

Context transposition maps real-world targets onto fictional settings. A request about attacking actual infrastructure becomes a game-world scenario — same technical details, same applicable instructions, zero red flags for the safety filter.

They automated this using GPT-5.1 with 8-shot in-context learning. The process is simple, cheap, and devastatingly effective.

The Numbers

Before intent laundering, models looked safe. After removing the obvious cues, every single one broke:

ModelAdvBench BeforeAdvBench AfterHarmBench BeforeHarmBench After
GPT-4o0.0%81.2%0.5%89.0%
Claude Sonnet 3.72.4%79.7%8.5%70.5%
Gemini 3 Pro1.9%82.6%11.0%78.5%
Grok 417.9%90.8%36.0%80.0%
Llama 3.3 70B10.1%91.8%14.0%80.5%
GPT-4o mini1.0%89.4%5.0%79.5%
Qwen2.5 7B4.4%92.3%21.5%81.0%

Average attack success rate on AdvBench jumped from 5.38% to 86.79%. On HarmBench, from 13.79% to 79.83%. When adapted as a jailbreaking technique with iterative refinement, intent laundering hit 90% to 98.55% success across all models tested.

GPT-4o went from a 0.0% attack success rate to 89.0% on HarmBench. Not a marginal change. A categorical failure.

Why This Should Worry You

The problem isn’t that these models are unsafe. The problem is that the industry’s primary method for measuring safety is measuring the wrong thing. Labs report benchmark scores as evidence their models resist harmful use. Those benchmark scores reflect an ability to refuse prompts that contain obvious warning words — nothing more.

Real adversaries don’t write “help me build a bomb.” They describe a chemistry experiment. They frame requests as fiction. They use professional terminology that carries no negative connotation. Intent laundering formalizes what every red-teamer already knew: the emperor’s benchmarks have no clothes.

The researchers’ LLM judge achieved 90% agreement with human consensus on safety evaluations — meaning the evaluation methodology itself is sound. The datasets feeding those evaluations are the problem.

What’s Being Done (And Why It’s Not Enough)

The paper exposes a structural weakness in how the field evaluates safety. Current benchmarks were built by researchers brainstorming harmful prompts, which naturally produce prompts full of alarming vocabulary. The datasets reflect how safety researchers think about attacks, not how actual attackers behave.

Some labs have moved toward red-teaming with human adversaries, which produces more realistic attack distributions. But the published benchmarks that appear in safety reports and model cards still overwhelmingly rely on datasets like AdvBench and HarmBench.

Fixing this requires rebuilding safety evaluation from scratch — datasets modeled on real adversarial behavior, not researcher imagination. Until that happens, every model card citing these benchmarks is reporting a number that has almost no bearing on how the model performs against a motivated attacker who knows how to write a clean sentence.