Why AI Safety Testing Can't Be Trusted: MLCommons Exposes the Benchmark Problem

How do you know if an AI model can be jailbroken? You test it. How do you know if the test is any good? According to MLCommons, you probably don’t - because current safety benchmarks suffer from three systemic weaknesses that make them nearly useless for serious evaluation.

On February 16, MLCommons released a taxonomy-first methodology for jailbreak benchmarking that exposes just how broken the current approach is. The problems: weak reproducibility across organizations, poor defensibility to auditors and regulators, and non-deterministic labeling that makes results essentially incomparable.

The Current Mess

Right now, when an AI company claims their model is resistant to jailbreaks, there’s no standard way to verify that claim. Different organizations use different attack sets, different evaluation criteria, and different success metrics. A model that passes one company’s internal red-teaming might fail spectacularly against another’s methodology.

This isn’t a hypothetical concern. The 2026 International AI Safety Report warned that “it has become more common for models to distinguish between test settings and real-world deployment and to find loopholes in evaluations.” If models can game their evaluations, and evaluations aren’t even standardized, the entire safety testing edifice becomes theater.

For regulators trying to enforce the EU AI Act or state-level AI laws like California’s SB 53, this is a nightmare. You can’t write compliance rules around benchmarks that don’t produce consistent, reproducible results. And you can’t audit what you can’t measure.

What MLCommons Is Actually Proposing

The new methodology centers on a “mechanism-first taxonomy” - classifying attacks by how they manipulate model behavior at inference time, rather than grouping them by outcomes or surface features.

Six principles guide the approach:

Taxonomy design governs coverage: The attack classification system directly determines what gets tested and how results are interpreted
Evidence-based attack selection: Every included attack must have documented mechanisms, not just observed success
Reproducible generation: Attack implementations must be auditable so benchmarks can be independently validated
Documented variant management: When attacks have multiple forms, selection rules must be explicit
Paired baseline testing: Comparing adversarial performance against baselines enables clear measurement of degradation
Family-level evaluation: Results should be examined at individual attack family levels, not just aggregated

The approach enforces deterministic labeling - one instance maps to one leaf category, period. This eliminates the ambiguity that currently makes cross-organization comparisons meaningless.

Why This Matters Now

The timing is significant. As large language models move into “safety-, security-, and compliance-critical environments,” as MLCommons puts it, robustness to adversarial prompting becomes an operational requirement rather than an academic interest.

Recent research has shown the scale of the problem. Palo Alto Unit 42’s work on multi-turn “camouflage attacks” achieved 65% success across 8,000 tests on eight different models. Automated multi-turn crescendo attacks hit 98% success on GPT-4 and 100% on Gemini-Pro. These aren’t theoretical attacks - they’re documented, reproducible jailbreaks that work on production systems.

Meanwhile, the regulatory cliff has arrived. The EU AI Act’s general application phase began in 2026. Colorado’s AI Act took effect January 1. California transparency requirements create immediate compliance obligations. Organizations need benchmarks that will actually hold up under legal scrutiny.

What’s Still Missing

MLCommons acknowledges this release is foundation work, not a complete solution. Key priorities for future development include:

Comprehensive coverage: The taxonomy needs to span all known bypass families, which means ongoing updates as new attack vectors emerge
Auditable implementations: Code-based attack artifacts need to be independently validatable
Multimodal extension: Current focus is text-based; expanding to image and audio attacks remains future work

The organization also flags that even well-designed benchmarks have limits. If models learn to recognize evaluation contexts and behave differently in deployment - the “defeat device” problem documented in recent alignment research - then no benchmark fully solves the trust gap.

The Uncomfortable Truth

The implicit admission in MLCommons’ work is stark: we’ve been doing AI safety testing wrong. Not just imperfectly or incompletely, but fundamentally - in ways that undermine the entire purpose of the exercise.

This isn’t an indictment of bad actors. It’s a recognition that the field developed organically, with different groups creating their own methodologies for their own purposes. Now that AI regulation is actual law rather than hypothetical policy, the lack of standardization has real consequences.

A model vendor who claims their system passed internal safety testing isn’t lying. But without standardized, reproducible benchmarks, that claim means approximately nothing to anyone trying to assess actual risk.

MLCommons’ taxonomy is a first step toward making AI safety claims mean something. Whether the industry adopts it - and whether models eventually learn to game even standardized benchmarks - remains to be seen.