AI Labs Grade Their Own Biorisk Homework

AI models are approaching the point where they can meaningfully assist experts in developing novel biological weapons. The median forecast puts that threshold at late 2026 — months from now. You’d expect the testing regime designed to catch this moment to be rigorous, transparent, and built on solid foundations.

It isn’t. Two of the three primary biorisk benchmarks are effectively saturated. Most evaluation details remain undisclosed. And none of the tests measure whether an AI system could help someone actually build something dangerous in a lab, as opposed to answering multiple-choice questions about it.

The Benchmarks Are Maxed Out

Epoch AI researchers Anson Ho and Arden Berg examined how eight major AI labs — Anthropic, OpenAI, Google DeepMind, Meta, xAI, Mistral, Alibaba, and DeepSeek — evaluate their models for biological risk. They found the evaluation landscape dominated by three benchmarks.

SecureBio’s Virology Capabilities Test asks models to troubleshoot complex virology protocols, including tacit knowledge that requires hands-on experience. Expert virologists score only 22.1% on questions within their own specialty. FutureHouse’s LAB-Bench tests knowledge of biological sequences, protocols, and research tools. And the Weapons of Mass Destruction Proxy benchmark covers hazardous knowledge in biology and chemistry through multiple-choice questions.

The problem: LAB-Bench and WMDP are “now largely saturated,” with frontier models approaching perfect scores. When a benchmark is saturated, it can no longer distinguish between a model that’s marginally capable and one that could provide meaningful uplift to a bad actor. It’s like using a bathroom scale to weigh trucks.

Only the Virology Capabilities Test still has headroom. But it measures textbook troubleshooting, not the ability to guide someone through the physical steps of working with dangerous pathogens. The gap between “knows the right answer” and “can walk you through it” is where the actual risk lives.

What Labs Report (And What They Don’t)

The disparity across labs is stark. Anthropic conducted ten biorisk evaluations for Claude Opus 4. OpenAI reported four for o3 and o4-mini. Google DeepMind provided minimal detail for Gemini 2.5 Pro. xAI, Mistral, Alibaba, and DeepSeek disclosed nothing.

Even among labs that publish results, the details that matter most are missing. Anthropic’s model card specifies a 5x uplift threshold for its highest concern level, but doesn’t clarify what baseline that multiplier applies to. Does 5x mean performance rises from 20% to 100%, or from 0.1% to 0.5%? The evaluation rubric is unpublished. OpenAI’s system card mentions biorisk red teaming but doesn’t specify whether expert red-teamers were involved.

This opacity matters because the International AI Safety Report 2026 — written by over 100 experts across 29 nations — warned that multiple AI companies released new models in 2025 “with additional safeguards after pre-deployment testing could not rule out the possibility that they could meaningfully help novices develop such weapons.” They couldn’t prove it was safe, so they added guardrails and shipped it anyway.

The Physical Bottleneck Nobody Tests For

The most significant gap in biorisk evaluations is what Epoch AI calls “somatic tacit knowledge” — the physical skills you can only learn by doing. Pipetting without contamination, maintaining sterile conditions across multi-day protocols, interpreting ambiguous results under a microscope. These are skills that take months to develop under expert supervision.

No current AI biorisk evaluation tests whether a model could actually guide someone through the physical work. Multiple-choice questions and text-based uplift trials measure knowledge, not capability. A model that scores perfectly on virology MCQs is still useless to someone who can’t maintain a cell culture.

But this defense has an expiration date. AI-enabled laboratory automation is advancing rapidly. Robotic systems that can execute molecular biology protocols are becoming more accessible. When the physical bottleneck narrows — when the question shifts from “can the AI tell you how” to “can the AI run the equipment that does it for you” — the gap between benchmark performance and real-world risk closes fast.

New Research Finds Active Gaps

A paper published April 23 by Michael Richter tested ChatGPT 5.2, Gemini 3 Pro, Claude Opus 4.5, and Meta’s Muse Spark on biological capability tasks. The findings included edge-case prompts where Gemini demonstrated a “seeming lack of contextual awareness” — providing information that could facilitate harm without recognizing the risk. The research documented scenarios involving poison production and extraction techniques accessible through anonymous, logged-out access to AI services.

This is the gap between self-reported safety evaluations and independent testing. Labs test their own models under controlled conditions with carefully designed prompts. Independent researchers test them the way adversaries would — through edge cases, context manipulation, and access paths the lab’s red team didn’t think of.

Why This Should Worry You

The forecasting community’s median estimate puts expert-level biological uplift capabilities at late 2026. The Forecasting Research Institute’s survey found that when AI models reach this threshold, expert biologists estimate the annual probability of an AI-enabled biological catastrophe rises fivefold, from 0.3% to 1.5%.

Those numbers sound small until you consider they compound year after year. And they’re based on the assumption that we’ll know when models cross the threshold. If the benchmarks are saturated and the evaluations are opaque, we might not.

What’s Being Done (And Why It’s Not Enough)

Anthropic activated ASL-3 security protections for Claude Opus 4, which Epoch AI’s researchers consider “largely justified” given the uncertainties. That’s responsible. But ASL-3 is a single lab’s internal framework, not an industry standard. OpenAI, Google, and Meta each use different criteria, different benchmarks, and different disclosure practices. There is no shared evaluation framework that would let anyone compare biorisk across labs.

The harder evaluations — the ones that would actually measure whether a model can enable bioweapon development — require physical wet-lab studies with expert oversight. They’re expensive, slow, and ethically fraught. Nobody is running them at scale. Instead, the industry relies on multiple-choice tests that frontier models have already maxed out, self-reported evaluations with undisclosed methodologies, and the assumption that physical bottlenecks will hold.

That assumption is a bet. And the stakes are biological.