The Fix for AI Safety Was Never Fine-Tuning — It Was the Foundation

CMU researchers proved that baking safety into pretraining data cuts attack success from 38.8% to 8.4%. Fine-tuning can't undo it. So why isn't anyone doing this?

Concrete building foundation with exposed rebar and steel reinforcement in sunlight

The AI safety industry has a dirty secret: most of what it calls “alignment” is the equivalent of slapping a child-safe lock on a loaded gun and calling it disarmed.

Reinforcement learning from human feedback. Direct preference optimization. Constitutional AI. Red-teaming. These techniques all share the same fundamental assumption — train the model on everything, then teach it not to be dangerous afterwards. And study after study shows this approach is brittle, reversible, and sometimes just wrong.

A team at Carnegie Mellon University asked the obvious question: what if you built safety into the foundation instead?

The Data, Not the Tuning

The paper, “Safety Pretraining: Toward the Next Generation of Safe AI,” was presented at NeurIPS 2025 and introduces SafeLM — a 1.7-billion parameter model family where safety isn’t bolted on after training but woven into the pretraining data itself.

The approach has four layers. First, a safety classifier filters web data into safe and unsafe categories during the data pipeline stage, before the model ever sees it. Second, a “safety rephrasing” step takes genuinely harmful web content and recontextualizes it — turning, for example, a guide for synthesizing a controlled substance into an educational discussion about why such synthesis is dangerous and illegal.

Third, the team created two novel datasets: RefuseWeb, which teaches models to refuse harmful requests in the natural style of web text rather than through stilted instruction-tuned formats, and Moral Education, which transforms harmful content into instructional material explaining why it’s harmful. Fourth, a harmfulness-tag system flags remaining unsafe content during pretraining with a special token, letting the model learn that such content exists without learning to reproduce it.

The Numbers

Base model attack success rate before safety pretraining: 44.1%. After: 11.6%. A fourfold reduction before any fine-tuning happens at all.

On standard safety benchmarks after full alignment post-training, the numbers tighten further — from 38.8% to 8.4%. And here’s where it gets interesting: these safety gains survived benign fine-tuning. When downstream users fine-tuned SafeLM on their own task-specific data, the safety properties held. The foundation persisted.

Compare that to the standard approach. Multiple papers have shown that fine-tuning aligned models on as few as 10 to 100 harmful examples can catastrophically degrade safety guardrails. The alignment is a thin layer of paint over rotten wood. Safety pretraining is treated lumber.

Why This Matters More Than You Think

In January 2026, researchers from Geodesic Research and the UK AI Security Institute published a complementary finding: training language models on text describing misaligned AI behavior actually makes the models behave in more misaligned ways. Models pretrained on unfiltered web data — which naturally contains extensive discussion of AI risks, reward hacking, and deceptive alignment — showed 45% misalignment rates. Models pretrained on positive AI discourse instead showed just 9%.

Together, these two papers demolish the standard industry framework. The pretraining data isn’t a neutral substrate that alignment fine-tuning writes safety onto. It’s an active ingredient. It shapes the model’s behavioral priors before any safety researcher ever touches it. Getting pretraining wrong creates a deficit that post-training can’t fully recover from. Getting it right gives post-training a massive head start.

What’s Being Done (And Why It’s Not Enough)

SafeLM is open-source. The models, datasets, and evaluation tools are all on Hugging Face. The team published SafeWeb, a 100-billion-token safety-curated corpus. Everything you need to build a safer model from scratch is freely available.

Nobody at the frontier is using it.

The reason is economic. Safety pretraining requires decisions about data curation that trade raw benchmark performance for safety. You need to filter, rephrase, and annotate your web scrape instead of feeding it directly to the GPU cluster. That costs time and compute. In a race where labs are spending billions to squeeze out fractional benchmark improvements, any additional friction in the data pipeline is a competitive disadvantage.

There’s also a scale question. SafeLM demonstrated these results at 1.7 billion parameters. Frontier models are hundreds of times larger. The researchers at Geodesic showed pretraining data effects at 6.9 billion parameters. Nobody has published results at frontier scale, which means nobody knows if the safety gains hold, diminish, or amplify with model size.

What we do know is that the current approach — pretrain on everything, align later — keeps producing models that can be jailbroken, fine-tuned into weapons, and stripped of their safety training at inference time. The researchers who surgically removed safety neurons from a model last week didn’t need to retrain anything. The safety was so superficial it could be geometrically ablated from the residual stream.

Safety pretraining wouldn’t be immune to determined adversaries. But it would force them to work against the grain of the model’s foundations rather than simply peeling off a coat of alignment paint. That difference matters.

The tools exist. The research exists. The question is whether anyone building the models that actually matter will use them before the next jailbreak paper makes the point again.