Your AI's Safety Training Can Be Surgically Removed at Runtime

Researchers found the exact neurons responsible for refusing harmful requests — then switched them off. No retraining. No fine-tuning. Just geometry.

Surgeon's hands holding a scalpel over a blue surgical drape in an operating room

Safety training in large language models isn’t bolted on. It’s woven into the weights — distributed across billions of parameters through months of reinforcement learning. That’s the pitch, anyway. A new paper published April 9 on arxiv demonstrates that this woven-in safety occupies a surprisingly small, identifiable subspace in a model’s hidden states. And it can be cut out at inference time with a single mathematical operation.

The technique is called Contextual Representation Ablation, or CRA. Researchers Wenpeng Xing, Moran Fang, Guangtai Wang, Changting Lin, and Meng Han showed that the neurons responsible for safety refusals cluster into a low-rank subspace — a geometric structure in the model’s internal representations. Once identified, those dimensions can be suppressed during generation without touching the model’s weights, retraining it, or even fine-tuning.

The result: a 76% attack success rate on Llama-2-7B-Chat, one of Meta’s safety-tuned models.

How It Works

CRA treats alignment as a signal processing problem. The core insight: when a model encounters a harmful request, certain directions in its hidden state space activate to trigger a refusal. These aren’t random — they form a structured, low-rank subspace that can be mathematically isolated.

The method computes a “Refusal Importance Score” for each dimension using three complementary metrics:

  • Sensitivity — how much does perturbing this dimension change the output? (normalized gradient norms)
  • Salience — is this dimension actually active during refusal? (gradient-activation products to filter dormant neurons)
  • Dominance — does this dimension align with the principal refusal directions? (top-k selection for low-rank filtering)

Once the refusal subspace is identified, CRA applies a soft mask during decoding: each hidden state gets element-wise multiplied by (1 − λ·M), where M marks the refusal dimensions and λ controls suppression intensity. Set λ to 1.0 and the refusal behavior is fully ablated.

No gradient computation needed at runtime. No adversarial prompt optimization. Just identify the geometry, apply the mask, and the model stops saying no.

The Numbers

Tested on four safety-aligned open-source models using the AdvBench benchmark:

ModelCRA Success RateBest Prior MethodImprovement
Llama-2-7B-Chat76.0%68.7% (DSN)+7.3 points
Mistral-7B-Instruct70.5%
Guanaco-7B62.3%

Against PEZ, a gradient-based discrete optimization baseline, CRA achieved a 15.2-fold improvement on Llama-2 (3.3% to 53.0% on cross-dataset evaluation). The technique also outperformed Emulated Disalignment, a logit contrast method that scored 64.0%.

The output quality remained acceptable — self-BLEU around 17.0 and n-gram diversity around 83.0%, meaning the generated harmful content was coherent and varied, not repetitive gibberish.

Why This Should Worry You

Three things make CRA different from the usual jailbreak-of-the-week:

It’s structural, not prompt-based. Most jailbreaks exploit the text interface — clever prompts, role-playing, many-shot attacks. CRA operates at the representation level. It doesn’t trick the model into thinking a harmful request is benign. It removes the model’s ability to care.

It requires no retraining. Fine-tuning attacks (like LoRA-based ablation) need compute and access to training infrastructure. CRA works at inference time on any system running an open-weight model. If you can load the model, you can silence its guardrails.

It reveals a geometric vulnerability. Safety behaviors occupying a low-rank subspace means they’re concentrated in a small number of directions. All that RLHF, all that constitutional AI training, all those red-teaming hours — compressed into a handful of dimensions that can be masked with a single matrix operation.

This lands alongside a separate paper from the same week — ACIArena (arxiv 2604.07775) by Hengyu An and colleagues — showing that multi-agent AI systems are vulnerable to “cascading injection” attacks where one compromised agent propagates malicious instructions through a system of cooperating agents. Their evaluation of 1,356 test cases across six multi-agent implementations found that “topology alone proves insufficient for ensuring robustness” and that defensive strategies designed in simplified settings routinely fail in realistic environments.

What’s Being Done (And Why It’s Not Enough)

The authors themselves note the “urgent need for more robust defenses that secure the model’s latent space.” But what would that look like?

Current alignment techniques — RLHF, RLAIF, constitutional AI — all operate during training. They shape the weight space to make refusal the default behavior. CRA shows this shaping produces a concentrated, low-rank signal. A defense would need to either distribute safety behaviors across a higher-dimensional subspace (making them harder to isolate) or add runtime integrity checks on the model’s internal representations.

Neither approach exists today in any deployed system.

The open-weight ecosystem faces the starkest version of this problem. Meta, Mistral, and others distribute models that users run on their own hardware with full access to the inference pipeline. There is no API layer to enforce safety. No monitoring. No guardrails that can’t be geometrically excised by anyone who reads this paper.

Closed-weight providers like OpenAI, Anthropic, and Google control the inference stack, so CRA doesn’t directly apply to their APIs. But the underlying finding — that alignment occupies a small, identifiable subspace — likely holds for their models too. The question is whether anyone inside those companies has already mapped their own refusal geometry, and what they found when they looked.