The safety interventions designed to make AI helpful and harmless?
They might only work in English.
Not “work best” in English. Only work in English. According to a comprehensive analysis of nearly 300 safety publications from 2020-2024, the entire field of LLM safety research has been built on an English-centric foundation - and the cracks are showing.
The Language Gap
Zheng-Xin Yong and colleagues at Brown University and Cohere reviewed every LLM safety paper published at major NLP venues over five years. What they found is damning.
English-only safety publications grew from 5 papers in 2020 to 83 in 2024. Meanwhile, Mandarin Chinese - spoken by over a billion people - received about ten times less research attention than English. Every single public safety evaluation dataset includes English. Only two include any other language at all.
But here’s the part that should worry you: 50.6% of English-only papers didn’t even bother to mention that they were studying English. The language was so assumed, so default, that researchers didn’t think to document it.
The Performance Collapse
Raw numbers tell one story. Model performance tells a worse one.
The researchers compiled safety benchmarks across languages. Vicuna achieved an average harmlessness score of 69.32 - reasonable, if unimpressive. But its worst-case performance? 18.4. In Bengali.
ChatGPT’s average harmlessness was 85.56. Solid. Its worst case? 62.6. Also Bengali.
“Strong average performance,” the researchers write, “does not necessarily reflect robustness.” Averaging obscures catastrophic failures. And those failures cluster in exactly the languages where the most people live and the least research happens.
Why Safety Doesn’t Transfer
A separate study by Shen et al. dug into why this happens. Their finding: the bottleneck is pretraining, not alignment.
Fine-tuning models on high-resource languages improves safety alignment. The same techniques applied to lower-resource languages show “minimal improvement.” The safety doesn’t transfer because the underlying model representations were never linguistically balanced in the first place.
Models generate unsafe responses more frequently when malicious prompts come in lower-resource languages. They also generate more irrelevant responses - they’re not just less safe, they’re less competent. The model learned English first, learned safety in English, and treats everything else as a dialect of English that doesn’t quite work.
The Jailbreak Geography
The multilingual safety survey documents specific attack vectors that exploit this gap:
Code-switching - alternating between languages within a single prompt - jailbreaks multilingual safety guardrails. The model’s safety training expects monolingual inputs. Mixed-language prompts fall through the cracks.
Arabizi - Arabic written in Latin characters - successfully bypasses safety filters that catch standard Arabic script. The model recognizes “dangerous Arabic” but not “dangerous Arabic transliterated into English letters.”
Cultural context misses: The word “banana” carries offensive connotations in Southeast Asia when describing someone who has abandoned their cultural identity. English safety training doesn’t know this. The Chinese character 屌 functions as both a severe slur and casual praise depending on context. English safety training can’t distinguish.
What This Means
The AI safety community has built its safety cases on techniques validated in English. Constitutional AI, RLHF, red-teaming - all developed, tested, and measured in English contexts. When Anthropic tells regulators their models are safe, they’re showing English evaluations. When OpenAI publishes safety benchmarks, they’re English benchmarks.
6.3 billion people speak languages other than English. For most of them, the safety properties the industry advertises simply don’t exist - or exist in degraded form that collapses in exactly the situations where safety matters most.
The industry’s response is predictable: more multilingual training, expanded red-teaming, language-specific safety tuning. All necessary. All treating symptoms rather than causes.
Because the fundamental problem isn’t coverage. It’s that the entire field assumed English safety would generalize. It built on that assumption for five years. Nearly 300 papers later, we’re discovering the assumption was wrong.
The Tower Falls
The Tower of Babel was a story about hubris - humanity building toward heaven and being scattered across languages as punishment.
The Babel Problem in AI safety is similar. We built alignment toward safety, validated it in one language, and assumed it would stand everywhere.
It doesn’t. In Bengali, in Arabic, in the languages spoken by most of humanity, the tower falls.
And we’re only now starting to count the bodies at the base.