← Back to Machine Learning
cs.LG

Why language models fail differently across languages

Max Zhang, Ameen Patel, Sang T. Truong, Sanmi Koyejo

May 16, 2026

Large language models show degraded safety guardrails in non-English languages, but the standard metric (jailbreak success rate) masks the actual causes. This work applies Multi-Group Item Response Theory to 1.9 million safety evaluations across 61 model configurations and 10 languages, isolating four independent factors: model robustness, prompt difficulty, language processing difficulty, and prompt-specific cross-lingual vulnerability. Key finding: 22 model variants are actually more vulnerable in English than in low-resource languages, contradicting assumptions about safety scaling with language resources. Low-resource languages generate more uncertain responses; vulnerability clusters in physical-harm categories and less-resourced languages, though translation quality alone doesn't predict vulnerability—cultural mismatches matter. The framework achieves AUC 0.940 in predicting safe refusals, outperforming simpler baselines and enabling targeted safety improvements.
Published as Why Do Safety Guardrails Degrade Across Languages? arXiv:2605.17173
Read the original paper →