← Back to Machine Learning cs.LG
Why language models fail differently across languages
Max Zhang, Ameen Patel, Sang T. Truong, Sanmi Koyejo
May 16, 2026
Large language models show degraded safety guardrails in non-English languages, but the standard metric (jailbreak success rate) masks the actual causes. This work applies Multi-Group Item Response Theory to 1.9 million safety evaluations across 61 model configurations and 10 languages, isolating four independent factors: model robustness, prompt difficulty, language processing difficulty, and prompt-specific cross-lingual vulnerability. Key finding: 22 model variants are actually more vulnerable in English than in low-resource languages, contradicting assumptions about safety scaling with language resources. Low-resource languages generate more uncertain responses; vulnerability clusters in physical-harm categories and less-resourced languages, though translation quality alone doesn't predict vulnerability—cultural mismatches matter. The framework achieves AUC 0.940 in predicting safe refusals, outperforming simpler baselines and enabling targeted safety improvements.
Read the original paper →