← Back to Computation and Language
cs.CL

Why twenty years of Arabic NLP taught lessons about people, not language

Wajdi Zaghouani

May 20, 2026

Wajdi Zaghouani reflects on two decades of constructing Arabic NLP resources and infrastructure, from foundational linguistic datasets to social media analysis tools. He identifies three counterintuitive lessons: dataset creation is fundamentally a social process, communities matter more than individual tasks, and traditional NLP training leaves practitioners unprepared for real-world deployment challenges. Three high-profile failures—a depression detection corpus that never reached clinical use, overextension across shared tasks, and the false assumption that Modern Standard Arabic resources would transfer to dialects—reveal that the hardest problems in serving underserved languages are not technical but social and institutional.
Published as Building Arabic NLP from the Ground Up: Twenty Years of Lessons, Failures, and Open Problems arXiv:2605.20786
Read the original paper →