← Back to Computation and Language cs.CL
Building multimodal AI for languages with limited data
Firoj Alam, Shammur Absar Chowdhury, Enamul Hoque Prince
May 16, 2026
Most multimodal language models focus on English with expensive infrastructure, leaving low-resource languages behind. This tutorial covers the emerging space of multilingual multimodal LLMs that handle text, speech, and vision together, with emphasis on data efficiency and limited compute budgets. Topics include low-cost data creation, adapter stacks for aligning modalities, culture-aware evaluation beyond English, and working examples of compact vision-language models and speech-to-text-to-LLM pipelines. Designed for researchers and practitioners building multilingual AI systems.
Read the original paper →