← Back to Computation and Language
cs.CL

Building multimodal AI for languages with limited data

Firoj Alam, Shammur Absar Chowdhury, Enamul Hoque Prince

May 16, 2026

Most multimodal language models focus on English with expensive infrastructure, leaving low-resource languages behind. This tutorial covers the emerging space of multilingual multimodal LLMs that handle text, speech, and vision together, with emphasis on data efficiency and limited compute budgets. Topics include low-cost data creation, adapter stacks for aligning modalities, culture-aware evaluation beyond English, and working examples of compact vision-language models and speech-to-text-to-LLM pipelines. Designed for researchers and practitioners building multilingual AI systems.
Published as Multilingual and Multimodal LLMs in the Wild: Building for Low-Resource Languages arXiv:2605.17152
Read the original paper →