Building multimodal AI for languages with limited data

Most multimodal language models focus on English with expensive infrastructure, leaving low-resource languages behind. This tutorial covers the emerging space of multilingual multimodal LLMs that handle text, speech, and vision together, with emphasis on data efficiency and limited compute budgets. Topics include low-cost data creation, adapter stacks for aligning modalities, culture-aware evaluation beyond English, and working examples of compact vision-language models and speech-to-text-to-LLM pipelines. Designed for researchers and practitioners building multilingual AI systems.