← Back to Machine Learning
cs.LG

Training multimodal AI with paired data instead of perfectly aligned datasets

Yan Li, Yunlong Deng, Yuewen Sun, Gongxu Luo, Kun Zhang, Guangyi Chen

May 20, 2026

Building multimodal AI systems usually demands costly datasets where every sample has all modalities present. This work shows you can skip that: train on pairwise modalities (image-text, audio-text) separately and still learn a shared representation across all modes. The researchers prove theoretically when this is possible, then propose a two-stage framework using self-reconstruction and contrastive learning to align latent spaces. They demonstrate the approach by adding 3D point clouds and tactile data to existing vision-language models, achieving competitive cross-modal performance without full joint alignment.
Published as Multimodal LLMs under Pairwise Modalities arXiv:2605.21059
Read the original paper →