Training multimodal AI with paired data instead of perfectly aligned datasets

Building multimodal AI systems usually demands costly datasets where every sample has all modalities present. This work shows you can skip that: train on pairwise modalities (image-text, audio-text) separately and still learn a shared representation across all modes. The researchers prove theoretically when this is possible, then propose a two-stage framework using self-reconstruction and contrastive learning to align latent spaces. They demonstrate the approach by adding 3D point clouds and tactile data to existing vision-language models, achieving competitive cross-modal performance without full joint alignment.