← Back to Computer Vision cs.CV
Teaching AI to remember from two viewpoints at once
Ruiping Liu, Junwei Zheng, Yufan Chen, Di Wen, Shaofang Quan, Chengzhi Wu, Jiaming Zhang, Kailun Yang, Kunyu Peng, Rainer Stiefelhagen
May 18, 2026
EgoExoMem is a benchmark dataset of 2,600 multiple-choice questions testing spatial-temporal reasoning across synchronized first-person (egocentric) and third-person (exocentric) videos. The work introduces E²-Select, a training-free frame selection method that combines relevance-based budget allocation with sampling techniques to handle temporal misalignment and view differences. Current large multimodal models reach only 55.3% accuracy, while E²-Select achieves 58.2%—demonstrating that ego and exo views provide complementary information but that this cross-view reasoning task remains largely unsolved. The analysis reveals systematic conflicts between how questions frame scenarios and how answers ground in different viewpoints.
Read the original paper →