Teaching AI to remember from two viewpoints at once

Ruiping Liu, Junwei Zheng, Yufan Chen, Di Wen, Shaofang Quan, Chengzhi Wu, Jiaming Zhang, Kailun Yang, Kunyu Peng, Rainer Stiefelhagen

EgoExoMem is a benchmark dataset of 2,600 multiple-choice questions testing spatial-temporal reasoning across synchronized first-person (egocentric) and third-person (exocentric) videos. The work introduces E²-Select, a training-free frame selection method that combines relevance-based budget allocation with sampling techniques to handle temporal misalignment and view differences. Current large multimodal models reach only 55.3% accuracy, while E²-Select achieves 58.2%—demonstrating that ego and exo views provide complementary information but that this cross-view reasoning task remains largely unsolved. The analysis reveals systematic conflicts between how questions frame scenarios and how answers ground in different viewpoints.