A benchmark for robots that learn by exploring, not just watching

Yining Hong, Jiageng Liu, Han Yin, Manling Li, Leonidas Guibas, Li Fei-Fei, Jiajun Wu, Yejin Choi

Spatial understanding requires agents to act and observe in tandem: deciding what to do to gather relevant information, rather than passively receiving views. ESI-Bench, built on OmniGibson and grounded in cognitive science, contains 10 task categories testing embodied spatial reasoning across perception, locomotion, and manipulation. Experiments on state-of-the-art multimodal models show active exploration outperforms passive observation, yet failures stem primarily from poor action choices rather than weak visual perception. Notably, random multi-view approaches consume more images but add noise; imperfect 3D representations harm performance compared to 2D baselines. Human studies expose a metacognitive gap: models commit to high-confidence answers regardless of evidence quality, whereas humans seek falsifying viewpoints and revise beliefs.