Teaching robots to find things they can't see yet

Most robotic vision-language models fail when task objects leave the camera's field of view, forcing reactive behavior. SOMA adds a persistent spatial memory built from multi-view scans with a movable head camera, letting robots reason about objects they can't currently see. On five real-world manipulation tasks—including dual-arm scenarios—SOMA achieves faster target localization and one-shot grasping under partial visibility.