Finding hidden objects by reasoning about invisible spaces

SceneFunRI introduces a benchmark for inferring locations of invisible objects in 3D scenes based on task instructions and commonsense knowledge. Built on SceneFun3D with 855 instances, it frames the problem as 2D spatial reasoning—requiring models to predict where an object should be despite it being out of view. Gemini 3 Flash, the strongest tested baseline, achieves only 15.20% coordinate accuracy. The authors analyze three prompting strategies: instruction-based, reasoning-based, and spatial elimination. Results show that invisible-region reasoning remains a weak point in current vision-language models, requiring better integration of task intent, spatial grounding, and uncertainty estimation.