A hierarchical framework for understanding scenes through objects, parts, and affordances

Pengxin Xu, Xincheng Lin, Luping Xiao, Qing Jiang, Meishan Zhang, Hao Fei, Shanghang Zhang, Xingyu Chen

Current scene understanding systems recognize objects, ground text, and predict affordances as separate isolated tasks. This work proposes hierarchical scene parsing—representing scenes as explicit scene→object→part→affordance hierarchies with cross-level bindings—and releases SceneParser, a VLM-based parser trained with structural-completion pseudo labels and curriculum learning. SceneParser-Bench, a large-scale benchmark with 110K images, 1.74M part and affordance annotations, and novel metrics for localization and hierarchical completeness, reveals that existing multimodal models struggle with structured parsing. SceneParser achieves stronger performance and provides actionable representations compatible with downstream planning tasks.