How relationships between objects improve detecting unseen categories

Yi Chen, Yinghao Lu, Zhehao Li, Chenchen Yan, Jiafei Wu, Chong Wang, Jiangbo Qian

Open-vocabulary object detection must identify objects never seen during training. Most approaches distill knowledge from vision-language models but ignore how objects relate to each other—their spatial arrangements and interactions. This work adds scene graphs to capture these structured relationships, using a Relation Attention Module to amplify relevant cues and a caption-based alignment branch to connect visual relationships with semantic knowledge. On COCO and LVIS, the method achieves higher accuracy for novel categories than comparable approaches.