← Back to Computer Vision cs.CV
How to orchestrate multiple vision specialists for complex visual reasoning?
Yaowu Fan, Tao Han, Dazhao Du, Andy J. Ma, Jia Wan
May 28, 2026
Most vision models optimize for single tasks in isolation, struggling with problems that need reasoning and language understanding together. VisHarness trains a lightweight agent to coordinate multiple specialized models—detection, segmentation, counting experts—rather than retraining everything from scratch. Using a memory-efficient approach to manage multi-turn interactions with experts, it beats general-purpose models and matches task-specific ones on four benchmarks: reasoning segmentation, referring segmentation, small-object detection, and referring counting.
Read the original paper →