How to orchestrate multiple vision specialists for complex visual reasoning?

Most vision models optimize for single tasks in isolation, struggling with problems that need reasoning and language understanding together. VisHarness trains a lightweight agent to coordinate multiple specialized models—detection, segmentation, counting experts—rather than retraining everything from scratch. Using a memory-efficient approach to manage multi-turn interactions with experts, it beats general-purpose models and matches task-specific ones on four benchmarks: reasoning segmentation, referring segmentation, small-object detection, and referring counting.