← Back to Computer Vision
cs.CV

How to orchestrate multiple vision specialists for complex visual reasoning?

Yaowu Fan, Tao Han, Dazhao Du, Andy J. Ma, Jia Wan

May 28, 2026

Most vision models optimize for single tasks in isolation, struggling with problems that need reasoning and language understanding together. VisHarness trains a lightweight agent to coordinate multiple specialized models—detection, segmentation, counting experts—rather than retraining everything from scratch. Using a memory-efficient approach to manage multi-turn interactions with experts, it beats general-purpose models and matches task-specific ones on four benchmarks: reasoning segmentation, referring segmentation, small-object detection, and referring counting.
Published as Train the Agent, Not the Expert: Learning to Harness Heterogeneous Experts for Multi-Turn Visual Reasoning arXiv:2605.29894
Read the original paper →