← Back to Computer Vision cs.CV
Teaching AI to find images by what you want to change about them
Xingtian Pei, Yukun Song, Changwei Wang, Shunpeng Chen, Rongtao Xu, Shibiao Xu
May 21, 2026
Zero-shot compositional image retrieval asks: given a reference photo and text describing changes (e.g., "same person, but smiling"), find matching images. Existing methods fail because they either get tunnel vision in one search space or drift when trying to iterate. This paper proposes PDF, a hierarchical multi-agent framework where different perceptual workers propose candidates, then a decision manager uses a tournament-style voting process to refine results—all at test time, without retraining. The approach hits state-of-the-art on CIRR, CIRCO, and FashionIQ benchmarks and will release code.
Read the original paper →