← Back to Computer Vision
cs.CV

Teaching AI to find images by what you want to change about them

Xingtian Pei, Yukun Song, Changwei Wang, Shunpeng Chen, Rongtao Xu, Shibiao Xu

May 21, 2026

Zero-shot compositional image retrieval asks: given a reference photo and text describing changes (e.g., "same person, but smiling"), find matching images. Existing methods fail because they either get tunnel vision in one search space or drift when trying to iterate. This paper proposes PDF, a hierarchical multi-agent framework where different perceptual workers propose candidates, then a decision manager uses a tournament-style voting process to refine results—all at test time, without retraining. The approach hits state-of-the-art on CIRR, CIRCO, and FashionIQ benchmarks and will release code.
Published as Matching with Deliberation: Test-Time Evolutionary Hierarchical Multi-Agents for Zero-Shot Compositional Image Retrieval arXiv:2605.22478
Read the original paper →