Teaching AI to find images by what you want to change about them

Zero-shot compositional image retrieval asks: given a reference photo and text describing changes (e.g., "same person, but smiling"), find matching images. Existing methods fail because they either get tunnel vision in one search space or drift when trying to iterate. This paper proposes PDF, a hierarchical multi-agent framework where different perceptual workers propose candidates, then a decision manager uses a tournament-style voting process to refine results—all at test time, without retraining. The approach hits state-of-the-art on CIRR, CIRCO, and FashionIQ benchmarks and will release code.