← Back to Computer Vision
cs.CV

Finding images by describing changes, without training

Miaoge Li, Dongsheng Wang, Zening Sun, Jinsen Zhang, Wenhan Luo, Jingcai Guo

May 20, 2026

Searching for images using text descriptions of changes ("make it sunnier") is tricky because language models generate captions with unwanted details from the original image. This work refines LLM-generated captions by adjusting them in embedding space toward the actual target, then treats image-caption matching as a distribution alignment problem solved with optimal transport. Tests show the method works across multiple composed image retrieval tasks without any model training.
Published as STiTch: Semantic Transition and Transportation in Collaboration for Training-Free Zero-Shot Composed Image Retrieval arXiv:2605.21261
Read the original paper →