← Back to Computer Vision cs.CV
Finding images by describing changes, without training
Miaoge Li, Dongsheng Wang, Zening Sun, Jinsen Zhang, Wenhan Luo, Jingcai Guo
May 20, 2026
Searching for images using text descriptions of changes ("make it sunnier") is tricky because language models generate captions with unwanted details from the original image. This work refines LLM-generated captions by adjusting them in embedding space toward the actual target, then treats image-caption matching as a distribution alignment problem solved with optimal transport. Tests show the method works across multiple composed image retrieval tasks without any model training.
Read the original paper →