Finding images by describing changes, without training

Searching for images using text descriptions of changes ("make it sunnier") is tricky because language models generate captions with unwanted details from the original image. This work refines LLM-generated captions by adjusting them in embedding space toward the actual target, then treats image-caption matching as a distribution alignment problem solved with optimal transport. Tests show the method works across multiple composed image retrieval tasks without any model training.