← Back to Computer Vision
cs.CV

Teaching AI to plan and execute complex image edits step by step

Anirudh Sundara Rajan, Krishna Kumar Singh, Yong Jae Lee

May 14, 2026

Image editors struggle with abstract, multi-step instructions like "make this more vegetarian-friendly." This work proposes a two-part system: a planner that decomposes complex tasks into atomic steps, and an orchestrator that selects which tools and regions to apply at each stage. A vision-language model judges the quality of edits against the original instruction, and the orchestrator uses these rewards to improve its decisions. Successful trajectories then feed back to refine the planner. By coupling planning directly to reward-driven execution rather than relying on handcrafted rules or teacher imitation, the system produces more coherent edits on abstract instructions.
Published as From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing arXiv:2605.15181
Read the original paper →