← Back to Computer Vision cs.CV
Teaching AI to plan and execute complex image edits step by step
Anirudh Sundara Rajan, Krishna Kumar Singh, Yong Jae Lee
May 14, 2026
Image editors struggle with abstract, multi-step instructions like "make this more vegetarian-friendly." This work proposes a two-part system: a planner that decomposes complex tasks into atomic steps, and an orchestrator that selects which tools and regions to apply at each stage. A vision-language model judges the quality of edits against the original instruction, and the orchestrator uses these rewards to improve its decisions. Successful trajectories then feed back to refine the planner. By coupling planning directly to reward-driven execution rather than relying on handcrafted rules or teacher imitation, the system produces more coherent edits on abstract instructions.
Read the original paper →