Teaching AI to plan and execute complex image edits step by step

Image editors struggle with abstract, multi-step instructions like "make this more vegetarian-friendly." This work proposes a two-part system: a planner that decomposes complex tasks into atomic steps, and an orchestrator that selects which tools and regions to apply at each stage. A vision-language model judges the quality of edits against the original instruction, and the orchestrator uses these rewards to improve its decisions. Successful trajectories then feed back to refine the planner. By coupling planning directly to reward-driven execution rather than relying on handcrafted rules or teacher imitation, the system produces more coherent edits on abstract instructions.