How well can AI actually follow image editing instructions?

Fangzhou Lin, Peiran Li, Lingyu Xu, Wenjing Chen, Qianwen Ge, Shuo Xing, Mingyang Wu, Xiangbo Gao, Siyuan Yang, Kazunori Yamada, Ziming Zhang, Haichong Zhang, Zhen Dong, Ming-Hsuan Yang, Zhengzhong Tu

Image editing has moved beyond simple filters to complex, multi-constraint tasks: remove an object while preserving shadows, adjust perspective while keeping proportions. CV-Arena benchmarks 21 AI systems on 12,000 high-resolution real images with natural-language instructions spanning 16 task types, paired with a dual-track construction pipeline and an "Active Elo" evaluation system that routes confident judgments to AI validators and ambiguous cases to expert raters. All systems failed at structural control and physical reasoning; a lightweight agentic model using iterative planning-editing-verification showed the most promise.