← Back to Computer Vision
cs.CV

How well can AI actually follow image editing instructions?

Fangzhou Lin, Peiran Li, Lingyu Xu, Wenjing Chen, Qianwen Ge, Shuo Xing, Mingyang Wu, Xiangbo Gao, Siyuan Yang, Kazunori Yamada, Ziming Zhang, Haichong Zhang, Zhen Dong, Ming-Hsuan Yang, Zhengzhong Tu

May 30, 2026

Image editing has moved beyond simple filters to complex, multi-constraint tasks: remove an object while preserving shadows, adjust perspective while keeping proportions. CV-Arena benchmarks 21 AI systems on 12,000 high-resolution real images with natural-language instructions spanning 16 task types, paired with a dual-track construction pipeline and an "Active Elo" evaluation system that routes confident judgments to AI validators and ambiguous cases to expert raters. All systems failed at structural control and physical reasoning; a lightweight agentic model using iterative planning-editing-verification showed the most promise.
Published as CV-Arena: An Open Benchmark for Instructional Computer Vision Problem Solving with Human-AI Collaborative Preferences arXiv:2606.00931
Read the original paper →