One task to train them all: teaching models to edit

Dian Zheng, Manyuan Zhang, Hongyu Li, Hongbo Liu, Kai Zou, Kaituo Feng, Hongsheng Li

Unified multimodal models usually need separate training for understanding, generating, and editing images, creating conflicts that hurt performance. This work reframes image editing as a general task that naturally demands both visual understanding and generation. The team created Uni-Edit-148k by converting visual question-answering data into complex editing instructions with reasoning logic, then showed that training on this single task alone improves all three capabilities on BAGEL and Janus-Pro benchmarks without auxiliary tricks.