Can one model handle all scene text editing tasks?

Shuyu Wang, Zhile Guan, Hongxiu Chen, Yule Duan, Weiqi Li, Xin Shan, Ronggang Wang, Jian Zhang

TextWand unifies three separate text editing jobs—removal, generation, and replacement—into one model. The key innovations are a positional encoding method (ORPE) that locks text to precise positions and matches style from examples, plus a suppression strategy (RAS) that cleanly erases text. Authors built TextWand-Bench, the first broad benchmark for this problem, and show their model beats both open and closed competitors on text accuracy, layout consistency, and image quality.