← Back to Computation and Language cs.CL
Can image editing help vision models reason better?
Beichen Zhang, Yuhong Liu, Jinsong Li, Yuhang Zang, Jiaqi Wang, Dahua Lin
May 22, 2026
Multimodal language models struggle with reasoning tasks that need precise visual focus or transformation. ETCHR trains a dedicated image editor to transform images based on questions—not generic instructions—then feeds the clearer result to any vision model. Two-stage training first mimics correct edit sequences, then optimizes for downstream accuracy. No retraining needed: it works plug-and-play with Qwen, Gemini, and Kimi models, lifting performance by 4.6–5.5 percentage points on fine-grained perception, charts, logic, jigsaw, and 3D tasks.
Read the original paper →