← Back to Computation and Language
cs.CL

Can image editing help vision models reason better?

Beichen Zhang, Yuhong Liu, Jinsong Li, Yuhang Zang, Jiaqi Wang, Dahua Lin

May 22, 2026

Multimodal language models struggle with reasoning tasks that need precise visual focus or transformation. ETCHR trains a dedicated image editor to transform images based on questions—not generic instructions—then feeds the clearer result to any vision model. Two-stage training first mimics correct edit sequences, then optimizes for downstream accuracy. No retraining needed: it works plug-and-play with Qwen, Gemini, and Kimi models, lifting performance by 4.6–5.5 percentage points on fine-grained perception, charts, logic, jigsaw, and 3D tasks.
Published as ETCHR: Editing To Clarify and Harness Reasoning arXiv:2605.23897
Read the original paper →