Can image editing help vision models reason better?

Multimodal language models struggle with reasoning tasks that need precise visual focus or transformation. ETCHR trains a dedicated image editor to transform images based on questions—not generic instructions—then feeds the clearer result to any vision model. Two-stage training first mimics correct edit sequences, then optimizes for downstream accuracy. No retraining needed: it works plug-and-play with Qwen, Gemini, and Kimi models, lifting performance by 4.6–5.5 percentage points on fine-grained perception, charts, logic, jigsaw, and 3D tasks.