← Back to Computer Vision
cs.CV

Reasoning through images: making text-to-image models think step by step

Hanbo Cheng, Limin Lin, Ruo Zhang, Yicheng Pan, Jun Du

May 14, 2026

Current text-to-image models generate entire images in one pass, struggling with complex scenes and semantic precision. CLVR couples visual-language reasoning with pixel-level diffusion generation, using step-level verification to prevent hallucinations and a reinforcement learning method (PPRL) to handle long-context optimization instabilities. A novel weight-merging technique (DSWM) cuts inference cost dramatically without retraining. Experiments show CLVR outperforms open-source baselines and approaches proprietary models like DALL-E on multiple benchmarks, enabling practical scaling for complex image generation.
Published as Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning arXiv:2605.14876
Read the original paper →