← Back to Computer Vision cs.CV
Reasoning through images: making text-to-image models think step by step
Hanbo Cheng, Limin Lin, Ruo Zhang, Yicheng Pan, Jun Du
May 14, 2026
Current text-to-image models generate entire images in one pass, struggling with complex scenes and semantic precision. CLVR couples visual-language reasoning with pixel-level diffusion generation, using step-level verification to prevent hallucinations and a reinforcement learning method (PPRL) to handle long-context optimization instabilities. A novel weight-merging technique (DSWM) cuts inference cost dramatically without retraining. Experiments show CLVR outperforms open-source baselines and approaches proprietary models like DALL-E on multiple benchmarks, enabling practical scaling for complex image generation.
Read the original paper →