← Back to Computer Vision
cs.CV

Training text-to-image models by fixing the decoder problem

Siyong Jian, Siyuan Li, Luyuan Zhang, Zedong Wang, Xin Jin, Ying Li, Cheng Tan, Huan Wang

May 20, 2026

Standard post-training for discrete autoregressive text-to-image models optimizes only the policy while freezing the decoder, causing a hidden mismatch: as the policy evolves, generated tokens drift from what the decoder was trained on, tanking image quality even as reward scores climb. RankE solves this by alternately refining both the policy and decoder together through ranking-based alignment objectives, breaking the fidelity–reward trade-off. On LlamaGen-XL, it improves FID to 15.21 and CLIP to 33.76 on MS-COCO, gains confirmed across model sizes.
Published as RankE: End-to-End Post-Training for Discrete Text-to-Image Generation with Decoder Co-Evolution arXiv:2605.21195
Read the original paper →