Training text-to-image models by fixing the decoder problem

Siyong Jian, Siyuan Li, Luyuan Zhang, Zedong Wang, Xin Jin, Ying Li, Cheng Tan, Huan Wang

Standard post-training for discrete autoregressive text-to-image models optimizes only the policy while freezing the decoder, causing a hidden mismatch: as the policy evolves, generated tokens drift from what the decoder was trained on, tanking image quality even as reward scores climb. RankE solves this by alternately refining both the policy and decoder together through ranking-based alignment objectives, breaking the fidelity–reward trade-off. On LlamaGen-XL, it improves FID to 15.21 and CLIP to 33.76 on MS-COCO, gains confirmed across model sizes.