Can rewards instead of fixed rules improve image generation training?

Diffusion transformers struggle to efficiently align generative features with pretrained visual encoders during training. VRPO treats representation alignment as a reinforcement process where the model receives adaptive rewards for generation fidelity and semantic coherence, rather than enforcing fixed similarity constraints. On ImageNet-256, this approach yields 1.8 FID improvement and 2.3× speedup compared to prior alignment methods, with negligible added cost and compatibility with existing DiT and SiT architectures.