Can diffusion generate pixels faster than traditional decoders?

Yifan Lu, Qi Wu, Jay Zhangjie Wu, Zian Wang, Huan Ling, Sanja Fidler, Xuanchi Ren

Text-to-image systems generate in compact latent spaces then decode back to pixels—but standard decoders are slow and struggle at high resolution. PiD treats decoding as conditional pixel diffusion, synthesizing details directly in pixel space while a lightweight adapter grounds the process in the latent codes. With distillation down to 4 steps, it decodes 512×512 latents to 2048×2048 pixels in under 1 second on RTX 5090, matching or beating cascaded super-resolution pipelines at a fraction of the cost.