← Back to Machine Learning
cs.LG

A smarter video decoder that remembers what it's supposed to look like

Xiang Fan, Yuheng Wang, Bohan Fang, Zhongzheng Ren, Ranjay Krishna

May 14, 2026

Latent diffusion video models spend enormous effort conditioning the denoising network but leave the decoder unconditional — a mismatch that quietly erodes sharpness and consistency with the input image. RefDecoder fixes this by injecting a reference frame into the decoder via a lightweight image encoder and cross-attention at each upsampling stage. It plugs into existing systems like Wan 2.1 and VideoVAE+ without retraining, and improves PSNR by up to 2.1dB on Inter4K, WebVid, and Large Motion benchmarks while lifting subject consistency, background consistency, and overall quality on VBench I2V. The approach also extends to style transfer and video editing.
Published as RefDecoder: Enhancing Visual Generation with Conditional Video Decoding arXiv:2605.15196
Read the original paper →