A smarter video decoder that remembers what it's supposed to look like

Latent diffusion video models spend enormous effort conditioning the denoising network but leave the decoder unconditional — a mismatch that quietly erodes sharpness and consistency with the input image. RefDecoder fixes this by injecting a reference frame into the decoder via a lightweight image encoder and cross-attention at each upsampling stage. It plugs into existing systems like Wan 2.1 and VideoVAE+ without retraining, and improves PSNR by up to 2.1dB on Inter4K, WebVid, and Large Motion benchmarks while lifting subject consistency, background consistency, and overall quality on VBench I2V. The approach also extends to style transfer and video editing.