← Back to Computer Vision cs.CV
Keeping giant image generators faithful while making them smarter
Jinjin Zhang, Xiefan Guo, Di Huang
May 20, 2026
Ultra-high-resolution image synthesis with diffusion models faces a tradeoff: directly forcing the model to copy knowledge from foundation models (like SAM) degrades generation quality. Spatial Gram Alignment sidesteps this by aligning only the internal structure of features—their self-similarities—rather than the features themselves, preserving the model's original capability while adding structural guidance. The approach works across both intermediate diffusion layers and VAE latents, achieving state-of-the-art results on text-to-image synthesis at extreme resolutions.
Read the original paper →