← Back to Computer Vision
cs.CV

Keeping giant image generators faithful while making them smarter

Jinjin Zhang, Xiefan Guo, Di Huang

May 20, 2026

Ultra-high-resolution image synthesis with diffusion models faces a tradeoff: directly forcing the model to copy knowledge from foundation models (like SAM) degrades generation quality. Spatial Gram Alignment sidesteps this by aligning only the internal structure of features—their self-similarities—rather than the features themselves, preserving the model's original capability while adding structural guidance. The approach works across both intermediate diffusion layers and VAE latents, achieving state-of-the-art results on text-to-image synthesis at extreme resolutions.
Published as Spatial Gram Alignment for Ultra-High-Resolution Image Synthesis arXiv:2605.20808
Read the original paper →