Keeping giant image generators faithful while making them smarter

Ultra-high-resolution image synthesis with diffusion models faces a tradeoff: directly forcing the model to copy knowledge from foundation models (like SAM) degrades generation quality. Spatial Gram Alignment sidesteps this by aligning only the internal structure of features—their self-similarities—rather than the features themselves, preserving the model's original capability while adding structural guidance. The approach works across both intermediate diffusion layers and VAE latents, achieving state-of-the-art results on text-to-image synthesis at extreme resolutions.