Can 2D spatial grids beat attention for vision encoders?

Yitong Jiang, Hongjun Wang, Collin McCarthy, Hanrong Ye, David Wehr, Xinhao Li, Qi Dou, Tianfan Xue, Ka Chun Cheung, Simon See, Wonmin Byeon, Ke Chen, Kai Han, Jinwei Gu, Hongxu Yin, Pavlo Molchanov, Jan Kautz, Sifei Liu

Self-attention dominates vision models but costs quadratically in resolution. C-GSPN replaces it with 2D spatial propagation on a grid—avoiding the flattening that weakens layout awareness in token-stream alternatives. A specialized CUDA kernel makes this 40–52× faster than prior work; distillation from ViT teachers enables practical foundation-scale training. Result: matches ViT baseline with fewer parameters, 4× speedup at 2K resolution, and +2.1% on ADE20K segmentation.