← Back to Computer Vision
cs.CV

Can 2D spatial grids beat attention for vision encoders?

Yitong Jiang, Hongjun Wang, Collin McCarthy, Hanrong Ye, David Wehr, Xinhao Li, Qi Dou, Tianfan Xue, Ka Chun Cheung, Simon See, Wonmin Byeon, Ke Chen, Kai Han, Jinwei Gu, Hongxu Yin, Pavlo Molchanov, Jan Kautz, Sifei Liu

May 30, 2026

Self-attention dominates vision models but costs quadratically in resolution. C-GSPN replaces it with 2D spatial propagation on a grid—avoiding the flattening that weakens layout awareness in token-stream alternatives. A specialized CUDA kernel makes this 40–52× faster than prior work; distillation from ViT teachers enables practical foundation-scale training. Result: matches ViT baseline with fewer parameters, 4× speedup at 2K resolution, and +2.1% on ADE20K segmentation.
Published as Scaling Parallel Sequence Models to Foundation-Scale Vision Encoders arXiv:2606.00746
Read the original paper →