← Back to Computer Vision
cs.CV

Why vision transformers fail at detail work—and how sparse attention fixes it

Linxiang Su

May 22, 2026

Vision transformers excel at image classification but degrade at dense tasks like segmentation because semantic information spreads too broadly across patches during training—a phenomenon called semantic diffusion. Rather than blocking global context, researchers showed that selective token mixing fixes this: swapping softmax for entmax-1.5 attention substantially improves segmentation (VOC mIoU 42.8→48.78, ADE20K 19.85→21.97) without losing classification performance. A minimal intervention with clear practical gains.
Published as Vision Transformers Need Better Token Interaction arXiv:2605.23868
Read the original paper →