Why vision transformers fail at detail work—and how sparse attention fixes it

Vision transformers excel at image classification but degrade at dense tasks like segmentation because semantic information spreads too broadly across patches during training—a phenomenon called semantic diffusion. Rather than blocking global context, researchers showed that selective token mixing fixes this: swapping softmax for entmax-1.5 attention substantially improves segmentation (VOC mIoU 42.8→48.78, ADE20K 19.85→21.97) without losing classification performance. A minimal intervention with clear practical gains.