Why projection heads prevent neural networks from collapsing

Self-supervised learning relies on projection heads to prevent dimensional collapse, but the mechanism remained unclear. This work models projection heads as trainable Riemannian metrics on the representation manifold and proves that smooth nonlinear heads naturally generate negative Hessian eigenvalues at collapsed states, making collapse unstable. Empirical tracking of optimization geometry during training shows that Swish activations exploit this negative curvature to escape collapse, while linear and ReLU heads cannot without relying on BatchNorm and discrete optimization steps. The analysis explains why deeper nonlinear heads are more effective and why projection heads must be discarded after pretraining—they impose rigid constraints incompatible with downstream tasks.