Why do neural networks learn the features they do?

Sparse autoencoders decompose complex neural representations into interpretable features, but we didn't understand what forces them to pick certain features over others. This work derives mathematical constraints on what any optimal sparse autoencoder must learn, avoiding unrealistic data assumptions. The theory explains three puzzling behaviors: hierarchical feature splitting, residual structure, and surprising antipodal features—revealing how sparsity constraints interact with real data to shape what gets extracted.