← Back to Neurons and Cognition
q-bio.NC

Why do neural networks learn the features they do?

William Dorrell

June 1, 2026

Sparse autoencoders decompose complex neural representations into interpretable features, but we didn't understand what forces them to pick certain features over others. This work derives mathematical constraints on what any optimal sparse autoencoder must learn, avoiding unrealistic data assumptions. The theory explains three puzzling behaviors: hierarchical feature splitting, residual structure, and surprising antipodal features—revealing how sparsity constraints interact with real data to shape what gets extracted.
Published as How Optimality Structures Sparse Dictionaries: A Theory for Understanding SAE Representations arXiv:2606.02385
Read the original paper →