← Back to Machine Learning (Statistics)
stat.ML

Finding hidden feature spaces buried in neural networks

Alexander Modell

May 18, 2026

Neural networks compress meaning into dense vector representations that are hard to interpret. This paper introduces the Manifold Probe, which discovers low-dimensional geometric structures (manifolds) encoding specific concepts—like how Llama 2-7b represents time. The method finds which directions in the network's activations correspond to features like "year" or "location," then shows these directions causally affect model outputs by steering predictions about when famous movies were released. This matters because it makes neural network internals more transparent and controllable.
Published as Probing for Representation Manifolds in Superposition arXiv:2605.18537
Read the original paper →