Finding hidden feature spaces buried in neural networks

Neural networks compress meaning into dense vector representations that are hard to interpret. This paper introduces the Manifold Probe, which discovers low-dimensional geometric structures (manifolds) encoding specific concepts—like how Llama 2-7b represents time. The method finds which directions in the network's activations correspond to features like "year" or "location," then shows these directions causally affect model outputs by steering predictions about when famous movies were released. This matters because it makes neural network internals more transparent and controllable.