A rigorous metric for deciding when two neural networks are equivalent

ML Nissen Gonzalez, Melwina Albuquerque, Laurence Wroe, Jacob Meyer Cohen, Logan Riggs Smith, Thomas Dooms

Mechanistic interpretability requires confirming that two model components perform identical computations, but existing tools either rely on behavioral tests that miss out-of-distribution behavior or compare raw weights without accounting for symmetries in weight space. Tensor similarity addresses both problems for tensor-based models by constructing a metric invariant to weight-space symmetries while capturing cross-layer mechanisms via a recursive algorithm. In experiments, it tracks functional training dynamics — including grokking and backdoor insertion — with higher fidelity than prior metrics. The result reduces similarity verification from an empirical approximation problem to a solved algebraic one.