← Back to Machine Learning
cs.LG

What transformers actually compute vs. what they represent

Ishita Darade, Sushrut Thorat

May 21, 2026

Transformers trained on base-digit extraction (e.g., finding the coefficient of B^D in N's base-B representation) achieve 99.83% accuracy and appear to implement the closed-form algorithm. Linear probes successfully decode intermediate values that match this solution, but causal circuit analysis shows the model doesn't actually use them—it routes information through separate, late-combining pathways instead. The work demonstrates that internal representations and causal computation can diverge sharply, even with explicit algorithmic ground truth available.
Published as Represented Is Not Computed: A Causal Test of Candidate Algorithmic Intermediates in a Transformer arXiv:2605.22488
Read the original paper →