← Back to Machine Learning cs.LG
What transformers actually compute vs. what they represent
Ishita Darade, Sushrut Thorat
May 21, 2026
Transformers trained on base-digit extraction (e.g., finding the coefficient of B^D in N's base-B representation) achieve 99.83% accuracy and appear to implement the closed-form algorithm. Linear probes successfully decode intermediate values that match this solution, but causal circuit analysis shows the model doesn't actually use them—it routes information through separate, late-combining pathways instead. The work demonstrates that internal representations and causal computation can diverge sharply, even with explicit algorithmic ground truth available.
Read the original paper →