How low-precision transformers achieve real computational power

Prior expressivity results for transformers required unrealistic assumptions: hardmax attention, high-precision arithmetic, or expensive architectural modifications. This work shows that standard transformer decoders with softmax attention and rounded activations/weights can compute anything a Turing machine can, as long as depth and width scale logarithmically with context length. The authors construct hardmax transformers using Chain-of-Thought to simulate Turing machines, then convert them to softmax equivalents without requiring extreme precision. They also analyze a recently proposed summarized CoT approach, showing it uses model size scaling logarithmically in space rather than time. Empirical validation on Sudoku reasoning tasks better predicts learnability than prior high-precision results. Code is released.