Why language models hit a wall on step-by-step tracking tasks

Decoder-only transformers have a hard mathematical limit on how much state they can track across long reasoning chains. The authors prove this with an Attention Bottleneck Theorem, show that performance decays super-exponentially, and identify a "Deterministic Horizon" (around 20–30 steps) where pure neural reasoning collapses. Across 8 task domains including web automation and SQL, hybrid systems that delegate to tools beat pure chain-of-thought by a factor of 2–3. Fine-tuning doesn't fix the gap, confirming it's a core architectural constraint, not a training issue.