← Back to Machine Learning
cs.LG

How category theory unifies attention, diffusion, and self-conditioning

Sridhar Mahadevan

May 26, 2026

This work reframes Transformer layers as weighted extension operators in category theory, unifying standard attention (narrow neighborhoods), sparse geometric variants, and higher-order simplicial structures. The framework also connects to diffusion models and reveals how predict-detach self-conditioning—training on model predictions instead of ground truth—exposes noncausal structure without leaking future tokens. Testing 12 variants across Penn Treebank, WikiText-2, and WikiText-103 shows quadratic KET outperforms causal baselines on large datasets, but the strongest improvements come from the predict-detach regime itself.
Published as Kan Extension Transformers: A Categorical Unification of Attention, Diffusion, and Predict-Detach Self-Conditioning arXiv:2605.27259
Read the original paper →