How category theory unifies attention, diffusion, and self-conditioning

This work reframes Transformer layers as weighted extension operators in category theory, unifying standard attention (narrow neighborhoods), sparse geometric variants, and higher-order simplicial structures. The framework also connects to diffusion models and reveals how predict-detach self-conditioning—training on model predictions instead of ground truth—exposes noncausal structure without leaking future tokens. Testing 12 variants across Penn Treebank, WikiText-2, and WikiText-103 shows quadratic KET outperforms causal baselines on large datasets, but the strongest improvements come from the predict-detach regime itself.