Separating forget and remember in fast attention mechanisms

Linear attention speeds up transformers by replacing their expensive attention cache with a fixed-size memory that updates in constant time. The challenge: editing this compressed memory without corrupting what's already stored. Gated DeltaNet-2 decouples two operations—erasing old content and writing new content—with separate learnable gates per channel, generalizing prior approaches (Gated DeltaNet and Kimi Delta Attention). Testing on 100B tokens, it outperforms Mamba-2/3 on language modeling, reasoning, and especially long-context retrieval benchmarks, with code released.