Can you train RNNs without unrolling them through time?

Training RNNs on long sequences hits a wall: backpropagation through time is sequential and gradients vanish or explode. This work proposes Supervised Memory Training, which sidesteps the problem entirely by treating RNN training as supervised learning on one-step memory updates. A Transformer encoder learns what information to preserve across timesteps, then an RNN learns how to update that memory—all trainable in parallel with stable gradients. On language modeling and pixel prediction, SMT outperforms standard BPTT, suggesting RNNs could scale to capture long-range dependencies without the parallelism bottleneck.