Learning from old data while adapting online without forgetting

Qisai Liu, Zhanhong Jiang, Joshua Russell Waite, Aditya Balu, Cody Fleming, Soumik Sarkar

Hybrid offline-online reinforcement learning struggles when transitioning between static datasets and live environments—policies drift away from what they learned or forget old knowledge entirely. COOPO solves this by repeatedly looping: first anchoring the policy to the dataset with KL-regularized training, then fine-tuning online, then cycling back. This simple rhythm cuts environment interactions versus prior methods while boosting final performance on D4RL tasks. The approach works with any offline or online algorithm, making it a practical drop-in for practitioners.