← Back to Computation and Language cs.CL
Why does offline feedback training beat expensive reinforcement learning for chatbots?
Jian Mu, Tianyi Lin, Chengwei Qin, Zhongxiang Dai, Yao Shu
May 29, 2026
Training language models from user feedback in multi-turn conversations faces a costly trade-off: online RL handles conversation dynamics but requires expensive trajectory generation, while offline supervised fine-tuning is cheap but drifts from the training distribution. DRIFT sidesteps this by converting the RL objective into importance-weighted supervised learning—sampling trajectories offline, computing return-based weights, then training via standard fine-tuning. Empirically matches or beats RL baselines while maintaining supervised learning's efficiency and simplicity.
Read the original paper →