← Back to Computation and Language
cs.CL

Why does offline feedback training beat expensive reinforcement learning for chatbots?

Jian Mu, Tianyi Lin, Chengwei Qin, Zhongxiang Dai, Yao Shu

May 29, 2026

Training language models from user feedback in multi-turn conversations faces a costly trade-off: online RL handles conversation dynamics but requires expensive trajectory generation, while offline supervised fine-tuning is cheap but drifts from the training distribution. DRIFT sidesteps this by converting the RL objective into importance-weighted supervised learning—sampling trajectories offline, computing return-based weights, then training via standard fine-tuning. Empirically matches or beats RL baselines while maintaining supervised learning's efficiency and simplicity.
Published as DRIFT: Decoupled Rollouts and Importance-Weighted Fine-Tuning for Efficient Multi-Turn Optimization arXiv:2605.31455
Read the original paper →