Teaching AI Agents to Learn From Their Own Best Moves

Zhengxi Lu, Zhiyuan Yao, Zhuowen Han, Zi-Han Wang, Jinyang Wu, Qi Gu, Xunliang Cai, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen

Training LLM agents on long multi-step tasks with reinforcement learning suffers from coarse, trajectory-level rewards that miss finer decision points. SDAR (Self-Distilled Agentic Reinforcement Learning) adds a gated distillation signal on top of RL: a teacher branch with extra context generates token-level guidance, and a sigmoid gate selectively amplifies helpful signals while dampening unreliable negative ones. Tested on Qwen2.5 and Qwen3 models across ALFWorld, WebShop, and Search-QA, SDAR consistently outperforms GRPO alone (+9.4%, +10.2%, +7.0% respectively) and avoids the training instability that plagues naive RL+distillation combinations. The approach is intended for researchers building multi-turn agents where reward sparsity limits learning.