← Back to Machine Learning cs.LG
Teaching AI Agents to Learn From Their Own Best Moves
Zhengxi Lu, Zhiyuan Yao, Zhuowen Han, Zi-Han Wang, Jinyang Wu, Qi Gu, Xunliang Cai, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen
May 14, 2026
Training LLM agents on long multi-step tasks with reinforcement learning suffers from coarse, trajectory-level rewards that miss finer decision points. SDAR (Self-Distilled Agentic Reinforcement Learning) adds a gated distillation signal on top of RL: a teacher branch with extra context generates token-level guidance, and a sigmoid gate selectively amplifies helpful signals while dampening unreliable negative ones. Tested on Qwen2.5 and Qwen3 models across ALFWorld, WebShop, and Search-QA, SDAR consistently outperforms GRPO alone (+9.4%, +10.2%, +7.0% respectively) and avoids the training instability that plagues naive RL+distillation combinations. The approach is intended for researchers building multi-turn agents where reward sparsity limits learning.
Read the original paper →