← Back to Computation and Language
cs.CL

Training agents to learn from their mistakes more efficiently

Woongyeng Yeo, Yumin Choi, Taekyung Ki, Sung Ju Hwang

May 18, 2026

Training LLM agents on long-horizon tasks with reinforcement learning is inefficient because sparse rewards don't indicate which intermediate steps caused failures. Existing methods apply feedback to every turn or use fixed feedback timing, wasting effort on already-successful actions. HINT-SD analyzes full trajectories after execution to identify which actions contributed to failures, then applies corrective feedback only to those targeted spans. Tested on BFCL v3 and AppWorld benchmarks, the method outperforms dense per-turn feedback baselines while significantly reducing computation per training step. Code availability not mentioned in the abstract.
Published as HINT-SD: Targeted Hindsight Self-Distillation for Long-Horizon Agents arXiv:2605.17873
Read the original paper →