Training agents to learn from their mistakes more efficiently

Training LLM agents on long-horizon tasks with reinforcement learning is inefficient because sparse rewards don't indicate which intermediate steps caused failures. Existing methods apply feedback to every turn or use fixed feedback timing, wasting effort on already-successful actions. HINT-SD analyzes full trajectories after execution to identify which actions contributed to failures, then applies corrective feedback only to those targeted spans. Tested on BFCL v3 and AppWorld benchmarks, the method outperforms dense per-turn feedback baselines while significantly reducing computation per training step. Code availability not mentioned in the abstract.