← Back to Computation and Language
cs.CL

Learning from past mistakes to make LLMs better at using tools

Renning Pang, Tian Lan, Leyuan Liu, Piao Tong, Sheng Cao, Xiaosong Zhang

May 14, 2026

Language models struggle to reliably use external tools because they must balance reasoning depth against the need for structurally valid outputs. CAST addresses this by analyzing historical execution trajectories to identify two types of patterns: complexity profiles (which reasoning strategies work for which task types) and failure profiles (which structural errors are most likely to occur). The model learns a fine-grained reward function during reinforcement learning that internalizes these patterns, enabling it to adapt reasoning depth per case. On BFCLv2 and ToolBench benchmarks, CAST achieves up to 5.85 percentage points improvement in execution accuracy while reducing average reasoning length by 26%, with particular gains in preventing high-impact structural failures.
Published as Case-Based Calibration of Adaptive Reasoning and Execution for LLM Tool Use arXiv:2605.15041
Read the original paper →