Learning from past mistakes to make LLMs better at using tools

Language models struggle to reliably use external tools because they must balance reasoning depth against the need for structurally valid outputs. CAST addresses this by analyzing historical execution trajectories to identify two types of patterns: complexity profiles (which reasoning strategies work for which task types) and failure profiles (which structural errors are most likely to occur). The model learns a fine-grained reward function during reinforcement learning that internalizes these patterns, enabling it to adapt reasoning depth per case. On BFCLv2 and ToolBench benchmarks, CAST achieves up to 5.85 percentage points improvement in execution accuracy while reducing average reasoning length by 26%, with particular gains in preventing high-impact structural failures.