Rewarding tool use without reference solutions

Anay Kulkarni, ChiaEn Lu, Dheeraj Mekala, Jayanth Srinivasa, Gaowen Liu, Jingbo Shang

Language models can solve complex tasks by calling APIs in sequence, but existing reinforcement learning approaches struggle with multi-step composition. Outcome-based rewards are too sparse; trajectory-supervised rewards require annotated solutions and penalize valid alternatives. TIER derives dense, interpretable rewards directly from function schemas and runtime execution—checking format validity, schema adherence, execution success, and answer correctness at each step. This allows any valid execution path to be rewarded, supporting multiple solution strategies and adapting when tool interfaces change. On DepthBench (1–6 step tasks), TIER maintains >90% accuracy across all depths, while trajectory-supervised rewards collapse beyond step 4. Gains are consistent on BFCL v3 and NestFUL benchmarks.