← Back to Artificial Intelligence
cs.AI

Building better planning tasks to train and test language models

Ziliang Zhao, Zenan Xu, Shuting Wang, Hongjin Qian, Yan Lei, Minda Hu, Zhao Wang, Shihan Dou, Zhicheng Dou, Pluto Zhou

May 20, 2026

Planning—coordinating goals, constraints, and long-term consequences into executable solutions—remains a weakness for large language models. Researchers created PlanningBench, a framework that generates scalable planning problems from real-world scenarios rather than relying on static benchmarks. The system abstracts workflows into 30+ task types and constraint families, then synthesizes verifiable problems with controlled difficulty. Testing shows current frontier models struggle with coupled constraints, and training on verified PlanningBench data improves performance on unseen planning tasks and general instruction-following.
Published as PlanningBench: Generating Scalable and Verifiable Planning Data for Evaluating and Training Large Language Models arXiv:2605.20873
Read the original paper →