Building better planning tasks to train and test language models

Ziliang Zhao, Zenan Xu, Shuting Wang, Hongjin Qian, Yan Lei, Minda Hu, Zhao Wang, Shihan Dou, Zhicheng Dou, Pluto Zhou

Planning—coordinating goals, constraints, and long-term consequences into executable solutions—remains a weakness for large language models. Researchers created PlanningBench, a framework that generates scalable planning problems from real-world scenarios rather than relying on static benchmarks. The system abstracts workflows into 30+ task types and constraint families, then synthesizes verifiable problems with controlled difficulty. Testing shows current frontier models struggle with coupled constraints, and training on verified PlanningBench data improves performance on unseen planning tasks and general instruction-following.