When AI's plan falls apart mid-task: building smarter replanning

Tingfeng Hui, Hao Xu, Pengyu Zhu, Hongsheng Xin, Kun Zhan, Sen Su, Chunxiao Liu, Ning Miao

Real-world AI agents need to detect when something unexpected happens and replan on the fly—yet existing benchmarks ignore this challenge. STT-Arena tests 227 tasks where spatio-temporal disruptions (like a target object moving or disappearing) force models to abandon their strategy and adapt. Frontier LLMs score below 40%, failing in three consistent ways: executing stale information, misidentifying what changed, and skipping verification after replanning. The authors fix these errors using trajectory refinement plus online RL, creating a 4B model that beats all tested frontier models on this benchmark.