Catching bad robot moves before they happen

Zhen Sun, Yongjian Guo, Haoran Sun, Luqiao Wang, Wei Lu, Jiachi Ji, Shengzhe Ji, Junwu Xiong, Zhijun Meng

Robot systems using vision-language models often generate poor actions that cause failures or waste computation on world-model rollouts. Pre-VLA catches these bad actions before execution by predicting safety confidence and advantage scores, using a lightweight multimodal classifier trained with techniques to handle imbalanced data. On LIBERO benchmarks, it improved success rates by 7.8 percentage points, reduced execution steps, and ran in 184 ms per action—catching errors early rather than failing during physical execution.