Can one model handle robot vision, language, and control together?

Yi Yang, Zhihong Liu, Siqi Kou, Yiyang Chen, Yanzhe Hu, Jianbo Zhou, Boyuan Zhao, Zhijie Wei, Xiao Xia, Xueqi Li, Pengfei Liu, Zhijie Deng

A new foundation model class (WLA) unifies world modeling from egocentric video, language reasoning, and robot action prediction in a single autoregressive Transformer. It predicts text subtasks, subgoal images, and control actions jointly—using world prediction to guide action generation without slowing inference. On RoboTwin2.0, it reaches 92.94% success; on real-world RMBench, 56.5%. The 2B-parameter model runs at 40 ms per step and can learn from unlabeled cross-embodiment robot video.