← Back to Robotics
cs.RO

Can one model handle robot vision, language, and control together?

Yi Yang, Zhihong Liu, Siqi Kou, Yiyang Chen, Yanzhe Hu, Jianbo Zhou, Boyuan Zhao, Zhijie Wei, Xiao Xia, Xueqi Li, Pengfei Liu, Zhijie Deng

June 4, 2026

A new foundation model class (WLA) unifies world modeling from egocentric video, language reasoning, and robot action prediction in a single autoregressive Transformer. It predicts text subtasks, subgoal images, and control actions jointly—using world prediction to guide action generation without slowing inference. On RoboTwin2.0, it reaches 92.94% success; on real-world RMBench, 56.5%. The 2B-parameter model runs at 40 ms per step and can learn from unlabeled cross-embodiment robot video.
Published as World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis arXiv:2606.05979
Read the original paper →