One robot brain for seeing, reasoning, imagining, and acting

Yi Zhang, Yinda Chen, Che Liu, Zeyuan Ding, Jin Xu, Shilong Zou, Junwei Liao, Jiayu Hu, Xiancong Ren, Xiaopeng Zhang, Yechi Liu, Haoyuan Shi, Zecong Tang, Haosong Sun, Renwen Cui, Kuishu Wu, Wenhai Liu, Yang Xu, Yingji Zhang, Yidong Wang, Senkang Hu, Jinpeng Lu, Nga Teng Chan, Yechen Wu, Yong Dai, Jian Tang, Xiaozhu Ju

Most embodied AI systems stitch together separate expert modules for vision, planning, and control. Pelican-Unified 1.0 collapses these into one vision-language model that produces reasoning chains, predicts future video frames, and outputs robot actions in a single forward pass. A shared latent variable ties the video and action generation heads together inside the same denoising process, so gradients from all three loss signals update one representation. The single checkpoint scores 64.7 across eight VLM benchmarks (best at its scale), 66.03 on WorldArena (first place), and 93.5 on RoboTwin (second among action methods), suggesting that unification does not require trading off specialist performance.