Teaching robots to anticipate their own moves from language

Haozhe Jia, Honglei Jin, Yuan Zhang, Youcheng Fan, Shaofeng Liang, Lei Wang, Shuxu Jin, Kuimou Yu, Zinuo Zhang, Jianfei Song, Wenshuo Chen, Yutao Yue

Controlling humanoid robots from language commands requires predicting not just immediate actions but future physical states—balance shifts, foot placements, support transfers. DAJI addresses this by creating an explicit "joint intent" interface between language and low-level control. DAJI-Act distills a future-aware teacher model into a deployable diffusion policy; DAJI-Flow generates intent sequences autoregressively from language and history. On HumanML3D motion generation and BABEL action benchmarks, the system demonstrates strong anticipatory learning and handles both single commands and streaming instruction sequences.