Dense language descriptions unlock better robot learning

Bosung Kim, Ruiyi Wang, David Acuna, Jaehun Jung, Alexander Trevithick, Brandon Cui, Yejin Choi, Prithviraj Ammanabrolu

Training robot policies requires expensive demonstrations, but annotating existing video is cheap. DeMiAn generates rich language descriptions for demonstration segments across four dimensions—physical motion, scene, arm pose, and reasoning—then learns an instructor to select task-appropriate annotations at deployment. Tested on over 1M robot clips and 50K egocentric videos, the approach improves both vision-language-action policies and world models, closing the gap to oracle performance by 3 points on RoboCasa while improving out-of-distribution generalization. The findings suggest dense language re-annotation is a practical way to scale robot learning without new data collection.