Can robots learn to fold clothes from mixed demonstrations?

Taiyi Su, Jian Zhu, Tianjian Wang, Youzhang He, Zitai Huang, Jianjun Zhang, Chong Ma, Hanyang Wang, Tianjiao Zhang, Munan Yin, Weihao Ding, Yi Xu

Household robots struggle with deformable objects like clothing because existing systems train separate policies for each item type. DeMaVLA combines a vision-language backbone with an efficient action expert (using pruned transformers and flow matching) and trains on 5,000 hours of real dual-arm demonstrations plus corrective trajectories from failed attempts. The result: a single policy that folds different clothing items across varying materials and scenes, validated on both simulation and real robot experiments.