Teaching AI to watch people interact with clothes they're trying on

Jun Zheng, Zhengze Xu, Mengting Chen, Jing Wang, Jinsong Lan, Xiaoyong Zhu, Kaifu Zhang, Bo Zheng, Xiaodan Liang

Video virtual try-on usually just swaps garments on static poses, missing how people actually interact with clothing. iTryOn tackles interactive scenarios where subjects grab, adjust, or manipulate their clothes by combining spatial guidance (3D hand position tracking for precise hand-garment contact) and semantic guidance (captions describing actions at specific moments). Built on a diffusion Transformer, the system handles complex garment deformations during brief interactive moments and outperforms prior methods on both traditional and interactive benchmarks.