← Back to Computer Vision
cs.CV

Teaching AI to watch people interact with clothes they're trying on

Jun Zheng, Zhengze Xu, Mengting Chen, Jing Wang, Jinsong Lan, Xiaoyong Zhu, Kaifu Zhang, Bo Zheng, Xiaodan Liang

May 20, 2026

Video virtual try-on usually just swaps garments on static poses, missing how people actually interact with clothing. iTryOn tackles interactive scenarios where subjects grab, adjust, or manipulate their clothes by combining spatial guidance (3D hand position tracking for precise hand-garment contact) and semantic guidance (captions describing actions at specific moments). Built on a diffusion Transformer, the system handles complex garment deformations during brief interactive moments and outperforms prior methods on both traditional and interactive benchmarks.
Published as iTryOn: Mastering Interactive Video Virtual Try-On with Spatial-Semantic Guidance arXiv:2605.21431
Read the original paper →