A video editor that understands what you actually mean

Yongsheng Yu, Ziyun Zeng, Zhiyuan Xiao, Zhenghong Zhou, Hang Hua, Wei Xiong, Jiebo Luo

Video editing models work best when users provide complete specifications—exact text descriptions, reference images, and spatial masks. Aurora adds a vision-language model agent that bridges the gap between messy real-world requests and model-ready inputs. The agent interprets ambiguous user instructions, selects or generates necessary reference images, and produces structured editing plans. Trained on supervised edit planning data and preference pairs for tool use, the agent transfers to compatible frozen video editing models. A new benchmark, AgentEdit-Bench, evaluates performance under realistic underspecification, where Aurora outperforms instruction-only baselines on both new and existing video editing benchmarks.