← Back to Computer Vision
cs.CV

Edit videos and their soundtracks together, not separately

Haojie Zheng, Yixin Yang, Siqi Yang, Shuchen Weng, Boxin Shi

May 18, 2026

Video editing tools strip away audio, leaving you with silent clips or mismatched soundtracks. InstructAV2AV fixes this by editing audio and video together based on text instructions—if you change the scene, the sound changes too. The team built InsAVE-80K, the first large-scale paired dataset of audio-video edits, and trained a diffusion model with gated attention mechanisms to follow instructions while preserving the original content. The result beats prior work across speech quality, sound effects matching, and visual fidelity metrics.
Published as InstructAV2AV: Instruction-Guided Audio-Video Joint Editing arXiv:2605.18467
Read the original paper →