← Back to Computer Vision cs.CV
Edit videos and their soundtracks together, not separately
Haojie Zheng, Yixin Yang, Siqi Yang, Shuchen Weng, Boxin Shi
May 18, 2026
Video editing tools strip away audio, leaving you with silent clips or mismatched soundtracks. InstructAV2AV fixes this by editing audio and video together based on text instructions—if you change the scene, the sound changes too. The team built InsAVE-80K, the first large-scale paired dataset of audio-video edits, and trained a diffusion model with gated attention mechanisms to follow instructions while preserving the original content. The result beats prior work across speech quality, sound effects matching, and visual fidelity metrics.
Read the original paper →