Edit videos and their soundtracks together, not separately

Video editing tools strip away audio, leaving you with silent clips or mismatched soundtracks. InstructAV2AV fixes this by editing audio and video together based on text instructions—if you change the scene, the sound changes too. The team built InsAVE-80K, the first large-scale paired dataset of audio-video edits, and trained a diffusion model with gated attention mechanisms to follow instructions while preserving the original content. The result beats prior work across speech quality, sound effects matching, and visual fidelity metrics.