Generating high-quality audio directly from raw waveforms

Feiyan Zhou, Luyuan Wang, Shoufa Chen, Zhe Wang, Zhiheng Liu, Yuren Cong, Xiaohui Zhang, Fanny Yang, Belinda Zeng

WavFlow generates audio directly from raw waveforms instead of using compressed latent representations, eliminating a standard intermediate step in modern audio synthesis. The approach reshapes audio into 2D token grids via waveform patchification and applies amplitude lifting to stabilize training under flow matching. The team curated 5 million video-text-audio triplets to train the model end-to-end on semantic alignment and temporal synchronization. Results on VGGSound and AudioCaps benchmarks match or exceed established latent-based methods, suggesting intermediate compression is unnecessary for high-fidelity synthesis.