← Back to Computer Vision cs.CV
Can coding agents beat native multimodal models at audio-video tasks?
Dongping Chen, Xuanao Huang, Zhihan Hu, Qingyuan Shi, Dianqi Li, Tianyi Zhou
May 30, 2026
Multimodal tasks on video and audio don't always need models that directly consume all modalities. Coding agents with text+image access and sandboxed tool use matched or beat specialized omnimodal models on multiple audio-video benchmarks by orchestrating code to extract relevant signals from transcripts and frames. Adding skill injection and training on the OmniCoding dataset with Qwen models further closes gaps. Code and new TerminalBench-O benchmark will be released.
Read the original paper →