Can coding agents beat native multimodal models at audio-video tasks?

Multimodal tasks on video and audio don't always need models that directly consume all modalities. Coding agents with text+image access and sandboxed tool use matched or beat specialized omnimodal models on multiple audio-video benchmarks by orchestrating code to extract relevant signals from transcripts and frames. Adding skill injection and training on the OmniCoding dataset with Qwen models further closes gaps. Code and new TerminalBench-O benchmark will be released.