← Back to Computer Vision
cs.CV

One model to generate digital humans across text, audio, video, and motion

Chong Bao, Shichen Liu, Lijun Yu, David Futschik, Stylianos Moschoglou, Shefali Srivastava, Ziqian Bai, Feitong Tan, Guofeng Zhang, Zhaopeng Cui, Sean Fanello, Yinda Zhang

May 28, 2026

Creating realistic digital humans requires coordinating multiple outputs—speech, facial movement, body motion, video—which typically demands separate specialized models. Archon unifies this with one pretrained model across seven modalities (text, audio, motion, video, and more), trained on 72 diverse tasks. To avoid memory overload from video tokens, they compress video 4× using semantic reparameterization while keeping fine-grained details intact. A stepwise "thinking in modality" approach decomposes tricky cross-modal tasks into clearer intermediate steps. Outperforms or matches prior methods on talking head generation, full-body avatars, and motion synthesis.
Published as Archon: A Unified Multimodal Model for Holistic Digital Human Generation arXiv:2605.30311
Read the original paper →