← Back to Computer Vision cs.CV
One model to generate digital humans across text, audio, video, and motion
Chong Bao, Shichen Liu, Lijun Yu, David Futschik, Stylianos Moschoglou, Shefali Srivastava, Ziqian Bai, Feitong Tan, Guofeng Zhang, Zhaopeng Cui, Sean Fanello, Yinda Zhang
May 28, 2026
Creating realistic digital humans requires coordinating multiple outputs—speech, facial movement, body motion, video—which typically demands separate specialized models. Archon unifies this with one pretrained model across seven modalities (text, audio, motion, video, and more), trained on 72 diverse tasks. To avoid memory overload from video tokens, they compress video 4× using semantic reparameterization while keeping fine-grained details intact. A stepwise "thinking in modality" approach decomposes tricky cross-modal tasks into clearer intermediate steps. Outperforms or matches prior methods on talking head generation, full-body avatars, and motion synthesis.
Read the original paper →