One model for multimodal understanding and generation

Fengyi Fu, Mengqi Huang, Shaojin Wu, Yunsheng Jiang, Yufei Huo, Hao Li, Yinghang Song, Fei Ding, Jianzhu Guo, Qian He, Zheren Fu, Zhendong Mao, Yongdong Zhang

Lance tackles the challenge of building efficient models that handle understanding, generation, and editing across images and videos without scaling to massive parameter counts. The approach uses a dual-stream mixture-of-experts architecture with shared interleaved multimodal sequences, decoupling pathways for understanding versus generation tasks. A key innovation is modality-aware rotary positional encoding to reduce interference between different visual token types. The model is trained with staged multi-task learning and adaptive data scheduling to balance semantic comprehension with visual generation quality. Experiments show Lance outperforms existing open-source unified models on generation tasks while maintaining strong multimodal understanding. Code and project details are available.