← Back to Computer Vision cs.CV
A multimodal AI that natively understands and generates 3D models
Zongyuan Yang, Mingjing Yi, Wanli Ma, Chenzhuo Fan, Bocheng Li, Baolin Liu, Yuke Lou, Yingde Song, Yongping Xiong, Zhengdong Guo, Shimu Wang
May 16, 2026
Current methods treat 3D generation as an external task disconnected from language model reasoning, relying on stateless reconstructors that lose context between edits. EVA01 embeds 3D as a native modality within multimodal large language models through a Mixture-of-Transformers design: separate Understanding and Generation experts share a global attention layer to align semantic and geometric feature spaces. The system achieves state-of-the-art text-to-3D generation quality and enables multi-turn editing sessions that preserve object identity—a capability impossible in frame-by-frame reconstruction pipelines. Project and likely code are available at the project page.
Read the original paper →