← Back to Computer Vision
cs.CV

A multimodal AI that natively understands and generates 3D models

Zongyuan Yang, Mingjing Yi, Wanli Ma, Chenzhuo Fan, Bocheng Li, Baolin Liu, Yuke Lou, Yingde Song, Yongping Xiong, Zhengdong Guo, Shimu Wang

May 16, 2026

Current methods treat 3D generation as an external task disconnected from language model reasoning, relying on stateless reconstructors that lose context between edits. EVA01 embeds 3D as a native modality within multimodal large language models through a Mixture-of-Transformers design: separate Understanding and Generation experts share a global attention layer to align semantic and geometric feature spaces. The system achieves state-of-the-art text-to-3D generation quality and enables multi-turn editing sessions that preserve object identity—a capability impossible in frame-by-frame reconstruction pipelines. Project and likely code are available at the project page.
Published as EVA01: Unified Native 3D Understanding and Generation via Mixture-of-Transformers arXiv:2605.16745
Read the original paper →