One model for all 3D perception tasks, online or offline

Haotian Wang, Yusong Huang, Zhaonian Kuang, Hongliang Lu, Xinhu Zheng, Meng Yang, Gang Hua

Most 3D perception systems are built separately for either real-time or batch processing, with different models for different sensors and scales. UniT unifies these into one framework using a Group Autoregressive Transformer that treats sensor observations in configurable batches and predicts 3D point maps without fixed anchors or scales. By adjusting group size, the same model runs in real-time (single frames) or offline (multi-frame batches), with memory-efficient caching for long sequences and a scale-adaptive loss that learns absolute metric scale. Tested on ten benchmarks across online perception, offline reconstruction, multi-modal fusion, and long-horizon tasks, it reaches top performance on all.