← Back to Computer Vision cs.CV
How to pick out one object in a crowded scene without manual tracing?
Quynh Phung, Sandesh Ghimire, Minsi Hu, Chung-Chi Tsai, Jia-Bin Huang
May 29, 2026
Existing personalization methods struggle when images contain multiple objects—they need manual segmentation or fail to separate what's what. UniVerse does this without masks by learning to decompose a complex scene into individual concept representations inside diffusion transformers, then recompose them. Tests show it localizes target objects more accurately and generates higher-fidelity results than prior work.
Read the original paper →