← Back to Computer Vision
cs.CV

How to pick out one object in a crowded scene without manual tracing?

Quynh Phung, Sandesh Ghimire, Minsi Hu, Chung-Chi Tsai, Jia-Bin Huang

May 29, 2026

Existing personalization methods struggle when images contain multiple objects—they need manual segmentation or fail to separate what's what. UniVerse does this without masks by learning to decompose a complex scene into individual concept representations inside diffusion transformers, then recompose them. Tests show it localizes target objects more accurately and generates higher-fidelity results than prior work.
Published as UniVerse: A Unified Modulation Framework for Segmentation-Free,Disentangled Multi-Concept Personalization arXiv:2606.00351
Read the original paper →