How to pick out one object in a crowded scene without manual tracing?

Existing personalization methods struggle when images contain multiple objects—they need manual segmentation or fail to separate what's what. UniVerse does this without masks by learning to decompose a complex scene into individual concept representations inside diffusion transformers, then recompose them. Tests show it localizes target objects more accurately and generates higher-fidelity results than prior work.