Teaching AI which objects go on top when drawing overlapping scenes

Existing image generation models fail when bounding boxes overlap—they don't know which object should appear in front. OcclusionFormer solves this by explicitly modeling Z-order (layering priority) using a Diffusion Transformer that separates instances and composites them like stacked layers. The authors built SA-Z, a dataset with pixel-level occlusion annotations, and added a queried alignment loss to lock each object in place. Result: clean, physically plausible overlaps instead of blurred textures.