← Back to Computer Vision
cs.CV

Teaching AI which objects go on top when drawing overlapping scenes

Ziye Li, Henghui Ding

May 20, 2026

Existing image generation models fail when bounding boxes overlap—they don't know which object should appear in front. OcclusionFormer solves this by explicitly modeling Z-order (layering priority) using a Diffusion Transformer that separates instances and composites them like stacked layers. The authors built SA-Z, a dataset with pixel-level occlusion annotations, and added a queried alignment loss to lock each object in place. Result: clean, physically plausible overlaps instead of blurred textures.
Published as OcclusionFormer: Arranging Z-Order for Layout-Grounded Image Generation arXiv:2605.21343
Read the original paper →