Segmenting and naming any object without a vocabulary

Danyang Li, Tianhao Wu, Bin Li, Zhenyuan Chen, Yang Zhang, Yuxuan Li, Ming-Ming Cheng, Xiang Li

Open-world segmentation must handle infinitely many object categories, but foundation models like SAM segment well but struggle to identify what they've segmented. WOW-Seg bridges this gap using Mask2Token, which converts image masks into visual tokens aligned with vision-language model embeddings, paired with Cascade Attention Masks to separate instances cleanly. The authors also release RR-7K, a 7,662-category region recognition benchmark. On LVIS, WOW-Seg achieves 89.7% semantic similarity and 82.4% semantic IoU, outperforming prior work while using 8× fewer parameters. Code and model weights are released.