← Back to Computer Vision cs.CV
Teaching foundation models to track objects through chaos
Deyi Zhu, Yuji Wang, Yong Liu, Yansong Tang, Bingyao Yu, Jiwen Lu, Jie Zhou
May 21, 2026
Visual object tracking fails when objects move unpredictably, get occluded, or are surrounded by distractors. SAMOSA adapts the SAM 2 foundation model—which understands video well but doesn't track explicitly—by adding three layers: a lightweight motion predictor for nonlinear dynamics, semantic cues to recover from tracking failures, and geometric constraints for stability. On challenging benchmarks like anti-UAV datasets, it beats both state-of-the-art SAM 2 variants and traditional supervised trackers, with code released.
Read the original paper →