← Back to Computer Vision
cs.CV

Teaching foundation models to track objects through chaos

Deyi Zhu, Yuji Wang, Yong Liu, Yansong Tang, Bingyao Yu, Jiwen Lu, Jie Zhou

May 21, 2026

Visual object tracking fails when objects move unpredictably, get occluded, or are surrounded by distractors. SAMOSA adapts the SAM 2 foundation model—which understands video well but doesn't track explicitly—by adding three layers: a lightweight motion predictor for nonlinear dynamics, semantic cues to recover from tracking failures, and geometric constraints for stability. On challenging benchmarks like anti-UAV datasets, it beats both state-of-the-art SAM 2 variants and traditional supervised trackers, with code released.
Published as Segment Anything with Motion, Geometry, and Semantic Adaptation for Complex Nonlinear Visual Object Tracking arXiv:2605.22538
Read the original paper →