Teaching foundation models to track objects through chaos

Visual object tracking fails when objects move unpredictably, get occluded, or are surrounded by distractors. SAMOSA adapts the SAM 2 foundation model—which understands video well but doesn't track explicitly—by adding three layers: a lightweight motion predictor for nonlinear dynamics, semantic cues to recover from tracking failures, and geometric constraints for stability. On challenging benchmarks like anti-UAV datasets, it beats both state-of-the-art SAM 2 variants and traditional supervised trackers, with code released.