Training radar to understand motion without expensive sensor labels

4D radar scene flow—estimating how objects move in a scene—is hard to label, so current methods either guess poorly or require LiDAR sensors. This work uses just camera tracking and odometry as weak supervision: they extract instance masks from off-the-shelf 2D trackers, project them into 3D radar space for semantic guidance, and use vehicle motion to handle static regions. On the real-world VoD dataset, the approach outperforms both existing cross-modal supervised methods and fully supervised baselines while being cheaper to train. Code is released.