Finding the surprising moments in long videos without training

Dahye Kim, Bhuvan Sachdeva, Karan Uppal, Naman Gupta, Vineeth N. Balasubramanian, Deepti Ghadiyaram

Most frames in long videos are redundant. Swift Sampling identifies temporally surprising moments—where visual features unexpectedly diverge from their predicted trajectory—using Taylor expansion in the visual latent space. The method adds only 0.02× computational overhead, outperforms uniform sampling across video QA benchmarks and 10 downstream tasks, with gains up to +12.5 points on long videos with limited frame budgets.