Can AI find all the moments in a video that match one description?

Qi Xu, Yue Tan, Shihao Chen, Jiahao Meng, Anna Wang, Shunping Ji, Hao Fei, Jason Li

Current video-text models excel at finding one segment per query but fail when asked to locate multiple unrelated moments matching a single description. Researchers built the first One-to-Many Temporal Grounding benchmark with 56k examples and novel reward functions using chain-of-thought reasoning over video captions. Their approach achieves 43.65% temporal F1, beating commercial models by over 15 percentage points.