Finding actions in videos without seeing them first

Benedetta Liberatori, Alessandro Conti, Lorenzo Vaquero, Paolo Rota, Yiming Wang, Elisa Ricci

Finding specific actions in raw video is hard when you've never seen those action types before. Current approaches rely on vision-language models, but these struggle with fine-grained distinctions—like telling whether an action actually happens versus doesn't. TEGU compensates by layering in rich linguistic context from language models and video captions, giving the system better cues to distinguish subtle action differences. Tests on THUMOS14 and ActivityNet show it beats existing unsupervised methods without needing any labeled training data.