← Back to Computer Vision cs.CV
Finding actions in videos without seeing them first
Benedetta Liberatori, Alessandro Conti, Lorenzo Vaquero, Paolo Rota, Yiming Wang, Elisa Ricci
May 21, 2026
Finding specific actions in raw video is hard when you've never seen those action types before. Current approaches rely on vision-language models, but these struggle with fine-grained distinctions—like telling whether an action actually happens versus doesn't. TEGU compensates by layering in rich linguistic context from language models and video captions, giving the system better cues to distinguish subtle action differences. Tests on THUMOS14 and ActivityNet show it beats existing unsupervised methods without needing any labeled training data.
Read the original paper →