← Back to Computer Vision
cs.CV

Building searchable memory for understanding month-long videos

Aiden Yiliu Li, Nels Numan, Anthony Steed

May 15, 2026

Long video understanding fails when compressed into fixed context windows or latent representations. VAM introduces three components: online indexing selects key visual evidence under streaming constraints, hierarchical memory organizes it spatially and temporally for fast search, and agentic retrieval verifies candidate evidence before answering questions. On OVO-Bench, VAM scores 68.41 (vs. 67.46 for Gemini 3 Flash alone); on 105-hour month-scale videos, it reaches 17.11% accuracy. The system treats memory as an explicit, queryable substrate rather than hidden state. Code is released.
Published as Visual Agentic Memory: Enabling Online Long Video Understanding via Online Indexing, Hierarchical Memory, and Agentic Retrieval arXiv:2605.16481
Read the original paper →