← Back to Computer Vision cs.CV
Building searchable memory for understanding month-long videos
Aiden Yiliu Li, Nels Numan, Anthony Steed
May 15, 2026
Long video understanding fails when compressed into fixed context windows or latent representations. VAM introduces three components: online indexing selects key visual evidence under streaming constraints, hierarchical memory organizes it spatially and temporally for fast search, and agentic retrieval verifies candidate evidence before answering questions. On OVO-Bench, VAM scores 68.41 (vs. 67.46 for Gemini 3 Flash alone); on 105-hour month-scale videos, it reaches 17.11% accuracy. The system treats memory as an explicit, queryable substrate rather than hidden state. Code is released.
Read the original paper →