Why AI misreads ancient artifacts with modern eyes

Vision-language models struggle with temporal reasoning when interpreting cultural heritage, applying modern frameworks to historical artifacts in ways that distort meaning. The authors introduce TAB-VLM, a benchmark of 600 questions across Indian cultural artifacts from prehistoric to modern periods, revealing that even GPT-4V and other leading models fail consistently. The gap persists regardless of model size or architecture, pointing to a fundamental blind spot in how VLMs are trained—likely due to underrepresentation of non-Western visual cultures in training data. The benchmark and code are released to help future work improve temporal cognition in multimodal systems.