← Back to Computer Vision
cs.CV

Testing vision models on questions they can't answer from images alone

Basel Shbita, Pengyuan Li, Anna Lisa Gentile

May 20, 2026

Most visual question-answering datasets test what models can see in images. WikiVQABench instead measures whether they can combine visual content with external facts from Wikipedia and Wikidata to answer correctly. The authors curated 3,500+ multiple-choice questions by pairing Wikipedia images with knowledge-based queries, then evaluated 15 models ranging from 256M to 90B parameters—finding accuracy gaps from 25% to 76%, showing the benchmark effectively discriminates between models on knowledge-intensive reasoning. Dataset and code are publicly released.
Published as WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark from Wikipedia and Wikidata arXiv:2605.21479
Read the original paper →