Testing vision models on questions they can't answer from images alone

Most visual question-answering datasets test what models can see in images. WikiVQABench instead measures whether they can combine visual content with external facts from Wikipedia and Wikidata to answer correctly. The authors curated 3,500+ multiple-choice questions by pairing Wikipedia images with knowledge-based queries, then evaluated 15 models ranging from 256M to 90B parameters—finding accuracy gaps from 25% to 76%, showing the benchmark effectively discriminates between models on knowledge-intensive reasoning. Dataset and code are publicly released.