← Back to Neurons and Cognition
q-bio.NC

Do vision-trained AI models actually match human reading better?

Jinzhou Wu, Zhengwu Ma, Jixing Li, Baoping Tang, Zitong Lu

May 27, 2026

Researchers compared text-only performance of language models trained with and without vision, measuring alignment to human brain activity (fMRI) and eye movements during natural reading. Multimodal training offered no universal advantage for text processing, but VLMs did better specifically for visually rich sentences. This suggests visual training shapes only selective aspects of how models represent language, not text comprehension broadly.
Published as VLMs May Not Globally Enhance Human Alignment over LLMs During Natural Reading arXiv:2605.28818
Read the original paper →