← Back to Neurons and Cognition q-bio.NC
Do vision-trained AI models actually match human reading better?
Jinzhou Wu, Zhengwu Ma, Jixing Li, Baoping Tang, Zitong Lu
May 27, 2026
Researchers compared text-only performance of language models trained with and without vision, measuring alignment to human brain activity (fMRI) and eye movements during natural reading. Multimodal training offered no universal advantage for text processing, but VLMs did better specifically for visually rich sentences. This suggests visual training shapes only selective aspects of how models represent language, not text comprehension broadly.
Read the original paper →