Do vision-trained AI models actually match human reading better?

Researchers compared text-only performance of language models trained with and without vision, measuring alignment to human brain activity (fMRI) and eye movements during natural reading. Multimodal training offered no universal advantage for text processing, but VLMs did better specifically for visually rich sentences. This suggests visual training shapes only selective aspects of how models represent language, not text comprehension broadly.