← Back to Computation and Language cs.CL
Where do vision-language models actually merge images and text?
Luis Palacios, Lorenzo Basile, Diego Doimo, Alberto Cazzaniga
June 2, 2026
When you add images to a language model, where do they actually get processed? By probing multiple vision-language architectures, researchers found that visual features skip early layers and plug directly into the semantic middle of the LLM—the same layers that process text meaning. Restricting fine-tuning to just these intermediate layers preserved performance on multimodal benchmarks while cutting training time, revealing that visual-language alignment is highly localized, not network-wide.
Read the original paper →