← Back to Computation and Language
cs.CL

Where do vision-language models actually merge images and text?

Luis Palacios, Lorenzo Basile, Diego Doimo, Alberto Cazzaniga

June 2, 2026

When you add images to a language model, where do they actually get processed? By probing multiple vision-language architectures, researchers found that visual features skip early layers and plug directly into the semantic middle of the LLM—the same layers that process text meaning. Restricting fine-tuning to just these intermediate layers preserved performance on multimodal benchmarks while cutting training time, revealing that visual-language alignment is highly localized, not network-wide.
Published as Visual Instruction Tuning Aligns Modalities through Abstraction arXiv:2606.03871
Read the original paper →