Where do vision-language models actually merge images and text?

When you add images to a language model, where do they actually get processed? By probing multiple vision-language architectures, researchers found that visual features skip early layers and plug directly into the semantic middle of the LLM—the same layers that process text meaning. Restricting fine-tuning to just these intermediate layers preserved performance on multimodal benchmarks while cutting training time, revealing that visual-language alignment is highly localized, not network-wide.