← Back to Computation and Language
cs.CL

What hidden patterns reveal how vision-language models actually work?

Ravil Mussabayev, Rustam Mussabayev

May 30, 2026

MLLMs like LLaVA-NeXT and OmniFusion combine images and text, but their internal mechanics remain opaque. MLLM-Microscope measures linearity, dimensionality, and anisotropy of token embeddings across transformer layers to map how each modality flows through the model. Both architectures show linear behavior, but image tokens behave differently: LLaVA-NeXT's decline in linearity while OmniFusion's stay steady, suggesting fusion strategy shapes internal structure more than model size.
Published as MLLM-Microscope: Unlocking Hidden Structure Within Multimodal Large Language Models arXiv:2606.00909
Read the original paper →