← Back to Computation and Language cs.CL
What hidden patterns reveal how vision-language models actually work?
Ravil Mussabayev, Rustam Mussabayev
May 30, 2026
MLLMs like LLaVA-NeXT and OmniFusion combine images and text, but their internal mechanics remain opaque. MLLM-Microscope measures linearity, dimensionality, and anisotropy of token embeddings across transformer layers to map how each modality flows through the model. Both architectures show linear behavior, but image tokens behave differently: LLaVA-NeXT's decline in linearity while OmniFusion's stay steady, suggesting fusion strategy shapes internal structure more than model size.
Read the original paper →