What hidden patterns reveal how vision-language models actually work?

MLLMs like LLaVA-NeXT and OmniFusion combine images and text, but their internal mechanics remain opaque. MLLM-Microscope measures linearity, dimensionality, and anisotropy of token embeddings across transformer layers to map how each modality flows through the model. Both architectures show linear behavior, but image tokens behave differently: LLaVA-NeXT's decline in linearity while OmniFusion's stay steady, suggesting fusion strategy shapes internal structure more than model size.