← Back to Computer Vision
cs.CV

What do vision-language models actually learn inside?

Piotr Kubaty, Patryk Marszałek, Łukasz Struski, Adam Wróbel, Jacek Tabor, Marek Śmieja

May 21, 2026

Vision-language models like CLIP learn powerful representations, but what those embeddings actually encode remains largely mysterious. This work introduces CEDAR, a technique that rotates embeddings into a sparse, interpretable space without inflating their size—the key insight being that entanglement isn't fundamental, just a matter of perspective. Individual coordinates now correspond to human-understandable concepts like "dog" or "metal texture", and the method achieves better sparsity-reconstruction trade-offs than previous approaches while staying more faithful to the original model geometry.
Published as Conceptualizing Embeddings: Sparse Disentanglement for Vision-Language Models arXiv:2605.22679
Read the original paper →