What do vision-language models actually learn inside?

Piotr Kubaty, Patryk Marszałek, Łukasz Struski, Adam Wróbel, Jacek Tabor, Marek Śmieja

Vision-language models like CLIP learn powerful representations, but what those embeddings actually encode remains largely mysterious. This work introduces CEDAR, a technique that rotates embeddings into a sparse, interpretable space without inflating their size—the key insight being that entanglement isn't fundamental, just a matter of perspective. Individual coordinates now correspond to human-understandable concepts like "dog" or "metal texture", and the method achieves better sparsity-reconstruction trade-offs than previous approaches while staying more faithful to the original model geometry.