← Back to Computation and Language cs.CL
Do AI model labels actually capture concepts across languages?
Sripad Karne
May 29, 2026
Researchers used Serbian digraphia—the same language in Latin and Cyrillic scripts—to test whether sparse autoencoder features and their natural-language labels genuinely capture semantic concepts. Features activated by identical content across languages overlapped substantially (0.57 Jaccard similarity), but auto-interpretation labels failed to track the same concepts across scripts, missing Serbian Cyrillic meanings 4× more than English. The failures correlate with how well each representation appears in training data, suggesting labels describe surface patterns rather than underlying concepts.
Read the original paper →