Do AI model labels actually capture concepts across languages?

Researchers used Serbian digraphia—the same language in Latin and Cyrillic scripts—to test whether sparse autoencoder features and their natural-language labels genuinely capture semantic concepts. Features activated by identical content across languages overlapped substantially (0.57 Jaccard similarity), but auto-interpretation labels failed to track the same concepts across scripts, missing Serbian Cyrillic meanings 4× more than English. The failures correlate with how well each representation appears in training data, suggesting labels describe surface patterns rather than underlying concepts.