← Back to Computation and Language
cs.CL

Do AI model labels actually capture concepts across languages?

Sripad Karne

May 29, 2026

Researchers used Serbian digraphia—the same language in Latin and Cyrillic scripts—to test whether sparse autoencoder features and their natural-language labels genuinely capture semantic concepts. Features activated by identical content across languages overlapped substantially (0.57 Jaccard similarity), but auto-interpretation labels failed to track the same concepts across scripts, missing Serbian Cyrillic meanings 4× more than English. The failures correlate with how well each representation appears in training data, suggesting labels describe surface patterns rather than underlying concepts.
Published as How Far Do Auto-Interpretation Labels Generalize: A Controlled Study Across Languages, Scripts, and Rewordings arXiv:2606.00356
Read the original paper →