← Back to Computation and Language
cs.CL

Making AI text decisions explainable to human auditors

Tong Wang, Yiqing Xu, Leo Yang Yang

May 20, 2026

Machine learning models often use uninterpretable features—directions in embedding space that work but mean nothing to humans. The authors propose a practical standard: each feature must pass two tests. First, independent annotators applying the feature definition must agree (measured by Cohen's κ), ensuring it's actually describing something real. Second, the feature must add predictive power beyond just rewording the target label. They built LFD, which generates candidate features from text pairs, screens them for inter-rater agreement, then selects by residual predictive gain. Across ten text-classification tasks, LFD matched strong baselines while producing features that human auditors consistently rated as clearer and less label-leaking. This matters outside NLP: any high-stakes classifier (hiring, lending, content moderation) needs features auditors can actually check.
Published as Interpretable Discriminative Text Representations via Agreement and Label Disentanglement arXiv:2605.20693
Read the original paper →