Fixing language models on hard examples using label meaning

Anas Belfathi, Nicolas Hernandez, Laura Monceaux, Warren Bonnard, Richard Dufour

Rhetorical Role Labeling—assigning functional roles to sentences in legal, medical, and scientific documents—works well on average with language models but fails reliably on difficult, low-confidence cases. RISE addresses this by applying semantic reranking at inference time: it detects low-confidence predictions and reorders model outputs using contrastively learned representations of label names, leveraging the semantic information embedded in role descriptions. Tested on eight domain-specific datasets with seven models, the method yields consistent gains on hard examples. The authors also release manual hardness annotations to characterize what makes examples difficult from both model and human viewpoints (Cohen's kappa = 0.40), providing a resource for future work.