Why do fine-tuned AI models suddenly overuse words like 'delve'?

After RLHF training, language models develop systematic word preferences—overusing "delve" or "furthermore"—that diverge from their base versions and human text. This team built an automated metric (Triangulated Preference Shift score) that isolates these shifts by comparing human standards, base models, and fine-tuned versions, eliminating the need for manual curation. Analysis across six model families suggests preference learning pushes models toward formal, prestige-oriented language patterns.