← Back to Computation and Language
cs.CL

Self-training kills grammar while amplifying filler words

Ming Liu

May 20, 2026

Self-training on model outputs doesn't simplify language uniformly; instead it restructures it. Across five models over eleven generations, surface markers (discourse connectives, hedges) proliferate while deep syntactic structures (questions, passives, subjunctives) collapse. The researchers formalize this as the Structural Depth Hypothesis: decay rate is primarily predicted by how many nested dependencies a feature requires, not its initial frequency. A pooled analysis across 85 feature panels shows correlation of 0.540 versus 0.225 for frequency alone—and human fine-tuning shows no such pattern. The paradox: aggregate complexity metrics rise even as underlying clause structure dies, with direct implications for detecting LLM-generated text and curating training data.
Published as Self-Training Doesn't Flatten Language -- It Restructures It: Surface Markers Amplify While Deep Syntax Dies arXiv:2605.20602
Read the original paper →