← Back to Computation and Language
cs.CL

Multiplying training data by rephrasing instead of repeating

Zichun Yu, Chenyan Xiong

May 18, 2026

Language model pretraining now faces a data bottleneck: there aren't enough human-written texts to feed the models that exist. SynPro addresses this by applying two operations—rephrasing and reformatting—to existing organic data, creating diverse versions without introducing false information. Both generators are tuned via reinforcement learning to maximize quality and model influence, updating continuously as pretraining progresses. On 400M and 1.1B parameter models using only 10% of Chinchilla-optimal tokens, SynPro outperforms naive repetition and even matches oracle performance (training on unique data) at the larger scale. Code is released.
Published as Generating Pretraining Tokens from Organic Data for Data-Bound Scaling arXiv:2605.17849
Read the original paper →