Multiplying training data by rephrasing instead of repeating

Language model pretraining now faces a data bottleneck: there aren't enough human-written texts to feed the models that exist. SynPro addresses this by applying two operations—rephrasing and reformatting—to existing organic data, creating diverse versions without introducing false information. Both generators are tuned via reinforcement learning to maximize quality and model influence, updating continuously as pretraining progresses. On 400M and 1.1B parameter models using only 10% of Chinchilla-optimal tokens, SynPro outperforms naive repetition and even matches oracle performance (training on unique data) at the larger scale. Code is released.