Can AI write better code by testing itself without human examples?

Zhangyi Hu, Chenhui Liu, Tian Huang, Jindong Li, Yang Yang, Jiemin Wu, Zining Zhong, Menglin Yang, Yutao Yue

Current methods for generating correct code require expensive ground-truth unit tests during training. CoSPlay sidesteps this by having the model generate both code candidates and test cases, then iteratively improve them together—using a two-way feedback loop where weak code fails tests and bad tests wrongly pass code. When multiple solutions tie, it picks code from the largest consensus cluster, since correct answers agree while wrong ones diverge. On four benchmarks, this nearly-training-free approach lifts accuracy from 22% to 33%, matching or beating models trained with human-written tests.