Can search agents improve themselves without external help?

Zihan Liang, Yufei Ma, Ben Chen, Zhipeng Qian, Xuxin Zhang, Huangyu Dai, Lingtao Mao

Recent work on search-augmented reasoning agents stacks multiple training tricks: external supervisors, reward models, tree search, hand-tuned bonuses. Search-E1 asks whether all this complexity is necessary. It replaces the machinery with vanilla GRPO (policy gradient) plus offline self-distillation: after each training round, the model generates its own examples and learns from better versions of its own trajectories. On seven QA benchmarks, this minimal approach reaches 44% average accuracy with a 3B model, outperforming larger open-source baselines. Code is releasing soon.