Can AI agents write their own tools reliably?

Yifan Zhou, Zhentao Zhang, Ziming Cheng, Shuo Zhang, Qizhen Lan, Zhangquan Chen, Zhi Yang, QianyuXu, Ronghao Chen, Huacan Wang, Sen Hu

Building effective AI agents requires not just using existing tools but generating new ones from raw materials. SkillGenBench isolates skill generation as its own problem, testing whether language models can synthesize executable skills from software repositories and long-form documents. The benchmark covers two scenarios: task-specific skills written after seeing a task, and reusable skill libraries built blindly before tasks arrive. Early results show current methods struggle significantly with skill reusability, especially when distilling procedures from documentation.