Can AI coding agents actually train machine learning models?

Robin-Nico Kampa, Fabian Deuser, Anna Bößendörfer, Konrad Habel, Norbert Oswald

No standardized way exists to evaluate whether autonomous AI agents can design, implement, and train ML models independently. 1GC-7RC provides seven tasks spanning diverse domains (NLP, vision, graphs, tabular, time series) where agents modify only training code, run on a single GPU, and must complete within 40–120 minutes. The authors tested five proprietary agents (Claude, GPT, Qwen) and two open-source variants, running five trials each. Results show substantial performance variation across agents and tasks, exposing differences in implicit ML knowledge, planning, and time management. The benchmark, code, and artifacts are publicly available on GitHub, with a modular design that supports extension to new tasks and multi-agent studies.