Multi-agent AI teams beat human creativity on hard problems

Tiancheng Hu, Yixuan Jiang, Haotian Li, José Hernández-Orallo, Xing Xie, Nigel Collier, David Stillwell, Luning Sun

This empirical study compares the creative output of multi-agent LLM teams to human teams across 4,541 AI-generated ideas and 341 human ideas on six tasks. LLM teams exceeded humans on creativity (Cohen's d=1.50) by favoring novelty while maintaining usefulness. Analysis of conversation semantics reveals both groups produce better ideas with less coherence around a single theme, but through different mechanisms: LLM teams benefit from efficient semantic exploration with short paths; human teams benefit from smooth conversational flow with frequent topic shifts. Model choice and discussion structure together explain 26.8% of variance in LLM team creativity. Results suggest systematic design levers for improving multi-agent creative capabilities.