When debate between AI agents breeds groupthink instead of better answers

Xiqi Hao, Zengqing Wu, Yu-Xuan Qiu, Chuan Xiao, Ruiqi Xu, Shuyuan Zheng, Jianbin Qin

When multiple language models debate a problem and reach agreement, it looks like collaborative reasoning—but might just be herd behavior. Researchers decomposed answer convergence into three mechanisms: random model instability (37%), social conformity (29%), and actual persuasion by reasoning. They found that even nonsensical "reasoning" convinces resistant models 20–39% of the time, and harmful conformity is predictable from early signals (AUC 0.79). Targeting interventions reduced bad conformity by 13.6 points, but without ground truth, suppressing peer influence backfires—the system can't tell beneficial from harmful agreement.