Do AI agents actually think like humans, or just pretend?

Weihan Peng, Chenxu Zhang, Qianao Wang, Yuling Shi, Heng Lian, Qihong Mao, Jiahao Pang, Chunliang Feng, Bowen Li, Xiaodong Gu

Most LLM benchmarks measure reasoning and planning, ignoring emotional consistency and personality stability. Researchers created HEART-Bench: 11 detailed character profiles (grounded in Big Five traits and 1,000 autobiographical memories each), tested against 64 psychologically-designed scenarios across eight dimensions like adversity, sociality, and deception. The benchmark filters these into 673 multiple-choice questions to measure whether agents actually behave like coherent humans or just pattern-match. Results reveal how far current LLMs fall from genuine psychological consistency.