← Back to Machine Learning cs.LG
Do language models have a built-in sense of well-being?
Andy Q Han, David J. Chalmers, Pavel Izmailov
May 28, 2026
Researchers trained language models in a maze environment with rewards and punishments, then extracted concept vectors representing success and failure. These vectors proved effective far beyond the maze: the punishment vector triggered failure tokens, negative emotions, and refusals even in unrelated tasks, while the reward vector did the opposite. Crucially, these welfare-like representations existed in models before maze training, suggesting RL taps into latent structure rather than building it from scratch. The finding has implications for understanding how post-training shapes behavior and for AI alignment.
Read the original paper →