← Back to Machine Learning
cs.LG

Do language models have a built-in sense of well-being?

Andy Q Han, David J. Chalmers, Pavel Izmailov

May 28, 2026

Researchers trained language models in a maze environment with rewards and punishments, then extracted concept vectors representing success and failure. These vectors proved effective far beyond the maze: the punishment vector triggered failure tokens, negative emotions, and refusals even in unrelated tasks, while the reward vector did the opposite. Crucially, these welfare-like representations existed in models before maze training, suggesting RL taps into latent structure rather than building it from scratch. The finding has implications for understanding how post-training shapes behavior and for AI alignment.
Published as How's it going? Reinforcement learning in language models recruits a functional welfare axis arXiv:2605.30232
Read the original paper →