Do language models have a built-in sense of well-being?

Researchers trained language models in a maze environment with rewards and punishments, then extracted concept vectors representing success and failure. These vectors proved effective far beyond the maze: the punishment vector triggered failure tokens, negative emotions, and refusals even in unrelated tasks, while the reward vector did the opposite. Crucially, these welfare-like representations existed in models before maze training, suggesting RL taps into latent structure rather than building it from scratch. The finding has implications for understanding how post-training shapes behavior and for AI alignment.