Why standard AI learning breaks with human-like time preferences

Almost all RL algorithms use the Bellman equation, which assumes exponential discounting—that rewards decay smoothly over time. But humans and survival processes don't work that way; we often weight immediate and distant rewards differently (hyperbolic discounting). The authors show this mismatch is fundamental: the Bellman recursion mathematically requires both exponential decay and time homogeneity. They propose PG-DPO, a new method based on optimal control theory that skips recursion entirely, using Pontryagin's principle and Monte Carlo rollouts instead. On benchmarks with realistic discount rates, PG-DPO stays stable where traditional critic-based methods fail.