← Back to Machine Learning cs.LG
Why standard AI learning breaks with human-like time preferences
Hojin Ko, Jeonggyu Huh
May 20, 2026
Almost all RL algorithms use the Bellman equation, which assumes exponential discounting—that rewards decay smoothly over time. But humans and survival processes don't work that way; we often weight immediate and distant rewards differently (hyperbolic discounting). The authors show this mismatch is fundamental: the Bellman recursion mathematically requires both exponential decay and time homogeneity. They propose PG-DPO, a new method based on optimal control theory that skips recursion entirely, using Pontryagin's principle and Monte Carlo rollouts instead. On benchmarks with realistic discount rates, PG-DPO stays stable where traditional critic-based methods fail.
Read the original paper →