← Back to Machine Learning
cs.LG

Why standard AI learning breaks with human-like time preferences

Hojin Ko, Jeonggyu Huh

May 20, 2026

Almost all RL algorithms use the Bellman equation, which assumes exponential discounting—that rewards decay smoothly over time. But humans and survival processes don't work that way; we often weight immediate and distant rewards differently (hyperbolic discounting). The authors show this mismatch is fundamental: the Bellman recursion mathematically requires both exponential decay and time homogeneity. They propose PG-DPO, a new method based on optimal control theory that skips recursion entirely, using Pontryagin's principle and Monte Carlo rollouts instead. On benchmarks with realistic discount rates, PG-DPO stays stable where traditional critic-based methods fail.
Published as Beyond the Bellman Recursion: A Pontryagin-Guided Framework for Non-Exponential Discounting arXiv:2605.20996
Read the original paper →