← Back to Machine Learning cs.LG
Learning to recommend new items from old decisions
Ren Kishimoto, Tatsuhiro Shimizu, Kazuki Kawamura, Takanori Muroi, Yusuke Narita, Yuki Sasamoto, Kei Tateno, Takuma Udagawa, Yuta Saito
May 18, 2026
Real-world recommendation and search systems continuously introduce new items (articles, videos) after the logging policy has already collected data, creating a cold-start problem: existing off-policy learning methods cannot select actions with no historical feedback. This work introduces PONA (Policy Optimization for Effective New Actions), which combines a new policy gradient estimator (LCPI) that generalizes across action feature dimensions with a doubly robust component for learning from logged data. A tunable weight parameter balances selection of new actions against exploitation of known-good existing actions. Experiments show PONA efficiently incorporates new actions while maintaining performance on existing ones, addressing a practical gap in recommendation systems and search.
Read the original paper →