Learning to recommend new items from old decisions

Ren Kishimoto, Tatsuhiro Shimizu, Kazuki Kawamura, Takanori Muroi, Yusuke Narita, Yuki Sasamoto, Kei Tateno, Takuma Udagawa, Yuta Saito

Real-world recommendation and search systems continuously introduce new items (articles, videos) after the logging policy has already collected data, creating a cold-start problem: existing off-policy learning methods cannot select actions with no historical feedback. This work introduces PONA (Policy Optimization for Effective New Actions), which combines a new policy gradient estimator (LCPI) that generalizes across action feature dimensions with a doubly robust component for learning from logged data. A tunable weight parameter balances selection of new actions against exploitation of known-good existing actions. Experiments show PONA efficiently incorporates new actions while maintaining performance on existing ones, addressing a practical gap in recommendation systems and search.