Testing AI agents by replaying real world events chronologically

Shashwat Goel, Nikhil Chandak, Arvindh Arun, Ameya Prabhu, Steffen Staab, Moritz Hardt, Maksym Andriushchenko, Jonas Geiping

Current AI benchmarks rarely test whether agents can adapt to genuinely new information arriving over time. FutureSim addresses this by replaying real news articles in chronological order and asking agents to predict outcomes of questions resolving between January and March 2026 — entirely beyond their knowledge cutoffs. Evaluated in their native harnesses, frontier agents show stark performance gaps: the best reaches 25% accuracy, while several score worse than a no-prediction baseline on Brier skill score. The benchmark is designed to stress-test long-horizon adaptation, memory, search, and uncertainty reasoning, and the authors present it as an ongoing evaluation framework for the research community.