← Back to Machine Learning (Statistics)
stat.ML

How confident can you be in a bandit algorithm's performance?

Samya Praharaj, Chih-Yu Chang, Koulik Khamaru, Kelly W. Zhang

May 30, 2026

Bandit algorithms pick actions adaptively based on rewards, but their non-random data collection breaks standard statistical confidence intervals. BSI fits a simulator of the bandit environment from observed data, then uses it to estimate mean reward under any policy while formally propagating uncertainty. Works with weak exploration assumptions and maintains nominal coverage where off-policy methods fail.
Published as Bandit Simulation for Average Reward Inference arXiv:2606.00913
Read the original paper →