A stabler way to rank which features actually matter in ML models

SHAP scores for feature importance can swing dramatically depending on train-test splits or random seeds, making model interpretation unreliable. RoSHAP models the full distribution of SHAP scores via bootstrap resampling and kernel density estimation, then collapses that distribution into a single ranking criterion that rewards features for being active, strong, and consistent. The authors prove the aggregated score is asymptotically Gaussian, which cuts the computational cost of distribution estimation. In simulations and real-data experiments, RoSHAP better identifies true signal features than single-run SHAP, and models built on RoSHAP-selected features match full-model predictive performance with substantially fewer predictors — useful for practitioners doing feature selection in noisy settings.