← Back to Machine Learning cs.LG
Why cross-validation ensembles aren't what they seem for uncertainty
Kirscher Tristan, Bujotzek Markus, Kirchhoff Yannick, Rokuss Maximilian, Isensee Fabian, Kahl Kim-Celine, Kovacs Balint, Maier-Hein Klaus
May 18, 2026
When predicting segmentation uncertainty in medical images, researchers often use disagreement across ensemble members. The catch: calling a 5-fold cross-validation ensemble a "deep ensemble" conflates two different signals. Cross-validation mixes data-subset effects with random-seed variability, while true deep ensembles (same training data, different seeds) isolate seed variability alone. On three datasets across three imaging modalities, deep ensembles better detect failures and calibrate confidence, while cross-validation correlates more with inter-rater disagreement. The takeaway: pick ensemble type by your goal—deep ensembles for reliability, cross-validation for modeling annotation ambiguity.
Read the original paper →