Why healthcare AI benchmarks fail in the real world

Healthcare AI benchmarks often fail when deployed because they ignore hidden assumptions about how clinicians will interact with models—gaps that no dataset can reveal. The team separates assumptions into task-level (testable from conversations) and outcome-level (requiring behavioral studies), showing roughly equal contribution to the evaluation-deployment gap in their case study. They propose BenchmarkCards to document assumptions and staged evaluation to systematically test them before deployment.