Can foundation models help predict when models fail on new data?

When models encounter data different from training, predicting their performance without labels is hard—existing methods rely only on the failing model itself. FRAP combines predictions from a foundation model and the target model, aligning them via temperature scaling and weighting by confidence to create a better performance proxy. Tests across multiple datasets and architectures show consistent, substantial improvements over baseline estimation methods.