← Back to Machine Learning
cs.LG

Can one model handle images, audio, text, and molecules equally well?

Julien Lafrance

June 1, 2026

A preprocessing step (Equiangular Tight Frame) paired with a tabular foundation model can classify data across wildly different modalities—images, audio, speech, text, molecules, time-series, and tables—without retraining. It stays competitive with specialized models on frozen features while running orders of magnitude faster than backbone fine-tuning. The practical contribution: how to calibrate confidence scores so practitioners can reject uncertain predictions automatically.
Published as When Tabular Foundation Models Transfer Across Modalities: A Systematic Evaluation Across 95 Datasets, 7 Modalities, and Two Regimes arXiv:2606.02106
Read the original paper →