Picking the right training data for language models on a budget

Siqi Zeng, Christopher Jung, Rui Li, Zhe Kang, Ming Li, Nima Noorshams, Zhigang Wang, Fuchun Peng, Han Zhao, Xue Feng

When fine-tuning language models on downstream tasks, developers must choose which auxiliary datasets to use under budget constraints. Existing gradient alignment scores measure how well a dataset helps the target task but ignore redundancy—overlapping information across datasets wastes resources. This work frames dataset valuation as a subset selection problem and proposes a kernel mean matching approach in gradient space that jointly optimizes for task alignment and dataset diversity. Experiments across multiple post-training settings show consistent improvements over baseline methods with minimal computational overhead. Code is released.