Can we synthesize domain data without describing the domain?

Tong Ye, Hang Yu, Tengfei Ma, Xuhong Zhang, Jianguo Li, Peng Di, Peiyu Liu, Jianwei Yin, Wenhai Wang

Fine-tuning LLMs on domain-specific data works well, but collecting it is hard—especially when the domain itself resists neat description. DOMINO learns what makes a domain tick by studying reference examples, then generates synthetic data that matches that pattern. It uses contrastive learning to extract domain essence while ditching sample noise. Result: coding models trained on DOMINO-generated data beat instruction-tuned baselines by up to 4.63%.