← Back to Artificial Intelligence
cs.AI

Can we synthesize domain data without describing the domain?

Tong Ye, Hang Yu, Tengfei Ma, Xuhong Zhang, Jianguo Li, Peng Di, Peiyu Liu, Jianwei Yin, Wenhai Wang

May 28, 2026

Fine-tuning LLMs on domain-specific data works well, but collecting it is hard—especially when the domain itself resists neat description. DOMINO learns what makes a domain tick by studying reference examples, then generates synthetic data that matches that pattern. It uses contrastive learning to extract domain essence while ditching sample noise. Result: coding models trained on DOMINO-generated data beat instruction-tuned baselines by up to 4.63%.
Published as Domain-Specific Data Synthesis for LLMs via Minimal Sufficient Representation Learning arXiv:2605.30039
Read the original paper →