Domain-Specific Data Synthesis for LLMs via Minimal Sufficient Representation Learning
基于最小充分表示学习的大语言模型领域特定数据合成
Tong Ye, Hang Yu, Tengfei Ma, Xuhong Zhang, Jianguo Li, Peng Di, Peiyu Liu, Jianwei Yin, Wenhai Wang
AI总结 提出DOMINO框架,通过对比解耦学习最小充分领域表示,指导生成领域对齐的合成数据,在隐式领域定义下提升微调性能。
详情
- Comments
- Accepted by KDD 2026
大语言模型在通用能力上取得了显著进展,并可通过在领域特定数据上微调在特定领域实现强性能。然而,获取目标领域的高质量数据仍是一个重大挑战。现有数据合成方法遵循演绎范式,严重依赖自然语言表达的显式领域描述和精心设计的提示工程,限制了其在领域难以描述或正式表述的现实场景中的适用性。在这项工作中,我们通过归纳范式处理未被充分探索的领域特定数据合成问题,其中目标领域仅通过一组参考示例定义,特别是在领域特征难以用自然语言表述时。我们提出了一种新颖框架DOMINO,它从参考样本中学习最小充分的领域表示,并利用它来指导生成领域对齐的合成数据。DOMINO将提示调优与对比解耦目标相结合,以分离领域级模式与样本特定噪声,在保留核心领域特征的同时缓解过拟合。理论上,我们证明DOMINO扩展了合成数据分布的支持集,确保了更大的多样性。在隐式领域定义的具有挑战性的编码基准上,对DOMINO合成的数据进行微调,在强大的指令调优基线上将Pass@1准确率提高了高达4.63%,证明了其有效性和鲁棒性。这项工作为领域特定数据合成建立了一种新范式,无需手动提示设计或自然语言领域规范即可实现实用且可扩展的领域适应。
Large Language Models have demonstrated remarkable progress in general-purpose capabilities and can achieve strong performance in specific domains through fine-tuning on domain-specific data. However, acquiring high-quality data for target domains remains a significant challenge. Existing data synthesis approaches follow a deductive paradigm, heavily relying on explicit domain descriptions expressed in natural language and careful prompt engineering, limiting their applicability in real-world scenarios where domains are difficult to describe or formally articulate. In this work, we tackle the underexplored problem of domain-specific data synthesis through an inductive paradigm, where the target domain is defined only through a set of reference examples, particularly when domain characteristics are difficult to articulate in natural language. We propose a novel framework, DOMINO, that learns a minimal sufficient domain representation from reference samples and leverages it to guide the generation of domain-aligned synthetic data. DOMINO integrates prompt tuning with a contrastive disentanglement objective to separate domain-level patterns from sample-specific noise, mitigating overfitting while preserving core domain characteristics. Theoretically, we prove that DOMINO expands the support of the synthetic data distribution, ensuring greater diversity. Empirically, on challenging coding benchmarks where domain definitions are implicit, fine-tuning on data synthesized by DOMINO improves Pass@1 accuracy by up to 4.63\% over strong, instruction-tuned backbones, demonstrating its effectiveness and robustness. This work establishes a new paradigm for domain-specific data synthesis, enabling practical and scalable domain adaptation without manual prompt design or natural language domain specifications.