Position: Prioritize Identifying Structure, Not Complex Models, for Scientific Discovery
立场:优先识别结构,而非复杂模型,以促进科学发现
Tyler H. McCormick
AI总结 本文论证现代机器学习在高维代理机制下存在通用欠定性,提出“机制性机器学习”的具体标准,以确保以LLM为中心的工作流真正支持科学而非模拟科学。
详情
- Comments
- Will appear as a position paper in ICML
现代机器学习(ML)和人工智能(AI)模型,特别是大型语言模型(LLMs),越来越多地被用于从观测数据中生成科学假设和机制解释。这篇立场论文认为,在现代ML擅长的高维代理机制中,机制性学习通常是欠定的:许多不相容的机制在数据支撑上诱导出本质上相同的观测关系,因此预测成功和连贯的解释并不足以作为机制发现的证据。这种欠定性在大型语言模型(LLMs)中变得尤为危险,因为它们倾向于将大量等价的解释类压缩成一个流畅的叙述。本文提出了“机制性机器学习”的具体标准,并论证如果以LLM为中心的工作流要支持科学而非仅仅模拟科学,这些标准是必要的。
Modern Machine Learning (ML) and Artificial Intelligence (AI) models, especially large language models (LLMs), are increasingly used to generate scientific hypotheses and mechanistic explanations from observational data. This position paper argues that in the high-dimensional proxy regimes where modern ML excels, mechanistic learning is generically underdetermined: many incompatible mechanisms induce essentially the same observational relationships on the support of the data, so predictive success and coherent explanations are insufficient evidence of mechanism discovery. This underdetermination becomes uniquely hazardous with large language models (LLMs), which tend to collapse large equivalence classes of explanations into a single fluent narrative. This paper proposes concrete standards for ``mechanistic ML,'' and argues these norms are necessary if LLM-centered workflows are to support science rather than merely simulate it.