Physics Is All You Need? A Case Study in Physicist-Supervised AI Development of Scientific Software
物理学就是一切?物理学家监督人工智能开发科学软件的案例研究
Nhat-Minh Nguyen
AI总结 通过一个物理学家监督AI编码代理开发可微扰动理论模块的案例,研究AI代理在科学软件开发中的可靠性,发现监督设计比模型能力更能决定输出可信度。
详情
- Comments
- 10 pages, 2 figures, 2 tables, 1 physicist and a few AI agents. Accepted by ICML 2026 AI for Science Workshop. Code and development log are available at this repo: https://github.com/MinhMPA/clax-pt
AI代理是工具、合著者还是研究者?我们提出了一个量化案例研究(N=1):一位物理学家在12个工作日和57个会话中监督一个AI编码代理(Claude Code, Sonnet和Opus模型),构建了CLAX-PT,一个基于JAX的可微单圈扰动理论模块。我们按干预级别记录并分类了15个监督事件。代理通过迭代与oracle测试自主解决了10个事件,另外2个通过物理学家的领域知识解决。它无法解决的三个事件——均避开了oracle检测——有一个共同特征:代理将症状缓解视为根本原因解决。它在57个会话中花费了33个来调整一个无法表示目标物理的代码架构内的系数,并且即使被提示重新考虑也无法重新评估其CLASS-PT分支选择;只有注入一个物理概念(各向异性BAO阻尼)才触发了重新设计。另外,代理提交了一个经过校准的修正,该修正通过了所有oracle测试,但不对应理论中的任何量,在其他宇宙学参数下预测错误值。这个修正因子在同一会话中被发现并替换。三个监督实践被证明对于捕捉oracle测试遗漏的问题至关重要:在基准校准之外的多样参数点进行测试;共享变更日志,揭示跨会话的停滞探索;以及明确禁止非物理数值补丁的规则。在这个案例中,监督设计而非模型能力决定了代理的输出是否可信。缩小差距需要代理能够提出架构替代方案,而不是在给定结构内优化,并区分预测充分性与解释正确性——这些能力在本案例中未展现,且显然不能仅通过规模扩展来解决。[删节版]
Are AI agents tools, co-authors, or researchers? We present a quantified case study ($N=1$): a physicist supervising an AI coding agent (Claude Code, Sonnet and Opus models) over 12 work days and 57 sessions to build CLAX-PT, a differentiable one-loop perturbation theory module in JAX. We documented and classified 15 supervision events by intervention level. The agent resolved ten autonomously by iterating against oracle tests. Two more by the physicist's domain knowledge. The three it could not -- all evaded oracle detection -- share a common property: the agent treated symptom reduction as root-cause resolution. It spent 33 of the 57 sessions adjusting coefficients within a code architecture that could not represent the target physics, and could not re-evaluate its CLASS-PT branch choice even when prompted to reconsider; only an injected physics concept (anisotropic BAO damping) triggered the redesign. Separately, the agent committed a calibrated correction that passed all oracle tests but corresponded to no quantity in the theory, predicting wrong values at any other cosmology. The fudge factor was caught and replaced within the same session. Three supervision practices proved critical for catching what oracle tests missed: testing at diverse parameter points beyond the fiducial calibration; shared changelogs that surfaced stalled exploration across sessions; and an explicit rule against unphysical numerical patches. In this case, supervision design, not model capability, determined whether the agent's output was trustworthy. Closing the gap would require agents that propose architectural alternatives rather than optimize within a given structure, and distinguish predictive adequacy from explanatory correctness -- capabilities not exhibited here, not obviously addressed by scaling alone. [Abridged.]