arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1946
2511.10619 2026-05-22 cs.LG stat.ML

Algorithm Design and Stronger Guarantees for the Improving Multi-Armed Bandits Problem

改进多臂老虎机问题的算法设计及更强的保证

Avrim Blum, Marten Garicano, Kavya Ravichandran, Dravyansh Sharma

AI总结 本文提出两种新的参数化老虎机算法家族,通过离线数据界定了学习近最优算法的样本复杂度,并在标准超参数调优基准上进行了实证评估。第一家族包含先前工作的最优随机算法,展示在满足额外凹性性质的臂奖励曲线下,可以实现更强的保证。第二家族算法在良好行为实例上保证最佳臂识别,在不良行为实例上退化为最坏情况保证。

Comments 36 pages

详情
AI中文摘要

改进多臂老虎机问题是一个在不确定性下分配努力的形式模型,受投资新技术研究努力、进行临床试验和从学习曲线中选择超参数等场景的启发。每次拉取臂提供奖励,该奖励以递减回报单调增加。已有大量工作设计了改进老虎机算法,但最坏情况保证较为悲观。事实上,已知确定性和随机性算法相对于最优臂的强下界分别为Ω(k)和Ω(√k)的乘法近似因子。在本文中,我们提出两个新的参数化老虎机算法家族,并利用离线数据界定了从每个家族学习近最优算法的样本复杂度。我们还在标准超参数调优基准上进行了实证评估。我们定义的第一家族包含先前工作的最优随机算法。我们证明,适当选择的算法从该家族中可以实现更强的保证,当臂奖励曲线下满足与凹性强度相关的额外性质时,具有最优的k依赖性。我们的第二家族包含在良好行为实例上保证最佳臂识别并在不良行为实例上退化为最坏情况保证的算法。

英文摘要

The improving multi-armed bandits problem is a formal model for allocating effort under uncertainty, motivated by scenarios such as investing research effort into new technologies, performing clinical trials, and hyperparameter selection from learning curves. Each pull of an arm provides reward that increases monotonically with diminishing returns. A growing line of work has designed algorithms for improving bandits, albeit with somewhat pessimistic worst-case guarantees. Indeed, strong lower bounds of $Ω(k)$ and $Ω(\sqrt{k})$ multiplicative approximation factors are known for both deterministic and randomized algorithms (respectively) relative to the optimal arm, where $k$ is the number of bandit arms. In this work, we propose two new parameterized families of bandit algorithms and bound the sample complexity of learning the near-optimal algorithm from each family using offline data. We also perform empirical evaluations on standard hyperparameter tuning benchmarks. The first family we define includes the optimal randomized algorithm from prior work. We show that an appropriately chosen algorithm from this family can achieve stronger guarantees, with optimal dependence on $k$, when the arm reward curves satisfy additional properties related to the strength of concavity. Our second family contains algorithms that both guarantee best-arm identification on well-behaved instances and revert to worst-case guarantees on poorly-behaved instances.

2511.04838 2026-05-22 cs.LG math.SP q-bio.MN

SPECTRA: Spectral Domain-Aware Graph Generation for Imbalanced Molecular Property Regression

SPECTRA: 用于不平衡分子属性回归的谱域感知图生成

Brenda Nogueira, Gisela A. Gonzalez-Montiel, Meng Jiang, Nitesh V. Chawla, Nuno Moniz

AI总结 本文提出SPECTRA方法,通过结合稀缺性感知预算方案、目标邻居图对齐和拉普拉斯谱插值,提升对相关但数据稀缺的分子属性值的预测能力,同时在相关目标范围内优于现有最先进方法,计算时间减少约4倍。

详情
AI中文摘要

分子属性回归在化学相关的目标范围内遇到困难,因为这些范围在数据集中代表性不足。标准的平均误差最小化方法在这些高相关性情况下表现不佳,而过采样方法会导致分子表示失去意义。本文提出SPECTRA,一种谱域感知的图生成方法,旨在提高对相关但数据稀缺的分子属性值的预测能力。它结合了稀缺性感知的预算方案以聚焦数据稀缺区域,目标邻居图对齐以建立结构对应关系,以及拉普拉斯谱、节点特征和目标的插值。结合使用谱图神经网络和边缘感知的切比雪夫卷积,SPECTRA在属性预测基准测试中表现出色,在相关目标范围内与最先进的方法竞争,同时计算时间减少约4倍。

英文摘要

Molecular property regression struggles with cases in chemically relevant target ranges that are underrepresented in datasets. Standard average error minimization approaches underperform in these highly relevant cases, and oversampling approaches lead to meaningless molecular representations. In this paper, we propose SPECTRA, a spectral, domain-aware graph generation method designed to improve the prediction of underrepresented but relevant molecular property values. It combines a rarity-aware budgeting scheme to focus generation where data are scarce, target-neighbors graph alignment to establish structural correspondence, and interpolation of Laplacian spectra, node features, and targets. Coupled with spectral GNN using edge-aware Chebyshev convolutions, SPECTRA shows its effectiveness in property prediction benchmarks with competitive performance over leading state-of-the-art methods in relevant target ranges, while requiring ~4x less computational time.

2511.02043 2026-05-22 cs.LG cs.PF

Flashlight: PyTorch Compiler Extensions to Accelerate Attention Variants

Flashlight: PyTorch 编译器扩展以加速注意力变种

Bozhi You, Irene Wang, Zelal Su Mustafaoglu, Abhinav Jangda, Angélica Moreira, Roshan Dathathri, Divya Mahajan, Keshav Pingali

AI总结 本文提出Flashlight,一种基于PyTorch的编译器框架,能够自动生成融合的FlashAttention风格内核,支持任意注意力程序,无需静态模板或预定义内核专有化,从而在保持性能的同时提供灵活性。

详情
AI中文摘要

注意力是大型语言模型(LLMs)的基本构建块,因此有很多努力去高效地实现它。例如,FlashAttention利用分块和内核融合来优化注意力。最近,一些注意力变种被引入以提高模型质量和效率。支持它们仍然困难,因为它们通常需要专门的内核或手动调优的实现。FlexAttention最近通过使用静态编程模板来支持FlashAttention-like内核来解决部分这一差距。在本文中,我们介绍了Flashlight,一种位于PyTorch生态系统中的编译器原生框架,能够自动生成融合的FlashAttention风格内核,适用于任意注意力程序,而无需依赖静态模板或预定义的内核专有化。Flashlight利用PyTorch的编译流程来透明地融合和分块注意力计算,使各种注意力模式能够高效执行。不仅支持FlexAttention模型中所有可表达的变种,还处理更一般、数据依赖的注意力公式,这些超出了FlexAttention的能力范围。我们的结果表明,Flashlight生成的内核在性能上与FlexAttention具有竞争力或更优,同时提供原生PyTorch代码的灵活性,使开发人员能够快速探索新的注意力模型,而不会牺牲性能。

英文摘要

Attention is a fundamental building block of large language models (LLMs), so there have been many efforts to implement it efficiently. For example, FlashAttention leverages tiling and kernel fusion to optimize attention. Recently, a number of variants of attention have been introduced to enhance model quality or efficiency. Supporting them efficiently remains difficult since they usually require specialized kernels or hand-tuned implementations. FlexAttention recently addressed part of this gap by using static programming templates to support FlashAttention-like kernels for a subset of attention variants. In this paper, we introduce Flashlight, a compiler-native framework within the PyTorch ecosystem that automatically generates fused, FlashAttention-style kernels for arbitrary attention-based programs, without relying on static templates or predefined kernel specializations. Flashlight leverages PyTorch's compilation workflow to fuse and tile attention computations transparently, enabling efficient execution for diverse attention patterns. Not only does it support all variants expressible in the FlexAttention model but it also handles more general, data-dependent attention formulations that are beyond the capabilities of FlexAttention. Our results show that Flashlight produces kernels with competitive or superior performance to FlexAttention, while offering the flexibility of native PyTorch code, enabling developers to rapidly explore new attention models without sacrificing performance.

2510.20814 2026-05-22 cs.CV

SpectraMorph: Structured Latent Learning for Self-Supervised Hyperspectral Super-Resolution

SpectraMorph: 结构化潜在学习用于自监督超光谱超分辨率

Ritik Shah, Marco F Duarte

AI总结 本研究提出SpectraMorph,一种基于物理指导的自监督融合框架,通过结构化潜在空间实现超光谱超分辨率,利用多光谱图像与超光谱图像的融合,产生可解释的中间结果,并在短时间内训练,即使在单波段多光谱图像下也保持鲁棒性。

详情
Journal ref
ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
AI中文摘要

超光谱传感器每像素捕获密集的光谱信息,但空间分辨率低,导致边界模糊和混合像素效应。共注册的互补传感器如多光谱、RGB或全色相机提供高空间分辨率细节,推动通过超光谱与多光谱图像融合实现超光谱超分辨率。现有的基于深度学习的方法虽然性能强大,但依赖于不透明的回归器,缺乏可解释性且在多光谱图像波段很少时往往失效。我们提出了SpectraMorph,一种具有结构化潜在空间的物理指导自监督融合框架。SpectraMorph不通过直接回归,而是强制一个解混瓶颈:从低分辨率超光谱图像中提取端成员签名,并通过紧凑的多层感知机从多光谱图像预测类似丰度的地图。通过线性混合重建光谱,训练通过多光谱传感器的光谱响应函数进行自监督方式。SpectraMorph产生可解释的中间结果,训练时间短于一分钟,并且即使在单波段(全色)多光谱图像下也保持鲁棒性。在合成和真实数据集上的实验表明,SpectraMorph在自监督和无监督基线中表现一致优于最先进方法,同时在监督基线中也保持非常具有竞争力。

英文摘要

Hyperspectral sensors capture dense spectra per pixel but suffer from low spatial resolution, causing blurred boundaries and mixed-pixel effects. Co-registered companion sensors such as multispectral, RGB, or panchromatic cameras provide high-resolution spatial detail, motivating hyperspectral super-resolution through the fusion of hyperspectral and multispectral images (HSI-MSI). Existing deep learning based methods achieve strong performance but rely on opaque regressors that lack interpretability and often fail when the MSI has very few bands. We propose SpectraMorph, a physics-guided self-supervised fusion framework with a structured latent space. Instead of direct regression, SpectraMorph enforces an unmixing bottleneck: endmember signatures are extracted from the low-resolution HSI, and a compact multilayer perceptron predicts abundance-like maps from the MSI. Spectra are reconstructed by linear mixing, with training performed in a self-supervised manner via the MSI sensor's spectral response function. SpectraMorph produces interpretable intermediates, trains in under a minute, and remains robust even with a single-band (pan-chromatic) MSI. Experiments on synthetic and real-world datasets show SpectraMorph consistently outperforming state-of-the-art unsupervised/self-supervised baselines while remaining very competitive against supervised baselines.

2510.08759 2026-05-22 cs.CV cs.RO

Dissecting Embodied Abilities in Multimodal Language Models through Skill-level Evaluation and Diagnosis

通过技能级评估与诊断解构多模态语言模型的具身能力

Yu Qi, Haibo Zhao, Ziyu Guo, Siyuan Ma, Ziyan Chen, Yaokun Han, Renrui Zhang, Zitiantao Lin, Yizhe Zhu, Shiji Xin, Yijian Huang, Boce Hu, Kai Cheng, Peiheng Wang, Jiazheng Liu, Jiayi Zhang, Yizhe Zhu, Wenqing Wang, Yiran Qin, Haojie Huang, Lawson L. S. Wong

AI总结 本文提出BEAR基准,通过分解具身任务为14个原子技能进行细粒度评估,发现感知能力是推理失败的主要瓶颈,并提出BEAR-Agent多模态对话代理,显著提升具身技能性能。

Comments Accepted to ICML 2026

详情
AI中文摘要

理解具身多模态大语言模型(MLLMs)的能力瓶颈对于改进具身代理至关重要。然而,现有具身基准主要集中在任务级评估,未能提供模型失败的潜在原因的可操作见解。为解决这一限制,我们引入BEAR,一个将具身任务分解为14个原子技能以进行细粒度技能级评估的基准。BEAR包含4,469个交错的图像-视频-文本样本,涵盖6类中的14种技能,从低级感知到高级规划。我们评估了20个MLLMs在BEAR上的表现,采用分层技能级诊断框架,并揭示了两个关键发现:(1)感知能力是推理失败的主要瓶颈,(2)当前模型存在不稳定的时间空间建模问题,这在先前基准中未被充分暴露。受这些发现启发,我们进一步提出BEAR-Agent,一个多模态对话代理,通过添加视觉和空间推理工具来增强MLLMs。BEAR-Agent在具身技能上显著提升了性能,在BEAR上相对于GPT-5基模型实现了17.5%的相对提升,同时在仿真和现实世界机器人实验中也优于强基线模型。项目页面:https://bear-official66.github.io/

英文摘要

Understanding the capability bottlenecks of embodied multimodal large language models (MLLMs) is crucial for improving embodied agents. However, existing embodied benchmarks mainly focus on task-level evaluation and fail to provide actionable insights into the underlying causes of model failures. To address this limitation, we introduce BEAR, a benchmark that decomposes embodied tasks into 14 atomic skills for fine-grained skill-level evaluation. BEAR comprises 4,469 interleaved image-video-text samples spanning 14 skills across 6 categories, ranging from low-level perception to high-level planning. We evaluate 20 MLLMs on BEAR under a hierarchical skill-level diagnosis framework and uncover two key findings: (1) perceptual capabilities are major bottlenecks behind reasoning failures, and (2) current models suffer from unstable spatiotemporal modeling that remains largely unexposed in prior benchmarks. Motivated by these findings, we further propose BEAR-Agent, a multimodal conversational agent that augments MLLMs with visual and spatial reasoning tools. BEAR-Agent substantially improves performance across embodied skills, achieving a relative improvement of 17.5% on GPT-5 over the base model on BEAR, while also outperforming strong baselines in both simulation and real-world robotic experiments. Project page: https://bear-official66.github.io/

2510.07962 2026-05-22 cs.CL cs.AI

LightReasoner: Can Small Language Models Teach Large Language Models Reasoning?

LightReasoner: 小型语言模型能否教会大型语言模型推理?

Jingyuan Wang, Yankai Chen, Zhonghang Li, Chao Huang

AI总结 本文提出LightReasoner框架,通过利用强专家模型与弱业余模型之间的行为差异,发现高价值推理时刻,从而提升大型语言模型的推理能力,同时减少资源消耗。

Comments Updated to ACL 2026 camera-ready version with improved method presentation, expanded related work discussion, additional analyses, and presentation refinements

详情
AI中文摘要

大型语言模型(LLMs)在推理任务上取得了显著进展,通常通过监督微调(SFT)实现。然而,SFT过程资源消耗大,依赖大规模定制数据集、拒绝采样演示和对所有token的统一优化,尽管只有少量token具有实际学习价值。在本工作中,我们探索了一个反直觉的想法:小型语言模型(SLMs)能否通过揭示高价值推理时刻来教会大型语言模型(LLMs)其独特优势?我们提出了LightReasoner,一种新的框架,利用强专家模型(LLM)与弱业余模型(SLM)之间的行为差异。LightReasoner分为两个阶段:(1)采样阶段通过专家-业余对比确定关键推理时刻,并构建捕捉专家优势的监督示例;(2)微调阶段将专家模型与这些提炼出的示例对齐,放大其推理优势。在七个数学基准测试中,LightReasoner将准确性提高了28.1%,同时将时间消耗减少了90%,采样问题减少了80%,调优token使用减少了99%,且不依赖真实标签。通过将弱SLMs转化为有效的教学信号,LightReasoner提供了一种可扩展且资源高效的提升LLM推理能力的方法。代码可在:https://github.com/HKUDS/LightReasoner获取。

英文摘要

Large language models (LLMs) have demonstrated remarkable progress in reasoning, often through supervised fine-tuning (SFT). However, SFT is resource-intensive, relying on large curated datasets, rejection-sampled demonstrations, and uniform optimization across all tokens, even though only a fraction carry meaningful learning value. In this work, we explore a counterintuitive idea: can smaller language models (SLMs) teach larger language models (LLMs) by revealing high-value reasoning moments that reflect the latter's unique strength? We propose LightReasoner, a novel framework that leverages the behavioral divergence between a stronger expert model (LLM) and a weaker amateur model (SLM). LightReasoner operates in two stages: (1) a sampling stage that pinpoints critical reasoning moments and constructs supervision examples capturing the expert's advantage through expert-amateur contrast, and (2) a fine-tuning stage that aligns the expert model with these distilled examples, amplifying its reasoning strengths. Across seven mathematical benchmarks, LightReasoner improves accuracy by up to 28.1%, while reducing time consumption by 90%, sampled problems by 80%, and tuned token usage by 99%, all without relying on ground-truth labels. By turning weaker SLMs into effective teaching signals, LightReasoner offers a scalable and resource-efficient approach for advancing LLM reasoning. Code is available at: https://github.com/HKUDS/LightReasoner

2510.04280 2026-05-22 cs.LG cs.AI cs.RO

A KL-regularization Framework for Learning to Plan with Adaptive Priors

一种基于KL正则化的学习规划框架:具有自适应先验的规划

Álvaro Serra-Gomez, Daniel Jarne Ornia, Dhruva Tirumala, Thomas Moerland

AI总结 本文提出了一种基于KL正则化的学习规划框架,通过将规划器的动作分布作为先验整合到策略优化中,提升了在高维连续控制任务中模型驱动强化学习的样本效率和长期性能。

Comments Published at ICML2026

详情
AI中文摘要

有效的探索仍然是模型驱动强化学习(MBRL)中的核心挑战,尤其是在高维连续控制任务中,样本效率至关重要。近期的一项重要工作利用学习的策略作为模型预测路径积分(MPPI)规划的提案分布。初始方法在更新采样策略时独立于规划器分布,通常通过确定性策略梯度和熵正则化最大化学习的价值函数。然而,由于训练过程中遇到的状态依赖于MPPI规划器,使采样策略与规划器对齐可以提高价值估计的准确性以及长期性能。为此,近期的方法通过最小化KL散度到规划器分布或引入规划器引导的正则化来更新采样策略。在本文中,我们通过引入策略优化-模型预测控制(PO-MPC),将这些基于MPPI的强化学习方法统一到一个框架中,这是一种整合规划器动作分布作为先验的KL正则化MBRL方法家族。通过使学习的策略与规划器的行为对齐,PO-MPC允许在回报最大化和KL散度最小化之间更灵活的策略更新。我们澄清了先前方法如何作为该家族的特殊案例出现,并探索了之前未研究的变体。我们的实验表明,这些扩展配置产生了显著的性能提升,推动了基于MPPI的强化学习的前沿。

英文摘要

Effective exploration remains a central challenge in model-based reinforcement learning (MBRL), particularly in high-dimensional continuous control tasks where sample efficiency is crucial. A prominent line of recent work leverages learned policies as proposal distributions for Model-Predictive Path Integral (MPPI) planning. Initial approaches update the sampling policy independently of the planner distribution, typically maximizing a learned value function with deterministic policy gradient and entropy regularization. However, because the states encountered during training depend on the MPPI planner, aligning the sampling policy with the planner improves the accuracy of value estimation and long-term performance. To this end, recent methods update the sampling policy by minimizing KL divergence to the planner distribution or by introducing planner-guided regularization into the policy update. In this work, we unify these MPPI-based reinforcement learning methods under a single framework by introducing Policy Optimization-Model Predictive Control (PO-MPC), a family of KL-regularized MBRL methods that integrate the planner's action distribution as a prior in policy optimization. By aligning the learned policy with the planner's behavior, PO-MPC allows more flexibility in the policy updates to trade off Return maximization and KL divergence minimization. We clarify how prior approaches emerge as special cases of this family, and we explore previously unstudied variations. Our experiments show that these extended configurations yield significant performance improvements, advancing the state of the art in MPPI-based RL.

2509.20912 2026-05-22 cs.AI

DeFacto: Counterfactual Thinking with Images for Enforcing Evidence-Grounded and Faithful Reasoning

DeFacto: 通过图像进行反事实推理以强制证据支持和忠实推理

Tianrun Xu, Haoda Jing, Ye Li, Yuquan Wei, Jun Feng, Guanyu Chen, Haichuan Gao, Tianren Zhang, Feng Chen

AI总结 本文提出DeFacto框架,通过整合正例、反事实和随机遮蔽三种训练范式,提升多模态语言模型在证据一致性方面的表现,并引入DeFacto-1.5K基准进行系统评估。

详情
AI中文摘要

最近多模态语言模型(MLLMs)的进步使通过图像进行推理成为多模态推理的主要范式。然而,现有方法仍无法确保答案与证据的一致性,即正确答案必须由正确视觉证据支持。为了解决这个问题,我们提出了DeFacto,一种反事实推理框架,该框架明确地将视觉证据与最终答案对齐。我们的方法整合了三种互补的训练范式:正例、反事实和随机遮蔽。我们进一步开发了一个语言引导的证据构建流水线,该流水线能够自动定位与问题相关区域并生成反事实变体,从而得到DeFacto-100K。基于此数据集,我们训练MLLMs使用基于GRPO的强化学习,并设计三种互补的奖励机制以促进正确回答、结构化推理和一致的证据选择。此外,我们引入了DeFacto-1.5K,一个由人类标注的基准,用于系统评估证据支持的一致性,而不仅仅是答案准确性。在多样化的基准测试中,DeFacto在答案准确性和证据-答案一致性方面均显著优于强大的基线模型。

英文摘要

Recent advances in multimodal language models (MLLMs) have made thinking with images a dominant paradigm for multimodal reasoning. However, existing methods still fail to ensure evidence-answer consistency, where correct answers must be supported by correct visual evidence. To address this issue, we propose DeFacto, a counterfactual reasoning framework that explicitly aligns visual evidence with final answers. Our approach integrates three complementary training paradigms: positive, counterfactual, and random-masking. We further develop a language-guided evidence construction pipeline that automatically localizes question-relevant regions and generates counterfactual variants, resulting in DeFacto-100K. Building on this dataset, we train MLLMs with GRPO-based reinforcement learning and design three complementary rewards to promote correct answering, structured reasoning, and consistent evidence selection. Moreover, we introduce DeFacto-1.5K, a human-annotated benchmark for systematically evaluating evidence-grounded consistency beyond answer accuracy. Experiments on diverse benchmarks demonstrate that DeFacto substantially improves both answer accuracy and evidence-answer consistency over strong baselines.

2509.17086 2026-05-22 cs.CV

SFN-YOLO: Towards Free-Range Poultry Detection via Scale-aware Fusion Networks

SFN-YOLO:通过尺度感知融合网络实现自由放养禽类检测

Jie Chen, Yuhong Feng, Tao Dai, Hao Wang, Hongtao Chen, Zhaoxi He, Mingzhe Liu, Jiancong Bai

AI总结 本文提出了一种名为SFN-YOLO的创新禽类检测方法,通过尺度感知融合技术提高复杂环境中的检测性能,并引入了专为自由放养条件设计的M-SCOPE数据集,实验表明该模型在仅7.2M参数的情况下达到了80.7%的mAP,比基准模型少35.1%的参数,同时保持了良好的泛化能力。

详情
AI中文摘要

检测和定位禽类对于推进智能禽类养殖至关重要。尽管检测导向方法已取得进展,但在自由放养环境中仍面临多尺度目标、遮挡和复杂或动态背景带来的挑战。为解决这些问题,我们引入了一种名为SFN-YOLO的创新禽类检测方法,该方法利用尺度感知融合技术,将详细的局部特征与更广泛的全局上下文相结合,以提高复杂环境中的检测性能。此外,我们还开发了一个新的扩展数据集(M-SCOPE),专门针对多样的自由放养条件。全面的实验表明,我们的模型在仅7.2M参数的情况下实现了80.7%的mAP,比基准模型少35.1%的参数,同时在不同领域中保持了强大的泛化能力。SFN-YOLO的高效和实时检测能力支持了自动化智能禽类养殖。

英文摘要

Detecting and localizing poultry is essential for advancing smart poultry farming. Despite the progress of detection-centric methods, challenges persist in free-range settings due to multiscale targets, obstructions, and complex or dynamic backgrounds. To tackle these challenges, we introduce an innovative poultry detection approach named SFN-YOLO that utilizes scale-aware fusion. This approach combines detailed local features with broader global context to improve detection in intricate environments. Furthermore, we have developed a new expansive dataset (M-SCOPE) tailored for varied free-range conditions. Comprehensive experiments demonstrate our model achieves an mAP of 80.7% with just 7.2M parameters, which is 35.1% fewer than the benchmark, while retaining strong generalization capability across different domains. The efficient and real-time detection capabilities of SFN-YOLO support automated smart poultry farming.

2509.09088 2026-05-22 cs.LG math.DG math.DS

An entropy formula for the Deep Linear Network

深度线性网络的熵公式

Govind Menon, Tianmin Yu

AI总结 本文研究深度线性网络的黎曼几何,以建立学习过程的热力学描述。通过群作用分析过参数化,并利用参数空间到可观测空间的黎曼子流形,定义并计算玻尔兹曼熵。主要技术步骤是利用雅可比矩阵理论显式构造平衡流形的切空间正交基。

Comments Final version of accepted paper in SIAM Journal on Mathematical Analysis. Includes fixes of minor typos (especially equation (3.13), (6.35) and (6.36)

详情
AI中文摘要

我们研究深度线性网络(DLN)的黎曼几何,作为建立学习过程热力学描述的基础。主要工具是利用群作用分析过参数化以及利用参数空间到可观测空间的黎曼子流形。通过在参数空间中平衡流形的群轨道分层来定义并计算玻尔兹曼熵。我们还显示[2]中定义在可观测空间上的黎曼几何是通过平衡流形的黎曼子流形得到的。主要技术步骤是利用雅可比矩阵理论显式构造平衡流形切空间的正交基。

英文摘要

We study the Riemannian geometry of the Deep Linear Network (DLN) as a foundation for a thermodynamic description of the learning process. The main tools are the use of group actions to analyze overparametrization and the use of Riemannian submersion from the space of parameters to the space of observables. The foliation of the balanced manifold in the parameter space by group orbits is used to define and compute a Boltzmann entropy. We also show that the Riemannian geometry on the space of observables defined in [2] is obtained by Riemannian submersion of the balanced manifold. The main technical step is an explicit construction of an orthonormal basis for the tangent space of the balanced manifold using the theory of Jacobi matrices.

2509.06503 2026-05-22 cs.AI q-bio.QM

An AI system to help scientists write expert-level empirical software

一种帮助科学家编写专家级经验软件的AI系统

Eser Aygün, Anastasiya Belyaeva, Gheorghe Comanici, Marc Coram, Hao Cui, Jake Garrison, Renee Johnston Anton Kast, Cory Y. McLean, Peter Norgaard, Zahra Shamsi, David Smalling, James Thompson, Subhashini Venugopalan, Brian P. Williams, Chujun He, Sarah Martinson, Martyna Plomecka, Lai Wei, Yuchen Zhou, Qian-Ze Zhu, Matthew Abraham, Erica Brand, Anna Bulanova, Jeffrey A. Cardille, Chris Co, Scott Ellsworth, Grace Joseph, Malcolm Kane, Ryan Krueger, Johan Kartiwa, Dan Liebling, Jan-Matthis Lueckmann, Paul Raccuglia, Xuefei, Wang, Katherine Chou, James Manyika, Yossi Matias, John C. Platt, Lizzie Dorfman, Shibl Mourad, Michael P. Brenner

AI总结 本文提出Empirical Research Assistance (ERA)系统,利用大型语言模型和树搜索技术,自动创建高质量的科学软件,以加速计算实验的开发,从而提高科研效率。

Comments 78 pages, 31 figures, 22 tables

详情
AI中文摘要

科学发现的周期经常被缓慢、手动的软件创建所限制,用于支持计算实验。为了解决这个问题,我们提出了Empirical Research Assistance (ERA),一种AI系统,其目标是最大化一个质量度量。该系统使用大型语言模型(LLM)和树搜索(TS)来系统性地提高质量度量并智能地导航可能的解决方案空间。当探索并整合外部来源的复杂研究想法时,ERA能够产生专家级的结果。树搜索的有效性在各种任务上得到了证明。在生物信息学中,ERA发现了40种新的单细胞数据分析方法,这些方法在公开排行榜上优于顶级的人工方法。在流行病学中,ERA生成了14种模型,这些模型在预测新冠住院预测方面优于CDC集合和所有其他个体模型。ERA还为地理空间分析、斑马鱼神经活动预测和积分数值解法以及时间序列预测的规则基构造生成了专家级软件。通过为多样任务设计和实现新的解决方案,ERA代表了加速科学进步的重要一步。

英文摘要

The cycle of scientific discovery is frequently bottlenecked by the slow, manual creation of software to support computational experiments\cite{hannay2009how}. To address this, we present Empirical Research Assistance (ERA), an AI system that creates expert-level scientific software whose goal is to maximize a quality metric. The system uses a Large Language Model (LLM) and Tree Search (TS)\cite{silver2016mastering} to systematically improve the quality metric and intelligently navigate the large space of possible solutions. ERA achieves expert-level results when it explores and integrates complex research ideas from external sources. The effectiveness of tree search is demonstrated across a diverse range of tasks. In bioinformatics, ERA discovered 40 novel methods for single-cell data analysis that outperformed the top human-developed methods on a public leaderboard. In epidemiology, ERA generated 14 models that outperformed the CDC ensemble and all other individual models for forecasting COVID-19 hospitalizations. ERA also produced expert-level software for geospatial analysis, neural activity prediction in zebrafish, and numerical solution of integrals, and a novel rule-based construction for time series forecasting. By devising and implementing novel solutions to diverse tasks, ERA represents a significant step towards accelerating scientific progress.

2508.03865 2026-05-22 cs.CL

An Entity Linking Agent for Question Answering

用于问答任务的实体链接代理

Yajie Luo, Yihong Wu, Muzhi Li, Jia Ao Sun, Xinyu Wang, Liheng Ma, Yingxue Zhang, Jian-Yun Nie

AI总结 本文提出了一种基于大语言模型的实体链接代理,用于解决问答任务中短且模糊用户问题的实体链接问题,通过两个实验验证了其有效性。

Comments 12 pages, 2 figures

详情
AI中文摘要

一些问答(QA)系统依赖知识库(KB)来提供准确答案。实体链接(EL)在将自然语言提及链接到KB条目中起着关键作用。然而,大多数现有的EL方法是为长上下文设计的,无法在问答任务中有效处理短且模糊的用户问题。我们提出了一种用于问答任务的实体链接代理,基于一个模拟人类认知流程的大语言模型。该代理主动识别实体提及、检索候选实体并做出决策。为了验证我们代理的有效性,我们进行了两项实验:基于工具的实体链接和问答任务评估。结果证实了我们代理的鲁棒性和有效性。

英文摘要

Some Question Answering (QA) systems rely on knowledge bases (KBs) to provide accurate answers. Entity Linking (EL) plays a critical role in linking natural language mentions to KB entries. However, most existing EL methods are designed for long contexts and do not perform well on short, ambiguous user questions in QA tasks. We propose an entity linking agent for QA, based on a Large Language Model that simulates human cognitive workflows. The agent actively identifies entity mentions, retrieves candidate entities, and makes decision. To verify the effectiveness of our agent, we conduct two experiments: tool-based entity linking and QA task evaluation. The results confirm the robustness and effectiveness of our agent.

2507.20268 2026-05-22 cs.LG eess.SP stat.ML

Reliable Wireless Indoor Localization via Cross-Validated Prediction-Powered Calibration

通过交叉验证的预测驱动校准实现可靠的无线室内定位

Seonghoon Yoo, Houssem Sifaou, Sangwoo Park, Joonhyuk Kang, Osvaldo Simeone

AI总结 本文提出一种利用有限校准数据同时优化预测器和估计合成标签偏差的方法,通过交叉验证预测驱动校准提高无线室内定位的可靠性。

详情
AI中文摘要

使用预测模型和接收信号强度信息(RSSI)进行无线室内定位需要适当的校准以获得可靠的定位估计。一种解决方法是使用由(通常不同的)预测模型生成的合成标签。但微调额外的预测器以及估计合成标签的残差偏差需要额外的数据,加剧了无线环境中的校准数据稀缺问题。本文提出了一种方法,能够高效利用有限的校准数据,同时微调预测器并估计合成标签的偏差,从而获得具有严格覆盖保证的预测集。在指纹数据集上的实验验证了所提出方法的有效性。

英文摘要

Wireless indoor localization using predictive models with received signal strength information (RSSI) requires proper calibration for reliable position estimates. One remedy is to employ synthetic labels produced by a (generally different) predictive model. But fine-tuning an additional predictor, as well as estimating residual bias of the synthetic labels, demands additional data, aggravating calibration data scarcity in wireless environments. This letter proposes an approach that efficiently uses limited calibration data to simultaneously fine-tune a predictor and estimate the bias of synthetic labels, yielding prediction sets with rigorous coverage guarantees. Experiments on a fingerprinting dataset validate the effectiveness of the proposed method.

2507.17640 2026-05-22 cs.CV

Not All Starting Points Are Equal: Pre-trained Priors and Their Outsized Impact on Person Identification

并非所有起始点都平等:预训练先验及其在人识别人脸识别中的巨大影响

Thomas M. Metz, Matthew Q. Hill, Alice J. O'Toole

AI总结 本文研究了预训练方法对人识别人脸识别任务的影响,发现预训练权重在域适应过程中扮演重要先验角色,并展示了使用大视觉基础模型进行简单域适应可获得SOTA结果。

详情
AI中文摘要

近年来,计算机视觉领域出现了大量多样化的通用预训练方法。然而,这些预训练方法对人识别人脸识别任务(re-id)的影响仍缺乏深入研究。我们发现,在等效域适应流程下,不同起始模型(架构和预训练权重)会产生显著不同的识别人脸识别结果。我们指出,对不同下游性能的直观解释是不足的,并提出预训练权重在域适应过程中学习的权重起着强先验作用。在此框架下,域适应解决方案可被视为Gibbs后验的最大概率点估计,其中预训练权重充当先验。在此框架下,我们展示了使用大预训练基础模型进行简单域适应可在多个re-id数据集(Market、PRCC、DeepChange、BTS)上获得SOTA结果,其参数空间与起始参数非常接近。此外,我们对这些解决方案进行了消融研究,发现它们可以使用小的迁移集和不同迁移数据集实现,但对优化器、权重衰减和损失函数的选择敏感。最终,我们提出直接使用大视觉基础模型(如CLIP、Dino、EVA、AIM等)进行微调的简单方法应作为未来re-id研究的重要基准。

英文摘要

Recent years have seen an explosion of diverse general purpose pre-training methodologies for computer vision. However, the impact that these pre-training methodologies have on person identification tasks (re-id) remains under-explored. We show that under equated domain adaptation pipelines, there is dramatic variance in person identification outcomes using different starting models (architectures and pre-trained weights). We show that a range of intuitive explanations for differing downstream performance on a range of re-id tests are insufficient and propose that pre-trained weights serve as a strong prior to the weights learned during domain adaptation. This framework allows for domain adapted solutions to be viewed as a maximum probability point estimate of the Gibbs posterior with the pre-trained weights acting as a prior. Under this framework, we show that large, pre-trained foundation models with simple domain adaptation achieve SOTA solutions on a range of re-id datasets (Market, PRCC, DeepChange, BTS) with solutions that are very close in the parameter space to the starting parameters. Moreover, we perform ablations on these solutions and show that they can be reached with small transfer sets and with varying transfer datasets but are sensitive to choice of optimizer, weight-decay, and loss function. Ultimately, we propose that the simple approach of direct fine-tuning using large vision foundation models (CLIP, Dino, EVA, AIM, etc.) needs to serve as an important baseline for future work in re-id.

2507.03674 2026-05-22 cs.CL cs.AI

STRUCTSENSE: A Task-Agnostic Agentic Framework for Structured Information Extraction with Human-In-The-Loop Evaluation and Benchmarking

STRUCTSENSE:一种任务无关的代理框架,用于结构化信息提取,具有人机协同评估和基准测试

Tek Raj Chhetri, Yibei Chen, Puja Trivedi, Dorota Jarecka, Saif Haobsh, Patrick Ray, Lydia Ng, Satrajit S. Ghosh

AI总结 本文提出STRUCTSENSE框架,通过整合本体引导的符号知识、代理自我评估细化和人机协同验证,实现了结构化信息提取的鲁棒性,并在三个领域展示了其跨任务泛化能力。

Comments -

详情
AI中文摘要

从科学文献中提取结构化信息对于加速发现至关重要,但大型语言模型(LLMs)在需要专家知识的专门领域表现不佳,且在跨任务泛化方面表现差。我们引入STRUCTSENSE,一种模块化、任务无关、开源的框架,整合了本体引导的符号知识、代理自我评估细化和人机协同验证,以实现领域感知的稳健提取。我们在三个递增语义复杂度的任务上评估STRUCTSENSE:基于模式的评估工具提取(91-100%准确率)、从科学论文中提取元数据和资源(86-93%总体准确率)以及从神经科学文献中进行命名实体识别(NER)(58-75%标签准确率,共8,882个实体)。在两个生物医学NER基准(NCBI疾病和S800物种)上,系统实现了≥90%的宽松召回率和62.5-85.8%的严格召回率,同时提取了1,000-3,600个额外实体。本地概念映射服务在严格匹配下达到62-82%的Hits@1,在语义匹配下达到68-86%。这些结果在三个领域展示了STRUCTSENSE跨任务泛化的能力,同时保持了源地和可追溯性透明度。

英文摘要

Extracting structured information from scientific literature is critical for accelerating discovery, yet Large Language Models (LLMs) often struggle in specialized domains that require expert knowledge and generalize poorly across tasks. We introduce \textsc{StructSense}, a modular, task-agnostic, open-source framework that integrates ontology-guided symbolic knowledge, agentic self-evaluative refinement, and human-in-the-loop validation for robust domain-aware extraction. We evaluate \textsc{StructSense} on three tasks of increasing semantic complexity: schema-based extraction of assessment instruments (91--100\% accuracy), metadata and resource extraction from scientific papers (86--93\% overall), and named entity recognition (NER) from neuroscience literature (58--75\% label accuracy across 8,882 entities). On two biomedical NER benchmarks (NCBI Disease and S800 Species), the system achieves $\geq$90\% relaxed recall and 62.5--85.8\% strict recall while extracting 1,000--3,600 additional entities beyond gold annotations. The local concept mapping service achieves Hits@1 of 62--82\% under strict matching and 68--86\% under semantic matching. These results across three domains demonstrate that \textsc{StructSense} generalizes across tasks while maintaining source grounding and provenance transparency.

2506.23808 2026-05-22 cs.CV

Towards Initialization-free Calibrated Bundle Adjustment

迈向无初始化的校准捆绑调整

Carl Olsson, Amanda Nilsson

AI总结 本文提出了一种利用已知相机校准的无初始化校准SfM方法,通过引入成对相对旋转估计来实现近等距重建,从而提高三维重建的准确性。

详情
AI中文摘要

近期一系列工作表明,可以通过伪对象空间误差(pOSE)作为替代目标函数来实现无初始化的捆绑调整(BA)。初始重建步骤优化一个所有项都是射影不变的目标函数,无法纳入相机校准的知识。因此,解法仅在射影变换下确定,该过程需要更多的数据才能成功重建。相反,我们提出了一种能够利用已知相机校准的方法,从而产生近等距解,即精确到相似变换的重建。为此,我们引入了携带相机校准信息的成对相对旋转估计。这些估计仅对相似变换不变,因此鼓励保留真实场景的度量特征的解。我们的方法可以看作是将旋转平均整合到pOSE框架中,朝着无初始化校准SfM迈进。我们的实验评估表明,我们能够可靠地优化我们的目标函数,从随机起始解中以高概率收敛到全局最小值,从而产生准确的近等距重建。

英文摘要

A recent series of works has shown that initialization-free BA can be achieved using pseudo Object Space Error (pOSE) as a surrogate objective. The initial reconstruction-step optimizes an objective where all terms are projectively invariant and it cannot incorporate knowledge of the camera calibration. As a result, the solution is only determined up to a projective transformation of the scene and the process requires more data for successful reconstruction. In contrast, we present a method that is able to use the known camera calibration thereby producing near metric solutions, that is, reconstructions that are accurate up to a similarity transformation. To achieve this we introduce pairwise relative rotation estimates that carry information about camera calibration. These are only invariant to similarity transformations, thus encouraging solutions that preserve metric features of the real scene. Our method can be seen as integrating rotation averaging into the pOSE framework striving towards initialization-free calibrated SfM. Our experimental evaluation shows that we are able to reliably optimize our objective, achieving convergence to the global minimum with high probability from random starting solutions, resulting in accurate near metric reconstructions.

2506.19500 2026-05-22 cs.AI cs.CL cs.LG

NaviAgent: Graph-Driven Bilevel Planning for Scalable Tool Orchestration

NaviAgent: 一种基于图的双层规划用于可扩展的工具编排

Yan Jiang, Hao Zhou, Lizhong GU, Tianlong Li, Ruinan Jin, Wanqi Zhou, Ai Han

AI总结 本文提出NaviAgent,一种基于图的双层规划框架,通过解耦任务规划与工具执行,提升大规模工具编排的可扩展性和鲁棒性,实验表明其在任务成功率和实际应用中表现优异。

Comments Accepted to ICML 2026

详情
Journal ref
Proceedings of the 43rd International Conference on Machine Learning (ICML), 2026
AI中文摘要

大型语言模型(LLMs)越来越多地作为功能调用代理,通过调用外部工具来处理超出其静态知识的任务。然而,它们通常逐个调用工具,缺乏对任务结构的整体视图。由于工具之间往往相互依赖,这导致了错误累积和可扩展性差,尤其是在扩展到数百或数千个工具时。为了解决这些限制,我们提出了NaviAgent,一种显式的双层架构,通过基于工具关系的图建模来解耦任务规划与工具执行。在规划层,基于LLM的代理决定是否直接回应、澄清意图或检索并执行独立于工具间复杂度的工具链。在执行层,工具世界导航模型(TWNM)编码工具之间的结构和行为关系,引导代理生成可扩展且鲁棒的调用序列。通过整合真实工具交互的反馈,NaviAgent实现了规划与执行之间的闭环对齐,使代理能够在大规模工具生态系统中实现自适应导航。在API-Bank和ToolBench上的评估显示,任务成功率(TSR)有持续改进,TWNM在复杂任务上平均提升13.1个百分点。进一步在50个真实API跨7个领域的测试中,展示了4.3-12.0个百分点的持续收益,步骤更少且延迟更低,证明了其在真实世界动态下的鲁棒泛化能力。

英文摘要

Large Language Models (LLMs) increasingly act as function-call agents that invoke external tools to tackle tasks beyond their static knowledge. However, they typically invoke tools one at a time without a global view of task structure. As tools often depend on one another, this leads to error accumulation and poor scalability, particularly when scaling to hundreds or thousands of tools. To address these limitations, we propose NaviAgent, an explicit bilevel architecture that decouples task planning from tool execution through graph-based modeling of tool relations. At the planning level, the LLM-based agent decides whether to respond directly, clarify intent, or retrieve and execute a toolchain independent of inter-tool complexity. At the execution level, a Tool World Navigation Model (TWNM) encodes structural and behavioral relations among tools, steering the agent to compose scalable and robust invocation sequences. Incorporating feedback from real tool interactions, NaviAgent achieves closed-loop alignment between planning and execution, enabling adaptive navigation in large-scale tool ecosystems. Evaluations on API-Bank and ToolBench show consistent improvements in task success rate (TSR), with TWNM yielding an average gain of 13.1 points on complex tasks. Further tests on 50 real APIs across 7 domains show consistent gains of 4.3--12.0 points, with fewer steps and latency, demonstrating robust generalization under real-world dynamics.

2506.16659 2026-05-22 cs.LG cs.AI math.OC

Memory-Efficient LLM Pretraining via Minimalist Optimizer Design

通过最小化优化器设计实现内存高效的LLM预训练

Athanasios Glentis, Jiaxiang Li, Andi Han, Mingyi Hong

AI总结 本文研究了如何通过简单的优化器设计改进,使SGD在预训练中达到最先进的性能,提出了SCALE优化器,在内存使用上比Adam更高效,并在多个模型上表现优于现有内存高效的优化器。

Comments Accepted at ICML 2026

详情
AI中文摘要

训练大型语言模型(LLMs)依赖于自适应优化器,如Adam,这些优化器引入了额外的操作,并需要比SGD更多的内存来维护一阶和二阶矩量。尽管最近的工作如GaLore、Fira和APOLLO提出了状态压缩的内存高效变体,但一个根本性的问题仍然存在:plain SGD需要哪些最小的修改才能达到最先进的预训练性能?我们通过自底向上的方法系统地研究了这个问题,并识别出两种简单但高度(内存和计算)高效的技巧:(1)列级梯度归一化(沿输出维度归一化梯度),在没有动量的情况下提升SGD性能;(2)仅在输出层应用一阶动量,因为梯度方差最高。结合这两种技术得到SCALE(Stochastic Column-normAlized Last-layer momEntum),一种简单的优化器,用于内存高效的预训练。在多个模型(60M-1B)上,SCALE的内存使用仅为Adam的35-45%,并且在多个模型上表现优于Adam。它还一致优于内存高效的优化器如GaLore、Fira和APOLLO,使其成为在内存限制下的大规模预训练的强大候选者。对于LLaMA 7B,SCALE在困惑度和内存消耗方面都优于最先进的内存高效方法APOLLO和Muon。

英文摘要

Training large language models (LLMs) relies on adaptive optimizers such as Adam, which introduce extra operations and require significantly more memory to maintain first- and second-order moments than SGD. While recent works such as GaLore, Fira and APOLLO have proposed state-compressed memory-efficient variants, a fundamental question remains: What are the minimum modifications to plain SGD needed to match state-of-the-art pretraining performance? We systematically investigate this question using a bottom-up approach, and identify two simple yet highly (memory- and compute-) efficient techniques: (1) column-wise gradient normalization (normalizing the gradient along the output dimension), that boosts SGD performance without momentum; and (2) applying first-order momentum only to the output layer, where gradient variance is highest. Combining these two techniques lead to SCALE (Stochastic Column-normAlized Last-layer momEntum), a simple optimizer for memory efficient pretraining. Across multiple models (60M-1B), SCALE matches or exceeds the performance of Adam while using only 35-45% of the total memory. It also consistently outperforms memory-efficient optimizers such as GaLore, Fira and APOLLO, making it a strong candidate for large-scale pretraining under memory constraints. For LLaMA 7B, SCALE outperforms the state-of-the-art memory-efficient methods APOLLO and Muon in both perplexity and memory consumption.

2506.14648 2026-05-22 cs.RO cs.AI

SENIOR: Efficient Query Selection and Preference-Guided Exploration in Preference-based Reinforcement Learning

SENIOR: 在基于偏好的强化学习中高效查询选择与偏好引导探索

Hexian Ni, Tao Lu, Haoyuan Hu, Yinghao Cai, Shuo Wang

AI总结 本文提出SENIOR方法,通过高效查询选择和偏好引导探索提升人类反馈效率和策略学习速度,解决基于偏好的强化学习在反馈和样本效率方面的不足。

Comments 8 pages, 8 figures, IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2025)

详情
AI中文摘要

基于偏好强化学习(PbRL)方法通过学习基于人类偏好的奖励模型来避免奖励工程。然而,较差的反馈和样本效率仍然是阻碍PbRL应用的问题。本文提出了一种新颖的高效查询选择和偏好引导探索方法,称为SENIOR,能够选择有意义且易于比较的行为片段对,以提高人类反馈效率并加速策略学习,通过设计的偏好引导内在奖励。我们的关键思想是双方面的:(1)我们设计了一种基于运动区别的选择方案(MDS)。它通过状态的核密度估计选择具有明显运动和不同方向的片段对,这更任务相关且更易于人类偏好标注;(2)我们提出了一种新颖的偏好引导探索方法(PGE)。它鼓励探索高偏好和低访问状态,并持续引导智能体获取有价值的样本。两种机制的协同作用可以显著加快奖励和策略学习的进度。我们的实验表明,SENIOR在六个复杂的机器人操作任务(从仿真和现实世界)中,既在人类反馈效率又在策略收敛速度上均优于其他五个现有方法。视频可在我们的项目网站上找到:https://2025senior.github.io/

英文摘要

Preference-based Reinforcement Learning (PbRL) methods provide a solution to avoid reward engineering by learning reward models based on human preferences. However, poor feedback- and sample- efficiency still remain the problems that hinder the application of PbRL. In this paper, we present a novel efficient query selection and preference-guided exploration method, called SENIOR, which could select the meaningful and easy-to-comparison behavior segment pairs to improve human feedback-efficiency and accelerate policy learning with the designed preference-guided intrinsic rewards. Our key idea is twofold: (1) We designed a Motion-Distinction-based Selection scheme (MDS). It selects segment pairs with apparent motion and different directions through kernel density estimation of states, which is more task-related and easy for human preference labeling; (2) We proposed a novel preference-guided exploration method (PGE). It encourages the exploration towards the states with high preference and low visits and continuously guides the agent achieving the valuable samples. The synergy between the two mechanisms could significantly accelerate the progress of reward and policy learning. Our experiments show that SENIOR outperforms other five existing methods in both human feedback-efficiency and policy convergence speed on six complex robot manipulation tasks from simulation and four real-worlds. Videos can be found on our project website: https://2025senior.github.io/

2503.21821 2026-05-22 cs.AI

PHYSICS: Benchmarking Foundation Models on University-Level Physics Problem Solving

PHYSICS:在大学物理问题求解中基准测试基础模型

Kaiyue Feng, Yilun Zhao, Yixin Liu, Tianyu Yang, Chen Zhao, John Sous, Arman Cohan

AI总结 本文提出PHYSICS基准测试,用于评估大学水平物理问题求解能力,包含1297个专家标注的问题,涵盖六个核心领域,并通过自动化评估系统揭示了领先基础模型的显著局限性。

详情
Journal ref
Findings of ACL 2025
AI中文摘要

我们介绍了PHYSICS,一个全面的大学物理问题求解基准测试。它包含1297个专家标注的问题,涵盖六个核心领域:经典力学、量子力学、热力学和统计力学、电磁学、原子物理和光学。每个问题都需要高级物理知识和数学推理。我们开发了一个稳健的自动化评估系统,以实现精确且可靠的验证。对领先基础模型的评估揭示了显著的局限性。即使最先进的模型o3-mini也只能达到59.9%的准确率,突显了解决高水平科学问题的重大挑战。通过全面的错误分析、探索多样的提示策略以及基于检索增强生成(RAG)的知识增强,我们识别出关键的改进领域,为未来的发展奠定了基础。

英文摘要

We introduce PHYSICS, a comprehensive benchmark for university-level physics problem solving. It contains 1297 expert-annotated problems covering six core areas: classical mechanics, quantum mechanics, thermodynamics and statistical mechanics, electromagnetism, atomic physics, and optics. Each problem requires advanced physics knowledge and mathematical reasoning. We develop a robust automated evaluation system for precise and reliable validation. Our evaluation of leading foundation models reveals substantial limitations. Even the most advanced model, o3-mini, achieves only 59.9% accuracy, highlighting significant challenges in solving high-level scientific problems. Through comprehensive error analysis, exploration of diverse prompting strategies, and Retrieval-Augmented Generation (RAG)-based knowledge augmentation, we identify key areas for improvement, laying the foundation for future advancements.

2503.00747 2026-05-22 cs.CV cs.RO eess.IV

LFX: Towards Unified Light Field Dense Semantic Segmentation and Salient Object Detection

LFX:迈向统一的光场密集语义分割和显著物体检测

Fei Teng, Lingxin Huang, Buyin Deng, Kai Luo, Boyuan Zheng, Zheng Fang, Hong Zheng, Kunyu Peng, Jiaming Zhang, Yaonan Wang, Kailun Yang

AI总结 本文提出LFX框架,通过统一的光场表示特征调制空间,实现了对多种光场表示和不同感知任务的适应,从而在三个光场基准测试中取得最先进的结果,显著优于特定表示方法。

Comments The source code will be made publicly available at https://github.com/FeiT-FeiTeng/LFX

详情
AI中文摘要

光场相机在单次曝光内捕获多视角观测。然而,现有研究通常针对特定的LF表示进行优化,导致该领域缺乏统一的学习框架。为弥合这一差距,我们提出了LFX,首个统一的光场感知框架。LFX建立了一个表示不变的特征调制空间,使其能够适应异构的LF表示和多样的感知任务。具体而言,我们提出了Field-of-Parallax Angular Subspace Modeling(FoP-ASM),为每个辅助视图分配独立的角标记,实现视图间的独立建模。同时,共享流形子空间约束和正则化损失强制在视图间保持全局一致的语义调制。在三个LF基准测试中的广泛评估表明,LFX在不同的LF表示上均取得最佳结果,比特定表示方法高出高达12%和20%,在显著物体检测中达到0.029/0.027的MAE,且在语义分割中达到84.37 mIoU。源代码将在https://github.com/FeiT-FeiTeng/LFX上公开。

英文摘要

Light field cameras capture multi-view observations within a single exposure. However, existing studies are typically tailored to specific LF representations, leaving the field without a unified learning framework. To bridge this gap, we present LFX, the first unified framework for LF perception. LFX establishes a representation-invariant feature modulation space, enabling it to adapt to heterogeneous LF representations and diverse perception tasks. Specifically, we propose Field-of-Parallax Angular Subspace Modeling (FoP-ASM), which assigns an independent angular marker to each auxiliary view, enabling view-wise independent modeling. Meanwhile, shared manifold subspace constraints and regularization losses enforce globally consistent semantic modulation across views. Extensive evaluations across three LF benchmarks show that LFX achieves state-of-the-art results across distinct LF representations, outperforming representation-specific methods by up to 12% and 20% with 0.029/0.027 MAE for salient object detection, and achieving 84.37 mIoU for semantic segmentation. The source code will be made publicly available at https://github.com/FeiT-FeiTeng/LFX.

2502.09487 2026-05-22 cs.CL cs.AI cs.LG

Internal narratives parameterise affective states

内部叙事参数化情感状态

Jakub Onysk, Quentin J. M. Huys

AI总结 本文通过量化参与者内部叙事的大语言模型表示及其子空间,研究了叙事与情感状态之间的关系,发现特定症状的描述性思维能够预测标准化的抑郁评分,并强调保持症状间的协方差对构建效度至关重要。

详情
AI中文摘要

描述我们如何用语言表达感受对于心理评估和干预至关重要,但叙事与情感状态之间的映射仍然理解不足。在两个大规模研究(n=1257)中,我们通过大语言模型表示及其子空间量化了参与者内部叙事的结构和动态,以参数化抑郁状态。在第一项研究中,我们发现对特定症状的描述性思维捕捉了预测标准化、自我报告抑郁评分的细粒度信息。关键的是,我们显示保持症状之间的特定协方差对于构效效度至关重要,这表明高维文本表示镜像了疾病的潜在几何结构。第二项研究探讨了这种关系的时间动态,当参与者与情感叙事互动时。我们发现量化内部叙事的变化导致自我报告的变化,而基线叙事严重性预测了后续情感变化的幅度。通过将情感视为计算状态,我们的结果强调了其核心、治疗相关功能:约束内部叙事的结构并整合上下文以塑造自我报告。

英文摘要

Characterising how we verbalise our feelings is central to psychological assessment and intervention, yet the mapping between narrative and affective state remains poorly understood. Across two large studies (n=1257), we parameterised the structure and dynamics of depressive states by quantifying participants' internal narratives through large-language-model representations and their subspaces. In Study 1, we found verbal descriptions of symptom-specific thoughts captured granular information predictive of standardised, self-reported depression scores. Critically, we show preserving the specific covariance between symptoms is essential for construct validity, suggesting high-dimensional text representations mirror the latent geometry of the disorder. Study 2 probed the temporal dynamics of this relationship as participants engaged with emotional narratives. We found quantified changes in internal narratives led to changes in self-report, while the baseline narrative severity predicted the magnitude of subsequent affective change. By framing affect as a computational state, our results highlight its core, therapeutically pertinent functions: constraining the structure of internal narratives and integrating context to shape self-report.

2501.00677 2026-05-22 cs.LG cs.CV cs.IT cs.NA math.IT math.NA stat.ML

Deeply Learned Robust Matrix Completion for Large-scale Low-rank Data Recovery

深度学习鲁棒矩阵补全用于大规模低秩数据恢复

HanQin Cai, Chandra Kundu, Jialin Liu, Wotao Yin

AI总结 本文提出了一种可扩展且可学习的非凸方法,即学得鲁棒矩阵补全(LRMC),用于大规模鲁棒矩阵补全问题,该方法具有低计算复杂度和线性收敛性,并通过深度展开有效学习自由参数以实现最优性能,同时在合成数据集和实际应用中验证了其优越的实验性能。

详情
Journal ref
IEEE Transactions on Pattern Analysis and Machine Intelligence, 48(6): 6541-6556, 2026
AI中文摘要

鲁棒矩阵补全(RMC)是一种广泛使用的机器学习工具,同时解决低秩数据分析中的两个关键问题:缺失数据条目和极端异常值。本文提出了一种新颖的可扩展且可学习的非凸方法,称为学得鲁棒矩阵补全(LRMC),用于大规模RMC问题。LRMC具有低计算复杂度和线性收敛性。受所提出定理的启发,LRMC的自由参数可通过深度展开有效学习以达到最佳性能。此外,本文提出了一种灵活的前馈-递归-混合神经网络框架,将深度展开从固定次数迭代扩展到无限次数迭代。通过在合成数据集和实际应用中的广泛实验,验证了LRMC的优越的实验性能,包括视频背景减除、超声成像、面部建模和卫星图像云去除。

英文摘要

Robust matrix completion (RMC) is a widely used machine learning tool that simultaneously tackles two critical issues in low-rank data analysis: missing data entries and extreme outliers. This paper proposes a novel scalable and learnable non-convex approach, coined Learned Robust Matrix Completion (LRMC), for large-scale RMC problems. LRMC enjoys low computational complexity with linear convergence. Motivated by the proposed theorem, the free parameters of LRMC can be effectively learned via deep unfolding to achieve optimum performance. Furthermore, this paper proposes a flexible feedforward-recurrent-mixed neural network framework that extends deep unfolding from fix-number iterations to infinite iterations. The superior empirical performance of LRMC is verified with extensive experiments against state-of-the-art on synthetic datasets and real applications, including video background subtraction, ultrasound imaging, face modeling, and cloud removal from satellite imagery.

2411.02813 2026-05-22 cs.LG

Sparse Orthogonal Parameters Tuning for Continual Learning

稀疏正交参数调优用于持续学习

Kun-Peng Ning, Hai-Jian Ke, Yu-Yang Liu, Jia-Yu Yao, Yong-Hong Tian, Li Yuan

AI总结 本文提出了一种名为SoTU的新型方法,通过稀疏正交参数调优来解决持续学习中的灾难性遗忘问题,实现了对流数据的最优特征表示。

详情
AI中文摘要

基于预训练模型(PTM)的持续学习方法近年来引起了广泛关注,这些方法能够适应连续的下游任务而无需灾难性遗忘。这些方法通常不更新预训练参数,而是使用额外的适配器、提示和分类器。在本文中,我们从新的角度研究了稀疏正交参数对持续学习的益处。我们发现,合并来自多个流任务的模型所学习的稀疏正交性在解决灾难性遗忘方面具有巨大潜力。利用这一见解,我们提出了一种新颖且有效的称为SoTU(稀疏正交参数调优)的方法。我们假设SoTU的有效性在于将多个领域学到的知识转换为正交delta参数的融合。在多样化的CL基准测试中评估了所提出的方法的有效性。值得注意的是,SoTU在不需要复杂分类器设计的情况下实现了流数据的最优特征表示,使其成为一种即插即用的解决方案。

英文摘要

Continual learning methods based on pre-trained models (PTM) have recently gained attention which adapt to successive downstream tasks without catastrophic forgetting. These methods typically refrain from updating the pre-trained parameters and instead employ additional adapters, prompts, and classifiers. In this paper, we from a novel perspective investigate the benefit of sparse orthogonal parameters for continual learning. We found that merging sparse orthogonality of models learned from multiple streaming tasks has great potential in addressing catastrophic forgetting. Leveraging this insight, we propose a novel yet effective method called SoTU (Sparse Orthogonal Parameters TUning). We hypothesize that the effectiveness of SoTU lies in the transformation of knowledge learned from multiple domains into the fusion of orthogonal delta parameters. Experimental evaluations on diverse CL benchmarks demonstrate the effectiveness of the proposed approach. Notably, SoTU achieves optimal feature representation for streaming data without necessitating complex classifier designs, making it a Plug-and-Play solution.

2411.02776 2026-05-22 cs.LG stat.AP

Deep learning-based modularized loading protocol for parameter estimation of Bouc-Wen class models

基于深度学习的模块化加载协议用于Bouc-Wen类模型参数估计

Sebin Oh, Junho Song, Taeyong Kim

AI总结 本文提出了一种基于深度学习的模块化加载协议,用于优化Bouc-Wen类模型的参数估计。该协议包含两个关键部分:最优加载历史构建和基于CNN的快速参数估计。每个部分被分解为独立的子模块,针对不同的滞回行为(基本滞回、结构退化和咬合效应),使协议能够适应多种滞回模型。三种独立的CNN架构被开发出来以捕捉这些滞回行为的路径依赖性。通过在多样化的加载历史上训练这些CNN架构,识别出最小的加载序列,称为加载历史模块,并将其组合以构建最优的加载历史。三种训练好的CNN模型用作快速参数估计器。协议的数值评估,包括三栋钢结构框架的非线性时间历史分析和三栋钢筋混凝土框架的脆弱性曲线构建,表明该协议显著减少了总分析时间,同时保持或提高了估计精度。该协议可扩展到其他滞回模型,表明了一种系统的方法来识别通用滞回模型。

详情
Journal ref
Engineering Structures 339, 120458 (2025)
AI中文摘要

本研究提出了一种模块化的深度学习基于加载协议,用于Bouc-Wen(BW)类模型的最佳参数估计。该协议由两个关键组成部分组成:最佳加载历史构建和基于CNN的快速参数估计。每个组成部分被分解为独立的子模块,针对不同的滞回行为——基本滞回、结构退化和咬合效应——使协议能够适应多种滞回模型。开发了三种独立的CNN架构以捕捉这些滞回行为的路径依赖性。通过在多样化的加载历史上训练这些CNN架构,识别出最小的加载序列,称为加载历史模块,然后将其组合以构建最优的加载历史。三种训练好的CNN模型,分别在相应的加载历史模块上训练,用作快速参数估计器。协议的数值评估,包括三栋钢结构框架的非线性时间历史分析和三栋钢筋混凝土框架的脆弱性曲线构建,表明所提出的协议显著减少了总分析时间,同时保持或提高了估计精度。所提出的协议可以扩展到其他滞回模型,表明了一种系统的方法来识别通用滞回模型。

英文摘要

This study proposes a modularized deep learning-based loading protocol for optimal parameter estimation of Bouc-Wen (BW) class models. The protocol consists of two key components: optimal loading history construction and CNN-based rapid parameter estimation. Each component is decomposed into independent sub-modules tailored to distinct hysteretic behaviors-basic hysteresis, structural degradation, and pinching effect-making the protocol adaptable to diverse hysteresis models. Three independent CNN architectures are developed to capture the path-dependent nature of these hysteretic behaviors. By training these CNN architectures on diverse loading histories, minimal loading sequences, termed \textit{loading history modules}, are identified and then combined to construct an optimal loading history. The three CNN models, trained on the respective loading history modules, serve as rapid parameter estimators. Numerical evaluation of the protocol, including nonlinear time history analysis of a 3-story steel moment frame and fragility curve construction for a 3-story reinforced concrete frame, demonstrates that the proposed protocol significantly reduces total analysis time while maintaining or improving estimation accuracy. The proposed protocol can be extended to other hysteresis models, suggesting a systematic approach for identifying general hysteresis models.

2411.01332 2026-05-22 cs.LG cs.AI

A Mechanistic Explanatory Strategy for XAI

为XAI的解释性策略机制

Marcin Rabiza

AI总结 本文提出了一种基于机制的解释性策略,旨在通过分解、定位和重组来揭示深度学习系统功能组织的机制,从而改进可解释人工智能的理论基础和实践应用。

详情
AI中文摘要

尽管在XAI领域取得了显著进展,学者们指出缺乏坚实的理论基础和与更广泛科学解释 discourse 的整合仍是持续存在的问题。为此,新兴研究借鉴了各种科学和科学哲学文献中的解释策略来填补这些空白。本文概述了一种用于解释深度学习系统功能组织的机制性策略,将近期的可解释人工智能发展置于更广泛的哲学背景下。根据机制方法,对不透明AI系统的解释涉及识别驱动决策的机制。对于深度神经网络,这意味着辨别功能相关组件,如神经元、层、电路或激活模式,并通过分解、定位和重组来理解其作用。图像识别和语言模型的证明原理案例研究将这些理论方法与OpenAI和Anthropic的机制可解释性研究相结合。研究结果表明,追求机制性解释可以揭示传统可解释性技术可能忽略的元素,最终促进更彻底的可解释人工智能。

英文摘要

Despite significant advancements in XAI, scholars note a persistent lack of solid conceptual foundations and integration with broader scientific discourse on explanation. In response, emerging research draws on explanatory strategies from various sciences and the philosophy of science literature to fill these gaps. This paper outlines a mechanistic strategy for explaining the functional organization of deep learning systems, situating recent developments in explainable AI within a broader philosophical context. According to the mechanistic approach, the explanation of opaque AI systems involves identifying mechanisms that drive decision making. For deep neural networks, this means discerning functionally relevant components, such as neurons, layers, circuits, or activation patterns, and understanding their roles through decomposition, localization, and recomposition. Proof-of-principle case studies from image recognition and language modeling align these theoretical approaches with mechanistic interpretability research from OpenAI and Anthropic. The findings suggest that pursuing mechanistic explanations can uncover elements that traditional explainability techniques may overlook, ultimately contributing to more thoroughly explainable AI

2410.04753 2026-05-22 cs.AI cs.CL cs.LG cs.LO

ImProver: Agent-Based Automated Proof Optimization

ImProver:基于代理的自动证明优化

Riyaz Ahuja, Jeremy Avigad, Prasad Tetali, Sean Welleck

AI总结 本文研究了自动证明优化问题,提出ImProver这一基于大语言模型的代理,用于重写证明以优化长度、可读性等任意标准,实验表明其能显著缩短证明并提高其模块化和可读性。

Comments Published as a conference paper at ICLR 2025

详情
AI中文摘要

大型语言模型(LLMs)已被用于在证明助手如Lean中生成数学定理的正式证明。然而,我们通常希望根据不同的下游用途优化正式证明,例如使其符合某种风格、易于阅读、简洁或模块化。适当优化的证明对于学习任务也非常重要,尤其是因为人工撰写的证明可能不适用于此目的。为此,我们研究了一个新的问题:自动证明优化,即重写证明以使其正确并优化任意标准,如长度或可读性。作为自动证明优化的一种初步方法,我们提出了ImProver,这是一个能够重写证明以优化任意用户定义指标的大型语言模型代理。我们发现直接应用LLMs进行证明优化效果有限,并在ImProver中引入了各种改进,例如新颖的链式状态技术中的符号Lean上下文使用,以及错误校正和检索。我们测试了ImProver在重写真实世界中的本科、竞赛和研究级数学定理方面的性能,发现ImProver能够重写证明使其显著更短、更模块化和更易读。

英文摘要

Large language models (LLMs) have been used to generate formal proofs of mathematical theorems in proofs assistants such as Lean. However, we often want to optimize a formal proof with respect to various criteria, depending on its downstream use. For example, we may want a proof to adhere to a certain style, or to be readable, concise, or modularly structured. Having suitably optimized proofs is also important for learning tasks, especially since human-written proofs may not optimal for that purpose. To this end, we study a new problem of automated proof optimization: rewriting a proof so that it is correct and optimizes for an arbitrary criterion, such as length or readability. As a first method for automated proof optimization, we present ImProver, a large-language-model agent that rewrites proofs to optimize arbitrary user-defined metrics in Lean. We find that naively applying LLMs to proof optimization falls short, and we incorporate various improvements into ImProver, such as the use of symbolic Lean context in a novel Chain-of-States technique, as well as error-correction and retrieval. We test ImProver on rewriting real-world undergraduate, competition, and research-level mathematics theorems, finding that ImProver is capable of rewriting proofs so that they are substantially shorter, more modular, and more readable.

2403.03920 2026-05-22 cs.AI cs.CL cs.HC

Enhancing Instructional Quality: Leveraging Computer-Assisted Textual Analysis to Generate In-Depth Insights from Educational Artifacts

提升教学质量:利用计算机辅助文本分析从教育资料中生成深入见解

Zewei Tian, Min Sun, Alex Liu, Shawon Sarkar, Jing Liu

AI总结 本文探讨了计算机辅助文本分析在通过教育资料的深入分析提升教学质量的变革潜力,结合Richard Elmore的Instructional Core Framework,分析AI和机器学习方法,特别是自然语言处理(NLP),如何分析教育内容、教师话语和学生回答以促进教学改进,并指出AI/ML在教师指导、学生支持和内容开发中的关键优势。

详情
AI中文摘要

本文探讨了计算机辅助文本分析在通过教育资料的深入分析提升教学质量的变革潜力。我们整合Richard Elmore的Instructional Core Framework,以探讨人工智能(AI)和机器学习(ML)方法,特别是自然语言处理(NLP),如何分析教育内容、教师话语和学生回答,以促进教学改进。通过在Instructional Core Framework内的全面回顾和案例研究,我们识别出AI/ML整合在教师指导、学生支持和内容开发中的关键优势。我们揭示出模式,表明AI/ML不仅简化了行政任务,还引入了新的个性化学习路径,为教育工作者提供可操作的反馈,并有助于更深入地理解教学动态。本文强调了将AI/ML技术与教学目标对齐的重要性,以在教育环境中实现其全部潜力,倡导一种平衡的方法,考虑伦理问题、数据质量和人类专业知识的整合。

英文摘要

This paper explores the transformative potential of computer-assisted textual analysis in enhancing instructional quality through in-depth insights from educational artifacts. We integrate Richard Elmore's Instructional Core Framework to examine how artificial intelligence (AI) and machine learning (ML) methods, particularly natural language processing (NLP), can analyze educational content, teacher discourse, and student responses to foster instructional improvement. Through a comprehensive review and case studies within the Instructional Core Framework, we identify key areas where AI/ML integration offers significant advantages, including teacher coaching, student support, and content development. We unveil patterns that indicate AI/ML not only streamlines administrative tasks but also introduces novel pathways for personalized learning, providing actionable feedback for educators and contributing to a richer understanding of instructional dynamics. This paper emphasizes the importance of aligning AI/ML technologies with pedagogical goals to realize their full potential in educational settings, advocating for a balanced approach that considers ethical considerations, data quality, and the integration of human expertise.

2402.11621 2026-05-22 cs.CL

Decoding News Narratives: A Critical Analysis of Large Language Models in Framing Detection

解码新闻叙述:对大型语言模型在框架检测中的关键分析

Valeria Pastorino, Jasivan A. Sivakumar, Nafise Sadat Moosavi

AI总结 本文研究了大型语言模型在框架检测中的应用,分析了不同模型在零样本、少样本和解释性提示设置下的表现,指出模型性能对提示设计敏感且易在模糊案例中出现系统性错误,并提出了一种新的跨领域新闻标题数据集以提高评估的现实性。

详情
Journal ref
Proceedings of the 3rd Workshop on Natural Language Processing for Political Sciences (PoliticalNLP 2026) @ LREC 2026, pages 17-28
AI中文摘要

随着新闻报道的复杂性和多样性增加,框架分析已成为计算社会科学中的关键但具有挑战性的任务。传统方法,包括手动标注和微调模型,仍然受到高标注成本、领域特定性和不一致泛化能力的限制。基于指令的大型语言模型(LLMs)提供了一个有前景的替代方案,但它们在框架分析中的可靠性尚不充分。本文系统评估了几个LLMs,包括GPT-3.5/4、FLAN-T5和Llama 3,在零样本、少样本和基于解释的提示设置下的表现。聚焦于领域转移和固有的标注模糊性,我们显示模型性能高度敏感于提示设计,并且在模糊案例中容易出现系统性错误。尽管LLMs,特别是GPT-4,表现出更强的跨领域泛化能力,但它们也显示出系统性偏见,最值得注意的是倾向于将情感语言与框架混淆。为了在现实世界的话题多样性下实现原则性评估,我们引入了一种新的跨领域新闻标题数据集。最后,通过分析多个模型在现有框架数据集上的一致性模式,我们证明了跨模型共识为识别争议标注提供了一个有用的信号,为低资源环境下的数据集审计提供了一种实用方法。

英文摘要

The growing complexity and diversity of news coverage have made framing analysis a crucial yet challenging task in computational social science. Traditional approaches, including manual annotation and fine-tuned models, remain limited by high annotation costs, domain specificity, and inconsistent generalisation. Instruction-based large language models (LLMs) offer a promising alternative, yet their reliability for framing analysis remains insufficiently understood. In this paper, we conduct a systematic evaluation of several LLMs, including GPT-3.5/4, FLAN-T5, and Llama 3, across zero-shot, few-shot, and explanation-based prompting settings. Focusing on domain shift and inherent annotation ambiguity, we show that model performance is highly sensitive to prompt design and prone to systematic errors on ambiguous cases. Although LLMs, particularly GPT-4, exhibit stronger cross-domain generalisation, they also display systematic biases, most notably a tendency to conflate emotional language with framing. To enable principled evaluation under real-world topic diversity, we introduce a new dataset of out-of-domain news headlines covering diverse subjects. Finally, by analysing agreement patterns across multiple models on existing framing datasets, we demonstrate that cross-model consensus provides a useful signal for identifying contested annotations, offering a practical approach to dataset auditing in low-resource settings.

2401.00139 2026-05-22 cs.AI cs.CL cs.LG stat.ME

Enhancing Causal Reasoning in Large Language Models: A Causal Attribution Model for Precision Fine-Tuning

增强大语言模型中的因果推理:一种用于精确微调的因果归因模型

Hengrui Cai, Shengjie Liu, Rui Song

AI总结 本文提出一种因果归因模型,通过精确微调提升大语言模型的可解释性和因果推理能力,展示了模型在不同领域中的因果发现任务中的有效性。

Comments A Python implementation of our proposed method is available at https://github.com/ncsulsj/Causal_LLM

详情
AI中文摘要

本文介绍了一种因果归因模型,旨在通过精确微调增强大语言模型(LLMs)的可解释性并提高其因果推理能力。尽管LLMs在多种任务中表现出色,但其推理过程往往仍是一个黑箱,限制了有针对性的增强。我们提出了一种新的因果归因模型,利用“do-运算符”构建干预场景,使我们能够系统地量化LLMs因果推理过程中不同组件的贡献。通过在各种领域中进行因果发现任务来评估所提出的归因分数,我们证明了LLMs在因果发现中的有效性严重依赖于提供的上下文和领域特定知识,但也可以利用数值数据进行有限的相关性推理,而非因果性。这促使了所提出的微调LLM用于成对因果发现,有效且正确地利用了知识和数值信息。

英文摘要

This paper introduces a causal attribution model to enhance the interpretability of large language models (LLMs) and improve their causal reasoning abilities via precise fine-tuning. Despite LLMs' proficiency in diverse tasks, their reasoning processes often remain black box, and thus restrict targeted enhancement. We propose a novel causal attribution model that utilizes "do-operators" for constructing interventional scenarios, allowing us to quantify the contribution of different components in LLMs's causal reasoning process systematically. By assessing the proposed attribution scores through causal discovery tasks across various domains, we demonstrate that LLMs' effectiveness in causal discovery heavily relies on provided context and domain-specific knowledge but can also utilize numerical data with limited calculations in correlation, not causation. This motivates the proposed fine-tuned LLM for pairwise causal discovery, effectively and correctly leveraging both knowledge and numerical information.