arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.11535 2026-06-11 cs.RO 新提交

Adversarial Attacks on Learned Policies for Surgical Robotic Tasks

针对手术机器人任务学习策略的对抗攻击

Shutong Jin, Ziyang Chen, Preethi Satish, Paavan Gupta, Florian T. Pokorny, Ken Goldberg

发表机构 * University of California, Berkeley(加州大学伯克利分校) KTH Royal Institute of Technology(瑞典皇家理工学院)

AI总结 研究学习型策略在机器人辅助手术中易受对抗攻击的脆弱性,提出破坏性和引导性攻击方法,实验表明攻击可使手术子任务成功率平均降低61%。

详情
AI中文摘要

基于学习的策略正被考虑用于增强机器人辅助手术中人类外科医生的灵巧性。从视觉观察到机器人动作的端到端映射是否容易受到对抗性攻击,从而可能导致患者受伤?在本文中,我们首次研究了手术机器人中学习型策略面临的对抗性威胁。我们研究了两种威胁模式:(a) 破坏性攻击,其中难以察觉的视觉扰动中断策略执行,以及 (b) 引导性攻击,其中此类扰动将策略动作引导至攻击者指定的方向。我们提出了三种对抗性攻击方法,每种方法对策略信息的访问权限逐渐增加,并评估了它们对两个手术子任务(清创和缝合)的影响。我们的评估涵盖了三种端到端策略架构:ACT、扩散策略和Pi0。此外,我们引入了一类新的光度对抗攻击,它模仿自然视觉变化(如光照变化)来生成有效且视觉上合理的扰动。使用清创和缝合模型进行的560次物理实验结果表明,最先进的策略可能受到显著干扰,导致手术子任务成功率平均降低61%。项目页面:此 https URL

英文摘要

Learning-based policies are being considered to augment the dexterity of human surgeons in robot-assisted surgery. Can the end-to-end mapping from visual observations to robot actions be vulnerable to adversarial attacks, potentially leading to patient injury? In this paper, we present the first study of adversarial threats to learning-based policies in surgical robotics. We investigate two threat modes: (a) disruptive attacks, where imperceptible visual perturbations interrupt policy execution, and (b) steering attacks, where such perturbations steer policy actions toward attacker-specified directions. We formulate three adversarial attack methods, each with increasing access to policy information, and evaluate their impact on two surgical subtasks: debridement and suturing. Our evaluation covers three end-to-end policy architectures: ACT, Diffusion Policy, and Pi0. In addition, we introduce a new class of photometric adversarial attacks that mimic natural visual changes, such as lighting variations, to generate effective yet visually plausible perturbations. Results from 560 physical experiments using phantoms for debridement and suturing suggest that state-of-the-art policies can be significantly disrupted, resulting in an average 61% reduction in surgical subtask success rates. Project page: this https URL

2606.11531 2026-06-11 cs.CL cs.IT 新提交

Measuring language complexity from hierarchical reuse of recurring patterns

从重复模式的层次复用测量语言复杂度

Junyi Zhou, Rui Liu, Pengyu Liu, Yu Liu

发表机构 * Department of Systems Science, Faculty of Arts and Sciences, Beijing Normal University(北京师范大学文理学院系统科学系) International Academic Center of Complex Systems, Beijing Normal University(北京师范大学国际复杂系统学术中心) Department of Chinese Language and Literature, Faculty of Arts and Sciences, Beijing Normal University(北京师范大学文理学院中国语言文学系) Center for Linguistic Sciences, Beijing Normal University(北京师范大学语言学科学中心) School of Systems Science, Beijing Normal University(北京师范大学系统科学学院) Department of Mathematics and Applied Mathematical Sciences, University of Rhode Island(罗德岛大学数学与应用数学科学系) Department of Cell and Molecular Biology, University of Rhode Island(罗德岛大学细胞与分子生物学系)

AI总结 提出基于算法信息论的梯径指数,通过层次复用重复子结构测量语言复杂度,在21个平行语料库中验证了等复杂度假说和权衡假说。

详情
Comments
17 pages, 4 figures
AI中文摘要

我们引入梯径指数作为基于算法信息论的语言复杂度度量。它通过层次复用重复子结构来重建序列所需的最小步骤数,捕捉了一种可精确计算但受约束的算法可压缩性形式,与Kolmogorov复杂度相关但不同。我们将梯径方法应用于Parallel Universal Dependencies数据集中的21个平行语料库。梯径指数在不同语言间近似不变,且变化远小于语料库长度。当所有语料库映射到统一的二进制表示时,这一现象更为明显,从表示无关的角度为等复杂度假说提供了证据。我们还观察到字符库存大小与语料库长度之间的权衡,以及词汇级和语料库级重建复杂度之间的权衡,支持了总复杂度守恒并在语言层次间重新分布的权衡假说。梯径方法识别出的可重用子结构(无需任何语言输入)与自然词汇中存在的单词和形态成分重叠。梯径方法捕获的层次复用与认知科学中提出的组块机制相似,即人类认知系统在共享记忆和处理约束下将语言输入压缩为嵌套的、可重用的单元。认知组块与梯径方法之间的这种联系为等复杂度假说和权衡假说提供了新的解释,将两者都根植于支撑所有人类语言处理的共享认知架构中。

英文摘要

We introduce the ladderpath index as a measure of language complexity grounded in algorithmic information theory. It counts the minimum steps needed to reconstruct a sequence through hierarchical reuse of repeated substructures, capturing an exactly computable but constrained form of algorithmic compressibility related to, but distinct from, Kolmogorov complexity. We apply the ladderpath approach to 21 parallel corpora from the Parallel Universal Dependencies dataset. The ladderpath index is approximately invariant across the languages, and varies much less than the corpus length. This is more pronounced when all corpora are mapped to a unified binary representation, providing evidence for the equi-complexity hypothesis from a representation-independent perspective. We also observe trade-offs between character inventory size and corpus length, and between vocabulary-level and corpus-level reconstruction complexity, supporting the trade-off hypothesis that total complexity is conserved and redistributed across linguistic levels. The reusable substructures identified by the ladderpath approach, without any linguistic input, overlap with words and morphological components attested in the natural vocabulary. The hierarchical reuse captured by the ladderpath approach parallels the chunking mechanisms proposed in cognitive science, where the human cognitive system compresses linguistic input into nested, reusable units under shared memory and processing constraints. This connection between cognitive chunking and the ladderpath approach provides a new interpretation for the equi-complexity and trade-off hypotheses, grounding both in the shared cognitive architecture that underlies language processing across human languages.

2606.11525 2026-06-11 cs.RO cs.LG 新提交

Learning Object Manipulation from Scratch via Contrastive Interaction

通过对比交互从零开始学习物体操作

Tongle Shen, Caleb Chuck, Fan Feng, Biwei Huang

发表机构 * UC San Diego(加州大学圣地亚哥分校) UT Austin(德克萨斯大学奥斯汀分校)

AI总结 针对对比强化学习在交互密集操作任务中表现不佳的问题,提出交互加权重采样方法,通过保留模式边界提升多模态分段非线性可达性表示,在仿真和真实机器人空气曲棍球任务中取得显著改进。

详情
AI中文摘要

对比强化学习(CRL)通过学习动力学的结构化表示,在多种目标条件机器人任务中取得了近期成功。然而,尽管在运动控制和简单控制领域表现优异,CRL在交互密集的操作任务中常常遇到困难。我们认为这一困难的关键来源是物体中心交互,如接触或抓取,这些交互会引起潜在动态模式的显著变化。在这项工作中,我们将操作动力学建模为分段平滑马尔可夫过程,并证明交互引起的模式变化产生了分段非线性可达性结构,这使得标准CRL能量函数难以表示和规划。基于这一分析,我们引入了交互加权重采样(IWR)。IWR在交互前、中、后阶段进行交互感知重采样,鼓励学习到的表示保留决定未来可达性的模式边界,以捕获多模态和分段非线性可达性。在包括2D动态控制、机器人操作和机器人空气曲棍球在内的交互中心环境中,IWR相比先前的CRL方法提高了样本效率和整体性能,在仿真中平均提升19.8%。最后,通过使用IWR训练的策略进行仿真到现实的迁移,我们展示了首个能够击打目标的真实世界目标条件机器人空气曲棍球智能体,成功率从25%提升到60%。项目页面:此 http URL。

英文摘要

Contrastive Reinforcement Learning (CRL) has seen recent success in a wide variety of goal-conditioned robotics tasks by learning structured representations of the dynamics. However, despite its success in locomotion and simpler control domains, CRL often struggles in interaction-rich manipulation. We argue that a key source of this difficulty is object-centric interaction, such as contact or grasping, that induces distinct changes in the underlying dynamic modes. In this work, we formulate manipulation dynamics as a piecewise-smooth Markov process and show that interaction-induced mode changes create piecewise nonlinear reachability structures that are difficult for standard CRL energy functions to represent and plan over. Based on this analysis, we introduce Interaction-weighted Resampling (IWR). IWR performs interaction-aware resampling around phases before, during, and after interactions, encouraging the learned representation to preserve the mode boundaries that determine future reachability to capture multi-modal and piecewise nonlinear reachability. Across interaction-centric environments, including 2D dynamic control, robotic manipulation, and robot air hockey, IWR improves both sample efficiency and overall performance over prior CRL methods, with 19.8% average improvement in simulation. Finally, using a sim-to-real pipeline with policies trained by IWR, we demonstrate the first real-world goal-conditioned robot air hockey agent capable of hitting goals, improving success from 25% to 60%. Project Page: this http URL.

2606.11522 2026-06-11 cs.AI cs.LG 新提交

Search Discipline for Long-Horizon Research Agents

长周期研究智能体的搜索纪律

Adithya Srinivasan, Devesh Paragiri

发表机构 * North Carolina State University(北卡罗来纳州立大学) University of Maryland(马里兰大学)

AI总结 针对研究智能体使用聚合指标评估候选方案导致科学有效性反转的问题,提出一种外部审计协议,基于分解行为而非单一分数进行决策。

详情
Comments
9 pages, 1 figure
AI中文摘要

自主研究智能体现在根据指标提出、评估和选择科学候选方案,该指标通常是在区域、切片或队列的异质空间上聚合的简化值。我们表明,当科学有效性存在于这种分解结构中时,聚合值可能错误地将候选方案排在首位。总体数字改善,但底层结构反转,因此基于该数字的决策会接受一个悄然破坏模型的候选方案。这种失败并非领域特定,只要候选方案的有效性是多维的,而其验证器是单一简化值,就会出现。我们在生态系统人口模型中的火灾模型任务上展示了这种反转。得分最高的候选方案和略低的候选方案在全球得分上处于噪声范围内,但得分最高的候选方案破坏了受保护的北方区域,而另一个则保护了它们。区分它们的是每个区域的行为,而不是总体数字。这个决策不应留给产生候选方案的智能体。优化分数的智能体是最不可能发现分数错误的一方,一旦智能体停止,提示就没有剩余轮次。我们将决策移到一个外部控制循环,该循环根据每个候选方案的分解行为进行审计,并在智能体决策后采取行动。它可以降级智能体本会接受的候选方案,也可以重新打开智能体声明已完成的运行。我们的贡献在于反转发现本身,以及一种搜索纪律协议,该协议基于可审查的候选效果证据而非分数进行决策。

英文摘要

Autoresearch agents now propose, evaluate, and select scientific candidates against a metric, and that metric is usually an aggregate reduced over a heterogeneous space of regions, slices, or cohorts. We show that when scientific validity lives in that disaggregated structure, the aggregate can rank the wrong candidate first. The headline number improves while the structure underneath inverts, so a decision made on the number accepts a candidate that quietly breaks the model. The failure is not domain-specific. It appears wherever a candidate's validity is multi-dimensional but its verifier is a single reduction. We demonstrate the inversion on a fire-model task in the Ecosystem Demography model. The highest-scoring candidate and a slightly lower one are within noise of each other on global score, yet the top-scoring one collapses the protected boreal regions while the other preserves them. What separates them is the per-region behavior, not the headline number. This decision should not be left to the agent that produced the candidates. The agent optimizing the score is the last party likely to catch the score being wrong, and a prompt has no remaining turn once the agent has stopped. We move the decision to an external control loop that audits each candidate on its disaggregated behavior and acts after the agent has decided. It can demote a candidate the agent would have accepted, and it can reopen a run the agent had declared finished. Our contribution is the inversion finding itself, and a search-discipline protocol that decides on reviewable candidate-effect evidence instead of the score.

2606.11521 2026-06-11 cs.LG 新提交

Counterexample Guided Learning in the Large using Reasoning Agents

使用推理代理的大规模反例引导学习

Hongyi Liu, Frederic Sala, Thomas Reps, Adithya Murali

发表机构 * University of Wisconsin-Madison(威斯康星大学麦迪逊分校)

AI总结 提出反例引导的LLM正则表达式归纳框架,通过验证器反馈和代理策略(如反思与修复循环)显著提升样本效率和复杂任务成功率。

详情
Comments
Code, data, and resources are publicly available for research purposes: this https URL
AI中文摘要

LLM和LLM代理在获得反馈时应能改进,但识别其何时能做到这一点很困难:反馈是异质的、领域特定的且难以控制。我们通过要求LLM执行正则表达式归纳来应对这一挑战,这是一个经典的符号学习问题,其中存在以反例形式存在的精确反馈机制。在反例引导学习中,学习者(LLM)从正/负标记字符串中提出候选正则表达式,教师(验证器)返回反例,展示候选语言与目标语言之间的差异。我们识别出新的反例引导细化策略,如正则化和符号反例聚类,这些策略能够实现有效的正则表达式学习。我们还探索了代理策略,如反思和修复循环。实验发现,验证器反馈显著提高了具有挑战性的正则表达式归纳任务的样本效率,减少了所需标记示例的数量,并使得能够学习标准提示失败时的复杂目标表达式。例如,在最困难的任务组上,我们的反例引导框架在两个不同的正则表达式领域将成功率从3.2%提高到38.1%,从38.9%提高到74.1%。这些结果表明,LLM可以从丰富的反馈中受益,而不仅仅将其视为额外数据,为基于LLM的程序合成和形式推理的鲁棒验证器引导方法打开了大门。

英文摘要

LLMs and LLM agents should improve when given feedback, but identifying when they are able to do so is difficult: feedback is heterogeneous, domain-specific, and difficult to control. We approach this challenge by asking LLMs to perform regular-expression induction, a classical symbolic learning problem where precise mechanisms for feedback exist in the form of counterexamples. In counterexample-guided learning, a learner (LLM) proposes candidate regular expressions from positive/negative-labeled strings, and the teacher (verifier) returns counterexamples showcasing the difference between the candidate and target languages. We identify novel counterexample-guided refinement strategies that enable effective regex learning, such as regularization and symbolic counterexample clusters. We also explore agentic strategies such as reflection and repair loops. Empirically, we find that verifier feedback substantially improves sample efficiency on challenging regex-induction tasks, reducing the number of labeled examples required and enabling learning of complex target expressions where standard prompting fails. For example, on the hardest task groups, our counterexample-guided framework improves success from 3.2% to 38.1% and from 38.9% to 74.1% on two different regex domains. These results suggest that LLMs can benefit from rich feedback beyond treating it as additional data, opening the door for robust verifier-guided methods for LLM-based program synthesis and formal reasoning.

2606.11518 2026-06-11 cs.LG cs.AI 新提交

SirenFNO: Efficient and Full Frequency Learning of Fourier Neural Operators

SirenFNO:高效且全频率学习的傅里叶神经算子

Pengqing Shi, Jie Yin, Stephen Tierney, Junbin Gao

发表机构 * The University of Sydney(悉尼大学)

AI总结 提出SirenFNO框架,利用正弦表示网络学习隐式神经表示并进行模态核参数化,消除频率截断,实现全频谱学习,在多个PDE基准上以最多73倍参数减少取得性能提升。

详情
Comments
9 pages, accepted by IJCAI 2026
AI中文摘要

傅里叶神经算子(FNO)是近似求解偏微分方程的有效且高效的替代方法,并能跨离散化泛化。然而,由于依赖频率截断以保持FNO的学习效率,实证研究表明FNO对低频信息存在频谱偏差,这可能阻碍学习能力,尤其是对于某些具有强烈高频振荡的偏微分方程。为了解决这一局限性,我们提出了SirenFNO,一种利用正弦表示网络(SIREN)学习隐式神经表示并进行模态核参数化的新颖框架。我们的SIREN参数化以常数且与离散化无关的参数数量学习全网格频谱,从而消除了频率截断的需要。我们进一步通过函数张量分解扩展SirenFNO,以提高参数和学习效率。实证结果表明,我们的SirenFNO在保持离散化不变性的情况下,以约4到15倍的参数减少持续优于FNO,并且我们的函数分解变体在多个PDE基准上以最多73倍的参数减少获得了性能提升。

英文摘要

Fourier neural operators (FNOs) are effective and efficient surrogates for approximating solutions of PDEs and generalize across discretizations. However, owing to the reliance on frequency truncation to maintain learning efficiency of FNOs, empirical studies suggest that FNOs exhibit spectral bias toward low-frequency information, which may hinder the learning capability especially for certain PDEs with strong high-frequency oscillations. To address this limitation, we propose SirenFNO, a novel framework that leverages sinusoidal representation networks (SIRENs) to learn implicit neural representations and performs mode-wise kernel parameterization. Our SIREN parameterization learns a full-grid spectrum with a constant and discretization-independent parameter count, thereby eliminating the need for frequency truncation. We further extend SirenFNO with functional tensor decompositions to enhance parameter and learning efficiency. Empirical results show that our SirenFNO consistently outperforms FNO with approximately $4$ to $15$ times parameter reductions with preserved discretization invariance, and our functional decomposition variants obtain performance improvements with a maximum of $73$ times fewer parameters across multiple PDE benchmarks.

2606.11514 2026-06-11 cs.SD 新提交

CS-YODAS: A Mined Dataset of In-the-Wild Code-Switched Speech

CS-YODAS:一个挖掘自真实环境的代码转换语音数据集

Brian Yan, Qingzheng Wang, Matthew Wiesner, Anuj Diwan, Olga Iakovenko, Alexander Polok, Injy Hamed, Shuichiro Shimizu, Iris Emerman Thomas Hain, David R. Mortensen, Peter Viechnicki, Shinji Watanabe

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Johns Hopkins University(约翰霍普金斯大学) University of Texas at Austin(德克萨斯大学奥斯汀分校) University of Sheffield(谢菲尔德大学) Brno University of Technology(布尔诺理工大学) MBZUAI(穆罕默德·本·扎耶德人工智能大学) Kyoto University(京都大学)

AI总结 本文提出CS-YODAS数据集,通过可扩展的人机协同流程从多语言YouTube数据中挖掘真实代码转换语音,涵盖7种基质语言共313小时,并分析其分布特征与语言对切换模式。

详情
AI中文摘要

我们提出CS-YODAS,一个基于Creative Commons许可的数据集,包含从多语言YouTube数据中挖掘的真实环境代码转换语音。代码转换(CS),即在话语或对话中交替使用不同语言,在多语言环境中很常见,但在现有的CS语音资源中代表性不足,这些资源通常规模小、领域特定或人为构建。基于YODAS语料库,我们开发了一个可扩展的人机协同流程,用于识别和验证自然发生的代码转换。最终数据集总计313小时,涵盖7种基质语言,提供了多样化的真实世界自发性代码转换语音示例。我们进一步分析了真实环境中代码转换的分布和特征,考察了语言对频率和切换模式,并报告了口语语言识别的基线结果。我们希望CS-YODAS能够促进对代码转换语音更广泛和全面的研究。数据集链接:此https URL。

英文摘要

We present CS-YODAS, a Creative Commons-licensed dataset of in-the-wild code-switched speech mined from multilingual YouTube data. Code-switching (CS), or the alternation between languages within an utterance or conversation, is common in multilingual settings but remains underrepresented in existing CS speech resources, which are typically small, domain-specific, or artificially constructed. Building on the YODAS corpus, we develop a scalable, human-in-the-loop pipeline for identifying and validating naturally occurring code-switching. The resulting dataset, which totals 313 hours and spans 7 matrix languages, provides diverse, real-world examples of spontaneous code-switched speech. We further analyze the distribution and characteristics of code-switching in the wild, examining language-pair frequencies and switching patterns, and report baseline results for spoken language identification. We hope that CS-YODAS will encourage broader and more comprehensive research on code-switched speech. Dataset link: this https URL.

2606.11512 2026-06-11 cs.CL 新提交

SAGE: Answer-Conditioned Uncertainty Targets for Verbal Uncertainty Alignment

SAGE: 面向言语不确定性对齐的答案条件不确定性目标

Kaiwen Shi, Zheyuan Zhang, Yanfang Ye

发表机构 * University of Notre Dame(圣母大学)

AI总结 提出SAGE目标,通过答案条件不确定性几何从模型采样响应中构建群组级不确定性目标,结合GUPO训练框架优化言语不确定性表达,在多项推理任务中提升不确定性排序、降低校准误差和过度自信。

详情
AI中文摘要

大型语言模型越来越多地通过自然语言语句表达不确定性,但这些表达往往无法反映模型的采样行为。我们将言语不确定性对齐作为一个分布校准问题:提示的适当不确定性目标应从重复模型输出中估计,而非来自孤立响应。然而,仅靠群组展开是不够的,因为由此产生的目标必须提供有用的训练信号。现有目标仅部分满足这一要求。我们提出SAGE(语义答案引导熵),一种群组级不确定性目标,它在采样响应上构建答案条件不确定性几何。SAGE保留了分类、数值和符号答案的区别,同时保持平滑且尺度保持的校准信号。我们进一步通过群组不确定性偏好优化(GUPO)应用该目标,这是一种不确定性通道训练框架,监督言语不确定性表达而非完整响应。在事实、数学和多项选择推理任务上的实验表明,不确定性排序得到改善,校准误差降低,过度自信减少。

英文摘要

Large language models increasingly express uncertainty through natural-language statements, yet these expressions often fail to reflect the model's sampled behavior. We study verbal uncertainty alignment as a distributional calibration problem: the appropriate uncertainty target for a prompt should be estimated from repeated model outputs rather than from an isolated response. However, group rollouts alone are insufficient, since the resulting target must provide a useful training signal. Existing targets only partially satisfy this requirement. We propose SAGE, Semantic-Answer Guided Entropy, a group-level uncertainty target that constructs an answer-conditioned uncertainty geometry over sampled responses. SAGE preserves categorical, numeric, and symbolic answer distinctions while maintaining a smooth and scale-preserving calibration signal. We further apply this target through Group-Uncertainty Preference Optimization, or GUPO, an uncertainty-channel training framework that supervises verbal uncertainty expressions rather than the full response. Experiments across factual, mathematical, and multiple-choice reasoning tasks show improved uncertainty ranking, lower calibration error, and reduced overconfidence.

2606.11508 2026-06-11 cs.LG q-bio.QM 新提交

Probabilistic Contrastive Pretraining for Multi-task ADME Property Prediction

概率对比预训练用于多任务ADME性质预测

Yifan Xue, Srimukh Prasad Veccham, Saee Paliwal, Tyler Shimko, Micha Livne

发表机构 * NVIDIA(英伟达)

AI总结 提出分子图-Transformer预训练框架,结合化学自监督与对比互信息,通过统一概率潜变量目标优化重构、对比和化学任务,在多任务微调中采用任务特定MLP头,在三个数据集上平均提升7.6%-9.5%。

详情
AI中文摘要

准确预测吸收、分布、代谢和排泄(ADME)性质对药物发现至关重要,但由于ADME终点存在噪声、相互依赖且数据有限,仍然具有挑战性。我们提出了一种分子图-Transformer预训练框架,结合了化学特异性自监督与对比互信息机器学习(cMIM)。我们的方法将分子图编码为潜变量,从图导出的潜代码重建SMILES字符串,并用领域特定的自监督化学任务增强对比目标。我们不是将这些任务视为具有单独调整损失权重的辅助正则化器,而是将重建、对比判别和化学特异性监督表述为单个概率潜变量目标中的单位权重对数概率因子。对于微调,我们提出了一种具有任务特定多层感知器头的多任务GNN读出架构,在保留共享表示学习的同时减轻负迁移并改进异质非线性任务关系的建模。在Biogen、ExpansionRX和ChEMBL-MT上,所得到的对比KERMT预训练相比KERMT基线分别提高了7.6%、9.9%和9.5%(在显著改进的终点上平均)。将ADME邻近分子添加到预训练语料库进一步改善了迁移,并且对比组件锐化了化学上有意义的潜邻域。

英文摘要

Accurate prediction of absorption, distribution, metabolism, and excretion (ADME) properties is critical to drug discovery, but remains challenging because ADME endpoints are noisy, interdependent, and often data-limited. We propose a molecular graph-transformer pretraining framework that combines chemistry-specific self-supervision with contrastive mutual information machine learning (cMIM). Our method encodes molecular graphs into latent variables, reconstructs SMILES strings from the graph-derived latent codes, and augments the contrastive objective with domain-specific self-supervised chemistry tasks. Rather than treating these tasks as auxiliary regularizers with separately tuned loss weights, we formulate reconstruction, contrastive discrimination, and chemistry-specific supervision as unit-weighted log-probability factors in a single probabilistic latent-variable objective. For fine-tuning, we propose a multi-task GNN readout architecture with task-specific multilayer perceptron heads, preserving shared representation learning while mitigating negative transfer and improving the modeling of heterogeneous, nonlinear task relationships. Across Biogen, ExpansionRX, and ChEMBL-MT, the resulting Contrastive KERMT pretraining improves over the KERMT baseline by 7.6%, 9.9%, and 9.5% respectively (averaged over significantly-improved endpoints). Adding ADME-adjacent molecules to the pretraining corpus further improves transfer, and the contrastive component sharpens chemically meaningful latent neighborhoods.

2606.11507 2026-06-11 cs.CV 新提交

SceneMiner: Identity-Preserving Multi-Task Fine-Tuning for Unified BEV Scene Mining

SceneMiner: 保持身份的多任务微调用于统一BEV场景挖掘

Abdalmalek Aburaddaha, Venkatraman Narayanan, Keval Thaker, Samir A. Rawashdeh

发表机构 * University of Michigan-Dearborn(密歇根大学迪尔伯恩分校)

AI总结 提出SceneMiner,一种统一的仅相机鸟瞰图管道,通过冻结视觉语言骨干网络在单次前向传播中发出互补的挖掘信号,并发现跨任务干扰问题,通过零初始化新子模块和冻结共享流参数的身份保持多任务微调解决。

详情
AI中文摘要

从驾驶日志中挖掘困难、安全关键的场景受到缺乏难度标签的瓶颈,且没有单一的代理(碰撞风险、轨迹歧义或语义稀有性)足以单独找到这些场景。我们提出SceneMiner,一种统一的、仅相机的鸟瞰图管道,从冻结的视觉语言骨干网络在单次前向传播中发出互补的挖掘信号,无需激光雷达或雷达:用于文本提示场景搜索的检索嵌入、多标签场景标签分布以及连续的基于物理的风险评分(运动预测是副产品,而非贡献)。构建这样的多头模型暴露了我们的核心发现,即我们称之为跨任务干扰的失败模式:添加或升级一个头会改变共享激活流并降低权重冻结的兄弟头,因此仅冻结参数是不够的。我们的贡献,即保持身份的多任务微调,通过零初始化每个新子模块并冻结每个馈入共享流的参数来消除这种干扰。挖掘头因此保持比特一致,同时仅训练约102k参数。标签头通过将每个场景池化为32个视觉令牌,在20个场景标签上达到mAP 0.4614(micro-F1 0.5557),嵌入头支持文本提示检索,经定性验证。代码可在以下网址获取:this https URL

英文摘要

Mining hard, safety-critical scenes from driving logs is bottlenecked by the absence of difficulty labels, and no single proxy, collision risk, trajectory ambiguity, or semantic rarity suffices to find such scenes on its own. We present SceneMiner, a unified, camera-only bird's-eye-view pipeline that emits complementary mining signals from a frozen vision-language backbone in a single forward pass, with no LiDAR or radar: a retrieval embedding for text-prompted scenario search, a multi-label scene-tag distribution, and a continuous physics-based risk score (a motion forecast is a byproduct, not a contribution). Building such a multi-head model exposes our central finding, a failure mode we term cross-task interference: adding or upgrading one head shifts a shared activation stream and degrades weight-frozen sibling heads, so freezing parameters alone is insufficient. Our contribution, identity-preserving multi-task fine-tuning, removes this interference by zero-initializing every new sub-module and freezing every parameter that feeds the shared stream. The mining heads are thereby preserved bit-identically while training only ~102k parameters. The tagging head reaches mAP 0.4614 (micro-F1 0.5557) on 20 scene tags by pooling each scene into 32 visual tokens, and the embedding head supports text-prompted retrieval, validated qualitatively. Code is available at: this https URL

2606.11502 2026-06-11 cs.CL cs.AI 新提交

When Roleplaying, Do Models Believe What They Say?

角色扮演时,模型是否相信它们所说的话?

Benjamin Sturgeon, David Africa, Sid Black

发表机构 * MATS

AI总结 通过线性真实探针研究角色扮演对LLM内部表征的影响,发现角色扮演主要改变输出而非内部真实表征,而紧急错位则更显著地改变内部表征。

详情
AI中文摘要

语言模型可以陈述“地球绕太阳运行”,并在扮演亚里士多德时断言相反的说法。最近的研究认为,角色采用是语言模型运作的基础,模型会不断为给定上下文选择最合适的角色。这种角色扮演是否仅仅改变了模型的输出,还是也影响了模型内部表征为真实的内容?我们通过线性真实探针研究这个问题,将其应用于扮演历史人物(其可能的信念与现代共识不同)的LLM。对于每个角色,我们比较该角色可能赞同的虚假陈述(*时代相信*)与主题匹配但该角色不会赞同的虚假陈述(*时代虚假*)。通过提示、上下文学习和监督微调,角色诱导对时代相信陈述的抑制程度低于同等虚假的替代陈述,但它们总体上仍被分类为虚假。因此,角色扮演改变模型所说的内容多于其内部表征为真实的内容。我们将此与经过有害建议训练并表现出紧急错位(EM)的模型进行对比。在三个模型家族(Qwen 2.5 14B、Qwen 3 8B和Llama 3.3 70B)中,它们的虚假陈述显著向探针空间的真实区域移动,在挑战下大约一半时间被辩护(而角色扮演约为六分之一),并用于下游推理。因此,角色扮演和紧急错位是信念内化谱系上的点,其中角色扮演改变模型所说的内容而表征变化很小,而紧急错位则改变虚假陈述的内部表征,但并未完全将其标记为真实。

英文摘要

Language models can state that "the Earth orbits the Sun" and, when role-playing Aristotle, assert the opposite. Recent work argues that persona adoption is fundamental to how language models operate, with models constantly selecting the most appropriate persona for a given context. Does such role-playing merely change the model's outputs, or does it also affect what the model internally represents as truthful? We study this question with linear truth probes, applying them to LLMs role-playing historical personas whose likely beliefs differ from modern consensus. For each persona, we compare false claims the persona would likely have endorsed (*era-believed*) with topic-matched false claims they would not have endorsed (*era-false*). Across prompting, in-context learning, and supervised fine-tuning, persona induction suppresses era-believed statements less than equally false alternatives, yet they remain classified as false overall. Role-play therefore shifts what these models say more than what they internally represent as true. We contrast this with models trained on harmful advice that exhibit Emergent Misalignment (EM). Across three model families (Qwen 2.5 14B, Qwen 3 8B, and Llama 3.3 70B), their false claims move substantially toward the true region of probe space, are defended under challenge roughly half the time versus about a sixth for role-play, and are used in downstream reasoning. Role-play and Emergent Misalignment thus are points on a spectrum of belief internalization, where role-play changes what a model says with little representational change, while Emergent Misalignment shifts the internal representation of false claims without fully marking them as true.

2606.11499 2026-06-11 cs.CL cs.AI 新提交

Hubs or Fringes: Pretraining Data Selection via Web Graph Centrality

枢纽或边缘:基于网页图中心性的预训练数据选择

Vedant Badoni, Danqi Chen, Xinyi Wang

发表机构 * Princeton Language and Intelligence(普林斯顿语言与智能) Princeton University(普林斯顿大学)

AI总结 提出WebGraphMix框架,利用Common Crawl主机级网页图的结构中心性得分调整预训练数据中中心与边缘文档的比例,无需模型训练或标注数据,在400M和1B参数模型上平均性能提升至41.4%。

详情
Comments
10 pages
AI中文摘要

现代语言模型的性能关键取决于预训练数据的组成。然而,现有的数据选择方法依赖辅助分类器进行文档评分或混合优化,增加了计算开销和对标注数据的依赖。我们提出WebGraphMix,一个轻量级的数据选择框架,它计算Common Crawl主机级网页图的结构中心性得分,并用其改变预训练混合数据中中心文档与边缘文档的比例。我们假设中心主机使模型暴露于可重用的抽象知识,而边缘主机编码专门的、长尾知识。WebGraphMix在网页规模下高效计算中心性得分,无需模型训练、标注数据或下游监督。我们将WebGraphMix集成到DataComp-LM流水线中,训练了400M和1B参数规模的模型,分别使用8B和28B token,在从事实知识到符号推理的23个任务上进行评估。实验表明,中心和边缘网页区域编码互补的能力。以1:1比例混合两者平均达到41.4%,而均匀采样为39.8%。将结构得分与文档级质量分类器得分相结合,性能进一步提升至43.8%。这些发现表明,网页图拓扑是预训练数据策展的一个有意义维度,捕获了与现有基于内容的方法大致正交的信息。

英文摘要

The performance of modern language models depends critically on pretraining data composition. Yet existing data selection methods rely on auxiliary classifiers for document scoring or mixture optimization, adding computational overhead and dependence on labeled data. We propose WebGraphMix, a lightweight data selection framework that computes structural centrality scores over the Common Crawl host-level web graph and uses them to vary the proportion of central versus peripheral documents in the pretraining mixture. We hypothesize that central hosts expose models to reusable abstractions, while peripheral hosts encode specialized, long-tail knowledge. WebGraphMix computes centrality scores efficiently at web scale, requiring no model training, labeled data, or downstream supervision. We integrate WebGraphMix into the DataComp-LM pipeline and train models at 400M and 1B parameter scales with 8B and 28B tokens respectively, evaluating on 23 tasks ranging from factual knowledge to symbolic reasoning. Our experiments show that central and peripheral web regions encode complementary capabilities. Mixture combining both at a ratio of 1:1 achieves 41.4% on average, compared to 39.8% for uniform sampling. Combining structural scores with document-level quality classifier scores further improves performance to 43.8%. These findings demonstrate that web graph topology is a meaningful axis for pretraining data curation, capturing information that is largely orthogonal to existing content-based approaches.

2606.11489 2026-06-11 cs.RO 新提交

Steering Multirobot Behavior via Closed-Loop Affine Activation Editing

通过闭环仿射激活编辑引导多机器人行为

Satyajeet Das, Darren Chiu, Shashank Hegde, Gaurav S. Sukhatme

发表机构 * University of Southern California(南加州大学)

AI总结 提出CLAE框架,在推理时通过编辑冻结策略的中间激活来引导多机器人行为,无需微调或重训练,并在多四旋翼导航任务中验证了速度控制、编队保持和规避监控等新行为。

详情
AI中文摘要

现实世界中的机器人需要适应超出其预训练策略范围的行为。策略微调或重训练是可选方案,但它们存在灾难性遗忘的风险,会降低预训练策略的基础性能。为了解决这个问题,我们引入了CLAE:闭环仿射激活编辑,这是一种推理时框架,通过编辑中间激活来引导冻结策略的行为,同时保持基础策略权重和下游动作头不变。CLAE将行为引导视为一个闭环问题,其输出编辑策略激活,这些激活在线适应机器人状态、环境、目标行为和多机器人上下文。它在冻结策略激活上训练稀疏自编码器,通过事后探测选择行为相关的潜在特征,并学习一个轻量级的基于强化学习的引导策略,在推理期间对所选潜在特征应用状态相关的仿射编辑。我们在一个冻结的多四旋翼导航策略上验证了CLAE,该策略训练用于执行单一任务:在避开障碍物的同时将机器人导航到一组目标位置。通过大量仿真和物理测试,我们表明,在导航到目标位置的同时,CLAE可以:1. 通过控制每个机器人的速度曲线来引导单个机器人行为;2. 通过保持期望的编队来协调多机器人行为;3. 产生全新的行为,其中机器人需要减少在环境中暴露于监控摄像头的机会。

英文摘要

Real-world robots need to adapt their behavior beyond the envelope of their pre-trained policy. Policy finetuning or retraining are options, but they risk catastrophic forgetting, degrading the pretrained policy's base performance. To combat this, we introduce CLAE: Closed-Loop Affine Activation Editing, an inference-time framework for steering the behavior of a frozen policy by editing intermediate activations while keeping the base policy weights and downstream action head untouched. CLAE approaches behavior steering as a closed-loop problem whose outputs edit policy activations that adapt online to the robot state, environment, target behavior, and multi-robot context. It trains a sparse autoencoder over frozen-policy activations, selects behavior-relevant latent features via post-hoc probing, and learns a lightweight RL-based steering policy that applies state-dependent affine edits to selected latents during inference. We validate CLAE on a frozen multi-quadrotor navigation policy trained to perform a single task: navigating robots to a set of goal locations while avoiding obstacles. Through extensive simulations and physical tests, we show that while navigating to their goal positions, CLAE can 1. steer individual robot behavior by controlling each robot's velocity profile; 2. coordinate multirobot behavior by preserving a desired formation; and 3. produce entirely new behavior wherein robots are required to reduce their exposure to surveillance cameras in the environment.

2606.11480 2026-06-11 cs.LG 新提交

Accurate and Resource-Efficient Federated Continual Learning

准确且资源高效的联邦持续学习

Jebacyril Arockiaraj, Dhruv Parikh, Jayashree Adivarahan, Rajgopal Kannan, Viktor Prasanna

发表机构 * University of Southern California(南加州大学) DEVCOM Army Research Office(DEVCOM陆军研究办公室)

AI总结 提出FedRAN框架,通过紧凑随机特征统计替代梯度更新,利用截断SVD降低通信开销,结合原型伪标签处理标签稀缺,在多个数据集上提升准确率并大幅降低资源消耗。

详情
Comments
Technical Report
AI中文摘要

联邦持续学习(FCL)必须在有限的资源(如通信、计算、内存和标签可用性)下从分布式任务流中学习。现有的FCL方法通常依赖于重复的局部优化、重放和完全监督。解析替代方法避免了迭代训练和重放,但使用高维随机特征来提高准确性需要二阶特征统计量——Gram矩阵,其通信成本与随机特征大小$M$成二次方关系。我们提出FedRAN,一种资源感知的解析FCL框架,用紧凑的随机特征统计量替代基于梯度的更新。每个客户端传输其Gram矩阵的截断SVD摘要,将主要的二阶上传从$M$的二次方减少到线性(对于固定秩)。服务器执行两级QR-SVD子空间合并,在空间上跨客户端、在时间上跨任务,并以闭式求解岭分类器。FedRAN进一步通过基于原型的伪标签支持标签稀缺。在CIFAR-100、ImageNet-R和VTAB数据集上,FedRAN相比最强基线将平均准确率提高了最多4.8个百分点,每个客户端的通信量比基于优化的FCL少30.6-121.8倍,平均比基于梯度的基线快190.3倍;仅使用20%标签时,伪标签将平均准确率提高了最多6.61个百分点。这些结果表明,FedRAN在通信、计算和标签约束下实现了准确且资源高效的FCL。源代码可在该https URL获取。

英文摘要

Federated continual learning (FCL) must learn from distributed task streams under limited resources, such as communication, computation, memory, and label availability. Existing FCL methods often rely on repeated local optimization, replay, and full supervision. Analytic alternatives avoid iterative training and replay, but using high-dimensional random features to improve accuracy requires a second-order feature statistic, the Gram matrix, which has a quadratic communication cost in the random feature size $M$. We propose FedRAN, a resource-aware analytic FCL framework that replaces gradient-based updates with compact random feature statistics. Each client transmits a truncated-SVD summary of its Gram matrix, reducing the dominant second-order upload from quadratic to linear in $M$ for fixed rank. The server performs a two-level QR-SVD subspace merge, spatially across clients and temporally across tasks, and solves a ridge classifier in closed form. FedRAN further supports label scarcity through prototype-based pseudo-labeling. Across CIFAR-100, ImageNet-R, and VTAB datasets, FedRAN improves average accuracy by up to 4.8 percentage points over the strongest baseline, uses 30.6-121.8$\times$ less per-client communication than optimization-based FCL, and is 190.3$\times$ faster on average than gradient-based baselines; with only 20% labels, pseudo-labeling improves average accuracy by up to 6.61 points. These results show that FedRAN enables accurate and resource-efficient FCL under communication, computation, and label constraints. The source code is available at this https URL.

2606.11477 2026-06-11 cs.CV cs.AI 新提交

Towards Fully Automated Exam Grading: Fairness-Aware Recognition of Handwritten Answers with Foundation Models

迈向全自动考试评分:基于基础模型的笔迹答案公平性识别

Hartwig Grabowski

发表机构 * Institute for Machine Learning and Analytics (IMLA), Offenburg University(奥芬堡大学机器学习和分析研究所(IMLA))

AI总结 提出使用视觉-语言基础模型(VLM)识别手写答案,在61份考试(3141个答案位置)上达到98.4%准确率,并通过轻量提示将假阴性率降至0.58%,实现公平的全自动评分。

详情
Comments
11 pages, 2 figures, 3 tables
AI中文摘要

手工批改手写试卷既耗时又容易出错,尤其是对于大规模班级,而全数字化考试往往迫使教学局限于封闭式问题格式。一个实用的折中方案是保留纸质、问题导向的任务,但将评估相关的答案以单个大写字母记录在机器可读的表格中。开放的问题是,这种读取能否足够准确,并且最重要的是,足够公平以实现无监督评分。早期的自动化方法仅达到约88%–91%的识别率——太低——并且在最关键的案例上失败:答案写在单元格外、被划掉或草书书写。我们展示了通用视觉-语言基础模型(VLM),它解释页面而非匹配像素模板,弥补了这一差距。在一个包含61份匿名考试(3141个答案位置)的基准测试中,最佳模型达到了98.4%的准确率,远高于之前的基线。关键的是,我们以公平性为中心进行评估:我们区分假阴性(正确答案被标记为错误,对学生不利)和假阳性,并且一个提供参考答案作为上下文的轻量提示将假阴性率降至0.58%。在示例性评分方案下,61份考试中只有3份会被评得更差,所有这些都通过学生自我审查步骤被发现。因此,大规模的全自动、公平性感知考试评分是合理的;我们发布匿名基准以支持可重复性。

英文摘要

Correcting handwritten exams by hand is time-consuming and error-prone, particularly for large cohorts, while fully digital exams tend to force a didactic narrowing towards closed question formats. A practical middle ground keeps paper-based, problem-oriented tasks but records the assessment-relevant answers as single capital letters in a table that a machine can read. The open question is whether this reading can be made accurate and, above all, fair enough for unsupervised grading. Earlier automated approaches reached only about 88%--91% recognition -- too low -- and failed on the cases that matter most: answers placed outside the cell, crossed out, or written in cursive. We show that general-purpose vision-language foundation models (VLMs), which interpret the page rather than match pixel templates, close this gap. On a benchmark of 61 anonymised exams (3141 answer positions) the best model reaches 98.4% accuracy, well above the previous baseline. Crucially, we centre the evaluation on fairness: we distinguish false negatives (a correct answer marked wrong, which disadvantages the student) from false positives, and a lightweight prompt that supplies the reference solution as context lowers the false-negative rate to 0.58%. Under an exemplary grading scheme only three of the 61 exams would be graded worse, all caught by a student self-review step. Fully automated, fairness-aware exam grading at scale is therefore defensible; we release the anonymised benchmark to support reproducibility.

2606.11473 2026-06-11 cs.LG cs.AI stat.ML 新提交

CRUMB: Efficient Prior Fitted Network Inference via Distributionally Matched Context Batching

CRUMB: 通过分布匹配上下文批处理实现高效先验拟合网络推理

Jamie Heredge, Mattia J. Villani, Pranav Deshpande, Akshay Seshadri, Niraj Kumar

发表机构 * Global Technology Applied Research, JPMorganChase(摩根大通全球技术应用研究)

AI总结 提出CRUMB方法,通过聚类查询、最小化最大均值差异选择训练子集、再执行精确推理,在不重新训练的情况下加速先验拟合网络推理,在51个数据集上优于同类方法。

详情
Comments
26 pages, 13 figures
AI中文摘要

先验拟合网络(PFNs)是一类有前景的表格基础模型,执行上下文学习,其中整个带标签的训练集作为上下文提供,并在单次前向传播中生成测试查询的预测。然而,许多PFN架构中二次缩放的自注意力机制使得对于非常大的训练数据集推理变得不可行。我们提出CRUMB(使用最小化MMD批处理的聚类检索),一个三阶段推理包装器:(i)聚类测试查询,(ii)通过贪心最小化最大均值差异(MMD)为每个聚类选择一个小型、分布匹配的训练子集,(iii)在每个缩减上下文的批次上执行精确的PFN推理。CRUMB是架构无关的,无需重新训练。在51个数据集的TabArena基准测试中,跨三种PFN架构(TabPFNv2、TabICLv1、TabICLv2)评估,我们展示了CRUMB优于类似的最先进的上下文选择策略。我们还展示了CRUMB对协变量漂移具有鲁棒性,因为MMD最小化步骤自然有助于对齐训练上下文分布以匹配当前测试批次分布。

英文摘要

Prior-fitted networks (PFNs) are a promising class of tabular foundation models that perform in-context learning, whereby the entire labelled training set is supplied as context, and predictions for test queries are produced in a single forward pass. However, the quadratically scaling self-attention mechanism in many PFN architectures makes inference prohibitive for very large training datasets. We propose CRUMB (Clustered Retrieval Using Minimised-MMD Batching), a three-stage inference wrapper that (i) clusters the test queries, (ii) selects a small, distributionally matched training subset for each cluster by greedily minimising the maximum mean discrepancy (MMD), and (iii) runs exact PFN inference on each reduced-context batch. CRUMB is architecture-agnostic and requires no retraining. On the 51-dataset TabArena benchmark, evaluated across three PFN architectures (TabPFNv2, TabICLv1, TabICLv2), we show that CRUMB outperforms similar state-of-the-art context selection strategies. We also show that CRUMB is resilient to covariate drift, as the MMD-minimisation step naturally helps align the training context distribution to match the current test batch distributions.

2606.11470 2026-06-11 cs.CL 新提交

The Periodic Table of LLM Reasoning: A Structured Survey of Reasoning Paradigms, Methods, and Failure Modes

LLM推理的周期表:推理范式、方法与失败模式的结构化综述

Avinash Anand, Mahisha Ramesh, Avni Mittal, Ashutosh Kumar, Erik Cambria, Zhengkui Wang, Timothy Liu, Aik Beng Ng, Simon See, Rajiv Ratn Shah

发表机构 * Singapore Institute of Technology(新加坡理工大学) Nvidia AI Center (SNAIC)(英伟达人工智能中心(SNAIC)) MIDAS Lab, IIIT Delhi(IIIT德里MIDAS实验室) MIDAS Lab, IIT Mandi(IIT曼迪MIDAS实验室) Owl Autonomous Imaging, Inc.(Owl自主成像公司) College of Computing & Data Science, NTU Singapore(新加坡南洋理工大学计算与数据科学学院) NVIDIA AI Technology Centre, Singapore(英伟达新加坡人工智能技术中心) Department of Computer Science and Engineering, IIT Kanpur(IIT坎普尔计算机科学与工程系)

AI总结 本文系统综述了300多篇论文,提出LLM推理研究的结构化分类法,涵盖多种推理范式,分析方法论趋势,并总结常见限制与失败模式,旨在为开发更鲁棒、可解释和可泛化的推理系统提供参考。

详情
AI中文摘要

大型语言模型(LLM)在自然语言处理任务中取得了强劲表现,但可靠推理仍是一个开放挑战。尽管现代LLM在结构化推理、多步问题求解和上下文理解方面显示出进展,但其推理行为往往不一致,且对提示策略、任务设计和模型规模敏感。本综述对来自arXiv、Semantic Scholar、Google Scholar、Papers with Code和ACL Anthology的300多篇近期论文进行了系统分析,以考察推理能力如何在LLM中涌现以及它们在何处失败。我们做出三项主要贡献。首先,我们引入了LLM推理研究的结构化分类法,涵盖思维链推理、多跳推理、数学推理、常识推理、视觉与时间推理、代码与算法推理、检索增强推理、工具增强与智能体推理以及基于强化学习的推理。其次,我们分析了这些范式中的方法论趋势,包括提示方法、模型架构、训练目标、奖励建模和评估基准。第三,我们综合了反复出现的局限性和失败模式,例如推理幻觉、脆弱的多步推理、弱的因果抽象以及差的跨域泛化。通过组织快速扩展的文献,本综述提供了LLM推理当前能力和局限性的统一视图。我们还识别了新兴研究方向,包括元推理、自进化推理框架、多模态推理和社会基础推理。总体而言,本工作旨在为未来语言模型中开发更鲁棒、可解释和可泛化的推理系统提供参考。

英文摘要

Large Language Models (LLMs) have achieved strong performance across natural language processing tasks, yet reliable reasoning remains an open challenge. Although modern LLMs show progress in structured inference, multi-step problem solving, and contextual understanding, their reasoning behavior is often inconsistent and sensitive to prompting strategies, task design, and model scale. This survey provides a systematic analysis of more than 300 recent papers from arXiv, Semantic Scholar, Google Scholar, Papers with Code, and the ACL Anthology to examine how reasoning capabilities emerge in LLMs and where they fail. We make three main contributions. First, we introduce a structured taxonomy of LLM reasoning research, covering Chain-of-Thought reasoning, multi-hop reasoning, mathematical reasoning, common sense reasoning, visual and temporal reasoning, code and algorithmic reasoning, retrieval-augmented reasoning, tool-augmented and agentic reasoning, and reinforcement learning-based reasoning. Second, we analyze methodological trends across these paradigms, including prompting methods, model architectures, training objectives, reward modeling, and evaluation benchmarks. Third, we synthesize recurring limitations and failure modes, such as reasoning hallucinations, brittle multi-step inference, weak causal abstraction, and poor cross-domain generalization. By organizing a rapidly expanding literature, this survey offers a unified view of the current capabilities and limitations of reasoning in LLMs. We also identify emerging research directions, including meta-reasoning, self-evolving reasoning frameworks, multimodal reasoning, and socially grounded reasoning. Overall, this work aims to serve as a reference for developing more robust, interpretable, and generalizable reasoning systems in future language models.

2606.11466 2026-06-11 cs.CV 新提交

PT-WNO: Point Transformer with Wavelet Neural Operator for 3D Point Cloud Semantic Segmentation

PT-WNO: 结合小波神经算子的点Transformer用于3D点云语义分割

Nhut Le, Maryam Rahnemoonfar

发表机构 * Lehigh University(里海大学)

AI总结 针对点云语义分割中全局上下文不足的问题,提出PT-WNO,通过在跳跃连接旁集成可学习的小波神经算子分支捕获多尺度全局频谱上下文,在四个基准上提升性能。

详情
AI中文摘要

点云语义分割需要同时捕捉细粒度局部几何和广阔全局场景结构的架构。基于Transformer的网络通过聚焦于详细的局部特征聚合表现出强大性能;然而,全局上下文主要通过编码器-解码器阶段之间的跳跃连接传递,我们认为这对于完整的场景理解是不够的。我们假设,用可学习的全局特征提取模块增强跳跃连接,使网络在深入局部细节之前获取场景级知识,从而产生更丰富且更具上下文基础的表示。为此,我们提出了点Transformer与小波神经算子(PT-WNO),它在点云Transformer骨干的跳跃连接旁集成了一个共享的小波神经算子(WNO)分支。在每个编码器-解码器过渡处,点特征被投影到密集的3D体素网格上,WNO通过可学习的小波分解和重建捕获多尺度全局频谱上下文。这些全局特征通过轻量级适配器融合回网络,补充而非替代现有的跳跃连接。在四个大规模3D点云基准上的实验证明了PT-WNO的有效性。在S3DIS(Area 5)上,PT-WNO达到71.59% mIoU,比Point Transformer v3(PTv3)基线高出+1.03个百分点。在DALES上达到81.05% mIoU(比基线高+1.47)。在ScanNet v2上,PT-WNO获得76.19% mIoU,与基线(76.36%)保持竞争力。

英文摘要

Point cloud semantic segmentation requires architectures that capture both fine-grained local geometry and broad global scene structure. Transformer-based networks have demonstrated strong performance by focusing on detailed local feature aggregation; however, global context is conveyed primarily through skip connections across encoder-decoder stages, which we argue is insufficient for full scene understanding. We hypothesize that augmenting skip connections with a learnable global feature extraction module allows the network to acquire scene-level knowledge before descending into local detail, leading to richer and more contextually grounded representations. To this end, we propose Point Transformer with Wavelet Neural Operato (PT-WNO), which integrates a shared Wavelet Neural Operator (WNO) branch alongside the skip connections of a point cloud transformer backbone. At each encoder-decoder transition, point features are projected onto a dense 3D volumetric grid where the WNO captures multi-scale global spectral context through learnable wavelet decomposition and reconstruction. These global features are fused back into the network via lightweight adapters, complementing rather than replacing the existing skip connections. Experiments on four large-scale 3D point cloud benchmarks demonstrate the effectiveness of PT-WNO. On S3DIS (Area 5), PT-WNO achieves 71.59% mIoU, outperforming the Point Transformer v3 (PTv3) baseline by +1.03 points. On DALES it achieves 81.05% mIoU (+1.47 over the baseline). On ScanNet~v2, PT-WNO obtains 76.19% mIoU, remaining competitive with the baseline (76.36%).

2606.11464 2026-06-11 cs.RO 新提交

Bridging the sim2real gap in the table tennis robot with a transformer-based ball states predictor

基于Transformer的乒乓球状态预测器弥合仿真到现实的差距

Yin Bi, Christian Conti, Bilan Yang, Alexander Sigrist, Peter Dürr, Naoya Takahashi

发表机构 * Sony AI, Zürich, Switzerland(索尼AI,苏黎世,瑞士) Sony AI, Tokyo, Japan(索尼AI,东京,日本)

AI总结 提出基于Transformer的乒乓球状态预测框架,利用注意力机制建模长程时间依赖,结合大规模真实数据集,并引入SPAD策略替换仿真器,无需重新训练即可缩小sim2real差距。

详情
AI中文摘要

机器人乒乓球是动态环境中高速闭环机器人控制的代表性基准,其中准确快速地预测球状态对于可靠规划和控制至关重要。基于物理的方法严重依赖准确的参数识别和精确的初始状态,而基于学习的方法通常难以捕捉长程时间依赖,并且通常在有限或模拟数据上训练。我们提出了一种基于Transformer的乒乓球状态预测框架,利用注意力机制直接从历史观测中建模长程时间相关性,无需依赖显式的飞行或弹跳模型。为了支持鲁棒学习和泛化,我们从不同技能水平的球员和多种球炮配置中收集了大规模真实世界数据集。高容量Transformer架构与广泛真实世界数据的结合实现了准确的长期预测。基于此能力,我们引入了一种即插即用的仿真到现实迁移策略,即部署时交换预测器(SPAD),该策略在部署时用训练好的真实世界预测器替换训练中使用的基于物理的仿真器,从而在不需重新训练的情况下提高策略的仿真到现实迁移能力。我们证明,这种简单的替换有效地缩小了仿真到现实的差距,同时保留了基于仿真训练的效率和可扩展性。

英文摘要

Robotic table tennis is a representative benchmark for high-speed, closed-loop robotic control in dynamic environments, where accurate and fast prediction of ball states is critical for reliable planning and control. Physics-based approaches rely heavily on accurate parameter identification and precise initial state, while learning-based methods often struggle to capture long-range temporal dependencies and are typically trained on limited or simulated data. We propose a transformer-based framework for table tennis ball state prediction that leverages attention mechanisms to model long-range temporal correlations directly from historical observations, without relying on explicit flight or bounce models. To support robust learning and generalization, we collected a large-scale real-world dataset from players of varying skill levels and diverse ball cannon configurations. The combination of a high-capacity transformer architecture and extensive real-world data enables accurate long-horizon forecasting. Building on this capability, we introduce a plug-and-play sim-to-real transfer strategy, Swap Predictor at Deployment (SPAD), which replaces the physics-based simulator used during training with the proposed real-world-trained predictor at deployment, improving the sim-to-real transferability of the policy without requiring retraining. We demonstrate that this simple substitution effectively narrows the sim-to-real gap while preserving the efficiency and scalability of simulation-based training.

2606.11463 2026-06-11 cs.LG cs.AI 新提交

LSTM-Based Detection of Structural Breaks in Property Insurance Loss Reserving: A Climate-Informed Approach

基于LSTM的财产保险损失准备金结构性断点检测:气候信息方法

Thomas Mbrice, Shashwat Panigrahi

发表机构 * Stony Brook University(石溪大学)

AI总结 针对气候变化导致传统精算方法失效的问题,提出使用LSTM神经网络检测结构性断点,在佛罗里达和路易斯安那州数据上预期将巨灾年份准备金精度提升15-20%,并给出理论保证。

详情
Comments
15 pages, 0 figures, whitepaper YC
AI中文摘要

准确的损失准备金是保险公司偿付能力的基础,然而加速的气候驱动灾难系统地违反了传统精算方法所依赖的稳定性假设。本文提出一个研究计划,测试长短期记忆(LSTM)神经网络是否能够比链梯法、Bornhuetter-Ferguson法和Cape Cod法更快、更准确地检测和适应这些结构性断点。使用来自佛罗里达州和路易斯安那州超过15年的监管发展三角形数据,并辅以NOAA飓风强度指数和海面温度,我们假设在巨灾暴露年份准备金精度有15-20%的针对性提升,这一阈值基于先前的神经网络准备金文献以及本文发展的形式化收敛结果。除了实证验证,我们还发展了一个理论框架,以概率术语为基础进行LSTM结构性断点检测,并提供形式化的性能保证,以弥补测试期间巨灾事件数量有限的不足。我们记录了研究设计、方法论、预期贡献以及对局限性的坦诚评估。

英文摘要

Accurate loss reserving is foundational to insurer solvency, yet accelerating climate driven catastrophes systematically violate the stability assumptions on which traditional actuarial methods depend. This white paper presents a research program testing whether Long Short Term Memory (LSTM) neural networks can detect and adapt to these structural breaks faster and more accurately than Chain Ladder, Bornhuetter Ferguson, and Cape Cod methods. Using 15 plus years of regulatory development triangle data from Florida and Louisiana, enriched with NOAA hurricane intensity indices and sea surface temperatures, we hypothesize a targeted improvement of 15, 20% in reserve accuracy for catastrophe exposed years, a threshold grounded both in the prior neural network reserving literature and in the formal convergence results developed here. Beyond empirical validation, we develop a theoretical framework grounding LSTM structural break detection in probabilistic terms, providing formal performance guarantees that compensate for the limited number of catastrophe events in the test period. We document the research design, methodology, expected contributions, and a candid assessment of limitations.

2606.11459 2026-06-11 cs.CL cs.AI cs.LG 新提交

APEX: Automated Prompt Engineering eXpert with Dynamic Data Selection

APEX: 具有动态数据选择的自动提示工程专家

Fei Wang, Si Si, Cho-Jui Hsieh, Inderjit S. Dhillon

发表机构 * Google(谷歌) UCLA(加州大学洛杉矶分校)

AI总结 提出APEX框架,通过动态数据分层(易、难、混合)优先选择高杠杆子集,在固定预算下提升提示优化效率,在三个基准上平均提升11.2%和6.8%。

详情
AI中文摘要

大型语言模型对提示表述高度敏感,需要自动提示优化以释放其全部潜力。尽管进化算法已成为主导范式,但它们面临一个关键瓶颈:数据效率。当前方法将开发数据集视为静态基准,在无信息数据上浪费大量计算预算。在这项工作中,我们引入了APEX(自动提示工程专家),这是一个新颖的框架,它在提示搜索的同时优化数据使用。APEX根据优化谱系将数据集动态分层为易、难和混合三个层级。通过优先考虑混合层级(即识别出LLM性能混合的数据),我们确定了两个高杠杆子集:用于生成信息性变异的可寻址前沿和用于区分候选质量的排名敏感前沿。我们在三个不同的基准上评估APEX:IFBench、SimpleQA Verified和FACTS Grounding。在固定5000次评估调用的预算下,由于其数据效率,APEX在Gemini 2.5 Flash上平均比初始提示高出11.2%,在Gemma 3 27B上高出6.8%,这表明以数据为中心的方法是高效且有效的提示优化的关键。

英文摘要

Large Language Models are highly sensitive to prompt formulation, necessitating automatic prompt optimization to unlock their full potential. While evolutionary algorithms have emerged as the dominant paradigm, they suffer from a critical bottleneck: data efficiency. Current methods treat the development dataset as a static benchmark, wasting significant compute budget on uninformative data. In this work, we introduce APEX (Automatic Prompt Engineering eXpert), a novel framework that optimizes the data usage alongside the prompt search. APEX dynamically stratifies the dataset into Easy, Hard, and Mixed tiers based on the optimization lineage. By prioritizing the Mixed tier, which identifies the data where the LLM has mixed performance, we identify two high-leverage subsets: the addressable frontier for generating informative mutations and the rank-sensitive frontier for distinguishing candidate quality. We evaluate APEX across three diverse benchmarks: IFBench, SimpleQA Verified, and FACTS Grounding. Under a fixed budget of 5,000 evaluation calls, due to its data efficiency, APEX outperforms the initial prompt by an average of 11.2% on Gemini 2.5 Flash and 6.8% on Gemma 3 27B, demonstrating that a data-centric approach is key to efficient and effective prompt optimization.

2606.11456 2026-06-11 cs.CL cs.AI cs.CY 新提交

AI Coding Agents in Social Science: Methodologically Diverse, Empirically Consistent, Interpretively Vulnerable

社会科学中的AI编码智能体:方法多样,经验一致,解释脆弱

Meysam Alizadeh, Fabrizio Gilardi, Mohsen Mosleh, Enkelejda Kasneci

发表机构 * University of Oxford(牛津大学) University of Zurich(苏黎世大学) Technical University of Munich(慕尼黑工业大学)

AI总结 研究LLM智能体在科学分析中的方法多样性与解释脆弱性,通过20次独立实验发现智能体在设计层匹配或超越人类多样性,但在裁决层易受提示影响,偏差源于解释而非估计。

详情
AI中文摘要

基于LLM的智能体在科学分析中的部署引发了相互矛盾的担忧:智能体可能减少方法多样性,或者可能放大分析灵活性,使研究者得出动机性结论。我们认为这些担忧针对两个经验上可分离的层面:方法选择的设计层,以及决策规则将估计映射到实质性主张的裁决层。我们通过在著名的移民与社会政策问题上运行20次Claude Code和Codex的独立执行,并以多位分析师的人类基线为基准,对两者进行了测试。在设计层,Codex匹配了人类的方法多样性,而Claude Code产生了近三倍的规格;两个智能体的效应估计与人类共识大致一致,且没有智能体模型与任何人类模型完全匹配。提示诱导的反移民研究者先验重组了每个智能体的方法决策,但与同一数据中有偏见的人类分析师不同,它并未改变总体估计或最终裁决;智能体也没有沿着人类用来偏倚其估计的方法轴重新路由。在裁决层,一个明确的确认性提示将Claude Code的裁决从10%的支持率翻转为90%,同时其系数分布基本保持不变,这是通过规则省略而非规则软化实现的。AI智能体在设计层可以媲美或超越人类的方法多样性,但在裁决层仍然脆弱。在我们的设置中,AI偏差的所在不是估计而是解释。

英文摘要

The deployment of LLM-based agents in scientific analysis raises opposing concerns: that agents may reduce methodological diversity, or that they may amplify the analytic flexibility through which researchers reach motivated conclusions. We argue these worries target two empirically separable layers: a design layer of methodological choices, and a verdict layer in which a decision rule maps estimates to a substantive claim. We test both by running 20 independent executions of Claude Code and Codex on a prominent immigration and social-policy against a many-analysts human baseline. At the design layer, Codex matches human methodological diversity and Claude Code produces nearly three times as many specifications; both agents' effect estimates remain broadly aligned with the human consensus, and no agent model exactly matches any human model. A prompt-induced anti-immigration researcher prior reorganizes each agent's methodological decisions but, unlike for biased human analysts in the same data, does not shift aggregate estimates or final verdicts; nor do agents reroute along the methodological axes humans use to bias their estimates. At the verdict layer, an explicit confirmatory prompt flips Claude Code's verdicts from 10% to 90% support while leaving its coefficient distribution essentially unchanged, operating through rule omission rather than rule softening. AI agents can rival or exceed human methodological diversity at the design layer while remaining vulnerable at the verdict layer. In our setting, the locus of AI bias is not estimation but interpretation.

2606.11450 2026-06-11 cs.CV 新提交

Exploring Adaptive Masked Reconstruction for Self-Supervised Skeleton-Based Action Recognition

探索自适应掩码重建用于自监督基于骨架的动作识别

Shengkai Sun, Zhiyong Cheng, Zefan Zhang, Jianfeng Dong, Zhihui Li, Meng Wang

发表机构 * Hefei University of Technology(合肥工业大学) Jilin University(吉林大学) Zhejiang Gongshang University(浙江工商大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出自适应掩码重建(AMR)框架,通过解耦编码器-解码器并引入自适应引导模块,加速预训练并提升下游动作识别精度,在多个数据集上超越现有方法。

详情
Comments
Accepted by CVPR2026. The code is available at this https URL
AI中文摘要

最近,掩码骨架重建模型已成为强大的动作表示学习器,推动了自监督基于骨架的动作识别的重大进展。然而,现有的最先进方法必须预测极其大量的时空块,显著延长了训练时间。此外,通过在重建过程中平等对待所有时空区域,这些模型被分散了注意力,无法学习动作语义背后的关键运动模式。为了解决这些挑战,我们提出了自适应掩码重建(AMR),一个更快更强的预训练框架。我们首先将解码器与编码器解耦,使得能够灵活预测更大的时空块,并大幅降低重建复杂度。鉴于更大的块包含更复杂的信息,这难以预测并因此降低性能,我们相应地引入了一个自适应引导模块。该模块识别高运动信息量的区域,引导模型关注每个块中最具判别力的部分,并减轻重建难度。在NTU RGB+D 60、NTU RGB+D 120和PKU-MMD数据集上的实验表明,AMR不仅显著加速了预训练,还提高了下游识别精度,超越了当前最先进的方法。

英文摘要

Recently, masked skeleton reconstruction models have emerged as strong action representation learners, driving significant progress in self-supervised skeleton-based action recognition. However, existing state-of-the-art methods must predict an exceedingly large number of spatiotemporal patches, significantly prolonging training time. Besides, by treating all spatiotemporal regions equally during reconstruction, these models are distracted from learning the critical motion patterns that underlie action semantics. To address these challenges, we propose Adaptive Masked Reconstruction (AMR), a faster and stronger pre-training framework. We first decouple the decoder from the encoder, enabling flexible prediction of larger spatiotemporal patches and dramatically reducing reconstruction complexity. Given that larger patches contain more complex information, which is challenging to predict and consequently degrades performance, we accordingly introduce an adaptive guidance module. This module identifies regions of high motion informativeness, guiding the model to focus on the most discriminative parts of each patch and alleviating reconstruction difficulty. Experiments on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD datasets demonstrate that AMR not only accelerates pre-training substantially but also improves downstream recognition accuracy, surpassing current state-of-the-art approaches.

2606.11447 2026-06-11 cs.CL 新提交

AI Coding Agents Can Reproduce Social Science Findings

AI编码智能体能够复现社会科学研究结果

Meysam Alizadeh, Mohsen Mosleh, Fabrizio Gilardi, Atoosa Kasirzadeh, Joshua Tucker

发表机构 * University of Oxford(牛津大学) University of Zurich(苏黎世大学) Carnegie Mellon University(卡内基梅隆大学) New York University(纽约大学)

AI总结 本研究构建SocSci-Repro-Bench基准测试,评估Claude Code和Codex两个前沿编码智能体在221项社会科学任务中的复现能力,发现它们能复现大部分结果,且Claude Code表现更优,同时提示框架会影响确认性规范搜索。

详情
AI中文摘要

近期轶事证据表明,当提供原始数据和代码时,AI编码智能体能够复现已发表的研究结果;然而,在社会科学领域的系统评估仍然有限。现有的评估基准不足,要么规模较小,要么将智能体性能与复现材料本身的问题(如代码无法正确执行)混为一谈。本文介绍了SocSci-Repro-Bench,这是一个包含221项任务的基准测试,涵盖四个学科和13个实质性领域,这些任务基于那些结果要么完全可通过现有材料复现,要么因数据缺失而明显不可复现的研究构建,从而使我们能够隔离智能体的复现能力。评估两个前沿编码智能体Claude Code和Codex,我们发现两者都能复现大部分社会科学研究结果,其中Claude Code显著优于Codex。这些复现率远高于先前报道的通用基于LLM的智能体在类似可复现性基准上的表现。两个智能体在需要识别潜在研究问题的推理任务上也表现强劲,附加分析表明结果并非主要由记忆驱动。提供原始论文PDF与复现材料一起可适度提升性能,但在无法复现的任务上引入了偏差。我们还表明,通过微妙的提示框架,智能体可以被引导向确认性规范搜索。这些发现共同表明,至少某些前沿编码智能体可以作为计算工作流的可靠执行者,同时强调了在AI系统在科学生产中扮演更大角色时,需要仔细的基准测试和提示设计。

英文摘要

Recent anecdotal evidence suggests that AI coding agents can reproduce published findings when provided with original data and code; yet systematic evaluation across social sciences remains limited. Existing evaluation benchmarks are insufficient, either small or conflate agent performance with problems in the reproduction materials themselves, such as code that fails to execute correctly. Here we introduce SocSci-Repro-Bench, a benchmark of 221 tasks spanning four disciplines and 13 substantive domains, constructed from studies whose results are either fully reproducible with available materials or demonstrably non-reproducible due to missing data, allowing us to isolate agents' reproduction capacity. Evaluating two frontier coding agents, Claude Code and Codex, we find that both can reproduce a large share of social science findings, with Claude Code substantially outperforming Codex. These reproduction rates considerably exceed those previously reported for general-purpose LLM-based agents on comparable reproducibility benchmarks. Both agents also perform strongly on a reasoning task requiring identification of underlying research questions, and additional analyses suggest that results are not primarily driven by memorization. Providing the original paper PDF alongside replication materials modestly improves performance but introduces bias on tasks where reproduction is impossible. We also show that agents can be nudged toward confirmatory specification search through subtle prompt framing. Together, these findings suggest that at least some frontier coding agents can serve as reliable executors of computational workflows while underscoring the need for careful benchmarking and prompt design as AI systems assume larger roles in scientific production.

2606.11446 2026-06-11 cs.CV cs.GR 新提交

3D-CBM: A Framework for Concept-Based Interpretability in Generative 3D Modeling

3D-CBM:生成式3D建模中基于概念可解释性的框架

Ahmad Al-Kabbany

发表机构 * Yubree Labs Multimedia Interaction and Communication Lab, Arab Academy for Science and Technology(阿拉伯科学技术学院多媒体交互与通信实验室)

AI总结 提出将概念瓶颈模型(CBM)融入3D生成架构,通过多层级可解释原语和功能属性映射,实现语义可操控的3D生成,实验验证了高概念预测精度和交互式纠错能力。

详情
AI中文摘要

本研究引入了一个将概念瓶颈模型(CBM)融入3D生成架构的框架,以解决深度几何学习中固有的“语义鸿沟”。随着深度模型成为3D内容创建的核心,可解释性从边缘特性转变为医疗和制造等安全关键领域中信任和问责的基本要求。CBM通过约束潜在表示与人类定义的概念对齐,提供了一种内在的可解释性解决方案,但其在非结构化3D数据上的应用仍 largely unexplored。我们设计、实现并验证了一个正式的3D-CBM架构,将原始几何输入(包括点云和网格)映射到可解释基元和功能属性的多层级分类中。该框架进一步确定了专门用于基于概念监督的战略性数据集,如PartNet和ShapeNet。来自3D部件操作概念验证实验的结果证明了该框架的有效性,实现了88.8%的概念预测准确率和0.0115的Chamfer距离。关键的是,该模型支持精确的测试时干预,允许交互式纠正结构错误。这项工作为语义可操控的3D生成奠定了基础,并邀请进一步探索协作式人在回路设计系统。

英文摘要

This research introduces a framework for incorporating Concept Bottleneck Models (CBMs) into 3D generative architectures to address the inherent 'semantic gap' in deep geometric learning. As deep models become central to 3D content creation, explainability shifts from a peripheral feature to a fundamental requirement for trust and accountability in safety-critical domains such as healthcare and manufacturing. CBMs provide an intrinsic interpretability solution by constraining latent representations to align with human-defined concepts, yet their application to unstructured 3D data remains largely unexplored. We design, implement, and validate a formal 3D-CBM architecture that maps raw geometric inputs, including point clouds and meshes, into a multi-tiered taxonomy of interpretable primitives and functional attributes. The framework further identifies strategic datasets, such as PartNet and ShapeNet, specialized for concept-based supervision. Experimental results from a 3D part-manipulation proof-of-concept experiment demonstrate the framework's efficacy, achieving a concept prediction accuracy of 88.8\% and a Chamfer Distance of 0.0115. Critically, the model enables precise test-time intervention, allowing for the interactive correction of structural errors. This work establishes a foundation for semantically-steerable 3D generation and invites further exploration into collaborative human-in-the-loop design systems.

2606.11445 2026-06-11 cs.AI 新提交

Forecasting Future Behavior as a Learning Task

将未来行为预测作为学习任务

Mosh Levy, Yoav Goldberg, Asa Cooper Stickland

发表机构 * Bar-Ilan University(巴伊兰大学) Allen Institute for AI(艾伦人工智能研究所) UK AI Security Institute(英国人工智能安全研究所)

AI总结 提出将AI行为预测作为可学习任务,训练行为预测器从推理轨迹中预测未来行为,无需解释步骤,在两项任务上优于GPT-5.4和Claude Opus-4.6。

详情
AI中文摘要

对AI系统的信任通常基于对其工作原理的解释,人们利用这些解释来预测系统在新输入上的行为。对于大型推理模型(LRM),这条常规路径尤其难以遵循:针对单个token生成的解释方法无法自然推广到长轨迹,而轨迹本身在作为自然语言阅读时往往不忠实。我们提出一种绕过解释步骤的替代方案:将行为预测视为可学习任务,训练行为预测器(Behavior Forecasters)在单个推理轨迹上运行,以做出通常从解释中寻求的相同预测。预测器的训练数据通过查询LRM获得,无需人工标注,其推理在单次前向传播中完成。我们在两个任务上实例化该方法:LRM在重新运行时重复其答案的可能性,以及移除输入部分如何改变其答案。我们在三个不同的推理数据集上对这两个任务进行了评估,发现训练后的行为预测器比作为朴素读者阅读相同轨迹的GPT-5.4和Claude Opus-4.6更准确,而推理成本仅为其一小部分。我们发现,端到端微调骨干网络并从目标LRM初始化对于强性能都是必要的。这些结果表明,推理轨迹携带了关于LRM未来行为的信息,超出了朴素阅读所能传达的范围。

英文摘要

Trust in an AI system is often anchored by explanations of how it works, which one then uses to forecast its behavior on new inputs. For large reasoning models (LRMs), this conventional route is particularly difficult to follow: explanation methods for single token generations do not naturally generalize to long trajectories, and the trajectories themselves are often not faithful when read as natural language. We propose an alternative that bypasses the explanation step: treat behavior forecasting as a learnable task and train Behavior Forecasters that operates on a single reasoning trajectory to make the same forecasts one would typically seek from an explanation. The forecaster's training data is obtained by querying the LRM with no human annotation, and its inference is done in a single forward pass. We instantiate this approach on two tasks: how likely the LRM is to repeat its answer on re-runs, and how removing parts of the input changes its answer. We evaluate this approach on both tasks across three diverse reasoning datasets and find that trained Behavior Forecasters are more accurate than GPT-5.4 and Claude Opus-4.6 reading the same trajectories as naive readers, at a small fraction of their inference cost. We find that fine-tuning the backbone end-to-end and initializing it from the target LRM are each necessary for strong performance. These results show that the reasoning trajectory carries information about the LRM's future behavior that goes beyond what naive reading conveys.

2606.11440 2026-06-11 cs.AI 新提交

INFRAMIND: Infrastructure-Aware Multi-Agent Orchestration

INFRAMIND: 基础设施感知的多智能体编排

Ahasan Kabir, Jiaqi Xue, Mengxin Zheng, Qian Lou

发表机构 * University of Central Florida(中佛罗里达大学)

AI总结 提出INFRAMIND框架,通过强化学习将基础设施状态(队列深度、KV缓存压力等)融入多智能体LLM编排的规划、路由和调度决策,在共享GPU集群上实现质量与延迟的平衡,相比基线提升最高7.6%准确率并降低7倍延迟。

详情
Comments
Preprint
AI中文摘要

现有的多智能体LLM编排方法,从暴力集成到学习型路由器,基于任务和模型特征选择模型和拓扑。然而,这些方法不考虑服务基础设施的运行时状态。在共享GPU集群上并发负载下,这种基础设施盲区导致系统性的资源利用不足:首选模型积累深度请求队列,而同等能力的替代模型闲置。在多智能体流水线中,每个查询触发多个顺序模型调用,这些延迟会进一步累积到每个下游步骤。弥补这一差距具有挑战性,因为相关基础设施信号(队列深度、KV缓存压力、延迟)是动态且嘈杂的,并且它们必须驱动三个不同的决策:规划、逐步骤路由和调度。我们引入INFRAMIND,一个使整个多智能体堆栈具备基础设施感知的框架。一个基础设施感知的规划器根据实时系统负载和剩余预算调节拓扑和角色选择,在拥塞时偏向简单图,在低负载时偏向丰富图。然后,一个基础设施感知的执行器在每个智能体步骤观察每个模型的队列深度、缓存利用率和响应延迟,以决定调用哪个模型以及推理深度;一个预算感知的调度器进一步重新排序每个模型的队列,使紧急请求优先得到服务。将其建模为分层约束MDP并通过强化学习端到端求解,系统自动学习平衡质量与延迟。在五个基准测试中,INFRAMIND在低负载下相比先前基线准确率提升高达7.6个百分点,延迟降低7倍,在高负载下维持高达99.9%的SLO合规性,而所有基线均降至50%以下。

英文摘要

Existing multi-agent LLM orchestration methods, ranging from brute-force ensembles to learned routers, select models and topologies based on task and model features. However, these methods do not consider the runtime state of the serving infrastructure. On shared GPU clusters under concurrent load, this infrastructure blindness causes systematic resource underutilization: preferred models accumulate deep request queues while equally capable alternatives sit idle. In multi-agent pipelines, where each query triggers multiple sequential model calls, these delays then compound across every downstream step. Closing this gap is challenging because the relevant infrastructure signals (queue depths, KV-cache pressure, latencies) are dynamic and noisy, and they must drive three different decisions: planning, per-step routing, and scheduling. We introduce INFRAMIND, a framework that makes the entire multi-agent stack infrastructure-aware. An infra-aware planner conditions topology and role selection on real-time system load and remaining budget, biasing toward simpler graphs under congestion and richer ones at low load. An infra-aware executor then observes per-model queue depths, cache utilization, and response latencies at each agent step to decide which model to call and how deeply to reason; a budget-aware scheduler further reorders each model's queue so that urgent requests are served first. Cast as a hierarchical constrained MDP and solved end-to-end via reinforcement learning, the system learns to balance quality against latency automatically. Across five benchmarks, INFRAMIND delivers up to +7.6 pp accuracy over the prior baseline at low load with up to 7x lower latency, and sustains up to 99.9% SLO compliance under high load where every baseline drops below 50%.

2606.11435 2026-06-11 cs.CL 新提交

Agent Skill Evaluation and Evolution: Frameworks and Benchmarks

智能体技能评估与进化:框架与基准

Kexin Ding, Yang Zhou, Can Jin, Feng Tong, Mu Zhou, Dimitris N. Metaxas

发表机构 * Rutgers University(罗格斯大学) University of North Carolina at Charlotte(北卡罗来纳大学夏洛特分校)

AI总结 本文系统综述了智能体技能从孤立创建到自动化评估驱动进化的范式转变,分类了四种进化范式并分析了六个技能基准类别,指出了覆盖缺口和开放方向。

详情
AI中文摘要

智能体技能的增长已经改变了智能体系统的构建、评估和部署方式。随着技能库的持续扩展,严格的评估对于确保其在现实应用中的效用、质量和安全性变得至关重要。因此,该领域正在经历从孤立技能创建到自动化、评估驱动的技能进化的新兴范式转变。在本综述中,我们系统地考察了超越基础技能创建的技能进化与评估的格局。我们将进化分为四种不同的范式,涵盖执行反馈、轨迹蒸馏、压缩和强化学习,展示了每种元素如何有助于提高技能效用和可靠性。我们还对六个以技能为中心的基准类别进行了分析,识别了基准覆盖范围、权衡和度量丰富性方面的结构性差距,以推动技能研究。最后,我们指出了构建可泛化、高效且可验证安全的技能生态系统的开放方向。项目网址为:https://this https URL

英文摘要

The growth of agent skills has transformed how agentic systems are built, evaluated, and deployed. As skill libraries continue to scale, rigorous evaluation becomes critical to ensuring their utility, quality, and safety in real-world applications. Consequently, the field is undergoing an emerging paradigm shift from isolated skill creation to automated, evaluation-driven skill evolution. In this survey, we systematically examine the landscape of skill evolution and evaluation beyond foundational skill creation. We categorize evolution into four distinct paradigms, spanning execution feedback, trajectory distillation, compression, and reinforcement learning, showing how each element contributes to improving skill utility and reliability. We also provide an analysis of six skill-centric benchmark categories, identifying structural gaps in benchmark coverage, trade-offs, and metric richness to advance skill research. Finally, we identify open directions for building skill ecosystems that are generalizable, efficient, and verifiably safe. The project URL is this https URL

2606.11431 2026-06-11 cs.LG 新提交

Mirror Descent Beyond Euclidean Stability: An Exponential Separation in Initialization Sensitivity

超越欧几里得稳定性的镜像下降:初始化敏感性的指数级分离

Shira Vansover-Hager, Matan Schliserman, Ofir Schlisselberg, Tomer Koren

发表机构 * Blavatnik School of Computer Science and AI, Tel Aviv University(特拉维夫大学布拉瓦特尼克计算机科学与人工智能学院) Google Research(谷歌研究院)

AI总结 本文证明非二次正则化的镜像下降(MD)在凸光滑目标上对初始化的敏感性可呈指数级增长,与梯度下降(GD)形成鲜明对比,并提出基于锚点的Bregman正则化可缓解不稳定性。

详情
AI中文摘要

镜像下降(MD)将梯度下降(GD)扩展到欧几里得几何之外,最近重新成为强化学习和LLM后训练中KL正则化策略优化的视角。这引发了一个基本的鲁棒性问题,对可重复性和可靠性至关重要:MD动力学对其输入的敏感性如何?我们关注初始化,它本身通常是预训练或先前对齐的模型。众所周知,二次正则化的MD(包括GD和马氏几何)对于凸光滑目标是稳定的。我们展示了一个鲜明的对比:一旦正则化器是非二次的,即使正则化器在欧几里得范数下是良条件的,MD对初始化的敏感性也可能比GD高指数级。我们给出了一个三维构造,其中目标函数是凸光滑的,正则化器是强凸、光滑且良条件的,初始$\varepsilon$扰动在$T$次步长为$\eta$的MD迭代后迅速放大到$\min\{\text{polylog}^{-1}(1/\varepsilon), \varepsilon e^{\Omega(\eta T)}\}$。对于单纯形上的典型KL正则化MD,我们证明即使线性目标在高维或近边界区域也能指数级放大初始$\varepsilon$扰动。最后,我们展示向锚点添加Bregman正则化项可以在很大程度上保持优化保证的同时稳定动力学,并且锚点的选择至关重要:在初始化处锚定仅部分缓解不稳定性,而在固定点锚定则产生更稳定的机制。

英文摘要

Mirror Descent (MD) extends Gradient Descent (GD) beyond Euclidean geometry and has recently reappeared as a lens for KL-regularized policy optimization in reinforcement learning and LLM post-training. This raises a basic robustness question, crucial to reproducibility and reliability: how sensitive are MD dynamics to their inputs? We focus on initialization, often itself a pretrained or previously aligned model. Quadratic-regularized MD, including GD and Mahalanobis geometries, is well-known to be stable for convex smooth objectives. We show a sharp contrast: once the regularizer is non-quadratic, MD can be exponentially more sensitive to initialization than GD, even with a well-conditioned regularizer in Euclidean norm. We give a three-dimensional construction with a convex, smooth objective and a strongly convex, smooth, well-conditioned regularizer where an initial $\varepsilon$ perturbation is quickly amplified to $\min\{\text{polylog}^{-1}(1/\varepsilon), \varepsilon e^{\Omega(\eta T)}\}$ after $T$ iterations of MD with step size $\eta$. For canonical KL-regularized MD on the simplex, we show that even linear objectives can amplify an initial $\varepsilon$ perturbation exponentially fast in high-dimensional or near-boundary regimes. Finally, we show that adding a Bregman regularization term toward an anchor point can stabilize the dynamics while largely preserving the optimization guarantees, and that the choice of anchor is crucial: anchoring at the initialization only partially mitigates the instability, whereas anchoring at a fixed point yields a more stable mechanism.

2606.11424 2026-06-11 cs.CL 新提交

SOMA-SQL: Resolving Multi-Source Ambiguity in NL-to-SQL via Synthetic Log and Execution Probing

SOMA-SQL: 通过合成日志和执行探测解决NL-to-SQL中的多源歧义

Sai Ashish Somayajula, Marianne Menglin Liu, Chuan Lei, Fjona Parllaku, Daniel Garcia, Rongguang Wang, Syed Fahad Allam Shah, Ankan Bansal, Sujeeth Bharadwaj, Tao Sheng, Sujith Ravi, Dan Roth

发表机构 * Oracle AI(甲骨文人工智能实验室)

AI总结 提出SOMA-SQL框架,通过合成查询日志和歧义驱动探测自动解决自然语言到SQL中的多源歧义,在6个基准上平均执行准确率提升13.0%。

详情
Comments
34 pages, 1 figure, 7 tables. Preprint
AI中文摘要

自然语言数据库接口旨在将用户问题转换为可执行的SQL,但在现实环境中,问题表述不明确且模式庞大且模糊时仍然脆弱。用户问题、数据库模式和模型解释之间的歧义是NL2SQL中的主要失败模式,导致意图不匹配、模式接地错误和SQL生成错误。现有方法依赖人工澄清或将歧义视为模式表示问题,但这些方法无法扩展也无法自主解决歧义。我们提出SOMA-SQL,通过目标合成查询日志和歧义驱动探测自动解决歧义。SOMA-SQL构建合成查询日志以接地模式解释并指导候选SQL生成;然后执行目标探测查询,由结构化歧义分类和候选不一致驱动,为最终SQL选择和修复生成消歧证据。这种主动的歧义发现和解决方法无需人工参与即可泛化到未见过的模式和查询分布。在六个公开基准上的实验表明,SOMA-SQL相比最先进的基线平均执行准确率提升13.0%,在歧义问题上提升高达16.7%。

英文摘要

Natural language interfaces to databases aim to translate user questions into executable SQL, yet remain brittle in real-world settings where questions are underspecified and schemas are large and ambiguous. Ambiguity across user questions, database schemas, and model interpretations are central failure modes in NL2SQL, leading to misaligned intent, incorrect schema grounding, and erroneous SQL generation. Existing approaches rely on human clarification or treat ambiguity as a schema representation problem, but these do not scale nor resolve ambiguity autonomously. We propose SOMA-SQL to automatically resolve ambiguity via targeted synthetic query log and ambiguity-driven probing. SOMA-SQL constructs synthetic query log to ground schema interpretation and guide candidate SQL generation; it then executes targeted probing queries, driven by a structured ambiguity taxonomy and candidate disagreements, to produce disambiguation evidence for final SQL selection and repair. This active approach to ambiguity discovery and resolution generalizes across unseen schemas and query distributions without human-in-the-loop. Experiments on six public benchmarks demonstrate that SOMA-SQL improves execution accuracy by 13.0% on average over state-of-the-art baselines, with gains of up to 16.7% on ambiguous questions.