arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4033
2606.16684 2026-06-16 cs.CL 新提交

Progressive Knowledge-Guided Large Language Model Framework for Bearing Fault Diagnosis

渐进式知识引导的大型语言模型框架用于轴承故障诊断

Jinghan Wang, Gaoliang Peng, Yanjun Chen, Wei Zhang, Wentao Wu, Tianchen Liu

AI总结 提出渐进式物理引导多尺度振动信号处理框架,通过81维测量描述符、故障自适应分割和隐式知识编码,在四个数据集上实现98.49%诊断精度并降低12.6倍计算成本。

详情
AI中文摘要

基于振动的轴承故障诊断需要解决三个相互关联的测量挑战,包括全局统计特征效率与局部瞬态信号保真度之间的权衡、测量特征对底层故障物理的可追溯性不足,以及跨诊断尺度的多源测量信息融合无效。本文提出一个渐进式物理引导的多尺度振动信号处理框架,在统一诊断流程中解决所有三个挑战。一个源自轴承运动学和特征缺陷频率的81维测量描述符,建立了物理可追溯的特征空间,实现每样本约20毫秒的实时故障筛查。然后,一种故障自适应信号分割机制将分析注意力引导至基于物理先验的故障相关波形区域,无需手动特征工程。在训练过程中,结构化的故障机制知识进一步隐式编码到模型参数中,实现自主多尺度测量融合,推理时无需外部知识依赖。在四个公开基准数据集上,在不同运行条件下验证,该框架实现了98.49%的诊断准确率,相对于信号级基线计算成本降低了12.6倍。可解释性分析证实诊断特征激活与已建立的轴承故障力学一致,支持安全关键工业系统中的测量可追溯性。

英文摘要

Vibration-based bearing fault diagnosis requires resolving three interrelated measurement challenges, including the trade-off between global statistical feature efficiency and local transient signal fidelity, insufficient traceability of measurement features to underlying fault physics, and ineffective multi-source measurement information fusion across diagnostic scales. This paper presents a progressive physics-guided multi-scale vibration signal processing framework that addresses all three challenges within a unified diagnostic pipeline. An 81-dimensional measurement descriptor, derived from bearing kinematic theory and characteristic defect frequencies, establishes a physically traceable feature space enabling real-time fault screening at approximately 20 ms per sample. A fault-adaptive signal segmentation mechanism then directs analytical attention toward fault-relevant waveform regions guided by physics-based priors, without manual feature engineering. Structured fault mechanism knowledge is further encoded implicitly in model parameters during training, enabling autonomous multi-scale measurement fusion without external knowledge dependencies at inference. Validated on four public benchmark datasets under diverse operating conditions, the framework achieves 98.49% diagnostic accuracy with a 12.6-fold reduction in computational cost relative to signal-level baselines. Interpretability analysis confirms that diagnostic feature activations align with established bearing fault mechanics, supporting measurement traceability in safety-critical industrial systems.

2606.16663 2026-06-16 cs.LG 新提交

Beyond Defensive Reporting: Machine Learning for Active Anti-Money Laundering Control in Insurance

超越防御性报告:机器学习在保险主动反洗钱控制中的应用

Dara Goldar, Geir Kjetil Ferkingstad Sandve, Martin Jullum

AI总结 本文利用挪威保险公司的生产数据,训练梯度提升决策树模型检测洗钱索赔,并引入欺诈标签辅助训练,在预算加权捕获率指标下,最佳模型在2-6%的审查索赔中捕获近三分之二的洗钱案例。

详情
AI中文摘要

通过保险索赔进行洗钱对保险公司构成威胁,既包括欺诈性赔付,也包括声誉和监管风险。尽管如此,很少有研究探讨如何预防此类洗钱行为。本文考察了机器学习是否可以帮助保险公司在赔付前标记可疑索赔,将重点从被动报告转向主动预防。使用一家挪威主要保险公司的生产数据,我们训练梯度提升决策树模型来检测后来被报告给当局涉嫌洗钱的索赔。由于欺诈和洗钱可能共享行为模式,我们还考察了保险欺诈标签是否可以作为辅助训练信号。我们使用预算加权捕获率(本文引入的指标)比较了不同的学习设置,该指标衡量在只能手动审查一小部分索赔时捕获了多少洗钱案例。结果表明,纳入与欺诈相关的调查标签显著改善了洗钱检测。表现最佳的模型在排名前2%至6%的选定调查索赔中捕获了近三分之二的洗钱案例。据我们所知,这是首个关于机器学习在保险索赔中检测洗钱的实证研究。

英文摘要

Money laundering through insurance claims poses a threat to insurers both through fraudulent payouts and reputational and regulatory risk. Despite this, little research has examined how such laundering can be prevented. This paper examines whether machine learning can help insurers flag suspicious claims before payout, shifting the focus from passive reporting to active prevention. Using production data from a major Norwegian insurer, we train gradient-boosted decision tree models to detect claims later reported to authorities for suspected money laundering. Because fraud and laundering may share behavioural patterns, we also examine whether insurance fraud labels can serve as an auxiliary training signal. We compare different learning setups using the Budget-Weighted Capture Rate, a metric introduced in this paper to measure how many laundering cases are captured when only a small share of claims can be manually reviewed. The results show that incorporating fraud-related investigation labels substantially improves laundering detection. The best-performing model captures nearly two-thirds of laundering cases within the top-ranked 2 to 6 percent of claims selected for investigation. To our knowledge, this is the first empirical study of machine learning for money laundering detection in insurance claims.

2606.16656 2026-06-16 cs.LG 新提交

Near-Optimal Stochastic Linear Bandits with Delay

带延迟的近最优随机线性赌博机

Ofir Schlisselberg, Mengxiao Zhang, Yishay Mansour

AI总结 研究多种延迟模型下的随机线性赌博机,给出近最优遗憾界,揭示延迟与线性结构交互的维度影响。

详情
AI中文摘要

我们研究了在几种延迟模型下具有延迟反馈的随机线性赌博机,并建立了近最优的遗憾保证。我们的结果确定了延迟线性赌博机何时表现出与多臂赌博机(MAB)相同的定性行为,以及线性结构何时产生根本性的新挑战。具体来说,(1)对于\emph{损失无关延迟},其中延迟不依赖于实现的损失(但可能依赖于臂),我们表明延迟仅引起加性遗憾惩罚。在随机延迟下,该惩罚与期望延迟成比例,而在对抗性延迟下,它与最大未完成观测数成比例。值得注意的是,两种延迟惩罚都是无维度的,改进了现有最优结果;(2)对于\emph{损失相关延迟},我们表明线性赌博机比MAB困难得多:与MAB不同,我们在线性赌博机中证明了匹配(最多对数因子)的上界和下界,其延迟惩罚依赖于维度的平方根。(3)对于\emph{延迟作为收益模型},这是损失相关延迟的一个特例,我们表明仅依赖于最优臂延迟的最优MAB保证在线性赌博机中也是无法实现的。这些结果共同提供了延迟反馈如何与线性泛化相互作用的清晰刻画。

英文摘要

We study stochastic linear bandits with delayed feedback under several delay models and establish near-optimal regret guarantees. Our results identify when delayed linear bandits exhibit the same qualitative behavior as multi-armed bandits (MAB), and when the linear structure creates fundamentally new challenges. Specifically, (1) for \emph{loss-independent delays}, where the delay does not depend on the realized loss (but potentially depends on the arm), we show that delays incur only an additive regret penalty. Under stochastic delays, this penalty scales with the expected delay, while under adversarial delays, it scales with the maximum number of outstanding observations. Notably, both delay penalties are dimension-free, improving upon the state-of-the-art results; (2) for \emph{loss-dependent delays}, we show that linear bandits are substantially harder than MAB: unlike in MAB, we prove matching (up to log factors) upper and lower bounds in linear bandits, whose delay penalty depends on the square root of the dimension. (3) for the \emph{delay-as-payoff model}, a special case of loss-dependent delay, we show that the optimal MAB guarantee, which depends only on the delay of the optimal arm, is also unattainable in linear bandits. Together, these results provide a sharp characterization of how delayed feedback interacts with linear generalization.

2606.16649 2026-06-16 cs.AI 新提交

The Integrator Advantage: Controlled Agentic AI for Small and Medium-Sized Companies

集成优势:面向中小企业的受控代理型人工智能

Christopner Koch, Joshua A. Wellbrock

AI总结 本文提出代理型AI对中小企业的近期价值在于受控部分自主性,而非完全自主或减员,并给出集成框架以提升生产力。

Comments 10 pages, 15 tables

详情
AI中文摘要

代理型AI标志着企业自动化的新阶段。与传统自动化或对话式AI不同,代理系统能够解释目标、规划多步骤任务、访问工具、与企业系统交互,并以不同程度的自主性执行工作流。对于中小企业而言,这创造了减少行政负担、加速常规流程以及改善组织知识利用的潜力。本文认为,代理型AI的近期价值不在于完全自主或减员,而在于对简单和中等复杂度业务流程的受控部分自主性。它提出了一个集成框架,涵盖用例适用性、自主性级别、技术集成、治理、安全、员工赋能和可衡量影响。本文得出结论,当作为以人为中心的能力实施,并由人保留责任和问责时,代理型AI可以成为生产力杠杆。

英文摘要

Agentic AI marks a new phase of enterprise automation. Unlike traditional automation or conversational AI, agentic systems can interpret goals, plan multi step tasks, access tools, interact with enterprise systems, and execute workflows with varying degrees of autonomy. For small and medium sized companies, this creates potential to reduce administrative burden, accelerate routine processes, and improve the use of organizational knowledge. This paper argues that the near term value of Agentic AI does not lie in full autonomy or workforce reduction, but in controlled partial autonomy for simple and medium complexity business processes. It proposes an integration framework covering use case suitability, autonomy levels, technical integration, governance, security, employee enablement, and measurable impact. The paper concludes that Agentic AI can become a productivity lever when implemented as a human centered capability with responsibility and accountability retained by people.

2606.16624 2026-06-16 cs.AI 新提交

MR-GVNO: A Geometry-Aware Variational Physics-Informed Neural Operator for Mindlin-Reissner Plates on Irregular Domains

MR-GVNO:一种面向不规则域上Mindlin-Reissner板的几何感知变分物理信息神经算子

Siqi Wang, Daobo Sun, Yizheng Wang, Yilong Zhang, Yabin Jin, Xiaoying Zhuang, Timon Rabczuk

AI总结 提出MR-GVNO,一种几何感知变分神经算子,通过边界点云表示不规则几何,利用交叉注意力机制融合多物理场输入,基于离散总势能的变分物理信息损失无监督训练,实现对Mindlin-Reissner板问题的快速准确预测。

详情
AI中文摘要

板壳结构在工程中广泛应用,因此在不同几何、材料和载荷下进行快速响应预测非常理想。然而,传统的有限元方法需要重复建模和求解,导致计算成本高昂。本研究提出了一种用于Mindlin-Reissner板问题的几何感知变分神经算子,称为MR-GVNO。该方法使用边界点云表示不规则几何,并采用独立的编码器处理空间变化的材料场、压力载荷和标量物理参数。交叉注意力机制将这些输入与查询点信息集成,以预测任意位置的横向挠度和转角。MR-GVNO无需标记解数据,通过从离散总势能导出的变分物理信息损失进行训练。它直接处理不规则点云,并允许不同的物理场独立离散化,避免了插值到公共网格。在单孔、双孔和L形板上的数值实验表明,在均匀和非均匀材料以及均匀和随机载荷下,该方法能准确预测响应。该模型还实现了毫秒级的全场推理和良好的跨几何泛化能力。

英文摘要

Plate and shell structures are widely used in engineering, making rapid response prediction under varying geometries, materials, and loads highly desirable. However, conventional finite element methods require repeated modeling and solution, resulting in high computational costs. This study proposes a geometry-aware variational neural operator for Mindlin-Reissner plate problems, termed MR-GVNO. The method uses boundary point clouds to represent irregular geometries and employs separate encoders for spatially varying material fields, pressure loads, and scalar physical parameters. A cross-attention mechanism integrates these inputs with query point information to predict transverse deflections and rotations at arbitrary locations. MR-GVNO is trained without labeled solution data using a variational physics-informed loss derived from the discretized total potential energy. It directly processes irregular point clouds and allows different physical fields to be discretized independently, avoiding interpolation onto a common grid. Numerical experiments on single-hole, double-hole, and L-shaped plates demonstrate accurate response prediction under homogeneous and heterogeneous materials and uniform and random loads. The model also achieves millisecond-level full-field inference and favorable cross-geometry generalization.

2606.16617 2026-06-16 cs.CL cond-mat.mtrl-sci cs.AI 新提交

Sycophancy as Material Failure under Pushback Loading: A Multi-Axis Characterization Across Three Loading Cases and up to Seventeen Material Charges

推挤载荷下的谄媚作为材料失效:三种加载情形及多达十七种材料批次的多元表征

Ferdinand M. Schessl

AI总结 采用材料科学框架,将LLM谄媚视为推挤载荷下的材料失效,通过14个轴测量和三种加载情形(辩论、错误预设、伦理设定)共7800个样本,揭示失效模式依赖加载类型,并发现跨评判者可靠性差异。

Comments 12 pages, 3 figures. Code, data, and pre-registrations: https://github.com/FerdinandSchessl/sycophancy-note-companion

详情
AI中文摘要

LLM中的谄媚现象在70多篇论文中有记录,但专家对构念边界的共识仍然较低(ICC=.184;Ye等人,2026)。该构念碎片化是因为行为分类取决于哪种表面形式被优先考虑。我们采用材料科学框架:对话作为加载下的测试样本,LLM模型作为材料批次,推挤作为渐进载荷,立场翻转作为材料失效。我们在三种加载情形(辩论n=1000;错误预设n=3400;伦理设定n=3400;每种情形10-17种材料批次;共7800个样本)下,使用14个回合级轴测量(涵盖速度、损伤累积、框架漂移、脆性和方向稳定性)以及来自独立管道的三个说话者解析轴来表征这种失效。测量是胡克耦合的($σ= E \cdot \varepsilon$类比),并在加载情形间重现,在辩论上效应高达$|r_{rb}| = 0.35$;符号结构增加了第二种模式:伦理设定情形反转了速度和累积块。方差组成分为两个轮廓:辩论是批次主导的(类似脆性断裂:材料等级决定),错误预设和伦理设定是主题主导的(类似蠕变:载荷决定);比率(2.03 vs 0.13/0.17)依赖于估计器,对于辩论甚至在方向上也是如此。跨评判者可靠性(GPT-4o vs Haiku 4.5)显示辩论评分是评判者鲁棒的(Cohen's $κ= 0.88$),而错误预设评分是评判者敏感的($κ= 0.36$)——这是单评判者基准必须报告的注意事项。这是Ye等人诊断所要求的方法论举措:一种不依赖于构念的哪种表面形式被优先考虑的多元表征。

英文摘要

Sycophancy in LLMs is documented across 70+ papers, but expert agreement on construct boundaries remains low (ICC=.184; Ye et al., 2026). The construct fragments because behavioral classification depends on which surface form is privileged. We adopt a materials-science framing: conversation as test specimen under load, LLM-model as material charge, pushback as progressive load, stance-flip as material failure. We characterize this failure across three loading cases (debate n=1000; false-presuppositions n=3400; ethical-setting n=3400; 10-17 material charges per case; 7800 specimens total) using 14 turn-level axis-measurements spanning velocity, damage accumulation, frame-drift, brittleness, and direction stability, plus three speaker-resolved axes from an independent pipeline. The measurements are Hooke-coupled ($σ= E \cdot \varepsilon$ analog) and reproduce across loading cases with effects up to $|r_{rb}| = 0.35$ on debate; the sign structure adds a second pattern: the ethical-setting case inverts the velocity and accumulation blocks. Variance composition partitions into two profiles: debate is charge-dominated (brittle-fracture-like: the material grade decides), false-presuppositions and ethical-setting are topic-dominated (creep-like: the load decides); the ratios (2.03 vs 0.13/0.17) are estimator-dependent, for debate even in direction. Cross-judge reliability (GPT-4o vs Haiku 4.5) shows debate scoring is judge-robust (Cohen's $κ= 0.88$) while false-presupposition scoring is judge-sensitive ($κ= 0.36$) -- a caveat single-judge benchmarks must report. This is the methodological move Ye et al.'s diagnosis calls for: a multi-axis characterization that does not depend on which surface form of the construct one privileges.

2606.16612 2026-06-16 cs.SD cs.LG cs.MM 新提交

Beyond Artifacts: Towards Generalizable Synthetic Song Detection via Music-Intrinsic Features

超越伪影:基于音乐内在特征的可泛化合成歌曲检测

Yan Han, Zhibin Wen, Yuan Wang, Shuangrun Shao, Xiaobing Li, Yang Xu, Wei Li

AI总结 提出Sofia框架,通过特征特定专家和自适应混合专家模型利用音乐内在特征(人声、音频效果、全局结构)进行合成歌曲检测,在MUSIC8K基准上F1提升18.5点,具有强鲁棒性。

详情
AI中文摘要

AI音乐生成器的快速发展凸显了对可靠合成歌曲检测(SSD)的迫切需求。现有SSD方法通常依赖于低级伪影或固定特征假设,难以捕捉生成器无关的线索。为解决这一问题,我们提出Sofia(基于音乐特征的合成歌曲检测框架),一个灵活的框架,通过特征特定专家和自适应混合专家(MoE)模块对音乐内在属性进行建模。通过使用代表性的人声、音频效果、全局结构特征及其组合配置Sofia,我们展示了它们的个体和互补贡献。为全面评估我们的框架,我们进一步构建了MUSIC8K,一个具有挑战性的基准,包含最新出现的生成器和逼真的音频扰动。实验表明,Sofia从音乐内在特征中学习生成器无关的表示,在MUSIC8K-O上相比最强基线F1分数提升18.5点,同时保持强鲁棒性。

英文摘要

The rapid advancement of AI music generators highlights the urgent need for reliable Synthetic Song Detection (SSD). Existing SSD methods often rely on low-level artifacts or fixed feature assumptions, struggling to capture generator-agnostic cues. To address this, we propose Sofia (Synthetic-song detection framework via music features), a flexible framework that models music-intrinsic attributes via feature-specific experts and an adaptive Mixture-of-Experts (MoE) module. By configuring Sofia with representative Vocal, Audio-effect, Global structure features, and their combinations, we present their individual and complementary contributions. To comprehensively evaluate our framework, we further construct MUSIC8K, a challenging benchmark featuring lastest emerging generators and realistic audio perturbations. Experiments show that Sofia learns generator-agnostic representations from music-intrinsic features, improving the F1 score by 18.5 points over the strongest baseline on MUSIC8K-O while maintaining strong robustness.

2606.16596 2026-06-16 cs.CL 新提交

How Far Can Machine Translation Quality Take You? Extrinsic Discourse Evaluation in Goal-Oriented Setups

机器翻译质量能带你走多远?目标导向设置中的外在话语评估

Wafaa Mohammed, Kata Naszadi, Vlad Niculae

AI总结 研究机器翻译在静态和交互式目标导向任务中的外在话语评估,发现高内在翻译质量不能保证下游话语成功,且强系统仍存在指代不一致问题。

详情
AI中文摘要

现有的机器翻译(MT)指标和话语焦点评估主要从内在角度评估翻译质量,而不衡量翻译错误的下游后果。在这项工作中,我们专注于两种不同机制下的机器翻译外在话语评估:静态和交互式。在静态机制下,我们提出一个实体计数任务作为话语中指代一致性的探针。我们表明,高内在MT质量并不能可靠地预测下游话语成功,且强MT系统仍然会产生指代不一致。对于交互式机制,我们研究目标导向的多智能体福利外交游戏作为长期沟通和协调的探针。我们发现,交互特定的翻译失败会影响下游协调。我们的结果强调了目标导向环境作为对话语敏感的MT外在评估的可行框架。

英文摘要

Existing machine translation (MT) metrics and discourse-focused evaluations primarily assess translation quality intrinsically, without measuring the downstream consequences of translation errors. In this work, we focus on extrinsic discourse evaluation of machine translation under two distinct regimes: static and interactive. Under the static regime, we propose an entity counting task as a probe of referential consistency in discourse. We show that high intrinsic MT quality does not reliably predict downstream discourse success and strong MT systems still produce referential inconsistencies. For the interactive regime, we study the goal-oriented multi-agent Welfare Diplomacy game as a probe of long-horizon communication and coordination. We find that interaction-specific translation failures impact downstream coordination. Our results highlight goal-oriented environments as a viable framework for discourse-sensitive extrinsic MT evaluation.

2606.16576 2026-06-16 cs.CL 新提交

Can LLM Agents Infer World Models? Evidence from Agentic Automata Learning

LLM智能体能否推断世界模型?来自智能体自动机学习的证据

Reef Menaged, Gili Lior, Shauli Ravfogel, Roee Aharoni, Gabriel Stanovsky

AI总结 提出智能体自动机学习框架,通过成员查询和等价查询评估LLM智能体发现隐藏确定性有限自动机的能力,发现性能随DFA规模增加而急剧下降,推理模型优于非推理模型但仍存在规划、整合和假设构建缺陷。

详情
AI中文摘要

我们提出智能体自动机学习,以评估调用工具的LLM智能体通过交互发现隐藏环境的程度。在我们的设置中,智能体应通过与预言机的交互来发现隐藏的确定性有限自动机(DFA),交互方式包括(1)成员查询(“该字符串是否属于目标语言?”)和(2)等价查询(“这是目标DFA吗?”)。这产生了一个可扩展的测试平台,具有可控的任务复杂度、可测量的交互效率以及强基线(经典自动机学习算法)。评估最先进的LLM,我们发现性能随着DFA规模增加而急剧下降。推理模型明显强于非推理模型,但轨迹分析揭示了查询规划、证据整合和假设构建中的反复失败。总体而言,我们的结果表明,当前的LLM智能体有时可以执行非平凡的交互式发现,但在此任务上远不如经典算法稳健和高效。

英文摘要

We propose agentic automata learning to evaluate the extent to which tool-calling LLM agents can uncover hidden environments through interaction. In our setup, an agent should uncover a hidden deterministic finite automaton (DFA) by interacting with an oracle through (1) membership queries ("Does this string belong to the target language?") and (2) equivalence queries ("Is this the target DFA?"). This yields a scalable testbed with controlled task complexity, measurable interaction efficiency, and strong baselines (classic automata-learning algorithms). Evaluating state-of-the-art LLMs, we find that performance drops sharply as DFA size increases. Reasoning models are markedly stronger than non-reasoning models, yet trajectory analyses reveal recurring failures in query planning, evidence integration, and hypothesis construction. Overall, our results show that current LLM agents can sometimes perform non-trivial interactive discovery, but remain far less robust and efficient than classic algorithms for the task.

2606.16541 2026-06-16 cs.AI cs.LG 新提交

The Faithfulness Gap: Certifying Semantic Equivalence Between Natural-Language and Formal Mathematical Statements

忠实性差距:认证自然语言与形式数学语句之间的语义等价性

Noor Islam S. Mohammad, Tamim Sheikh

AI总结 提出双向可证明性指纹识别框架,通过前向和后向推论邻域匹配自然语言探针,认证自动形式化翻译的忠实性,并引入反事实探针生成、等价谱、自适应探针预算分配和忠实性引导解码四个新组件,在基准上实现高检测率并减少漂移。

详情
AI中文摘要

自动形式化——将自然语言数学翻译成形式证明助手——的瓶颈不在于翻译流畅性,而在于\emph{忠实性}:一个形式语句可以通过类型检查且可证明,但仍可能编码与源意图不同的定理。我们引入\emph{双向可证明性指纹识别}(\bpf{}),这是一个通过刻画每个候选在背景理论中的前向和后向推论邻域,并将这些邻域与从自然语言语句导出的探针进行匹配来认证忠实性的框架。我们进一步引入四个新组件:(i)\emph{反事实探针生成}(\cpg{}),一种合成针对特定漂移方向的探针的对比性程序;(ii)\emph{等价谱},一个替代脆弱的二元判决的连续忠实性分数;(iii)\emph{自适应探针预算分配}(\apba{}),一个信息论预算路由器;以及(iv)\emph{忠实性引导解码}(\fgd{}),它在自动形式化过程中使用\bpf{}信号作为奖励。我们证明了一个\emph{漂移检测定理}和一个\emph{PAC-忠实性}结果,该结果确立了在温和假设下,自然语言语句的等价类可以从$\mathcal{O}(\log(1/δ)/\varepsilon)$个探针中学习。我们发布了\driftbench{},一个包含$2{,}183$个NL/Lean~4对的基准,这些对具有跨mathlib4六个子领域的受控漂移标签。\bpf{}\,+\,\cpg{}在$3.0\%$的假阳性率下检测出$89.6\%$的漂移形式化——相比之下,类型检查为$41.2\%$,LLM评判基线为$63.3\%$——并且\fgd{}将最先进的自动形式化器产生漂移语句的比率降低了$47\%$。https://pmlrbd.github.io/BPF/

英文摘要

Autoformalization, translating natural-language mathematics into formal proof assistants, is bottlenecked not by translation fluency but by \emph{faithfulness}: a formal statement can typecheck and be provable, yet still encode a different theorem than the source intended. We introduce \emph{Bidirectional Provability Fingerprinting} (\bpf{}), a framework that certifies faithfulness by characterizing each candidate through its forward and backward consequence neighborhoods in the ambient theory and matching these against probes derived from the natural-language statement. We further introduce four novel components: (i) \emph{Counterfactual Probe Generation} (\cpg{}), a contrastive procedure that synthesizes probes targeting specific drift directions; (ii) the \emph{Equivalence Spectrum}, a continuous faithfulness score that replaces brittle binary verdicts; (iii) \emph{Adaptive Probe Budget Allocation} (\apba{}), an information-theoretic budget router; and (iv) \emph{Faithfulness-Guided Decoding} (\fgd{}), which uses \bpf{} signals as a reward during autoformalization. We prove a \emph{drift detection theorem} and a \emph{PAC-faithfulness} result establishing that the equivalence class of a natural language statement is learnable from $\mathcal{O}(\log(1/δ)/\varepsilon)$ probes under mild assumptions. We release \driftbench{}, a benchmark of $2{,}183$ NL/Lean~4 pairs with controlled drift labels across six subfields of mathlib4. \bpf{}\,+\,\cpg{} detects $89.6\%$ of drifted formalizations at a $3.0\%$ false-positive rate-against $41.2\%$ for typecheck and $63.3\%$ for LLM-judge baselines, and \fgd{} reduces the rate at which a state-of-the-art autoformalizer emits drifted statements by $47\%$. https://pmlrbd.github.io/BPF/

2606.16478 2026-06-16 cs.AI 新提交

Tensor-Coord: Algebraic Decomposition of Joint Plan Tensors for Conflict-Free Multi-Agent LLM Planning

Tensor-Coord:用于无冲突多智能体LLM规划的联合计划张量代数分解

Mudit Rastogi

AI总结 提出Tensor-Coord框架,将多智能体联合计划表示为三阶张量,通过CP和Tucker分解识别协调结构,计算协调复杂度并定位冲突,实现无冲突规划。

详情
AI中文摘要

大型语言模型(LLM)在多智能体规划中仍然受限,因为独立生成的计划可能导致协调失败,如空间碰撞、资源争用和时间死锁。我们引入Tensor-Coord,一个多线性代数框架,将N个智能体的联合计划表示为三阶张量 \(T \in R^{N \times H \times A}\),维度为智能体、时间步和动作。使用典型多面体(CP)和Tucker分解来识别潜在协调结构。最小ε近似CP秩R*定义了一个可计算的协调复杂度度量,\(CC(Pi)=(R*-N)/N\)。我们证明R*=N是计划独立性的充分必要条件。残差 \(E=T-T_{R*}\) 定义了智能体对、时间步和动作上的冲突分数,无需领域特定规则即可定位失败。Tucker因子提供可解释的智能体角色、时间阶段和动作聚类,这些被转换为自然语言约束,用于迭代LLM重规划。在多机器人配送任务上的实验,包括简单(2个智能体,5x5网格)、中等(3个智能体,5x5网格)和困难(4个智能体,5x5网格)设置,显示在2个智能体情况下100%收敛到无冲突计划,平均迭代1.4次;3个智能体情况下80%收敛,平均迭代3.2次;4个智能体情况下60%收敛,平均迭代4.0次。CP秩近似线性增长,\(R*(N) = 3.9N + 0.5\),支持其作为协调复杂度预测器的使用。

英文摘要

Large language models (LLMs) remain limited in multi-agent planning because independently generated plans can create coordination failures such as spatial collisions, resource contention, and temporal deadlocks. We introduce Tensor-Coord, a multilinear algebra framework that represents the joint plan of N agents as a third-order tensor \(T \in R^{N \times H \times A}\) over agents, timesteps, and actions. Canonical Polyadic (CP) and Tucker decompositions are used to identify latent coordination structure. The minimal epsilon-approximate CP rank R* defines a computable coordination complexity measure, with \(CC(Pi)=(R*-N)/N\). We prove that R*=N is necessary and sufficient for plan independence. The residual \(E=T-T_{R*}\) defines a conflict score over agent pairs, timesteps, and actions, localizing failures without domain-specific rules. Tucker factors provide interpretable agent roles, temporal phases, and action clusters that are converted into natural language constraints for iterative LLM replanning. Experiments on multi-robot delivery tasks across Easy (2 agents, 5x5 grid), Medium (3 agents, 5x5 grid), and Hard (4 agents, 5x5 grid) settings show convergence to conflict-free plans in 100% of 2-agent cases within 1.4 iterations on average, 80% of 3-agent cases within 3.2 iterations, and 60% of 4-agent cases within 4.0 iterations. CP rank scaled approximately linearly as \(R*(N) = 3.9N + 0.5\), supporting its use as a predictor of coordination complexity.

2606.16434 2026-06-16 cs.LG cs.AI 新提交

Autonomous End-to-End SOH Prediction Services for Battery Systems via Temporal-Contrastive Representation Learning

基于时间对比表示学习的电池系统自主端到端健康状态预测服务

Junting Wen, Dan Li, Qihao Quan, Xiwen Wang, Hang Yang, Zhaohong Meng, Zigui Jiang, Changlin Yang, Tianle Liu, Diego Muñoz-Carpintero, Jian Lou

AI总结 提出TC-SOH模块化服务架构,通过时间对比机制和跨窗口预测任务从原始数据中提取退化相关表示,实现自主端到端SOH预测,在四个数据集上MAPE和RMSE分别降低1.91倍和2.13倍。

详情
AI中文摘要

准确的状态健康(SOH)估计是锂离子电池管理的关键诊断服务。然而,依赖劳动密集型的手动特征工程和不透明的黑箱模型阻碍了可扩展的工业部署。为此,我们引入TC-SOH:一种模块化、即插即用的服务架构,用于自主、端到端的SOH预测。TC-SOH采用时间对比机制和跨窗口预测预任务,直接从原始运行数据中提取与退化相关的表示。为了提高透明度,我们将模型效能与表示诊断联系起来:可视化、敏感性分析、冗余分析、双向探测、未来SOH探测和时间洗牌表明,学习到的特征与选定的专家描述符重叠,同时保留了额外的SOH相关变化,并且有序的时间上下文改善了后续SOH预测。在四个公开数据集上,TC-SOH优于所考虑的物理信息和数据驱动基线,MAPE降低了1.91倍,RMSE降低了2.13倍。

英文摘要

Accurate state of health (SOH) estimation is a critical diagnostic service for lithium-ion battery management. However, reliance on labor-intensive manual feature engineering and opaque black-box models hinders scalable industrial deployment. To address this, we introduce TC-SOH: a modular, plug-and-play service architecture for autonomous, end-to-end SOH prediction. TC-SOH employs a temporal-contrastive mechanism and a cross-window prediction pretext task to extract degradation-relevant representations directly from raw operational data. To improve transparency, we connect model efficacy with representation diagnostics: visualization, sensitivity analysis, redundancy analysis, bidirectional probing, future-SOH probing, and temporal shuffling show that learned features overlap with selected expert descriptors while retaining additional SOH-relevant variation, and that ordered temporal context improves subsequent-SOH prediction. Across four public datasets, TC-SOH outperforms the considered physics-informed and data-driven baselines, reducing MAPE by 1.91 times and RMSE by 2.13 times.

2606.16432 2026-06-16 cs.CL cs.AI 新提交

ACCORD: Action-Conditioned Contextual Grounding for Language Agents

ACCORD: 面向语言智能体的动作条件上下文接地

Lai Jiang, Cheng Qian, Zhenhailong Wang, Pan Lu, Heng Ji, Hao Peng

AI总结 针对用户指令常因隐含环境假设而欠指定,导致LLM智能体执行失败的问题,提出ACCORD框架,在每次动作前主动探测缺失信息并整合轨迹上下文,无需额外训练,在AppWorld和AlfWorld上显著提升任务完成率。

详情
AI中文摘要

用户指令往往因人类对周围环境的隐含假设而欠指定。对于在信息丰富的数字和物理环境中运行的大型语言模型(LLM)智能体,这些假设无法仅从指令中推断;必须从工具、数据、接口和观察的当前状态中恢复。因此,有效执行要求智能体识别缺失的上下文,将其基于观察到的证据,并带入后续动作。我们表明,当前智能体常常未能做到这一点。它们基于假设而非观察到的细节行动,忽略本可收集的信息,并且未能整合已经返回的证据。基于这一洞察,我们提出ACCORD(动作条件上下文接地),一种简单有效的自适应接地智能体框架。在每次动作前,ACCORD主动探测环境中缺失的信息,并整合来自智能体轨迹中原本会被忽略的相关上下文。无需额外训练或任务成功信号,ACCORD在AppWorld上将任务目标完成率从42.0%提升至62.6%(GPT-5-mini),比强基线高出最多20.6个百分点。这些增益在更强的基模型(Claude-4.5-sonnet上+10.8)、开放权重模型(Qwen3.5-27B-FP8上+10.1)以及具身AlfWorld基准(GPT-5-mini上成功率+7.4)上持续存在。

英文摘要

User instructions are often underspecified because humans rely on implicit assumptions about the surrounding environment. For large language model (LLM) agents operating in information-rich digital and physical environments, these assumptions cannot be inferred from the instruction alone; they must be recovered from the current state of tools, data, interfaces, and observations. Effective execution therefore requires agents to identify missing context, ground it in observed evidence, and carry it forward into subsequent actions. We show that current agents often fail to do so. They act from assumed rather than observed specifics, overlook information they could have gathered, and fail to incorporate evidence that has already been returned. Building on this insight, we propose ACCORD (Action-Conditioned Contextual Grounding), a simple and effective agent framework for adaptive grounding. Before each action, ACCORD actively probes the environment for missing information and integrates relevant context from the agent's trajectory that would otherwise be overlooked. Requiring no additional training or task-success signals, ACCORD improves task-goal completion on AppWorld by up to +20.6 points with GPT-5-mini, from 42.0% to 62.6%, compared to strong baselines. These gains persist with a substantially stronger base model (+10.8 with Claude-4.5-sonnet), an open-weight model (+10.1 with Qwen3.5-27B-FP8), and on the embodied AlfWorld benchmark (+7.4 success rate with GPT-5-mini).

2606.16371 2026-06-16 cs.LG 新提交

CacheMuon: Using Temporal Preconditioning To Approximate Polar Factor

CacheMuon:利用时间预条件近似极分解因子

Bishnu Dev, Sushil Bohara, Martin Takáč, Samuel Horváth

AI总结 提出CacheMuon,通过缓存历史优化步的极分解因子来减少Muon优化器中牛顿-舒尔茨迭代的计算开销,在保持训练质量的同时降低正交化计算量。

详情
AI中文摘要

Muon是一种优化器,它利用动量矩阵的极分解因子计算更新,并在多种训练设置中展现出强大的实证性能。Muon的一个关键组件是用于计算该极分解因子的牛顿-舒尔茨迭代。尽管这避免了精确奇异值分解的计算成本,但由于每一步优化都要执行,实际中仍然昂贵。同时,动量矩阵在训练过程中平滑变化,表明对应的极分解因子存在强时间相关性。在本文中,我们利用这一结构,提出CacheMuon,一种时间预条件方法,它重用先前优化步的信息来近似当前步的极分解因子。这减少了跨迭代的冗余正交化计算。我们将CacheMuon分析为一种非精确Muon更新,其误差由新鲜求解器误差和缓存陈旧度控制。实验上,CacheMuon提供了可控的质量-效率边界:保守阈值在语言模型和视觉训练中与新鲜Muon紧密匹配,同时减少正交化FLOPs,而更激进的阈值在牺牲适度验证质量下降的情况下带来更大的算术节省。

英文摘要

Muon is an optimizer that computes updates using the polar factor of the momentum matrix and has shown strong empirical performance across a range of training settings. A key component of Muon is the Newton-Schulz iteration used to compute this polar factor. Although this avoids the cost of an exact singular value decomposition, it remains expensive in practice because it is applied at every optimization step. At the same time, the momentum matrix changes smoothly over training, suggesting strong temporal correlation in the corresponding polar factors. In this paper, we exploit this structure and propose CacheMuon, a temporal preconditioning method that reuses information from previous optimization steps to approximate the polar factor at the current step. This reduces redundant orthogonalization computation across iterations. We analyze CacheMuon as an inexact Muon update, with error controlled by fresh-solver error and cache staleness. Empirically, CacheMuon provides a controllable quality-efficiency frontier: conservative thresholds closely match fresh Muon on language-model and vision training while reducing orthogonalization FLOPs, whereas more aggressive thresholds yield larger arithmetic savings at the cost of modest validation-quality degradation.

2606.16337 2026-06-16 cs.AI cs.HC cs.LG 新提交

Medical Heuristic Learning: An LLM-Driven Framework for Interpretable and Auditable Clinical Decision Rules

医学启发式学习:一个用于可解释和可审计临床决策规则的LLM驱动框架

Wei Xu, Ke Yang, Gang Luo, Keli Zheng, Lingyan Hu, Jing Wang, Kefeng Li

AI总结 提出医学启发式学习(MHL),利用LLM驱动的工作流优化确定性可执行决策系统,生成可解释、可审计的Python决策规则,在医学数据集上达到与最先进方法相当的性能,并支持小样本和高度不平衡场景。

详情
AI中文摘要

临床表格数据的预测建模是临床决策支持的核心,因此不仅需要强大的预测性能,还需要透明的决策逻辑。尽管深度学习和基于树的集成方法可以实现高精度,但其黑箱性质仍然是临床部署的主要障碍。这一挑战因医疗数据的常见特征而进一步加剧,包括有限的样本量、严重的类别不平衡以及因诊断标准和临床文档变化引起的特征演化。为了解决这些问题,我们提出了医学启发式学习(MHL),这是临床表格预测中超越梯度学习范式的一个实例。MHL不依赖神经网络权重更新,而是使用大型语言模型(LLM)驱动的工作流,整合统计探测、医学知识探测、规则合成和代码级迭代优化,以优化一个确定性的可执行决策系统。最终模型不是以不透明的参数表示,而是作为版本化的纯Python决策规则,这些规则明确可解释、完全可审计且具有临床基础。MHL还支持持续学习,从先前验证的规则开始,并在数据漂移或特征演化下使用更新的特征信息迭代修订规则。在医学数据集上的全面实验表明,MHL在保持与小样本和高度不平衡设置下强健行为的同时,实现了与最先进方法相当的性能。结果进一步表明,这种显式规则更新机制有助于缓解特征演化下的灾难性遗忘。总体而言,这些发现表明,非基于梯度的启发式系统为高风险临床决策支持提供了一种透明且可适应的替代方案。

英文摘要

Predictive modeling for clinical tabular data is central to clinical decision support and therefore requires not only strong predictive performance but also transparent decision logic. Although deep learning and tree-based ensemble methods can achieve high accuracy, their black-box nature remains a major obstacle to clinical deployment. This challenge is further compounded by common characteristics of medical data, including limited sample sizes, severe class imbalance, and feature evolution arising from changes in diagnostic criteria and clinical documentation. To address these issues, we propose Medical Heuristic Learning (MHL), an instantiation of the learning-beyond-gradients paradigm for clinical tabular prediction. Instead of relying on neural network weight updates, MHL uses a large language model (LLM)-driven workflow that integrates statistical probes, medical knowledge probes, rule synthesis, and code-level iterative refinement to optimize a deterministic and executable decision system. The resulting model is expressed not as opaque parameters, but as versioned pure-Python decision rules that are explicitly interpretable, fully auditable, and clinically grounded. MHL also supports continual learning by starting from previously validated rules and iteratively revising them using updated feature information under data drift or feature evolution. Comprehensive experiments on medical datasets show that MHL achieves performance comparable to state-of-the-art methods while maintaining strong behavior in small-sample and highly imbalanced settings. The results further indicate that this explicit rule update mechanism can help alleviate catastrophic forgetting under feature evolution. Overall, these findings suggest that non-gradient-based heuristic systems offer a transparent and adaptable alternative for high-stakes clinical decision support.

2606.16328 2026-06-16 cs.AI 新提交

AdaSTORM: Scaling LLM Reasoning on Dynamic Graphs via Adaptive Spatio-Temporal Multi-Agent Collaboration

AdaSTORM: 通过自适应时空多智能体协作扩展动态图上的LLM推理

Bing Hao, Ruijie Wang, Haodong Qian, Yunlong Chu, Yuhang Liu, Yumeng Lin, Minglai Shao, Jianxin Li

AI总结 提出AdaSTORM框架,通过自适应分区和时空解耦的多智能体协作,将动态图推理扩展到千节点规模,准确率超90%,无需外部工具。

详情
AI中文摘要

大型语言模型(LLM)在动态图推理中展现出显著潜力,但面临扩展瓶颈:当前模型只能处理数十个节点的图,受限于指数级推理开销和有限的上下文窗口。尽管多智能体系统(MAS)提供了集体推理和拓扑感知编排的能力——这些能力天然适用于图结构任务,但其在动态图上的应用仍未探索。本文提出通过自适应时空多智能体协作扩展动态图上的LLM推理(AdaSTORM),这是一个将大规模动态图推理重构为两个阶段的框架:(i)自适应分区,将大规模动态图划分为与模型推理能力匹配的子区域,同时最小化推理成本;(ii)协作推理,将图分区拓扑与时空解耦的多智能体架构对齐。AdaSTORM是首个专为动态图推理设计的多智能体框架。大量实验表明,AdaSTORM成功突破了扩展瓶颈,将推理扩展到千节点图,在多个大规模动态图设置中准确率超过90%,且无需外部工具,显著优于七个竞争基线。此外,它在现有基准上达到了最先进的准确率,并稳健地泛化到真实世界数据集。源代码可在 https://github.com/irisorchid107/AdaSTORM/ 获取。

英文摘要

Large Language Models (LLMs) demonstrate remarkable potential in dynamic graph reasoning, but suffer from a scaling bottleneck: current models can only handle graphs with tens of nodes, constrained by exponential reasoning overhead and finite context windows. While multi-agent systems (MAS) offer collective reasoning and topology-aware orchestration, capabilities naturally suited for graph-structured tasks, their application to dynamic graphs remains unexplored. This paper presents Scaling LLM Reasoning on Dynamic Graphs via Adaptive Spatio-Temporal Multi-Agent Collaboration (AdaSTORM), a framework that reformulates large-scale dynamic graph reasoning into two stages: (i) Adaptive Partitioning, partitioning large-scale dynamic graphs into subregions that match the model's reasoning capacity while minimizing inference cost; and (ii) Collaborative Reasoning, aligning graph partition topologies with a spatio-temporal decoupled multi-agent architecture. AdaSTORM is the first multi-agent framework tailored for dynamic graph reasoning. Extensive experiments show that AdaSTORM successfully breaks through the scaling bottleneck, scaling reasoning to thousand-node graphs with over 90% accuracy across several large-scale dynamic graph settings without external tools, significantly outperforms seven competitive baselines. Furthermore, it achieves state-of-the-art accuracy on existing benchmarks and generalizes robustly to real-world datasets. The source code is available at: https://github.com/irisorchid107/AdaSTORM/.

2606.16257 2026-06-16 cs.LG cs.AI 新提交

Variance Reduction for Non-Log-Concave Sampling with Applications to Inverse Problems

非对数凹采样的方差缩减及其在逆问题中的应用

M. Berk Sahin, Ahmet Ege Tanriverdi, Behzad Sharif, Abolfazl Hashemi

AI总结 针对非对数凹分布采样中随机梯度高方差问题,提出统一分析动量、STORM和PAGE等方差缩减方法,证明其在相对Fisher信息和非平方总变差距离下的改进收敛率,并扩展至基于得分的生成先验逆问题求解。

Comments Accepted to Uncertainty in Artificial Intelligence (UAI) 2026

详情
AI中文摘要

从具有未归一化密度的高维、非对数凹分布中采样是机器学习中的一个基本挑战,特别是当势能的精确梯度不可用,且必须通过每次迭代固定梯度计算预算下表现出高方差的随机梯度来近似时。尽管诸如带动量的SGD、STORM和PAGE等方差缩减技术已在非凸优化中展现出改进的收敛性质,但它们对非对数凹分布采样的影响仍 largely unexplored。在这项工作中,我们首次对这些估计器用于非对数凹分布采样进行了统一分析。我们在$\varepsilon$-相对Fisher信息下建立了改进的非渐近收敛率,并在Poincaré不等式假设下,在平方总变差距离下建立了改进的非渐近收敛率,进一步证明了向目标分布的弱收敛。我们将分析扩展到使用基于得分的生成先验求解逆问题。我们通过实验验证了理论,并证明在每次迭代固定梯度计算预算下,方差缩减技术在两个标准成像应用中 consistently 提高了样本质量。

英文摘要

Sampling from high-dimensional, non-log-concave distributions with unnormalized densities is a fundamental challenge in machine learning, particularly when the exact gradient of the potential is unavailable and must be approximated via stochastic gradients that exhibit high variance under a fixed budget of gradient computations per iteration. Although variance reduction techniques such as SGD with momentum, STORM, and PAGE have demonstrated improved convergence properties in non-convex optimization, their implications for sampling from non-log-concave distributions remain largely unexplored. In this work, we develop the first unified analysis of these estimators for sampling from non-log-concave distributions. We establish improved non-asymptotic convergence rates in $\varepsilon$-relative Fisher information and, under a Poincaré inequality assumption, in squared total variation distance, and further prove weak convergence to the target distribution. We extend our analysis to solving inverse problems with score-based generative priors. We empirically validate our theory and demonstrate that, under a fixed gradient computations per iteration, variance-reduction techniques consistently improve sample quality in two standard imaging applications.

2606.16226 2026-06-16 cs.LG 新提交

Prediction of Runtime Parameters of Parallel Chemistry Applications via Active and Generative Learning

通过主动和生成学习预测并行化学应用的运行时参数

Tanzila Tabassum, Omer Subasi, Ajay Panyala, Epiya Ebiapia, Gerald Baumgartner, Erdal Mutlu, P Sadayappan, Karol Kowalski

AI总结 提出基于主动学习和生成学习的机器学习方法,结合梯度提升回归树模型,预测并行化学计算的运行时参数,在CCSD计算中MAPE低至0.023,R²高达99.9%。

详情
AI中文摘要

在这项工作中,我们开发了两种主要的基于机器学习的方法来预测高度可扩展的并行化学计算的运行时参数。这些方法将主动学习和生成学习与经验确定的梯度提升回归树模型相结合,该模型是从丰富的机器学习模型套件中选出的。当在耦合簇单双激发计算上进行评估时,我们的模型实现了低至0.023的平均绝对误差百分比(MAPE)和高达99.9%的决定系数。此外,当与主动学习相结合以缓解缺乏大量训练数据的问题时,我们的模型在使用原始数据集的20-25%时,MAPE约为0.2。

英文摘要

In this work, we develop two main Machine Learning based approaches to predict the runtime parameters of highly scalable parallel chemistry computations.These approaches employ active and generative learning together with the empirically determined gradient boosted regression tree models chosen among a rich suite of machine learning models. When evaluated on Coupled-Cluster with Singles and Doubles computations, our models achieve a mean absolute error percentage (MAPE) as low as 0.023 and a coefficient of determination as high as 99.9%. Furthermore, when combined with active learning to mitigate the lack of large amounts of training data, our models score a MAPE about 0.2 with 20-25% of the original dataset.

2606.16160 2026-06-16 cs.LG cs.AI cs.HC 新提交

A comparative and critical study of EEGNet for fNIRS-driven cognitive load classification

EEGNet在fNIRS驱动的认知负荷分类中的比较与批判性研究

Mehshan Ahmed Khan, Houshyar Asadi, Li Zhang, Mohammad reza Chalak Qazani, Ghazal Bargshady, Stefanos gkikas, Christian arzate, Sam Oladazimi, Zoran Najdovsk, Lei Wei, Chee Peng Lim

AI总结 本研究系统评估EEGNet在fNIRS认知负荷分类中的性能,发现重叠分段和小固定学习率在随机分割中表现最佳,但受试者独立评估准确率大幅下降,非重叠分段和PCA特征在SI评估中取得最佳56.11%准确率,表明消除时间冗余有助于学习更鲁棒的跨个体表征。

详情
AI中文摘要

由于时间变异性、受试者间差异以及对预处理选择的敏感性,从功能性近红外光谱(fNIRS)信号中准确分类认知负荷仍然是一个重大挑战。本研究通过系统检查时间分割策略(重叠与非重叠)、窗口长度(10秒、20秒、30秒)、特征提取方法(方差分析(ANOVA)、主成分分析(PCA)、快速独立成分分析(FastICA))、学习率配置(固定和自适应)以及评估协议(随机分割与受试者独立(SI))的影响,对EEGNet在基于fNIRS的认知负荷分类中进行了全面评估。随机分割实验的结果表明,重叠分割结合较小的固定学习率(0.01-0.001)由于时间冗余和血流动力学转变的密集采样而产生了最高的准确率。然而,SI评估显示准确率大幅下降,表明对未见参与者的泛化能力有限。在SI评估下,非重叠分割优于重叠窗口,使用PCA特征、20秒窗口和0.1学习率获得了最佳准确率56.11%。这些发现表明,消除时间冗余有助于模型学习更鲁棒和可泛化的跨个体认知负荷表征。尽管自适应学习率策略提高了训练稳定性,但并未超过最优选择的固定学习率的性能。该研究强调了分割策略和学习率选择在提高模型泛化能力中的关键作用,并指出了开发基于fNIRS的可靠、实时和受试者独立认知负荷分类系统所必需的方法学考虑。

英文摘要

Accurately classifying cognitive load from functional near-infrared spectroscopy (fNIRS) signals remains a significant challenge due to temporal variability, inter-subject differences, and sensitivity to preprocessing choices. This study provides a comprehensive evaluation of EEGNet for fNIRS-based cognitive load classification by systematically examining the effects of temporal segmentation strategies (overlapping vs. non-overlapping), window lengths (10s, 20s, 30s), feature extraction methods (Analysis of Variance (ANOVA), Principal Component Analysis (PCA), Fast Independent Component Analysis (FastICA)), learning rate configurations (fixed and adaptive), and evaluation protocols (random split vs. subject-independent (SI)). Results from random-split experiments show that overlapping segmentation, combined with smaller fixed learning rates (0.01-0.001), yields the highest accuracies, due to temporal redundancy and dense sampling of hemodynamic transitions. However, SI evaluation reveals a substantial drop in accuracy, demonstrating limited generalization to unseen participants. Under SI evaluation, non-overlapping segmentation outperformed overlapping windows, with the best accuracy of 56.11% achieved using PCA features with a 20-second window and a 0.1 learning rate. These findings indicate that eliminating temporal redundancy helps the model learn more robust and generalizable representations of cognitive load across individuals. Although adaptive learning rate strategy improved training stability, it did not surpass the performance of optimally selected fixed learning rates. The study highlights the critical role of segmentation strategy and learning rate selection in improving model generalization and identifies methodological considerations essential for developing reliable, real-time, and SI cognitive load classification systems using fNIRS.

2606.16154 2026-06-16 cs.LG 新提交

A Gradient Perspective on RLVR Stability and Winner Advantage Policy Optimization

RLVR稳定性与胜者优势策略优化的梯度视角

Prasanth YSS, Zhichen Ren, Rasa Hosseinzadeh, Ilan Gofman, Yuqi Chen, Zhaoyan Liu, Guangwei Yu, Jesse C. Cresswell, Satya Krishna Gorti

AI总结 通过令牌级梯度动力学分析GRPO的不稳定性,提出仅更新正优势完成的WAPO算法,在数学推理和多跳QA任务中提升训练稳定性并匹配或超越基线。

详情
AI中文摘要

具有可验证奖励的强化学习(RLVR)改进了语言模型的推理能力,但GRPO风格的优化仍然容易崩溃。我们通过令牌级梯度动力学分析这种不稳定性,推导出一个分类法,预测更新如何影响下一个令牌的概率和熵。该分类法表明,稳定性共同取决于当前策略下的优势符号和令牌分布。受此发现启发,我们提出了胜者优势策略优化(WAPO),一种简单的在线裁剪策略梯度目标,仅更新正优势完成。在数学推理和多跳QA基准测试中,WAPO提高了训练稳定性,并在多个模型家族中匹配或超越基线。完整代码可在https://github.com/layer6ai-labs/wapo找到。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) improves language-model reasoning, but GRPO-style optimization remains prone to collapse. We analyse this instability through token-level gradient dynamics, deriving a taxonomy that predicts how updates affect next-token probabilities and entropy. The taxonomy shows that stability depends jointly on the advantage sign and token distribution under the current policy. Motivated by this finding, we propose Winner Advantage Policy Optimization (WAPO), a simple online clipped policy-gradient objective that updates only on positive-advantage completions. Across mathematical reasoning and multi-hop QA benchmarks, WAPO improves training stability and matches or outperforms baselines across multiple model families. Full code can be found at https://github.com/layer6ai-labs/wapo.

2606.16119 2026-06-16 cs.CV 新提交

EdgeZSAD: Practical Zero-Shot Anomaly Detection on Edge Devices

EdgeZSAD:边缘设备上的实用零样本异常检测

Taewan Cho, Andrew Jaeyong Choi

AI总结 针对边缘部署约束,提出基于TinyViT-21M-512骨干、非对称全局-局部读出(EdgeGLR)和可复现源训练方案(Real-IAD-DR)的紧凑零样本异常检测系统,在多个工业基准上达到高精度且可直接部署。

详情
AI中文摘要

工业检测需要零样本异常检测(ZSAD),该检测在边缘部署约束下仍然有效。最近的方法通常依赖ViT-L基础骨干(约3亿参数),这超出了典型嵌入式硬件的内存和算子预算。我们通过EdgeZSAD研究这一场景,这是一个紧凑的参考系统,围绕TinyViT-21M-512骨干、非对称全局-局部读出(EdgeGLR)和可复现的源端训练方案(Real-IAD-DR)构建。我们在源训练、目标未见协议下训练单个检查点,并在六个工业基准上评估。在三次独立运行中,所得模型在MVTec-AD上平均图像AUROC达到91.6,在VisA上达到88.2,同时可直接部署在Jetson Orin Nano Super(TensorRT FP16)和RB5 Gen2(QNN GPU FP16)上。在六个设备重新评分的基准中,图像AUROC漂移保持在0.2点以下,表明导出的图在评估的部署设置中保留了主机端的排序行为。

英文摘要

Industrial inspection needs zero-shot anomaly detection (ZSAD) that remains useful under edge deployment constraints. Recent methods often rely on ViT-L foundation backbones (~300M parameters), which exceed the memory and operator budget of typical embedded hardware. We study this regime through EdgeZSAD, a compact reference system built around a TinyViT-21M-512 backbone, an asymmetric global-local readout (EdgeGLR), and a reproducible source-side training recipe (Real-IAD-DR). We train a single checkpoint in a source-trained, target-unseen protocol and evaluate it across six industrial benchmarks. Across three independent runs, the resulting model reaches an average image AUROC of 91.6 on MVTec-AD and 88.2 on VisA, while remaining directly deployable on Jetson Orin Nano Super (TensorRT FP16) and RB5 Gen2 (QNN GPU FP16). Across the six device-rescored benchmarks, image-AUROC drift stays below 0.2 points, indicating that the exported graph preserves host-side ranking behavior in the evaluated deployment setting.

2606.16112 2026-06-16 cs.LG cs.AI 新提交

Scaling Adaptive Depth with Norm-Agnostic Residual Networks

缩放自适应深度:范数无关残差网络

Tomás Figliolia, Beren Millidge

AI总结 针对残差网络中残差流范数随深度增长导致深层更新被抑制的问题,提出范数无关残差架构NAG,通过分离幅度和方向信息保持各层贡献,并实现可解释的自适应深度跳过机制,在等计算量下匹配全深度性能。

详情
AI中文摘要

残差架构在深度学习中无处不在,但它们存在一个微妙的结构性限制:残差流的范数会随深度迅速增长。因此,来自后层的更新相对于累积的残差状态变得很小。这降低了它们对表示的影响,并限制了模型在深度上扩展的益处。为了解决这个问题,我们引入了NAG,一种范数无关的残差架构,它将残差流中的幅度与方向信息分离,在整个深度中保留有意义的层贡献,并防止后层更新被残差范数增长系统地抑制。重要的是,NAG仅引入可忽略数量的额外参数,并依赖于易于内核融合的简单操作,从而在实践中保持训练效率。我们表明,该架构优于基线Transformer,其增益随深度增加而显著增大,从而能够有效训练更深的模型。范数无关的公式还产生了一种可解释的深度混合(MoD)机制,该机制自适应地跳过注意力和MLP层。除了作为训练后的精度-计算权衡外,该机制还可以用作预训练时的扩展策略:在等FLOP训练下,通过减少每token前向传播成本节省的计算量可以再投资于在更多token上训练,同时保持总参数数量和KV缓存预算固定。在我们的实验中,约20%-25%的适度深度混合率在相等训练计算量下匹配全深度基线性能,同时大幅减少执行的层参数数量和前向传播FLOPs。这些结果将深度稀疏性确定为固定计算量训练的新扩展轴,从而能够实现非常深但FLOP高效的模型。

英文摘要

Residual architectures are ubiquitous in deep learning, but they suffer from a subtle structural limitation: the norm of the residual stream can grow rapidly with depth. As a result, updates from later layers become small relative to the accumulated residual state. This reduces their impact on the representation and limits the benefits of scaling models in depth. To address this, we introduce NAG, a norm-agnostic residual architecture that separates magnitude from directional information in the residual stream, preserving meaningful layer contributions throughout depth and preventing later updates from being systematically suppressed by residual-norm growth. Importantly, NAG introduces only a negligible number of additional parameters and relies on simple operations that are easily kernel-fusible, preserving training efficiency in practice. We show that this architecture outperforms baseline Transformers, with gains that increase substantially as depth grows, enabling effective training of much deeper models. The norm-agnostic formulation also leads to an interpretable Mixture-of-Depths (MoD) mechanism that adaptively skips both attention and MLP layers. Beyond serving as a post-training accuracy-compute tradeoff, this mechanism can be used as a pretraining-time scaling strategy: under iso-FLOP training, compute saved by reducing per-token forward-pass cost can be reinvested into training on more tokens while keeping the total parameter count and KV-cache budget fixed. In our experiments, moderate Mixture-of-Depths rates of approximately 20%-25% match full-depth baseline performance under equal training compute while substantially reducing the number of executed layer parameters and forward-pass FLOPs. These results identify sparsity in depth as a new scaling axis for fixed-compute training, enabling very deep yet FLOP-efficient models.

2606.16023 2026-06-16 cs.LG 新提交

IBAD: Interpretable Behavioral Anomaly Detection on Human Mobility Data

IBAD:人类移动数据上的可解释行为异常检测

Bita Azarijoo, John Krumm, Cyrus Shahabi

AI总结 提出IBAD框架,利用LDA学习可解释的日常移动模板,通过层次自监督模型检测个体行为异常,在真实和合成数据集上验证了模板的可迁移性和鲁棒性。

详情
AI中文摘要

人类移动行为看似高度多样化,但个体日常移动的大部分可由少量重复的行为模板解释,如通勤、学校活动、照护、夜生活或差事模式。我们提出 \texttt{IBAD}(可解释行为异常检测),该框架学习可解释的日常移动模板,并将每个个体表示为这些模板混合上的分布。IBAD 不关注特定位置,而是刻画个体在不同地点执行的活动。该方法首先使用潜在狄利克雷分配(LDA)发现全局行为模板,然后采用层次自监督模型从个体的软行为模板中学习正常行为。我们还引入了一个 \emph{拼接基准},用于在个体历史画像与注入的移动模式之间创建受控的行为不匹配。在真实和合成数据集上的实验表明,日常行为可有效分解为少量可解释的模板。关键的是,我们证明学习到的行为原型在不同地理和人口统计背景下具有 \emph{可迁移性}。此外,IBAD 在所有设置下均保持稳健的竞争性能。为便于复现,代码可在 \href{https://github.com/USC-InfoLab/IBAD}{https://github.com/USC-InfoLab/IBAD} 获取。

英文摘要

Human mobility appears highly diverse, yet much of a person's daily mobility can be explained by a small set of recurring behavioral templates, such as commuting, school-centered activities, caregiving, nightlife, or errand patterns. We present \texttt{IBAD} (\underline{I}nterpretable \underline{B}ehavioral \underline{A}nomaly \underline{D}etection), a framework that learns interpretable daily mobility templates and represents each individual as a distribution over mixtures of these templates. Rather than focusing on specific locations, IBAD characterizes activities that individuals perform across locations. This approach first discovers global behavioral templates using Latent Dirichlet Allocation (LDA), then employs a hierarchical self-supervised model to learn normal behavior of individuals from their soft behavioral templates. We also introduce a \emph{splicing benchmark} that creates controlled behavioral mismatches between an individual's historical profile and injected mobility patterns. Experiments on real-world and synthetic datasets show that daily behavior can be effectively decomposed into a small number of interpretable templates. Crucially, we show that the learned behavioral archetypes \emph{transfer} across distinct geographic and demographic contexts. Furthermore, IBAD maintains a robust competitive performance across all settings. For reproducibility purposes, the code is accessible at ~\href{https://github.com/USC-InfoLab/IBAD}{https://github.com/USC-InfoLab/IBAD}.

2606.15967 2026-06-16 cs.CV 新提交

CRIS: Cross-Plane Self-Supervised Isotropic Restoration for Anisotropic Volumetric Imaging Across Modalities

CRIS:跨模态各向异性体积成像的跨平面自监督各向同性恢复

Adi Ahituv, Anat Ilivitzki, Moti Freiman

AI总结 提出CRIS,一种无需配对各向同性真值的跨平面自监督框架,通过正交重切2D条带补全实现3D各向同性恢复,在MRI和体积电镜上优于插值和多种方法。

Comments 22 pages, 8 figures, supplementary material included. Submitted to Medical Image Analysis

详情
AI中文摘要

各向异性体积采集在临床MRI和体积电子显微镜(vEM)中很常见,其中稀疏的跨平面采样产生厚切片或截面,降低了正交重切和下游分析的质量。我们提出CRIS,一种跨平面自监督框架,无需配对各向同性真值即可实现各向同性恢复。CRIS将3D恢复视为各向同性网格正交重切上的2D条带补全:训练时,高分辨率面内切片被合成退化并周期性掩蔽;推理时,空白切片定义各向同性网格,恢复两个正交重切,并通过多视图平均融合预测。我们在两个MRI队列和两个显微镜基准上评估CRIS,各向异性高达8倍。在脑MRI上,CRIS达到32.921±0.436 dB PSNR和0.9631±0.0027 SSIM,优于插值、SMORE4、SIMPLE、SA-INR和ATME,并给出最佳分割一致性(Dice 0.940±0.004,ASSD 0.245±0.014 mm,HD99 1.275±0.061 mm)。在无参考腹部MRI上,CRIS将FID/KID降至48.714/0.023。在vEM上,CRIS优于插值、NIIV和vEMINR,在4倍时达到29.133 dB/0.834 3D PSNR/SSIM,在EPFL 8倍时达到27.123 dB/0.734,在噪声hemibrain数据上达到21.915 dB/0.699。在鲁棒性实验中,一个可变间隙CRIS模型在间隙因子3-7以及冠状、轴向和矢状退化下评估,保持比插值更高的PSNR/SSIM(36.36-31.14 dB和0.977-0.932对比33.07-27.85 dB和0.951-0.853)。这些结果支持CRIS作为一种模态灵活的途径,无需配对各向同性目标或特定配置的重新训练即可实现各向同性恢复。代码可在https://github.com/adi-hatav/CRIS获取。

英文摘要

Anisotropic volumetric acquisitions are common in clinical MRI and volume electron microscopy (vEM), where sparse through-plane sampling creates thick slices or sections that degrade orthogonal reformats and downstream analysis. We present CRIS, a cross-plane self-supervised framework for isotropic restoration without paired isotropic ground truth. CRIS casts 3D restoration as 2D stripe completion on orthogonal reformats of an isotropic grid: high-resolution in-plane slices are synthetically degraded and periodically masked for training, while at inference blank slices define the isotropic grid, two orthogonal reformats are restored, and predictions are fused by multi-view averaging. We evaluate CRIS on two MRI cohorts and two microscopy benchmarks up to 8x anisotropy. On brain MRI, CRIS achieves 32.921 +/- 0.436 dB PSNR and 0.9631 +/- 0.0027 SSIM, outperforming interpolation, SMORE4, SIMPLE, SA-INR, and ATME, and gives the best segmentation consistency (Dice 0.940 +/- 0.004, ASSD 0.245 +/- 0.014 mm, HD99 1.275 +/- 0.061 mm). On reference-free abdominal MRI, CRIS reduces FID/KID to 48.714/0.023. On vEM, CRIS outperforms interpolation, NIIV, and vEMINR, reaching 29.133 dB/0.834 3D PSNR/SSIM at 4x, 27.123 dB/0.734 on EPFL at 8x, and 21.915 dB/0.699 on noisy hemibrain data. In a robustness experiment, one variable-gap CRIS model evaluated across gap factors 3--7 and coronal, axial, and sagittal degradations maintained higher PSNR/SSIM than interpolation (36.36--31.14 dB and 0.977--0.932 vs. 33.07--27.85 dB and 0.951--0.853). These results support CRIS as a modality-flexible route to isotropic restoration without paired isotropic targets or configuration-specific retraining. Code is available at https://github.com/adi-hatav/CRIS.

2606.15940 2026-06-16 cs.LG 新提交

Causal-Privacy Audit Workflow for Synthetic and Distilled Data in Dropout Support

辍学支持中合成与蒸馏数据的因果隐私审计工作流

Hanghang Zheng, Xiwei Zhuang, Zhong Wang, Hong Liu, Xiao Chen, Jingwen He, Xia Li

AI总结 提出CaP-Eval工作流,在固定估计目标下审计合成学生数据的预测效用、因果保真度和隐私风险,发现DPGNet和蒸馏数据在保留处理效应结构上优于基线方法。

详情
AI中文摘要

合成和蒸馏的学生数据越来越多地用于实现隐私意识的学习分析,但它们对面向决策的机构支持的适用性仍不确定。在辍学支持中,生成的数据不仅必须保留预测效用或分布相似性,还必须保留用于指导咨询、付款计划援助和奖学金相关决策的财务状况证据。方法:本研究引入了CaP-Eval,一种面向决策的因果隐私审计工作流,用于在固定估计目标、时间感知调整设计、估计器集和经验隐私治理筛选下评估生成的学生数据。该工作流比较了原始数据、蒸馏数据、对抗合成数据、统计合成数据和DPGNet隐私导向生成数据在预测效用、处理效应保真度、对替代估计器的鲁棒性以及局部训练记录邻近性方面的表现。结果:DPGNet和蒸馏数据比对抗和高斯Copula基线更可靠地保留了原始财务状况处理效应结构。DPGNet在epsilon水平上保留了完整的方向和秩一致性;epsilon=10产生了最小的非原始IPW和DML偏差,而epsilon=1和epsilon=5放大了若干财务状况对比。蒸馏数据保持高度忠实,但保留了最强的局部训练记录邻近信号。TabularGNet保留了定性方向但存在中度衰减,高斯Copula压缩了效应幅度。结论:预测效用、隐私导向、经验披露信号和因果保真度存在分歧;生成的学生数据在决策使用前需要对方向、幅度、重叠和发布治理风险进行联合审计。

英文摘要

Synthetic and distilled student data are increasingly used to enable privacy-conscious learning analytics, yet their suitability for decision-facing institutional support remains uncertain. In dropout support, generated data must preserve not only predictive utility or distributional resemblance, but also the financial-status evidence used to guide advising, payment-plan assistance, and scholarship-related decisions. Method: This study introduces CaP-Eval, a decision-facing causal-privacy audit workflow for evaluating generated student data under a fixed estimand, timing-aware adjustment design, estimator set, and empirical privacy-governance screen. The workflow compares original, distilled, adversarial synthetic, statistical synthetic, and DPGNet privacy-oriented generated data on predictive utility, treatment-effect fidelity, robustness to alternative estimators, and local training-record proximity. Results: DPGNet and distilled data preserved the original financial-status treatment-effect structure more reliably than the adversarial and Gaussian Copula baselines. DPGNet preserved full direction and rank agreement across epsilon levels; epsilon = 10 produced the smallest non-original IPW and DML deviations, while epsilon = 1 and epsilon = 5 amplified several financial-status contrasts. Distilled data remained highly faithful but retained the strongest local training-record proximity signal. TabularGNet preserved qualitative directions with moderate attenuation, and Gaussian Copula compressed effect magnitudes. Conclusions: Predictive utility, privacy orientation, empirical disclosure signals, and causal fidelity diverged; generated student data require joint audits of direction, magnitude, overlap, and release-governance risk before decision use.

2606.15930 2026-06-16 cs.RO cs.AI 新提交

ControlMap: Controllable High-Definition Map Generation for Traffic Scenario Simulation

ControlMap: 用于交通场景仿真的可控高清地图生成

Marwan Farag, Steffen Wäldele, Yu Yao

AI总结 提出基于潜在扩散和ControlNet的数据驱动管道,实现可控高清地图生成,支持空间引导、条件强度调整和城市风格迁移,并引入新指标评估控制信号遵循度和地图真实性。

详情
AI中文摘要

仿真是验证自动驾驶系统的核心,但当前流程因高精(HD)地图创建成本高昂而受限于场景多样性不足。扩展HD地图需要昂贵的数据收集和人工处理。此外,现有生成模型缺乏在生成过程中针对特定道路拓扑进行细粒度控制的能力。本文提出一种数据驱动的可控HD地图生成管道,使用潜在扩散和ControlNet进行空间条件控制。据我们所知,我们是首个将空间引导信号注入扩散模型用于HD地图合成的工作。此外,我们的模型支持通过无分类器引导调整条件强度,并通过城市标签条件实现城市级风格迁移。为补充现有指标,我们引入两个新指标来评估对控制信号的遵循程度以及与真实地图的相似性。实验表明,我们的模型生成的HD地图真实且忠实遵循输入道路拓扑,同时准确保留城市特定细节。

英文摘要

Simulation is central to validating autonomous driving systems, yet current pipelines are limited by insufficient scenario diversity due to costly High Definition (HD) map creation. Scaling HD maps requires expensive data collection and manual processing. Moreover, existing generative models lack the fine-grained control necessary to target specific road topologies during generation. This paper presents a data-driven pipeline for controllable HD map generation using latent diffusion and ControlNet for spatial conditioning. To our knowledge, we are the first to inject spatial guidance signals into a diffusion model for HD map synthesis. Furthermore, our model supports adjustable conditioning strength through classifier-free guidance and city-level style transfer via city label conditioning. To complement existing metrics, we introduce two novel metrics to evaluate adherence to the control signal and similarity to ground-truth maps. Experiments demonstrate that our model generates realistic HD maps that faithfully follow input road topologies while accurately preserving city-specific details.

2606.15897 2026-06-16 cs.LG cs.AI stat.ML 新提交

Topological Flow Matching

拓扑流匹配

Kacper Wyrwal, İsmail İlkan Ceylan, Alexander Tong

AI总结 提出拓扑流匹配,通过拉普拉斯漂移增强参考过程,在保留流匹配稳定性和无模拟目标的同时,捕捉底层域拓扑结构,适用于脑fMRI、洋流等结构化数据。

Comments Accepted at ICLR 2026. 26 pages, 24 figures. Code: https://github.com/KacperWyrwal/topological-flow-matching

详情
AI中文摘要

流匹配是一个强大的生成建模框架,因其简单性和强大的经验性能而受到重视。然而,其标准公式将结构化空间上的信号(例如脑图上的fMRI数据)视为欧几里得空间中的点,忽略了其域的丰富拓扑特征。为了解决这个问题,我们引入了拓扑流匹配,这是流匹配的一种拓扑感知泛化。我们将流匹配解释为解决退化薛定谔桥问题的框架,并通过用拉普拉斯导出的漂移增强参考过程来注入拓扑信息。这种原则性修改捕获了底层域的结构,同时保留了流匹配的理想特性:稳定的、无模拟的目标和确定性样本路径。因此,我们的框架可以作为标准流匹配的直接替代品。我们在多样化的结构化数据集上展示了其有效性,包括脑fMRI、洋流、地震事件和交通流。

英文摘要

Flow matching is a powerful generative modeling framework, valued for its simplicity and strong empirical performance. However, its standard formulation treats signals on structured spaces, such as fMRI data on brain graphs, as points in Euclidean space, overlooking the rich topological features of their domains. To address this, we introduce topological flow matching, a topology-aware generalization of flow matching. We interpret flow matching as a framework for solving a degenerate Schrödinger bridge problem and inject topological information by augmenting the reference process with a Laplacian-derived drift. This principled modification captures the structure of the underlying domain while preserving the desirable properties of flow matching: a stable, simulation-free objective and deterministic sample paths. As a result, our framework serves as a drop-in replacement for standard flow matching. We demonstrate its effectiveness on diverse structured datasets, including brain fMRIs, ocean currents, seismic events, and traffic flows.

2606.15872 2026-06-16 cs.CL 新提交

SciOrch: Learning to Orchestrate Expert LLMs for Solving Frontier Multimodal Scientific Reasoning Tasks

SciOrch: 学习编排专家大语言模型以解决前沿多模态科学推理任务

Jingru Guo, Xiangyuan Xue, Lian Zhang, Wanghan Xu, Siki Chen, Philip Torr, Wanli Ouyang, Lei Bai, Zhenfei Yin

AI总结 提出SciOrch框架,训练轻量级8B模型编排多个前沿大语言模型,通过MCTS和GRPO优化,在科学推理任务上超越最强单模型和多智能体基线。

详情
AI中文摘要

前沿科学推理仍然是大语言模型(LLMs)面临的主要挑战,即使是最强大的商业系统也达不到专家级性能。对模型行为的深入分析揭示了单模型评估所隐藏的显著互补性:不同的前沿模型在不同类型的问题上表现出色,没有一个模型能全面覆盖。我们提出了SciOrch,一个训练轻量级8B模型来编排前沿LLMs进行科学推理的框架。编排器分解每个问题,通过API调用将子问题委托给选定的商业模型,并综合最终答案。训练这样的编排器比传统的智能体强化学习更难:每个动作都会触发一次API调用,这在金钱成本和延迟上都代价高昂,使得标准的在线回滚不可行。我们通过基于MCTS的方法解决了这个问题,生成了多样化的编排轨迹,提取了每个节点的单轮样本,并使用GRPO风格的训练优化编排器。在包含SGI-Reasoning和Scientists' First Exam的240个问题测试集上,SciOrch达到了56.66%的平均准确率,比最强的单个商业模型高出3.74%,比最强的多智能体基线高出3.33%。它还在SGI和SFE上都取得了最佳准确率,而API成本不到典型多智能体方法的一半。

英文摘要

Frontier scientific reasoning remains a major challenge for large language models (LLMs), where even the strongest commercial systems fall short of expert-level performance. A closer look at model behavior reveals substantial complementarity that single-model evaluation hides: different frontier models excel on different question types, and no single model captures the full picture. We present SciOrch, a framework that trains a lightweight 8B model to orchestrate frontier LLMs for scientific reasoning. The orchestrator decomposes each question, delegates sub-problems to selected commercial models through API calls, and synthesizes a final answer. Training such an orchestrator is fundamentally harder than conventional agentic RL: each action triggers an API call that is expensive in both dollar cost and latency, making standard online rollouts infeasible. We address this with MCTS-based approach, producing diverse orchestration trajectories, extracting per-node single-turn samples, and optimizing the orchestrator with GRPO-style training. On a 240-question test set spanning SGI-Reasoning and Scientists' First Exam, SciOrch reaches 56.66% average accuracy, outperforming the strongest single commercial model by 3.74% and the strongest multi-agent baseline by 3.33%. It also attains the best accuracy on both SGI and SFE with less than half the API cost of typical multi-agent methods.

2606.15857 2026-06-16 cs.CV 新提交

A Dual-Branch Collaborative Framework for Joint Optimization of Underwater Image Enhancement and Object Detection

用于水下图像增强与目标检测联合优化的双分支协作框架

Liyuan Cao, Zheng Liu, Guanghao Liao, Yonghui Yang, Qi Li

AI总结 提出一种双分支水下图像增强框架,通过细节增强和颜色恢复分支分别提升纹理细节和校正色偏,在提升视觉质量的同时兼顾检测性能与效率,在URPC数据集上使YOLOv8的mAP50提升2.1%。

详情
AI中文摘要

由于波长依赖的光吸收和散射,水下图像通常存在颜色失真和细节模糊,这限制了水下目标检测的性能。现有的水下图像增强方法主要关注视觉质量提升,但仍难以平衡增强质量、处理效率和下游检测性能。因此,本文提出一种高效的双分支水下图像增强框架用于目标检测。细节增强分支通过提升亮度和局部对比度来恢复暗区域的纹理细节。颜色恢复分支使用自适应补偿来减少颜色失真并改善色彩层次。通过结合两个分支的互补输出,所提框架为目标检测提供更清晰、信息更丰富的图像。在UIEB和EUVP数据集上,所提方法分别达到2.249和2.576的UIQM分数。当应用于URPC数据集上的YOLOv8检测任务时,与基线相比,所提方法将mAP50提升了2.1%。大量实验表明,我们的方法在复杂水下场景中改善了目标检测,同时平衡了增强质量和处理效率。

英文摘要

Due to wavelength dependent light absorption and scattering, underwater images usually suffer from color distortion and blurred details, which limits underwater object detection performance. Existing underwater image enhancement methods mainly focus on visual quality improvement, while it is still difficult to balance enhancement quality, processing efficiency, and downstream detection performance. Therefore, this paper proposes an efficient dual-branch underwater image enhancement framework for object detection. The detail enhancement branch improves brightness and local contrast to recover texture details in dark regions. The color restoration branch uses adaptive compensation to reduce color distortion and improve color gradation. By combining the complementary outputs of the two branches, the proposed framework provides clearer and more informative images for object detection. On the UIEB and EUVP datasets, the proposed method achieves UIQM scores of 2.249 and 2.576. When applied to the YOLOv8 detection task on the URPC dataset, the proposed method improves mAP50 by 2.1\% compared with the baseline. Extensive experiments show that our method improves object detection in complex underwater scenes, while balancing enhancement quality and processing efficiency.

2606.15832 2026-06-16 cs.LG math.OC 新提交

SILAGE: Memory-Efficient, Full-Gradient-Free Nonconvex Optimization for Nested Finite Sums

SILAGE: 针对嵌套有限和的内存高效、完全无全梯度的非凸优化

Igor Sokolov, Laurent Condat, Peter Richtárik

AI总结 针对大规模数据中嵌套双有限和结构的非凸优化,提出SILAGE算法,通过利用双和结构避免全局全梯度刷新,仅需O(n)内存,并基于组间和组内异质性实现自适应收敛分析。

Comments 80 pages, 3 algorithms, 4 theorems, 2 corollaries, 11 lemmas, 2 figures, 12 tables

详情
AI中文摘要

大规模数据集上的经验风险最小化自然呈现出嵌套的双有限和结构,其中 $N=nm$ 个总样本被逻辑或物理地划分为 $n$ 个大小为 $m$ 的块(例如,在池化数据孤岛、核外学习或有意分层中)。虽然方差缩减方法对非凸目标实现了最优的 oracle 复杂度,但在此集中式场景中它们遭受严重的扩展瓶颈。递归估计器(如 PAGE)需要定期对所有 $nm$ 个样本进行全局全梯度刷新,这在计算上代价高昂。相反,单循环方法(如 SILVER)避免了此类刷新,但需要不切实际的 $\mathcal{O}(nm)$ 内存来存储每个样本的控制变量。在本文中,我们提出了 SILAGE,一种解决此权衡的方差缩减算法。通过主动利用双和结构,SILAGE 消除了对所有 $nm$ 组件的周期性全局全梯度刷新(每次迭代最多评估一个局部组梯度),同时仅需 $\mathcal{O}(n)$ 内存。此外,我们提供了严格的收敛分析,避免了悲观的 worst-case Lipschitz 常数。相反,SILAGE 的复杂度通过嵌套的函数相似性(组间异质性 $δ_1$ 和组内异质性 $δ_2$)自然地适应底层数据几何。我们的结果在几个实际相关场景中改进了现有的最先进界限。

英文摘要

Empirical risk minimization on massive datasets naturally exhibits a nested double finite-sum structure, where $N=nm$ total samples are logically or physically partitioned into $n$ blocks of size $m$ (e.g., in pooled data silos, out-of-core learning, or deliberate stratification). While variance-reduced methods achieve optimal oracle complexities for nonconvex objectives, they suffer from severe scaling bottlenecks in this centralized regime. Recursive estimators, such as PAGE, require periodic global full-gradient refreshes over all $nm$ samples, which are computationally expensive. Conversely, single-loop methods, such as SILVER, avoid such refreshes but require an impractical $\mathcal{O}(nm)$ memory footprint to store a control variate for every sample. In this paper, we propose SILAGE, a variance-reduced algorithm that addresses this trade-off. By actively exploiting the double-sum structure, SILAGE eliminates periodic global full-gradient refreshes over all $nm$ components (evaluating at most one local group gradient per iteration) while requiring only $\mathcal{O}(n)$ memory. Furthermore, we provide a tight convergence analysis that avoids pessimistic worst-case Lipschitz constants. Instead, SILAGE's complexity natively adapts to the underlying data geometry via nested functional similarities: across-group ($δ_1$) and within-group ($δ_2$) heterogeneity. Our results improve existing state-of-the-art bounds in several practically relevant regimes.