arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4160
2605.21125 2026-06-02 cs.LG

Advantage Collapse in Group Relative Policy Optimization: Diagnosis and Mitigation

群体相对策略优化中的优势崩塌:诊断与缓解

Xixiang He, Qiyao Sun, Ao Cheng, Xingming Li, Xuanyu Ji, Hailun Lu, Runke Huang, Qingyong Hu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对GRPO在可验证奖励强化学习中的优势崩塌问题,提出诊断指标ACR和轻量级扩展方法AVSPO,通过注入虚拟奖励样本减少梯度消失,提升模型推理能力。

Comments Accepted at the International Conference on Machine Learning (ICML 2026). Project page: https://QingyongHu.github.io/AVSPO

详情
AI中文摘要

群体相对策略优化(GRPO)是可验证奖励强化学习(RLVR)框架中的一种重要算法,在提升大型语言模型(LLMs)的推理能力方面取得了显著成果。然而,GRPO容易发生优势崩塌,这是一种故障模式,其中组内的同质奖励(例如,全部正确或全部错误的答案)产生接近零的优势和消失的梯度。为了解决这个问题,我们引入了优势崩塌率(ACR),这是第一个量化具有无效梯度的训练批次比例的诊断指标。在数学推理基准上,使用从0.5B到14B参数的模型,我们证明ACR能够强有力地预测训练停滞和最终性能。然后,我们提出了自适应虚拟样本策略优化(AVSPO),这是GRPO的一个轻量级扩展,通过实时ACR监控指导注入虚拟奖励样本,使得无需额外的模型 rollout 即可从同质组中学习。与GRPO相比,AVSPO将优势崩塌减少了58-63%,并在所有模型规模上一致地获得了4-6个百分点的准确率提升,同时在评估的域外任务上保持了泛化能力。代码和数据集可在 https://github.com/hexixiang/Advantage-Collapse-Rate 获取。

英文摘要

Group Relative Policy Optimization (GRPO), a prominent algorithm within the Reinforcement Learning from Verifiable Rewards (RLVR) framework, has achieved strong results in improving the reasoning capabilities of large language models (LLMs). However, GRPO is prone to advantage collapse, a failure mode where homogeneous rewards within a group (e.g., all correct or all incorrect answers) yield near-zero advantages and vanishing gradients. To address this, we introduce the Advantage Collapse Rate (ACR), the first diagnostic metric quantifying the proportion of training batches with ineffective gradients. Across models from 0.5B to 14B parameters on mathematical reasoning benchmarks, we show that ACR strongly predicts training stagnation and final performance. We then propose Adaptive Virtual Sample Policy Optimization (AVSPO), a lightweight extension of GRPO that injects virtual reward samples, guided by real-time ACR monitoring, to enable learning from homogeneous groups without additional model rollouts. AVSPO reduces advantage collapse by 58-63% relative to GRPO and yields consistent accuracy gains of 4-6 percentage points across all model scales, while maintaining generalization on the evaluated out-of-domain task. Code and datasets are available at https://github.com/hexixiang/Advantage-Collapse-Rate.

2605.20854 2026-06-02 cs.LG

Finite-Time Regret Analysis of Retry-Aware Bandits

重试感知赌博机的有限时间遗憾分析

Bingkui Tong, Junpei Komiyama, Soichiro Nishimori, Paavo Parmas

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(莫扎伊德·本·扎耶德人工智能大学) RIKEN AIP(日本学术振兴会高级研究所) The University of Tokyo(东京大学)

AI总结 研究针对重试感知目标(如pass@k和max@k)的随机赌博机算法ReMax,通过期望改进平衡条件刻画其最优采样分布,并证明首次亚线性遗憾界,揭示其比汤普森采样更具剥削性的原因。

Comments 38 pages

详情
AI中文摘要

我们研究了一种由重试感知目标(重视多次尝试中的最佳结果,如pass@$k$和max@$k$)启发的随机赌博机算法。给定臂值的后验分布,ReMax选择一种采样分布,最大化在$M$次虚拟抽取中后验期望最大奖励。尽管该目标在强化学习中作为不确定性下的探索机制被引入,但其在赌博机问题中的遗憾性质一直不清楚。对于高斯奖励和第一个非平凡情况$M=2$,我们通过期望改进平衡条件刻画了最优ReMax分布,并证明了ReMax的第一个亚线性遗憾界。我们的分析将次优臂的通常饱和行为与ReMax特有的低估效应分开,其中最优臂可能在不利估计后被采样过少。这解释了为什么ReMax可能比汤普森采样(TS)更具剥削性,以及其遗憾分析在技术上的微妙之处。实验支持这一图景:在轻度低估下,ReMax通常优于KL-UCB和汤普森采样,而后验方差缩放经验性地缓解了严重低估。

英文摘要

We study a stochastic bandit algorithm motivated by retry-aware objectives that value the best outcome among multiple attempts, such as pass@$k$ and max@$k$. Given a posterior over arm values, ReMax chooses a sampling distribution that maximizes the posterior expected maximum reward over $M$ virtual draws. Although this objective was introduced in reinforcement learning as an exploration mechanism under uncertainty, its regret properties in bandit problems have remained unclear. For Gaussian rewards and the first nontrivial case $M=2$, we characterize the optimal ReMax distribution through an expected-improvement balance condition and prove the first sublinear regret bound for ReMax. Our analysis separates the usual saturation behavior of suboptimal arms from a ReMax-specific underestimation effect, in which the optimal arm may be sampled too rarely after an unfavorable estimate. This explains why ReMax can be more exploitative than Thompson sampling (TS) and why its regret analysis is technically delicate. Experiments support this picture: ReMax often outperforms KL-UCB and Thompson sampling under mild underestimation, while posterior-variance scaling empirically mitigates severe underestimation.

2605.20301 2026-06-02 cs.CV cs.AI

Co-Fusion4D: Spatio-temporal Collaborative Fusion for Robust 3D Object Detection

Co-Fusion4D:面向鲁棒3D目标检测的时空协同融合

Wenxuan Li, Qin Zou, Shoubing Chen, Chi Chen, Yingyi Yang, Qingxiang Meng

发表机构 * Tsinghua University(清华大学)

AI总结 提出Co-Fusion4D框架,通过当前帧主导-历史帧互补机制和双注意力融合模块,解决BEV检测器中跨帧时空不一致问题,在nuScenes上达到74.9% mAP和75.6% NDS。

详情
AI中文摘要

在自动驾驶中,3D目标检测对于准确感知和可靠决策至关重要。然而,目标运动和自车运动常常在基于BEV的检测器中引起跨帧时空不一致,导致时序BEV特征错位和时空一致性退化。为了解决这些挑战,我们提出了Co-Fusion4D,一个统一框架,显式地保持跨帧时空一致性并抑制时序特征漂移。Co-Fusion4D采用当前帧中心策略,将当前帧作为主要信息源,同时在时空滤波和对齐后选择性地融入历史帧。这种主从互补机制有效减轻了累积对齐误差,抑制了噪声特征传播,并利用可靠的时序线索获得更一致的BEV表示。此外,Co-Fusion4D集成了双注意力融合(DAF)模块,以进一步增强时空特征交互。DAF联合利用帧内空间注意力和帧间时序注意力,自适应地对齐和融合多帧特征,强调运动一致区域同时抑制虚假相关性。通过偏离传统的均匀融合范式,该设计显著提高了BEV表示的时序稳定性和判别能力。在nuScenes基准上的大量实验表明,Co-Fusion4D实现了最先进的性能,mAP为74.9%,NDS为75.6%,且不依赖测试时增强或外部数据。

英文摘要

In autonomous driving, 3D object detection is essential for accurate perception and reliable decision-making. However, object motion and ego-motion often induce cross-frame spatiotemporal inconsistencies in BEV-based detectors, leading to temporal BEV feature misalignment and degraded spatiotemporal consistency. To address these challenges, we propose Co-Fusion4D, a unified framework that explicitly preserves cross-frame spatiotemporal consistency and suppresses temporal feature drift. Co-Fusion4D adopts a current-frame-centric strategy, treating the current frame as the primary source of information while selectively incorporating historical frames after spatiotemporal filtering and alignment. This dominant-complementary mechanism effectively mitigates cumulative alignment errors, suppresses noisy feature propagation, and exploits reliable temporal cues for a more consistent BEV representation. In addition, Co-Fusion4D integrates a Dual Attention Fusion (DAF) module to further enhance spatiotemporal feature interaction. DAF jointly leverages intra-frame spatial attention and inter-frame temporal attention to adaptively align and fuse multi-frame features, emphasizing motion-consistent regions while suppressing spurious correlations. By departing from conventional uniform fusion paradigms, this design substantially improves the temporal stability and discriminative capability of BEV representations. Extensive experiments on the nuScenes benchmark demonstrate that Co-Fusion4D achieves state-of-the-art performance, with 74.9% mAP and 75.6% NDS, without relying on test-time augmentation or external data.

2605.20282 2026-06-02 cs.CV cs.AI

Can Vision Models Truly Forget? Mirage: Representation-Level Certification of Visual Unlearning

视觉模型真的能遗忘吗?Mirage:表示层面的视觉遗忘认证

Zhenyu Yu, Yangchen Zeng, Chunlei Meng, Guangzhen Yao, Shuigeng Zhou

发表机构 * Fudan University(复旦大学) Southeast University(东南大学) Northeast Normal University(东北师范大学)

AI总结 提出Mirage框架,通过表示层面诊断揭示现有垂直联邦学习遗忘方法在输出层面通过认证后仍保留类别结构信息,并发现遗忘三元组困境和类别-样本不对称性。

详情
AI中文摘要

垂直联邦学习中的机器遗忘引起了越来越多的关注,但现有方法仅使用输出层面指标来认证遗忘。我们通过引入Mirage(一个表示层面审计框架,包含四种互补诊断方法:线性探针恢复、中心核对齐、特征可分性评分和逐层恢复分析)来挑战这些说法。通过在七个数据集和七种基线方法上遵循最近的VFL遗忘协议进行实验,Mirage揭示了三个关键发现:(i)遗忘差距:通过输出层面认证的方法在其表示中仍然保留了大量的类别结构,线性探针恢复比重新训练的基线高出最多15.4个百分点;中心核对齐显示这些模型在结构上更接近原始模型而非重新训练的参考模型,而可分性评分表明存在持续的几何区分。(ii)遗忘三元组困境:没有现有方法能同时实现高效用、输出层面遗忘和表示层面遗忘。(iii)类别-样本不对称性:类别级遗忘留下强烈的表示痕迹(线性探针恢复高达97%),而样本级遗忘与随机无异(线性探针恢复约50%);逐层分析进一步表明残差类别信息在网络深度中持续存在。这些发现呼吁在联邦遗忘研究中采用表示层面感知的评估标准。

英文摘要

Machine unlearning in Vertical Federated Learning (VFL) has attracted growing interest, yet existing methods certify forgetting solely using output-level metrics. We challenge these claims by introducing Mirage, a representation-level auditing framework comprising four complementary diagnostics: Linear Probe Recovery (LPR), Centered Kernel Alignment (CKA), Feature Separability Scoring, and Layer-Wise Recovery Analysis. Through experiments across seven datasets and seven baseline methods following recent VFL unlearning protocols, Mirage reveals three key findings: (i) Forgetting gap: methods that pass output-level certification still retain substantial class structure in their representations, with LPR exceeding the retrained baseline by up to 15.4 points; CKA shows these models remain structurally closer to the original than to the retrained reference, while separability scores indicate persistent geometric discrimination. (ii) Unlearning trilemma: no existing method simultaneously achieves high utility, output-level forgetting, and representation-level forgetting. (iii) Class-sample asymmetry: class-level forgetting leaves strong representational traces (LPR up to 97%), whereas sample-level forgetting is indistinguishable from chance (LPR approx. 50%); layer-wise analysis further shows residual class information persists across network depths. These findings call for representation-aware evaluation standards in federated unlearning research.

2605.01752 2026-06-02 cs.LG

Robust Linear Dueling Bandits with Post-serving Context under Unknown Delays and Adversarial Corruptions

鲁棒线性决斗赌博机:在未知延迟和对抗性破坏下的服务后上下文

Youngmin Oh

发表机构 * KAIST(韩国科学技术院)

AI总结 针对同时存在服务后上下文、延迟反馈和对抗性破坏的易变环境,提出RCDP-UCB算法,通过预测服务后上下文和自适应加权策略,实现了延迟无关的遗憾上界,并揭示了破坏与延迟之间的加性成本结构。

详情
AI中文摘要

我们研究了在易变环境中的线性决斗赌博机,其特点是同时存在服务后上下文、延迟反馈和对抗性破坏。反馈受到未知随机或对抗性延迟以及累积破坏预算$\mathcal{C}$的影响。为了解决这些挑战,我们提出了RCDP-UCB算法,该算法集成一个学习近似器,从服务前信息预测服务后上下文。它进一步采用自适应加权策略,裁剪特征向量以同时减轻破坏和延迟观测的影响。在标准正则性条件和参数化服务后映射下,我们严格证明了我们的算法是延迟机制无关的,实现了$\widetilde{\mathcal{O}}(d(\sqrt{T} + \mathcal{C} + \mathcal{D}))$的遗憾上界,其中$d$是总特征维度,$\mathcal{D}$封装了延迟复杂度。关键的是,我们的分析揭示了破坏和延迟之间的加性成本结构,避免了先前工作中典型的乘性退化。我们进一步建立了下界,在没有服务后上下文的情况下,对于对抗性延迟,下界几乎与上界匹配,仅差$\sqrt{d}$因子。代码可在https://github.com/youngmin0oh/rcdp-public获取。

英文摘要

We study linear dueling bandits in volatile environments characterized by the simultaneous presence of post-serving contexts, delayed feedback, and adversarial corruption. Feedback is subject to unknown stochastic or adversarial delays and a cumulative corruption budget $\mathcal{C}$. To address these challenges, we propose e RCDP-UCB, which integrates a learned approximator that predicts post-serving contexts from pre-serving information. It further employs an adaptive weighting strategy that clips feature vectors to mitigate the impact of corrupted and delayed observations simultaneously. Under standard regularity conditions and a parametric post-serving mapping, we rigorously establish that our algorithm is delay-regime-agnostic, achieving a regret upper bound of $\widetilde{\mathcal{O}}(d(\sqrt{T} + \mathcal{C} + \mathcal{D}))$, where $d$ is the total feature dimension and $\mathcal{D}$ encapsulates the delay complexity. Crucially, our analysis reveals an additive cost structure between corruption and delay, avoiding the multiplicative degradation typical of prior works. We further establish lower bounds that nearly match our upper bounds up to a $\sqrt{d}$ factor for adversarial delays in the absence of post-serving contexts. Code is available at https://github.com/youngmin0oh/rcdp-public.

2605.19575 2026-06-02 cs.CL

A Data-Driven Approach to Idiomaticity Based on Experts' Criteria in Theoretical Linguistics

基于理论语言学专家标准的习语性数据驱动方法

Elena Mikhalkova, Anastasiya Vishnyakova, Anastasiya Drozdova, Polina Gavin, Aleksander Zhmykhov, Timofey Protasov

发表机构 * arXiv.org

AI总结 基于理论语言学专家标注的16项标准,分析286个多词表达的习语性分布,发现无绝对习语表达,词汇标准影响最大。

详情
AI中文摘要

本文基于理论书籍和论文中描述的16项词汇、语法及其他习语性标准,对286个多词表达(MWE)进行了数据分析。MWE来自相同的理论来源,并由一组语言学专家根据这些类别进行标注。类别分布显示,不存在绝对习语性的表达。词汇标准似乎最具影响力;语法标准受特定条件约束;过时词汇和语法的存在影响MWE被单个词替换的能力。

英文摘要

The article observes data analysis of 286 multi-word expressions (MWEs) based on 16 lexical, grammatical and other criteria described in theoretical books and papers on the notion of idiomaticity. MWEs were collected from the same theoretical sources, and a set of experts in linguistics annotated them with these categories. The distribution of categories shows that there are no absolutely idiomatic expressions. Lexical criteria seem to be the most influential; grammatical criteria are bound to certain conditions; presence of obsolete words and grammar influence ability of an MWE to be replaced with one word.

2605.19344 2026-06-02 cs.CL

Retrieval-Augmented Linguistic Calibration

检索增强的语言校准

Yi-Fan Yeh, Linwei Tao, Minjing Dong, Tao Huang, Jialin Yu, Philip Torr, Chang Xu

发表机构 * School of Computer Science, University of Sydney(悉尼大学计算机科学学院) City University of Hong Kong(香港城市大学) Shanghai Jiao Tong University(上海交通大学) University of Oxford, Department of Engineering Science(牛津大学工程科学系) Department of Engineering Science, University of Oxford(牛津大学工程科学系)

AI总结 提出检索增强的语言校准(RALC)框架,通过分布建模和检索增强改写,提升语言置信度表达的真实性和校准性。

详情
AI中文摘要

诸如“我相信”和“可能”等语言线索为传达置信度提供了直观的界面,然而,一个可泛化的、有原则的语言置信度表达校准框架仍未被充分探索。特别是,共现的语言线索、上下文变化和主观的受众解读带来了独特的挑战。因此,我们将语言置信度建模为关于陈述正确性的合理感知概率值的分布,捕捉标量表示所丢弃的解释变异性。在这个分布框架内,我们引入真实性作为互补的评估维度,并提出真实性散度(FD),一种信息论度量,用于量化真相揭示时受众信念所引发的惊讶。基于这些基础,我们提出检索增强的语言校准(RALC),一个轻量级的后处理管道,通过检索增强改写将校准后的置信度信号传播回自然语言。在三个问答基准和五个LLM家族上,RALC将域内真实性和校准性分别提高了高达66%和58%,优于黑盒和灰盒校准基线。

英文摘要

Linguistic cues such as "I believe" and "probably" offer an intuitive interface for communicating confidence, yet a generalisable, principled calibration framework for linguistic confidence expressions remains underexplored. In particular, co-occurring linguistic cues, contextual variation, and subjective audience interpretation pose unique challenges. We therefore model linguistic confidence as a distribution over plausible perceived probability values that a statement is correct, capturing interpretation variability that scalar representations discard. Within this distributional framework, we introduce faithfulness as a complementary evaluation dimension and present Faithfulness Divergence (FD), an information-theoretic metric quantifying the surprise induced in audience beliefs upon truth revelation. Building on these foundations, we present Retrieval-Augmented Linguistic Calibration (RALC), a lightweight post-hoc pipeline that propagates calibrated confidence signals back into natural language via retrieval-augmented rewriting. Across three QA benchmarks and five LLM families, RALC improves in-domain faithfulness and calibration up to 66% and 58%, respectively, outperforming black-box and grey-box calibration baselines.

2605.19263 2026-06-02 cs.LG cs.NA math.NA

From Simple to Complex: Curriculum-Guided Physics-Informed Neural Networks via Gaussian Mixture Models

从简单到复杂:基于高斯混合模型的课程引导物理信息神经网络

Jianan Yang, Yiran Wang, Shuai Li, Fujun Cao, Xuefei Yan, Junmin Liu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出CGMPINN,通过高斯混合模型拟合PDE残差分布以量化学习难度,并采用动态课程学习逐步从简单区域过渡到困难区域,显著提升PINN在强非线性、尖锐梯度或多尺度问题上的收敛性和精度。

Comments 23 pages, 15 figures

详情
AI中文摘要

物理信息神经网络(PINN)提供了一种无网格框架用于求解偏微分方程(PDE),但训练过程中常面临梯度病态、频谱偏差和收敛性差的问题,尤其是对于具有强非线性、尖锐梯度或多尺度特征的问题。我们提出了课程引导高斯混合物理信息神经网络(CGMPINN),它将高斯混合模型与动态课程学习相结合。具体来说,周期性地将GMM拟合到PDE残差分布上,以量化空间变化的学习难度。一个平滑的课程计划逐步将训练重点从简单区域转移到困难区域,同时在早期优化过程中,基于精度的方差调制抑制不可靠的聚类。这种双重课程由一个共享的课程参数控制,并且可以与自适应损失平衡相结合。我们进一步建立了理论保证,包括诱导时变损失的梯度范数的次线性收敛、课程加权损失与标准PDE损失之间的均匀等价性,以及具有显式加权诱导偏差表征的泛化界。在涵盖椭圆型、抛物型、双曲型、对流主导型和非线性反应扩散型的六个基准PDE上的实验表明,CGMPINN在所有比较方法中一致地实现了最低的相对$L_2$和最大绝对误差,在相当的计算成本下,相对于标准PINN,相对$L_2$误差降低了高达97.8%。我们的代码公开在https://github.com/Mathematics-Yang/CGMPINN。

英文摘要

Physics-informed neural networks (PINNs) offer a mesh-free framework for solving partial differential equations (PDEs), yet training often suffers from gradient pathologies, spectral bias, and poor convergence, especially for problems with strong nonlinearity, sharp gradients, or multiscale features. We propose the Curriculum-Guided Gaussian Mixture Physics-Informed Neural Network (CGMPINN), which integrates Gaussian mixture modeling with dynamic curriculum learning. Specifically, a GMM is periodically fitted to the PDE residual distribution to quantify spatially varying learning difficulty. A smooth curriculum schedule progressively shifts training focus from easy to harder regions, while precision-based variance modulation suppresses unreliable clusters during early optimization. This dual curriculum is governed by a shared curriculum parameter and can be combined with self-adaptive loss balancing. We further establish theoretical guarantees, including sublinear convergence of the gradient norm for the induced time-varying loss, uniform equivalence between the curriculum-weighted and standard PDE losses, and a generalization bound with an explicit weighting-induced bias characterization. Experiments on six benchmark PDEs spanning elliptic, parabolic, hyperbolic, advection-dominated, and nonlinear reaction-diffusion types show that CGMPINN consistently achieves the lowest relative $L_2$ and maximum absolute errors among all compared methods, reducing relative $L_2$ error by up to 97.8\% over the standard PINN at comparable cost. Our code is publicly available at https://github.com/Mathematics-Yang/CGMPINN.

2605.17839 2026-06-02 cs.LG cs.AI

Balancing Knowledge Distillation for Imbalance Learning with Bilevel Optimization

基于双层优化的不平衡学习知识蒸馏平衡

Anh B. H. Nguyen, Ba Tho Phan, Viet Cuong Ta

发表机构 * VNU University of Engineering and Technology(越南工程技术大学)

AI总结 提出BiKD双层框架,通过自适应样本级权重平衡硬损失和软损失,解决不平衡数据上知识蒸馏的脆弱性问题。

Comments Accepted to Special Session: Data Science: Foundations and Applications (DSFA), PAKDD 2026

详情
AI中文摘要

知识蒸馏通过混合硬损失和软损失将高容量教师的知识转移到紧凑的学生模型。在不平衡数据上,硬损失和软损失之间的固定权重使得学习过程变得脆弱。最近的研究尝试在长尾设置中重新加权这些组件。然而,大多数方法没有在样本级别调整权重,也没有考虑训练过程中学生的行为。为了解决这个问题,我们提出了BiKD——一个双层框架,动态平衡每个样本的硬损失和软损失。我们采用一个权重生成网络,由一个小型平衡验证集引导,产生自适应的逐样本权重。学生现在通过无约束的加权硬损失和软损失组合进行训练,使得学生可以放松这两个项。我们进一步提出了一种多步SGD策略,以更准确和高效地优化权重模型。在长尾CIFAR-10/100上的实验表明,我们的方法在不同不平衡因子下均优于最近的平衡蒸馏方法。

英文摘要

Knowledge distillation transfers knowledge from a high capacity teacher to a compact student using a mixture of hard and soft losses. On imbalanced data, a fixed weighting between hard and soft losses becomes brittle the learning process. Recent studies try to reweight these components in long-tailed settings. However, most of these methods do not adapt weights at the sample-wise level and do not take into account the students behavior during training. To address this, we propose BiKD -- a bilevel framework that dynamically balances hard and soft losses for each sample. We employ a weight generation network that produces adaptive per-sample weights, guided by a small balanced validation set. The student is now trained with an unconstrained combination of weighted hard and soft losses, allowing the student to relax both terms. We further propose a multi-step SGD strategy to optimize the weight model more accurately and efficiently. Experiments on long-tailed CIFAR-10/100 show that our approach surpasses recent balanced distillation methods across imbalance factors.

2605.05945 2026-06-02 cs.CV cs.CL

MobileEgo Anywhere: Open Infrastructure for long horizon egocentric data on commodity hardware

MobileEgo Anywhere:基于商用硬件的长时域自我中心数据开放基础设施

Senthil Palanisamy, Abhishek Anand, Satpal Singh Rathore, Pratyush Patnaik, Shubhanshu Khatana, Ekaksh Janweja

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学) University of Washington(华盛顿大学) University of California, Los Angeles(加州大学洛杉矶分校) University of California, Santa Barbara(加州大学圣巴巴拉分校)

AI总结 提出MobileEgo Anywhere框架,利用智能手机传感器实现超过一小时的自我中心轨迹采集,并发布开源处理流水线STERA、移动应用及200小时数据集,验证其在视觉-语言-动作模型训练中的有效性。

详情
AI中文摘要

视觉-语言-动作(VLA)模型推动了对大规模自我中心数据集的需求,但用于收集长时域数据的硬件和基础设施仍然难以获取。当前数据集通常只有几分钟长的片段,无法捕捉复杂机器人任务执行所需的长时域时间依赖。我们提出MobileEgo Anywhere,一个在商用移动硬件上收集超过一小时自我中心轨迹的框架,利用现代智能手机传感器进行长期姿态跟踪,避免了传统机器人数据收集的硬件障碍。我们发布三个组件:(1)STERA,一个开源视频处理流水线,将原始移动捕获转换为标准化、训练就绪的格式,用于VLA和基础模型研究;(2)一个免费的移动应用,让任何用户记录自我中心活动;(3)一个200小时的数据集,包含多样化的长格式自我中心数据,跨584个会话具有持久状态跟踪。我们进一步展示该数据是可用的训练信号:在其上对VLA进行中期训练可降低保留动作预测误差。

英文摘要

Vision-language-action (VLA) models have driven demand for large-scale egocentric datasets, yet the hardware and infrastructure to collect long-horizon data remain inaccessible. Datasets today typically have episodes only a few minutes long, which fails to capture the long-horizon temporal dependencies that complex robotic task execution requires. We present MobileEgo Anywhere, a framework for collecting hour-plus egocentric trajectories on commodity mobile hardware that uses modern smartphone sensors for long-term pose tracking without the hardware barriers of traditional robotics data collection. We release three components: (1) STERA, an open-source video-processing pipeline that converts raw mobile captures into standardized, training-ready formats for VLA and foundation-model research; (2) a free mobile app that lets any user record egocentric activity; and (3) a 200-hour dataset of diverse, long-form egocentric data with persistent state tracking across 584 sessions. We further show this data is a usable training signal:mid-training a VLA on it lowers held-out action-prediction error.

2601.22285 2026-06-02 cs.LG

Demystifying Mergeability: Interpretable Properties to Predict Model Merging Success

揭秘可合并性:预测模型合并成功的可解释属性

Luca Zhou, Bo Zhao, Rose Yu, Emanuele Rodolà

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 通过架构无关框架和L1正则化线性优化,发现合并成功取决于合并方法和伙伴任务,梯度对齐是最基本的兼容性信号。

Comments 9 pages of main paper, 3 figures in the main paper, 4 tables in the main paper, many more figures and tables in the appendix

详情
AI中文摘要

模型合并结合了分别微调模型的知识,但驱动其成功的因素仍知之甚少。虽然最近的工作将可合并性视为模型的内在属性,但我们通过一个架构无关的框架表明,它从根本上取决于合并方法和伙伴任务。使用基于一组可解释的成对度量(例如,梯度L_2距离)的L1正则化线性优化,我们发现了与五种合并方法合并后归一化准确率相关的属性。我们发现成功驱动因素存在架构和方法特定的变化(平均前5个度量重叠64.0%;符号一致性79.3%),某些方法,特别是TIES,表现出与广泛共识不同的独特“指纹”。然而,至关重要的是,梯度对齐度量始终作为兼容性的最基本信号出现。这些发现为理解可合并性提供了诊断基础,并激发了未来合并感知的微调策略。

英文摘要

Model merging combines knowledge from separately fine-tuned models, yet the factors driving its success remain poorly understood. While recent work treats mergeability as an intrinsic property of the models, we show with an architecture-agnostic framework that it fundamentally depends on both the merging method and the partner tasks. Using L1-regularized linear optimization over a set of interpretable pairwise metrics (e.g., gradient L_2 distance), we uncover properties correlating with post-merge normalized accuracy across five merging methods. We find architecture- and method-specific variation in success drivers (64.0% average top-5 metric overlap; 79.3% sign agreement), with certain methods, notably TIES, exhibiting distinct ``fingerprints'' that diverge from the broader consensus. Crucially, however, gradient alignment metrics consistently emerge as the most fundamental signals of compatibility. These findings provide a diagnostic foundation for understanding mergeability and motivate future merge-aware fine-tuning strategies.

2411.19093 2026-06-02 cs.CV cs.CY cs.LG

Seeing SDG 6 from space: local-scale monitoring of piped water and sewage system access across Africa using satellite imagery and self-supervised learning

从太空看SDG 6:利用卫星图像和自监督学习对非洲管道水和污水系统接入进行局部尺度监测

Othmane Echchabi, Aya Lahlou, Nizar Talty, Josh Malcolm Manto, Tongshu Zheng, Ka Leung Lam

发表机构 * Mila – Quebec AI Institute(魁北克人工智能研究所) School of Computer Science, McGill University(麦吉尔大学计算机科学学院) Department of Earth and Environmental Engineering, Columbia University(哥伦比亚大学地球与环境工程系) Center for Learning the Earth with Artificial Intelligence and Physics (LEAP)(人工智能与物理学习地球中心(LEAP)) Division of Natural and Applied Sciences, Duke Kunshan University(杜克-昆山大学自然科学与应用科学系)

AI总结 本研究利用Sentinel-2图像、Afrobarometer调查数据、30米人口数据和DINO自监督视觉Transformer特征,开发了一个可扩展的遥感框架,以约2.56公里分辨率估计管道水和污水系统接入情况,最佳模型AUROC分别达到91.54%和93.24%,与WHO/UNICEF JMP统计数据高度一致,并在尼日利亚案例中揭示了细粒度环境不平等。

Comments Under Review

详情
AI中文摘要

获得饮用水和卫生设施对健康和福祉至关重要,但主要差距仍然存在,尤其是在非洲等数据稀缺地区。SDG 6旨在实现普遍接入,但目前的监测依赖于成本高昂、频率低且空间不均匀的调查和普查,且报告延迟较长。 本研究开发了一个可扩展的遥感框架,利用Sentinel-2图像、Afrobarometer调查响应、30米人口数据和DINO自监督视觉Transformer特征,以约2.56公里分辨率估计管道水和污水系统接入情况。最佳模型在管道水和污水接入方面分别达到91.54%和93.24%的AUROC值。在50个非洲国家中,人口加权估计与WHO/UNICEF JMP统计数据在管道水方面高度一致($R^2 = 0.92$),在污水接入方面也有显著一致性($R^2 = 0.72$)。在无Afrobarometer覆盖的国家,平均绝对误差分别为9.5%和10.7%,估计值分别与1.214亿和1.597亿人口的JMP值相差在15%以内。 一项覆盖尼日利亚767个地方政府区域的案例研究表明,该框架揭示了细尺度的环境不平等。管道水和污水无接入的最大负担分别达到115.5万和145.2万人,是地方政府区域中位数负担的7.9倍和8.3倍,而最高十分位无接入阈值分别为0.805和0.952,表明匮乏普遍存在。这些发现表明,基于DINO的卫星模型可以以低成本、空间详细的方式补充家庭调查,为SDG 6监测、基础设施定位和环境公平评估提供证据。

英文摘要

Access to drinking water and sanitation is essential for health and well-being, yet major disparities remain, especially in data-scarce regions such as Africa. SDG 6 aims for universal access, but current monitoring relies on costly, infrequent, and spatially uneven surveys and censuses with long reporting delays. This study develops a scalable remote-sensing framework to estimate piped water and sewage system access at approximately 2.56 km resolution using Sentinel-2 imagery, Afrobarometer survey responses, 30 m population data, and DINO self-supervised Vision Transformer features. The best model achieves AUROC values of 91.54% for piped water and 93.24% for sewage access. Across 50 African countries, population-weighted estimates strongly align with WHO/UNICEF JMP statistics for piped water ($R^2 = 0.92$) and show meaningful agreement for sewage access ($R^2 = 0.72$). In countries without Afrobarometer coverage, MAEs are 9.5% and 10.7%, with estimates within 15% of JMP values for 121.4 million and 159.7 million people, respectively. A Nigeria case study across 767 Local Government Areas (LGAs) shows that the framework reveals fine-scale environmental inequality. The largest no-access burdens reach 1.155 million people for piped water and 1.452 million for sewage, 7.9 and 8.3 times the median LGA burden, while top-decile no-access thresholds of 0.805 and 0.952 indicate that deprivation is widespread. These findings show that DINO-based satellite models can complement household surveys with low-cost, spatially detailed evidence for SDG 6 monitoring, infrastructure targeting, and environmental equity assessment.

2604.23765 2026-06-02 cs.LG cs.NE math.FA

Necessary and sufficient conditions for universality of Kolmogorov-Arnold networks

Kolmogorov-Arnold 网络普适性的充要条件

Vugar Ismailov

发表机构 * arXiv.org

AI总结 本文分析了 Kolmogorov-Arnold 网络(KAN)的普适逼近性质,证明了当所有边缘函数为仿射时普适性不成立,而添加一个非仿射函数即可恢复普适性,并给出了深层和两层 KAN 的充要条件。

Comments 19 pages; two corollaries from Section 6 removed and generalized in arXiv:2605.26550

详情
AI中文摘要

我们从边缘函数的角度分析了 Kolmogorov-Arnold 网络(KAN)的普适逼近性质。如果这些函数都是仿射的,那么普适性显然不成立。除了仿射函数之外,还需要多少非仿射函数才能保证普适性?我们证明一个就足够了。更精确地说,我们证明对于每个紧集 $K\subset\mathbb{R}^n$,所有边缘函数要么是仿射的,要么等于一个固定的连续函数 $\sigma$ 的深层 KAN 在 $C(K)$ 中稠密,当且仅当 $\sigma$ 是非仿射的。相比之下,对于恰好有两个隐藏层的 KAN,普适性成立当且仅当 $\sigma$ 是非多项式的。我们进一步证明,并不需要完整的仿射函数类;它可以用一个有限集代替而不影响普适性。特别地,在非多项式情况下,当深度任意时,一个由五个仿射函数组成的固定族就足够了。更一般地,对于每个连续的非仿射函数 $\sigma$,存在一个有限的仿射族 $A_\sigma$,使得边缘函数在 $A_\sigma\cup\{\sigma\}$ 中的深层 KAN 仍然是普适的。我们还证明,采用 Liu 等人~\cite{Liu2024} 引入的基于样条的边缘参数化的 KAN 是经典意义上的普适逼近器,即使样条次数和节点序列是预先固定的。

英文摘要

We analyze the universal approximation property of Kolmogorov-Arnold Networks (KANs) in terms of their edge functions. If these functions are all affine, then universality clearly fails. How many non-affine functions are needed, in addition to affine ones, to ensure universality? We show that a single one suffices. More precisely, we prove that deep KANs in which all edge functions are either affine or equal to a fixed continuous function $σ$ are dense in $C(K)$ for every compact set $K\subset\mathbb{R}^n$ if and only if $σ$ is non-affine. In contrast, for KANs with exactly two hidden layers, universality holds if and only if $σ$ is nonpolynomial. We further show that the full class of affine functions is not required; it can be replaced by a finite set without affecting universality. In particular, in the nonpolynomial case, a fixed family of five affine functions suffices when the depth is arbitrary. More generally, for every continuous non-affine function $σ$, there exists a finite affine family $A_σ$ such that deep KANs with edge functions in $A_σ\cup\{σ\}$ remain universal. We also prove that KANs with the spline-based edge parameterization introduced by Liu et al.~\cite{Liu2024} are universal approximators in the classical sense, even when the spline degree and knot sequence are fixed in advance.

2605.18077 2026-06-02 cs.AI cs.LG cs.MA

LLM-Guided Communication for Cooperative Multi-Agent Reinforcement Learning

LLM引导的通信用于合作多智能体强化学习

Sangjun Bae, Yisak Park, Sanghyeon Lee, Seungyul Han

发表机构 * KAIST(韩国科学技术院)

AI总结 提出LMAC框架,利用大语言模型的推理能力设计通信协议,使所有智能体尽可能准确一致地重建底层状态,从而提升多智能体强化学习中的状态重建和性能。

Comments 9 pages for main, 32 pages for total, Accepted to ICML 2026

详情
AI中文摘要

通信是多智能体强化学习(MARL)中缓解部分可观测性的关键组成部分,然而先前的方法通常依赖于低效的信息交换或无法传输足够的状态信息。为了解决这一问题,我们提出了LLM驱动的多智能体通信(LMAC),它利用LLM的推理能力设计一种通信协议,使所有智能体能够尽可能准确且一致地重建底层状态。LMAC使用显式的状态感知准则迭代地优化协议,在缩小智能体知识差异的同时改善状态恢复。在多种MARL基准上的实验表明,LMAC改善了智能体间的状态重建,并且相较于先前的通信基线取得了显著的性能提升。

英文摘要

Communication is a key component in multi-agent reinforcement learning (MARL) for mitigating partial observability, yet prior approaches often rely on inefficient information exchange or fail to transmit sufficient state information. To address this, we propose LLM-driven Multi-Agent Communication (LMAC), which leverages an LLM's reasoning capability to design a communication protocol that enables all agents to reconstruct the underlying state as accurately and uniformly as possible. LMAC iteratively refines the protocol using an explicit state-awareness criterion, improving state recovery while narrowing differences in agents' knowledge. Experiments on diverse MARL benchmarks show that LMAC improves state reconstruction across agents and yields substantial performance gains over prior communication baselines.

2605.17921 2026-06-02 cs.CV

An Efficient Streaming Video Understanding Framework with Agentic Control

一种具有代理控制的高效流式视频理解框架

Jinming Liu, Jianguo Huang, Zhaoyang Jia, Jiahao Li, Xiaoyi Zhang, Zongyu Guo, Bin Li, Wenjun Zeng, Yan Lu, Xin Jin

发表机构 * Shanghai Jiao Tong University(上海交通大学) Eastern Institute of Technology, Ningbo, China(宁波工程技术学院) Microsoft Research Asia(微软亚洲研究院)

AI总结 提出R3-Streaming框架,通过级联控制(记忆压缩、响应判断、计算路由)和年龄感知遗忘策略及目标平衡强化学习(TB-GRPO),在严格延迟预算下实现流式视频理解,性能达到SOTA并减少95-96%视觉令牌使用。

详情
AI中文摘要

流式视频需要在严格的延迟预算下处理动态信息密度。然而,现有方法通常采用静态策略,例如固定记忆压缩或依赖单一模型,这迫使做出权衡:快速模型无法处理复杂查询,而始终开启的重模型违反实时约束并使简单查询过于复杂。我们不预先固定这些决策,而是提出R3-Streaming(记忆、响应、推理),它将流式视频理解表述为级联控制问题:对于每个查询,系统压缩记忆、判断响应就绪状态,并顺序路由计算,使得每个下游决策建立在逐步精化的信息状态上。为了优化这一流水线,我们引入了一种年龄感知的遗忘策略用于记忆压缩,因为激进地压缩历史帧可以带来显著的性能提升。对于计算路由,我们提出了TB-GRPO,一种目标平衡的强化学习目标,它将困难查询路由到更强的模型,同时防止模式崩溃。大量评估表明,R3-Streaming在流式多模态大模型中取得了最先进的结果,在OVO-Bench上达到57.92,在StreamingBench上达到76.36,同时将视觉令牌使用量减少了95%到96%。

英文摘要

Streaming video requires handling dynamic information density under strict latency budgets. Yet, existing methods typically employ static strategies, such as fixed memory compression or reliance on a single model, forcing a trade-off: fast models fail on complex queries, while always-on heavy models violate real-time constraints and overcomplicate simple queries. Rather than fixing these decisions upfront, we propose R3-Streaming (Remember, Respond, Reason), which formulates streaming video understanding as a cascaded control problem: for each query, the system compresses memory, judges response readiness, and routes computation sequentially, so that each downstream decision builds on progressively refined information states. To optimize this pipeline, we introduce an age-aware forgetting policy for memory compression, as aggressively compressing historical frames can yield substantial performance gains. For compute routing, we propose TB-GRPO, a target-balanced reinforcement learning objective that routes hard queries to a stronger model while preventing mode collapse. Extensive evaluations demonstrate that R3-Streaming achieves state-of-the-art results among streaming MLLMs, reaching 57.92 on OVO-Bench and 76.36 on StreamingBench, while reducing visual token usage by 95 to 96 percent.

2605.17909 2026-06-02 cs.AI cs.LO

Ethical Hyper-Velocity (EHV): A Hardware-Rooted Zero-Trust Runtime Enforcement Architecture for Agentic AI Systems

伦理超高速 (EHV):一种面向代理型AI系统的硬件根零信任运行时强制架构

Riddhi Mohan Sharma

发表机构 * Senior Member, IEEE(IEEE高级会员)

AI总结 提出伦理超高速 (EHV) 架构,通过结合语法约束解码、因果图CRDT、可信执行环境和OSCAL审计日志,实现硬件根的零信任运行时强制,将策略执行点嵌入推理管线,显著降低治理延迟并支持形式化验证。

Comments 12 pages, 3 TikZ Figures, 3 Tables

详情
AI中文摘要

随着自主代理系统在受监管的关键基础设施中规模化部署,缺乏针对高频策略更新的机械性、硬件根强制机制构成了基本的安全缺口。我们提出伦理超高速 (EHV),一种面向代理系统的治理感知运行时强制架构,它结合了用于内联策略约束令牌生成的语法约束解码 (GCD)、基于向量时钟排序的因果图CRDT策略同步、可信执行环境 (TEE) 中的硬件证明执行以及OSCAL格式的机器可读审计日志。与引入14-30天策略延迟的事后审计框架(如ISO/IEC 42001、NIST AI RMF)不同,EHV通过治理感知即时 (JIT) 编译器将策略执行点 (PEP) 重新定位到推理管线中。在明确陈述的假设下,该架构降低了强制延迟,提高了可追溯性,并支持有界模型中的安全不变量的形式化验证。我们通过TLA+模型检查证明,在验证的有界运行状态空间(生成1738个状态,324个不同状态,深度8,零违规)中,不合规的代理行为是不可达的。在这些条件下,O(1)运行时强制减少了部署速度与治理完整性之间的传统权衡,将治理延迟从O(天)降至O(1)。EHV的差异化贡献在于将GCD、因果CRDT、TEE证明缓存和有界形式化验证集成到一个单一的、硬件根的强制架构中——这是任何同期系统都未实现的组合。该架构通过儿科肿瘤剂量用例进行演示,适用于包括医疗、金融合规和关键基础设施控制在内的受监管关键基础设施。

英文摘要

As autonomous agentic systems scale across regulated critical infrastructures, the lack of mechanistic, hardware-rooted enforcement for high-frequency policy updates presents a fundamental safety gap. We present Ethical Hyper-Velocity (EHV), a governance-aware runtime enforcement architecture for agentic systems that combines Grammar-Constrained Decoding (GCD) for inline policy-constrained token generation, Causal Graph CRDT-based policy synchronization with vector-clock ordering, hardware-attested execution in Trusted Execution Environments (TEEs), and OSCAL-formatted machine-readable audit logging. Unlike retrospective auditing frameworks (ISO/IEC 42001, NIST AI RMF) that introduce 14-30 day policy latencies, EHV relocates the Policy Enforcement Point (PEP) into the inference pipeline via a Governance-Aware Just-In-Time (JIT) Compiler. Under explicitly stated assumptions, the architecture reduces enforcement latency, improves traceability, and supports formal verification of safety invariants in a bounded model. We demonstrate via TLA+ model checking that non-compliant agentic actions were unreachable in the verified bounded operating state space (1,738 states generated, 324 distinct, depth 8, zero violations). Under these conditions, O(1) runtime enforcement reduces the traditional trade-off between deployment velocity and governance integrity, targeting Governance Latency from O(days) toward O(1). EHV's differentiating contribution is the integration of GCD, Causal CRDT, TEE attestation caching, and bounded formal verification into a single, hardware-rooted enforcement architecture -- a combination not achieved by any contemporaneous system. The architecture is demonstrated through a pediatric oncology dosage use case, with applicability to regulated critical infrastructures including healthcare, financial compliance, and critical infrastructure control.

2605.12969 2026-06-02 cs.LG cs.AI

Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective

从对比视角重新审视基于可验证奖励的强化学习

Feng Zhang, Xinhong Ma, Ziqiang Dong, Xi Leng, Jianfei Zhao, Xin Sun, Yang Yang, Guanjun Jiang

发表机构 * Beijing Institute of Technology(北京理工大学) Qwen Business Unit of Alibaba(阿里巴巴Qwen业务部) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 本文提出ConSPO方法,通过对比序列级策略优化,解决GRPO在目标函数上的似然错配和信用分配不敏感问题,在推理任务上超越强基线。

详情
AI中文摘要

组相对策略优化(GRPO)是目前最广泛采用的RLVR算法之一,用于对大型语言模型进行推理任务的后训练。我们首先证明GRPO存在等价的判别式重新表述,其中策略优化最大化验证的正负rollout之间的期望得分差距。这种重新表述揭示了两个目标层面的局限性:似然错配的替代得分(优化的是基于裁剪比率的得分而非控制生成的序列似然)和得分不敏感的信用分配(rollout级别的信用不反映当前正负rollout之间的得分差距)。为了解决这些局限性,我们提出ConSPO,一种对比序列级策略优化方法,它使用长度归一化的序列对数概率作为rollout得分,并在同一组内对比验证的正rollout与负干扰项。ConSPO优化一个组级别的InfoNCE风格目标,以自适应地增强对分离不佳的正样本和高分负样本的更新,同时结合课程调度的边界,在训练过程中保持分离压力。在多种设置下的实验表明,ConSPO在具有挑战性的推理基准上优于强基线。代码将在论文被接收后发布。

英文摘要

Group Relative Policy Optimization (GRPO) is one of the most widely adopted RLVR algorithms for post-training large language models on reasoning tasks. We first show that GRPO admits an equivalent discriminative reformulation, in which policy optimization maximizes the expected score gap between verified positive and negative rollouts. This reformulation reveals two objective-level limitations: likelihood-misaligned surrogate scores, in which clipped ratio-based scores are optimized rather than the sequence likelihoods that govern generation, and score-insensitive credit assignment, in which rollout-level credit does not reflect the current score gaps between positive and negative rollouts. To address these limitations, we propose ConSPO, a Contrastive Sequence-level Policy Optimization method that uses length-normalized sequence log-probabilities as rollout scores and contrasts verified positive rollouts against negative distractors within the same group. ConSPO optimizes a group-wise InfoNCE-style objective to adaptively strengthen updates for poorly separated positives and high-scoring negatives, together with a curriculum-scheduled margin that preserves separation pressure as training progresses. Experiments across diverse settings show that ConSPO outperforms strong baselines on challenging reasoning benchmarks. Code will be released upon paper acceptance.

2603.05308 2026-06-02 cs.CL cs.AI

Med-V1: Small Language Models for Zero-shot and Scalable Biomedical Evidence Attribution

Med-V1:用于零样本和可扩展生物医学证据归因的小型语言模型

Qiao Jin, Yin Fang, Lauren He, Yifan Yang, Guangzhi Xiong, Zhizheng Wang, Nicholas Wan, Joey Chan, Donald C. Comeau, Robert Leaman, Charalampos S. Floudas, Aidong Zhang, Michael F. Chiang, Yifan Peng, Zhiyong Lu

发表机构 * Division of Intramural Research, National Library of Medicine, National Institutes of Health(国家医学图书馆内部研究部,国立卫生研究院) Department of Computer Science, University of Virginia(弗吉尼亚大学计算机科学系) Center for Cancer Research, National Cancer Institute, National Institutes of Health(国家癌症研究所癌症研究中心,国立卫生研究院) Department of Population Health Sciences, Weill Cornell Medicine Institute of AI for Digital Health, Weill Cornell Medicine(韦尔·科恩医学中心流行病学与健康科学系,韦尔·科恩医学中心人工智能与数字健康研究所)

AI总结 提出仅3B参数的小语言模型Med-V1,通过高质量合成数据训练,在生物医学证据归因任务上性能媲美GPT-5等前沿大模型,并用于量化LLM幻觉和识别临床指南中的证据错误归因。

详情
AI中文摘要

评估一篇文章是否支持某个断言对于幻觉检测和声明验证至关重要。虽然大型语言模型(LLM)有潜力自动化这一任务,但实现强性能需要如GPT-5这样的前沿模型,而这些模型在规模部署时成本过高。为了高效执行生物医学证据归因,我们提出了Med-V1,一个仅有三亿参数的小语言模型家族。在本研究中新开发的高质量合成数据上训练,Med-V1在统一为验证格式的五个生物医学基准上显著优于其基础模型(+27.0%至+71.3%)。尽管规模较小,Med-V1的性能与GPT-5等前沿LLM相当,并提供高质量的预测解释。我们使用Med-V1进行了首次用例研究,量化了不同引用指令下LLM生成答案中的幻觉。结果表明,格式指令强烈影响引文有效性和幻觉,GPT-5生成更多声明但表现出与GPT-4o相似的幻觉率。此外,我们展示了第二个用例,表明Med-V1可以自动识别临床实践指南中的高风险证据错误归因,揭示了否则难以大规模识别的潜在负面公共卫生影响。总体而言,Med-V1为生物医学证据归因和验证任务的实际应用提供了一种高效、准确的轻量级替代方案。Med-V1可在https://github.com/ncbi-nlp/Med-V1获取。

英文摘要

Assessing whether an article supports an assertion is essential for hallucination detection and claim verification. While large language models (LLMs) have the potential to automate this task, achieving strong performance requires frontier models such as GPT-5 that are prohibitively expensive to deploy at scale. To efficiently perform biomedical evidence attribution, we present Med-V1, a family of small language models with only three billion parameters. Trained on high-quality synthetic data newly developed in this study, Med-V1 substantially outperforms (+27.0% to +71.3%) its base models on five biomedical benchmarks unified into a verification format. Despite its smaller size, Med-V1 performs comparably to frontier LLMs such as GPT-5, along with high-quality explanations for its predictions. We use Med-V1 to conduct a first-of-its-kind use case study that quantifies hallucinations in LLM-generated answers under different citation instructions. Results show that the format instruction strongly affects citation validity and hallucination, with GPT-5 generating more claims but exhibiting hallucination rates similar to GPT-4o. Additionally, we present a second use case showing that Med-V1 can automatically identify high-stakes evidence misattributions in clinical practice guidelines, revealing potentially negative public health impacts that are otherwise challenging to identify at scale. Overall, Med-V1 provides an efficient and accurate lightweight alternative to frontier LLMs for practical and real-world applications in biomedical evidence attribution and verification tasks. Med-V1 is available at https://github.com/ncbi-nlp/Med-V1.

2512.05335 2026-06-02 cs.RO

State-Conditional Adversarial Learning: An Off-Policy Visual Domain Transfer Method for End-to-End Imitation Learning

状态条件对抗学习:一种用于端到端模仿学习的离策略视觉域迁移方法

Yuxiang Liu, Shengfan Cao

发表机构 * University of California, Berkeley, CA, USA(加州大学伯克利分校)

AI总结 针对目标域数据严格离策略、无专家且稀缺的挑战,提出状态条件对抗学习(SCAL),通过状态条件潜在KL散度的判别器估计对齐分布,实现鲁棒的视觉域迁移。

详情
AI中文摘要

我们在一个现实且具有挑战性的设置中研究端到端模仿学习的视觉域迁移,其中目标域数据严格离策略、无专家且稀缺。我们首先提供理论分析,表明目标域模仿损失可以由源域损失加上源和目标观测模型之间的状态条件潜在KL散度上界。受此结果启发,我们提出状态条件对抗学习(SCAL),一种离策略对抗框架,使用基于判别器的条件KL项估计来对齐基于系统状态的潜在分布。在基于BARC-CARLA模拟器的视觉多样化自动驾驶环境中的实验表明,SCAL实现了鲁棒的迁移和强大的样本效率。

英文摘要

We study visual domain transfer for end-to-end imitation learning in a realistic and challenging setting where target-domain data are strictly off-policy, expert-free, and scarce. We first provide a theoretical analysis showing that the target-domain imitation loss can be upper bounded by the source-domain loss plus a state-conditional latent KL divergence between source and target observation models. Guided by this result, we propose State- Conditional Adversarial Learning, an off-policy adversarial framework that aligns latent distributions conditioned on system state using a discriminator-based estimator of the conditional KL term. Experiments on visually diverse autonomous driving environments built on the BARC-CARLA simulator demonstrate that SCAL achieves robust transfer and strong sample efficiency.

2510.25799 2026-06-02 cs.CL

LISTEN to Your Preferences: An LLM Framework for Multi-Objective Selection

LISTEN 你的偏好:面向多目标选择的LLM框架

Adam S. Jovine, Tinghan Ye, Francis Bahk, Jingjing Wang, Matthew Ford, David B. Shmoys, Peter I. Frazier

发表机构 * Cornell University(康奈尔大学) Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出LISTEN框架,利用LLM作为决策代理,通过迭代优化内部偏好模型(LISTEN-U参数法或LISTEN-T非参数法)从自然语言中学习用户隐含偏好,实现多目标选择。

Comments Accepted at IJCAI-ECAI 2026 (the 35th International Joint Conference on Artificial Intelligence)

详情
AI中文摘要

人类专家通常难以从大量具有多个竞争目标的项目中选出最佳选项,这一过程因难以形式化复杂、隐含的偏好而成为瓶颈。为解决此问题,我们引入LISTEN(基于LLM的迭代选择与自然语言权衡评估),一种基于LLM的代理框架,将LLM视为决策代理,能够迭代优化其内部偏好模型并采取行动(如提出效用或选择候选者),以最大化与用户隐含目标的对齐。为了在上下文窗口和推理成本等LLM约束下运行,我们提出两种迭代算法:LISTEN-U,使用LLM细化参数化效用函数;以及LISTEN-T,一种非参数方法,对小批量解决方案进行锦标赛式选择。在包括航班预订、购物和考试排程等多样化任务上的评估结果显示,当偏好参数化对齐时(我们通过一种新颖的一致性度量来衡量),LISTEN-U表现优异,而LISTEN-T整体性能更稳健。这项工作探索了直接用自然语言引导复杂多目标决策的有前景方向,减轻了传统偏好诱导的认知负担。代码见https://github.com/AdamJovine/LISTEN;数据见https://huggingface.co/datasets/AdamJovine/LISTEN-benchmark。

英文摘要

Human experts often struggle to select the best option from a large set of items with multiple competing objectives, a process bottlenecked by the difficulty of formalizing complex, implicit preferences. To address this, we introduce LISTEN (LLM-based Iterative Selection with Trade-off Evaluation from Natural-language), an agentic LLM-based framework that treats the LLM as a decision-making agent capable of iteratively refining its internal preference model and taking actions (e.g., proposing utilities or selecting candidates) to maximize alignment with a user's implicit goals. To operate within LLM constraints like context windows and inference costs, we propose two iterative algorithms: LISTEN-U, which uses the LLM to refine a parametric utility function, and LISTEN-T, a non-parametric method that performs tournament-style selections over small batches of solutions. Evaluated on diverse tasks including flight booking, shopping, and exam scheduling, our results show LISTEN-U excels when preferences are parametrically aligned (a property we measure with a novel concordance metric), while LISTEN-T offers more robust performance overall. This work explores a promising direction for steering complex multi-objective decisions directly with natural language, reducing the cognitive burden of traditional preference elicitation. Code is available at https://github.com/AdamJovine/LISTEN; data is available at https://huggingface.co/datasets/AdamJovine/LISTEN-benchmark.

2605.17034 2026-06-02 cs.LG cs.AI cs.CR

Privacy Policy Enforcement Guardrails for Data-Sensitive Retrieval-Augmented Generation

面向数据敏感检索增强生成的隐私策略执行护栏

Osama Zafar, Alexander Nemecek, Yiqian Zhang, Wenbiao Li, Debargha Ganguly, Vikash Singh, Vipin Chaudhary, Erman Ayday

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Washington(华盛顿大学) University of Toronto(多伦多大学) University of Texas at Austin(德克萨斯大学奥斯汀分校) University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 针对RAG系统中上下文数据泄露问题,提出基于双单类密度估计器与融合文本嵌入的隐私策略执行框架,在医学、金融和法律领域实现高AUROC和低误报率。

详情
AI中文摘要

标准的PII过滤器常常遗漏RAG系统中的上下文数据泄露,例如非受管制的属性集群共同识别个人身份。我们引入了一个隐私策略执行(PPE)框架,使用双单类密度估计器与融合文本嵌入,以及针对分布外输入的校准弃权区域。通过跨医学、金融和法律领域的轴分层、多LLM合成数据管道,我们发现传统的高斯混合基线在边界安全压力测试中失败,因为它们关注语言风格而非内容。我们提出的T3+OCSVM检测器,在安全和边界安全数据上训练,实现了0.93+的边界AUROC,同时将误报率降低44-55个百分点,并保持毫秒级延迟。与监督MLP分类器或14B参数LLM评判器相比,我们的框架提供了更优的操作适用性,因为前者具有高弃权率,后者存在延迟和校准问题。该方法为任何合成数据训练的分类器提供了稳健的压力测试标准。

英文摘要

Standard PII filters often miss contextual data leakage in RAG systems, such as non-regulated attribute clusters that collectively identify individuals. We introduce a Privacy Policy Enforcement (PPE) framework using dual one-class density estimators with fused text embeddings and a calibrated abstain region for out-of-distribution inputs. Using an axis-stratified, multi-LLM synthetic data pipeline across medicine, finance, and law, we found that traditional Gaussian Mixture baselines fail on borderline-safe stress tests by focusing on linguistic register rather than content. Our proposed T3+OCSVM detector, trained on safe and borderline-safe data, achieves a borderline AUROC of 0.93+ while reducing false positives by 44-55 percentage points and maintaining millisecond latency. Compared to supervised MLP classifiers or 14B-parameter LLM judges, our framework offers superior operational suitability, as the former suffers from high abstention rates and the latter from latency and calibration issues. This methodology provides a robust stress-testing standard for any synthetic-data-trained classifier.

2605.16740 2026-06-02 cs.CV

TRACE: Evidence Grounding-Guided Multi-Video Event Understanding and Claim Generation

TRACE:基于证据定位的多视频事件理解与声明生成

Pengyu Yan, Akhil Gorugantu, Mahesh Bhosale, Abdul Wasi, Vishvesh Trivedi, David Doermann

发表机构 * University at Buffalo, SUNY(布法罗大学) New York University(纽约大学)

AI总结 提出TRACE框架,通过先构建文本可搜索时间线进行证据定位,再引导视觉语言模型生成声明和跨视频引用,显著提升多视频事件理解的事实完整性和归因准确性。

Comments Accepted at ACL 2026 Workshop

详情
AI中文摘要

多视频事件理解要求模型能够定位并归因于分布在长且异构的视频语料库中的查询相关证据。现有大型视觉语言模型(LVLMs)在此场景下表现不佳,因为它们很快耗尽上下文预算,难以精确定位证据重要的片段,经常错过密集的信息线索,如广播图形、字幕和记分牌。我们引入TRACE,一个基于证据定位的框架,采用先定位后推理的策略进行多视频事件推理。我们的方法首先使用OCR和物体检测为每个视频构建结构化的、可文本搜索的时间线。然后,一个纯文本LLM进行查询感知的证据定位,在后续视觉推理之前选择相关时刻。检索到的帧及其定位摘要随后用于引导基于LVLM的声明生成和跨视频引用整合。在MAGMaR 2026和WikiVideo上的实验表明,结构化定位显著提升了事实完整性和归因保真度。在MAGMaR验证集上,与未引导的Qwen3-VL-30B基线相比,TRACE将宏平均MiRAGE F1从0.705提升至0.811,引用召回率从0.440大幅提升至0.628。该方法还在官方MAGMaR 2026排行榜上取得了最先进的结果。代码已发布在https://github.com/pengyu965/TRACE。

英文摘要

Multi-video event understanding demands models that can locate and attribute query-relevant evidence scattered across long, heterogeneous video corpora. Existing large vision-language models (LVLMs) often underperform in this regime because they quickly exhaust their context budget and struggle to precisely localize evidentially important segments, frequently missing dense informational cues such as broadcast graphics, subtitles, and scoreboards. We introduce TRACE, an evidence grounding-guided framework that follows a ground-before-reasoning strategy for multi-video event reasoning. Our approach first builds a structured, text-searchable timeline for each video using OCR and object detection. A text-only LLM then conducts query-aware evidence localization, selecting relevant moments prior to any downstream visual reasoning. The retrieved frames and their grounding summaries are subsequently used to steer LVLM-based claim generation and cross-video citation consolidation. Experiments on MAGMaR 2026 and WikiVideo demonstrate that structured grounding markedly boosts factual completeness and attribution fidelity. On the MAGMaR validation split, TRACE raises macro-average MiRAGE F1 from 0.705 to 0.811 compared to an unguided Qwen3-VL-30B baseline, with especially strong improvements in citation recall from 0.440 to 0.628. The method also attains state-of-the-art results on the official MAGMaR 2026 leaderboard. Code is released at https://github.com/pengyu965/TRACE.

2605.11125 2026-06-02 cs.LG

Language Modeling with Hyperspherical Flows

超球面流语言建模

Justin Deschenaux, Caglar Gulcehre

发表机构 * EPFL(苏黎世联邦理工学院) Microsoft AI(微软人工智能)

AI总结 提出一种在超球面潜空间中进行连续流语言建模的方法 S-FLM,通过旋转向量和交叉熵学习速度场,避免独热向量开销,在大型词汇推理任务上显著提升性能,缩小了与掩码扩散模型的差距。

详情
AI中文摘要

离散扩散语言模型作为自回归模型的替代方案发展迅速,其动机在于并行生成能力。然而,为了可处理性,离散扩散模型从因子化分布中采样,其表达能力弱于自回归模型。最近的流语言模型将连续流应用于语言,通过确定性常微分方程将噪声传输到数据,避免了因子化采样。流语言模型操作于独热向量,其维度随词汇表大小缩放,使得流语言模型训练成本高昂。此外,由于所有不同的独热嵌入在 $\ell_2$ 中都是等距的,添加高斯噪声没有明确的语义解释(与图像不同,在图像中高斯噪声逐渐退化结构)。我们引入了 $\mathbb{S}$-FLM,一种在超球面中的潜在流语言模型。$\mathbb{S}$-FLM 通过沿速度场旋转 $\mathbb{S}^{d-1}$ 中的向量来生成序列,该速度场使用交叉熵学习,避免了具体化独热向量的开销。先前的流语言模型在生成困惑度上与自回归模型匹配,但在数学和代码等可验证领域中,高似然的样本不一定正确。$\mathbb{S}$-FLM 在大型词汇推理任务上显著改进了连续流语言模型,并在标准温度采样($T=1$)下缩小了与掩码扩散的差距,而在优化的低温解码($T=0.1$)下仍存在差距。

英文摘要

Discrete Diffusion Language Models progressed rapidly as an alternative to autoregressive (AR) models, motivated by their parallel generation abilities. However, for tractability, discrete diffusion models sample from a factorized distribution, which is less expressive than AR. Recent Flow Language Models (FLMs) apply continuous flows to language, transporting noise to data with a deterministic ODE that avoids factorized sampling. FLMs operate on one-hot vectors whose dimension scales with the vocabulary size, making FLMs costly to train. Moreover, since all distinct one-hot embeddings are equidistant in $\ell_2$, adding Gaussian noise does not have a clear semantic interpretation (unlike images, where Gaussian noise progressively degrades structure). We introduce $\mathbb{S}$-FLM, a latent FLM in the hypersphere. $\mathbb{S}$-FLM generates sequences by rotating vectors in $\mathbb{S}^{d-1}$ along a velocity field learned with cross-entropy, avoiding the overhead of materializing one-hot vectors. Previous FLMs match AR in Generative Perplexity (Gen.\ PPL), but samples with high likelihood are not necessarily correct in verifiable domains such as math and code. $\mathbb{S}$-FLM substantially improves continuous flow language models on large-vocabulary reasoning and closes the gap to masked diffusion under standard-temperature sampling ($T=1$), while a gap remains under optimized low-temperature ($T=0.1$) decoding.

2604.26283 2026-06-02 cs.CV cs.AI

MedSynapse-V: Bridging Visual Perception and Clinical Intuition via Latent Memory Evolution

MedSynapse-V:通过潜在记忆演化桥接视觉感知与临床直觉

Chunzheng Zhu, Jiaqi Zeng, Junyu Jiang, Jianxin Lin, Yijun Wang

发表机构 * Hunan University(湖南大学)

AI总结 提出MedSynapse-V框架,通过潜在诊断记忆演化模拟临床专家经验调用,解决医学视觉语言模型因离散分词导致的量化损失、长程信息消散和案例适应性问题,在诊断准确性上显著超越现有方法。

Comments Medical latent reasoning; Memory evolution

详情
AI中文摘要

高精度医学诊断不仅依赖于静态成像特征,还依赖于专家在图像解读过程中即时调用的隐式诊断记忆。我们指出了医学视觉语言模型中由于离散分词导致的基本认知错位,表现为量化损失、长程信息消散以及缺乏案例自适应专业知识。为弥合这一差距,我们提出了MedSynapse-V,一个用于潜在诊断记忆演化的框架,通过在模型隐藏流中动态合成隐式诊断记忆来模拟临床医生的经验调用。具体而言,它从元查询先验记忆机制开始,其中可学习的探针从解剖先验编码器中检索结构化先验,以生成压缩的隐式记忆。为确保临床保真度,我们引入了因果反事实细化(CCR),利用强化学习和基于区域级特征掩蔽的反事实奖励来量化每个记忆的因果贡献,从而修剪冗余并将潜在表示与诊断逻辑对齐。这一演化过程最终达到内在记忆转换(IMT),一种特权-自主双分支范式,通过全词汇散度对齐将教师分支的诊断模式内化到学生分支中。跨多个数据集的全面实证评估表明,通过将外部专业知识转化为内源参数,我们的方法在诊断准确性上显著优于现有最先进方法,特别是思维链范式。代码可在https://github.com/zhcz328/MedSynapse-V获取。

英文摘要

High-precision medical diagnosis relies not only on static imaging features but also on the implicit diagnostic memory experts instantly invoke during image interpretation. We pinpoint a fundamental cognitive misalignment in medical VLMs caused by discrete tokenization, leading to quantization loss, long-range information dissipation, and missing case-adaptive expertise. To bridge this gap, we propose ours, a framework for latent diagnostic memory evolution that simulates the experiential invocation of clinicians by dynamically synthesizing implicit diagnostic memories within the model's hidden stream. Specifically, it begins with a Meta Query for Prior Memorization mechanism, where learnable probes retrieve structured priors from an anatomical prior encoder to generate condensed implicit memories. To ensure clinical fidelity, we introduce Causal Counterfactual Refinement (CCR), which leverages reinforcement learning and counterfactual rewards derived from region-level feature masking to quantify the causal contribution of each memory, thereby pruning redundancies and aligning latent representations with diagnostic logic. This evolutionary process culminates in Intrinsic Memory Transition (IMT), a privileged-autonomous dual-branch paradigm that internalizes teacher-branch diagnostic patterns into the student-branch via full-vocabulary divergence alignment. Comprehensive empirical evaluations across multiple datasets demonstrate that ours, by transferring external expertise into endogenous parameters, significantly outperforms existing state-of-the-art methods, particularly chain-of-thought paradigms, in diagnostic accuracy. The code is available at https://github.com/zhcz328/MedSynapse-V.

2511.00064 2026-06-02 cs.LG

SPORE: Skeleton Propagation Over Recalibrating Expansions

SPORE: 基于重新校准扩展的骨架传播

Randolph Wiredu-Aidoo

发表机构 * Randolph Wiredu-Aidoo

AI总结 提出一种两阶段密度聚类算法SPORE,通过自适应扩展和边界传播解决异质密度和边界模糊问题,在28个基准数据集上显著优于现有方法。

详情
AI中文摘要

许多真实世界的数据集不是线性可分的,这限制了基于质心的聚类方法(如K-means)的有效性。基于密度的聚类方法通过识别具有任意几何结构的聚类来解决这一限制;然而,现有方法存在两个持续的缺点。首先,它们在存在异质局部密度的情况下往往表现不佳,其中单个密度阈值无法充分捕获跨多个密度尺度的聚类。其次,它们通常缺乏由基于质心方法的线性划分机制自然诱导的清晰边界界定。本文介绍了SPORE(基于重新校准扩展的骨架传播),这是一种聚类算法,旨在解决这两个挑战,同时保留基于密度方法的几何灵活性。SPORE分两个阶段运行:自适应聚类扩展阶段,然后是邻近驱动的边界传播阶段,即使在弱密度对比下也能保持判别能力。该方法在28个基准数据集上与已建立的基于密度的基线进行了评估,并以K-means作为参考的基于质心方法。实验结果表明,相对于所有评估的基线(p < 0.01),SPORE实现了显著改善的聚类恢复,同时可以在五次随机搜索评估内识别出性能强劲的配置。

英文摘要

Many real-world datasets are not linearly separable, limiting the effectiveness of centroid-based clustering methods such as K-means. Density-based clustering methods address this limitation by identifying clusters with arbitrary geometric structure; however, existing approaches exhibit two persistent shortcomings. First, they often underperform in the presence of heterogeneous local densities, where a single density threshold cannot adequately capture clusters across multiple density scales. Second, they generally lack the clear boundary delineation naturally induced by the linear partitioning mechanism of centroid-based methods. This paper introduces SPORE (Skeleton Propagation Over Recalibrating Expansions), a clustering algorithm designed to address both challenges while preserving the geometric flexibility of density-based approaches. SPORE operates in two stages: an adaptive cluster expansion phase followed by a proximity-driven boundary propagation phase that maintains discriminative capability even under weak density contrast. The proposed method is evaluated on 28 benchmark datasets against established density-based baselines, with K-means included as a reference centroid-based method. Experimental results demonstrate that SPORE achieves significantly improved cluster recovery relative to all evaluated baselines (p < 0.01), while strong-performing configurations can be identified within five random-search evaluations.

2510.12999 2026-06-02 cs.LG stat.ML

AMORE: Adaptive Multi-Output Operator Network for Stiff Chemical Kinetics

AMORE: 自适应多输出算子网络用于刚性化学动力学

Kamaljyoti Nath, Additi Pandey, Bryan T. Susi, Hessam Babaee, George Em Karniadakis

发表机构 * Division of Applied Mathematics, Brown University(布朗大学应用数学系) Applied Research Associates, Inc.(应用研究公司) Department of Mechanical Engineering and Materials Science, University of Pittsburgh(匹兹堡大学机械工程与材料科学系)

AI总结 针对刚性化学动力学系统的时间积分计算成本高的问题,提出AMORE框架,通过自适应损失函数和可逆映射确保多输出算子学习的可靠性,并在合成气和GRI-Mech 3.0上验证了有效性。

详情
AI中文摘要

刚性系统的时间积分是燃烧、高超声速及其他反应输运系统中计算成本的主要来源。这种刚性会引入远小于其他物理过程的时间尺度,导致显式格式需要极小的步长或隐式方法计算量大。因此,缓解刚性挑战的策略至关重要。虽然神经算子(DeepONet)可以作为刚性动力学的替代模型,但需要可靠的算子学习策略来适当考虑输出变量和样本之间的误差差异。本文开发了AMORE(自适应多输出算子网络),一个包含能够预测多个输出的算子和确保可靠算子学习的自适应损失函数的框架。该算子从给定初始条件预测所有热化学状态。我们提出了两种自适应损失函数,考虑每个状态变量和样本的误差来惩罚损失函数。我们设计了主干网络以自动满足单位分解。为了精确满足质量分数总和为1的约束,我们提出了一个可逆解析映射,将n维物种质量分数向量变换到(n-1)维空间。我们将所提出的自适应损失函数扩展到具有多输出的DeepONet的两步训练中的主干和分支训练。我们还通过预测质量分数上的softmax函数精确实现了另一个质量分数总和为1的约束。我们通过两个示例证明了模型的有效性和适用性:合成气(12个状态)、GRI-Mech 3.0(54个中的24个活跃状态)。所提出的DeepONet将成为未来CFD研究加速湍流燃烧模拟的骨干。AMORE是一个通用框架,本文也将其应用于FNO。

英文摘要

Time integration of stiff systems is a primary source of computational cost in combustion, hypersonics, and other reactive transport systems. This stiffness can introduce time scales significantly smaller than those associated with other physical processes, requiring extremely small time steps in explicit schemes or computationally intensive implicit methods. Consequently, strategies to alleviate challenges posed by stiffness are important. While neural operators (DeepONets) can act as surrogates for stiff kinetics, a reliable operator learning strategy is required to appropriately account for differences in error between output variables and samples. Here, we develop AMORE, Adaptive Multi-Output Operator Network, a framework comprising an operator capable of predicting multiple outputs and adaptive loss functions ensuring reliable operator learning. The operator predicts all thermochemical states from given initial conditions. We propose two adaptive loss functions within the framework, considering each state variable's and sample's error to penalize the loss function. We designed the trunk to automatically satisfy Partition of Unity. To enforce unity mass-fraction constraint exactly, we propose an invertible analytical map that transforms the $n$-dimensional species mass-fraction vector into an ($n-1$)-dimensional space. We extend the proposed adaptive loss functions to trunk and branch training in two-step training of DeepONet with multiple outputs. We implemented another unity mass fraction constraint exactly using a softmax function on the predicted mass fraction. We demonstrate efficacy and applicability of our models through two examples: syngas (12 states), GRI-Mech 3.0 (24 active states out of 54). The proposed DeepONet will be a backbone for future CFD studies to accelerate turbulent combustion simulations. AMORE is a general framework, and here, we also demonstrate it for FNO.

2605.16451 2026-06-02 cs.LG cs.AI

Physics-Guided Geometric Diffusion for Macro Placement Generation

物理引导的几何扩散用于宏单元布局生成

Jongho Yoon, Jinsung Jeon, Seokhyeong Kang

发表机构 * POSTECH Institute of Artificial Intelligence(POSTECH人工智能研究所) KAIST InnoCORE LLM(韩国科学技术院InnoCORE语言模型实验室) Seoul National University(首尔国立大学) Pohang University of Science and Technology(釜山科学技术大学)

AI总结 提出MacroDiff+框架,通过双域去噪架构和物理引导采样策略,在宏单元布局中同时优化拓扑连接和物理约束,在ISPD2005 MMS基准上实现线长减少6.1-6.2%。

Comments Accepted to IJCAI 2026. 9 pages, 5 figures

详情
AI中文摘要

宏单元布局是VLSI物理设计中的关键阶段,从根本上决定了芯片的整体性能。最近的数据驱动布局方法显示出巨大潜力,但它们往往难以处理序列依赖关系,并平衡拓扑连接与物理约束。为弥补这一差距,我们提出了MacroDiff+,一个物理引导的几何扩散框架。具体来说,我们设计了一个双域去噪架构,将异构GNN编码的拓扑连接与Transformer建模的全局几何上下文相结合。此外,我们引入了物理引导采样,一种推理策略,通过显式梯度主动引导生成,以确保统计合理性和物理有效性。在ISPD2005 MMS基准上,MacroDiff+优于最先进的基线,线长减少6.1-6.2%。值得注意的是,在先前方法无法收敛的大规模设计中,它表现出卓越的稳定性和可扩展性。源代码可在https://github.com/jhy00n/MacroDiff-plus获取。

英文摘要

Macro placement is a pivotal stage in VLSI physical design, fundamentally determining the overall chip performance. Recent data-driven placement methods have demonstrated significant potential, yet they often struggle to handle sequential dependencies and to balance topological connectivity with physical constraints. To bridge this gap, we propose MacroDiff+, a physics-guided geometric diffusion framework. Specifically, we design a dual-domain denoising architecture that couples topological connectivity encoded by heterogeneous GNNs with global geometric context modeled by a Transformer. Furthermore, we introduce Physics-Guided Sampling, an inference strategy that actively steers the generation using explicit gradients to ensure both statistical plausibility and physical validity. On the ISPD2005 MMS benchmarks, MacroDiff+ outperforms state-of-the-art baselines with a 6.1-6.2% reduction in wirelength. Notably, it exhibits superior stability and scalability on large-scale designs where prior methods fail to converge. The source code is available at https://github.com/jhy00n/MacroDiff-plus.

2605.16446 2026-06-02 cs.LG cs.AI

Avoiding Structural Failure Modes in Tabular Fair SSL: Online Primal-Dual Allocation under Confidence Gating

避免表格公平半监督学习中的结构失效模式:基于置信门控的在线原始-对偶分配

Hangchuan Liang, Changchun Li

发表机构 * College of Computer Science and Technology, Jilin University, China(吉林大学计算机科学与技术学院) Key Laboratory of Symbolic Computation and Knowledge Engineering of the Ministry of Education, Jilin University, China(教育部符号计算与知识工程重点实验室)

AI总结 针对表格公平半监督学习中的结构冲突,提出在线原始-对偶分配(OPDA)方法,通过动态调度公平性和熵稳定性惩罚,避免掩码崩溃和平凡饱和两种失效模式,在多个基准上实现非退化运行点。

详情
AI中文摘要

半监督学习(SSL)能够在有限标签下进行预测,但高风险表格应用(医疗、信贷、再犯)需要统计公平性保证。通过诊断压力测试,我们识别出表格公平SSL中的结构冲突:在置信门控伪标签下,矩匹配公平正则化器可能触发两种失效模式——掩码崩溃(公平性侵蚀置信度,导致伪标签匮乏)和平凡饱和(漂移至常数预测器)。我们提出在线原始-对偶分配(OPDA),一种在线控制器,利用违规、风险和伪标签健康信号调度公平性和基于熵的稳定性惩罚,从而避免在该诊断机制下为每个数据集选择固定公平权重。在评估的表格基准(Adult、ACSIncome、COMPAS)上,OPDA缓解了静态权重和简单单信号自适应基线中观察到的退化状态。在Adult和COMPAS上,它产生了与经验静态λ前沿竞争的非退化运行点;在ACSIncome上,它保持了效用,同时具有更宽的公平-效用分布。相对于OPDA-lite,完整控制器主要在ACSIncome上将运行点向更高效用偏移,而Adult则突出了两种变体之间的公平-效用权衡。这些结果使OPDA成为表格公平SSL中无需校准的控制器,无需针对每个数据集进行调整即可获得非退化运行点。

英文摘要

Semi-supervised learning (SSL) enables prediction with limited labels, but high-stakes tabular applications (medical, credit, recidivism) require statistical fairness guarantees. We identify a structural conflict in tabular fair SSL through a diagnostic stress test: under confidence-gated pseudo-labeling, moment-matching fairness regularizers can trigger two failure modes -- Masking Collapse (fairness erodes confidence, starving pseudo-labels) and Trivial Saturation (drift to constant predictors). We propose Online Primal-Dual Allocation (OPDA), an online controller that schedules fairness and entropy-based stability penalties using violation, risk, and pseudo-label health signals, avoiding per-dataset selection of a fixed fairness weight within this diagnostic regime. On the evaluated tabular benchmarks (Adult, ACSIncome, COMPAS), OPDA mitigates the degenerate regimes observed under static weighting and simple single-signal adaptive baselines. On Adult and COMPAS, it yields non-degenerate operating points competitive with the empirical static-$λ$ frontier; on ACSIncome, it preserves utility with a wider fairness-utility spread. Relative to OPDA-lite, the full controller mainly shifts the operating point toward higher utility on ACSIncome, while Adult highlights the fairness-utility trade-off between the two variants. These results position OPDA as a calibration-free controller for non-degenerate operating points in tabular fair SSL without per-dataset tuning.

2605.15511 2026-06-02 cs.LG

OgBench: A Framework for Evaluating Graph Neural Networks on Omics Data

OgBench:评估图神经网络在组学数据上的框架

Louisa Cornelis, Johan Mathe, Louis Van Langendonck, Guillermo Bernárdez, Nina Miolane

发表机构 * UC Santa Barbara(加州大学圣芭芭拉分校) Atmo, Inc.(Atmo公司) Universitat Politècnica de Catalunya(加泰罗尼亚理工大学)

AI总结 针对组学数据中样本少、节点多的特点,提出OgBench基准平台,评估GNN性能,发现常用GNN常不如简单MLP和经典基线。

Comments 42 pages

详情
AI中文摘要

图神经网络(GNN)已成为归纳图级学习的主导框架。然而,大多数基准测试关注的是 $n \gg p$ 的情况,其中图的数量 $n$ 远大于每张图的节点数 $p$。这忽略了诸如组学等生物学领域,这些领域处于相反的 $n \ll p$ 情况,其特点是跨少量患者样本的大规模基因、转录本或蛋白质图。这引发了一个问题: extit{GNN 在低样本、高节点的组学设置中表现如何?} 我们引入了 exttt{OgBench}(组学图基准),这是第一个针对组学数据 $n \ll p$ 特征下的图级预测基准平台。我们提供了一个标准化的、端到端的模块化基础设施,从原始组学数据到具有不同结构属性的特征图家族。我们对经典GNN、为大型图和组学应用设计的GNN,以及MLP和机器学习基线进行基准测试,以建立参考性能。我们的结果表明,广泛使用的GNN通常并不优于简单的MLP和经典基线。这些发现挑战了图结构在该领域固有地增加价值的普遍假设,促进了对当前学习范式的批判性重新评估。最终,通过揭示这些局限性,OgBench提供了必要的开源生态系统,使社区能够开发和验证专门为生物图设计的新型架构。代码可在 https://github.com/geometric-intelligence/ogbench 获取。

英文摘要

Graph Neural Networks (GNNs) have become the dominant framework for inductive graph-level learning. Yet most benchmarks focus on the regime $n \gg p$, where the number of graphs $n$ greatly exceeds the number of nodes per graph $p$. This overlooks biological domains such as omics, which operate in the opposite $n \ll p$ regime, characterized by large graphs of genes, transcripts, or proteins across few patient samples. This raises the question: \textit{how do GNNs perform in this low-sample, high-node omics setting?} We introduce \texttt{OgBench} (Omics-Graph Bench), the first benchmarking platform for graph-level prediction in the $n \ll p$ regime characteristic of omics data. We provide a standardized, end-to-end modular infrastructure from raw omics data to families of featured graphs with varied structural properties. We benchmark classical GNNs, as well as GNNs designed for large graphs and omics applications, alongside MLPs and machine learning baselines to establish reference performances. Our results show that widely used GNNs often do not outperform simple MLPs and classical baselines. These findings challenge the prevailing assumption that graph structure inherently adds value in this domain, fostering a critical reassessment of current learning paradigms. Ultimately, by exposing these limitations, OgBench provides the open-source ecosystem necessary for the community to develop and validate novel architectures explicitly tailored for biological graphs. The code is available at https://github.com/geometric-intelligence/ogbench.

2605.09366 2026-06-02 cs.AI

Towards a Virtual Neuroscientist: Autonomous Neuroimaging Analysis via Multi-Agent Collaboration

迈向虚拟神经科学家:基于多智能体协作的自主神经影像分析

Keqi Han, Songlin Zhao, Yao Su, Xiang Li, Yixuan Yuan, Lifang He, Carl Yang

发表机构 * Emory University(埃默里大学) Lehigh University(莱斯大学) Worcester Polytechnic Institute(沃思堡理工学院) Massachusetts General Hospital(麻省总医院) Harvard University(哈佛大学) Chinese University of Hong Kong(香港中文大学)

AI总结 提出NEXUS多智能体框架,通过代码中心执行和分层验证实现自主神经影像分析,在ADHD-200和ADNI数据集上优于传统工作流。

详情
AI中文摘要

将神经影像数据转化为临床可操作的生物标志物是一个知识密集型和劳动密集型过程。fMRIPrep等标准化工作流提高了鲁棒性和效率,但它们是静态配置的,无法像人类研究人员那样推理下游目标、权衡替代策略或在中级证据与后续决策之间形成闭环。这种闭环适应的缺失常常使领域专家陷入手动试错以调整参数和修复工作流失败的循环,严重限制了临床生物标志物开发的可扩展性。为弥补这一差距,我们引入了NEXUS,一个自主多智能体框架,它将神经影像工作流执行与科学目标理解相结合。与传统的平面工具调用智能体不同,NEXUS采用以代码为中心的执行范式,其中专业智能体在可组合的领域特定原语上协作合成和优化可执行程序。这种设计使得鲁棒的、长时程的工作流构建成为可能,并能动态适应运行时观察。此外,我们提出了一个用于自主质量控制的分层验证框架,将队列级指标筛选与智能体视觉检查相结合,以驱动基于证据的工作流修复。在ADHD-200和ADNI上的实验表明,NEXUS在预测性能上优于标准工作流基线,同时展现出复杂的智能体行为,包括策略探索和自适应改进。代码可在https://github.com/LearningKeqi/Virtual-Neuroscientist-NEXUS获取。

英文摘要

Transforming neuroimaging data into clinically actionable biomarkers is a knowledge-intensive and labor-intensive process. Standardized workflows such as fMRIPrep have improved robustness and efficiency, but they are statically configured and cannot reason about downstream objectives, deliberate over alternative strategies, or close the loop between intermediate evidence and subsequent decisions in the way a human researcher would. This lack of closed-loop adaptation often leaves domain experts trapped in a cycle of manual trial-and-error to tune parameters and remediate pipeline failures, severely constraining the scalability of clinical biomarker development. To bridge this gap, we introduce NEXUS, an autonomous multi-agent framework that integrates neuroimaging workflow execution with scientific-objective understanding. Unlike conventional flat toolcalling agents, NEXUS adopts a code-centric execution paradigm where specialist agents collaboratively synthesize and optimize executable programs over composable domain-specific primitives. This design enables robust, long-horizon workflow construction that adapts dynamically to runtime observations. Furthermore, we propose a hierarchical verification framework for autonomous quality control, integrating cohort-level metric screening with agentic visual inspection to drive evidence-grounded workflow remediation. Experiments on ADHD-200 and ADNI demonstrate that NEXUS outperforms standard workflow-based baselines in predictive performance while exhibiting sophisticated agentic behaviors, including strategy exploration and adaptive refinement. The code is available at https://github.com/LearningKeqi/Virtual-Neuroscientist-NEXUS.