arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3868
2604.24278 2026-06-09 cs.SD cs.AI 版本更新

RAS: a Reliability Oriented Metric for Automatic Speech Recognition

RAS:一种面向可靠性的自动语音识别度量标准

Wenbin Huang, Yuhang Qiu, Bohan Li, Yiwei Guo, Jing Peng, Hankun Wang, Xie Chen, Kai Yu

发表机构 * X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University, China(上海交通大学计算机科学学院X-LANCE实验室,中国) MoE Key Lab of Artificial Intelligence(人工智能MOE重点实验室;江苏语言计算重点实验室,中国) Jiangsu Key Lab of Language Computing, China

AI总结 本研究提出了一种面向可靠性的度量标准RAS,用于评估自动语音识别系统在不确定段落中的转录可靠性,通过引入一种具有退避意识的转录框架,结合人类偏好校准的参数,提升了转录的可靠性同时保持了准确性。

Comments 5 pages, 4 figures; Accepted at InterSpeech 2026

详情
AI中文摘要

自动语音识别系统在嘈杂或模糊条件下常常会产生自信但错误的转录,这对用户和下游应用都是误导性的。基于词错误率的标准评估仅关注准确性,未能捕捉转录的可靠性。我们引入了具有退避意识的转录框架,使ASR模型能够显式地避免不确定的段落。为了评估在退避情况下的可靠性,我们提出了RAS,一种面向可靠性的度量标准,平衡转录的信息量和错误回避,其权衡参数通过人类偏好进行校准。然后通过监督抽样后接强化学习训练了一个具有退避意识的ASR模型。我们的实验表明,在保持竞争力的准确性的同时,转录可靠性有显著的提高。

英文摘要

Automatic speech recognition systems often produce confident yet incorrect transcriptions under noisy or ambiguous conditions, which can be misleading for both users and downstream applications. Standard evaluation based on Word Error Rate focuses solely on accuracy and fails to capture transcription reliability. We introduce an abstention-aware transcription framework that enables ASR models to explicitly abstain from uncertain segments. To evaluate reliability under abstention, we propose RAS, a reliability-oriented metric that balances transcription informativeness and error aversion, with its trade-off parameter calibrated by human preference. We then train an abstention-aware ASR model through supervised bootstrapping followed by reinforcement learning. Our experiments demonstrate substantial improvements in transcription reliability while maintaining competitive accuracy.

2602.22243 2026-06-09 cs.RO 版本更新

SODA-CitrON: Static Object Data Association by Clustering Multi-Modal Sensor Detections Online

SODA-CitrON:通过在线聚类多模态传感器检测实现静态物体数据关联

Jan Nausner, Kilian Wohlleben, Michael Hubner

发表机构 * Jan Nausner, Kilian Wohlleben, Michael Hubner

AI总结 本文提出SODA-CitrON方法,通过在线聚类多模态传感器检测实现静态物体的数据关联,同时估计位置并维持持久跟踪,优于现有方法在F1分数、位置RMSE、MOTP和MOTA指标上。

Comments 8 pages, 5 figures; \c{opyright} 2026 IEEE. Accepted for the 2026 International Conference on Information Fusion (FUSION 2026)

详情
AI中文摘要

从异构传感器检测中在线融合和跟踪静态物体是机器人、自主系统和环境建图中的基本问题。尽管经典数据关联方法如JPDA适合动态目标,但在间歇性和异质不确定性的静态物体观测中效果较差,因为运动模型对杂波的判别能力有限。本文提出了一种新颖的静态物体数据关联方法SODA-CitrON,通过在线聚类多模态传感器检测,同时估计位置并维持未知数量物体的持久跟踪。所提出的无监督机器学习方法完全在线运行,处理时间上不相关的多传感器测量。此外,它在传感器检测数量上具有最坏情况下的对数线性复杂度,同时提供完整的输出可解释性。我们在不同的蒙特卡洛模拟场景中评估了该方法,并将其与基于POM的过滤、DBSTREAM聚类和JPDA等现有方法进行比较。结果表明,在研究的静态物体建图场景中,SODA-CitrON在F1分数、位置RMSE、MOTP和MOTA指标上始终优于比较方法。

英文摘要

The online fusion and tracking of static objects from heterogeneous sensor detections is a fundamental problem in robotics, autonomous systems, and environmental mapping. Although classical data association approaches such as JPDA are well suited for dynamic targets, they are less effective for static objects observed intermittently and with heterogeneous uncertainties, where motion models provide minimal discriminative power with respect to clutter. In this paper, we propose a novel method for static object data association by clustering multi-modal sensor detections online (SODA-CitrON), while simultaneously estimating positions and maintaining persistent tracks for an unknown number of objects. The proposed unsupervised machine learning approach operates in a fully online manner and handles temporally uncorrelated and multi-sensor measurements. Additionally, it has a worst-case loglinear complexity in the number of sensor detections while providing full output explainability. We evaluate the proposed approach in different Monte Carlo simulation scenarios and compare it against state-of-the-art methods, including POM-based filtering, DBSTREAM clustering, and JPDA. The results demonstrate that SODA-CitrON consistently outperforms the compared methods in terms of F1 score, position RMSE, MOTP, and MOTA in the static object mapping scenarios studied.

2604.24474 2026-06-09 cs.LG 版本更新

Advancing Ligand-based Virtual Screening and Molecular Generation with Pretrained Molecular Embedding Distance

通过预训练分子嵌入距离推进基于配体的虚拟筛选和分子生成

Shiyun Wa, Yifei Wang, Simone Sciabola, Ye Wang

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 本文提出预训练嵌入距离作为高效替代方案,用于虚拟筛选和分子生成,展示其在结构信息捕捉和相似性测量方面的有效性。

Comments Accepted by ICML 2026 AI4Science (https://openreview.net/forum?id=HbfrCipfNl). Code and data are available

详情
AI中文摘要

分子相似性在基于配体的药物发现中起核心作用,如虚拟筛选、类比搜索和目标导向的分子生成。然而,传统相似性度量,从基于指纹的Tanimoto系数到3D形状叠加,往往在大规模计算上昂贵或依赖手工制作的分子描述符。同时,许多深度学习方法在相似性感知设计中仍依赖相似性特定的监督或昂贵的数据整理,限制了其在不同目标上的通用性。在本工作中,我们提出预训练嵌入距离(PED)作为有效的替代方法,直接从预训练的分子模型计算得出,无需任务特定训练。实验结果表明,PED与传统相似性度量显示出不同的相关性,并在虚拟筛选中分子排名和通过奖励设计指导分子生成方面表现良好。这些发现表明,预训练分子嵌入捕捉了丰富的结构信息,并可以作为现代人工智能辅助药物发现中有力且可扩展的相似性度量方法。

英文摘要

Molecular similarity plays a central role in ligand-based drug discovery, such as virtual screening, analog searching, and goal-directed molecular generation. However, traditional similarity measures, ranging from fingerprint-based Tanimoto coefficients to 3D shape overlays, are often computationally expensive at scale or rely on hand-crafted molecular descriptors. Meanwhile, many deep learning approaches to similarity-aware design still depend on similarity-specific supervision or costly data curation, limiting their generality across targets. In this work, we propose pretrained embedding distance (PED) as an effective alternative, computed directly from pretrained molecular models without task-specific training. Experimental results show that PED exhibits distinct correlations with traditional similarity metrics, and performs effectively in both ranking molecules for virtual screening and guiding molecular generation via reward design. These findings suggest that pretrained molecular embeddings capture rich structural information and can serve as a promising and scalable similarity measurement for modern AI-aided drug discovery.

2601.18840 2026-06-09 cs.LG cs.SY eess.SY 版本更新

Bellman Residual Minimization for Control: Geometry, Stationarity, and Convergence

贝尔曼残差最小化用于控制:几何、站定性与收敛

Donghwan Lee, Hyukjun Yang

发表机构 * School of Electrical Engineering(电气工程学院) Korea Advanced Institute of Science and Technology (KAIST)(韩国科学技术院)

AI总结 本文研究了控制中贝尔曼残差最小化方法的几何特性、站定性及收敛性,探讨其在策略优化中的基础理论与应用价值。

详情
AI中文摘要

马尔可夫决策问题通常通过动态规划解决。另一种方法是贝尔曼残差最小化,直接最小化平方贝尔曼残差目标函数。然而,与动态规划相比,这种方法在实践中往往效率较低,且难以扩展到无模型设置如强化学习。尽管如此,贝尔曼残差最小化在价值函数近似中的收敛稳定性等优势使其值得深入研究。虽然已有广泛研究政策评估的贝尔曼残差方法,但针对策略优化(控制任务)的方法却很少被探讨。本文建立了控制中贝尔曼残差最小化在策略优化中的基础理论结果。

英文摘要

Markov decision problems are most commonly solved via dynamic programming. Another approach is Bellman residual minimization, which directly minimizes the squared Bellman residual objective function. However, compared to dynamic programming, this approach has received relatively less attention, mainly because it is often less efficient in practice and can be more difficult to extend to model-free settings such as reinforcement learning. Nonetheless, Bellman residual minimization has several advantages that make it worth investigating, such as more stable convergence with function approximation for value functions. While Bellman residual methods for policy evaluation have been widely studied, methods for policy optimization (control tasks) have been scarcely explored. In this paper, we establish foundational results for the control Bellman residual minimization for policy optimization.

2604.23435 2026-06-09 cs.CV cs.AI cs.LG 版本更新

Knee-xRAI: An Explainable AI Framework for Automatic Kellgren-Lawrence Grading of Knee Osteoarthritis

膝-xRAI:一种用于自动膝骨关节炎Kellgren-Lawrence分级的可解释AI框架

Azmul A. Irfan, Nur Ahmad Khatim, Alfan Alfian Irfan, Achmad Zaki, Erike A. Suwarsono, Mansur M. Arief

发表机构 * Orthopaedic Department, Faculty of Medicine UIN Syarif Hidayatullah Jakarta(乌姆尼大学医学学院骨科部) Informatics Engineering, Institut Teknologi Sepuluh Nopember(十月份技术研究所信息工程系) Information Technology, Universitas Muhammadiyah Yogyakarta(尤科阿卡塔大学信息技术系) Industrial and Systems Engineering, King Fahd University of Petroleum and Minerals(国王法赫德石油与矿物大学工业与系统工程系)

AI总结 本文提出Knee-xRAI框架,通过模拟临床放射流程,结合JSN、骨刺和下骨质硬化等特征,利用XGBoost-SHAP和ConvNeXt模型实现可解释的KL分级,验证了其在膝骨关节炎诊断中的有效性。

Comments 8 pages, 5 figures

详情
AI中文摘要

对平片进行膝骨关节炎(KOA)分级的可重复性差。KL评分单级分歧可能改变手术管理或将患者从保守治疗转为关节内注射。同时,超越人类读者的深度学习模型通常缺乏决策解释。我们提出了Knee-xRAI,一个分解分级过程的流程,通过模仿临床放射流程独立测量关节间隙狭窄(JSN)、骨刺和下骨质硬化,然后将这些发现组合成可解释的KL评分。具体而言,U-Net++架构通过轮廓分割量化JSN,SE-ResNet-50多任务网络在OARSI尺度上对骨刺进行解剖部位评分,混合纹理-CNN检测二进制硬化。该流程产生一个50维特征向量,通过XGBoost-SHAP分类器(路径A,审计)和ConvNeXt混合预测器(路径B,部署)进行评估。在8,260个OAI衍生的放射图像上,JSN模块的Dice得分为0.8909,mJSW ICC为0.8674。路径A达到QWK为0.6294和AUC为0.8046,证实了结构化特征向量具有显著的诊断信号。路径B达到QWK为0.8436和AUC为0.9017。SHAP分析显示JSN是主导特征,骨刺增加了一致的增量,硬化贡献微小。移除JSN证据会降低KL3-KL4召回率,而早期等级保持不变,与KL诊断标准一致。Knee-xRAI将每个预测都基于可审计的放射学发现链,提供临床透明度。

英文摘要

Grading knee osteoarthritis (KOA) on plain radiographs is poorly reproducible across readers. A single-grade disagreement on the Kellgren-Lawrence (KL) scale can alter surgical management or redirect a patient from conservative therapy to intra-articular injection. Meanwhile, deep learning models that outperform human readers often offer no explanation for their decisions. We present Knee-xRAI, a pipeline that decomposes the grading process by mimicking clinical radiological workflows. It independently measures joint space narrowing (JSN), osteophytes, and subchondral sclerosis, then combines these findings into an explainable KL grade. Specifically, a U-Net++ architecture quantifies JSN via contour segmentation, an SE-ResNet-50 multi-task network grades osteophytes per anatomical site on the OARSI scale, and a hybrid texture-CNN detects binary sclerosis. This pipeline yields a 50-dimensional feature vector evaluated via an XGBoost-SHAP classifier (Path A, audit) and a ConvNeXt hybrid predictor (Path B, deployed). On 8,260 OAI-derived radiographs, the JSN module achieved a Dice score of 0.8909 and an mJSW ICC of 0.8674. Path A reached a QWK of 0.6294 and an AUC of 0.8046, confirming the structured feature vector carries substantial diagnostic signal. Path B achieved a QWK of 0.8436 and an AUC of 0.9017. SHAP analysis identifies JSN as the dominant feature, with osteophytes adding a consistent increment and sclerosis contributing marginally. Removing JSN evidence collapses KL3-KL4 recall while early grades remain intact, aligning with the KL diagnostic criteria. Knee-xRAI grounds every prediction in an auditable chain of measured radiographic findings, providing clinical transparency at the point of care.

2604.23066 2026-06-09 cs.CV 版本更新

Urban Flood Observations: A hand-labeled training and validation dataset of post-flood inundation

城市洪水观测:一个手标注的训练和验证数据集,用于洪水后淹没区域

Rohit Mukherjee, Hannah K. Friedrich, Beth Tellman, Ariful Islam, Zhijie Zhang, Jonathan Giezendanner, Upmanu Lall, Venkataraman Lakshmi

发表机构 * Pacific Northwest National Laboratory(太平洋西北国家实验室) University of Arizona(亚利桑那大学) University of Wisconsin–Madison(威斯康星大学麦迪逊分校) Utah State University(犹他州立大学) Massachusetts Institute of Technology(麻省理工学院) Columbia University(哥伦比亚大学) University of Virginia(弗吉尼亚大学)

AI总结 本文提出UFO数据集,用于复杂城市环境中从卫星图像中映射洪水淹没区域,通过手标注数据集验证了分割模型,达到77.3的平均IoU,并评估了两种常用水体产品。

Comments 15 pages, 8 figures

详情
AI中文摘要

城市洪水影响全球生命和基础设施。从卫星图像中映射复杂城市环境中的淹没区域仍然具有挑战性,由于空间分辨率有限、获取频率低和云层覆盖。我们提出了Urban Flood Observations (UFO),一个全球性的手标注数据集,包含2017至2021年间14次洪水事件中的215张图像芯片(1024x1024像素),源自3米的PlanetScope影像。每张芯片被标注为'淹没'(所有可见水面,包括洪水水和原有水面(永久或季节性))和'非淹没'。通过留一事件法交叉验证训练分割模型,达到77.3的平均交并比(IoU)。我们还利用UFO评估了两种广泛使用的水体产品,即基于Sentinel-1的NASA IMPACT模型和Google的10米Dynamic World水类,分别得到44.1和48.1的IoU。UFO公开可用,以支持城市淹没区域映射方法的发展和验证。

英文摘要

Urban flooding affects lives and infrastructure worldwide. Mapping inundation in complex urban environments from satellite imagery remains challenging due to limited spatial resolution, infrequent acquisitions, and cloud cover. We present Urban Flood Observations (UFO), a global, hand-labeled dataset of post-flood inundation in diverse urban settings. UFO comprises 215 image chips (1024 by 1024 pixels) from 14 flood events between 2017 and 2021, derived from 3 m PlanetScope imagery. Each chip is annotated with two classes: 'inundated' (all visible surface water, including floodwater and pre-existing water bodies (permanent or seasonal)) and 'non-inundated'. To demonstrate the dataset's utility, we trained a segmentation model using leave-one-event-out cross-validation, achieving a mean Intersection over Union (IoU) of 77.3. We also used UFO to evaluate two widely used surface water products, the Sentinel-1-based NASA IMPACT model and Google's 10 m Dynamic World water class, which yielded IoUs of 44.1 and 48.1, respectively. UFO is publicly available to support the development and validation of urban inundation mapping methods.

2604.23053 2026-06-09 cs.LG math.OC 版本更新

ML-Guided Primal Heuristics for Mixed Binary Quadratic Programs

基于机器学习的混合二元二次规划的原始启发式方法

Weimin Huang, Natalie M. Isenberg, Ján Drgoňa, Draguna L Vrabie, Bistra Dilkina

发表机构 * University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 本文提出基于机器学习的混合二元二次规划求解启发式方法,通过改进神经网络架构和损失函数,提升求解效率和泛化能力。

详情
AI中文摘要

混合二元二次规划(MBQPs)是组合优化中的重要且复杂的问题集。由于解决大规模组合优化问题具有挑战性,已开发出原始启发式方法以在短时间内快速找到高质量解。最近,越来越多的研究利用机器学习加速解决复杂组合优化问题的方法。尽管ML引导方法日益流行,但大部分工作集中在混合整数线性规划(MILPs)上。MBQPs的挑战在于组合复杂性与非线性相结合。本文通过将现有的ML引导MILP求解预测方法扩展到MBQPs,提出ML引导的原始启发式方法。我们引入了新的神经网络架构用于MBQP求解预测,并提出新的训练数据收集程序。此外,我们扩展了现有求解预测中的损失函数,并提出结合对比和加权交叉熵损失。我们在标准和现实世界MBQP基准上评估了这些方法,并展示了所开发的ML引导方法显著优于现有原始启发式方法和最先进的求解器。此外,使用我们提出的扩展损失函数训练的模型在其他基于MILP的ML方法和现实世界风场布局优化问题的跨区域推理中表现更优。

英文摘要

Mixed Binary Quadratic Programs (MBQPs) are an important and complex set of problems in combinatorial optimization. As solving large-scale combinatorial optimization problems is challenging, primal heuristics have been developed to quickly identify high-quality solutions within a short amount of time. Recently, a growing body of research has also used machine learning to accelerate solution methods for challenging combinatorial optimization problems. Despite the increasing popularity of these ML-guided methods, a large body of work has focused on Mixed-Integer Linear Programs (MILPs). MBQPs are challenging to solve due to the combinatorial complexity coupled with nonlinearities. This work proposes ML-guided primal heuristics for Mixed Binary Quadratic Programs (MBQPs) by adapting and extending existing work on ML-guided MILP solution prediction to MBQPs. We introduce a new neural network architecture for MBQP solution prediction and a new training data collection procedure. Moreover, we extend existing loss functions in solution prediction and propose to combine contrastive and weighted cross-entropy losses. We evaluate the methods on standard and real-world MBQP benchmarks and show that the developed ML-guided methods significantly outperform existing primal heuristics and state-of-the-art solvers. Furthermore, models trained with our proposed extension with combined losses outperform other ML-based methods adapted from MILPs and improve generalization in cross-regional inference on a real-world wind farm layout optimization problem.

2604.19845 2026-06-09 cs.AI 版本更新

Deconstructing Superintelligence: Identity, Self-Modification and Différance

解构超智能:身份、自我修改与差异

Elija Perrier

发表机构 * Centre for Quantum Software & Information, UTS, Sydney(量子软件与信息中心,UTS,悉尼)

AI总结 本文通过关联算子代数分析自我修改与超智能的关系,揭示非交换性如何传播至自我表示,并指出强自我修改可能破坏系统基础身份。

Comments Camera-ready version, AGI-2026

详情
AI中文摘要

自我修改常被视为构成人工超智能(SI)的核心,但修改是一种相对行为,需要一个在操作外的补充。我们在此基于关联算子代数$\mathcal{A}$,引入更新算子$\hat U$、差分算子$\hat D$和自我表示算子$\hat R$,将补充定义为$\operatorname{Comm}(\hat U)$。传播定理显示$[\hat U,\hat R]$通过$[\hat U,\hat D]$分解,因此非交换性传播至自我表示。谎言悖论是秩一情况$[\hat T,Π_L]=0$,而类$\mathbf{A}$系统中$\hat U$作用于$\hat D$,在系统尺度上再现它,产生与Priest的inclosure方案及Derrida的différance相一致的结构。我们的结果表明,强自我修改所定义的超智能可能破坏此类系统所依赖的持续身份。

英文摘要

Self-modification is routinely treated as constitutive of artificial superintelligence (\textbf{SI}), yet modification is a relative action requiring a \emph{supplement} outside the operation. We formalise this on an associative operator algebra $\mathcal{A}$ with update operator $\hat U$, difference operator $\hat D$, and self-representation operator $\hat R$, identifying the supplement with $\operatorname{Comm}(\hat U)$. A propagation theorem shows $[\hat U,\hat R]$ decomposes through $[\hat U,\hat D]$, so non-commutation propagates to self-representation. The liar paradox is the rank-one case $[\hat T,Π_L]=0$, and \emph{class $\mathbf{A}$} systems, in which $\hat U$ acts on $\hat D$, reproduce it at system scale, yielding a structure coinciding with Priest's inclosure schema and Derrida's \emph{différance}. Our results show that the strong self-modification taken to define superintelligence may undermine the persistent identity upon which such systems are premised.

2604.22482 2026-06-09 cs.CV cs.GR 版本更新

Holo360D: A Large-Scale Real-World Dataset with Continuous Trajectories for Advancing Panoramic 3D Reconstruction and Beyond

Holo360D: 一个大规模真实世界数据集,具有连续轨迹,用于推进全景3D重建及更广泛领域

Jing Ou, Zidong Cao, Yinrui Ren, Zhuoxiao Li, Jinjing Zhu, Tongyan Hua, Shuai Zhang, Hui Xiong, Wufan Zhao

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) South China Normal University(华南师范大学)

AI总结 本文提出Holo360D数据集,包含109495张全景图像及注册点云、网格和对齐相机姿态,通过连续轨迹和高完整性深度图提升全景3D重建性能,建立新基准并提供有效微调策略。

Comments Datasets Link: https://github.com/Jou719/Holo360D

详情
AI中文摘要

尽管馈送式3D重建模型发展迅速,但全景图像仍因球面畸变而性能下降。现有全景3D数据集多由360相机在离散位置采集,导致轨迹不连续。本文提出Holo360D数据集,包含109,495张全景图像及注册点云、网格和对齐相机姿态。Holo360D是首个大规模提供连续全景序列和高完整性深度图的数据集。原始数据由3D激光扫描仪与360相机采集,随后通过在线和离线SLAM系统处理。为提升3D数据质量,提出针对360数据集的后处理流程,包括几何去噪、网格孔填补和区域特定重网格化。最后,通过在Holo360D上微调3D重建模型建立新基准,提供有效微调策略的关键见解。实验结果表明,Holo360D提供更优的训练信号,为推进全景3D重建模型提供全面基准。数据集和代码将公开发布。

英文摘要

While feed-forward 3D reconstruction models have advanced rapidly, they still exhibit degraded performance on panoramas due to spherical distortions. Moreover, existing panoramic 3D datasets are predominantly collected with 360 cameras fixed at discrete locations, resulting in discontinuous trajectories. These limitations critically hinder the development of panoramic feed-forward 3D reconstruction, especially for the multi-view setting. In this paper, we present Holo360D, a comprehensive dataset containing 109,495 panoramas paired with registered point clouds, meshes, and aligned camera poses. To our knowledge, Holo360D is the first large-scale dataset that provides continuous panoramic sequences with accurately aligned high-completeness depth maps. The raw data are initially collected using a 3D laser scanner coupled with a 360 camera. Subsequently, the raw data are processed with both online and offline SLAM systems. Furthermore, to enhance the 3D data quality, a post-processing pipeline tailored for the 360 dataset is proposed, including geometry denoising, mesh hole filling, and region-specific remeshing. Finally, we establish a new benchmark by fine-tuning 3D reconstruction models on Holo360D, providing key insights into effective fine-tuning strategies. Our results demonstrate that Holo360D delivers superior training signals and provides a comprehensive benchmark for advancing panoramic 3D reconstruction models. Datasets and Code will be made publicly available.

2604.22238 2026-06-09 cs.RO 版本更新

CodeGraphVLP: Code-as-Planner Meets Semantic-Graph State for Non-Markovian Vision-Language-Action Models

CodeGraphVLP:代码规划器与语义图状态的结合用于非马尔可夫视觉-语言-动作模型

Khoa Vo, Sieu Tran, Taisei Hanyu, Yuki Ikebe, Duy Nguyen, Nghi D. Q. Bui, Minh Vu, Anthony Gunderman, Chase Rainwater, Anh Nguyen, Ngan Le

发表机构 * University of Arkansas(亚拉巴马大学) Max Planck Research School for Intelligent Systems and the University of Stuttgart(马克斯·普朗克智能系统研究学校和斯图加特大学) Center of AI Research, VinUniversity(Vin大学人工智能研究中心) TU Wien(维也纳技术大学) University of Liverpool(利物浦大学)

AI总结 CodeGraphVLP结合语义图状态与可执行代码规划器,提升非马尔可夫长周期任务的视觉语言动作执行效率,降低规划延迟并提高任务完成率。

详情
AI中文摘要

视觉-语言-动作(VLA)模型旨在实现通用机器人操作,但通常被训练为短周期策略,假设最新观察足以进行动作推理。这一假设在非马尔可夫长周期任务中失效,因为任务相关证据可能被遮挡或出现在轨迹早期,且杂乱环境使精细视觉定位脆弱。我们提出CodeGraphVLP,一种分层框架,通过结合持久语义图状态与可执行代码规划器和进度引导的视觉-语言提示,实现可靠的长周期操作。语义图在部分可观测条件下维护任务相关实体和关系。合成规划器在该语义图上执行,进行高效进度检查并输出子任务指令及相关对象。这些输出用于构建抑制杂乱的观察,使VLA执行器聚焦关键证据。在现实非马尔可夫任务中,CodeGraphVLP在强VLA基线和历史增强变体上提升任务完成率,同时显著降低规划延迟。我们还进行了广泛的消融研究以验证各组件的贡献。

英文摘要

Vision-Language-Action (VLA) models promise generalist robot manipulation, but are typically trained and deployed as short-horizon policies that assume the latest observation is sufficient for action reasoning. This assumption breaks in non-Markovian long-horizon tasks, where task-relevant evidence can be occluded or appear only earlier in the trajectory, and where clutter and distractors make fine-grained visual grounding brittle. We present CodeGraphVLP, a hierarchical framework that enables reliable long-horizon manipulation by combining a persistent semantic-graph state with an executable code-based planner and progress-guided visual-language prompting. The semantic-graph maintains task-relevant entities and relations under partial observability. The synthesized planner executes over this semantic-graph to perform efficient progress checks and outputs a subtask instruction together with subtask-relevant objects. We use these outputs to construct clutter-suppressed observations that focus the VLA executor on critical evidence. On real-world non-Markovian tasks, CodeGraphVLP improves task completion over strong VLA baselines and history-enabled variants while substantially lowering planning latency compared to VLM-in-the-loop planning. We also conduct extensive ablation studies to confirm the contributions of each component.

2604.19755 2026-06-09 cs.AI cs.LG 版本更新

Explainable AML Triage with LLMs: Evidence Retrieval and Counterfactual Checks

可解释的AML优先级排序与LLMs:证据检索与反事实检查

Dorothy Torres, Wei Cheng, Ke Hu

发表机构 * School of Science, Technology, Engineering and Mathematics(科学、技术、工程与数学学院) School of Electrical Engineering and Computer Science(电气工程与计算机科学学院)

AI总结 本文提出一种可解释的AML优先级排序框架,结合证据增强的证据捆绑、结构化LLM输出合同和反事实验证,提升审计性和鲁棒性,实验证明其在优先级排序和证据支持方面表现优异。

详情
AI中文摘要

反洗钱(AML)交易监控生成大量警报,需在严格审计和治理约束下快速优先级排序。尽管大语言模型(LLMs)可汇总异质证据并起草理由,但不受约束的生成在受监管流程中因幻觉、弱溯源性和不忠实的解释而风险较高。本文提出一种可解释的AML优先级排序框架,将优先级排序视为受证据约束的决策过程。我们的方法结合(i)从政策/类型指南、客户上下文、警报触发器和交易子图中检索增强的证据捆绑;(ii)一个结构化的LLM输出合同,要求明确引用并区分支持、矛盾或缺失的证据;(iii)反事实检查,验证最小、合理的扰动是否导致优先级推荐及其理由的连贯变化。我们在公开的合成AML基准和模拟器上评估,并与规则、表格和图机器学习基线以及LLM-only/RAG-only变体进行比较。结果表明,证据支撑显著提高了可审计性,并减少了数值和政策幻觉错误,而反事实验证进一步增加了与决策相关的可解释性和鲁棒性,实现了最佳的整体优先级排序性能(PR-AUC 0.75;升级F1 0.62)和强溯源性和忠实度指标(引用有效性0.98;证据支持0.88;反事实忠实度0.76)。这些发现表明,受约束、可验证的LLM系统可以在不牺牲合规要求的可追溯性和防御性的情况下,为AML优先级排序提供实用的决策支持。

英文摘要

Anti-money laundering (AML) transaction monitoring generates large volumes of alerts that must be rapidly triaged by investigators under strict audit and governance constraints. While large language models (LLMs) can summarize heterogeneous evidence and draft rationales, unconstrained generation is risky in regulated workflows due to hallucinations, weak provenance, and explanations that are not faithful to the underlying decision. We propose an explainable AML triage framework that treats triage as an evidence-constrained decision process. Our method combines (i) retrieval-augmented evidence bundling from policy/typology guidance, customer context, alert triggers, and transaction subgraphs, (ii) a structured LLM output contract that requires explicit citations and separates supporting from contradicting or missing evidence, and (iii) counterfactual checks that validate whether minimal, plausible perturbations lead to coherent changes in both the triage recommendation and its rationale. We evaluate on public synthetic AML benchmarks and simulators and compare against rules, tabular and graph machine-learning baselines, and LLM-only/RAG-only variants. Results show that evidence grounding substantially improves auditability and reduces numerical and policy hallucination errors, while counterfactual validation further increases decision-linked explainability and robustness, yielding the best overall triage performance (PR-AUC 0.75; Escalate F1 0.62) and strong provenance and faithfulness metrics (citation validity 0.98; evidence support 0.88; counterfactual faithfulness 0.76). These findings indicate that governed, verifiable LLM systems can provide practical decision support for AML triage without sacrificing compliance requirements for traceability and defensibility.

2604.20689 2026-06-09 cs.RO 版本更新

FingerEye: Learning Dexterous Manipulation with Continuous Vision-Tactile Sensing

FingerEye:通过连续视觉-触觉感知学习灵巧操作

Zhixuan Xu, Yichen Li, Xuanye Wu, Tianyu Qiu, Lin Shao

发表机构 * National University of Singapore(新加坡国立大学) RoboScience(机器人科学) Huazhong University of Science and Technology(华中科技大学) South China University of Technology(华南理工大学)

AI总结 FingerEye通过连续视觉-触觉感知提升机器人灵巧操作,结合视觉和触觉反馈,在模拟和现实环境中使腕部策略的成功率提升超30个百分点。

详情
AI中文摘要

灵巧的机器人操作需要从接触前的接近到接触启动和接触后的控制保持信息丰富的感知。我们介绍了FingerEye,一种通过连续视觉-触觉反馈增强机器人灵巧性的感知和学习框架。在感知方面,FingerEye整合双目RGB相机和一个合规的接触接口,以支持接触前后的同时感知。接触前,指尖相机提供近距离视觉线索和隐式立体视觉,用于精确接近和物体定位。接触后,标记跟踪的变形提供接触 wrench 感知的代理。在学习方面,我们构建了真实和模拟的基础设施用于数据收集和评估,系统研究了多项FingerEye传感器的学习策略-接口设计,并开发了FingerEye Policy,该策略通过组结构化的模态融合来减少模态捷径并更好地利用分布式的指尖反馈。在七个接触敏感的任务设置中,FingerEye在模拟和现实世界中均使腕部策略的平均成功率提高了超过30个百分点。

英文摘要

Dexterous robotic manipulation requires perception that remains informative from pre-contact approach to contact initiation and post-contact control. We introduce FingerEye, a sensing and learning framework that strengthens robotic dexterity through continuous vision-tactile feedback throughout interaction. On the sensing side, FingerEye integrates binocular RGB cameras with a compliant contact interface to support perception both before and after contact. Before contact, the fingertip cameras provide close-range visual cues and implicit stereo for precise approach and object localization. After contact, marker-tracked deformation of the compliant ring provides a proxy for contact wrench sensing. On the learning side, we build real-and-sim infrastructure for data collection and evaluation, systematically study policy-interface designs for learning with multiple FingerEye sensors, and develop FingerEye Policy, which applies group-structured modality fusion to reduce modality shortcuts and better exploit distributed fingertip feedback. Across seven contact-sensitive task settings, FingerEye improves wrist-only policy by over 30 percentage points in mean success rate in both simulation and the real world.

2604.17406 2026-06-09 cs.AI 版本更新

EvoMaster: A Foundational Evolving Agent Framework for Agentic Science at Scale

EvoMaster:一种用于大规模代理科学的基础进化代理框架

Xinyu Zhu, Yuzhu Cai, Zexi Liu, Cheng Wang, Fengyang Li, Wenkai Jin, Wanxu Liu, Zehao Bing, Bingyang Zheng, Jingyi Chai, Shuo Tang, Rui Ye, Yuwen Du, Xianghe Pang, Yaxin Du, Tingjia Miao, Yuzhi Zhang, Ruoxue Liao, Zhaohan Ding, Linfeng Zhang, Yanfeng Wang, Weinan E, Siheng Chen

发表机构 * School of Artificial Intelligence, Shanghai Jiao Tong University(上海交通大学人工智能学院) SciLand DP Technology(DP技术)

AI总结 EvoMaster通过持续自我进化机制,使代理能迭代优化假设并积累知识,实现跨学科的高效科学发现,其易用性与性能在多个基准测试中均表现优异。

Comments 17 pages, 3 figures

详情
AI中文摘要

大型语言模型与代理的融合正推动科学发现进入新纪元:代理科学。尽管科学方法本质上是迭代的,现有代理框架多为静态、狭窄且缺乏从试错中学习的能力。为弥合这一差距,我们提出了EvoMaster,一种专为大规模代理科学设计的基础进化代理框架。其核心原理是持续自我进化,使代理能够迭代优化假设、自我批评并逐步积累知识。作为领域无关的基础平台,EvoMaster极容易扩展——开发者可在约100行代码中构建和部署高度 capable、自我进化的科学代理。基于EvoMaster,我们建立了覆盖机器学习、物理和一般科学等多个领域的SciMaster生态系统。在四个权威基准测试(Humanity's Last Exam、MLE-Bench Lite、BrowseComp和FrontierScience)上的评估显示,EvoMaster分别达到41.1%、75.8%、73.3%和53.3%的先进分数。其性能全面超越通用基准OpenClaw,相对提升范围从+159%到+316%,充分验证了其作为下一代自主科学发现基础框架的有效性和通用性。EvoMaster可在https://github.com/sjtu-sai-agents/EvoMaster获取。

英文摘要

The convergence of large language models and agents is catalyzing a new era of scientific discovery: Agentic Science. While the scientific method is inherently iterative, existing agent frameworks are predominantly static, narrowly scoped, and lack the capacity to learn from trial and error. To bridge this gap, we present EvoMaster, a foundational evolving agent framework engineered specifically for Agentic Science at Scale. Driven by the core principle of continuous self-evolution, EvoMaster empowers agents to iteratively refine hypotheses, self-critique, and progressively accumulate knowledge across experimental cycles, faithfully mirroring human scientific inquiry. Crucially, as a domain-agnostic base harness, EvoMaster is exceptionally easy to scale up -- enabling developers to build and deploy highly capable, self-evolving scientific agents for arbitrary disciplines in approximately 100 lines of code. Built upon EvoMaster, we incubated the SciMaster ecosystem across domains such as machine learning, physics, and general science. Evaluations on four authoritative benchmarks (Humanity's Last Exam, MLE-Bench Lite, BrowseComp, and FrontierScience) demonstrate that EvoMaster achieves state-of-the-art scores of 41.1%, 75.8%, 73.3%, and 53.3%, respectively. It comprehensively outperforms the general-purpose baseline OpenClaw with relative improvements ranging from +159% to +316%, robustly validating its efficacy and generality as the premier foundational framework for the next generation of autonomous scientific discovery. EvoMaster is available at https://github.com/sjtu-sai-agents/EvoMaster.

2604.18347 2026-06-09 cs.CL cs.AI 版本更新

Multilingual Training and Evaluation Resources for Vision-Language Models

面向视觉语言模型的多语言训练和评估资源

Daniela Baiamonte, Elena Fano, Matteo Gabburo, Stefano Simonazzi, Leonardo Rigutini, Andrea Zugarini

发表机构 * Villanova.ai Aithlas

AI总结 本文提出跨五种欧洲语言的视觉语言模型训练与评估资源,通过再生与翻译方法生成高质量多语言数据,验证多语言数据在非英语基准上的有效性。

详情
AI中文摘要

视觉语言模型(VLMs)近年来取得了快速进展。然而,尽管其发展依赖于英语,导致两个主要限制:(i)缺乏多语言和多模态数据集用于训练,(ii)缺乏跨语言的全面评估基准。本文通过引入覆盖五种欧洲语言(英语、法语、德语、意大利语和西班牙语)的新型综合资源来填补这些空白。我们采用再生-翻译范式,通过结合精心挑选的合成生成和人工标注来生成高质量的跨语言资源。具体而言,我们构建了Multi-PixMo训练语料库,通过再生Pixmo现有数据集中的示例,结合许可的模型:PixMo-Cap、PixMo-AskModelAnything和CoSyn-400k。在评估方面,我们构建了一组多语言基准,通过翻译广泛使用的英语数据集(MMbench、ScienceQA、MME、POPE、AI2D)来实现。我们通过定性和定量的人类分析评估这些资源的质量,测量跨标注者的一致性。此外,我们进行了消融研究,以展示多语言数据在VLMs训练中的影响,相对于仅英语数据。实验包括三种不同的模型,结果表明使用多语言、多模态示例训练VLMs在非英语基准上始终有益,同时对英语也有积极的迁移效果。

英文摘要

Vision Language Models (VLMs) achieved rapid progress in the recent years. However, despite their growth, VLMs development is heavily grounded on English, leading to two main limitations: (i) the lack of multilingual and multimodal datasets for training, and (ii) the scarcity of comprehensive evaluation benchmarks across languages. In this work, we address these gaps by introducing a new comprehensive suite of resources for VLMs training and evaluation spanning five European languages (English, French, German, Italian, and Spanish). We adopt a regeneration-translation paradigm that produces high-quality cross-lingual resources by combining curated synthetic generation and manual annotation. Specifically, we build Multi-PixMo, a training corpus obtained regenerating examples from Pixmo pre-existing datasets with permissively licensed models: PixMo-Cap, PixMo-AskModelAnything, and CoSyn-400k. On the evaluation side, we construct a set of multilingual benchmarks derived translating widely used English datasets (MMbench, ScienceQA, MME, POPE, AI2D). We assess the quality of these resources through qualitative and quantitative human analyses, measuring inter-annotator agreement. Additionally, we perform ablation studies to demonstrate the impact of multilingual data, with respect to English only, in VLMs training. Experiments, comprising 3 different models show that using multilingual, multimodal examples for training VLMs aids is consistently beneficial on non-English benchmarks, with positive transfer to English as well.

2604.18050 2026-06-09 cs.AI cs.LO 版本更新

The Topological Dual of a Dataset: A Logic-to-Topology Encoding for AlphaGeometry-Style Data

数据集的拓扑对偶:一种逻辑到拓扑的编码用于AlphaGeometry风格的数据

Anthony Bordg

发表机构 * Huawei Lagrange Center(华为拉格朗日中心)

AI总结 本文提出一种逻辑到拓扑的编码方法,用于揭示模型潜在空间的结构不变性,通过逻辑观察的对偶性,为神经符号AI提供解释路径。

Comments Company decision as a precautionary measure while a third-party dispute is under review

详情
AI中文摘要

AlphaGeometry在神经符号推理中是一个里程碑,但其架构在符号推导引擎中面临对数线性扩展瓶颈,限制了随着问题复杂性增加的效率。最近的技术报告表明,当前领域特定语言可能与自然语言同构,作为输入表示,互换是性能不变的转换,暗示当前神经指导依赖于表面编码而非结构理解。本文通过提出一种逻辑到拓扑的编码方法来解决这一表示瓶颈,该方法旨在揭示模型潜在空间在输入空间变换下的结构不变性。通过利用观察逻辑,我们利用可观察理论中的可证性与拓扑之间的对偶性,提出一种输入空间的逻辑到拓扑编码器。我们引入了“数据集的拓扑对偶”概念,这是一种连接形式逻辑、拓扑和神经处理的转换。该框架为神经符号AI提供了一种罗塞塔石碑,提供了一条机制可解释的路径,以解释模型如何在复杂发现路径中导航。

英文摘要

AlphaGeometry represents a milestone in neuro-symbolic reasoning, yet its architecture faces a log-linear scaling bottleneck within its symbolic deduction engine that limits its efficiency as problem complexity increases. Recent technical reports suggest that current domain-specific languages may be isomorphic as input representations to natural language, interchanging them acts as a performance-invariant transformation, implying that current neural guidance relies on superficial encodings rather than structural understanding. This paper addresses this representation bottleneck by proposing a logic-to-topology encoding designed to reveal the structural invariants of a model's latent space under a transformation of its input space. By leveraging the Logic of Observation, we utilize the duality between provability in observable theories and topologies to propose a logic-to-topology encoder for the input space. We introduce the concept of the "topological dual of a dataset", a transformation that bridges formal logic, topology, and neural processing. This framework serves as a Rosetta Stone for neuro-symbolic AI, providing a principled pathway for the mechanistic interpretability of how models navigate complex discovery paths.

2604.17324 2026-06-09 cs.LG cs.AI 版本更新

Capacity-Controlled Global Attention for Graph Transformers

具有容量控制的全局注意力用于图变换器

Yang Liu, Dongxin Guo, Tom Zheng, Siu Ming Yiu, Liam Ning, Jikun Wu

发表机构 * Brain Investing Limited The University of Hong Kong(香港大学) Stellaris AI Limited

AI总结 本文提出SigGate-GT,通过在图变换器中引入可学习的sigmoid门来缓解全局注意力的保守约束,从而解决过平滑、低秩瓶颈和训练不稳定等问题,提升了多个基准测试的性能。

Comments 13 pages, 2 figures, 15 tables

详情
AI中文摘要

全局自注意力推动了现代图变换器,但其核心的softmax操作引入了一个很少直接考察的结构约束:每个注意力行非负且和为一,因此每个头的输出是值向量的守恒凸组合。一个节点永远无法“不关注任何东西”。我们认为这种守恒约束是三个通常孤立研究的病理的根本原因:深度下的节点表示崩溃(过平滑)、每个头输出的低秩瓶颈,以及深度堆栈中的脆弱优化。借鉴sigmoid门在语言模型中消除类似注意力沉底的方式,我们引入SigGate-GT,一种在GraphGPS框架中应用可学习、按头、输入条件化的sigmoid门的图变换器。该门是一种平滑的、按维度的“体积控制”,可将头输出驱动至零,不放弃注意力的概率解释。通过分析和合成实验,我们证明该门严格增加每个头输出的稳定秩,并将此秩增益与所有三种表现联系起来。在五个分子和长距离基准上,SigGate-GT在ZINC上匹配先前最佳(0.059 MAE),在ogbg-molhiv上记录最强结果(82.47% ROC-AUC),在ogbg-molpcba和长距离图基准上具有竞争力,且在所有五个数据集上均优于GraphGPS(p < 0.05)。机制分析证实了诊断:门减缓了过平滑(在4-16层中表示多样性平均相对增益30%),保持了注意力熵不崩溃,并在10倍学习率范围内稳定训练,参数开销约为OGB的1%,时间成本低于3%。

英文摘要

Global self-attention drives modern graph transformers, yet the softmax at its core imposes a structural constraint rarely examined directly: every attention row is non-negative and sums to one, so each per-head output is a mass-conserving convex combination of value vectors. A node can never "attend to nothing." We argue this conservation constraint is a single root cause behind three pathologies usually studied in isolation: the collapse of node representations with depth (over-smoothing), a low-rank bottleneck on per-head outputs, and brittle optimization in deep stacks. Drawing on how sigmoid gating removes analogous attention sinks in language models, we introduce SigGate-GT, a graph transformer that applies a learned, per-head, input-conditioned sigmoid gate to the attention output inside the GraphGPS framework. The gate is a smooth, per-dimension "volume control" that can drive head outputs toward zero, relaxing the constraint without abandoning attention's probabilistic interpretation. Analytically and through synthetic experiments, we show the gate strictly increases the stable rank of per-head outputs, and connect this rank gain to all three manifestations. On five molecular and long-range benchmarks, SigGate-GT matches the prior best on ZINC (0.059 MAE), records the strongest result among the graph-transformer baselines we evaluate on ogbg-molhiv (82.47% ROC-AUC), and is competitive on ogbg-molpcba and the Long-Range Graph Benchmark, with statistically significant gains over GraphGPS on all five datasets (p < 0.05). Mechanism analyses confirm the diagnosis: gating slows over-smoothing (a 30% mean relative gain in representation diversity across 4-16 layers), keeps attention entropy from collapsing, and stabilizes training across a 10x learning-rate range, at about 1% parameter overhead on OGB and under 3% wall-clock cost.

2604.16512 2026-06-09 cs.CV cs.CG cs.GR cs.LG cs.NA math.NA 版本更新

Medial Axis Aware Learning of Signed Distance Functions

面向中轴线的符号距离函数学习

Samuel Weidemaier, Christoph Norden-Smoch, Martin Rumpf

发表机构 * Institute for Numerical Simulation, University of Bonn(数值模拟研究所,波恩大学)

AI总结 本文提出一种新的变分方法,用于计算高精度的全局符号距离函数,通过高阶变分公式考虑梯度的跳跃集,以提高计算精度。

详情
AI中文摘要

我们提出了一种新的变分方法,用于计算给定点云的高精度全局符号距离函数(SDF)。为此,通过高阶变分公式显式考虑SDF梯度的跳跃集,即表面的中轴线,该公式强制在远离此不连续集的方向上沿梯度方向线性增长。Eikonal方程和SDF的零水平集被作为约束条件。为了使该变分问题具有计算可行性,采用了一种相场近似方法,属于Ambrosio-Tortorelli类型。相关的相场函数隐式地描述了中轴线。该方法用于由无向点云表示的表面,使用神经网络近似SDF和相场函数。实验表明,该方法在近场和全局范围内均具有较高的准确性。定量和定性比较表明,所提出的方法具有优势。

英文摘要

We propose a novel variational method to compute a highly accurate global signed distance function (SDF) to a given point cloud. To this end, the jump set of the gradient of the SDF, which coincides with the medial axis of the surface, is explicitly taken into account through a higher-order variational formulation that enforces linear growth along the gradient direction away from this discontinuity set. The eikonal equation and the zero-level set of the SDF are enforced as constraints. To make this variational problem computationally tractable, a phase field approximation of Ambrosio-Tortorelli type is employed. The associated phase field function implicitly describes the medial axis. The method is implemented for surfaces represented by unoriented point clouds using neural network approximations of both the SDF and the phase field. Experiments demonstrate the method's accuracy both in the near field and globally. Quantitative and qualitative comparisons with other approaches show the advantages of the proposed method.

2509.12760 2026-06-09 cs.LG cs.CL 版本更新

Similarity-Distance-Magnitude Activations

相似度-距离-幅度激活函数

Allen Schmaltz

发表机构 * Reexpress AI

AI总结 本文提出SDM激活函数,通过引入相似度和距离意识提升softmax的鲁棒性和可解释性,并通过密集匹配实现基于实例的可解释性。SDM估计器通过数据驱动的CDF分区控制分类准确性,优于现有校准方法。

Comments Accepted to Findings of the Association for Computational Linguistics: ACL 2026. 21 pages, 8 tables, 1 algorithm. arXiv admin note: substantial text overlap with arXiv:2502.20167

详情
AI中文摘要

我们引入了相似度-距离-幅度(SDM)激活函数,这是一种更稳健和可解释的标准softmax激活函数的改进形式,增加了相似度(即正确预测深度匹配到训练)意识和距离到训练分布意识,从而通过密集匹配实现可解释性。我们进一步引入了基于SDM激活的类内经验CDF数据驱动分区的SDM估计器,以控制选择性分类中的类和预测条件下的准确性。当用作预训练语言模型的最终层激活进行选择性分类时,SDM估计器比使用softmax激活的现有校准方法更鲁棒于协变量偏移和分布外输入,同时在分布内数据上保持信息性。

英文摘要

We introduce the Similarity-Distance-Magnitude (SDM) activation function, a more robust and interpretable formulation of the standard softmax activation function, adding Similarity (i.e., correctly predicted depth-matches into training) awareness and Distance-to-training-distribution awareness to the existing output Magnitude (i.e., decision-boundary) awareness, and enabling interpretability-by-exemplar via dense matching. We further introduce the SDM estimator, based on a data-driven partitioning of the class-wise empirical CDFs via the SDM activation, to control the class- and prediction-conditional accuracy among selective classifications. When used as the final-layer activation over pre-trained language models for selective classification, the SDM estimator is more robust to covariate shifts and out-of-distribution inputs than existing calibration methods using softmax activations, while remaining informative over in-distribution data.

2505.19662 2026-06-09 cs.AI cs.CV 版本更新

FieldWorkArena: Agentic AI Benchmark for Real Field Work Tasks

FieldWorkArena:面向真实作业任务的代理AI基准测试

Jun Takahashi, Atsunori Moteki, Akiyoshi Uchida, Shoichi Masui, Fan Yang, Kanji Uchino, Yueqi Song, Yonatan Bisk, Graham Neubig, Ikuo Kusajima, Yasuto Watanabe, Hiroyuki Ishida, Koki Nakagawa, Shan Jiang

发表机构 * Fujitsu Limited(富士通株式会社) Fujitsu Research of America(富士通美国研究部) Carnegie Mellon University(卡内基梅隆大学) Master’s Student, The University of Tokyo(东京大学硕士研究生) Agent Research Collective(代理研究集体)

AI总结 本文提出FieldWorkArena,用于评估代理AI在真实制造业和零售环境中的性能,通过现场采集的数据和实地访谈设计任务,验证多模态大语言模型的评估可行性。

Comments 27 pages, 10 figures, 7 tables [ICPR 2026 Accepted] Changes from previous version: added supplemental material

详情
AI中文摘要

本文介绍FieldWorkArena,一个针对真实世界作业任务的代理AI基准测试平台。随着对代理AI的需求增加,此类系统旨在检测和记录安全隐患、程序违规等关键事件。与大多数专注于模拟或数字环境的基准测试不同,我们的工作解决了在真实世界中评估代理的挑战。本文改进了之前的评估函数,以评估代理AI在多样化真实任务中的性能。数据集包含工厂、仓库和零售现场采集的图像和视频。任务通过与现场工人和管理人员的访谈精心设计。评估结果证实,考虑多模态大语言模型(如GPT-4o)特性进行性能评估是可行的。此外,本研究确定了所提新评估方法的有效性和局限性。完整数据集和评估程序可在网站(https://en-documents.research.global.fujitsu.com/fieldworkarena/)上公开获取。

英文摘要

This paper introduces FieldWorkArena, a benchmark for agentic AI targeting real-world field work. With the recent increase in demand for agentic AI, they are built to detect and document safety hazards, procedural violations, and other critical incidents across real-world manufacturing and retail environments. Whereas most agentic AI benchmarks focus on performance in simulated or digital environments, our work addresses the fundamental challenge of evaluating agents in the real-world. In this paper, we improve the evaluation function from previous methods to assess the performance of agentic AI in diverse real-world tasks. Our dataset comprises on-site captured images/videos in factories, warehouses and retails. Tasks were meticulously developed through interviews with site workers and managers. Evaluation results confirmed that performance evaluation considering the characteristics of Multimodal LLM (MLLM) such as GPT-4o is feasible. Furthermore, this study identifies both the effectiveness and limitations of the proposed new evaluation methodology. The complete dataset and evaluation program are publicly accessible on the website (https://en-documents.research.global.fujitsu.com/fieldworkarena/)

2604.12277 2026-06-09 cs.LG 版本更新

Models Know Their Shortcuts: Deployment-Time Shortcut Mitigation

模型知晓其捷径:部署时的捷径缓解

Jiayi Li, Shijie Tang, Gün Kaynar, Shiyi Du, Carl Kingsford

发表机构 * Ray and Stephanie Lane Computational Biology Department, School of Computer Science, Carnegie Mellon University(雷和斯蒂芬妮·兰德计算生物学系,计算机科学学院,卡内基梅隆大学)

AI总结 研究提出在部署时通过无监督梯度归因缓解预训练文本编码器的捷径学习,证明部署时的缓解在信息理论上受训练时缓解的限制,并在情感分类、毒性检测和自然语言推理中取得显著性能提升。

详情
AI中文摘要

预训练文本编码器容易产生捷径学习,依赖于token-标签相关性,一旦在部署时分布偏移就会失效。现有捷径缓解方法主要在训练时操作,假设能获取训练数据、训练动态或捷径注释,这些在部署时难以获得,只有收敛的模型存在。我们证明该模型本身足以在部署时缓解捷径:一个偏置模型内部化了其学习捷径的信号,可通过无监督梯度归因捕捉。我们进一步证明部署时的缓解在信息理论上受训练时缓解的限制。尽管如此,利用这一梯度信号,我们提出的无监督部署时捷径缓解框架Shortcut Guardrail,通过恢复捷径分布偏移下的性能,在情感分类、毒性检测和自然语言推理中达到或超越训练时基线性能。

英文摘要

Pretrained text encoders are prone to shortcut learning, relying on token-label correlations that fail once the distribution shifts in deployment. Existing shortcut mitigation methods mainly operate at training time and assume access to training data, training dynamics, or shortcut annotations, which are hardly available during deployment, where only the converged model remains. We show that this model alone suffices to mitigate shortcuts during deployment: a biased model internalizes a signal of its learned shortcuts that can be captured via unsupervised gradient-based attribution. We further prove that deployment-time mitigation is information-theoretically upper-bounded by training-time mitigation. Nevertheless, exploiting this gradient signal, our proposed unsupervised deployment-time shortcut mitigation framework for pretrained text encoders, Shortcut Guardrail, recovers substantial performance under shortcut distribution shift, matching or outperforming training-time baselines across sentiment classification, toxicity detection, and natural language inference.

2604.10999 2026-06-09 cs.CV 版本更新

TraversalBench: Challenging Paths to Follow for Vision Language Models

TraversalBench: 为视觉语言模型设计的复杂路径挑战测试集

Clara Petrova, Zhuo Chen, Marin Soljačić

发表机构 * Massachusetts Institute of Technology, Department of Physics(麻省理工学院物理系) Massachusetts Institute of Technology, Institute for Data, Systems, and Society(麻省理工学院数据、系统与社会研究所) NSF AI Institute for Artificial Intelligence and Fundamental Interactions(国家科学基金会人工智能与基本相互作用AI研究所)

AI总结 本文提出TraversalBench,一个用于评估视觉语言模型复杂视觉路径跟随能力的受控基准测试集,发现自相交是主要困难来源,揭示了模型在路径感知上的局限性。

详情
AI中文摘要

视觉语言模型(VLMs)在多模态基准测试中表现优异,但其遵循复杂视觉路径的能力尚未充分测试。我们引入TraversalBench,一个用于精确视觉路径遍历的受控基准测试集。每个实例包含一条具有唯一起始标记和标签顶点的连续折线;模型必须从起点到终点恢复顺序序列。该基准测试平衡了自相交次数、曲折度、顶点数量和附近干扰线的影响,同时限制对OCR、世界知识或开放式规划的依赖。我们发现自相交是主要困难来源。一次交叉分析将失败定位到交叉点:性能在第一次交叉前稳定,然后在模型必须解决正确延续时急剧下降。附近干扰因素有较弱但累积的影响,辅助阅读顺序基准揭示了一致的左右偏见。这些结果描述了VLMs如何感知和失败于视觉路径。最后,我们将TraversalBench定位为视觉语言模型持续和精确视觉定位基准测试集的新贡献。代码、基准测试数据和渲染示例可在https://github.com/clarapetrova/traversalbench获取。

英文摘要

Vision-language models (VLMs) perform strongly on multimodal benchmarks, but their ability to follow complex visual paths remains under-tested. We introduce TraversalBench, a controlled benchmark for exact visual path traversal. Each instance contains a continuous polyline with a unique start marker and labeled vertices; models must recover the ordered sequence encountered from start to finish. The benchmark balances self-intersection count, tortuosity, vertex count, and nearby confounding lines while limiting reliance on OCR, world knowledge, or open-ended planning. We find that self-intersections are the dominant source of difficulty. A first-crossing analysis localizes failures to crossing points: performance is stable before the first crossing, then drops sharply when the model must resolve the correct continuation. Nearby confounders have weaker but compounding effects, and an auxiliary reading-order benchmark reveals a consistent left-to-right bias. Together, these results characterize how VLMs perceive and fail on visual paths. Finally, we position TraversalBench as a new contribution to the growing line of sustained and precise visual grounding benchmarks for VLMs. Code, benchmark data, and rendered examples are available at https://github.com/clarapetrova/traversalbench.

2604.10628 2026-06-09 cs.SD cs.CL cs.IR 版本更新

BMdataset: A Musicologically Curated LilyPond Dataset

BMdataset:一个音乐学精心编纂的LilyPond数据集

Matteo Spanio, Ilay Guler, Antonio Rodà

发表机构 * Department of Information Engineering , University of Padua(信息工程系,帕多瓦大学) Boston University(波士顿大学)

AI总结 本文提出BMdataset,包含393个LilyPond乐谱,用于音乐理解研究,并引入LilyBERT模型,证明小规模专家编纂数据集在音乐识别任务中优于大规模噪声数据集。

Comments Submitted to SMC2026

详情
AI中文摘要

符号音乐研究几乎仅依赖MIDI数据集,而基于文本的乐谱格式如LilyPond尚未被探索。我们提出了BMdataset,包含393个LilyPond乐谱(2,646个乐章),由专家直接从原巴洛克手稿转录,涵盖作曲家、音乐形式、乐器和乐章属性的元数据。基于此资源,我们引入LilyBERT(权重可在https://huggingface.co/csc-unipd/lilybert获取),一种基于CodeBERT的编码器,通过扩展词汇表加入115个LilyPond特定标记并进行掩码语言模型预训练。在非领域数据集Mutopia上的线性探测显示,尽管其规模较小(约90M tokens),仅在BMdataset上微调的表现优于在完整PDMX数据集(约15B tokens)上的连续预训练,证明小规模专家编纂数据集在音乐理解任务中更有效。结合广泛预训练与领域特定微调获得最佳结果(84.3%作曲家准确率),证实了两种数据制度的互补性。我们发布数据集、分词器和模型,以建立LilyPond的表示学习基准。

英文摘要

Symbolic music research has relied almost exclusively on MIDI-based datasets; text-based engraving formats such as LilyPond remain unexplored for music understanding. We present BMdataset, a musicologically curated dataset of 393 LilyPond scores (2,646 movements) transcribed by experts directly from original Baroque manuscripts, with metadata covering composer, musical form, instrumentation, and sectional attributes. Building on this resource, we introduce LilyBERT (weights can be found at https://huggingface.co/csc-unipd/lilybert), a CodeBERT-based encoder adapted to symbolic music through vocabulary extension with 115 LilyPond-specific tokens and masked language model pre-training. Linear probing on the out-of-domain Mutopia corpus shows that, despite its modest size (~90M tokens), fine-tuning on BMdataset alone outperforms continuous pre-training on the full PDMX corpus (~15B tokens) for both composer and style classification, demonstrating that small, expertly curated datasets can be more effective than large, noisy corpora for music understanding. Combining broad pre-training with domain-specific fine-tuning yields the best results overall (84.3% composer accuracy), confirming that the two data regimes are complementary. We release the dataset, tokenizer, and model to establish a baseline for representation learning on LilyPond.

2604.09967 2026-06-09 cs.LG cs.AI 版本更新

Muon$^2$: Boosting Muon via Adaptive Second-Moment Preconditioning

Muon²:通过自适应二阶矩预条件提升穆隆

Ziyue Liu, Ruijie Zhang, Zhengyang Wang, Yequan Zhao, Yupeng Su, Zi Yang, Zheng Zhang

发表机构 * University of California at Santa Barbara(加州大学圣巴巴拉分校) University at Albany, SUNY(阿尔巴尼大学,SUNY)

AI总结 Muon²通过引入Adam风格的自适应二阶矩预条件改进了穆隆的效率与质量,提升了极化近似中的收敛速度和实际正交化质量,实验表明其在参数规模达13B的预训练任务中表现更优。

Comments Preprint, subject to update

详情
AI中文摘要

Muon已展现为一种有前途的优化器,用于大规模基础模型预训练,通过迭代正交化利用神经网络更新的矩阵结构。然而,Muon的正交化质量依赖于执行的牛顿-施卢茨(NS)迭代次数,这带来了效率挑战,因为其计算和通信成本非平凡。我们提出Muon²,作为Muon的扩展,通过在正交化前应用Adam风格的自适应二阶矩预条件来提高质量和效率。我们的关键见解是,Muon的核心挑战在于极化近似中的病态动量矩阵,其谱通过Muon²显著改善,从而更快收敛到实用的正交化。我们进一步通过方向对齐特性化了实际正交化质量,在此情况下,Muon²在每个极化步骤中均显著优于Muon。在GPT、LLaMA和专家混合预训练实验中,Muon²(及其内存高效变种Muon²-F)在参数规模达13B时,始终优于Muon及其变种,同时将NS迭代次数减少40%,并在达到相同损失时节省了多达四分之一的训练时间。

英文摘要

Muon has emerged as a promising optimizer for large-scale foundation model pre-training by exploiting the matrix structure of neural network updates through iterative orthogonalization. However, the orthogonalization quality of Muon hinges on the number of Newton--Schulz (NS) iterations performed, which poses efficiency challenges due to its non-trivial computation and communication cost. We propose Muon$^2$, an extension of Muon, to improve both quality and efficiency by applying Adam-style adaptive second-moment preconditioning before orthogonalization. Our key insight is that the core challenge of polar approximation in Muon lies in the ill-conditioned momentum matrix, of which the spectrum is substantially improved by Muon$^2$, leading to faster convergence toward a practically sufficient orthogonalization. We further characterize the practical orthogonalization quality via directional alignment, under which Muon$^2$ demonstrates dramatic improvement over Muon at each polar step. Across GPT, LLaMA, and Mixture-of-Experts pre-training experiments up to 13B parameters, Muon$^2$ (and its memory-efficient variant Muon$^2$-F that preserves most of its benefits) consistently outperforms Muon and its variants while reducing NS iterations by 40%, and saves up to 1/4 training time over Muon when achieving the same loss.

2603.23916 2026-06-09 cs.CV cs.AI 版本更新

DecepGPT: Schema-Driven Deception Detection with Multicultural Datasets and Robust Multimodal Learning

DecepGPT: 基于多文化数据集和鲁棒多模态学习的模式驱动欺骗检测

Jiajian Huang, Dongliang Zhu, Zitong YU, Hui Ma, Jiayu Zhang, Chunmei Zhu, Xiaochun Cao

发表机构 * Great Bay University(Great Bay大学) Wuhan University(武汉大学) Sun Yat-sen University(孙中山大学)

AI总结 本文提出DecepGPT,通过构建包含结构化线索描述和推理链的推理数据集,释放多文化数据集T4-Deception,并提出SICS和DMC模块,实现多模态欺骗检测的鲁棒学习,实验表明其在领域内和跨领域场景中均取得最佳性能。

Comments 17 pages, 11 figures, 12 tables

详情
AI中文摘要

多模态欺骗检测旨在通过分析音频视觉线索来识别欺骗行为,用于刑侦和安全领域。在高风险环境中,调查人员需要可验证的证据将音频视觉线索与最终决策联系起来,并且需要在不同领域和文化背景下可靠地泛化。然而,现有基准仅提供二元标签而无中间推理线索。数据集也较小,场景覆盖有限,导致捷径学习。我们通过三个贡献解决这些问题:首先,我们通过增强现有基准并添加结构化线索级描述和推理链来构建推理数据集,使模型输出可审计报告。其次,我们发布T4-Deception,一个基于统一的『To Tell The Truth』电视格式在四个国家实施的多文化数据集。该数据集包含1695个样本,是目前最大的非实验室欺骗检测数据集。第三,我们提出两个模块,以在小数据条件下实现鲁棒学习。Stabilized Individuality-Commonality Synergy (SICS) 通过结合可学习的全局先验与样本自适应残差,优化多模态表示,随后通过极性感知调整双向校准表示。Distilled Modality Consistency (DMC) 通过知识蒸馏将模态特定预测与融合的多模态预测对齐,以防止单模态捷径学习。在三个已建立的基准和我们新的数据集上的实验表明,我们的方法在领域内和跨领域场景中均取得最佳性能,同时在不同文化背景下表现出优越的迁移能力。数据集和代码将被发布。

英文摘要

Multimodal deception detection aims to identify deceptive behavior by analyzing audiovisual cues for forensics and security. In these high-stakes settings, investigators need verifiable evidence connecting audiovisual cues to final decisions, along with reliable generalization across domains and cultural contexts. However, existing benchmarks provide only binary labels without intermediate reasoning cues. Datasets are also small with limited scenario coverage, leading to shortcut learning. We address these issues through three contributions. First, we construct reasoning datasets by augmenting existing benchmarks with structured cue-level descriptions and reasoning chains, enabling model output auditable reports. Second, we release T4-Deception, a multicultural dataset based on the unified ``To Tell The Truth'' television format implemented across four countries. With 1695 samples, it is the largest non-laboratory deception detection dataset. Third, we propose two modules for robust learning under small-data conditions. Stabilized Individuality-Commonality Synergy (SICS) refines multimodal representations by synergizing learnable global priors with sample-adaptive residuals, followed by a polarity-aware adjustment that bi-directionally recalibrates representations. Distilled Modality Consistency (DMC) aligns modality-specific predictions with the fused multimodal predictions via knowledge distillation to prevent unimodal shortcut learning. Experiments on three established benchmarks and our novel dataset demonstrate that our method achieves state-of-the-art performance in both in-domain and cross-domain scenarios, while exhibiting superior transferability across diverse cultural contexts. The datasets and codes will be released.

2604.08849 2026-06-09 cs.CL cs.AI cs.DB cs.MA cs.SC 版本更新

SatIR: Scalable High-Recall Constraint-Satisfaction-Based Information Retrieval for Clinical Trials Matching

SatIR:可扩展的高召回率约束满足基于信息检索的临床试验匹配

Cyrus Zhou, Yufei Jin, Yilin Xu, Yu-Chiang Wang, Chieh-Ju Chao, Monica S. Lam

发表机构 * Department of Computer Science, Stanford University(斯坦福大学计算机科学系) Samueli Electrical and Computer Engineering, UCLA(UCLA Samueli电气与计算机工程系) Department of Computer Science and Informatics, Emory University(埃默里大学计算机科学与信息学系) Mayo Clinic(梅奥诊所)

AI总结 SatIR通过将临床试验资格条件和摘要转化为形式约束,结合SMT、关系代数和大语言模型,提升了临床试验匹配的召回率和效率,优于基于相似度的基线方法。

详情
AI中文摘要

许多重要的检索问题不仅仅是语义相似性问题,而是约束满足问题:检索的项目应与查询主题相关,并满足涉及否定、时间条件、数值阈值、例外、本体关系和不完整证据的显式要求。我们研究了临床试验匹配中的这一挑战,这是一个高风险的测试平台,其中有用的试验必须既解决患者医疗需求,又满足复杂的资格标准。我们提出了SatIR,一种用于临床试验匹配的可扩展约束检索方法。SatIR将试验资格标准和摘要转换为形式约束,然后通过执行这些约束来检索患者-试验对。系统结合了满足模理论(SMT)、关系代数、医学本体基础和大语言模型(LLMs):形式方法提供可执行且可检查的匹配,而LLMs将模糊、不完整和隐含的临床信息转换为显式、可控的约束表示。在SIGIR 2016患者-试验集合和TREC-2022-RetrievalSubset基准上,SatIR在资格意识检索方面优于基于相似度的基线方法。与TrialGPT式检索相比,SatIR在SIGIR 2016上每名患者检索出32%至72%更多相关且合格的试验,在TREC-2022-RetrievalSubset上实现了1.8至3.2倍更高的合格试验召回率。检索速度快,仅需146毫秒每名患者处理3,621个SIGIR试验。

英文摘要

Many important retrieval problems are not merely problems of semantic similarity, but problems of constraint satisfaction: a retrieved item should be topically relevant to a query and satisfy explicit requirements involving negation, temporal conditions, numeric thresholds, exceptions, ontological relations, and incomplete evidence. We study this challenge in clinical trial matching, a high-stakes test bed where a useful trial must both address a patient's medical needs and satisfy complex eligibility criteria. We propose SatIR, a scalable constraint-based retrieval method for clinical trial matching. SatIR converts trial eligibility criteria and summaries into formal constraints, then retrieves patient--trial pairs by executing these constraints over a database. The system combines Satisfiability Modulo Theories (SMT), relational algebra, medical ontology grounding, and large language models (LLMs): formal methods provide executable and inspectable matching, while LLMs convert ambiguous, incomplete, and implicit clinical information into explicit, controllable constraint representations. Across the SIGIR 2016 patient--trial collection and TREC-2022-RetrievalSubset, a benchmark derived from TREC 2022, SATIR consistently improves eligibility-aware retrieval over similarity-based baselines. Relative to TrialGPT-style retrieval, SATIR retrieves 32%--72% more relevant-and-eligible trials per patient on SIGIR 2016 and achieves $1.8$--$3.2\times$ higher eligible-trial recall on TREC-2022-RetrievalSubset. Retrieval is fast, requiring only 146 milliseconds per patient over 3,621 SIGIR trials.

2604.08479 2026-06-09 cs.CL 版本更新

AI generates well-liked but templatic empathic responses

AI生成受欢迎但模板化的共情回应

Emma S. Gueorguieva, Hongli Zhan, Jina Suh, Javier Hernandez, Tatiana Lau, Junyi Jessy Li, Desmond C. Ong

发表机构 * Department of Psychology, The University of Texas at Austin(心理学系,德克萨斯大学奥斯汀分校) Department of Linguistics, The University of Texas at Austin(语言学系,德克萨斯大学奥斯汀分校) Department of Computer Science and Engineering, The University of Washington(计算机科学与工程系,华盛顿大学) Microsoft Research(微软研究院) Toyota Research Institute(丰田研究院)

AI总结 研究发现LLM生成的共情回应高度模板化,采用10种共情语言策略,覆盖81-92%的回应内容,而人类写作则更多样。

详情
AI中文摘要

最近的研究显示,越来越多的人转向大型语言模型(LLMs)寻求情感支持,并认为LLM的回应比人类写的更具共情性。我们提出原因:LLM学习并一致部署了一种受欢迎的共情模板。我们开发了10种共情语言“策略”分类,包括验证他人感受和 paraphrasing,并将此分类应用于分析人类和LLM生成共情回应的语言。在两项研究中,比较了3265个AI生成(由六个模型生成)和1290个人类写作的回应,发现LLM回应在话语功能层面高度公式化。我们发现一个模板——一种策略序列——匹配83-90%的LLM回应(在持出样本中为60-83%),当匹配时覆盖81-92%的回应内容。相比之下,人类写作的回应更多样化。我们最后讨论了这对AI生成共情未来的影响。

英文摘要

Recent research shows that greater numbers of people are turning to Large Language Models (LLMs) for emotional support, and that people rate LLM responses as more empathic than human-written responses. We suggest a reason for this success: LLMs have learned and consistently deploy a well-liked template for expressing empathy. We develop a taxonomy of 10 empathic language "tactics" that include validating someone's feelings and paraphrasing, and apply this taxonomy to characterize the language that people and LLMs produce when writing empathic responses. Across a set of 2 studies comparing a total of n = 3,265 AI-generated (by six models) and n = 1,290 human-written responses, we find that LLM responses are highly formulaic at a discourse functional level. We discovered a template -- a structured sequence of tactics -- that matches between 83--90% of LLM responses (and 60--83\% in a held out sample), and when those are matched, covers 81--92% of the response. By contrast, human-written responses are more diverse. We end with a discussion of implications for the future of AI-generated empathy.

2604.07848 2026-06-09 cs.LG q-bio.MN 版本更新

Information-Theoretic Requirements for Gradient-Based Task Affinity Estimation in Multi-Task Learning

基于梯度的任务亲和性估计在多任务学习中的信息论要求

Jasper Zhang, Bryan Cheng

发表机构 * Great Neck South High School(Great Neck South 高中)

AI总结 本文探讨了多任务学习中梯度基于任务亲和性估计的信息论要求,发现任务样本重叠度对梯度对齐的影响,并揭示了样本重叠度的相变特性。

Comments 8 pages, 4 figures. ACM BCB 2026 Short Paper. Accepted at workshop on AI for Accelerated Materials Design, Foundation Models for Science: Real-World Impact and Science-First Design, and Generative and Experimental Perspectives for Biomolecular Design at ICLR 2026

详情
AI中文摘要

多任务学习展现出显著不一致的结果——有时联合训练有显著帮助,有时反而损害性能——但该领域缺乏一个原则性的框架来预测这些结果。我们识别出梯度基于任务分析背后一个基本但未明说的假设:任务必须共享训练实例,以便梯度冲突揭示真实的关系。当任务在相同输入上测量时,梯度对齐反映共享的机制结构;当在不相交的输入上测量时,任何明显的信号都混淆了任务关系与分布偏移。我们发现这种样本重叠要求表现出明显的相变特性:低于30%的重叠,梯度-任务相关性在统计上与噪声不可区分;高于40%的重叠,它们可靠地恢复已知的生物结构。在多个数据集上的全面验证实现了强相关性和恢复生物通路组织。标准基准系统系统性地违反这一要求——MoleculeNet在<5%的重叠,TDC在8-14%——远低于梯度分析变得有意义的阈值。这为过去七年不一致的MTL结果提供了第一个原则性解释。

英文摘要

Multi-task learning shows strikingly inconsistent results -- sometimes joint training helps substantially, sometimes it actively harms performance -- yet the field lacks a principled framework for predicting these outcomes. We identify a fundamental but unstated assumption underlying gradient-based task analysis: tasks must share training instances for gradient conflicts to reveal genuine relationships. When tasks are measured on the same inputs, gradient alignment reflects shared mechanistic structure; when measured on disjoint inputs, any apparent signal conflates task relationships with distributional shift. We discover this sample overlap requirement exhibits a sharp phase transition: below 30% overlap, gradient-task correlations are statistically indistinguishable from noise; above 40%, they reliably recover known biological structure. Comprehensive validation across multiple datasets achieves strong correlations and recovers biological pathway organization. Standard benchmarks systematically violate this requirement -- MoleculeNet operates at <5% overlap, TDC at 8-14% -- far below the threshold where gradient analysis becomes meaningful. This provides the first principled explanation for seven years of inconsistent MTL results.

2604.07421 2026-06-09 cs.LG 版本更新

SPAMoE: Spectrum-Aware Hybrid Operator Framework for Full-Waveform Inversion

SPAMoE:一种频谱感知的混合运算框架用于全波形反演

Zhenyu Wang, Peiyuan Li, Yongxiang Shi, Ruoyu Wu, Chenfei Liao, Lei Zhang

发表机构 * China University of Mining and Technology - Beijing(中国矿业大学(北京)) City University of Hong Kong (Dongguan)(香港城市大学(东莞)) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 本文提出SPAMoE框架,通过频谱保护编码器和动态频谱分解路由机制,解决多尺度地质特征的频率纠缠问题,提升全波形反演的效率与稳定性。

详情
AI中文摘要

全波形反演(FWI)对于重建高分辨率地下速度模型至关重要,但计算成本高且问题不明确。尽管深度学习方法有潜力提高效率,但现有卷积神经网络(CNNs)和单范式神经运算(NOs)在处理多尺度地质特征的频率纠缠方面存在根本性困难。为此,我们提出了Spectral-Preserving Adaptive MoE(SPAMoE),一种新的频谱感知框架,用于解决具有复杂多尺度结构的逆问题。我们的方法引入了Spectral-Preserving DINO编码器,强制编码表示的高频到低频能量比的下限,缓解高频崩溃并稳定后续频域建模。此外,我们设计了一种新的频谱分解和路由机制,动态地将频率带分配给由FNO、MNO和LNO组成的专家混合(MoE)集合。在十个OpenFWI子数据集上,实验表明,SPAMoE相对于最佳官方报告的OpenFWI基线,平均MAE减少了44.4%,从而建立了学习驱动的全波形反演的新架构框架。我们的代码和数据可在https://github.com/zhenyuwang12366/SPAMoE获取。

英文摘要

Full-waveform inversion (FWI) is pivotal for reconstructing high-resolution subsurface velocity models but remains computationally intensive and ill-posed. While deep learning approaches promise efficiency, existing Convolutional Neural Networks (CNNs) and single-paradigm Neural Operators (NOs) struggle with one fundamental issue: frequency entanglement of multi-scale geological features. To address this challenge, we propose Spectral-Preserving Adaptive MoE (SPAMoE), a novel spectrum-aware framework for solving inverse problems with complex multi-scale structures. Our approach introduces a Spectral-Preserving DINO Encoder that enforces a lower bound on the high-to-low frequency energy ratio of the encoded representation, mitigating high-frequency collapse and stabilizing subsequent frequency-domain modeling. Furthermore, we design a novel Spectral Decomposition and Routing mechanism that dynamically assigns frequency bands to a Mixture-of-Experts (MoE) ensemble comprising FNO, MNO, and LNO. On the ten OpenFWI sub-datasets, experiments show that SPAMoE reduces the average MAE by 44.4% relative to the best officially reported OpenFWI baseline, thereby establishing a new architectural framework for learning-based full-waveform inversion. Our code and data are available at https://github.com/zhenyuwang12366/SPAMoE

2604.06210 2026-06-09 cs.CL cs.AI cs.CY cs.LG 版本更新

Distributional Open-Ended Evaluation of LLM Cultural Value Alignment Based on Value Codebook

基于价值码本的LLM文化价值对齐的分布式开放式评估

Jaehyeok Lee, Xiaoyuan Yi, Jing Yao, Hyunjin Hwang, Roy Ka-Wei Lee, Xing Xie, JinYeong Bak

发表机构 * KAIST(韩国科学技术院)

AI总结 提出DOVE框架,通过率失真变分优化构建价值码本,利用不平衡最优传输度量分布对齐,解决LLM文化价值评估中的构造-组成-上下文挑战。

Comments ICML 2026 Camera Ready

详情
AI中文摘要

随着LLM在全球部署,使其文化价值取向对齐对于安全性和用户参与至关重要。然而,现有基准面临构造-组成-上下文($C^3$)挑战:依赖判别性、多项选择格式,探测的是价值知识而非真实取向,忽视亚文化异质性,且与真实世界的开放式生成不匹配。我们引入DOVE,一个直接比较人类撰写的文本分布与LLM生成输出的分布式评估框架。DOVE利用率失真变分优化目标从10K文档中构建紧凑的价值码本,将文本映射到结构化价值空间以过滤语义噪声。使用不平衡最优传输测量对齐,捕捉文化内分布结构和子群体多样性。在12个LLM上的实验表明,DOVE实现了优越的预测有效性,与下游任务的相关性达到31.56%,同时每个文化仅需500个样本即可保持高可靠性。

英文摘要

As LLMs are globally deployed, aligning their cultural value orientations is critical for safety and user engagement. However, existing benchmarks face the Construct-Composition-Context ($C^3$) challenge: relying on discriminative, multiple-choice formats that probe value knowledge rather than true orientations, overlook subcultural heterogeneity, and mismatch with real-world open-ended generation. We introduce DOVE, a distributional evaluation framework that directly compares human-written text distributions with LLM-generated outputs. DOVE utilizes a rate-distortion variational optimization objective to construct a compact value codebook from 10K documents, mapping text into a structured value space to filter semantic noise. Alignment is measured using unbalanced optimal transport, capturing intra-cultural distributional structures and subgroup diversity. Experiments across 12 LLMs show that DOVE achieves superior predictive validity, attaining a 31.56% correlation with downstream tasks, while maintaining high reliability with as few as 500 samples per culture.

2601.07994 2026-06-09 cs.CL cs.AI 版本更新

DYCP: Dynamic Context Pruning for Long-Form Dialogue with LLMs

DYCP:基于LLMs的长格式对话动态上下文修剪

Nayoung Choi, Jonathan Zhang, Jinho D. Choi

发表机构 * Computer Science Emory University(计算机科学 埃默里大学)

AI总结 DYCP通过动态识别和检索对话段落,提升长格式对话中LLM的上下文管理效率,实现更精确的上下文选择和推理效率提升。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地用于长格式对话,其中话题频繁变化。尽管最近的LLMs支持扩展的上下文窗口,但在实践中仍需有效管理对话历史,以应对推理成本和延迟限制。我们提出了DyCP,一种轻量级的上下文管理方法,该方法在LLM外部实现,能够根据当前轮次动态识别和检索相关对话段落,无需离线内存构建。DyCP在不预设话题边界的情况下管理对话上下文,保持对话的顺序性,实现自适应和高效的上下文选择。在三个长格式对话基准(LoCoMo、MT-Bench+和SCM4LLMs)和多个LLM后端上,DyCP在下游生成任务中实现了具有竞争力的答案质量,具有更选择性的上下文使用和改进的推理效率。

英文摘要

Large Language Models (LLMs) increasingly operate over long-form dialogues with frequent topic shifts. While recent LLMs support extended context windows, efficient management of dialogue history in practice is needed due to inference cost and latency constraints. We present DyCP, a lightweight context management method implemented outside the LLM that dynamically identifies and retrieves relevant dialogue segments conditioned on the current turn, without offline memory construction. DyCP manages dialogue context while preserving the sequential nature of dialogue without predefined topic boundaries, enabling adaptive and efficient context selection. Across three long-form dialogue benchmarks-LoCoMo, MT-Bench+, and SCM4LLMs-and multiple LLM backends, DyCP achieves competitive answer quality in downstream generation, with more selective context usage and improved inference efficiency.