arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2367
专题追踪
2605.26408 2026-05-29 cs.LG stat.ME stat.ML

Function-Valued Causal Influence in Nonlinear Time Series

非线性时间序列中的函数值因果影响

Valentina V. Kuskova, Dmitry Zaytsev, Michael Coppedge

发表机构 * Lucy Family Institute for Data \& Society, University of Notre Dame, Notre Dame, Indiana, USA. Department of Political Science, University of Notre Dame, Notre Dame, Indiana, USA

AI总结 针对非线性时间序列因果发现中常用标量评分掩盖状态依赖函数效应的问题,提出基于个体条件期望的框架从神经加性向量自回归模型直接估计因果响应函数,揭示标量评分无法区分的多种函数行为。

Comments 26 pages, 6 tables, 8 figures

详情
AI中文摘要

时间序列中的因果发现越来越多地使用非线性机器学习模型进行,但由此产生的因果关系几乎总是通过标量边评分来总结。我们认为,这种做法掩盖了非线性自回归模型真正学习到的对象:一个状态依赖的函数,其效应随机制、幅度和上下文而变化。我们形式化了加性、贡献可分解架构的函数值因果影响,并表明标量因果评分构成了严重的信息瓶颈,将状态间变化与状态内残差噪声混为一谈。以神经加性向量自回归作为代表性架构,我们引入了一个基于个体条件期望的实用框架,直接从训练好的模型估计因果响应函数。通过受控的合成实验,我们证明了具有无法区分的标量评分的边可以表现出定性的不同函数行为,包括单调、阈值、饱和和符号变化效应。一个关于民主发展的应用案例进一步表明,函数值分析揭示了以评分为中心的方法系统性遗漏的特定于机制和非对称的因果结构。

英文摘要

Causal discovery in time series is increasingly performed using nonlinear machine-learning models, yet the resulting causal relationships are almost always summarized by scalar edge scores. We argue that this practice obscures the true object learned by nonlinear autoregressive models: a state-dependent function whose effect varies across regimes, magnitudes, and contexts. We formalize function-valued causal influence for additive, contribution-decomposable architectures and show that scalar causal scores constitute a severe information bottleneck, conflating between-state variation with within-state residual noise. Using Neural Additive Vector Autoregression as a representative architecture, we introduce a practical framework based on Individual Conditional Expectation for estimating causal response functions directly from trained models. Through controlled synthetic experiments, we demonstrate that edges with indistinguishable scalar scores can exhibit qualitatively different functional behaviors, including monotonic, thresholded, saturating, and sign-changing effects. An applied case study on democratic development further shows that function-valued analysis reveals regime-specific and asymmetric causal structure systematically missed by score-centric approaches.

2605.26194 2026-05-29 cs.LG

On the Role of Inductive Bias in Time-Series Pretraining: A Case Study in Learning Generalizable Representations for Clinical Time Series

论归纳偏置在时间序列预训练中的作用:以临床时间序列学习通用表征的案例研究

Sharmita Dey, Diego Paez-Granados

发表机构 * ETH Zurich(苏黎世联邦理工学院) Swiss Paraplegic Research(瑞士脊髓损伤研究所) ETH Zurich, Swiss Paraplegic Research(苏黎世联邦理工学院、瑞士脊髓损伤研究所)

AI总结 通过PathoFM编码器中心Transformer,结合局部补全、时间连续性和无监督上下文动力学三种互补目标,研究预训练目标中归纳偏置对跨任务类型和受试者迁移的影响,发现动态中心混合目标能产生最平衡的迁移表征。

详情
AI中文摘要

临床时间序列学习通常受限于小规模、异质性队列和协议漂移,而其下游应用涵盖分类(如病理诊断)和回归(如时间预测)。这些限制使得基础模型预训练具有吸引力,但提出了一个重要问题:预训练目标应施加何种归纳偏置,以使表征能够跨任务类型和受试者迁移。我们通过PathoFM研究脊髓损伤(SCI)的病理步态分析,PathoFM是一种以编码器为中心的Transformer,在多元步态窗口上使用三种互补目标进行预训练:局部补全(重建连续的掩码跨度以强制局部结构)、时间连续性(从观察到的前缀预测掩码的中期延续以强制平滑性和因果一致性)以及无监督上下文动力学(通过注意力基于受试者示例窗口进行支持-查询重建)。通过经验比较目标族(分组/对比、基于动力学和生成式重建),我们发现以动力学为中心的混合目标产生最平衡的迁移:分组目标有利于判别边界,但可能降低连续目标所需的幅度保真度,而仅重建目标保留波形结构但在分类上可能表现不佳。总体而言,将局部重建与时间连续性相结合,并在可获取示例时添加上下文条件,可产生稳健的受试者泛化表征。

英文摘要

Clinical time-series learning is routinely constrained by small, heterogeneous cohorts and protocol drift, while its downstream use spans both classification (e.g., pathology diagnosis) and regression (e.g., temporal forecasting). These constraints make foundation-model pretraining appealing, but raises an important question of which inductive biases should the pretraining objective impose so that representations transfer across task types and subjects. We study this question in pathological gait analysis for spinal cord injury (SCI) via PathoFM, an encoder-centric transformer pretrained on multivariate gait windows with three complementary objectives: Local Completion (reconstruct contiguous masked spans to enforce local structure), Temporal Continuity (predict a masked mid-horizon continuation from an observed prefix to enforce smoothness and causal consistency), and Unsupervised In-Context Dynamics (support-query reconstruction conditioned on subject exemplar windows via attention). Empirically comparing objective families (grouping/contrastive, dynamics-based, and generative reconstruction), we find that dynamics-centric mixtures produce the most balanced transfer: grouping objectives favor discriminative margins but can degrade magnitude fidelity needed for continuous targets, whereas reconstruction-only objectives preserve waveform structure but may underperform on classification. Overall, combining local reconstruction with temporal continuity, and adding in-context conditioning when exemplar access is realistic, yields robust subject-generalizing representations.

2605.26193 2026-05-29 cs.LG cs.AI

Bridging Classification and Reconstruction: Cooperative Time Series Anomaly Detection

桥接分类与重建:协同时间序列异常检测

Qideng Tang, Dai Chaofan, Wubin Ma, Yahui Wu, Haohao Zhou, Tao Zhang, Huan Li, Dalin Zhang

发表机构 * National Key Laboratory of Information Systems Engineering, National University of Defense Technology(信息系统工程国家重点实验室,国防科技大学) College of Systems Engineering, National University of Defense Technology(系统工程学院,国防科技大学) Zhejiang University(浙江大学) Zhejiang Key Laboratory of Space Information Sensing and Transmission, Hangzhou Dianzi University(空间信息感知与传输浙江大学重点实验室,杭州电子科技大学)

AI总结 提出CoAD框架,通过分类模块生成概率软掩码指导重建模块,协同利用分类与重建范式的互补优势,有效检测细微复杂异常,并在基准数据集上显著优于现有方法。

Comments 15 pages, submitted to KDD 2026

详情
AI中文摘要

时间序列异常检测(TSAD)因其广泛应用而长期成为数据挖掘领域的热门研究课题。最近的研究挑战了流行的深度学习方法在TSAD中的有效性,指出它们无法检测细微和持久的异常。异常暴露(OE)和掩码自编码器(MAE)作为两种有前景的范式(分类和重建)出现,用于解决上述问题。然而,基于OE的方法受限于泛化能力差,而基于MAE的方法受限于掩码错位问题。为了解决这些局限性,本文提出了一种新颖的框架CoAD,该框架统一了两种范式,以利用它们的互补优势,同时减轻各自的弱点。在该框架中,分类模块为重建模块生成概率信息软掩码,这反过来又缓解了分类模块的泛化问题。这种协同设计使CoAD能够有效检测现有方法常常忽略的细微和复杂异常。此外,分类模块经过精心设计,以解决分类粒度不当和忽视频率信息的问题。在高质量基准数据集上,按照严格的评估协议进行的大量实验表明,CoAD显著优于最先进的深度学习和传统数据挖掘方法,突显了深度学习在TSAD中的潜力。此外,CoAD轻量级且速度远快于现有SOTA方法,展示了其在大规模实时应用中的实用价值。

英文摘要

Time series anomaly detection (TSAD) has long been a hot research topic in data mining due to its various applications. Recent studies challenge the effectiveness of popular deep learning methods for TSAD, suggesting their failure in detecting subtle and prolonged anomalies. Outlier Exposure (OE) and Masked Autoencoder (MAE) emerge as two promising paradigms (classification and reconstruction) for solving the above problems. However, OE-based methods are constrained by poor generalization, while MAE-based methods are limited by masking misalignment issues. To address these limitations, this paper proposes a novel framework, CoAD, which unifies the two paradigms to leverage their complementary strengths while mitigating their respective weaknesses. In this framework, the classification module generates probability-informed soft masks for the reconstruction module, which in turn alleviates the generalization problem of the classification module. This cooperative design enables CoAD to effectively detect subtle and complex anomalies that are often overlooked by existing methods. Additionally, the classification module is carefully designed to resolve issues related to improper classification granularity and the neglect of frequency information. Extensive experiments on high-quality benchmark datasets, conducted under rigorous evaluation protocols, demonstrate that CoAD significantly outperforms both state-of-the-art deep learning and traditional data mining methods, highlighting the potential of deep learning in TSAD. Moreover, CoAD is lightweight and substantially faster than existing SOTA methods, demonstrating its practical value for large-scale, real-time applications.

2605.26029 2026-05-29 cs.AI cs.CL

CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists

CausaLab:面向AI科学家的交互式因果发现可扩展环境

Junlin Yang, Dylan Zhang, Xiangchen Song, Qirun Dai, Xiao Liu, Yuen Chen, Aniket Vashishtha, Jing Shi, Chenhao Tan, Hao Peng

发表机构 * Tsinghua University(清华大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Carnegie Mellon University(卡内基梅隆大学) University of Chicago(芝加哥大学) Adobe

AI总结 提出CausaLab环境,通过合成实验室任务评估LLM代理在因果发现中的预测准确性与因果机制恢复能力,发现两者存在显著差距。

详情
AI中文摘要

我们介绍了CausaLab,一个用于评估LLM代理进行交互式因果发现的可扩展环境。与先前的评估不同,CausaLab既评估代理是否能够使用因果证据解决问题,也评估其答案是否基于忠实恢复的因果机制。每个回合将代理置于一个合成实验室中:它接收先前的测量记录,对操纵器晶体进行干预,并预测由相同机制控制的保留反应器晶体的共振频率。隐藏的数据生成过程是一个随机采样的结构因果模型(SCM),因此成功需要恢复因果图和结构方程,而不是回忆先验知识。实验表明,预测和机制恢复之间存在持续差距:在纯观测的6节点设置中,GPT-5.2-high达到92%的任务准确率,但全边$F_1$仅为0.471。混合观测-干预策略提高了结构保真度,而纯干预即使对强代理仍然困难。我们确定过早停止是一个主要弱点,并表明一致性验证可以缓解它。因此,CausaLab将预测成功与因果理解分开,并揭示了当前LLM代理作为实验因果推理者的局限性。

英文摘要

We introduce CausaLab, a scalable environment for evaluating interactive causal discovery by LLM agents. Unlike prior evaluations, CausaLab evaluates both whether an agent can solve a problem using causal evidence and whether its answer is grounded in a faithful recovered causal mechanism. Each episode places an agent in a synthetic laboratory: it receives prior measurement records, intervenes on a manipulator crystal, and predicts the resonance frequency of a held-out reactor crystal governed by the same mechanism. The hidden data-generating process is a randomly sampled structural causal model (SCM), so success requires recovering both a causal graph and structural equations rather than recalling prior knowledge. Experiments show a persistent gap between prediction and mechanism recovery: in the purely observational 6-node setting, GPT-5.2-high reaches 92% task accuracy but only 0.471 all-edge $F_1$. Mixed observation-intervention strategies improve structural fidelity, while pure intervention remains difficult even for strong agents. We identify premature stopping as a major weakness and show that consistency verification mitigates it. CausaLab therefore separates predictive success from causal understanding and exposes current LLM agents' limits as experimental causal reasoners.

2605.25299 2026-05-29 cs.CV cs.LG

A Principled Self-Referenced Early Stopping Approach for Deep Image Prior

一种基于自引用的原则性早期停止方法用于深度图像先验

Chaoyan Huang, Cheng-Han Huang, Ismail R. Alkhouri, Rongrong Wang

发表机构 * Department of Computational Mathematics, Science, & Engineering, Michigan State University(密歇根州立大学计算数学、科学与工程系) Department of Electrical Engineering and Computer Science, University of Michigan(密歇根大学电气工程与计算机科学系) X Computational Physics Division, Los Alamos National Laboratory(洛斯阿拉莫斯国家实验室计算物理部) Michigan Institute for Computational Discovery & Engineering, University of Michigan(密歇根大学计算发现与工程研究所) Mathematical Sciences, Michigan State University(密歇根州立大学数学科学系)

AI总结 针对深度图像先验(DIP)过拟合问题,提出一种基于构造伪自引用图像的过拟合检测框架,实现无需噪声水平估计的早期停止方法。

Comments 35 pages, 10 figures, 14 tables

详情
AI中文摘要

最近,深度图像先验(DIP)通过在无训练数据的情况下优化随机初始化的卷积神经网络,展示了解决逆成像问题(IIPs)的强大能力。然而,由于网络过参数化,DIP会过拟合噪声测量,使得早期停止(ES)至关重要。最成功的ES方法通过跟踪网络输出运行方差的波动来检测过拟合。然而,在许多应用中,这些波动可能过早出现,导致重建不稳定。本文首先证明,当退化图像的两个独立噪声副本可用时,可以实现近乎最优的DIP早期停止。受此观察启发,且由于获取两个完全独立的副本不可行,我们提出了一种基于构造伪自引用图像的过拟合检测框架,从而得到三种IIP特定算法。我们的方法还得到了关于单引用验证、伪验证估计以及共享噪声影响的理论结果的支持。在不同的IIP中,从自然图像恢复到医学图像重建,以及在不同噪声水平和噪声类型下,我们的方法始终优于现有的DIP早期停止方法,且无需准确估计噪声水平。

英文摘要

Recently, Deep Image Prior (DIP) has demonstrated strong capabilities for solving inverse imaging problems (IIPs) by optimizing a randomly initialized convolutional neural network in a training-data-free regime. However, DIP suffers from overfitting to noisy measurements due to network over-parameterization, making early stopping (ES) essential. The most successful ES method tracks fluctuations in the running variance of the network output to detect overfitting. However, in many applications, these fluctuations may appear prematurely, leading to unstable reconstructions. In this paper, we first show that nearly optimal DIP early stopping can be achieved when two independent noisy copies of the degraded image are available. Motivated by this observation, and since obtaining two fully independent copies is infeasible, we propose an overfitting detection framework based on constructing pseudo self-referenced images, resulting in three IIP-specific algorithms. Our approach is further supported by theoretical results on single-reference validation, pseudo-validation estimation, and the impact of shared noise. Across different IIPs, ranging from natural image restoration to medical image reconstruction, and under varying noise levels and noise types, our methods consistently outperform existing DIP early stopping approaches, all without requiring an accurate estimate of the noise level.

2605.25297 2026-05-29 cs.CL cs.AI cs.LG

Eureka: Intelligent Feature Engineering for Enterprise AI Cloud Resource Demand Prediction

Eureka:面向企业AI云资源需求预测的智能特征工程

Hangxuan Li, Renjun Jia, Xuezhang Wu, Yunjie Qian, Zeqi Zheng, Xianling Zhang

发表机构 * Alibaba Cloud Computing Co. Ltd, Hangzhou, China(阿里云计算有限公司,杭州,中国) School of Computer Science, Fudan University, Shanghai, China(复旦大学计算机学院,上海,中国) School of Computer Science and Technology, Tongji University, Shanghai, China(同济大学计算机科学与技术学院,上海,中国) Independent Researcher, United States(独立研究员,美国)

AI总结 提出Eureka框架,将特征工程视为智能体代码生成问题,通过专家代理、LLM特征工厂和自演化对齐引擎三阶段,自动生成可执行特征代码,在医疗、金融、社交等7个公开基准及阿里云GPU资源需求预测中显著提升性能。

Comments accepted at NeurIPS 2025 Workshop, DASFAA 2026 (International Conference on Database Systems for Advanced Applications)

Journal ref Database Systems for Advanced Applications (DASFAA 2026), Lecture Notes in Computer Science, vol. 16540, pp. 528-540, Springer

详情
AI中文摘要

有效的特征对于预测模型性能至关重要,但创建特征通常需要领域专业知识,限制了跨应用的可扩展性。我们将特征工程定义为一个智能体代码生成问题:特征不再是静态的数据转换,而是可生成、评估和迭代改进的可执行程序。我们提出了Eureka,一个由LLM驱动的三阶段框架。(1)专家代理,通过领域知识的SFT微调,生成结构化的JSON格式特征设计方案。(2)LLM特征工厂,通过思维链推理将每个方案转化为可执行的Python代码,将特征假设转化为可运行的程序。(3)自演化对齐引擎,使用带双通道奖励(基于指标的效用+语义对齐)的强化学习(GRPO)来提升代码质量。通过将特征表达为程序,学习到的生成模式可以跨领域迁移。在医疗、金融和社交领域的7个公开基准上评估,Eureka一致优于传统的AutoFE和基于LLM的基线。我们进一步在阿里云的云GPU资源需求预测中展示了Eureka的有效性,其中Eureka将需求满足率提高了16%,并将计算资源迁移率降低了33%。

英文摘要

Effective features are crucial for predictive model performance, but creating them often requires domain expertise, limiting scalability across applications. We define feature engineering as an agentic code generation problem: features are not static data transformations, but executable programs that can be generated, evaluated, and iteratively improved. We present Eureka, an LLM-driven framework with three stages. (1) An Expert Agent, fine-tuned via SFT on domain knowledge, produces structured feature design plans in JSON format. (2) An LLM Feature Factory translates each plan into executable Python code through chain-of-thought reasoning, turning feature hypotheses into runnable programs. (3) A Self-Evolving Alignment Engine uses Reinforcement Learning (GRPO) with dual-channel reward (metric-based utility + semantic alignment) to enhance code quality. By expressing features as programs, the learned generation patterns can transfer across domains. Evaluated on 7 public benchmarks in healthcare, finance, and social domains, Eureka consistently outperforms both traditional AutoFE and LLM-based baselines. We further demonstrate Eureka's effectiveness on cloud GPU resource demand prediction at Alibaba Cloud, where Eureka improves demand fulfillment rate by 16% and lowers computing resource migration rates by 33%.

2605.25059 2026-05-29 cs.CV

VEOcc: Voxel-Centric Online Semantic Occupancy Prediction For Embodied Scene Understanding

VEOcc:面向具身场景理解的体素中心在线语义占用预测

Ruoyu Wang, Yong Liu, Sheng Tao, Yuhang Lin, Yukai Ma

发表机构 * Institute of Cyber-Systems and Control(控制系统研究院)

AI总结 提出一种基于体素的递归感知-同化框架VEOcc,通过时空感知在线更新策略实现无需初始尺度估计的高效、鲁棒语义占用预测,在局部和具身场景中达到最先进性能。

详情
AI中文摘要

对于自主探索至关重要,在线3D占用预测和映射逐步构建密集的空间表示。然而,近期以高斯为中心的方法在结构边界保真度上存在困难,且严重依赖预定义的场景大小先验,从根本上限制了其操作效率。在这项工作中,我们提出了VEOcc,一个以体素为中心的框架,表述为递归感知-同化范式。通过消除初始尺度估计的需要,VEOcc实现了高度精简、开放的地图扩展。此外,为了在离散体素空间内鲁棒地聚合带噪声的时间观测,我们提出了一种时空感知在线更新策略。它集成了跨时间对数聚合(TLA)以保持时间一致性、可靠性感知置信度调制(RCM)以进行空间不确定性校准,以及置信度驱动的增量状态更新(CSU)以实现鲁棒的全局状态同化。在Occ-ScanNet和EmbodiedOcc-ScanNet上的大量实验表明,VEOcc在局部和具身设置中均建立了新的最先进性能,为真实世界探索提供了准确且高效的解决方案。值得注意的是,在自收集视频序列上的零样本评估进一步证实了其在完全未见过的真实世界环境中的鲁棒分布外泛化能力。最终,我们的框架为自主探索提供了准确且高效的解决方案。代码和补充可视化可在我们的项目页面获取:https://wryzju.github.io/VEOcc/。

英文摘要

Crucial for autonomous exploration, online 3D occupancy prediction and mapping incrementally constructs dense spatial representations on the fly. However, recent Gaussian-centric methods struggle with structural boundary fidelity and rely heavily on predefined scene-size priors, fundamentally limiting their operational efficiency. In this work, we present VEOcc, a voxel-centric framework formulated as a recursive perception-and-assimilation paradigm. By eliminating the need for initial scale estimation, VEOcc enables highly streamlined, open-ended map expansion. Furthermore, to robustly aggregate noisy temporal observations within the discrete voxel space, we propose a Spatio-Temporal-Aware Online Update Strategy. It integrates Cross-Temporal Logit Aggregation (TLA) for temporal consistency, Reliability-Aware Confidence Modulation (RCM) for spatial uncertainty calibration, and Confidence-Driven Incremental State Update (CSU) for robust global state assimilation. % Extensive experiments on Occ-ScanNet and EmbodiedOcc-ScanNet demonstrate that VEOcc establishes new state-of-the-art performance in both local and embodied settings, providing an accurate and efficient solution for real-world exploration. Extensive experiments on Occ-ScanNet and EmbodiedOcc-ScanNet demonstrate that VEOcc establishes new state-of-the-art performance in both local and embodied settings. Notably, zero-shot evaluations on self-collected video sequences further confirm its robust out-of-distribution generalization capability in completely unseen real-world environments. Ultimately, our framework provides an accurate and highly efficient solution for autonomous exploration. Code and supplementary visualizations are available on our project page: https://wryzju.github.io/VEOcc/.

2605.24846 2026-05-29 cs.LG cs.AI

Tiny Brains, Giant Impact: Uncovering the Keystone Neurons of LLM with Just a Few Prompts

微小大脑,巨大影响:仅用少量提示揭示LLM的关键神经元

Xiangtian Ji, Yuxin Chen, Zhengzhou Cai, Xiang Wang, An Zhang, Tat-Seng Chua

发表机构 * National University of Singapore(新加坡国立大学) Beijing University of Posts and Telecommunications(北京邮电大学) University of Science and Technology of China(中国科学技术大学)

AI总结 本研究通过跨任务激活强度分析,发现大型语言模型中存在一组极其稀疏的关键神经元,其移除会导致模型行为崩溃,并基于此提出仅更新关键神经元的微调方法,在少量参数修改下达到与全参数微调相当或更优的任务性能。

详情
AI中文摘要

大型语言模型(LLM)展现出强大的综合能力,但支撑这些行为的内部机制仍未被充分理解。在这项工作中,我们展示了在多种开放权重Transformer模型中,存在一组神经元在跨多个能力维度的任务推理期间始终保持高度激活。通过沿跨任务激活强度进行探测,我们分离出一个极其稀疏的子集,其移除会导致模型行为崩溃,我们将其称为关键神经元。我们的分析揭示,关键神经元是模型的一个稳定且内在的神经元子集,主要在预训练期间建立。与这些神经元相关的参数在训练过程中被紧密校准,其精确值对模型能力至关重要。基于这些见解,我们提出了一种监督微调方法,仅更新关键神经元,在修改远少于全参数的情况下,实现了与全参数微调相当甚至更好的任务增益,同时更好地保留了其他能力维度的性能。

英文摘要

Large language models (LLMs) display strong comprehensive abilities, yet the internal mechanisms that support these behaviors remain insufficiently understood. In this work, we show that across a wide range of open-weight Transformers, a subset of neurons remains consistently highly activated during inference across tasks of multiple capability dimensions. By probing along the cross-task activation strength, an extremely sparse subset is isolated, whose removal causes a collapse in model behavior, which we term keystone neurons. Our analysis reveals that keystone neurons are a stable and intrinsic neuron subset of the model that is largely established during pretraining. The parameters associated with these neurons are tightly calibrated during the training process, and their precise values are critical for the capabilities of the model. Building on these insights, we propose a supervised fine-tuning approach that updates only keystone neurons, achieving task gains comparable to or even better than full-parameter fine-tuning while better preserving performance in other capability dimensions, despite modifying a much smaller number of parameters.

2605.24399 2026-05-29 cs.AI

ConceptM$^3$oE: Concept-Guided Multimodal Mixture of Experts for Interpretable Computational Pathology

ConceptM$^3$oE:面向可解释计算病理学的概念引导多模态专家混合模型

Xuan Wang, Zhongling Xu, Gopi Kannedhara, Joakim Nguyen, Jian Yu, Jinrui Fang, Abdurrahmaan Baghdadi, Tianlong Chen, Awais Naeem, Chandra Krishnan, Edward Castillo, Andrew H. Song, Ankita Shukla, Ying Ding, Nicholas Konz, Hairong Wang

发表机构 * University of Texas at Austin(德克萨斯大学奥斯汀分校) University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校) Dell Children’s Medical Center(德尔儿童医疗中心) The University of Texas MD Anderson Cancer Center(德克萨斯大学MD安德森癌症中心) University of Nevada, Reno(内华达大学里诺分校)

AI总结 提出ConceptM$^3$oE框架,通过概念引导的多模态专家混合路径嵌入概念形成,并利用残差路径保持性能与可解释性,在脑肿瘤分类中优于基线并提升小样本性能。

详情
AI中文摘要

医疗模型正从单模态预测转向对异构诊断输入的多模态推理。在计算病理学中,对于仅凭形态学难以区分的复杂肿瘤亚型,病理报告和分子测量可提供额外的诊断证据,但现有模型往往无法阐明不同信号如何组合成可识别的诊断概念。我们提出ConceptM$^3$oE(概念多模态MoE),将概念形成直接嵌入交互感知的专家混合(MoE)路径中。该架构将证据分解为模态特定、冗余和协同专家,然后将其投影到结构化概念瓶颈中,将潜在特征映射到形态学和生物标志物概念层次结构。为防止可解释瓶颈典型的信息损失,我们在每个专家内利用残差路径,使任务相关信号既通过概念流动,也直接流向最终任务预测,从而在保持可解释性的同时维持高性能。在机构性儿童脑肿瘤队列和公共胶质瘤队列上,该框架实现了与无约束模型相竞争的性能,同时产生由独立神经病理学家验证的推理轨迹。在数据有限的情况下,ConceptM$^3$oE提升了小数据性能,在较小训练规模下,与非概念信息基线相比,宏F1从56.41%提升至66.70%,同时显示出更快的训练收敛速度,这与概念学习的正则化效应一致。这项工作为高性能、内在可验证且更符合临床实践复杂决策的医疗AI提供了一条可扩展的路径。

英文摘要

Healthcare models are transitioning from unimodal prediction toward multimodal reasoning over heterogeneous diagnostic inputs. In computational pathology, for complex tumor subtypes where morphology alone can be challenging to distinguish, pathology reports and molecular measurements may provide additional diagnostic evidence alongside whole-slide images, yet existing models often fail to clarify how diverse signals assemble into recognizable diagnostic concepts. We propose ConceptM$^3$oE (Concept Multimodal MoE), which embeds concept formation directly within interaction-aware mixture-of-experts (MoE) pathways. The architecture decomposes evidence into modality-specific, redundant, and synergistic experts, which are then projected into structured concept bottlenecks mapping latent features to a hierarchy of morphology and biomarker concepts. To prevent the information loss typical of interpretable bottlenecks, we utilize residual pathways within each expert to allow task-relevant signals to flow both through the concepts and directly to the final task prediction, so that high performance is maintained alongside interpretability. Across an institutional pediatric brain tumor cohort and a public glioma cohort, the framework delivers competitive performance to unconstrained models while producing reasoning traces validated by an independent neuropathologist. In data-limited regimes, ConceptM$^3$oE improves limited-data performance, increasing macro-F1 from 56.41% to 66.70% at small training sizes compared to non-concept-informed baselines, while also showing faster training convergence consistent with the regularizing effect of concept learning. This work offers a scalable path toward high-performance medical AI that is inherently verifiable and better aligned with the complex decision-making of clinical practice.

2605.24140 2026-05-29 cs.AI

HyperGuide: Hyperbolic Guidance for Efficient Multi-Step Reasoning in Large Language Models

HyperGuide: 用于大型语言模型高效多步推理的双曲引导

Yuyu Liu, Haotian Xu, Yanan He, Sarang Rajendra Patil, Mengjia Xu, Tengfei Ma

发表机构 * Department of Computer Science(计算机科学系) Stony Brook University(石英布鲁克大学) Department of Applied Mathematics and Statistics(应用数学与统计学系) Yale University(耶鲁大学) Department of Data Science(数据科学系) New Jersey Institute of Technology(新泽西理工学院) Department of Biomedical Informatics(生物医学信息学系)

AI总结 针对多步推理中单次生成效率高但精度低、树搜索计算量大的问题,提出通过将推理进度蒸馏为双曲几何信号来引导逐步生成,利用双曲空间的距离和角度特性编码解接近度与分支区分,训练轻量头投影隐状态并微调适配器,在多个基准上取得一致提升。

详情
AI中文摘要

多步推理仍然是大型语言模型的一个核心挑战:单次生成效率高但缺乏准确性;树搜索方法探索多条路径但计算量大。我们通过将推理进度蒸馏为双曲几何信号来弥补这一差距,该信号引导逐步生成。我们的方法基于一个结构性观察:在组合推理树中,包含解的状态很少,而死胡同则呈指数级多。双曲空间匹配这种不对称性,原点附近体积紧凑,向边界指数扩展,因此到原点的距离自然地编码解的接近度,而角度分离则区分需要不同下一步操作的分支。我们训练一个轻量头将LLM的隐状态投影到该空间,然后在其自身的推理尝试上交互式地微调一个低秩适配器,以对注入的信号做出反应。在多个基准上,该几何信号带来一致的提升,在更深推理链上改进更大。我们的代码公开在 https://github.com/yuyuliu11037/HyperGuide。

英文摘要

Multi-step reasoning remains a central challenge for large language models: single-pass generation is efficient but lacks accuracy; tree-search methods explore multiple paths but are computation-heavy. We address this gap by distilling reasoning progress into a hyperbolic geometric signal that guides step-by-step generation. Our approach is motivated by a structural observation: in combinatorial reasoning trees, solution-bearing states are few while dead ends are exponentially numerous. The hyperbolic space matches this asymmetry, with compact volume near the origin and exponentially expanding capacity toward the boundary, so that distance-to-origin naturally encodes solution proximity while angular separation distinguishes branches requiring different next operations. We train a lightweight head to project LLM hidden states into this space, then fine-tune a low-rank adapter interactively on its own reasoning attempts to act on the injected signal. Across multiple benchmarks, the geometric signal yields consistent gains, with larger improvements on deeper reasoning chains. Our code is publicly available at https://github.com/yuyuliu11037/HyperGuide.

2605.23993 2026-05-29 cs.CV cs.AI cs.LG

Nano World Models: A Minimalist Implementation of Future Video Prediction

纳米世界模型:未来视频预测的极简实现

Siqiao Huang, Partha Kaushik, Michael Chen, Hengkai Pan, Kaiwen Geng, Omar Chehab, Fernando Moreno-Pino, Max Simchowitz

发表机构 * DeepMind

AI总结 提出Nano World Models,一个基于扩散强迫的极简代码库,用于未来视频预测,支持可控研究世界模型的设计选择,并通过实验分析预测参数化、架构规模等因素对视频预测质量的影响。

Comments Project page: https://simchowitzlabpublic.github.io/nano-world-model/

详情
AI中文摘要

世界模型已成为学习预测模拟器的核心范式,支持生成、规划和决策。然而,尽管工业级交互式视频生成取得了快速进展,更广泛的研究社区仍然缺乏紧凑、可重复且易于扩展的实现来研究现代世界模型的设计选择。我们介绍了Nano World Models,一个围绕扩散强迫的极简代码库,用于未来视频预测。Nano World Models为生成目标、模型规模、动作条件机制、潜在观测空间、数据集、评估协议和长程展开程序提供了统一接口。这种设计使得通常在不同实现中纠缠的世界模型组件可以进行受控研究。通过在简单控制环境、游戏模拟和真实机器人数据上的实验,我们考察了预测参数化、架构规模、动作注入、采样预算和领域复杂性如何影响视频预测质量和自回归展开行为。通过发布代码、配置、评估脚本和预训练检查点,Nano World Models旨在为开放、可重复和科学的世界模型研究提供一个紧凑但可扩展的实验基础。

英文摘要

World models have become a central paradigm for learning predictive simulators that support generation, planning, and decision-making. Yet, despite rapid progress in industry-scale interactive video generation, the broader research community still lacks compact, reproducible, and easily extensible implementations for studying the design choices underlying modern world models. We introduce Nano World Models, a minimalist codebase for future video prediction centered around diffusion forcing. Nano World Models provides a unified interface for generative objectives, model scales, action-conditioning mechanisms, latent observation spaces, datasets, evaluation protocols, and long-horizon rollout procedures. This design enables controlled studies of world-modeling components that are often entangled across separate implementations. Through experiments across simple control environments, game simulation, and real-robot data, we examine how prediction parameterization, architecture scale, action injection, sampling budget, and domain complexity affect video prediction quality and autoregressive rollout behavior. By releasing code, configurations, evaluation scripts, and pretrained checkpoints, Nano World Models aims to provide a compact yet extensible experimental substrate for open, reproducible, and scientific world-model research.

2605.23657 2026-05-29 cs.CL

OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents

OpenSkillEval:自动审计LLM智能体的开放技能生态系统

Jiahao Ying, Boxian Ai, Wei Tang, Siyuan Liu, Yixin Cao

发表机构 * Singapore Management University(新加坡国立管理学院) Institute of Trustworthy Embodied AI, Fudan University(复旦大学可信具身人工智能研究院) Joy Future Academy, JD(京东未来学院)

AI总结 提出自动评估框架OpenSkillEval,通过动态构建真实任务实例和收集社区技能,系统评估技能增强型智能体系统及技能本身,揭示技能可用性不保证有效使用、技能增强收益依赖模型和框架等关键发现。

详情
AI中文摘要

技能,即为大型语言模型(LLM)提炼的结构化工作流指令,正成为提升智能体在现实下游任务性能的日益重要的机制。然而,随着开源技能生态系统的快速扩张,不同模型和智能体框架如何与技能交互、如何评估技能质量、以及用户在实际成本-性能权衡下应如何选择技能,这些问题仍不明确。在本文中,我们提出了 extsc{OpenSkillEval},一个针对技能增强型智能体系统及技能本身的自动评估框架。 extsc{OpenSkillEval}不依赖静态基准,而是从不断演变的现实世界工件中自动构建跨五类下游应用(演示生成、前端网页设计、海报生成、数据可视化和报告生成)的真实任务实例。它进一步收集和组织社区贡献的技能,以便在统一任务设置下进行受控比较。利用超过600个动态生成的任务实例和30个开源技能,我们对最先进的模型和智能体框架进行了系统评估。我们的结果表明,技能可用性并不保证有效使用技能,技能增强的收益强烈依赖于底层模型和智能体框架,并且许多公开流行的技能并不始终优于没有技能的基础智能体。这些发现凸显了动态、基于任务的评估的必要性,并为LLM智能体技能的设计、选择和部署提供了实用见解。更多案例和基准资源可在项目网站上获取:https://yingjiahao14.github.io/OpenSkillEval-Web/。

英文摘要

Skills, i.e., structured workflow instructions distilled for large language models (LLMs), are becoming an increasingly important mechanism for improving agent performance on real-world downstream tasks. However, as the open-source skill ecosystem rapidly expands, it remains unclear how different models and agent frameworks interact with skills, how to evaluate skill quality, and how users should select skills under practical cost-performance trade-offs. In this paper, we present \textsc{OpenSkillEval}, an automatic evaluation framework for both skill-augmented agent systems and the skills themselves. Instead of relying on static benchmarks, \textsc{OpenSkillEval} automatically constructs realistic task instances from evolving real-world artifacts across five categories of downstream applications: presentation generation, front-end web design, poster generation, data visualization, and report generation. It further collects and organizes community-contributed skills for controlled comparison under unified task settings. Using more than 600 dynamically generated task instances and 30 open-source skills, we conduct a systematic evaluation of state-of-the-art models and agent frameworks. Our results show that skill availability does not guarantee effective skill usage, that the benefit of skill augmentation depends strongly on both the underlying model and the agent framework, and that many publicly popular skills do not consistently outperform base agents without skills. These findings highlight the need for dynamic, task-grounded evaluation and provide practical insights into the design, selection, and deployment of skills for LLM agents. Additional cases and benchmark resources are available on the project website: https://yingjiahao14.github.io/OpenSkillEval-Web/.

2605.23531 2026-05-29 cs.CV

PixIE: Prompted Pixel-Space Low-Light Image Enhancement

PixIE: 提示驱动的像素空间低光照图像增强

Ruirui Lin, Guoxi Huang, David Bull, Nantheera Anantrasirichai

发表机构 * Visual Information Laboratory, University of Bristol, United Kingdom(布里斯托大学视觉信息实验室,英国)

AI总结 提出PixIE框架,利用视觉基础模型的语义提示,通过跨尺度去噪和DINO提示像素块进行像素空间低光照图像增强,在多个基准上提升PSNR和LPIPS。

详情
AI中文摘要

低光照图像遭受严重的噪声、对比度损失和语义模糊,使得增强成为去噪和细节恢复的联合问题。我们提出PixIE,一种由视觉基础模型语义提示的前馈像素空间LLIE框架。PixIE首先执行跨尺度去噪以抑制噪声并保持结构,然后使用DINO提示像素块(DPPBs)细化细节,通过补丁条件、空间连续的逐像素调制注入中间DINOv3特征。为了使像素空间注意力在跨尺度上高效,我们引入了空间通道压缩(SCC),它联合减少空间令牌网格和通道维度。我们进一步提出多感受野像素嵌入(MRPE),在语义提示之前提供邻域感知的像素表示,提高对信号依赖噪声的鲁棒性,超越逐点嵌入。在LLIE基准上的实验表明,与最近的最先进方法相比,PixIE将平均PSNR提高了1.9-15.0%,并将LPIPS降低了8.5-44.4%。定性比较进一步显示更清晰的细节和更稳定的纹理,提高了重建保真度和感知质量。

英文摘要

Low-light images suffer from severe noise, contrast loss, and semantic ambiguity, making enhancement a joint problem of denoising and detail recovery. We propose PixIE, a feed-forward pixel-space LLIE framework semantically prompted by a vision foundation model. PixIE first performs cross-scale denoising to suppress noise and preserve structure, then refines details using DINO-Prompted Pixel Blocks (DPPBs), which inject intermediate DINOv3 features through patch-conditioned, spatially continuous per-pixel modulation. To make pixel-space attention efficient across scales, we introduce Spatial-Channel Compaction (SCC), which jointly reduces the spatial token grid and channel dimension. We further propose Multi-Receptive-Field Pixel Embedding (MRPE) to provide neighborhood-aware pixel representations before semantic prompting, improving robustness to signal-dependent noise beyond point-wise embeddings. Experiments on LLIE benchmarks show that PixIE improves average PSNR by 1.9-15.0% over recent state-of-the-art methods and reduces LPIPS by 8.5-44.4%. Qualitative comparisons further show sharper details and more stable textures, improving both reconstruction fidelity and perceptual quality.

2605.23345 2026-05-29 cs.CV

SCOPE: Simulating Cross-game Operations in Playable Environments for FPS World Models

SCOPE: 在可玩环境中模拟跨游戏操作以构建FPS世界模型

Zizhao Tong, Yeying Jin, Hongfeng Lai, Zeqing Wang, Zhaohu Xing, Kexu Cheng, Haoran Xu, Zhao Pu, Shangwen Zhu, Ruili Feng, Jian Zhao, Yan Zhang, Hao Tang, Ling Shao

发表机构 * UCAS-Terminus AI Lab, University of Chinese Academy of Sciences(中国科学院大学Terminus AI实验室) Tencent(腾讯) National University of Singapore(新加坡国立大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) University of Waterloo(多伦多大学) Shanghai Jiaotong University(上海交通大学) Zhongguancun Institute of Artificial Intelligence(中关村人工智能研究院) State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University(北京大学计算机科学学院多媒体信息处理国家重点实验室)

AI总结 提出SCOPE方法,通过在每个Transformer块中插入条件模块,将特征重塑为逐像素时间序列,以分离FPS游戏中局部作用域(scope)内的操作效果与全局生成,并引入跨游戏数据集CrossFPS,实现零样本迁移。

Comments Project page: https://z2tong.github.io/SCOPE/. Code is available at https://github.com/z2tong/SCOPE

详情
AI中文摘要

第一人称射击(FPS)游戏的交互式世界模型必须在每一帧解析高频重叠控制信号,同时不干扰未受影响的区域。现有方法全局注入动作并在单一游戏上训练,在密集FPS输入下失败。我们观察到FPS动作具有空间选择性:离散事件(如射击或换弹)仅影响武器周围的局部区域(scope),而连续的相机和移动信号控制稳定的环境。我们提出SCOPE,它在预训练视频扩散模型的每个Transformer块中插入一个条件模块。它将特征重塑为逐像素时间序列,使得每个位置根据局部视觉内容计算其动作响应。这无需分割标签即可将作用域内效果与作用域外生成分离。我们还引入了CrossFPS,这是第一个具有帧对齐动作遥测的多游戏FPS数据集。它包含来自7个游戏的69K个片段,具有10自由度控制器信号,并经过策划以消除游戏玩法偏差。该模型学习通用的视觉到动作映射,而非特定游戏模式,从而实现对未见场景的零样本迁移。实验证实了强动作响应性、精确的作用域分离以及有效的跨游戏泛化。

英文摘要

Interactive world models for first-person shooter (FPS) games must resolve high-frequency overlapping control signals at every frame without disrupting unaffected regions. Existing methods inject actions globally and train on single titles, failing under dense FPS inputs. We observe that FPS actions are spatially selective: discrete events such as firing or reloading affect only a localized region around the weapon (the scope), while continuous camera and movement signals govern stable surroundings. We propose SCOPE, which inserts a conditioning module into each transformer block of a pretrained video diffusion model. It reshapes features into per-pixel temporal sequences so that each position computes its action response from local visual content. This separates in-scope effects from out-of-scope generation without segmentation labels. We also introduce CrossFPS, the first multi-game FPS dataset with frame-aligned action telemetry. It comprises 69K clips from 7 titles with 10-DoF controller signals, curated to remove gameplay bias. The model learns general visual-to-action mappings rather than game-specific patterns, enabling zero-shot transfer to unseen scenes. Experiments confirm strong action responsiveness, precise scope separation, and effective cross-game generalization.

2605.22924 2026-05-29 cs.LG cs.IR

Building a privacy-preserving Federated Recommender system for mobile devices

构建保护隐私的移动设备联邦推荐系统

Aasheesh Singh

发表机构 * Département d’informatique et de recherche opérationnelle(计算机与运筹研究部)

AI总结 提出一种两阶段联邦推荐系统流水线,通过分离非敏感偏好数据与设备内敏感上下文数据,在保护隐私的同时实现移动设备上的个性化推荐。

Comments Masters thesis, Université de Montréal, Department of Computer Science and Operations Research, 2024

详情
AI中文摘要

在移动设备上提供个性化内容传统上需要在中央服务器上汇集敏感用户数据,这种做法越来越不符合现代隐私期望和地域法规。我们提出了一种用于移动设备的两阶段联邦推荐系统流水线,其核心原则是将非敏感的用户偏好数据与永不离开设备的敏感移动上下文数据分离。第一阶段在云端对非敏感的应用上下文数据运行协同过滤模型,生成相关项目的短列表。第二阶段在设备上使用敏感的移动信号对这些候选项目进行重新排序,只有模型更新/梯度会离开设备。我们在MovieLens、UCI人类活动识别以及一个专有试点数据集上验证了该方法,并提供了一个生产就绪的实现,作为可在Android和iOS上部署的Kotlin多平台库。

英文摘要

Serving personalized content on mobile devices has traditionally required pooling sensitive user data on centralized servers, a practice increasingly at odds with modern privacy expectations and geographical regulations. We present a two-stage federated recommendation system pipeline for mobile devices, built around a principled separation between non-sensitive user preference data and sensitive mobile context data that never leaves the device. The first stage runs a collaborative filtering model on non-sensitive app-context data in the cloud to generate a shortlist of relevant items. The second stage re-ranks these candidates on-device using sensitive mobile signals, with only model updates/gradients ever leaving the device. We validate the approach on MovieLens, UCI Human Activity Recognition, and a proprietary pilot dataset, and deliver a production-ready implementation as a Kotlin Multiplatform library deployable on Android and iOS.

2605.22100 2026-05-29 cs.AI

MPDocBench-Parse: Benchmarking Practical Multi-page Document Parsing

MPDocBench-Parse:面向实际的多页文档解析基准测试

Bangbang Zhou, Hangdi Xing, Yifan Chen, Jianjun Xu, Qi Zheng, Feiyu Gao, Zhibo Yang, Shuai Bai, Ming Yan, Jieping Ye, Hongtao Xie

发表机构 * University of Science and Technology of China(中国科学技术大学) Tongyi Lab, Alibaba Group(阿里云实验室)

AI总结 针对现有基准测试在真实场景中评估不足的问题,提出MPDocBench-Parse基准,包含433份多页文档(3246页),覆盖15种文档类型,设计全面的内容保真度和逻辑结构评估协议,实验表明现有模型在语义连续性、视觉内容解析和层次结构恢复方面存在明显局限。

详情
AI中文摘要

文档解析将视觉丰富的文档转换为机器可读的结构化表示,为信息系统提供了关键基础。尽管已有许多文档解析基准测试,但它们仍不足以应对真实场景。现有基准测试要么专注于特定任务,要么仅评估单页、以文本为中心的设置,因此不足以处理实际的多页解析。此外,它们缺乏对语义连续性、层次结构恢复和视觉内容保留的细粒度评估。为解决这些不足,我们提出了MPDocBench-Parse,一个面向实际应用的多页文档解析基准测试。它包含433份人工标注的文档,共3246页,覆盖中英文15种文档类型,具有多样化的布局风格,并支持文档级端到端评估。我们进一步设计了一套全面的内容保真度和逻辑结构评估协议,涵盖文本、表格和公式识别,截断文本和表格合并,图形提取,阅读顺序以及标题层次恢复。实验表明,尽管现有模型在基本文本提取方面表现良好,但在语义连续性整合、视觉内容解析和层次结构恢复方面仍存在明显局限。MPDocBench-Parse为将文档解析推进到更真实的场景提供了统一基础。

英文摘要

Document parsing converts visually rich documents into machine-readable structured representations, forming a crucial foundation for information systems. Although many benchmarks have been proposed for document parsing, they remain inadequate for realistic scenarios. Existing benchmarks either focus on specific tasks or assess only single-page, text-centric settings, making them insufficient for practical multi-page parsing. Moreover, they lack fine-grained evaluation of semantic continuity, hierarchical structure recovery, and visual content preservation. To address these gaps, we propose MPDocBench-Parse, a benchmark for multi-page document parsing in real-world applications. It contains 433 manually annotated documents with 3,246 pages, covering 15 document types in English and Chinese, with diverse layout styles, and supports document-level end-to-end evaluation. We further design a comprehensive protocol for content fidelity and logical structure, covering text, table, and formula recognition, truncated text and table merging, figure extraction, reading order, and heading hierarchy recovery. Experiments show that, while existing models perform well on basic text extraction, they still suffer clear limitations in semantic continuity integration, visual content parsing, and hierarchical structure recovery. MPDocBench-Parse provides a unified foundation for advancing document parsing toward more realistic scenarios.

2605.22082 2026-05-29 cs.RO cs.LG

CoRMA: Contrastive RMA for Contact-Rich Meta-Adaptation

CoRMA: 用于接触丰富元适应的对比RMA

Wentian Wang, Chutong Wen, Hongxu Ma, Wuhao Wang, Zhexiong Xue, Abdul Haseeb Nizamani, Dandi Zhou, Xinhai Sun, Jianqiao Zhu

发表机构 * Synthoid AI

AI总结 提出CoRMA框架,通过语义接触上下文和对比学习实现力主导装配任务的元适应,无需演示或梯度更新,在仿真和真实机器人上优于基线。

详情
AI中文摘要

我们提出CoRMA(对比机器人运动适应),一个基于上下文的元适应框架,修改了RMA以适用于力主导的装配任务。CoRMA用紧凑的6维仅仿真语义接触上下文(描述接触开始、侧向接合、引导过渡、接触方向和卡滞)替换原始仿真器参数适应。一个可部署的因果Transformer适配器通过语义回归和力状态对比目标,从力、本体感受和动作历史中在线推断该上下文。部署时,移除真实上下文并由推断上下文替代,从而无需演示、特权输入或梯度更新即可实现片段内适应。我们在Isaac Lab / Isaac Sim 5.0中的PegInsert、GearMesh和NutThread任务以及真实Marvin机械臂上评估CoRMA。与在仿真中成功率高但在硬件上大幅下降的FORGE基线相比,CoRMA在受控目标位姿噪声下保留了更高的验证真实成功率。这些结果支持语义接触推断作为相关装配任务族内可复用的适应接口,而更广泛的未见任务泛化和Real2Sim校准仍是未来工作。

英文摘要

We present CoRMA(Contrastive Robotic Motor Adaptation), a context-based meta-adaptation framework that modifies RMA for force-dominant assembly. CoRMA replaces raw simulator-parameter adaptation with a compact 6D simulator-only semantic contact context describing contact onset, lateral engagement, guided transition, contact direction, and jamming. A deployable causal Transformer adapter infers this context online from force, proprioceptive, and action histories using semantic regression and a force-regime contrastive objective. At deployment, oracle context is removed and replaced by the inferred context, enabling within-episode adaptation without demonstrations, privileged inputs, or gradient updates. We evaluate CoRMA on PegInsert, GearMesh, and NutThread in Isaac Lab / Isaac Sim 5.0 and on a real Marvin arm. Compared with FORGE baselines that achieve high simulation success but degrade substantially on hardware, CoRMA retains higher verified real success under controlled target-pose noise. These results support semantic contact inference as a reusable adaptation interface within a related assembly task family, while broader unseen-task generalization and Real2Sim calibration remain future work.

2605.22080 2026-05-29 cs.CV cs.AI

JMed48k: A Multi-Profession Japanese Medical Licensing Benchmark for Vision-Language Model Evaluation

JMed48k:用于视觉语言模型评估的多专业日本医疗执照基准

Yue Xun, Junyu Liu, Qian Niu, Xinyi Wang, Zheng Yuan, Zirui Li, Zequn Zhang, Bowen Zhao, Shujun Wang, Irene Li, Kan Hatakeyama-Sato, Yusuke Iwasawa, Yutaka Matsuo

发表机构 * The Hong Kong Polytechnic University(香港理工大学) Kyoto University(京都大学) The University of Tokyo(东京大学) Hohai University(淮海大学) University of Science and Technology of China(中国科学技术大学) University of Toronto(多伦多大学)

AI总结 本文提出JMed48k,一个包含48,862道试题和20,142张图像的多专业日本医疗执照基准,通过评估21个模型并引入配对图像移除审计,发现专有和开源模型显著受益于图像,而医学专用模型对视觉证据利用有限。

详情
AI中文摘要

我们引入了JMed48k,一个用于评估视觉语言模型的多专业日本医疗执照基准。该基准基于日本厚生劳动省发布的官方PDF材料构建,包含2005年至2025年间11个国家执照考试的48,862道试题和20,142张图像,视觉内容按8类分类法进行标注。从该语料库中,我们提取了JMed48k-Eval,一个近五年的评估子集,包含12,484道评分题,其中9,905道纯文本题和2,579道带图像题。我们评估了21个专有、开源和医学专用模型,分别报告纯文本和带图像的性能。由于这些子集包含不同的问题,我们进一步引入了一种配对图像移除审计,评估带图像的问题在移除视觉内容前后的表现,以探索四种答案转换状态。审计显示,专有和开源模型从图像中获益显著,而医学专用系统对视觉证据的利用有限,许多正确答案在图像移除后仍然存在。即使在专有模型中,净图像移除效应在不同专业间变化七倍,从医师问题的+5.7分到公共卫生护士问题的+39.8分。我们发布JMed48k以支持在医疗执照场景中对视觉语言模型进行可重复的、按专业分层的评估。

英文摘要

We introduce JMed48k, a multi-profession Japanese healthcare licensing benchmark for evaluating vision-language models. Built from official PDF materials released by the Japanese Ministry of Health, Labour and Welfare, JMed48k contains 48,862 exam questions and 20,142 images from 11 national licensing examinations between 2005 and 2025, with visual content annotated under an 8-type taxonomy. From this corpus, we derive JMed48k-Eval, a recent five-year evaluation subset with 12,484 scored questions, including 9,905 text-only questions and 2,579 questions with images. We evaluate 21 proprietary, open-source, and medical-specific models, reporting text-only and with-image performance separately. Because these subsets contain different questions, we further introduce a paired image-removal audit that evaluates questions with images before and after removing visual content to explore four answer-transition states. The audit shows that proprietary and open source models gain substantially from images, whereas medical-specific systems show limited observable use of visual evidence, with many correct answers persisting after image removal. Even among proprietary models, the net image-removal effect varies sevenfold across professions, from +5.7 points on Physician questions to +39.8 points on Public Health Nurse questions. We release JMed48k to support reproducible, profession-stratified evaluation of vision-language models in medical licensing settings.

2605.22069 2026-05-29 cs.CV cs.LG

TWINGS: Thin Plate Splines Warp-aligned Initialization for Sparse-View Gaussian Splatting

TWINGS: 基于薄板样条翘曲对齐的稀疏视图高斯泼溅初始化

Hyeseong Kim, Geonhui Son, Deukhee Lee, Dosik Hwang

发表机构 * Yonsei University(延世大学) Korea Institute of Science and Technology(韩国科学技术院)

AI总结 提出TWINGS框架,利用薄板样条(TPS)对齐反投影点与三角化控制点,为3D高斯泼溅提供几何精确的初始化,从而在稀疏视图下提升场景重建的细节保留和颜色保真度。

Comments Accepted at CVPR 2026, Project page: https://sandokim.github.io/twings/

详情
AI中文摘要

从稀疏视图输入进行新视角合成是3D计算机视觉中的一个重大挑战,特别是在有限视角下实现高质量场景重建。我们引入了TWINGS,这是一个通过直接解决点稀疏性来增强3D高斯泼溅(3DGS)的框架。我们采用薄板样条(TPS),一种平滑的非刚性变形模型,通过最小化弯曲能量从控制点对应关系估计全局一致的翘曲,将估计深度反投影的点与三角化的3D控制点对齐,从而生成校准的反投影点。通过在这些控制点附近采样校准点,TWINGS为3DGS提供了快速且几何精确的初始化,最终改善了重建场景中结构细节的保留和颜色保真度。在DTU、LLFF和Mip-NeRF360上的大量实验表明,TWINGS在稀疏视图场景下始终优于现有方法,提供详细且准确的重建。

英文摘要

Novel view synthesis from sparse-view inputs poses a significant challenge in 3D computer vision, particularly for achieving high-quality scene reconstructions with limited viewpoints. We introduce TWINGS, a framework that enhances 3D Gaussian Splatting (3DGS) by directly addressing point sparsity. We employ Thin Plate Splines (TPS), a smooth non-rigid deformation model that minimizes bending energy to estimate a globally coherent warp from control-point correspondences, to align backprojected points from estimated depth with triangulated 3D control points, yielding calibrated backprojected points. By sampling these calibrated points near the control points, TWINGS provides a fast and geometrically accurate initialization for 3DGS, ultimately improving structural detail preservation and color fidelity in reconstructed scenes. Extensive experiments on DTU, LLFF, and Mip-NeRF360 demonstrate that TWINGS consistently outperforms existing methods, delivering detailed and accurate reconstructions under sparse-view scenarios.

2605.21739 2026-05-29 cs.AI

AttuneBench: A Conversation-Based Benchmark for LLM Emotional Intelligence

AttuneBench: 基于对话的LLM情商基准测试

Kate M. Lubrano, Faisal Sayed, Ankita Rathod, Akshansh, Craver Corbyn Thomas-Smith, Mark E. Whiting, Karina Nguyen

发表机构 * Pareto Thoughtful University of Pennsylvania(宾夕法尼亚大学)

AI总结 提出AttuneBench基准,基于200个真实多轮人机对话,评估LLM在情绪识别、行为分类、偏好预测和响应质量等方面的情商能力,发现这些能力相互独立且偏好对齐和响应质量更具区分性。

Comments v2: Updated def_18 and def_20 supplemental figures to cover all 11 evaluated models (previously 9). Removed redundant supplemental figures. Corrected select captions (color descriptions, chance baselines, figure-content mismatches). No changes to experimental results, numerical claims, or conclusions

详情
AI中文摘要

情商(EI),即感知、理解并恰当回应他人情绪状态的能力,是人类交流的核心,随着LLM在日常生活中承担对话角色,评估其情商日益重要。现有的EI基准依赖于合成提示、单轮案例或第三方标注。这些方法不能直接衡量模型在真实对话过程中如何推断和回应参与者的情绪状态。我们引入AttuneBench,一个基于200个真实多轮人机对话的基准,其中参与者与匿名LLM对话,并逐轮标注其情绪状态、模型行为以及他们偏好的回应。在11个评估模型中,我们发现模型在情绪识别、行为分类、偏好预测和评判响应质量上的排名基本独立,表明情商行为可分解为可分离的能力。偏好对齐和响应质量判断比情绪标签准确性更具模型区分性。这些结果表明,情商行为需要预测特定用户在上下文中想要什么样的回应,这一区别可能被总体评分掩盖,而单轮或合成格式无法跨轮直接捕捉。AttuneBench提供了一个评估这些能力以及诊断模型在情绪显著对话中的特定优势和失败模式的框架。

英文摘要

Emotional intelligence (EI), the ability to perceive, understand, and respond appropriately to others' emotional states, is central to human communication, and increasingly important to assess as LLMs assume conversational roles in everyday life. Existing EI benchmarks rely on synthetic prompts, single-turn cases, or third-party annotation. These approaches do not directly measure how models infer and respond to a participant's emotional state over the course of a real conversation. We introduce AttuneBench, a benchmark grounded in 200 genuine multi-turn human-model conversations in which participants conversed with anonymized LLMs and provided turn-by-turn annotations of their emotional state, the model's behavior, and their preferred responses. Across 11 evaluated models, we find that model rankings on emotion recognition, behavioral classification, preference prediction, and judged response quality are largely independent, indicating that emotionally intelligent behavior decomposes into separable capabilities. Preference alignment and response-quality judgments are substantially more model-discriminating than emotion-label accuracy. These results indicate that emotionally intelligent behavior requires predicting what kind of response a specific user wants in context, a distinction that aggregate scoring can obscure and that single-turn or synthetic formats cannot directly capture across turns. AttuneBench provides a framework for assessing each of these capabilities and for diagnosing model-specific strengths and failure modes in emotionally salient conversation.

2605.17286 2026-05-29 cs.CV

HyperVision: A Channel-Adaptive Ground-Based Hyperspectral Vision Pre-trained Backbone

HyperVision: 一种通道自适应的地基高光谱视觉预训练骨干网络

Guanyiman Fu, Jingtao Li, Zihang Cheng, Zhuanfeng Li, Diqi Chen, Yan Xu, Xiangyu Liu, Fengchao Xiong, Jianfeng Lu, Chengrong Chen, Jun Zhou

发表机构 * Griffith University, Australia(格里菲斯大学,澳大利亚) Wuhan University, China(武汉大学,中国) Nanjing University of Science and Technology, China(南京理工大学,中国) Huaiyin Normal University, China(淮阴师范学院,中国) Massey University, New Zealand(马斯sey大学,新西兰)

AI总结 针对地基高光谱传感器配置差异、标签稀缺与不一致、数据集规模有限等问题,提出首个地基高光谱预训练骨干HyperVision,采用通道自适应动态嵌入、多源伪标签和跨模态知识蒸馏,在三个下游任务上取得最优性能。

详情
AI中文摘要

虽然高光谱成像通过数百个窄波长波段提供丰富的空间-光谱信息,用于精确的材料识别,但地基高光谱预训练骨干网络仍然缺失,受限于传感器间的光谱配置差异、标签的稀缺性和不一致性,以及现有数据集的规模有限和场景多样性不足。为了解决这些挑战并实现通用感知,我们提出了HyperVision,这是首个地基高光谱预训练骨干网络。首先,为了处理不同的光谱配置,HyperVision采用通道自适应动态嵌入机制,将异构输入映射到统一的标记空间。其次,我们开发了一个无监督表示学习框架。具体来说,为了解决标签稀缺和不一致问题,引入了一种多源伪标签方法,融合来自SAM2的空间结构和来自HyperFree的细粒度光谱材料信息。此外,为了丰富场景多样性并补偿有限的数据集规模,利用跨模态知识蒸馏机制,将预训练RGB视觉模型的丰富语义表示迁移到我们的骨干网络。HyperVision在来自26个不同地基数据集的15000张图像集合上进行预训练,展现出卓越的泛化能力。仅需高效的头适配而无需调整骨干参数,它在不同传感器配置下的三个下游任务中取得了比任务特定方法更优的性能,在高光谱语义分割中$\mathrm{Acc}_{\mathrm{M}}$相对提升高达16.3%,目标跟踪AUC相对提升2.1%,显著目标检测MAE降低35.5%。源代码和预训练模型将在https://github.com/lronkitty/HyperVision 公开。

英文摘要

While hyperspectral imaging provides rich spatial-spectral information across hundreds of narrow wavelength bands for precise material identification, ground-based hyperspectral pre-trained backbones remain absent, constrained by varying spectral configurations across sensors, the scarcity and inconsistency of labels, and the limited scale and scene diversity of existing datasets. To address these challenges and enable universal perception, we propose HyperVision, the first ground-based hyperspectral pre-trained backbone. First, to handle varying spectral configurations, HyperVision adopts a channel-adaptive dynamic embedding mechanism to map heterogeneous inputs into a unified token space. Second, we develop an unsupervised representation learning framework. Specifically, to address label scarcity and inconsistency, a multi-source pseudo-labeling method is introduced to fuse spatial structures from SAM2 and fine-grained spectral material information from HyperFree. Furthermore, to enrich scene diversity and compensate for limited dataset scale, a cross-modal knowledge distillation mechanism is utilized to transfer rich semantic representations from a pre-trained RGB vision model to our backbone. Pre-trained on a collection of 15k images from 26 diverse ground-based datasets, HyperVision demonstrates exceptional generalization. Requiring only efficient head-only adaptation without adjusting backbone parameters, it achieves state-of-the-art performance compared to task-specific methods across three downstream tasks under varying sensor configurations, yielding up to a 16.3% relative improvement in hyperspectral semantic segmentation $\mathrm{Acc}_{\mathrm{M}}$, a 2.1% relative gain in object tracking AUC, and a 35.5% reduction in salient object detection MAE. The source code and pre-trained model will be publicly available on https://github.com/lronkitty/HyperVision .

2605.15852 2026-05-29 cs.CV

GHOST: Geometry-Hierarchical Online Streaming Token Eviction for Efficient 3D Reconstruction

GHOST: 用于高效3D重建的几何层次化在线流式令牌驱逐

Leyang Chen, Junyi Wu, Zhiteng Li, Yulun Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 提出GHOST框架,利用模型自身的3D几何输出在线驱逐冗余令牌,在保持重建质量的同时将KV缓存减半并实现1.75倍加速。

详情
AI中文摘要

从长单目视频序列进行流式3D重建需要维护一个随序列长度线性增长的键值(KV)缓存,造成严重的内存瓶颈。现有方法要么将缓存截断为固定的一组锚帧,导致重建质量下降,要么依赖于对3D场景结构无关的注意力分数启发式方法,未能保留几何上有价值的令牌。为解决这些问题,我们提出GHOST(几何层次化在线流式令牌驱逐),一种无需训练的KV缓存管理框架,利用模型自身的3D几何输出在线驱逐冗余令牌。GHOST引入了三项相互增强的创新:层次化双层重要性评分方案、保护特殊令牌不被驱逐的特权机制,以及余弦相似度引导的逐层预算分配。在各种基准上的实验表明,GHOST在保持出色重建质量的同时,将KV缓存削减近一半,并且与最先进方法相比实现了1.75倍的推理加速。我们的代码可在 https://github.com/lokiniuniu/GHOST 获取。

英文摘要

Streaming 3D reconstruction from long monocular video sequences requires maintaining a key-value (KV) cache that grows linearly with sequence length, creating a severe memory bottleneck. Existing approaches either truncate the cache to a fixed set of anchor frames, leading to reconstruction quality degradation, or rely on attention-score heuristics that are agnostic to 3D scene structure, failing to preserve geometrically valuable tokens. To address these problems, we present GHOST (Geometry-Hierarchical Online Streaming Token Eviction), a training-free KV cache management framework that exploits the model's own 3D geometry outputs to evict redundant tokens online. GHOST introduces three mutually reinforcing innovations: a hierarchical dual-level importance scoring scheme, a privilege mechanism that protects special tokens from eviction, and a cosine-similarity-guided layer-wise budget allocation. Experiments on various benchmarks show that GHOST preserves excellent reconstruction quality while cutting the KV cache by nearly half and delivering 1.75x faster inference compared to state-of-the-art methods. Our code is available at https://github.com/lokiniuniu/GHOST.

2605.15422 2026-05-29 cs.LG

DualKV: Shared-Prompt Flash Attention for Efficient RL Training with Large Rollouts and Long Contexts

DualKV: 面向高效RL训练的共享提示Flash注意力机制,支持大规模展开和长上下文

Jiading Gai, Shuai Zhang, Xiang Song, Bernie Wang, George Karypis

发表机构 * Amazon Web Services(亚马逊网络服务) Google(谷歌) University of Minnesota(明尼苏达大学)

AI总结 针对RL训练中共享提示重复计算问题,提出DualKV内核,通过融合CUDA前向/反向核和veRL数据流水线重排,消除提示复制,实现1.63-3.82倍策略更新加速。

详情
AI中文摘要

现代RL后训练方法(如GRPO和DAPO)在从共享提示($P$个token)采样的$N$个响应序列(每个$R$个token)上进行训练,但标准FlashAttention在前向和反向传播中将所有$P$个提示token复制$N$次——在相同的隐藏状态上重复计算和内存。在大规模展开、长上下文RL训练($N\geq16$,$P\geq8\text{K}$)中,这种冗余主导了策略更新成本。我们观察到,在仅解码器模型中,因果掩码使提示表示在每一层跨序列不变,因此所有逐token操作(归一化、投影、MLP)和注意力可以一次性处理提示——这一特性尚未在训练的内核级别被利用。我们提出\textbf{DualKV},这是首个消除RL训练中共享提示复制的FlashAttention内核变体,通过(1)~融合的CUDA前向和反向内核,在单次内核启动中迭代两个不相交的KV区域——共享上下文和逐序列响应,以及(2)~veRL中的数据流水线重设计,将$N(P{+}R)$个token重新打包为每个微批$P{+}NR$个token,将token减少从注意力扩展到整个模型,因子$ρ= N(P{+}R)/(P{+}NR)$。DualKV在数学上等价于标准注意力,且不引入近似。在Qwen3-8B GRPO训练中,使用8$\times$H100 GPU($N{=}32$,8K上下文),DualKV实现了$1.63$--$2.09\times$的策略更新加速,支持$2\times$更大的微批,并将MFU从$36\%$提升至$76\%$。类似增益在DAPO上成立($2.47\times$加速,$77\%$ MFU)。在30B MoE规模下,使用16$\times$H100,DualKV相比FlashAttention(需要4路Ulysses序列并行以避免OOM)实现了$3.82\times$的策略更新加速和$3.38\times$的端到端步骤加速。

英文摘要

Modern RL post-training methods such as GRPO and DAPO train on $N$ response sequences of $R$ tokens sampled from a shared prompt of $P$ tokens, but standard FlashAttention replicates all $P$ prompt tokens $N$ times across both forward and backward passes -- duplicating compute and memory on identical hidden states. In large-rollout, long-context RL training ($N{\geq}16$, $P{\geq}8\text{K}$), this redundancy dominates the policy update cost. We observe that in decoder-only models, causal masking makes prompt representations invariant across sequences at every layer, so all per-token operations (norms, projections, MLP) and attention can process the prompt once -- a property not yet exploited at the kernel level for training. We propose \textbf{DualKV}, the first FlashAttention kernel variant that eliminates shared-prompt replication during RL training, via (1)~fused CUDA forward and backward kernels that iterate over two disjoint KV regions -- shared context and per-sequence response -- in a single kernel launch, and (2)~a data-pipeline redesign in veRL that repacks $N(P{+}R)$ tokens into $P{+}NR$ tokens per micro-batch, extending the token reduction from attention to the entire model by a factor $ρ= N(P{+}R)/(P{+}NR)$. DualKV is mathematically equivalent to standard attention and introduces no approximation. On Qwen3-8B GRPO training with 8$\times$H100 GPUs ($N{=}32$, 8K-context), DualKV achieves $1.63$--$2.09\times$ policy-update speedup, enables $2\times$ larger micro-batches, and raises MFU from $36\%$ to $76\%$. Similar gains hold for DAPO ($2.47\times$ speedup, $77\%$ MFU). At 30B MoE scale on 16$\times$H100, DualKV achieves $3.82\times$ policy-update and $3.38\times$ end-to-end step speedup over FlashAttention (which requires 4-way Ulysses sequence parallelism to avoid OOM).

2605.15219 2026-05-29 cs.AI cs.IT math.IT

NOVA: Fundamental Limits of Knowledge Discovery Through AI

NOVA:通过人工智能进行知识发现的基本限制

Salman Avestimehr, Ken Duffy, Muriel Médard

发表机构 * University of Southern California(南加州大学) Northeastern University(东北大学) Massachusetts Institute of Technology(麻省理工学院)

AI总结 本文提出NOVA框架,将“生成-验证-积累-再训练”循环建模为知识空间上的自适应采样过程,识别了知识覆盖有限域的条件及失败模式,并证明了发现成本与Zipf定律相关的标度律。

详情
AI中文摘要

人工智能系统能否通过迭代自我改进发现真正的新知识,如果可以,代价是什么?我们引入了NOVA框架,将常见的“生成、验证、积累、再训练”循环建模为知识空间上的自适应采样过程。我们识别了积累的真正知识最终覆盖有限域的充分条件,并展示了违反这些条件如何产生不同的失败模式:污染、遗忘、探索失败和接受失败。然后,我们分析了不完美的验证,并识别了一个污染陷阱:随着容易发现的知识被耗尽,模型分配给新有效工件的质量缩小,因此即使很小的假阳性率也可能导致无效工件比真正发现更快地进入知识库。我们澄清了Good-Turing估计是一种局部批次多样性诊断工具,而不是用于估计历史上未发现的、支配长期发现的有效质量的估计量。在将模型的有效发现分布与指数$α>1$的Zipf定律联系起来的独立尾部等价假设下,我们证明了获得$D$个不同真正发现所需的累积生成成本满足$R_{\mathrm{cum}}(D)=Θ(c_{\mathrm{gen}}D^α)$,其中$c_{\mathrm{gen}}$是每个候选的生成成本。这个标度律量化了随着发现前沿推进而渐进的收益递减。最后,我们通过指导、生成和验证形式化了人类增强,解释了为什么专家输入在自主探索障碍附近最有价值。

英文摘要

Can AI systems discover genuinely new knowledge through iterative self improvement, and if so, at what cost? We introduce the NOVA framework, which models the common ``generate, verify, accumulate, retrain'' loop as an adaptive sampling process over a knowledge space. We identify sufficient conditions under which accumulated genuine knowledge eventually covers a finite domain, and show how their violations produce distinct failure modes: contamination, forgetting, exploration failure, and acceptance failure. We then analyze imperfect verification and identify a contamination trap: as easy-to-find knowledge is exhausted, the model mass assigned to new valid artifacts shrinks, so even small false-positive rates can cause invalid artifacts to enter the knowledge base faster than genuine discoveries. We clarify that Good--Turing estimation is a local batch-diversity diagnostic, not an estimator of the historically undiscovered valid mass that governs long-term discovery. Under a separate tail-equivalence assumption relating the model's effective discovery distribution to a Zipf law with exponent $α>1$, we prove that the cumulative generation cost required to obtain $D$ distinct genuine discoveries satisfies $R_{\mathrm{cum}}(D)=Θ(c_{\mathrm{gen}}D^α)$, where $c_{\mathrm{gen}}$ is the per-candidate generation cost. This scaling law quantifies asymptotic diminishing returns as the discovery frontier advances. Finally, we formalize human amplification through guidance, generation, and verification, explaining why expert input is most valuable near autonomous exploration barriers.

2605.14270 2026-05-29 cs.CV

Diagnosing and Correcting Concept Omission in Multimodal Diffusion Transformers

诊断和纠正多模态扩散Transformer中的概念遗漏

Kanghyun Baek, Jaihyun Lew, Chaehun Shin, Jungbeom Lee, Sungroh Yoon

发表机构 * Interdisciplinary Program in Artificial Intelligence, Seoul National University, Seoul, South Korea Department of Electrical Computer Engineering, Seoul National University, Seoul, South Korea Department of Computer Science \& Engineering, Korea University, Seoul, South Korea ISRC, Seoul National University, Seoul, South Korea

AI总结 本文通过线性探测发现文本嵌入中存在表征目标概念缺失的“遗漏信号”,并提出遗漏信号干预(OSI)方法放大该信号以主动催化缺失概念的生成,在FLUX.1-Dev和SD3.5-Medium上显著缓解了概念遗漏问题。

Comments Accepted to ICML 2026

详情
AI中文摘要

多模态扩散Transformer(MM-DiTs)在文本到图像生成方面取得了显著进展,但它们经常遭受概念遗漏,即指定的对象或属性未能出现在生成的图像中。通过对文本标记进行线性探测,我们证明文本嵌入可以区分代表目标概念缺失的特征性“遗漏信号”。利用这一见解,我们提出了遗漏信号干预(OSI),该方法放大遗漏信号以主动催化缺失概念的生成。在FLUX.1-Dev和SD3.5-Medium上的全面实验表明,即使在极端场景下,OSI也能显著缓解概念遗漏。

英文摘要

Multimodal Diffusion Transformers (MM-DiTs) have achieved remarkable progress in text-to-image generation, yet they frequently suffer from concept omission, where specified objects or attributes fail to emerge in the generated image. By performing linear probing on text tokens, we demonstrate that text embeddings can distinguish a characteristic `omission signal' representing the absence of target concepts. Leveraging this insight, we propose Omission Signal Intervention (OSI), which amplifies the omission signal to actively catalyze the generation of missing concepts. Comprehensive experiments on FLUX.1-Dev and SD3.5-Medium demonstrate that OSI significantly alleviates concept omission even in extreme scenarios.

2605.13841 2026-05-29 cs.SD cs.AI cs.CL cs.LG

EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents

EVA-Bench:一种用于评估语音代理的新型端到端框架

Tara Bogavelli, Gabrielle Gauthier Melançon, Katrina Stankiewicz, Oluwanifemi Bamgbose, Fanny Riols, Hoang H. Nguyen, Raghav Mehndiratta, Lindsay Devon Brin, Joseph Marinier, Hari Subramani, Anil Madamala, Sridhar Krishna Nemala, Srinivas Sunkara

发表机构 * ServiceNow

AI总结 提出EVA-Bench框架,通过机器人间音频对话模拟和复合指标(EVA-A和EVA-X)全面评估语音代理的准确性和体验质量。

Comments Work in progress

详情
AI中文摘要

语音代理是一种通过口语对话完成任务的人工智能系统,越来越多地部署在企业应用中。然而,现有基准测试未能同时解决两个核心评估挑战:生成逼真的模拟对话,以及全面衡量语音特定故障模式的质量。我们提出了EVA-Bench,一个端到端评估框架,同时解决这两个问题。在模拟方面,EVA-Bench通过动态多轮对话协调机器人间的音频对话,并自动进行模拟验证,检测用户模拟器错误并在评分前适当重新生成对话。在测量方面,EVA-Bench引入了两个复合指标:EVA-A(准确性),捕捉任务完成度、忠实度和音频级语音保真度;以及EVA-X(体验),捕捉对话进展、口语简洁性和话轮转换时机。这两个指标适用于所有主要的代理架构,支持直接的跨架构比较。EVA-Bench包含三个企业领域的213个场景、一个用于口音和噪声鲁棒性的受控扰动套件,以及区分峰值能力和可靠能力的pass@1、pass@k、pass^k测量。在跨越所有三种架构的12个系统中,我们发现:(1)没有系统在EVA-A pass@1和EVA-X pass@1上同时超过0.5;(2)峰值性能和可靠性能差异显著(EVA-A上pass@k与pass^k的中位数差距为0.44);(3)口音和噪声扰动暴露了显著的鲁棒性差距,其影响因架构、系统和指标而异(平均Δ高达0.314)。我们在开源许可下发布了完整的框架、评估套件和基准数据。

英文摘要

Voice agents, artificial intelligence systems that conduct spoken conversations to complete tasks, are increasingly deployed across enterprise applications. However, no existing benchmark jointly addresses two core evaluation challenges: generating realistic simulated conversations, and measuring quality across the full scope of voice-specific failure modes. We present EVA-Bench, an end-to-end evaluation framework that addresses both. On the simulation side, EVA-Bench orchestrates bot-to-bot audio conversations over dynamic multi-turn dialogues, with automatic simulation validation that detects user simulator error and appropriately regenerates conversations before scoring. On the measurement side, EVA-Bench introduces two composite metrics: EVA-A (Accuracy), capturing task completion, faithfulness, and audio-level speech fidelity; and EVA-X (Experience), capturing conversation progression, spoken conciseness, and turn-taking timing. Both metrics apply to all major agent architectures, enabling direct cross-architecture comparison. EVA-Bench includes 213 scenarios across three enterprise domains, a controlled perturbation suite for accent and noise robustness, and pass@1, pass@k, pass^k measurements that distinguish peak from reliable capability. Across 12 systems spanning all three architectures, we find: (1) no system simultaneously exceeds 0.5 on both EVA-A pass@1 and EVA-X pass@1; (2) peak and reliable performance diverge substantially (median pass@k--pass^k gap of 0.44 on EVA-A); and (3) accent and noise perturbations expose substantial robustness gaps, with effects varying across architectures, systems, and metrics (mean $Δ$ up to 0.314). We release the full framework, evaluation suite, and benchmark data under an open-source license.

2605.07707 2026-05-29 cs.AI

Hierarchical Task Network Planning with LLM-Generated Heuristics

基于LLM生成启发式的层次任务网络规划

Felipe Meneguzzi, Alexandre Buchweitz, Augusto B. Corrêa, Victor Scherer Putrich, André Grahl Pereira

发表机构 * University of Aberdeen, UK(爱丁堡大学(英国)) PUCRS, Brazil(巴西普埃布拉联邦大学) University of Oxford, UK(牛津大学(英国)) Saarland University, Germany(萨尔大学(德国)) Universidade Federal do Rio Grande do Sul, Brazil(巴西里约格兰德 do 南大学)

AI总结 研究利用大语言模型为层次任务网络规划生成搜索启发式,通过Pytrich规划器在六个基准领域评估,结果表明LLM生成的启发式在覆盖度上接近最优HTN规划器,并在83%的共享问题上显著减少搜索开销。

Comments 9 pages, 3 figures; submitted to NeurIPS 2026

详情
AI中文摘要

HTN规划是经典规划的一种变体,其中算法不是搜索线性动作序列,而是使用方法库分解高层任务,直到只剩下可执行动作。一方面,这允许引入领域知识,通过方法库加速解决方案的搜索。另一方面,它带来了超越经典状态空间搜索的挑战。尽管最近的研究产生了一些加速HTN规划的启发式和新型算法,但这些启发式仍不如经典规划算法中的启发式信息丰富。我们研究大语言模型(LLMs)能否为HTN规划生成有效的搜索启发式,将Corrêa、Pereira和Seipp(2025)的方法从经典规划扩展到层次规划。使用Pytrich规划器在六个标准全序HTN基准领域上,我们评估了九个LLM在领域特定提示下生成的启发式,并将它们与TDG和LMCount领域无关基线以及PANDA规划器进行比较。结果表明,LLM生成的启发式在覆盖度上几乎与最佳可用HTN规划器相当,同时在83%的共享问题上显著减少了搜索开销。

英文摘要

HTN planning is a variation of classical planning where, instead of searching for a linear sequence of actions, an algorithm decomposes higher-level tasks using a method library until only executable actions remain. On one hand, this allows one to introduce domain knowledge that can speed up the search for a solution through the method library. On the other hand, it creates challenges that go beyond those of classical state-space search. While recent research produced a number of heuristics and novel algorithms that speed up HTN planning, these heuristics are not yet as informative as those available in classical planning algorithms. We investigate whether large language models (LLMs) can generate effective search heuristics for HTN planning, extending the methodology of Corrêa, Pereira, and Seipp (2025) from classical to hierarchical planning. Using the Pytrich planner on six standard total-order HTN benchmark domains, we evaluate heuristics generated by nine LLMs under domain-specific prompting and compare them against the TDG and LMCount domain-independent baselines and the PANDA planner. Our results show that LLM-generated heuristics nearly match the coverage of the best available HTN planner, while substantially reducing search effort on 83% of shared problems.

2605.06322 2026-05-29 cs.LG

SMolLM: Small Language Models Learn Small Molecular Grammar

SMolLM: 小型语言模型学习小型分子语法

Akhil Jindal, Harang Ju

发表机构 * Carey Business School Johns Hopkins University(约翰霍普金斯大学Carey商学院)

AI总结 本文提出SMolLM,一个53K参数的小型权重共享Transformer,通过固定层次结构学习SMILES语法,在ZINC-250K数据集上以95%的有效性生成分子,优于参数多10倍的GPT模型。

Comments 19 pages, 5 figures, 11 tables

详情
AI中文摘要

用于分子设计的语言模型已扩展到数亿个参数,但人们对它们如何学习化学语法知之甚少。我们训练了SMolLM,一个53K参数的权重共享Transformer,在ZINC-250K药物样分子基准上生成新颖的SMILES,有效性达95%,优于参数多10倍的标准GPT。从机制上看,同一模块在多次前向传播中以固定层次结构解决SMILES约束:首先是括号,其次是环,最后是化合价,这一点通过错误分类和线性探测得到证明,并通过消融实验隔离出括号匹配头。综合这些结果,我们得到了一个紧凑、可机械解释的分子生成器,以及一个用于研究形式语言领域迭代计算的测试平台。

英文摘要

Language models for molecular design have scaled to hundreds of millions of parameters, yet how they learn chemical grammar is poorly understood. We train SMolLM, a 53K-parameter weight-shared transformer, to generate novel SMILES with 95% validity on the ZINC-250K drug-like-molecule benchmark, outperforming a standard GPT with 10 times more parameters. Mechanistically, the same block resolves SMILES constraints across passes in a fixed hierarchy: brackets first, rings second, and valence last, as shown by error classification and linear probing, with ablation isolating the bracket-matching head. Together, these results yield a compact, mechanistically interpretable molecular generator and a testbed for studying iterative computation in formal-language domains.

2605.04916 2026-05-29 cs.AI cs.LG cs.SC

A Foundation Model for Zero-Shot Logical Rule Induction

零样本逻辑规则归纳的基础模型

Yin Jun Phua

发表机构 * Institute of Science Tokyo(东京科学研究所)

AI总结 提出神经规则归纳器(NRI),一种基于统计编码和并行槽解码的预训练模型,实现零样本逻辑规则归纳,无需重新训练即可泛化到新谓词。

Comments Camera-ready version accepted at IJCAI 2026, with full appendices

详情
AI中文摘要

归纳逻辑编程(ILP)从数据中学习可解释的逻辑规则。现有方法是传导性的:其学习参数绑定到特定谓词,并且每个新任务都需要重新训练。我们引入了神经规则归纳器(NRI),一种用于零样本规则归纳的预训练模型。NRI 不编码文字标识,而是使用领域无关的统计属性(如类别条件率、熵和共现)来表示文字,这些属性无需重新训练即可泛化到不同的标识和数量。该模型由一个统计编码器和一个基于并行槽的解码器组成。并行解码保持了逻辑析取的置换不变性;而自回归解码器则会施加任意子句顺序。乘积 T-范数松弛使规则执行可微分,从而仅基于预测准确性进行端到端训练。我们在规则恢复、对标签噪声和虚假相关性的鲁棒性以及零样本迁移到真实世界基准上评估了 NRI,并相信这项工作开启了符号推理基础模型的可能性。代码和参考检查点可在 https://github.com/phuayj/neural-rule-inducer 获取。

英文摘要

Inductive Logic Programming (ILP) learns interpretable logical rules from data. Existing methods are transductive: their learned parameters are bound to specific predicates and require retraining for each new task. We introduce Neural Rule Inducer (NRI), a pretrained model for zero-shot rule induction. Rather than encoding literal identities, NRI represents literals using domain-agnostic statistical properties such as class-conditional rates, entropy, and co-occurrence, which generalize across variable identities and counts without retraining. The model consists of a statistical encoder and a parallel slot-based decoder. Parallel decoding preserves the permutation invariance of logical disjunction; an autoregressive decoder would instead impose an arbitrary clause order. Product T-norm relaxation makes rule execution differentiable, allowing end-to-end training on prediction accuracy alone. We evaluate NRI on rule recovery, robustness to label noise and spurious correlations, and zero-shot transfer to real-world benchmarks, and we believe this work opens up the possibility of foundation models for symbolic reasoning. Code and the reference checkpoint are available at https://github.com/phuayj/neural-rule-inducer.

2605.02116 2026-05-29 cs.LG

Statistical Consistency and Generalization of Contrastive Representation Learning

对比表示学习的统计一致性与泛化性

Yuanfan Li, Xiyuan Wei, Tianbao Yang, Yiming Ying

发表机构 * University of Sydney Texas A\&M University

AI总结 本文提出统一的统计学习理论,证明对比损失与最优排序统计一致,并推导出随负样本数增加而改善的泛化界,解释了大负样本集的经验优势。

Comments Accepted by ICML 2026

详情
AI中文摘要

对比表示学习(CRL)支撑着许多现代基础模型。尽管最近取得了理论进展,现有分析仍存在几个关键限制:(i)CRL的统计一致性仍知之甚少;(ii)可用的泛化界随着负样本数量的增加而恶化,这与大负样本集的经验优势相矛盾;(iii)CRL的检索性能受到的理论关注有限。在本文中,我们为CRL发展了一个统一的统计学习理论。对于下游任务,我们使用AUC型总体准则评估检索质量,并证明对比损失与最优排序是 extit{统计一致的}。我们进一步建立了一个 extit{校准型不等式},定量地将过剩对比风险与过剩检索次优性联系起来。对于上游训练,我们研究了监督和自监督对比目标,并分别推导了阶为$O(1/m + 1/\sqrt{n})$和$O(1/\sqrt{m} + 1/\sqrt{n})$的泛化界,其中$m$表示负样本数量,$n$表示锚点数量。这些界不仅解释了大负样本集的经验优势,还揭示了$m$和$n$之间的显式权衡。在大规模视觉-语言模型上的广泛实验证实了我们的理论预测。

英文摘要

Contrastive representation learning (CRL) underpins many modern foundation models. Despite recent theoretical progress, existing analyses suffer from several key limitations: (i) the statistical consistency of CRL remains poorly understood; (ii) available generalization bounds deteriorate as the number of negative samples increases, contradicting the empirical benefits of large negative sets; and (iii) the retrieval performance of CRL has received limited theoretical attention. In this paper, we develop a unified statistical learning theory for CRL. For downstream tasks, we evaluate retrieval quality using an AUC-type population criterion and show that the contrastive loss is \emph{statistically consistent} with optimal ranking. We further establish a \emph{calibration-style inequality} that quantitatively relates excess contrastive risk to excess retrieval suboptimality. For upstream training, we study both supervised and self-supervised contrastive objectives and derive generalization bounds of order $O(1/m + 1/\sqrt{n})$ and $O(1/\sqrt{m} + 1/\sqrt{n})$, respectively, where $m$ denotes the number of negative samples and $n$ the number of anchor points. These bounds not only explain the empirical advantages of large negative sets but also reveal an explicit trade-off between $m$ and $n$. Extensive experiments on large-scale vision--language models corroborate our theoretical predictions.