arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.26408 2026-05-29 cs.LG stat.ME stat.ML

Function-Valued Causal Influence in Nonlinear Time Series

非线性时间序列中的函数值因果影响

Valentina V. Kuskova, Dmitry Zaytsev, Michael Coppedge

发表机构 * Lucy Family Institute for Data \& Society, University of Notre Dame, Notre Dame, Indiana, USA. ； Department of Political Science, University of Notre Dame, Notre Dame, Indiana, USA

AI总结针对非线性时间序列因果发现中常用标量评分掩盖状态依赖函数效应的问题，提出基于个体条件期望的框架从神经加性向量自回归模型直接估计因果响应函数，揭示标量评分无法区分的多种函数行为。

Comments 26 pages, 6 tables, 8 figures

详情

AI中文摘要

时间序列中的因果发现越来越多地使用非线性机器学习模型进行，但由此产生的因果关系几乎总是通过标量边评分来总结。我们认为，这种做法掩盖了非线性自回归模型真正学习到的对象：一个状态依赖的函数，其效应随机制、幅度和上下文而变化。我们形式化了加性、贡献可分解架构的函数值因果影响，并表明标量因果评分构成了严重的信息瓶颈，将状态间变化与状态内残差噪声混为一谈。以神经加性向量自回归作为代表性架构，我们引入了一个基于个体条件期望的实用框架，直接从训练好的模型估计因果响应函数。通过受控的合成实验，我们证明了具有无法区分的标量评分的边可以表现出定性的不同函数行为，包括单调、阈值、饱和和符号变化效应。一个关于民主发展的应用案例进一步表明，函数值分析揭示了以评分为中心的方法系统性遗漏的特定于机制和非对称的因果结构。

英文摘要

Causal discovery in time series is increasingly performed using nonlinear machine-learning models, yet the resulting causal relationships are almost always summarized by scalar edge scores. We argue that this practice obscures the true object learned by nonlinear autoregressive models: a state-dependent function whose effect varies across regimes, magnitudes, and contexts. We formalize function-valued causal influence for additive, contribution-decomposable architectures and show that scalar causal scores constitute a severe information bottleneck, conflating between-state variation with within-state residual noise. Using Neural Additive Vector Autoregression as a representative architecture, we introduce a practical framework based on Individual Conditional Expectation for estimating causal response functions directly from trained models. Through controlled synthetic experiments, we demonstrate that edges with indistinguishable scalar scores can exhibit qualitatively different functional behaviors, including monotonic, thresholded, saturating, and sign-changing effects. An applied case study on democratic development further shows that function-valued analysis reveals regime-specific and asymmetric causal structure systematically missed by score-centric approaches.

URL PDF HTML ☆

赞 0 踩 0

2605.26194 2026-05-29 cs.LG

On the Role of Inductive Bias in Time-Series Pretraining: A Case Study in Learning Generalizable Representations for Clinical Time Series

论归纳偏置在时间序列预训练中的作用：以临床时间序列学习通用表征的案例研究

Sharmita Dey, Diego Paez-Granados

发表机构 * ETH Zurich（苏黎世联邦理工学院）； Swiss Paraplegic Research（瑞士脊髓损伤研究所）； ETH Zurich, Swiss Paraplegic Research（苏黎世联邦理工学院、瑞士脊髓损伤研究所）

AI总结通过PathoFM编码器中心Transformer，结合局部补全、时间连续性和无监督上下文动力学三种互补目标，研究预训练目标中归纳偏置对跨任务类型和受试者迁移的影响，发现动态中心混合目标能产生最平衡的迁移表征。

详情

AI中文摘要

临床时间序列学习通常受限于小规模、异质性队列和协议漂移，而其下游应用涵盖分类（如病理诊断）和回归（如时间预测）。这些限制使得基础模型预训练具有吸引力，但提出了一个重要问题：预训练目标应施加何种归纳偏置，以使表征能够跨任务类型和受试者迁移。我们通过PathoFM研究脊髓损伤（SCI）的病理步态分析，PathoFM是一种以编码器为中心的Transformer，在多元步态窗口上使用三种互补目标进行预训练：局部补全（重建连续的掩码跨度以强制局部结构）、时间连续性（从观察到的前缀预测掩码的中期延续以强制平滑性和因果一致性）以及无监督上下文动力学（通过注意力基于受试者示例窗口进行支持-查询重建）。通过经验比较目标族（分组/对比、基于动力学和生成式重建），我们发现以动力学为中心的混合目标产生最平衡的迁移：分组目标有利于判别边界，但可能降低连续目标所需的幅度保真度，而仅重建目标保留波形结构但在分类上可能表现不佳。总体而言，将局部重建与时间连续性相结合，并在可获取示例时添加上下文条件，可产生稳健的受试者泛化表征。

英文摘要

Clinical time-series learning is routinely constrained by small, heterogeneous cohorts and protocol drift, while its downstream use spans both classification (e.g., pathology diagnosis) and regression (e.g., temporal forecasting). These constraints make foundation-model pretraining appealing, but raises an important question of which inductive biases should the pretraining objective impose so that representations transfer across task types and subjects. We study this question in pathological gait analysis for spinal cord injury (SCI) via PathoFM, an encoder-centric transformer pretrained on multivariate gait windows with three complementary objectives: Local Completion (reconstruct contiguous masked spans to enforce local structure), Temporal Continuity (predict a masked mid-horizon continuation from an observed prefix to enforce smoothness and causal consistency), and Unsupervised In-Context Dynamics (support-query reconstruction conditioned on subject exemplar windows via attention). Empirically comparing objective families (grouping/contrastive, dynamics-based, and generative reconstruction), we find that dynamics-centric mixtures produce the most balanced transfer: grouping objectives favor discriminative margins but can degrade magnitude fidelity needed for continuous targets, whereas reconstruction-only objectives preserve waveform structure but may underperform on classification. Overall, combining local reconstruction with temporal continuity, and adding in-context conditioning when exemplar access is realistic, yields robust subject-generalizing representations.

URL PDF HTML ☆

赞 0 踩 0

2605.26193 2026-05-29 cs.LG cs.AI

Bridging Classification and Reconstruction: Cooperative Time Series Anomaly Detection

桥接分类与重建：协同时间序列异常检测

Qideng Tang, Dai Chaofan, Wubin Ma, Yahui Wu, Haohao Zhou, Tao Zhang, Huan Li, Dalin Zhang

发表机构 * National Key Laboratory of Information Systems Engineering, National University of Defense Technology（信息系统工程国家重点实验室，国防科技大学）； College of Systems Engineering, National University of Defense Technology（系统工程学院，国防科技大学）； Zhejiang University（浙江大学）； Zhejiang Key Laboratory of Space Information Sensing and Transmission, Hangzhou Dianzi University（空间信息感知与传输浙江大学重点实验室，杭州电子科技大学）

AI总结提出CoAD框架，通过分类模块生成概率软掩码指导重建模块，协同利用分类与重建范式的互补优势，有效检测细微复杂异常，并在基准数据集上显著优于现有方法。

Comments 15 pages, submitted to KDD 2026

详情

AI中文摘要

时间序列异常检测（TSAD）因其广泛应用而长期成为数据挖掘领域的热门研究课题。最近的研究挑战了流行的深度学习方法在TSAD中的有效性，指出它们无法检测细微和持久的异常。异常暴露（OE）和掩码自编码器（MAE）作为两种有前景的范式（分类和重建）出现，用于解决上述问题。然而，基于OE的方法受限于泛化能力差，而基于MAE的方法受限于掩码错位问题。为了解决这些局限性，本文提出了一种新颖的框架CoAD，该框架统一了两种范式，以利用它们的互补优势，同时减轻各自的弱点。在该框架中，分类模块为重建模块生成概率信息软掩码，这反过来又缓解了分类模块的泛化问题。这种协同设计使CoAD能够有效检测现有方法常常忽略的细微和复杂异常。此外，分类模块经过精心设计，以解决分类粒度不当和忽视频率信息的问题。在高质量基准数据集上，按照严格的评估协议进行的大量实验表明，CoAD显著优于最先进的深度学习和传统数据挖掘方法，突显了深度学习在TSAD中的潜力。此外，CoAD轻量级且速度远快于现有SOTA方法，展示了其在大规模实时应用中的实用价值。

英文摘要

Time series anomaly detection (TSAD) has long been a hot research topic in data mining due to its various applications. Recent studies challenge the effectiveness of popular deep learning methods for TSAD, suggesting their failure in detecting subtle and prolonged anomalies. Outlier Exposure (OE) and Masked Autoencoder (MAE) emerge as two promising paradigms (classification and reconstruction) for solving the above problems. However, OE-based methods are constrained by poor generalization, while MAE-based methods are limited by masking misalignment issues. To address these limitations, this paper proposes a novel framework, CoAD, which unifies the two paradigms to leverage their complementary strengths while mitigating their respective weaknesses. In this framework, the classification module generates probability-informed soft masks for the reconstruction module, which in turn alleviates the generalization problem of the classification module. This cooperative design enables CoAD to effectively detect subtle and complex anomalies that are often overlooked by existing methods. Additionally, the classification module is carefully designed to resolve issues related to improper classification granularity and the neglect of frequency information. Extensive experiments on high-quality benchmark datasets, conducted under rigorous evaluation protocols, demonstrate that CoAD significantly outperforms both state-of-the-art deep learning and traditional data mining methods, highlighting the potential of deep learning in TSAD. Moreover, CoAD is lightweight and substantially faster than existing SOTA methods, demonstrating its practical value for large-scale, real-time applications.

URL PDF HTML ☆

赞 0 踩 0

2605.26029 2026-05-29 cs.AI cs.CL

CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists

CausaLab：面向AI科学家的交互式因果发现可扩展环境

Junlin Yang, Dylan Zhang, Xiangchen Song, Qirun Dai, Xiao Liu, Yuen Chen, Aniket Vashishtha, Jing Shi, Chenhao Tan, Hao Peng

发表机构 * Tsinghua University（清华大学）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Carnegie Mellon University（卡内基梅隆大学）； University of Chicago（芝加哥大学）； Adobe

AI总结提出CausaLab环境，通过合成实验室任务评估LLM代理在因果发现中的预测准确性与因果机制恢复能力，发现两者存在显著差距。

详情

AI中文摘要

我们介绍了CausaLab，一个用于评估LLM代理进行交互式因果发现的可扩展环境。与先前的评估不同，CausaLab既评估代理是否能够使用因果证据解决问题，也评估其答案是否基于忠实恢复的因果机制。每个回合将代理置于一个合成实验室中：它接收先前的测量记录，对操纵器晶体进行干预，并预测由相同机制控制的保留反应器晶体的共振频率。隐藏的数据生成过程是一个随机采样的结构因果模型（SCM），因此成功需要恢复因果图和结构方程，而不是回忆先验知识。实验表明，预测和机制恢复之间存在持续差距：在纯观测的6节点设置中，GPT-5.2-high达到92%的任务准确率，但全边$F_1$仅为0.471。混合观测-干预策略提高了结构保真度，而纯干预即使对强代理仍然困难。我们确定过早停止是一个主要弱点，并表明一致性验证可以缓解它。因此，CausaLab将预测成功与因果理解分开，并揭示了当前LLM代理作为实验因果推理者的局限性。

英文摘要

We introduce CausaLab, a scalable environment for evaluating interactive causal discovery by LLM agents. Unlike prior evaluations, CausaLab evaluates both whether an agent can solve a problem using causal evidence and whether its answer is grounded in a faithful recovered causal mechanism. Each episode places an agent in a synthetic laboratory: it receives prior measurement records, intervenes on a manipulator crystal, and predicts the resonance frequency of a held-out reactor crystal governed by the same mechanism. The hidden data-generating process is a randomly sampled structural causal model (SCM), so success requires recovering both a causal graph and structural equations rather than recalling prior knowledge. Experiments show a persistent gap between prediction and mechanism recovery: in the purely observational 6-node setting, GPT-5.2-high reaches 92% task accuracy but only 0.471 all-edge $F_1$. Mixed observation-intervention strategies improve structural fidelity, while pure intervention remains difficult even for strong agents. We identify premature stopping as a major weakness and show that consistency verification mitigates it. CausaLab therefore separates predictive success from causal understanding and exposes current LLM agents' limits as experimental causal reasoners.

URL PDF HTML ☆

赞 0 踩 0

2605.25299 2026-05-29 cs.CV cs.LG

A Principled Self-Referenced Early Stopping Approach for Deep Image Prior

一种基于自引用的原则性早期停止方法用于深度图像先验

Chaoyan Huang, Cheng-Han Huang, Ismail R. Alkhouri, Rongrong Wang

发表机构 * Department of Computational Mathematics, Science, & Engineering, Michigan State University（密歇根州立大学计算数学、科学与工程系）； Department of Electrical Engineering and Computer Science, University of Michigan（密歇根大学电气工程与计算机科学系）； X Computational Physics Division, Los Alamos National Laboratory（洛斯阿拉莫斯国家实验室计算物理部）； Michigan Institute for Computational Discovery & Engineering, University of Michigan（密歇根大学计算发现与工程研究所）； Mathematical Sciences, Michigan State University（密歇根州立大学数学科学系）

AI总结针对深度图像先验（DIP）过拟合问题，提出一种基于构造伪自引用图像的过拟合检测框架，实现无需噪声水平估计的早期停止方法。

Comments 35 pages, 10 figures, 14 tables

详情

AI中文摘要

最近，深度图像先验（DIP）通过在无训练数据的情况下优化随机初始化的卷积神经网络，展示了解决逆成像问题（IIPs）的强大能力。然而，由于网络过参数化，DIP会过拟合噪声测量，使得早期停止（ES）至关重要。最成功的ES方法通过跟踪网络输出运行方差的波动来检测过拟合。然而，在许多应用中，这些波动可能过早出现，导致重建不稳定。本文首先证明，当退化图像的两个独立噪声副本可用时，可以实现近乎最优的DIP早期停止。受此观察启发，且由于获取两个完全独立的副本不可行，我们提出了一种基于构造伪自引用图像的过拟合检测框架，从而得到三种IIP特定算法。我们的方法还得到了关于单引用验证、伪验证估计以及共享噪声影响的理论结果的支持。在不同的IIP中，从自然图像恢复到医学图像重建，以及在不同噪声水平和噪声类型下，我们的方法始终优于现有的DIP早期停止方法，且无需准确估计噪声水平。

英文摘要

Recently, Deep Image Prior (DIP) has demonstrated strong capabilities for solving inverse imaging problems (IIPs) by optimizing a randomly initialized convolutional neural network in a training-data-free regime. However, DIP suffers from overfitting to noisy measurements due to network over-parameterization, making early stopping (ES) essential. The most successful ES method tracks fluctuations in the running variance of the network output to detect overfitting. However, in many applications, these fluctuations may appear prematurely, leading to unstable reconstructions. In this paper, we first show that nearly optimal DIP early stopping can be achieved when two independent noisy copies of the degraded image are available. Motivated by this observation, and since obtaining two fully independent copies is infeasible, we propose an overfitting detection framework based on constructing pseudo self-referenced images, resulting in three IIP-specific algorithms. Our approach is further supported by theoretical results on single-reference validation, pseudo-validation estimation, and the impact of shared noise. Across different IIPs, ranging from natural image restoration to medical image reconstruction, and under varying noise levels and noise types, our methods consistently outperform existing DIP early stopping approaches, all without requiring an accurate estimate of the noise level.

URL PDF HTML ☆

赞 0 踩 0

2605.25297 2026-05-29 cs.CL cs.AI cs.LG

Eureka: Intelligent Feature Engineering for Enterprise AI Cloud Resource Demand Prediction

Eureka：面向企业AI云资源需求预测的智能特征工程

Hangxuan Li, Renjun Jia, Xuezhang Wu, Yunjie Qian, Zeqi Zheng, Xianling Zhang

发表机构 * Alibaba Cloud Computing Co. Ltd, Hangzhou, China（阿里云计算有限公司，杭州，中国）； School of Computer Science, Fudan University, Shanghai, China（复旦大学计算机学院，上海，中国）； School of Computer Science and Technology, Tongji University, Shanghai, China（同济大学计算机科学与技术学院，上海，中国）； Independent Researcher, United States（独立研究员，美国）

AI总结提出Eureka框架，将特征工程视为智能体代码生成问题，通过专家代理、LLM特征工厂和自演化对齐引擎三阶段，自动生成可执行特征代码，在医疗、金融、社交等7个公开基准及阿里云GPU资源需求预测中显著提升性能。

Comments accepted at NeurIPS 2025 Workshop, DASFAA 2026 (International Conference on Database Systems for Advanced Applications)

Journal ref Database Systems for Advanced Applications (DASFAA 2026), Lecture Notes in Computer Science, vol. 16540, pp. 528-540, Springer

详情

DOI: 10.1007/978-981-92-0378-9_33

AI中文摘要

有效的特征对于预测模型性能至关重要，但创建特征通常需要领域专业知识，限制了跨应用的可扩展性。我们将特征工程定义为一个智能体代码生成问题：特征不再是静态的数据转换，而是可生成、评估和迭代改进的可执行程序。我们提出了Eureka，一个由LLM驱动的三阶段框架。（1）专家代理，通过领域知识的SFT微调，生成结构化的JSON格式特征设计方案。（2）LLM特征工厂，通过思维链推理将每个方案转化为可执行的Python代码，将特征假设转化为可运行的程序。（3）自演化对齐引擎，使用带双通道奖励（基于指标的效用+语义对齐）的强化学习（GRPO）来提升代码质量。通过将特征表达为程序，学习到的生成模式可以跨领域迁移。在医疗、金融和社交领域的7个公开基准上评估，Eureka一致优于传统的AutoFE和基于LLM的基线。我们进一步在阿里云的云GPU资源需求预测中展示了Eureka的有效性，其中Eureka将需求满足率提高了16%，并将计算资源迁移率降低了33%。

英文摘要

Effective features are crucial for predictive model performance, but creating them often requires domain expertise, limiting scalability across applications. We define feature engineering as an agentic code generation problem: features are not static data transformations, but executable programs that can be generated, evaluated, and iteratively improved. We present Eureka, an LLM-driven framework with three stages. (1) An Expert Agent, fine-tuned via SFT on domain knowledge, produces structured feature design plans in JSON format. (2) An LLM Feature Factory translates each plan into executable Python code through chain-of-thought reasoning, turning feature hypotheses into runnable programs. (3) A Self-Evolving Alignment Engine uses Reinforcement Learning (GRPO) with dual-channel reward (metric-based utility + semantic alignment) to enhance code quality. By expressing features as programs, the learned generation patterns can transfer across domains. Evaluated on 7 public benchmarks in healthcare, finance, and social domains, Eureka consistently outperforms both traditional AutoFE and LLM-based baselines. We further demonstrate Eureka's effectiveness on cloud GPU resource demand prediction at Alibaba Cloud, where Eureka improves demand fulfillment rate by 16% and lowers computing resource migration rates by 33%.

URL PDF HTML ☆

赞 0 踩 0

2605.25059 2026-05-29 cs.CV

VEOcc: Voxel-Centric Online Semantic Occupancy Prediction For Embodied Scene Understanding

VEOcc：面向具身场景理解的体素中心在线语义占用预测

Ruoyu Wang, Yong Liu, Sheng Tao, Yuhang Lin, Yukai Ma

发表机构 * Institute of Cyber-Systems and Control（控制系统研究院）

AI总结提出一种基于体素的递归感知-同化框架VEOcc，通过时空感知在线更新策略实现无需初始尺度估计的高效、鲁棒语义占用预测，在局部和具身场景中达到最先进性能。

详情

AI中文摘要

对于自主探索至关重要，在线3D占用预测和映射逐步构建密集的空间表示。然而，近期以高斯为中心的方法在结构边界保真度上存在困难，且严重依赖预定义的场景大小先验，从根本上限制了其操作效率。在这项工作中，我们提出了VEOcc，一个以体素为中心的框架，表述为递归感知-同化范式。通过消除初始尺度估计的需要，VEOcc实现了高度精简、开放的地图扩展。此外，为了在离散体素空间内鲁棒地聚合带噪声的时间观测，我们提出了一种时空感知在线更新策略。它集成了跨时间对数聚合（TLA）以保持时间一致性、可靠性感知置信度调制（RCM）以进行空间不确定性校准，以及置信度驱动的增量状态更新（CSU）以实现鲁棒的全局状态同化。在Occ-ScanNet和EmbodiedOcc-ScanNet上的大量实验表明，VEOcc在局部和具身设置中均建立了新的最先进性能，为真实世界探索提供了准确且高效的解决方案。值得注意的是，在自收集视频序列上的零样本评估进一步证实了其在完全未见过的真实世界环境中的鲁棒分布外泛化能力。最终，我们的框架为自主探索提供了准确且高效的解决方案。代码和补充可视化可在我们的项目页面获取：https://wryzju.github.io/VEOcc/。

英文摘要

Crucial for autonomous exploration, online 3D occupancy prediction and mapping incrementally constructs dense spatial representations on the fly. However, recent Gaussian-centric methods struggle with structural boundary fidelity and rely heavily on predefined scene-size priors, fundamentally limiting their operational efficiency. In this work, we present VEOcc, a voxel-centric framework formulated as a recursive perception-and-assimilation paradigm. By eliminating the need for initial scale estimation, VEOcc enables highly streamlined, open-ended map expansion. Furthermore, to robustly aggregate noisy temporal observations within the discrete voxel space, we propose a Spatio-Temporal-Aware Online Update Strategy. It integrates Cross-Temporal Logit Aggregation (TLA) for temporal consistency, Reliability-Aware Confidence Modulation (RCM) for spatial uncertainty calibration, and Confidence-Driven Incremental State Update (CSU) for robust global state assimilation. % Extensive experiments on Occ-ScanNet and EmbodiedOcc-ScanNet demonstrate that VEOcc establishes new state-of-the-art performance in both local and embodied settings, providing an accurate and efficient solution for real-world exploration. Extensive experiments on Occ-ScanNet and EmbodiedOcc-ScanNet demonstrate that VEOcc establishes new state-of-the-art performance in both local and embodied settings. Notably, zero-shot evaluations on self-collected video sequences further confirm its robust out-of-distribution generalization capability in completely unseen real-world environments. Ultimately, our framework provides an accurate and highly efficient solution for autonomous exploration. Code and supplementary visualizations are available on our project page: https://wryzju.github.io/VEOcc/.

URL PDF HTML ☆

赞 0 踩 0

2605.24846 2026-05-29 cs.LG cs.AI

Tiny Brains, Giant Impact: Uncovering the Keystone Neurons of LLM with Just a Few Prompts

微小大脑，巨大影响：仅用少量提示揭示LLM的关键神经元

Xiangtian Ji, Yuxin Chen, Zhengzhou Cai, Xiang Wang, An Zhang, Tat-Seng Chua

发表机构 * National University of Singapore（新加坡国立大学）； Beijing University of Posts and Telecommunications（北京邮电大学）； University of Science and Technology of China（中国科学技术大学）

AI总结本研究通过跨任务激活强度分析，发现大型语言模型中存在一组极其稀疏的关键神经元，其移除会导致模型行为崩溃，并基于此提出仅更新关键神经元的微调方法，在少量参数修改下达到与全参数微调相当或更优的任务性能。

详情

AI中文摘要

大型语言模型（LLM）展现出强大的综合能力，但支撑这些行为的内部机制仍未被充分理解。在这项工作中，我们展示了在多种开放权重Transformer模型中，存在一组神经元在跨多个能力维度的任务推理期间始终保持高度激活。通过沿跨任务激活强度进行探测，我们分离出一个极其稀疏的子集，其移除会导致模型行为崩溃，我们将其称为关键神经元。我们的分析揭示，关键神经元是模型的一个稳定且内在的神经元子集，主要在预训练期间建立。与这些神经元相关的参数在训练过程中被紧密校准，其精确值对模型能力至关重要。基于这些见解，我们提出了一种监督微调方法，仅更新关键神经元，在修改远少于全参数的情况下，实现了与全参数微调相当甚至更好的任务增益，同时更好地保留了其他能力维度的性能。

英文摘要

Large language models (LLMs) display strong comprehensive abilities, yet the internal mechanisms that support these behaviors remain insufficiently understood. In this work, we show that across a wide range of open-weight Transformers, a subset of neurons remains consistently highly activated during inference across tasks of multiple capability dimensions. By probing along the cross-task activation strength, an extremely sparse subset is isolated, whose removal causes a collapse in model behavior, which we term keystone neurons. Our analysis reveals that keystone neurons are a stable and intrinsic neuron subset of the model that is largely established during pretraining. The parameters associated with these neurons are tightly calibrated during the training process, and their precise values are critical for the capabilities of the model. Building on these insights, we propose a supervised fine-tuning approach that updates only keystone neurons, achieving task gains comparable to or even better than full-parameter fine-tuning while better preserving performance in other capability dimensions, despite modifying a much smaller number of parameters.

URL PDF HTML ☆

赞 0 踩 0

2605.24399 2026-05-29 cs.AI

ConceptM$^3$oE: Concept-Guided Multimodal Mixture of Experts for Interpretable Computational Pathology

ConceptM$^3$oE：面向可解释计算病理学的概念引导多模态专家混合模型

Xuan Wang, Zhongling Xu, Gopi Kannedhara, Joakim Nguyen, Jian Yu, Jinrui Fang, Abdurrahmaan Baghdadi, Tianlong Chen, Awais Naeem, Chandra Krishnan, Edward Castillo, Andrew H. Song, Ankita Shukla, Ying Ding, Nicholas Konz, Hairong Wang

发表机构 * University of Texas at Austin（德克萨斯大学奥斯汀分校）； University of North Carolina at Chapel Hill（北卡罗来纳大学教堂山分校）； Dell Children’s Medical Center（德尔儿童医疗中心）； The University of Texas MD Anderson Cancer Center（德克萨斯大学MD安德森癌症中心）； University of Nevada, Reno（内华达大学里诺分校）

AI总结提出ConceptM$^3$oE框架，通过概念引导的多模态专家混合路径嵌入概念形成，并利用残差路径保持性能与可解释性，在脑肿瘤分类中优于基线并提升小样本性能。

详情

AI中文摘要

医疗模型正从单模态预测转向对异构诊断输入的多模态推理。在计算病理学中，对于仅凭形态学难以区分的复杂肿瘤亚型，病理报告和分子测量可提供额外的诊断证据，但现有模型往往无法阐明不同信号如何组合成可识别的诊断概念。我们提出ConceptM$^3$oE（概念多模态MoE），将概念形成直接嵌入交互感知的专家混合（MoE）路径中。该架构将证据分解为模态特定、冗余和协同专家，然后将其投影到结构化概念瓶颈中，将潜在特征映射到形态学和生物标志物概念层次结构。为防止可解释瓶颈典型的信息损失，我们在每个专家内利用残差路径，使任务相关信号既通过概念流动，也直接流向最终任务预测，从而在保持可解释性的同时维持高性能。在机构性儿童脑肿瘤队列和公共胶质瘤队列上，该框架实现了与无约束模型相竞争的性能，同时产生由独立神经病理学家验证的推理轨迹。在数据有限的情况下，ConceptM$^3$oE提升了小数据性能，在较小训练规模下，与非概念信息基线相比，宏F1从56.41%提升至66.70%，同时显示出更快的训练收敛速度，这与概念学习的正则化效应一致。这项工作为高性能、内在可验证且更符合临床实践复杂决策的医疗AI提供了一条可扩展的路径。

英文摘要

Healthcare models are transitioning from unimodal prediction toward multimodal reasoning over heterogeneous diagnostic inputs. In computational pathology, for complex tumor subtypes where morphology alone can be challenging to distinguish, pathology reports and molecular measurements may provide additional diagnostic evidence alongside whole-slide images, yet existing models often fail to clarify how diverse signals assemble into recognizable diagnostic concepts. We propose ConceptM$^3$oE (Concept Multimodal MoE), which embeds concept formation directly within interaction-aware mixture-of-experts (MoE) pathways. The architecture decomposes evidence into modality-specific, redundant, and synergistic experts, which are then projected into structured concept bottlenecks mapping latent features to a hierarchy of morphology and biomarker concepts. To prevent the information loss typical of interpretable bottlenecks, we utilize residual pathways within each expert to allow task-relevant signals to flow both through the concepts and directly to the final task prediction, so that high performance is maintained alongside interpretability. Across an institutional pediatric brain tumor cohort and a public glioma cohort, the framework delivers competitive performance to unconstrained models while producing reasoning traces validated by an independent neuropathologist. In data-limited regimes, ConceptM$^3$oE improves limited-data performance, increasing macro-F1 from 56.41% to 66.70% at small training sizes compared to non-concept-informed baselines, while also showing faster training convergence consistent with the regularizing effect of concept learning. This work offers a scalable path toward high-performance medical AI that is inherently verifiable and better aligned with the complex decision-making of clinical practice.

URL PDF HTML ☆

赞 0 踩 0

2605.24140 2026-05-29 cs.AI

HyperGuide: Hyperbolic Guidance for Efficient Multi-Step Reasoning in Large Language Models

HyperGuide: 用于大型语言模型高效多步推理的双曲引导

Yuyu Liu, Haotian Xu, Yanan He, Sarang Rajendra Patil, Mengjia Xu, Tengfei Ma

发表机构 * Department of Computer Science（计算机科学系）； Stony Brook University（石英布鲁克大学）； Department of Applied Mathematics and Statistics（应用数学与统计学系）； Yale University（耶鲁大学）； Department of Data Science（数据科学系）； New Jersey Institute of Technology（新泽西理工学院）； Department of Biomedical Informatics（生物医学信息学系）

AI总结针对多步推理中单次生成效率高但精度低、树搜索计算量大的问题，提出通过将推理进度蒸馏为双曲几何信号来引导逐步生成，利用双曲空间的距离和角度特性编码解接近度与分支区分，训练轻量头投影隐状态并微调适配器，在多个基准上取得一致提升。

详情

AI中文摘要

多步推理仍然是大型语言模型的一个核心挑战：单次生成效率高但缺乏准确性；树搜索方法探索多条路径但计算量大。我们通过将推理进度蒸馏为双曲几何信号来弥补这一差距，该信号引导逐步生成。我们的方法基于一个结构性观察：在组合推理树中，包含解的状态很少，而死胡同则呈指数级多。双曲空间匹配这种不对称性，原点附近体积紧凑，向边界指数扩展，因此到原点的距离自然地编码解的接近度，而角度分离则区分需要不同下一步操作的分支。我们训练一个轻量头将LLM的隐状态投影到该空间，然后在其自身的推理尝试上交互式地微调一个低秩适配器，以对注入的信号做出反应。在多个基准上，该几何信号带来一致的提升，在更深推理链上改进更大。我们的代码公开在 https://github.com/yuyuliu11037/HyperGuide。

英文摘要

Multi-step reasoning remains a central challenge for large language models: single-pass generation is efficient but lacks accuracy; tree-search methods explore multiple paths but are computation-heavy. We address this gap by distilling reasoning progress into a hyperbolic geometric signal that guides step-by-step generation. Our approach is motivated by a structural observation: in combinatorial reasoning trees, solution-bearing states are few while dead ends are exponentially numerous. The hyperbolic space matches this asymmetry, with compact volume near the origin and exponentially expanding capacity toward the boundary, so that distance-to-origin naturally encodes solution proximity while angular separation distinguishes branches requiring different next operations. We train a lightweight head to project LLM hidden states into this space, then fine-tune a low-rank adapter interactively on its own reasoning attempts to act on the injected signal. Across multiple benchmarks, the geometric signal yields consistent gains, with larger improvements on deeper reasoning chains. Our code is publicly available at https://github.com/yuyuliu11037/HyperGuide.

URL PDF HTML ☆

赞 0 踩 0

2605.23993 2026-05-29 cs.CV cs.AI cs.LG

Nano World Models: A Minimalist Implementation of Future Video Prediction

纳米世界模型：未来视频预测的极简实现

Siqiao Huang, Partha Kaushik, Michael Chen, Hengkai Pan, Kaiwen Geng, Omar Chehab, Fernando Moreno-Pino, Max Simchowitz

发表机构 * DeepMind

AI总结提出Nano World Models，一个基于扩散强迫的极简代码库，用于未来视频预测，支持可控研究世界模型的设计选择，并通过实验分析预测参数化、架构规模等因素对视频预测质量的影响。

Comments Project page: https://simchowitzlabpublic.github.io/nano-world-model/

详情

AI中文摘要

世界模型已成为学习预测模拟器的核心范式，支持生成、规划和决策。然而，尽管工业级交互式视频生成取得了快速进展，更广泛的研究社区仍然缺乏紧凑、可重复且易于扩展的实现来研究现代世界模型的设计选择。我们介绍了Nano World Models，一个围绕扩散强迫的极简代码库，用于未来视频预测。Nano World Models为生成目标、模型规模、动作条件机制、潜在观测空间、数据集、评估协议和长程展开程序提供了统一接口。这种设计使得通常在不同实现中纠缠的世界模型组件可以进行受控研究。通过在简单控制环境、游戏模拟和真实机器人数据上的实验，我们考察了预测参数化、架构规模、动作注入、采样预算和领域复杂性如何影响视频预测质量和自回归展开行为。通过发布代码、配置、评估脚本和预训练检查点，Nano World Models旨在为开放、可重复和科学的世界模型研究提供一个紧凑但可扩展的实验基础。

英文摘要

World models have become a central paradigm for learning predictive simulators that support generation, planning, and decision-making. Yet, despite rapid progress in industry-scale interactive video generation, the broader research community still lacks compact, reproducible, and easily extensible implementations for studying the design choices underlying modern world models. We introduce Nano World Models, a minimalist codebase for future video prediction centered around diffusion forcing. Nano World Models provides a unified interface for generative objectives, model scales, action-conditioning mechanisms, latent observation spaces, datasets, evaluation protocols, and long-horizon rollout procedures. This design enables controlled studies of world-modeling components that are often entangled across separate implementations. Through experiments across simple control environments, game simulation, and real-robot data, we examine how prediction parameterization, architecture scale, action injection, sampling budget, and domain complexity affect video prediction quality and autoregressive rollout behavior. By releasing code, configurations, evaluation scripts, and pretrained checkpoints, Nano World Models aims to provide a compact yet extensible experimental substrate for open, reproducible, and scientific world-model research.

URL PDF HTML ☆

赞 0 踩 0

2605.23657 2026-05-29 cs.CL

OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents

OpenSkillEval：自动审计LLM智能体的开放技能生态系统

Jiahao Ying, Boxian Ai, Wei Tang, Siyuan Liu, Yixin Cao

发表机构 * Singapore Management University（新加坡国立管理学院）； Institute of Trustworthy Embodied AI, Fudan University（复旦大学可信具身人工智能研究院）； Joy Future Academy, JD（京东未来学院）

AI总结提出自动评估框架OpenSkillEval，通过动态构建真实任务实例和收集社区技能，系统评估技能增强型智能体系统及技能本身，揭示技能可用性不保证有效使用、技能增强收益依赖模型和框架等关键发现。

详情

AI中文摘要

技能，即为大型语言模型（LLM）提炼的结构化工作流指令，正成为提升智能体在现实下游任务性能的日益重要的机制。然而，随着开源技能生态系统的快速扩张，不同模型和智能体框架如何与技能交互、如何评估技能质量、以及用户在实际成本-性能权衡下应如何选择技能，这些问题仍不明确。在本文中，我们提出了 extsc{OpenSkillEval}，一个针对技能增强型智能体系统及技能本身的自动评估框架。 extsc{OpenSkillEval}不依赖静态基准，而是从不断演变的现实世界工件中自动构建跨五类下游应用（演示生成、前端网页设计、海报生成、数据可视化和报告生成）的真实任务实例。它进一步收集和组织社区贡献的技能，以便在统一任务设置下进行受控比较。利用超过600个动态生成的任务实例和30个开源技能，我们对最先进的模型和智能体框架进行了系统评估。我们的结果表明，技能可用性并不保证有效使用技能，技能增强的收益强烈依赖于底层模型和智能体框架，并且许多公开流行的技能并不始终优于没有技能的基础智能体。这些发现凸显了动态、基于任务的评估的必要性，并为LLM智能体技能的设计、选择和部署提供了实用见解。更多案例和基准资源可在项目网站上获取：https://yingjiahao14.github.io/OpenSkillEval-Web/。

英文摘要

Skills, i.e., structured workflow instructions distilled for large language models (LLMs), are becoming an increasingly important mechanism for improving agent performance on real-world downstream tasks. However, as the open-source skill ecosystem rapidly expands, it remains unclear how different models and agent frameworks interact with skills, how to evaluate skill quality, and how users should select skills under practical cost-performance trade-offs. In this paper, we present \textsc{OpenSkillEval}, an automatic evaluation framework for both skill-augmented agent systems and the skills themselves. Instead of relying on static benchmarks, \textsc{OpenSkillEval} automatically constructs realistic task instances from evolving real-world artifacts across five categories of downstream applications: presentation generation, front-end web design, poster generation, data visualization, and report generation. It further collects and organizes community-contributed skills for controlled comparison under unified task settings. Using more than 600 dynamically generated task instances and 30 open-source skills, we conduct a systematic evaluation of state-of-the-art models and agent frameworks. Our results show that skill availability does not guarantee effective skill usage, that the benefit of skill augmentation depends strongly on both the underlying model and the agent framework, and that many publicly popular skills do not consistently outperform base agents without skills. These findings highlight the need for dynamic, task-grounded evaluation and provide practical insights into the design, selection, and deployment of skills for LLM agents. Additional cases and benchmark resources are available on the project website: https://yingjiahao14.github.io/OpenSkillEval-Web/.

URL PDF HTML ☆

赞 0 踩 0

2605.23531 2026-05-29 cs.CV

MPDocBench-Parse：面向实际的多页文档解析基准测试

Bangbang Zhou, Hangdi Xing, Yifan Chen, Jianjun Xu, Qi Zheng, Feiyu Gao, Zhibo Yang, Shuai Bai, Ming Yan, Jieping Ye, Hongtao Xie

发表机构 * University of Science and Technology of China（中国科学技术大学）； Tongyi Lab, Alibaba Group（阿里云实验室）

AI总结针对现有基准测试在真实场景中评估不足的问题，提出MPDocBench-Parse基准，包含433份多页文档（3246页），覆盖15种文档类型，设计全面的内容保真度和逻辑结构评估协议，实验表明现有模型在语义连续性、视觉内容解析和层次结构恢复方面存在明显局限。

详情

AI中文摘要

文档解析将视觉丰富的文档转换为机器可读的结构化表示，为信息系统提供了关键基础。尽管已有许多文档解析基准测试，但它们仍不足以应对真实场景。现有基准测试要么专注于特定任务，要么仅评估单页、以文本为中心的设置，因此不足以处理实际的多页解析。此外，它们缺乏对语义连续性、层次结构恢复和视觉内容保留的细粒度评估。为解决这些不足，我们提出了MPDocBench-Parse，一个面向实际应用的多页文档解析基准测试。它包含433份人工标注的文档，共3246页，覆盖中英文15种文档类型，具有多样化的布局风格，并支持文档级端到端评估。我们进一步设计了一套全面的内容保真度和逻辑结构评估协议，涵盖文本、表格和公式识别，截断文本和表格合并，图形提取，阅读顺序以及标题层次恢复。实验表明，尽管现有模型在基本文本提取方面表现良好，但在语义连续性整合、视觉内容解析和层次结构恢复方面仍存在明显局限。MPDocBench-Parse为将文档解析推进到更真实的场景提供了统一基础。

英文摘要

Document parsing converts visually rich documents into machine-readable structured representations, forming a crucial foundation for information systems. Although many benchmarks have been proposed for document parsing, they remain inadequate for realistic scenarios. Existing benchmarks either focus on specific tasks or assess only single-page, text-centric settings, making them insufficient for practical multi-page parsing. Moreover, they lack fine-grained evaluation of semantic continuity, hierarchical structure recovery, and visual content preservation. To address these gaps, we propose MPDocBench-Parse, a benchmark for multi-page document parsing in real-world applications. It contains 433 manually annotated documents with 3,246 pages, covering 15 document types in English and Chinese, with diverse layout styles, and supports document-level end-to-end evaluation. We further design a comprehensive protocol for content fidelity and logical structure, covering text, table, and formula recognition, truncated text and table merging, figure extraction, reading order, and heading hierarchy recovery. Experiments show that, while existing models perform well on basic text extraction, they still suffer clear limitations in semantic continuity integration, visual content parsing, and hierarchical structure recovery. MPDocBench-Parse provides a unified foundation for advancing document parsing toward more realistic scenarios.

URL PDF HTML ☆

赞 0 踩 0

2605.22082 2026-05-29 cs.RO cs.LG

CoRMA: Contrastive RMA for Contact-Rich Meta-Adaptation

CoRMA: 用于接触丰富元适应的对比RMA

Wentian Wang, Chutong Wen, Hongxu Ma, Wuhao Wang, Zhexiong Xue, Abdul Haseeb Nizamani, Dandi Zhou, Xinhai Sun, Jianqiao Zhu

发表机构 * Synthoid AI

AI总结提出CoRMA框架，通过语义接触上下文和对比学习实现力主导装配任务的元适应，无需演示或梯度更新，在仿真和真实机器人上优于基线。

详情

AI中文摘要

我们提出CoRMA（对比机器人运动适应），一个基于上下文的元适应框架，修改了RMA以适用于力主导的装配任务。CoRMA用紧凑的6维仅仿真语义接触上下文（描述接触开始、侧向接合、引导过渡、接触方向和卡滞）替换原始仿真器参数适应。一个可部署的因果Transformer适配器通过语义回归和力状态对比目标，从力、本体感受和动作历史中在线推断该上下文。部署时，移除真实上下文并由推断上下文替代，从而无需演示、特权输入或梯度更新即可实现片段内适应。我们在Isaac Lab / Isaac Sim 5.0中的PegInsert、GearMesh和NutThread任务以及真实Marvin机械臂上评估CoRMA。与在仿真中成功率高但在硬件上大幅下降的FORGE基线相比，CoRMA在受控目标位姿噪声下保留了更高的验证真实成功率。这些结果支持语义接触推断作为相关装配任务族内可复用的适应接口，而更广泛的未见任务泛化和Real2Sim校准仍是未来工作。

英文摘要

We present CoRMA(Contrastive Robotic Motor Adaptation), a context-based meta-adaptation framework that modifies RMA for force-dominant assembly. CoRMA replaces raw simulator-parameter adaptation with a compact 6D simulator-only semantic contact context describing contact onset, lateral engagement, guided transition, contact direction, and jamming. A deployable causal Transformer adapter infers this context online from force, proprioceptive, and action histories using semantic regression and a force-regime contrastive objective. At deployment, oracle context is removed and replaced by the inferred context, enabling within-episode adaptation without demonstrations, privileged inputs, or gradient updates. We evaluate CoRMA on PegInsert, GearMesh, and NutThread in Isaac Lab / Isaac Sim 5.0 and on a real Marvin arm. Compared with FORGE baselines that achieve high simulation success but degrade substantially on hardware, CoRMA retains higher verified real success under controlled target-pose noise. These results support semantic contact inference as a reusable adaptation interface within a related assembly task family, while broader unseen-task generalization and Real2Sim calibration remain future work.

URL PDF HTML ☆

赞 0 踩 0

2605.22080 2026-05-29 cs.CV cs.AI

HyperVision: 一种通道自适应的地基高光谱视觉预训练骨干网络

Guanyiman Fu, Jingtao Li, Zihang Cheng, Zhuanfeng Li, Diqi Chen, Yan Xu, Xiangyu Liu, Fengchao Xiong, Jianfeng Lu, Chengrong Chen, Jun Zhou

发表机构 * Griffith University, Australia（格里菲斯大学，澳大利亚）； Wuhan University, China（武汉大学，中国）； Nanjing University of Science and Technology, China（南京理工大学，中国）； Huaiyin Normal University, China（淮阴师范学院，中国）； Massey University, New Zealand（马斯sey大学，新西兰）

AI总结针对地基高光谱传感器配置差异、标签稀缺与不一致、数据集规模有限等问题，提出首个地基高光谱预训练骨干HyperVision，采用通道自适应动态嵌入、多源伪标签和跨模态知识蒸馏，在三个下游任务上取得最优性能。

详情

AI中文摘要

虽然高光谱成像通过数百个窄波长波段提供丰富的空间-光谱信息，用于精确的材料识别，但地基高光谱预训练骨干网络仍然缺失，受限于传感器间的光谱配置差异、标签的稀缺性和不一致性，以及现有数据集的规模有限和场景多样性不足。为了解决这些挑战并实现通用感知，我们提出了HyperVision，这是首个地基高光谱预训练骨干网络。首先，为了处理不同的光谱配置，HyperVision采用通道自适应动态嵌入机制，将异构输入映射到统一的标记空间。其次，我们开发了一个无监督表示学习框架。具体来说，为了解决标签稀缺和不一致问题，引入了一种多源伪标签方法，融合来自SAM2的空间结构和来自HyperFree的细粒度光谱材料信息。此外，为了丰富场景多样性并补偿有限的数据集规模，利用跨模态知识蒸馏机制，将预训练RGB视觉模型的丰富语义表示迁移到我们的骨干网络。HyperVision在来自26个不同地基数据集的15000张图像集合上进行预训练，展现出卓越的泛化能力。仅需高效的头适配而无需调整骨干参数，它在不同传感器配置下的三个下游任务中取得了比任务特定方法更优的性能，在高光谱语义分割中$\mathrm{Acc}_{\mathrm{M}}$相对提升高达16.3%，目标跟踪AUC相对提升2.1%，显著目标检测MAE降低35.5%。源代码和预训练模型将在https://github.com/lronkitty/HyperVision 公开。

英文摘要

While hyperspectral imaging provides rich spatial-spectral information across hundreds of narrow wavelength bands for precise material identification, ground-based hyperspectral pre-trained backbones remain absent, constrained by varying spectral configurations across sensors, the scarcity and inconsistency of labels, and the limited scale and scene diversity of existing datasets. To address these challenges and enable universal perception, we propose HyperVision, the first ground-based hyperspectral pre-trained backbone. First, to handle varying spectral configurations, HyperVision adopts a channel-adaptive dynamic embedding mechanism to map heterogeneous inputs into a unified token space. Second, we develop an unsupervised representation learning framework. Specifically, to address label scarcity and inconsistency, a multi-source pseudo-labeling method is introduced to fuse spatial structures from SAM2 and fine-grained spectral material information from HyperFree. Furthermore, to enrich scene diversity and compensate for limited dataset scale, a cross-modal knowledge distillation mechanism is utilized to transfer rich semantic representations from a pre-trained RGB vision model to our backbone. Pre-trained on a collection of 15k images from 26 diverse ground-based datasets, HyperVision demonstrates exceptional generalization. Requiring only efficient head-only adaptation without adjusting backbone parameters, it achieves state-of-the-art performance compared to task-specific methods across three downstream tasks under varying sensor configurations, yielding up to a 16.3% relative improvement in hyperspectral semantic segmentation $\mathrm{Acc}_{\mathrm{M}}$, a 2.1% relative gain in object tracking AUC, and a 35.5% reduction in salient object detection MAE. The source code and pre-trained model will be publicly available on https://github.com/lronkitty/HyperVision .

URL PDF HTML ☆

赞 0 踩 0

2605.15852 2026-05-29 cs.CV

GHOST: Geometry-Hierarchical Online Streaming Token Eviction for Efficient 3D Reconstruction

GHOST: 用于高效3D重建的几何层次化在线流式令牌驱逐

Leyang Chen, Junyi Wu, Zhiteng Li, Yulun Zhang

发表机构 * Shanghai Jiao Tong University（上海交通大学）

AI总结提出GHOST框架，利用模型自身的3D几何输出在线驱逐冗余令牌，在保持重建质量的同时将KV缓存减半并实现1.75倍加速。

详情

AI中文摘要

从长单目视频序列进行流式3D重建需要维护一个随序列长度线性增长的键值（KV）缓存，造成严重的内存瓶颈。现有方法要么将缓存截断为固定的一组锚帧，导致重建质量下降，要么依赖于对3D场景结构无关的注意力分数启发式方法，未能保留几何上有价值的令牌。为解决这些问题，我们提出GHOST（几何层次化在线流式令牌驱逐），一种无需训练的KV缓存管理框架，利用模型自身的3D几何输出在线驱逐冗余令牌。GHOST引入了三项相互增强的创新：层次化双层重要性评分方案、保护特殊令牌不被驱逐的特权机制，以及余弦相似度引导的逐层预算分配。在各种基准上的实验表明，GHOST在保持出色重建质量的同时，将KV缓存削减近一半，并且与最先进方法相比实现了1.75倍的推理加速。我们的代码可在 https://github.com/lokiniuniu/GHOST 获取。

英文摘要

Streaming 3D reconstruction from long monocular video sequences requires maintaining a key-value (KV) cache that grows linearly with sequence length, creating a severe memory bottleneck. Existing approaches either truncate the cache to a fixed set of anchor frames, leading to reconstruction quality degradation, or rely on attention-score heuristics that are agnostic to 3D scene structure, failing to preserve geometrically valuable tokens. To address these problems, we present GHOST (Geometry-Hierarchical Online Streaming Token Eviction), a training-free KV cache management framework that exploits the model's own 3D geometry outputs to evict redundant tokens online. GHOST introduces three mutually reinforcing innovations: a hierarchical dual-level importance scoring scheme, a privilege mechanism that protects special tokens from eviction, and a cosine-similarity-guided layer-wise budget allocation. Experiments on various benchmarks show that GHOST preserves excellent reconstruction quality while cutting the KV cache by nearly half and delivering 1.75x faster inference compared to state-of-the-art methods. Our code is available at https://github.com/lokiniuniu/GHOST.

URL PDF HTML ☆

赞 0 踩 0

2605.15422 2026-05-29 cs.LG

DualKV: Shared-Prompt Flash Attention for Efficient RL Training with Large Rollouts and Long Contexts

DualKV: 面向高效RL训练的共享提示Flash注意力机制，支持大规模展开和长上下文

Jiading Gai, Shuai Zhang, Xiang Song, Bernie Wang, George Karypis

发表机构 * Amazon Web Services（亚马逊网络服务）； Google（谷歌）； University of Minnesota（明尼苏达大学）

AI总结针对RL训练中共享提示重复计算问题，提出DualKV内核，通过融合CUDA前向/反向核和veRL数据流水线重排，消除提示复制，实现1.63-3.82倍策略更新加速。

详情

AI中文摘要

现代RL后训练方法（如GRPO和DAPO）在从共享提示（$P$个token）采样的$N$个响应序列（每个$R$个token）上进行训练，但标准FlashAttention在前向和反向传播中将所有$P$个提示token复制$N$次——在相同的隐藏状态上重复计算和内存。在大规模展开、长上下文RL训练（$N\geq16$，$P\geq8\text{K}$）中，这种冗余主导了策略更新成本。我们观察到，在仅解码器模型中，因果掩码使提示表示在每一层跨序列不变，因此所有逐token操作（归一化、投影、MLP）和注意力可以一次性处理提示——这一特性尚未在训练的内核级别被利用。我们提出\textbf{DualKV}，这是首个消除RL训练中共享提示复制的FlashAttention内核变体，通过(1)~融合的CUDA前向和反向内核，在单次内核启动中迭代两个不相交的KV区域——共享上下文和逐序列响应，以及(2)~veRL中的数据流水线重设计，将$N(P{+}R)$个token重新打包为每个微批$P{+}NR$个token，将token减少从注意力扩展到整个模型，因子$ρ= N(P{+}R)/(P{+}NR)$。DualKV在数学上等价于标准注意力，且不引入近似。在Qwen3-8B GRPO训练中，使用8$\times$H100 GPU（$N{=}32$，8K上下文），DualKV实现了$1.63$--$2.09\times$的策略更新加速，支持$2\times$更大的微批，并将MFU从$36\%$提升至$76\%$。类似增益在DAPO上成立（$2.47\times$加速，$77\%$ MFU）。在30B MoE规模下，使用16$\times$H100，DualKV相比FlashAttention（需要4路Ulysses序列并行以避免OOM）实现了$3.82\times$的策略更新加速和$3.38\times$的端到端步骤加速。

英文摘要

Modern RL post-training methods such as GRPO and DAPO train on $N$ response sequences of $R$ tokens sampled from a shared prompt of $P$ tokens, but standard FlashAttention replicates all $P$ prompt tokens $N$ times across both forward and backward passes -- duplicating compute and memory on identical hidden states. In large-rollout, long-context RL training ($N{\geq}16$, $P{\geq}8\text{K}$), this redundancy dominates the policy update cost. We observe that in decoder-only models, causal masking makes prompt representations invariant across sequences at every layer, so all per-token operations (norms, projections, MLP) and attention can process the prompt once -- a property not yet exploited at the kernel level for training. We propose \textbf{DualKV}, the first FlashAttention kernel variant that eliminates shared-prompt replication during RL training, via (1)~fused CUDA forward and backward kernels that iterate over two disjoint KV regions -- shared context and per-sequence response -- in a single kernel launch, and (2)~a data-pipeline redesign in veRL that repacks $N(P{+}R)$ tokens into $P{+}NR$ tokens per micro-batch, extending the token reduction from attention to the entire model by a factor $ρ= N(P{+}R)/(P{+}NR)$. DualKV is mathematically equivalent to standard attention and introduces no approximation. On Qwen3-8B GRPO training with 8$\times$H100 GPUs ($N{=}32$, 8K-context), DualKV achieves $1.63$--$2.09\times$ policy-update speedup, enables $2\times$ larger micro-batches, and raises MFU from $36\%$ to $76\%$. Similar gains hold for DAPO ($2.47\times$ speedup, $77\%$ MFU). At 30B MoE scale on 16$\times$H100, DualKV achieves $3.82\times$ policy-update and $3.38\times$ end-to-end step speedup over FlashAttention (which requires 4-way Ulysses sequence parallelism to avoid OOM).

URL PDF HTML ☆

赞 0 踩 0

2605.15219 2026-05-29 cs.AI cs.IT math.IT

NOVA: Fundamental Limits of Knowledge Discovery Through AI

NOVA：通过人工智能进行知识发现的基本限制

Salman Avestimehr, Ken Duffy, Muriel Médard

发表机构 * University of Southern California（南加州大学）； Northeastern University（东北大学）； Massachusetts Institute of Technology（麻省理工学院）

AI总结本文提出NOVA框架，将“生成-验证-积累-再训练”循环建模为知识空间上的自适应采样过程，识别了知识覆盖有限域的条件及失败模式，并证明了发现成本与Zipf定律相关的标度律。

详情

AI中文摘要

对比表示学习（CRL）支撑着许多现代基础模型。尽管最近取得了理论进展，现有分析仍存在几个关键限制：（i）CRL的统计一致性仍知之甚少；（ii）可用的泛化界随着负样本数量的增加而恶化，这与大负样本集的经验优势相矛盾；（iii）CRL的检索性能受到的理论关注有限。在本文中，我们为CRL发展了一个统一的统计学习理论。对于下游任务，我们使用AUC型总体准则评估检索质量，并证明对比损失与最优排序是 extit{统计一致的}。我们进一步建立了一个 extit{校准型不等式}，定量地将过剩对比风险与过剩检索次优性联系起来。对于上游训练，我们研究了监督和自监督对比目标，并分别推导了阶为$O(1/m + 1/\sqrt{n})$和$O(1/\sqrt{m} + 1/\sqrt{n})$的泛化界，其中$m$表示负样本数量，$n$表示锚点数量。这些界不仅解释了大负样本集的经验优势，还揭示了$m$和$n$之间的显式权衡。在大规模视觉-语言模型上的广泛实验证实了我们的理论预测。

英文摘要

Contrastive representation learning (CRL) underpins many modern foundation models. Despite recent theoretical progress, existing analyses suffer from several key limitations: (i) the statistical consistency of CRL remains poorly understood; (ii) available generalization bounds deteriorate as the number of negative samples increases, contradicting the empirical benefits of large negative sets; and (iii) the retrieval performance of CRL has received limited theoretical attention. In this paper, we develop a unified statistical learning theory for CRL. For downstream tasks, we evaluate retrieval quality using an AUC-type population criterion and show that the contrastive loss is \emph{statistically consistent} with optimal ranking. We further establish a \emph{calibration-style inequality} that quantitatively relates excess contrastive risk to excess retrieval suboptimality. For upstream training, we study both supervised and self-supervised contrastive objectives and derive generalization bounds of order $O(1/m + 1/\sqrt{n})$ and $O(1/\sqrt{m} + 1/\sqrt{n})$, respectively, where $m$ denotes the number of negative samples and $n$ the number of anchor points. These bounds not only explain the empirical advantages of large negative sets but also reveal an explicit trade-off between $m$ and $n$. Extensive experiments on large-scale vision--language models corroborate our theoretical predictions.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

Function-Valued Causal Influence in Nonlinear Time Series

On the Role of Inductive Bias in Time-Series Pretraining: A Case Study in Learning Generalizable Representations for Clinical Time Series

Bridging Classification and Reconstruction: Cooperative Time Series Anomaly Detection

CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists

A Principled Self-Referenced Early Stopping Approach for Deep Image Prior

Eureka: Intelligent Feature Engineering for Enterprise AI Cloud Resource Demand Prediction

VEOcc: Voxel-Centric Online Semantic Occupancy Prediction For Embodied Scene Understanding

Tiny Brains, Giant Impact: Uncovering the Keystone Neurons of LLM with Just a Few Prompts

ConceptM$^3$oE: Concept-Guided Multimodal Mixture of Experts for Interpretable Computational Pathology

HyperGuide: Hyperbolic Guidance for Efficient Multi-Step Reasoning in Large Language Models

Nano World Models: A Minimalist Implementation of Future Video Prediction

OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents

PixIE: Prompted Pixel-Space Low-Light Image Enhancement

SCOPE: Simulating Cross-game Operations in Playable Environments for FPS World Models

Building a privacy-preserving Federated Recommender system for mobile devices

MPDocBench-Parse: Benchmarking Practical Multi-page Document Parsing

CoRMA: Contrastive RMA for Contact-Rich Meta-Adaptation

JMed48k: A Multi-Profession Japanese Medical Licensing Benchmark for Vision-Language Model Evaluation

TWINGS: Thin Plate Splines Warp-aligned Initialization for Sparse-View Gaussian Splatting

AttuneBench: A Conversation-Based Benchmark for LLM Emotional Intelligence

HyperVision: A Channel-Adaptive Ground-Based Hyperspectral Vision Pre-trained Backbone

GHOST: Geometry-Hierarchical Online Streaming Token Eviction for Efficient 3D Reconstruction

DualKV: Shared-Prompt Flash Attention for Efficient RL Training with Large Rollouts and Long Contexts

NOVA: Fundamental Limits of Knowledge Discovery Through AI

Diagnosing and Correcting Concept Omission in Multimodal Diffusion Transformers

EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents

Hierarchical Task Network Planning with LLM-Generated Heuristics

SMolLM: Small Language Models Learn Small Molecular Grammar

A Foundation Model for Zero-Shot Logical Rule Induction

Statistical Consistency and Generalization of Contrastive Representation Learning