arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.09966 2026-06-10 cs.SD 新提交

RespiraMFM: A Multimodal Foundation Model with Contrastive Audio-Language Alignment for Respiratory Disease Identification

RespiraMFM：一种用于呼吸道疾病识别的对比音频-语言对齐多模态基础模型

Shakhrul Iman Siam, Tiantian Feng, Jiankun Zhang, Shrikanth Narayanan, Mi Zhang

AI总结提出RespiraMFM多模态基础模型，通过对比音频-文本对齐策略整合呼吸音与临床信息，在监督和零样本任务中分别提升AUROC 9.15%和20.98%。

Comments ACL 2026 Main Conference

详情

AI中文摘要

呼吸道疾病仍然是全球死亡率的主要原因，及时准确的诊断对于改善患者预后和减轻医疗负担至关重要。虽然先前的工作已经探索了基于音频的呼吸道疾病检测模型，但这种单模态方法通常泛化能力和诊断精度有限。在本文中，我们提出了RespiraMFM，一种多模态基础模型，它将呼吸音与患者病史和症状相结合，以提高诊断准确性和疾病检测能力。我们引入了一种有效的音频-文本多模态整合对比对齐策略，使模型能够学习呼吸音与相应文本临床信息之间更好的跨模态表示。我们使用七个真实世界数据集，在监督微调和零样本设置下，对五种主要呼吸道疾病评估了RespiraMFM，在监督任务中AUROC提高了9.15%，在零样本任务中比现有基线提高了20.98%。这些发现强调了我们的框架在推进呼吸道疾病管理中早期诊断和改善临床决策方面的潜力。

英文摘要

Respiratory diseases remain a leading cause of global mortality, where timely and accurate diagnosis is critical to improving patient outcomes and reducing healthcare burdens. While prior work has explored audio-based models for respiratory disease detection, such unimodal approaches often suffer from limited generalizability and diagnostic precision. In this paper, we propose RespiraMFM, a Multimodal Foundation Model that integrates respiratory sounds with patient medical history and symptoms to enhance diagnostic accuracy and disease detection capabilities. We introduce an effective contrastive alignment strategy for audio-text multimodal integration, allowing the model to learn better cross-modal representations between respiratory sounds and corresponding textual clinical information. We evaluate RespiraMFM across five major respiratory diseases using seven real-world datasets in both supervised fine-tuning and zero-shot settings, achieving a 9.15% improvement in AUROC on supervised tasks and a 20.98% gain on zero-shot tasks over existing baselines. These findings underscore the potential of our framework to advance early diagnosis and improve clinical decision-making in respiratory disease management.

URL PDF HTML ☆

赞 0 踩 0

2606.09962 2026-06-10 cs.LG cs.AI cs.SD 新提交

Optimality of FSQ Tokens for Continuous Diffusion for Categorical Data with Application to Text-to-Speech

FSQ 令牌在分类数据连续扩散中的最优性及其在文本到语音中的应用

Vadim Popov, Wenju Gu, Tasnima Sadekova, Georgii Aparin, Assel Yermekova

AI总结本文研究连续扩散模型中离散令牌的潜在空间结构，通过理论分析和实验证明 FSQ 令牌化方案在分类数据连续扩散中最优，并在文本到语音任务中验证其优于基于 LLM 的方法。

详情

AI中文摘要

分类数据的连续扩散是一种属于扩散家族的框架，旨在生成离散数据。近年来，由于研究人员试图实现寻找自回归大型语言模型的合理替代方案这一具有挑战性的目标，对此类模型的科学兴趣不断增长。在本文中，我们研究了与离散令牌相对应的潜在空间结构的性质，这些性质通过扩散路径测度上的 Kullback-Leibler 散度和最优训练扩散模型正确预测令牌的准确性来表达。我们发现，FSQ 令牌化方案具有的潜在空间结构使其最适合分类数据的连续扩散，这一点通过严格的理论分析和数值实验得到了验证。为了在现实场景中验证我们的发现，我们训练了几个以语音令牌作为中间声学特征的文本到语音扩散模型，并表明基于 FSQ 令牌的模型确实表现最佳，而且它优于其强大的基于 LLM 的对应模型，同时体积更小、速度更快。

英文摘要

Continuous diffusion for categorical data is a framework belonging to the diffusion family and aiming at generating discrete data. The scientific interest to such models has been constantly increasing these days because researchers try to achieve a challenging goal of finding reasonable alternatives to autoregressive large language models. In this paper, we study the properties of the structure of the latent space corresponding to discrete tokens expressed in terms of Kullback-Leibler divergence on diffusion path measures and accuracy of the correct token prediction by the optimally trained diffusion model. We find that FSQ tokenization scheme has the latent space structure with the properties that make it best suited for continuous diffusion for categorical data as verified through rigorous theoretical analysis and numerical experiments. To validate our findings in real-life scenario, we train several text-to-speech diffusion models having speech tokens as intermediate acoustic features, and show that the one based on FSQ tokens indeed performs the best, and, moreover, it outperforms its strong LLM-based counterpart, at the same time being significantly smaller and faster.

URL PDF HTML ☆

赞 0 踩 0

2606.09961 2026-06-10 cs.LG cs.AI 新提交

3SPO: State-Score-Supervised Policy Optimization for LLM Agents

3SPO: 面向LLM智能体的状态分数监督策略优化

Yu Han, Kailing Li, Yang Jiao, Yulin Dai, Yuqian Fu, Linhai Zhuo, Tianwen Qian

AI总结提出3SPO算法，通过动态状态分数监督实现逐步骤策略优化，解决多轮智能体任务中奖励稀疏和信用分配问题，在ALFWorld和WebShop上分别比GRPO提升22.6%和15.6个百分点。

详情

AI中文摘要

通过强化学习（RL）将大型语言模型（LLM）训练为自主智能体，已使前沿模型在长周期任务中实现超人类性能。然而，现有RL算法在轨迹级别操作，仅在收集完整回合后执行策略优化。这种粗粒度方法在多轮智能体设置中面临根本性挑战，其中奖励稀疏、延迟，且跨单个步骤的信用分配至关重要。在这项工作中，我们提出\textbf{状态分数监督策略优化（3SPO）}，一种新颖的RL算法，通过动态状态分数监督执行逐步骤策略优化。在每个步骤，3SPO基于历史成功率计算状态分数，监督逐步骤信用分配、自适应回合和逐步骤策略优化，无需价值函数估计或额外辅助模型。理论上，在每状态臂架抽象下，我们证明所提出的分数监督分配机制实现了对数分配遗憾，并为动作识别、分数可区分性和过滤稳定性提供了样本复杂度保证。在ALFWorld和WebShop上使用Qwen2.5-1.5B/7B-Instruct的实验表明，3SPO在ALFWorld上持续优于GRPO $+22.6\%$，在WebShop上优于$+15.6$个百分点，同时使用相当资源实现了$2.4\times$更多的状态探索和$1.8\times$更快的收敛。代码可从此https URL获取。

英文摘要

Training large language models (LLMs) as autonomous agents via reinforcement learning (RL) has enabled frontier models to achieve superhuman performance in long-horizon tasks. However, existing RL algorithms operate at the trajectory level, performing policy optimization only after collecting complete episode rollouts. This coarse-grained approach faces fundamental challenges in multi-turn agent settings where rewards are sparse, delayed, and credit assignment across individual steps is critical. In this work, we propose \textbf{State-Score-Supervised Policy Optimization (3SPO)}, a novel RL algorithm that performs post-step policy optimization with dynamic state score supervision. At each step, 3SPO computes the state score based on historical success rates, supervising step-wise credit assignment, adaptive rollout and post-step policy optimization without requiring value function estimation or additional auxiliary models. Theoretically, under a per-state bandit abstraction, we show that the proposed score-supervised allocation mechanism achieves logarithmic allocation regret and provide sample-complexity guarantees for action identification, score distinguishability, and filtering stability. Experiments on ALFWorld and WebShop with Qwen2.5-1.5B/7B-Instruct show that 3SPO consistently outperforms GRPO by $+22.6\%$ on ALFWorld and $+15.6$ points on WebShop, while using comparable resources to achieve $2.4\times$ more state exploration and $1.8\times$ faster convergence. Code is available at https://github.com/genalyu/3SPO.

URL PDF HTML ☆

赞 0 踩 0

2606.09960 2026-06-10 cs.LG cs.AI 新提交

HydraCIL: Decoupled Class-Incremental Learning through Prototype-Guided Multi-Head Classifiers

HydraCIL: 通过原型引导的多头分类器实现解耦的类增量学习

Daniel Vila-Cruz, Laura Morán-Fernández, Verónica Bolón-Canedo

AI总结提出HydraCIL模型，通过冻结主干网络、解耦特征提取与学习，并利用原型相似性选择任务特定分类头，在资源受限环境中实现高效类增量学习，匹配或超越现有方法同时大幅降低训练时间和碳排放。

Comments Accepted for publication at the International Joint Conference on Neural Networks (IJCNN 2026)

详情

AI中文摘要

我们提出HydraCIL，一种基于原型引导的多头分类器的解耦持续学习模型，旨在嵌入式及资源受限环境中的可持续部署。虽然大多数类增量学习（CIL）方法依赖强大硬件和长时间再训练周期，但实际系统（如机器人或边缘AI设备）必须在有限资源下快速适应。HydraCIL通过冻结主干网络并将特征提取与学习解耦来解决这一问题。对于每个任务，特征被提取一次，并创建一个轻量级的、任务特定的分类器头，避免了昂贵的主干再训练。在推理时，HydraCIL通过与原型的相似性选择适当的头。在CIFAR-100、ImageNet-100、CoRe50和Flowers102数据集上的实验表明，HydraCIL匹配或超越了最先进的CIL方法，同时显著减少了训练时间和碳足迹，使其成为在能源效率和快速适应至关重要的实际及嵌入式环境中进行持续学习的实用解决方案。

英文摘要

We present HydraCIL, a decoupled continual learning model based on prototype-guided multi-head classifiers, targeting sustainable deployment in embedded and resource-constrained environments. While most Class-Incremental Learning (CIL) methods rely on powerful hardware and long retraining cycles, real-world systems, such as robots or edge AI devices, must adapt quickly with limited resources. HydraCIL addresses this gap by freezing the backbone and decoupling feature extraction from learning. For each task, features are extracted once and a lightweight, task-specific classifier head is created, avoiding costly backbone retraining. At inference, HydraCIL selects the appropriate head via similarity with prototypes. Experiments on CIFAR-100, ImageNet-100, CoRe50, and Flowers102 datasets show that HydraCIL matches or outperforms state-of-the-art CIL methods while significantly reducing training time and carbon footprint, making it a practical solution for continual learning in real-world and embedded settings, where energy efficiency and rapid adaptation are critical.

URL PDF HTML ☆

赞 0 踩 0

2606.09959 2026-06-10 cs.LG cs.AI 新提交

Temporal Context Conditioning for Seasonality-Aware Precipitation Nowcasting of High-Intensity Rainfall

面向高强度降雨的季节感知降水临近预报的时间上下文条件化

Gijs van Nieuwkoop, Siamak Mehrkanoon

AI总结提出TA-SmaAt-UNet模型，通过时间条件层（昼夜和季节循环编码）增强雷达降水临近预报，显著提升高强度降雨事件的预测性能。

Comments 9 pages, 6 figures

详情

AI中文摘要

降水临近预报越来越多地采用直接从近期雷达观测中学习的深度学习模型。尽管这类模型能有效捕捉短期降水运动，但它们往往缺乏降雨发展所依据的气象条件的更广泛上下文信息。本文研究轻量级时间上下文是否能改善基于雷达的临近预报，特别是针对高强度降雨。我们提出了时间感知小注意力U-Net（TA-SmaAt-UNet），它在核心SmaAt-UNet模型基础上扩展了时间条件层，利用昼夜时间和一年中时间的循环编码来调节中间特征表示。在KNMI雷达降水数据上的实验表明，时间条件化对罕见的高强度降水事件最为有益，同时也能改善季节变异性和预测降水强度分布的表征。层传导分析进一步表明，尽管参数成本很小，模型仍积极使用添加的时间条件层。这些发现表明，简单的、基于物理动机的时间上下文可以提高基于深度学习的降水临近预报的真实性和可靠性。我们的模型实现和训练设置可在GitHub上获取。

英文摘要

Precipitation nowcasting is increasingly being approached with deep learning models that learn directly from recent radar observations. Although such models can efficiently capture short-term precipitation motion, they often lack broader contextual information about the meteorological conditions under which rainfall develops. This paper investigates whether lightweight temporal context can improve radar-based nowcasting, particularly for high-intensity rainfall. We propose the Time-Aware Small-Attention U-Net (TA-SmaAt-UNet), which extends the core SmaAt-UNet model with temporal conditioning layers that use cyclical encodings of time-of-day and time-of-year to modulate intermediate feature representations. Experiments on KNMI radar precipitation data show that temporal conditioning is most beneficial for rare, high-intensity precipitation events, while also improving the representation of seasonal variability and predicted rainfall-intensity distributions. A layer conductance analysis further indicates that the added temporal conditioning layers are actively used by the model despite their small parameter cost. These findings suggest that simple, physically motivated temporal context can improve the realism and reliability of deep learning-based precipitation nowcasts. The implementation of our models and training setup is available on \href{https://github.com/gijsvn/TA-SmaAt-UNet}{GitHub}.

URL PDF HTML ☆

赞 0 踩 0

2606.09958 2026-06-10 cs.RO cs.AI 新提交

Uncertainty-Aware Motion Planning for Autonomous Driving in Mixed Traffic Environment

混合交通环境下自动驾驶的不确定性感知运动规划

Ming Cheng, Hao Chen, Ziyi Yang, Ziluowen Luo, Senzhang Wang

AI总结提出不确定性感知运动规划（UAMP），通过量化人类意图不确定性并引入不确定性校准值学习，提升自动驾驶在混合交通中的安全性和舒适性。

详情

AI中文摘要

在自动驾驶和人类驾驶车辆可能共存的混合交通环境中，自动驾驶车辆的运动规划需要预测周围人类驾驶员的未来行为。现有的基于强化学习的方法通常直接将预测的人类意图纳入观测以实现主动规划。然而，由于行为多样性、感知噪声和部分可观测性，人类意图本质上是不确定的。将预测意图视为确定性状态可能导致自动驾驶车辆做出不安全决策。为解决此问题，我们提出不确定性感知运动规划（UAMP），该规划将人类意图预测的不确定性纳入自动驾驶决策。具体来说，UAMP首先引入一个邻近感知不确定性估计器，以量化交互条件下的意图不确定性，并构建一个不确定性引导的联合意图分布，覆盖周围的人类驾驶车辆。在此不确定性集合内，UAMP进一步引入不确定性校准值学习（UCVL），以纠正因直接将不确定的人类意图预测纳入观测而产生的值函数学习偏差。在各种混合交通场景中的大量实验表明，与现有方法相比，UAMP显著提高了安全性和驾驶舒适性，同时保持了交通效率。代码发布在此https URL。

英文摘要

In mixed-traffic environments where autonomous and human-driven vehicles may co-exist, motion planning for autonomous vehicles requires anticipating the future behaviors of surrounding human drivers. Existing reinforcement learning-based methods generally directly incorporate the predicted human intents into the observation to enable a proactive planning. However, human intent is inherently uncertain due to the behavioral diversity, perception noise, and partial observability. Treating predicted intends as deterministic states can result in unsafe decisions for autonomous vehicles. To address this problem, we propose Uncertainty-Aware Motion Planning (UAMP), which incorporates uncertainty in human intent prediction for AV decision-making. Specifically, UAMP first introduces a proximity-aware uncertainty estimator to quantify the interaction-conditioned intent uncertainty and constructs an uncertainty-guided joint intent distribution over surrounding human-driven vehicles. Within this uncertainty set, UAMP further introduces Uncertainty-Calibrated Value Learning (UCVL) to correct value function learning biases arising from directly incorporating uncertain human intent predictions into the observation. Extensive experiments in various mixed-traffic scenarios show that UAMP significantly improves safety and driving comfort, while maintaining traffic efficiency compared with existing approaches. The code is released at https://anonymous.4open.science/r/UAMP-5638.

URL PDF HTML ☆

赞 0 踩 0

2606.09954 2026-06-10 cs.LG cs.AI 新提交

Does Normalization Choice Matter for Causal Large Time-Series Models?

归一化选择对因果大规模时间序列模型重要吗？

Samy-Melwan Vilhes, Gilles Gasso, Mokhtar Z Alaya

AI总结研究因果大规模时间序列模型中不同归一化策略对训练收敛和预测性能的影响，发现归一化选择显著影响模型效果。

详情

Journal ref: ICLR 2026 Workshop: Time Series in the Age of Large Models, Apr 2026, Rio De Janeiro, Brazil

AI中文摘要

用于时间序列预测的大规模模型已成为在异构信号集合上训练模型的有前景的范式。这些模型通常依赖于因果自回归架构，其中每个观测值根据过去依次预测。在实践中，真实世界的时间序列表现出非平稳性，这显著影响预测性能。为了缓解这一问题，通常采用归一化。然而，在高效的因果设置中，归一化可能在训练期间导致来自未来观测的信息泄漏。最近提出的替代方案，包括因果归一化和从初始观测计算的统计量，旨在解决这一问题，但其实际影响仍未被充分理解。在这项工作中，我们评估了基于Transformer的大规模时间序列模型（采用分块和高效因果策略训练）的归一化策略。我们展示了归一化选择显著影响训练收敛和预测性能。

英文摘要

Large models for time-series forecasting have been emerged as a promising paradigm for training models on heterogeneous collections of signals. These models typically rely on causal autoregressive architectures, where each observation is sequentially predicted from past. In practice, real-world time-series exhibit non-stationarities, which significantly influence predictive performance. To mitigate this, normalization is commonly employed. However, in efficient causal settings it might induce information leakage from future observations during training. Recent alternatives, including causal normalization and statistics computed from initial observations, have been proposed to address this issue, but their practical implications remain insufficiently understood. In this work, we evaluate normalization strategies for transformer-based large time-series models trained with patching and efficient causal strategy. We showcase that normalization choice significantly influences both training convergence and forecasting performance.

URL PDF HTML ☆

赞 0 踩 0

2606.09951 2026-06-10 cs.LG 新提交

Hasse Diagrams for Attention: A Partial Order Framework for Designing Transformer Masks

注意力的哈斯图：设计Transformer掩码的偏序框架

Chentao Li, Han Guo

AI总结本文提出一个理论框架，证明多层Transformer的信息流收敛到哈斯图，并将并行训练任务设计转化为求哈斯图最小公共超图问题，由此导出两种新注意力掩码。

Comments 21 pages, 9 figures. Theoretical framework for attention mask design; no experiments included

详情

AI中文摘要

在大型Transformer模型的训练过程中，注意力掩码控制序列中信息流的范围和方向。存在多种掩码变体，诸如FlexAttention之类的算子已经支持任意注意力掩码。然而，对于任意掩码所引起的信息流结构，一直缺乏系统的形式化分析。本文开发了一个完整的理论框架。我们证明，在足够深度下，多层Transformer的信息流收敛到一个哈斯图——表示偏序的有向无环图。在此基础上，我们将并行训练任务的设计重新表述为寻找哈斯图的最小公共超图的问题，并建立了最小公共超图的判定准则。这产生了一种直接从任务族推导注意力掩码的构造性方法。应用该框架，我们设计了两种新颖的掩码：一种确保训练-推理一致性的块生成注意力掩码（块双流注意力），以及一种全监督双向注意力掩码（蝴蝶注意力）。这些结果证明了该框架发现新结构的能力。

英文摘要

During the training of large Transformer models, attention masks regulate the scope and direction of information flow across a sequence. Numerous mask variants exist, and operators such as FlexAttention already support arbitrary attention masks. Nevertheless, a systematic formal analysis of the information-flow structure induced by arbitrary masks has been missing. This paper develops a complete theoretical framework. We prove that, with sufficient depth, the information flow of a multi-layer Transformer converges to a Hasse diagram -- a directed acyclic graph representing a partial order. Building on this, we recast the design of parallel training tasks as the problem of finding a minimal common supergraph of Hasse diagrams, and we establish a criterion for the minimal common supergraph. This yields a constructive method to derive attention masks directly from a family of tasks. Applying the framework, we design two novel masks: a block-generation attention mask that ensures training-inference consistency (Block Two-Stream Attention), and a fully supervised bidirectional attention mask (Butterfly Attention). These results demonstrate the framework's capacity to discover new structures.

URL PDF HTML ☆

赞 0 踩 0

2606.09949 2026-06-10 cs.LG cs.AI 新提交

Learning Where to Simulate: Generative Active Sampling for Online PDE Surrogate Training

学习何处模拟：在线PDE代理训练的生成式主动采样

Pierre Cesar, Sofya Dymchenko, Abhishek Purandare, Bruno Raffin

AI总结提出在线生成式主动采样（OGAS），通过扩散模型学习配置参数与代理性能的关系，主动采样高难度区域，显著降低尾部分布误差，提升代理最坏情况可靠性。

详情

AI中文摘要

数据驱动的PDE代理使用数值PDE求解器产生的数据进行训练。然而，当代理的目标是在广泛的PDE配置（例如初始条件和物理系数）上泛化时，生成具有代表性的训练集并非易事。配置参数的均匀采样通常低估了表现出挑战性动力学的轨迹，导致训练后的代理出现高预测误差和大误差方差。在线训练将数据生成和代理训练耦合，通过允许实时调整求解器参数提供了自然优势。为了有效利用这一能力，我们引入了在线生成式主动采样（OGAS），一种主动学习方法，它反应性地学习配置参数与代理性能之间的关系，以控制采样分布。OGAS与代理并行训练一个快速扩散模型，作为条件采样器，将代理派生的难度信号（例如损失或不确定性）映射到配置参数。通过主动从偏向高难度的先验中抽取目标信号，OGAS持续将数据生成导向挑战性区域，而不会延迟训练流程。我们在具有不同挑战性动力学的2D PDE（Kuramoto-Sivashinsky、Navier-Stokes、Gray-Scott）上评估OGAS，参数多达308个，并使用多种代理架构。在所有设置中，与均匀采样相比，OGAS一致地改善了尾部分布统计，显著降低了第99百分位以上的误差和整体误差离散度。虽然优先考虑挑战性轨迹引入了与平均误差的权衡，但OGAS有效确保了训练后代理的最坏情况可靠性，且壁钟时间开销可忽略不计。

英文摘要

Data-driven PDE surrogates are trained with data produced by numerical PDE solvers. However, when the surrogate's goal is to generalize across a wide range of PDE configurations (e.g., initial conditions and physical coefficients), generating a representative training set is non-trivial. Uniform sampling of configuration parameters often under-represents trajectories exhibiting challenging dynamics, leading to high prediction errors and large error variance in the trained surrogate. Online training, where data generation and surrogate training are coupled, offers a natural advantage by allowing solver parameters to be steered on-the-fly. To efficiently exploit this capability, we introduce Online Generative Active Sampling (OGAS), an active learning method that reactively learns the relationship between configuration parameters and surrogate performance to control the sampling distribution. OGAS trains a fast diffusion model in parallel to the surrogate to act as a conditional sampler, mapping a surrogate-derived difficulty signal (e.g., loss or uncertainty) to configuration parameters. By actively drawing target signals from a prior biased toward high difficulty, OGAS continuously steers data generation toward challenging regimes without delaying the training workflow. We evaluate OGAS across 2D PDEs with distinct challenging dynamics (Kuramoto-Sivashinsky, Navier-Stokes, Gray-Scott) and up to 308 parameters, using multiple surrogate architectures. Across all settings, OGAS consistently improves tail statistics, yielding substantial reductions in errors above the 99th percentile and overall error dispersion compared to uniform sampling. While prioritizing challenging trajectories introduces a trade-off with average error, OGAS effectively ensures worst-case reliability of trained surrogates with negligible wall-time overhead.

URL PDF HTML ☆

赞 0 踩 0

2606.09940 2026-06-10 cs.LG cs.AI 新提交

Interactions Between Crosscoder Features: A Compact Proofs Perspective

交叉编码器特征间的交互：一个紧凑证明的视角

Dmitry Manning-Coe, Thomas Read, Anna Soligo, Oliver Clive-Griffin, Chun-Hei Yip, Rajashree Agrawal, Jason Gross

AI总结本文从紧凑证明角度形式化交叉编码器特征交互，提出交互度量并应用于计算稀疏性、语义聚类和检测休眠代理。

Comments Accepted at the NeurIPS 2025 Workshop on Mechanistic Interpretability

详情

AI中文摘要

像稀疏自编码器（SAEs）和交叉编码器这样的字典学习方法试图通过将模型的激活分解为独立特征来解释模型。因此，特征之间的交互会在重构中引入误差。我们通过紧凑证明形式化了这一直觉，并做出了五项贡献。首先，我们展示了原则上如何使用交叉编码器构建模型性能的紧凑证明。其次，我们证明了该证明中出现的误差项可以自然地解释为交叉编码器特征之间交互的度量，并提供了多层感知器（MLP）层中交互项的显式表达式。然后，我们提供了这种新交互度量的三个应用。在第三项贡献中，我们展示了交互项本身可以用作可微分的损失惩罚。应用这种惩罚，我们可以实现“计算稀疏”的交叉编码器，当在每个数据点和神经元仅保留单个特征时，保留MLP性能的60%，而标准交叉编码器仅保留10%。接着，我们展示了根据我们的交互度量进行聚类可以提供语义上有意义的特征聚类，最后，我们展示了休眠代理具有显著的交互。代码可在以下网址获取：https://this URL。

英文摘要

Dictionary learning methods like Sparse Autoencoders (SAEs) and crosscoders attempt to explain a model by decomposing its activations into independent features. Interactions between features hence induce errors in the reconstruction. We formalize this intuition via compact proofs and make five contributions. First, we show how, \textit{in principle}, a compact proof of model performance can be constructed using a crosscoder. Second, we show that an error term arising in this proof can naturally be interpreted as a measure of interaction between crosscoder features and provide an explicit expression for the interaction term in the Multi-Layer Perceptron (MLP) layers. We then provide three applications of this new interaction measure. In our third contribution we show that the interaction term itself can be used as a differentiable loss penalty. Applying this penalty, we can achieve ``computationally sparse'' crosscoders that retain $60\%$ of MLP performance when only keeping a single feature at each datapoint and neuron, compared to $10\%$ in standard crosscoders. We then show that clustering according to our interaction measure provides semantically meaningful feature clusters, and finally that sleeper agents have significant interactions. Code is available at https://github.com/chainik1125/crosscoders-feature-interactions/tree/arxiv.

URL PDF HTML ☆

赞 0 踩 0

2606.09937 2026-06-10 cs.LG cs.AI cs.CL 新提交

RKSC: Reasoning-Aware KV Cache Sharing and Confident Early Exit for Multi-Step LLM Inference

RKSC：面向多步LLM推理的感知推理的KV缓存共享与自信提前退出

Anirudh Sekar

AI总结提出RKSC框架，通过注意力相似性KV共享、置信门控提前退出和推理选择性块缓存管理，消除多分支LLM推理中的结构冗余，实现平均3.008倍加速，错误率仅0.37%。

Comments Accepted to the ICML 2026 Workshop on Statistical Frameworks for Uncertainty in Agentic Systems

详情

AI中文摘要

我们提出RKSC（感知推理的KV缓存共享），一种无需训练的推理框架，消除了多分支LLM推理流程中的两种结构冗余。ASKS（注意力相似性KV共享）计算前缀KV缓存一次，并通过隐藏状态余弦相似度广播给所有语义相似的分支，严格推广了vLLM和SGLang使用的精确令牌前缀缓存。CGEE（置信门控提前退出）应用两种互补的退出机制：（1）当生成置信度在分支间具有决定性时，完全跳过验证前向传播；（2）当逐层熵稳定时，在中间层终止验证传播，使用Transformer骨干上的轻量级钩子。RSBCM（推理选择性块缓存管理器）通过注意力加权深度优先驱逐防止无界缓存增长。在五个模型家族（7B-10B）、四个基准测试和1000个评估问题上，RKSC相对于无KV基线实现了平均3.008倍加速（峰值3.990倍），相对于vLLM等效前缀缓存平均提升1.66倍，CGEE导致的错误率仅为0.37%（1616次验证调用中6次错误）。无需微调或架构更改。代码可在该URL获取。

英文摘要

We introduce RKSC (Reasoning-Aware KV Cache Sharing), a training-free inference framework that eliminates two structural redundancies in multi-branch LLM reasoning pipelines. ASKS (Attention-Similarity KV Sharing) computes the prefix KV cache once and broadcasts it to all semantically similar branches via hidden-state cosine similarity, strictly generalising the token-exact prefix caching used by vLLM and SGLang. CGEE (Confidence-Gated Early Exit) applies two complementary exit mechanisms: (1) it skips the verification forward pass entirely when generation confidence is decisive across branches, and (2) it terminates the verification pass at an intermediate layer when per-layer entropy stabilises, using lightweight hooks on the transformer backbone. RSBCM (Reasoning-Selective Block Cache Manager) prevents unbounded cache growth via attention-weighted depth-priority eviction. Across five model families (7B-10B), four benchmarks, and 1,000 evaluated problems, RKSC achieves a mean speedup of 3.008x over the No-KV baseline (peak 3.990x), a 1.66x mean improvement over vLLM-equivalent prefix caching, with a CGEE-induced error rate of only 0.37% (6 errors out of 1,616 verify calls). No fine-tuning or architecture changes are required. Code is available at https://github.com/AnirudhSekar/RKSC.

URL PDF HTML ☆

赞 0 踩 0

2606.09936 2026-06-10 cs.LG cs.AI 新提交

One Lens, Many Worlds : A Capability-Typed Interface for World-Model Interpretability

一个镜头，多个世界：面向世界模型可解释性的能力类型接口

Bhavith Chandra Challagundla, Sanskar Pandey, Param Thakkar, Rishikesh Mallagundla, Yugandhar Reddy Gogireddy, Wenhao Lu, Hindol Roy Choudhury, Shravani Challagundla, Mohamed Deraz Nasr, Spursh Deshpande

AI总结提出WorldModelLens，通过能力类型适配器统一不同世界模型（如PlaNet、IRIS、I-JEPA）的可解释性分析，避免重复实现。

详情

AI中文摘要

世界模型现在建立在截然不同的计算基板上。潜在循环状态空间模型（如PlaNet和Dreamer系列）将观测压缩为循环状态；基于token的模型（如IRIS）将观测量化到学习到的码本中，并用transformer进行自回归预测；联合嵌入预测架构（如I-JEPA）在没有像素解码器的学习潜在空间中进行预测。应用于这些模型的可解释性方法，包括探针、激活修补、稀疏自编码器和惊喜分析，共享一组共同的基元，但由于现有的钩子和缓存工具假设一个没有动作、环境步骤或想象回滚概念的transformer语言模型，它们为每个架构从头重新实现。我们认为这种碎片化反映了工具而非模型，并且世界模型的共享结构可以通过一个小型类型接口捕获。我们提出了WorldModelLens，一个围绕能力类型适配器组织的开源可解释性基板：每个模型实现四个必需方法（编码、转移、初始状态、采样），并通过显式能力描述符声明一组可选头（解码、奖励、继续、行动者、评论者），使得强化学习和自监督世界模型成为一等公民，而无需模仿对方。单一的钩子和缓存层在此接口上暴露时间索引的激活、想象回滚和干预重放，使得每个分析只需编写一次。

英文摘要

World models are now built on substantially different computational substrates. Latent recurrent state-space models such as PlaNet and the Dreamer family compress observations into recurrent states; token-based models such as IRIS quantize observations into a learned codebook and predict autoregressively with a transformer; and joint-embedding predictive architectures such as I-JEPA predict in a learned latent space with no pixel decoder. The interpretability methods applied to these models, including probing, activation patching, sparse autoencoders, and surprise analysis, share a common set of primitives, yet they are re-implemented from scratch for each architecture because existing hook-and-cache tooling assumes a transformer language model with no notion of actions, environment steps, or imagined rollouts. We argue that this fragmentation reflects the tooling rather than the models, and that the shared structure of world models is captured by a small typed interface. We present WorldModelLens, an open-source interpretability substrate organized around a capability-typed adapter: every model implements four required methods (encode, transition, initial state, sample) and declares a set of optional heads (decode, reward, continue, actor, critic) through an explicit capability descriptor, so that reinforcement-learning and self-supervised world models are first-class without either imitating the other. A single hook and cache layer exposes time-indexed activations, imagination rollouts, and intervention replay over this interface, allowing each analysis to be written once.

URL PDF HTML ☆

赞 0 踩 0

2606.09934 2026-06-10 cs.LG cs.CR 新提交

nCMD: Benign-Anchored Feature Selection for Imbalanced Network Intrusion Detection

nCMD: 面向不平衡网络入侵检测的良性锚定特征选择

Abu Fuad Ahmad, Istiaque Ahmed

AI总结提出良性锚定类均值偏差（nCMD）方法，通过计算攻击类分布与良性类均值的偏差进行特征选择，在四个基准数据集上优于传统过滤方法，尤其适用于特征预算紧张和类别严重不平衡的场景。

Comments 6 pages, IEEE double columns

详情

AI中文摘要

特征选择对于在操作和防御网络中常见的高维、高度不平衡流量下运行的网络入侵检测系统（NIDS）至关重要。传统的过滤方法使用跨类别对称计算的全局统计量对特征进行排序，因此无法捕捉入侵检测的不对称性，其中攻击最好被描述为对主导良性流量的偏离。我们提出了良性锚定类均值偏差（nCMD），一种轻量级且可解释的方法，该方法基于攻击类分布与良性类均值的偏差（而非全局有偏的参考）对特征相关性进行评分。这种方法使特征选择与NIDS的操作语义保持一致，且不增加额外计算成本。在四个基准数据集（CICIDS2017、CICDDoS2019、NSL-KDD和UNSW-NB15）、多个特征预算和三个下游分类器上，nCMD在宏平均F1分数上达到或超过了经典过滤基线。它在四个数据集中的三个以及每个分类器下均取得了最佳结果，在特征预算紧张和类别严重不平衡的情况下改进最为显著。这些结果支持良性锚定排序作为资源受限NIDS的可扩展且可解释的预处理组件。

英文摘要

Feature selection is critical for network intrusion detection systems (NIDS) operating under high-dimensional, highly imbalanced traffic, as found in operational and defense networks. Traditional filter methods rank features using global statistics computed symmetrically across classes and thus fail to capture the asymmetry of intrusion detection, where attacks are best characterized as deviations from dominant benign traffic. We propose benign-anchored Classwise Mean Deviation (nCMD), a lightweight and interpretable method that scores feature relevance based on the deviation of attack-class distributions from the benign-class mean, rather than a globally biased reference. This approach aligns feature selection with the operational semantics of NIDS at no additional computational cost. Across four benchmark datasets (CICIDS2017, CICDDoS2019, NSL-KDD, and UNSW-NB15), multiple feature budgets, and three downstream classifiers, nCMD matches or exceeds classical filter baselines in macro-averaged F1-score. It achieves the best result on three of the four datasets and under every classifier, with the strongest improvements observed under tight feature budgets and severe class imbalance. These results support benign-anchored ranking as a scalable and interpretable preprocessing component for resource-constrained NIDS.

URL PDF HTML ☆

赞 0 踩 0

2606.09932 2026-06-10 cs.LG cs.AI 新提交

When RL Fails after SFT: Rejuvenating Model Plasticity for Robust SFT-to-RL Handoff

当强化学习在监督微调后失效：恢复模型可塑性以实现稳健的SFT到RL交接

Runze Liu, Jiashun Liu, Xu Wan, Yuqian Fu, Ling Pan

AI总结针对SFT过度训练导致RL阶段改进有限的问题，提出Rejuvenation方法，通过基模型锚定融合和神经元重置恢复模型可塑性，在数学推理和智能体任务上提升RL性能。

详情

AI中文摘要

监督微调（SFT）后接强化学习（RL）已成为大语言模型（LLM）后训练的标准流程。SFT预期为RL提供有用的行为先验，以进一步增强模型能力。然而，过度SFT的检查点在RL中往往表现出有限的改进。我们将此失败归因于模型可塑性的丧失：SFT初始化的策略被后续RL有效重塑的能力降低。为了更好地理解这一现象，我们从参数变化、输出空间和RL优化动态等多个角度进行了详细分析。我们的结果表明，过度SFT的模型倾向于产生过度自信的token分布，并表现出尖锐的参数景观，这使得它们在RL阶段更难优化。为了实现更稳健的SFT到RL交接，我们提出了Rejuvenation，一种简单而有效的方法，在保留有用的SFT获取先验的同时恢复可塑性。Rejuvenation利用基于基模型的模型融合来减少过度SFT引起的漂移，并通过有针对性的神经元重置来缓解模型僵化。在数学推理任务和智能体任务上的实验结果表明，我们的方法在过度训练的SFT模型上持续提升了RL性能，同时也增强了对分布外任务的泛化能力。

英文摘要

Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) has become a standard pipeline for Large Language Model (LLM) post-training. SFT is expected to provide a useful behavioral prior for RL to further enhance model capabilities. However, checkpoints with excessive SFT often show limited improvement during RL. We attribute this failure to the loss of model plasticity: the reduced ability of an SFT-initialized policy to be effectively reshaped by subsequent RL. To better understand this phenomenon, we conduct detailed analysis from multiple perspectives, including parameter changes, output spaces, and RL optimization dynamics. Our results show that models from excessive SFT tend to produce over-confident token distributions and exhibit sharp parameter landscapes, which make them harder to optimize in the RL stage. To enable a more robust SFT-to-RL handoff, we propose \texttt{Rejuvenation}, a simple yet effective method that restores plasticity while preserving useful SFT-acquired priors. Rejuvenation leverages base-anchored model fusion to reduce excessive SFT-induced drift with targeted neuron reset to mitigate model rigidity. Experimental results on both math reasoning tasks and agentic tasks demonstrate that our approach consistently improves RL performance on over-trained SFT models, while also enhancing generalization to out-of-distribution tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.09929 2026-06-10 cs.LG cs.AI 新提交

Between Amnesia and Chaos: A Memory Stability Expressivity Trilemma for Trainable Dissipative Oscillator Networks

介于遗忘与混沌之间：可训练耗散振荡器网络的记忆稳定性表现力三难困境

Caleb Munigety

AI总结本文研究可训练非线性振荡器网络，发现记忆范围、梯度稳定性和动态表现力三者受阻尼控制，存在无法同时最大化的三难困境，并通过实验验证了理论边界。

详情

AI中文摘要

物理储层计算利用非线性机械动力学，但传统上冻结基底并仅训练线性读出层，假定基底不可训练。我们重新审视这一前提，研究非线性振荡器网络，其质量、阻尼和刚度通过辛积分器端到端学习。我们的核心结果是三难困境：记忆范围、梯度稳定性和动态表现力无法同时最大化，因为三者均由阻尼控制。反向梯度以阻尼决定的速率衰减，限制了信用传播的距离，而前向灵敏度以最大李雅普诺夫指数指数增长，因此可用梯度需要阻尼高于稳定下限。由于李雅普诺夫指数随阻尼增加而下降，而记忆上限随范围增加而下降，稳定训练被限制在一个随范围收缩并在临界点闭合的带状区域内。我们在一个二十振荡器网络上测试了每一步。阻尼扫描发现最大李雅普诺夫指数单调变化并在明确的下限处过零，证实了定理的关键假设。在九个范围上的延迟回忆任务中，学习基底与冻结基底的算力匹配比较显示，学习基底在短范围占优，优势在约十一步范围附近接近并反转，这是带状闭合的预测特征；训练模型稳定在稳定下限附近，自发寻求混沌边缘。解析上限高估经验交叉约五倍，这是可检测梯度与可学习梯度之间的差距，我们报告而非调整消除。贡献在于确认了何时训练物理基底优于冻结基底。

英文摘要

Physical reservoir computing harnesses nonlinear mechanical dynamics but, by convention, freezes the substrate and trains only a linear readout, presuming the substrate is not usefully trainable. We revisit that premise for networks of nonlinear oscillators whose mass, damping, and stiffness are learned end-to-end through a symplectic integrator. Our central result is a trilemma: memory horizon, gradient stability, and dynamical expressivity cannot be simultaneously maximized, because all three are governed by the damping. The backward gradient decays at a rate set by the damping, capping how far back credit can propagate, while forward sensitivities grow exponentially in the largest Lyapunov exponent, so usable gradients require damping above a stability floor. Since the Lyapunov exponent falls as damping rises while the memory ceiling falls as the horizon grows, stable training is confined to a band that contracts with horizon and closes at a critical point. We test every step on a twenty-oscillator network. A damping sweep finds the largest Lyapunov exponent monotone and crossing zero at a well-defined stability floor, confirming the theorem's key assumption. A compute-matched comparison of learned versus frozen substrate on delayed recall across nine horizons shows the learned substrate dominating at short horizons and the advantage closing and reversing near a horizon of eleven steps, the predicted signature of band closure; trained models settle near the stability floor, seeking the edge of chaos unprompted. The analytic ceiling overestimates the empirical crossover roughly fivefold, a gap between detectable and learnable gradient that we report rather than tune away. The contribution is a confirmed account of when training a physical substrate beats freezing it.

URL PDF HTML ☆

赞 0 踩 0

2606.09928 2026-06-10 cs.LG cs.AI 新提交

Forward-Only Convolutional Neural Networks with Learnable Channel-Class Assignment

具有可学习通道-类别分配的前向传播卷积神经网络

Mohammadnavid Ghader, Saeed Reza Kheradpisheh, Bahar Farahani, Mahmood Fazlali

AI总结提出可学习的通道-类别分配机制，结合熵和正交正则化，以及基于验证性能的损失感知层贡献策略，在残差CNN上实现前向传播学习，在CIFAR-10/100和Tiny-ImageNet上达到FF模型最佳性能，缩小与反向传播的差距。

详情

AI中文摘要

前向-前向（FF）算法通过用局部的前向目标替代基于梯度的信用分配，提供了一种受生物学启发的反向传播替代方案。虽然最近的扩展已将FF适应到卷积神经网络（CNN），但现有公式依赖于静态的通道-类别分区，并且在复杂任务中难以有效执行。在这项工作中，我们引入了一种可学习的通道-类别分配机制，实现了卷积通道的自适应、数据驱动特化，并辅以熵和正交正则化以提升学习性能。我们进一步提出了一种损失感知的层贡献策略，该策略根据中间层的验证性能自适应地加权其预测，从而增强前向推理的有效性。集成到残差CNN中，所提出的方法在CIFAR-10、CIFAR-100和Tiny-ImageNet上相比现有的类似前向方法持续实现了更优的性能。值得注意的是，它在基于FF的模型中建立了新的最先进性能，显著缩小了与反向传播的差距。这些发现表明，引入可学习的通道特化和层贡献加权显著增强了深度CNN中前向学习的表示能力。

英文摘要

The Forward-Forward (FF) algorithm offers a biologically inspired alternative to backpropagation by replacing gradient-based credit assignment with local, forward-only objectives. While recent extensions have adapted FF to convolutional neural networks (CNNs), existing formulations rely on static channel-class partitions and struggle to perform effectively in complex tasks. In this work, we introduce a learnable channel-class assignment mechanism that enables adaptive, data-driven specialization of convolutional channels, supported by entropy and orthogonality regularization to promote learning performance. We further propose a loss-aware layer contribution strategy that adaptively weights intermediate-layer predictions based on their validation performance, enhancing the effectiveness of forward-only inference. Integrated into residual CNNs, the proposed method achieves consistently superior performance across CIFAR-10, CIFAR-100, and Tiny-ImageNet compared to existing similar forward-only methods. Notably, it establishes new state-of-the-art performance among FF-based models, substantially narrowing the gap with backpropagation. These findings demonstrate that introducing learnable channel specialization and layer contribution weighting significantly enhances the representational capacity of forward-only learning in deep CNNs.

URL PDF HTML ☆

赞 0 踩 0

2606.09927 2026-06-10 cs.LG cs.AI cs.CL 新提交

Trainable Smooth-Rotation Transforms with Learned Channel Scales for LLM Quantization

可训练平滑旋转变换与学习通道尺度用于LLM量化

Patrik Czakó, Gábor Kertész, Sándor Szénási

AI总结针对大语言模型量化中激活值量化困难的问题，提出基于分位数鲁棒的缩放策略和梯度优化的通道尺度学习，在W4A4量化下显著降低误差。

Comments 6 pages, 8 figures, 3 tables. Accepted to IEEE INES 2026 conference proceedings

详情

AI中文摘要

后训练量化（PTQ）是降低大语言模型（LLM）服务成本最实用的方法之一，但激活值量化仍然困难，因为异常值主导的通道会导致较大的量化误差。本文研究了这种退化是否部分由基于缩放的等效变换中的过度迁移引起。我们引入了一种用于SmoothRot风格变换的分位数鲁棒缩放策略，用高分位数替代基于最大值的激活统计量，并辅以通道尺度的约束梯度优化。在LLaMA-3.2-1B的W4A4量化下，仅分位数策略搜索相比SmoothRot基线将选定层误差降低11.1%，联合(alpha, q)搜索降低12%，训练达到18.5%。将最佳选定层策略重放到所有解码器块的下投影层，相应的全层平均误差从97.51降至78.08（19.9%）。结果表明，鲁棒的迁移控制和轻量级尺度学习在保持等效变换框架的同时，相比基于最大值的固定策略提供了持续改进。

英文摘要

Post-training quantization (PTQ) is one of the most practical ways to reduce the serving cost of Large Language Models (LLMs), but activation quantization remains difficult because outlier-dominated channels lead to large quantization errors. This paper investigates whether part of this degradation is caused by over-migration in scaling-based equivalent transformations. We introduce a quantile-robust scaling policy for SmoothRot-style transforms by replacing max-based activation statistics with high quantiles, and we complement it with constrained gradient-based optimization of channel scales. On LLaMA-3.2-1B under W4A4 quantization, quantile-only policy search improves selected-layer error by 11.1% over the SmoothRot baseline, joint (alpha, q) search improves it by 12%, and training reaches 18.5%. Replaying the best selected-layer policy on all decoder-block down-projection layers reduces the corresponding full-layer mean error from 97.51 to 78.08 (19.9%). The results show that robust migration control and lightweight scale learning provide consistent gains over max-based fixed policies while preserving the equivalent-transform framework.

URL PDF HTML ☆

赞 0 踩 0

2606.09926 2026-06-10 cs.LG cs.AI 新提交

Sample Where You Struggle: Sharpening Base Model Reasoning via Entropy-Guided Power Sampling

在你挣扎处采样：通过熵引导的幂采样增强基础模型推理

Hong Guo, Nianhui Guo, Christoph Meinel, Haojin Yang

AI总结提出熵引导的幂采样（EGPS），一种无需训练和验证器的采样方法，通过利用前向传播中的token级熵将MCMC移动定位到高熵区域，在多个基准上以高达12.6倍加速达到最优或并列最优准确率。

详情

AI中文摘要

从序列级幂分布 $p^\alpha$ 采样可以在不更新任何参数的情况下从基础语言模型中引出强化学习级别的推理，但标准的Metropolis-Hastings（MH），一种马尔可夫链蒙特卡洛（MCMC）采样器，既昂贵又慢混合。我们将这两个问题归因于结构不匹配：$p^\alpha$ 主要在稀疏、空间聚集的高熵决策点集上偏离 $p$，然而MH沿着前缀均匀地提出重采样位置——在近简并条件上浪费计算，同时在模式发散处欠混合。我们提出熵引导的幂采样（EGPS），一种无需训练和验证器的采样器，它从已经在前向传播中的token级熵重新推导其提议。EGPS跳过确定性块，将每个MCMC移动定位到高熵邻域，并在决策点应用多尝试Metropolis——使得采样成本随熵质量而非序列长度缩放。在Qwen2.5-Math-7B上，EGPS在所有三个基准（MATH500 $75.8\\%$，HumanEval $62.2\\%$，GPQA $42.4\\%$）上达到最佳或并列最佳准确率，同时相对于MH基线实现了高达12.6倍的墙钟加速。

英文摘要

Sampling from the sequence-level power distribution $p^α$ elicits RL-level reasoning from base language models without any parameter updates, but the standard Metropolis--Hastings (MH), a Markov Chain Monte Carlo (MCMC) sampler, is both expensive and slow-mixing. We trace both to a structural mismatch: $p^α$ mainly departs from $p$ at a sparse, spatially clustered set of high-entropy decision points, yet MH proposes resampling positions uniformly along the prefix -- wasting compute on near-degenerate conditionals while under-mixing precisely where modes diverge. We propose Entropy-Guided Power Sampling (EGPS), a training-free and verifier-free sampler that re-derives its proposal from token-level entropy already in the forward pass. EGPS skips deterministic blocks, localizes each MCMC move to a high-entropy neighborhood, and applies Multiple-Try Metropolis at decision points -- making sampling cost scale with \emph{entropy mass rather than sequence length}. On Qwen2.5-Math-7B, EGPS reaches best or tied-best accuracy on all three benchmarks (MATH500 $75.8\%$, HumanEval $62.2\%$, GPQA $42.4\%$) at up to a $12.6\times$ wall-clock speedup over the MH baseline.

URL PDF HTML ☆

赞 0 踩 0

2606.09925 2026-06-10 cs.SD 新提交

AudioProcessBench: Benchmark for Identifying Process Errors in Audio-Grounded Reasoning

AudioProcessBench: 音频基础推理中过程错误识别的基准

Xiangyu Zhao, Junyu Yan, Yaling Shen, Zimu Wang, Yiwen Jiang, Stephanie Fong, Qingyang Xu, Jiahe Liu, Dominic Dwyer, Zongyuan Ge

AI总结提出AudioProcessBench基准，用于评估音频-语言模型在推理步骤中的过程错误识别能力，涵盖步骤正确性、错误类型检测和链级聚合三种范式。

详情

AI中文摘要

大型音频-语言模型（LALMs）越来越多地使用显式推理轨迹进行复杂的音频理解，但对推理质量的评估仍未被充分探索。尽管过程级基准（用于过程奖励模型PRMs）在文本和多模态领域推进了推理评估，但音频推理的类似评估仍然有限。在本文中，我们提出了AudioProcessBench，一个用于音频推理中步骤级过程错误识别的综合基准。AudioProcessBench包含由6个音频和全模态语言模型生成的不同推理轨迹。每个轨迹被分割成离散的推理步骤，并标注了二元步骤正确性和细粒度错误类型。我们的基准在三种互补范式下评估模型：（1）步骤正确性识别，（2）错误类型条件检测，用于诊断音频特定验证器能力，以及（3）链级聚合，其中验证器为同一问题选择或聚合多个推理轨迹。这种设计使得系统分析当前模型是否能检测过程错误、它们的弱点是否因音频特定错误类型而异，以及过程验证是否能转化为改进的答案选择成为可能。AudioProcessBench为未来关于音频推理验证器、过程奖励模型和可靠的全模态推理研究提供了测试平台。

英文摘要

Large audio-language models (LALMs) increasingly use explicit reasoning traces for complex audio understanding, yet the evaluation of reasoning quality remains underexplored. Although process-level benchmarks for process reward models (PRMs) have advanced reasoning evaluation in text and multi-modal domains, comparable evaluation for audio reasoning remains limited. In this paper, we present AudioProcessBench, a comprehensive benchmark for step-level process error identification in audio reasoning. AudioProcessBench contains diverse reasoning traces generated by 6 audio and omni language models. Each trace is segmented into discrete reasoning steps and annotated with binary step correctness and fine-grained error types. Our benchmark evaluates models under three complementary paradigms: (1) step correctness identification, (2) error-type-conditioned detection for diagnosing audio-specific verifier capacities, and (3) chain-level aggregation, where verifiers select or aggregate among multiple reasoning traces for the same question. This design enables a systematic analysis of whether current models can detect process errors, whether their weaknesses differ across audio-specific error types, and whether process verification translates into improved answer selection. AudioProcessBench provides a testbed for future research on audio reasoning verifiers, process reward models, and reliable omni-modal reasoning.

URL PDF HTML ☆

赞 0 踩 0

2606.09924 2026-06-10 cs.LG cs.AI 新提交

Sigma-Branch: Hierarchical Single-Path Network Reconstruction for Dynamic Inference with Reduced Active Parameters

Sigma-Branch: 用于动态推理的分层单路径网络重构，减少活跃参数

Kohga Tanaka, Hiroaki Nishi

AI总结提出Sigma-Branch框架，通过分层二叉树结构将预训练密集网络重构为共享主干、分层路由器和专用叶子，利用激活聚类初始化并微调，推理时仅执行单一路径，在CIFAR-100/ResNet-50等任务上减少58-60%活跃参数，性能损失小于1.72个百分点。

详情

AI中文摘要

在内存受限的边缘加速器上部署深度神经网络，瓶颈在于每次推理的片外权重传输而非计算：密集网络无法保留在芯片上，每个输入都必须加载所有参数。现有模型压缩仅在永久容量损失代价下减少这种传输。我们提出Sigma-Branch (SigmaB)，一个将预训练密集网络重构为分层二叉树的框架，该树由共享主干、分层路由器和专用叶子组成。预训练权重通过基于激活的球形k-means聚类分布在树中，该聚类联合初始化路由器权重和每分支通道分配；然后通过软路由微调使每个叶子与其路由输入子集对齐。在推理时，所得网络仅执行一条根到叶路径，减少活跃参数占用，同时将完整密集参数集存储在内存中。在CIFAR-100 / ResNet-50、ImageNet-1K / ResNet-50和ModelNet40 / PointNet++上，SigmaB-Net将每次推理的活跃参数减少58-60%，同时与密集基线Top-1相比误差在1.72个百分点以内。在可比的ImageNet-1K Top-1下，活跃参数减少超过静态结构化剪枝（FPGM、HRank）14-23个百分点。跨模态评估涵盖2D视觉和3D点云骨干网络，证实了将每次推理内存流量与总参数数量解耦的框架级主张。

英文摘要

Deploying deep neural networks on memory-constrained edge accelerators is bottlenecked by per-inference off-chip weight transfer rather than computation: the dense network cannot be retained on-chip, and every parameter must be loaded for every input. Existing model compression reduces this transfer only at the cost of permanent capacity loss. We propose Sigma-Branch (SigmaB), a framework that restructures a pretrained dense network into a hierarchical binary tree composed of a shared backbone, hierarchical routers, and specialized leaves. Pretrained weights are distributed across the tree via activation-based spherical k-means clustering, which jointly initializes router weights and per-branch channel allocations; soft-routing fine-tuning then aligns each leaf with its routed input subset. At inference, the resulting network executes only a single root-to-leaf path, reducing the active-parameter footprint while storing the complete dense parameter set in memory. Across CIFAR-100 / ResNet-50, ImageNet-1K / ResNet-50, and ModelNet40 / PointNet++, SigmaB-Net reduces per-inference active parameters by 58-60% while remaining within 1.72 percentage points (pp) of the dense baseline Top-1. At comparable ImageNet-1K Top-1, the active-parameter reduction exceeds static structured pruning (FPGM, HRank) by 14-23 pp. The cross-modal evaluation, spanning 2D vision and 3D point-cloud backbones, substantiates a framework-level claim that decouples per-inference memory traffic from the total parameter count.

URL PDF HTML ☆

赞 0 踩 0

2606.09923 2026-06-10 cs.LG cs.AI 新提交

Conformal Prediction for Neural Operators: Distribution-Free Uncertainty Quantification in Physics Simulation

神经算子的共形预测：物理模拟中无分布不确定性量化

Michael Chin

AI总结提出将分裂共形预测应用于神经算子物理模拟，实现无分布预测区间和有限样本覆盖保证，并通过归一化共形预测方案生成自适应宽度区间。

Comments 13 pages, 7 tables, 7 figures. Full-scale experiments on NVIDIA V100

详情

AI中文摘要

神经算子如傅里叶神经算子（FNO）已成为求解偏微分方程（PDE）的强大替代方法，比传统数值求解器快几个数量级。然而，在安全关键工程应用（如电子元件和电池系统的热管理）中部署这些模型，不仅需要准确的点预测，还需要严格的不确定性保证。现有的神经算子不确定性量化（UQ）方法，包括蒙特卡洛Dropout和深度集成，仅提供相对不确定性估计，没有正式的覆盖保证。在这项工作中，我们首次将分裂共形预测应用于基于神经算子的物理模拟，提供具有有限样本覆盖保证的无分布预测区间。我们进一步引入了一种归一化共形预测方案，利用MC Dropout不确定性生成自适应宽度区间，在低不确定性区域产生更紧的区间，在模型不太确定的区域产生更宽的区间。在稳态热传导基准上的全规模实验（3370万参数，800个训练样本，5个集成成员，NVIDIA V100）表明，我们的方法在目标水平alpha=0.1下达到89.1%的经验覆盖率，同时生成反映底层物理不确定性结构的空间自适应预测区间。我们还提供了一个不确定性分解框架，将认知不确定性（占总量的68%）与偶然不确定性（占总量的32%）分离，为数据收集和模型改进提供可操作指导。我们的方法在一个开源平台上实现，具有REST API端点和交互式3D可视化。

英文摘要

Neural operators such as the Fourier Neural Operator (FNO) have emerged as powerful surrogates for solving partial differential equations (PDEs), achieving speedups of several orders of magnitude over traditional numerical solvers. However, deploying these models in safety-critical engineering applications -- such as thermal management of electronic components and battery systems -- requires not only accurate point predictions but also rigorous uncertainty guarantees. Existing uncertainty quantification (UQ) methods for neural operators, including Monte Carlo Dropout and Deep Ensembles, provide only relative uncertainty estimates without formal coverage guarantees. In this work, we propose the first application of split conformal prediction to neural operator-based physics simulation, providing distribution-free prediction intervals with finite-sample coverage guarantees. We further introduce a normalized conformal prediction scheme that leverages MC Dropout uncertainty to produce adaptive-width intervals, yielding tighter intervals in regions of low uncertainty and wider intervals where the model is less certain. Full-scale experiments (33.7M parameters, 800 training samples, 5 ensemble members, NVIDIA V100) on steady-state heat conduction benchmarks demonstrate that our method achieves 89.1% empirical coverage at the target level of alpha=0.1, while producing spatially adaptive prediction intervals that reflect the underlying physical uncertainty structure. We also provide an uncertainty decomposition framework that separates epistemic uncertainty (68% of total) from aleatoric uncertainty (32% of total), offering actionable guidance for data collection and model improvement. Our method is implemented in an open-source platform with REST API endpoints and interactive 3D visualization.

URL PDF HTML ☆

赞 0 踩 0

2606.09919 2026-06-10 cs.LG cs.AI cs.MA cs.RO 新提交

Co-GLANCE: Uncertainty-Aware Active Perception for Heterogeneous Robot Teaming

Co-GLANCE: 异构机器人团队的不确定性感知主动感知

Michal P. Podolinsky, Neel P. Bhatt, Pranay Samineni, Rohan Siva, Christian Ellis, Ufuk Topcu

AI总结提出Co-GLANCE系统，通过蒸馏视觉语言模型实现实时遮挡分割与机器人分配，结合共形预测与选择性弃权提供统计保证的不确定性量化，驱动主动感知，在真实场景中遮挡分割和分配准确率分别提升25%和36%，推理延迟降低350倍。

Comments Code, videos, and dataset available at https://co-glance.github.io/

详情

AI中文摘要

感知不确定性是异构机器人团队在非结构化户外环境中运行的核心挑战，因为单一视角无法提供可靠的场景理解。由遮挡等来源引起的感知不确定性，根据场景结构在不同机器人视角下表现不同。检测和解决感知不确定性的来源需要基于场景的上下文推理和具备能力感知的机器人分配。虽然视觉语言模型为两者提供了强大的语义先验，但它们对于机载推理在计算上过于昂贵，且缺乏校准的不确定性量化。我们介绍了Co-GLANCE，一个用于异构机器人团队不确定性解决的实时机载感知与决策系统。Co-GLANCE将视觉语言模型的语义推理能力蒸馏为用于遮挡分割和机器人分配的端到端模型，消除了对基于云推理的需求。为了量化感知不确定性，Co-GLANCE结合了共形预测与选择性弃权，为分割、机器人分配和检测输出提供统计有效的覆盖保证。这些校准的不确定性估计直接触发主动感知，派遣最合适的机器人获取信息丰富的视角并解决不确定性。在真实世界场景中，Co-GLANCE在遮挡分割和机器人分配准确率上分别比基于云的视觉语言模型基线高出25%和36%，同时将每帧推理延迟降低350倍。我们还发布了一个空地数据集以供未来研究。代码、视频和数据集可在以下网址获取：此 https URL。

英文摘要

Perceptual uncertainty is a central challenge for heterogeneous robot teams operating in unstructured outdoor environments, where no single viewpoint affords reliable scene understanding. Perceptual uncertainty, arising from sources such as occlusions, manifests differently across robot viewpoints depending on scene structure. Detecting and resolving sources of perceptual uncertainty requires both scene-based contextual reasoning and capability-aware robot allocation. While vision-language models provide strong semantic priors for both, they are computationally prohibitive for onboard inference and lack calibrated uncertainty quantification. We introduce Co-GLANCE, a real-time onboard perception and decision-making system for uncertainty resolution in heterogeneous robot teams. Co-GLANCE distills the semantic reasoning capabilities of a vision-language model into an end-to-end model for occlusion segmentation and robot allocation, eliminating the need for cloud-based inference. To quantify perceptual uncertainty, Co-GLANCE combines conformal prediction with selective abstention to provide statistically valid coverage guarantees for segmentation, robot allocation, and detection outputs. These calibrated uncertainty estimates directly trigger active perception, dispatching the most appropriate robot to acquire informative viewpoints and resolve uncertainty. Across real-world scenarios, Co-GLANCE outperforms cloud-based vision-language model baselines in occlusion segmentation and robot allocation accuracy by 25% and 36%, respectively, while reducing per-frame inference latency 350x. We also release an air-ground dataset for future research. Code, videos, and dataset available at https://co-glance.github.io/ .

URL PDF HTML ☆

赞 0 踩 0

2606.09917 2026-06-10 cs.LG 新提交

SPDM: Geometry-Modulated State Space Modeling with Manifold Constraints for Time Series Forecasting

SPDM: 基于流形约束的几何调制状态空间建模用于时间序列预测

Xingsheng Chen, Siu-Ming Yiu

AI总结提出SPDM，一种将对称正定流形约束引入状态空间模型的几何感知架构，通过流形轨迹和几何门控机制调制选择性扫描，在保持线性复杂度同时提升多变量时间序列预测精度。

详情

AI中文摘要

多变量时间序列预测需要捕捉交互变量间持续演化的相关结构。现有状态空间模型通过扫描标记化的时间或空间序列来处理时间序列，忽略了演化的几何结构。我们通过将流形约束引入状态空间建模来解决这一局限性：将跨变量相关结构视为对称正定流形上的连续轨迹，其黎曼几何特征、切空间线性度和弗雷歇均值中心性作为原则性的几何正则化器，引导并稳定SSM的选择性扫描动态。我们提出SPDM，一种几何感知的SSM架构，通过两种协作机制实现这一原则：一个流形轨迹路径，将动态演化的协方差矩阵从SPD流形投影到欧几里得切空间；以及一个几何门控方案，基于从流形轨迹导出的几何信号直接调制SSM的内部选择性参数。该参数化在嵌入丰富结构约束的同时保持了Mamba并行扫描的线性时间复杂度，使架构同时保持预测精度和计算效率。在11个真实世界基准数据集上的广泛实验建立了最先进的预测性能，进一步研究证实几何约束的状态空间动态是其性能提升背后的主导架构因素。

英文摘要

Multivariate time series forecasting requires capturing the continuously evolving correlation structure among interacting variables. Existing state-space models process time series by scanning tokenized temporal or spatial sequences, discarding the evolutionary geometric structure. We address this limitation by introducing manifold constraints into state-space modeling: treating the cross-variable correlation structure as a continuous trajectory on the symmetric positive definite manifold, whose Riemannian geometric features, tangent space linearity, and Frechet mean centrality act as a principled geometric regularizer that guides and stabilizes the selective scanning dynamics of SSMs. We propose SPDM, a geometry-aware SSM architecture that realizes this principle through two cooperating mechanisms: a manifold trajectory path that projects dynamically evolving covariance matrices from the SPD manifold to a Euclidean tangent space, and a geometric gating scheme that directly modulates SSM's internal selective parameters based on geometric signals derived from the manifold trajectory. The parameterization preserves the linear-time complexity of the Mamba parallel scan while embedding rich structural constraints, making the architecture preserve prediction accuracy and computational efficiency simultaneously. Extensive experiments on eleven real-world benchmark datasets establish state-of-the-art forecasting performance, and further studies confirm that geometrically constrained state-space dynamics are the dominant architectural factor behind its performance gains.

URL PDF HTML ☆

赞 0 踩 0

2606.09916 2026-06-10 cs.LG cs.AI 新提交

IntentKV: Cross-Turn Intent-Aware KV Cache Pruning for Agent Inference

IntentKV: 面向Agent推理的跨轮次意图感知KV缓存剪枝

Junjie Li, Jiong Lou, Jie Li

AI总结针对多轮LLM Agent中KV缓存成为服务瓶颈的问题，提出IntentKV方法，通过会话级QueryMemory和残差注意力头实现跨轮次意图感知的KV剪枝，在保持精度的同时大幅降低峰值请求token和KV读取量。

详情

AI中文摘要

多轮LLM Agent将短查询扩展为包含工具调用、搜索结果和中间推理的长轨迹。在单条轨迹中，KV内存和KV读取带宽增长数个数量级，使得键值（KV）缓存（而非参数计算）成为长时Agent的主要服务瓶颈。我们提出IntentKV，一种学习型KV剪枝方法，保持基础LLM冻结。IntentKV维护一个会话级的跨轮次意图QueryMemory，通过记忆-注意力规则对实时历史token进行评分，并添加一个零初始化的残差注意力头，对当前查询的K向量进行交叉注意力。为了与前缀缓存保持可组合性，驱逐采用槽位映射重定向：被丢弃的位置路由到一个哨兵死槽，而存活的K/V行、RoPE相位和槽位标识保持不变。在严格的KV预算下，IntentKV与无剪枝的全缓存基线相比几乎没有精度下降：在8k KV预算下，Qwen3-8B的平均峰值请求token下降23.9%，Qwen2.5-14B下降30.7%。在Qwen2.5-14B上所有方法都能完成的100个最长BCP查询中，IntentKV-8k进一步将最坏情况下的峰值请求token从92.3k降至20.5k（减少77.8%），最坏情况下的原始KV读取从4.11亿降至3100万（减少92.6%）。

英文摘要

Multi-turn LLM agents fan short queries into long trajectories of tool calls, search results, and intermediate reasoning. Both KV memory and KV read bandwidth grow by orders of magnitude across a single trajectory, making the key-value (KV) cache, not parameter compute, the dominant serving bottleneck for long-horizon agents. We introduce IntentKV, learned KV pruning that keeps the base LLM frozen. IntentKV maintains a session-level QueryMemory of cross-turn intent, scores live history tokens with a memory-attention rule, and adds a zero-initialized residual head with cross-attention over current-query K-vectors. To stay composable with prefix caches, eviction is a slot-map redirection: dropped positions route to a sentinel dead slot while surviving K/V rows, RoPE phases, and slot identities stay in place. IntentKV matches the no-pruning full-cache baseline with almost no accuracy drop under tight KV budgets: at an 8k KV budget, mean peak request tokens drop 23.9% on Qwen3-8B and 30.7% on Qwen2.5-14B. On the 100 longest BCP queries that all methods complete on Qwen2.5-14B, IntentKV-8k further cuts worst-case peak request tokens from 92.3k to 20.5k, a 77.8% reduction, and worst-case raw KV reads from 411M to 31M, a 92.6% reduction.

URL PDF HTML ☆

赞 0 踩 0

2606.09912 2026-06-10 cs.LG cs.AI 新提交

Mix, Don't Pick: Why Synthetic Corpus Composition Matters for Time Series Foundation Model Pretraining

混合而非挑选：为什么合成语料组合对时间序列基础模型预训练至关重要

Aaryan Nagpal, Debdeep Sanyal, Murari Mandal, Dhruv Kumar, Saurabh Deshpande

AI总结针对时间序列基础模型预训练中合成数据生成器选择困难的问题，提出简单等权混合所有生成器的方法，匹配或超越最优单个生成器，并与真实数据结合获得最强预训练语料。

Comments Accepted at the ICML 2026 Workshop on Foundation Models for Structured Data (FMSD), Seoul, South Korea

详情

AI中文摘要

为时间序列基础模型预训练选择错误的合成生成器代价高昂：在相同训练预算下，最佳和最差生成器产生的预测误差差距可达2倍，然而该领域尚无原则性的选择方法。问题因生成器排名在不同架构间不稳定而加剧：在11个生成器家族上，对从头训练的Chronos-T5-Mini和Moirai-Small进行评估，我们发现哪些生成器有用取决于模型架构。我们没有解决生成器选择问题，而是绕过了它：所有生成器的简单等权混合匹配或击败了两种架构的最佳单个生成器，并且将此混合与真实数据组合产生了整体最强的预训练语料。因此，合成预训练是一个语料组合问题，而非生成器选择问题，组合选择应针对每个模型家族进行验证，而非假设可迁移。

英文摘要

Choosing the wrong synthetic generator for time-series foundation model pretraining is costly: under identical training budgets, the best and worst generators produce up to a $2\times$ gap in forecasting error, yet the field has no principled way to make this choice. The problem is compounded by the fact that generator rankings are not stable across architectures: across 11 generator families evaluated on Chronos-T5-Mini and Moirai-Small trained from scratch, we find that which generators are useful depends on the model architecture. Rather than solving the generator selection problem, we sidestep it: a simple equal-weight mixture of all generators matches or beats the best individual generator for both architectures, and composing this mixture with real data yields the strongest pretraining corpora overall. Synthetic pretraining is therefore a corpus composition problem, not a generator selection problem, and composition choices should be validated per model family rather than assumed to transfer.

URL PDF HTML ☆

赞 0 踩 0

2606.09907 2026-06-10 cs.LG cs.AI 新提交

LongMoE: Longitudinal Multimodal Learning via Trajectory-Aware Mixture-of-Experts

LongMoE：基于轨迹感知的混合专家模型的纵向多模态学习

Maxx Richard Rahman, Prakhar Kumar, Wolfgang Maass

AI总结提出LongMoE框架，通过上下文感知插补、注意力标记化、轨迹感知编码和稀疏MoE路由，联合解决临床多模态学习中模态缺失和纵向动态两大挑战，在ADNI等数据集上验证了鲁棒性。

详情

AI中文摘要

多模态临床学习对于整合包括影像、文本和个性化健康记录在内的多样化患者数据日益重要。然而，它面临两个基本挑战：i) 模态缺失，即在一次患者就诊中任意子集的模态不可用；ii) 纵向动态，即观察结果的诊断意义取决于患者随时间演变的疾病轨迹。现有方法孤立地处理这些挑战：缺失模态框架将每次就诊视为独立的静态快照并丢弃时间上下文，而纵向模型通常假设模态完全可用并在系统性模态不完整时性能下降。我们提出LongMoE（纵向混合专家模型），这是一个统一框架，用于联合解决这两个挑战。LongMoE结合了上下文感知插补模块和注意力标记化模块，后者捕获不规则就诊序列中的频域时间模式，以及用于建模疾病进展的轨迹感知编码器和用于患者特定专家选择的上下文条件稀疏MoE路由。在ADNI、OASIS-3和MIMIC-IV上的实验表明，LongMoE在缺失或弱共时模态下提高了鲁棒性，并在全模态设置中保持竞争力，为纵向感知的多模态临床学习奠定了坚实基础。

英文摘要

Multimodal clinical learning is increasingly important for integrating diverse patient data, including imaging, text, and personalised health records. However, it faces two fundamental challenges: i) modality missingness, where arbitrary subsets of modalities are unavailable at a given patient visit, ii) longitudinal dynamics, where the diagnostic significance of an observation depends on the patient's evolving disease trajectory over time. Existing methods address these challenges in isolation: missing-modality frameworks treat each visit as an independent static snapshot and discard temporal context, while longitudinal models often assume complete modality availability and degrade under systematic modality incompleteness. We propose LongMoE (Longitudinal Mixture-of-Experts), the unified framework to jointly address both challenges. LongMoE combines a context-aware imputation module with an attentional tokenization module that captures frequency-domain temporal patterns across irregular visit sequences, a trajectory-aware encoder for modeling disease progression, and context-conditioned Sparse MoE routing for patient-specific expert selection. Experiments on ADNI, OASIS-3, and MIMIC-IV show that LongMoE improves robustness under missing or weak contemporaneous modalities and remains competitive in full-modality settings, establishing a strong foundation for longitudinally-aware multimodal clinical learning.

URL PDF HTML ☆

赞 0 踩 0

2606.09900 2026-06-10 cs.CL cs.AI cs.IR cs.LG 新提交

Less Context, More Accuracy: A Bi-Temporal Memory Engine for LLM Agents Where a Lean Retrieved Context Beats the Full History

更少上下文，更高准确率：一种用于LLM Agent的双时间记忆引擎，其中精简检索上下文优于完整历史

Liuyin Wang

AI总结提出一种双时间记忆引擎Engram，通过混合读取路径从约9.6k token的检索片段中回答，在LongMemEval_S上达到83.6%准确率，比完整历史（79k token）高10.4个百分点，且无错误。

Comments 14 pages, 4 figures, 3 tables. Code, reproducible harness, and raw per-question logs: https://github.com/ly-wang19/engram

详情

AI中文摘要

长期记忆是LLM Agent缺失的一层：跨会话时它们会遗忘，而常见的解决方法——将整个历史重放到提示中——成本高、速度慢，且随着干扰物积累，准确性下降。大多数记忆系统在成本或延迟上胜出，但在准确性上仍不如完整上下文基线，且基准测试结果在不一致、不可复现的测试平台上报告，导致同一系统在不同来源上得分差异巨大。我们提出Engram，一种基于双时间数据模型的开源双过程记忆引擎。快速写入路径附加无损事件，无需LLM参与关键路径；异步路径提取原子（主体、谓词、客体）事实，构建双时间知识图谱，并解决矛盾，无需每个事实调用LLM——使事实失效而非删除，因此每个事实都有来源和继承链。混合读取路径融合密集、词汇、图谱和时效/显著性信号，应用时间点（“截至”）过滤器，并组装紧凑、带有来源标记的上下文。在完整的500个问题的LongMemEval_S上，由官方分类特定评判器评分，Engram的精简配置——从约9.6k token的检索片段回答，而非完整历史——得分为83.6%，而完整上下文为73.2%（+10.4个百分点，McNemar p < 10^-6），token数约为1/8（9.6k vs. 79k），且0/500错误。这种增益需要混合读取路径：仅事实会丢失召回率，而事实加检索片段则恢复细节。我们还贡献了一个中立的、仓库内的评估平台，内置官方评判器，并在每个表格中包含完整上下文基线，发布原始每问题日志，并记录了无声扭曲记忆基准的测量完整性陷阱（截断、自制评判器、完整历史泄露）。每个数字都附带复现命令。

英文摘要

Long-term memory is the missing layer for LLM agents: across sessions they forget, and the common workaround -- replaying the whole history into the prompt -- is expensive, slow, and, as distractors accumulate, less accurate. Most memory systems win on cost or latency but still lose to the full-context baseline on accuracy, and benchmark numbers are reported on inconsistent, non-reproducible harnesses, so one system appears at wildly different scores across sources. We present Engram, an open-source, dual-process memory engine on a bi-temporal data model. A fast write path appends lossless episodes with no LLM on the critical path; an asynchronous path extracts atomic (subject, predicate, object) facts, builds a bi-temporal knowledge graph, and resolves contradictions without an LLM call per fact -- invalidating, never deleting, so every fact keeps provenance and a supersession chain. A hybrid read path fuses dense, lexical, graph, and recency/salience signals, applies a point-in-time ("as-of") filter, and assembles a compact, provenance-tagged context. On the full 500-question LongMemEval_S, graded by the official category-specific judge, Engram's lean configuration -- answering from a ~9.6k-token retrieved slice, never the full history -- scores 83.6% vs. 73.2% for full-context (+10.4 points, McNemar p < 10^-6) at ~8x fewer tokens (9.6k vs. 79k), with 0/500 errored. The gain needs a hybrid read path: facts alone lose recall, while facts plus retrieved chunks recover detail. We also contribute a neutral, in-repo evaluation harness with the official judge baked in and the full-context baseline in every table, publish the raw per-question logs, and document the measurement-integrity pitfalls (truncation, home-grown judges, full-history leaks) that silently distort memory benchmarks. Every number ships with a command to reproduce it.

URL PDF HTML ☆

赞 0 踩 0

2606.09899 2026-06-10 cs.LG cs.AI 新提交

When Attribution Patching Lies: Diagnosis and a Second-Order Correction

当归因补丁说谎时：诊断与二阶修正

Luyang Zhang, Jialu Wang

AI总结研究归因补丁（梯度一阶近似）在机制可解释性中的不可靠性，发现主要误差源于下游网络的非线性，并提出可靠性评分、误差界和HVP二阶修正方法。

Comments 30 pages, 12 figures

详情

AI中文摘要

机制可解释性的一个核心目标是识别哪些内部组件因果地驱动语言模型的行为。由于这些重要性估计作为识别电路的证据，系统性错误可能导致对底层机制的误识别。虽然激活补丁提供了黄金标准的因果度量，但其计算成本在大规模下难以承受。从业者转而依赖归因补丁，一种基于梯度的一阶近似，其可靠性尚不明确。在这项工作中，我们刻画了这种不可靠性的来源，证明主要误差源于下游网络的非线性，而非补丁组件的局部曲率。这一洞察产生了三个实用工具：(i) 检测不可信估计的可靠性评分，(ii) 量化潜在归因误规范的误差界，以及 (iii) 仅需一次额外反向传播即可消除主导误差的Hessian-向量-乘积（HVP）修正。在五个模型家族（124M-9B参数）以及随机令牌和自然（名称交换）扰动的评估中，HVP是唯一在大规模下可行的二阶修正，而标准基线如积分梯度在计算上变得不可行。在对比实验中，多步HVP变体以显著更低的计算量达到或超过积分梯度的准确性，优于先前的二阶基线。这些改进在标准基准上实现了更高保真度的电路恢复，并支持一种屏幕-标记-修复工作流，仅将计算努力针对被标记为不可靠的组件。

英文摘要

A central goal of mechanistic interpretability is to identify which internal components causally drive a language model's behavior. Because these importance estimates serve as the evidence for identifying circuits, systematic errors can lead to the misidentification of the underlying mechanisms. While activation patching provides a gold-standard causal metric, its computational cost is prohibitive at scale. Practitioners instead rely on attribution patching, a gradient-based, first-order approximation whose reliability remains poorly understood. In this work, we characterize the source of this unreliability, demonstrating that the dominant error stems from the non-linearities in the downstream network rather than local curvature at the patched component. This insight yields three practical tools: (i) a reliability score to detect untrustworthy estimates, (ii) error bounds quantifying potential attribution mis-specifications, and (iii) a Hessian-vector-product (HVP) correction that eliminates the leading-order error with only one additional backward pass. In evaluations across five model families (124M-9B parameters) and both random-token and naturalistic (name-swap) perturbations, HVP is the only second-order correction feasible at larger scale, where standard baselines like Integrated Gradients become computationally prohibitive. In comparative experiments, a multi-step HVP variant matches or exceeds the accuracy of Integrated Gradients at significantly lower compute, outperforming prior second-order baselines. These improvements lead to higher-fidelity circuit recovery on standard benchmarks and support a Screen-Flag-Fix workflow that targets computational effort only toward the components flagged as unreliable.

URL PDF HTML ☆

赞 0 踩 0

2606.09898 2026-06-10 cs.LG cs.MA q-bio.QM 新提交

TRAPS: Therapeutic Response Analysis via Pathway-informed Stratification

TRAPS: 基于通路信息分层的治疗反应分析

Sujoy Banik, Sayantan Chakraborty, Boishakhi Das Toma, Zainab Ghafoor, Ushashi Bhattacharjee, Koushik Howlader, Tirtho Roy

AI总结提出首个统一基准，评估三种通路引导的深度学习模型在联合预测癌症治疗反应和生存率上的表现，发现不同模型在不同任务上各有优劣。

详情

AI中文摘要

癌症治疗规划需要同时考虑多个临床维度。临床医生必须确定患者是否应接受靶向分子治疗、放疗，以及他们是否可能存活超过六个月。现有的通路引导深度学习模型是孤立开发和测试的，无法进行跨架构的公平比较。我们提出了第一个用于通路引导治疗反应建模的统一基准，评估了三种生物信息学架构：BINN、GraphPath 和 PATH，使用了来自癌症基因组图谱的五个癌症队列，代表 2,622 名患者，这些患者使用 Reactome 通路活性评分进行编码。每个模型在相同的数据和评估条件下联合训练所有三个临床结果，这是第一项将通路结构化深度学习视为联合治疗和生存预测问题的研究。我们的结果表明，没有一个架构在所有任务中获胜：PATH 在整体靶向分子治疗预测中表现最佳，BINN 在生存预测中最可靠，而没有一个模型能对放疗产生有用的预测，因为该决策的关键驱动因素是基因表达数据中未捕获的临床变量。最引人注目的是，GraphPath 在前列腺靶向分子治疗预测中达到了 0.92 的 AUROC，是整个基准中的最高分，这表明当与具有狭窄靶向驱动程序的队列匹配时，即使存在极端类别不平衡（仅 11% 阳性率），横向共调控结构也能产生卓越的判别能力。

英文摘要

Cancer treatment planning requires decisions across multiple clinical dimensions at once. Clinicians must determine whether a patient should receive targeted molecular therapy, radiation therapy, and whether they are likely to survive beyond six months. Existing pathway-informed deep learning models have been developed and tested in isolation, making fair comparison across architectures impossible. We present the first unified benchmark for pathway-guided therapy response modeling, evaluating three biologically informed architectures, BINN, GraphPath, and PATH, across five cancer cohorts drawn from The Cancer Genome Atlas, representing 2,622 patients encoded using Reactome pathway activity scores. Each model is trained jointly on all three clinical outcomes under identical data and evaluation conditions, the first study to treat pathway-structured deep learning as a combined therapy and survival prediction problem. Our results show that no single architecture wins across all tasks: PATH performs best for targeted molecular therapy prediction overall, BINN is most reliable for survival prediction, and no model produces useful predictions for radiation therapy, as the key drivers of that decision are clinical variables not captured in gene expression data. Most strikingly, GraphPath achieves an AUROC of 0.92 on prostate targeted molecular therapy prediction, the highest score in the entire benchmark, demonstrating that lateral co-regulation structure produces exceptional discriminative power when matched to a cohort with a narrow targetable driver programme, even under conditions of extreme class imbalance at only 11\% positive prevalence.

URL PDF HTML ☆

赞 0 踩 0

2606.09894 2026-06-10 cs.LG cs.CL 新提交

A Navigable Manifold of Hypothesized Consciousness-Spectrum States in Language Model Representations

语言模型表示中假设的意识谱状态的可导航流形

Sophie Zhao

AI总结研究语言模型嵌入空间中与意识谱相关的几何结构，发现嵌入形成可导航流形，高低层区域稳定，中间为过渡走廊，导航性为内在属性。

详情

AI中文摘要

在沉思、哲学和心理学描述中，人类意识常被描述为从反应性和自我聚焦模式到更整合和连贯模式的类似谱系。理解语言模型是否在表示空间中编码了这种结构化、人类可解释的意识谱系，对于模型引导、评估和对齐具有重要意义。在这项工作中，我们研究了Transformer嵌入空间中沿该谱系的几何结构和动态模式。我们表明，嵌入表现出与该谱系对齐的全局组织几何：与相似状态相关的句子聚类成局部连贯区域，形成结构化流形。特别地，高层和低层区域表现出类似凸性的稳定性，而中间区域形成过渡走廊。在动态上，效用引导和纯几何贪婪轨迹都一致地从低层区域穿越到高层区域，经过中间层级，表明可导航性是表示空间的内在属性，由全局方向信号引导但非决定。这些结果表明，嵌入空间编码了与假设的意识谱分类法（广泛受沉思传统、哲学和现代心理学中人类意识反复出现的结构描述启发）对齐的结构化和可导航几何，为分析和引导模型行为提供了表示层面的视角。

英文摘要

Across contemplative, philosophical, and psychological accounts, human consciousness is often described along a similar spectrum, ranging from reactive and self-focused patterns to more integrative and coherent ones. Understanding whether language models encode such a structured, human-interpretable consciousness spectrum in representation space is important for model guidance, evaluation and alignment. In this work, we study the geometric structure and dynamics of patterns along this spectrum in transformer embedding spaces. We show that embeddings exhibit a globally organized geometry aligned with this spectrum: sentences associated with similar states cluster into locally coherent regions, forming a structured manifold. In particular, higher-level and lower-level regions exhibit convexity-like stability, while intermediate regions form a transition corridor. Dynamically, both utility-guided and geometry-only greedy trajectories consistently traverse from lower- to higher-level regions, passing through intermediate tiers, indicating that navigability is an intrinsic property of the representation space, guided but not dictated by a global directional signal. These results suggest that embedding spaces encode structured and navigable geometry aligned with a hypothesized consciousness-spectrum taxonomy, broadly inspired by recurring structural descriptions of human consciousness across contemplative traditions, philosophy, and modern psychology, providing a representation-level perspective for analyzing and guiding model behavior.

URL PDF HTML ☆

赞 0 踩 0