arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.09967 2026-06-10 cs.CV 新提交

ABot-Earth 0.5: Generative 3D Earth Model

ABot-Earth 0.5：生成式3D地球模型

Ming Qian, Tianjian Ouyang, Mingchao Sun, Zijian Wang, Jincheng Xiong, Jiarong Han, Yongchang Zhang, Jiawei Zhang, Xu Wang, Yu Liu, Luyang Tang, Fei Yu, Zengye Ge, Mengmeng Du, Yuan Liu, Nianfei Fan, Song Wang, Yingliang Peng, Chunxue Jia, Yang Liu, Shiying Zeng, Haozhe Shi, Junnan Lai, Hongyu Pan, Zheng Wu, Ning Guo, Mu Xu, Hang Zhang

AI总结提出ABot-Earth 0.5框架，利用3D高斯泼溅从卫星图像生成大规模无缝3D环境，每平方公里合成时间低于10分钟，支持实时交互可视化，降低3D重建成本。

详情

Comments: From Amap-cvlab, Alibaba. Official page: https://abot-earth.amap.com/

AI中文摘要

我们提出ABot-Earth 0.5，一个生成式3D框架，旨在从普遍存在的、地理参考的卫星图像中合成大规模无缝3D环境。为此，我们提出了一种新颖的生成模型，直接使用3D高斯泼溅（3DGS）表示。该模型在多样化的真实世界城市重建语料库上进行训练，学习生成逼真的几何和纹理。在推理时，它仅以卫星图像为条件合成新颖的3D场景，可扩展速率低于每平方公里10分钟，同时表现出卓越的真实感。该框架设计为易于访问，集成了分层细节级别（LOD）结构，允许在基于Web的地图引擎上进行实时交互式可视化。这种高保真模拟沙箱有效缓解了模拟到现实的领域差距，支持关键的具身人工智能下游应用，如闭环无人机导航。通过提供超低成本和高效的解决方案，ABot-Earth 0.5显著降低了大规模3D重建的技术和财务障碍，并推动了全球数字地球可视化的未来。

英文摘要

We present ABot-Earth 0.5, a generative 3D framework designed to synthesize vast, seamless 3D environments from ubiquitous, geospatially referenced satellite imagery. To achieve this, we propose a novel generative model formulated directly with the 3D Gaussian Splatting (3DGS) representation. The model is trained on a diverse corpus of existing real-world urban reconstructions, learning to generate realistic geometry and textures. At inference, it synthesizes novel 3D scenes conditioned solely on satellite imagery at a scalable rate of under 10 minutes per square kilometer, while demonstrating exceptional realism. The framework is designed for accessibility, with integrated hierarchical level-of-detail (LOD) structures that permit real-time, interactive visualization on web-based map engines. This high-fidelity simulation sandbox effectively mitigates the sim-to-real domain gap, enabling critical downstream Embodied AI applications like closed-loop UAV navigation. By providing an ultra-low-cost and high-efficiency solution, ABot-Earth 0.5 significantly lowers the technical and financial barriers to large-scale 3D reconstruction and empowers the future of global digital earth visualization.

URL PDF HTML ☆

赞 0 踩 0

2606.09966 2026-06-10 cs.SD 新提交

RespiraMFM: A Multimodal Foundation Model with Contrastive Audio-Language Alignment for Respiratory Disease Identification

RespiraMFM：一种用于呼吸道疾病识别的对比音频-语言对齐多模态基础模型

Shakhrul Iman Siam, Tiantian Feng, Jiankun Zhang, Shrikanth Narayanan, Mi Zhang

AI总结提出RespiraMFM多模态基础模型，通过对比音频-文本对齐策略整合呼吸音与临床信息，在监督和零样本任务中分别提升AUROC 9.15%和20.98%。

详情

Comments: ACL 2026 Main Conference

AI中文摘要

呼吸道疾病仍然是全球死亡率的主要原因，及时准确的诊断对于改善患者预后和减轻医疗负担至关重要。虽然先前的工作已经探索了基于音频的呼吸道疾病检测模型，但这种单模态方法通常泛化能力和诊断精度有限。在本文中，我们提出了RespiraMFM，一种多模态基础模型，它将呼吸音与患者病史和症状相结合，以提高诊断准确性和疾病检测能力。我们引入了一种有效的音频-文本多模态整合对比对齐策略，使模型能够学习呼吸音与相应文本临床信息之间更好的跨模态表示。我们使用七个真实世界数据集，在监督微调和零样本设置下，对五种主要呼吸道疾病评估了RespiraMFM，在监督任务中AUROC提高了9.15%，在零样本任务中比现有基线提高了20.98%。这些发现强调了我们的框架在推进呼吸道疾病管理中早期诊断和改善临床决策方面的潜力。

英文摘要

Respiratory diseases remain a leading cause of global mortality, where timely and accurate diagnosis is critical to improving patient outcomes and reducing healthcare burdens. While prior work has explored audio-based models for respiratory disease detection, such unimodal approaches often suffer from limited generalizability and diagnostic precision. In this paper, we propose RespiraMFM, a Multimodal Foundation Model that integrates respiratory sounds with patient medical history and symptoms to enhance diagnostic accuracy and disease detection capabilities. We introduce an effective contrastive alignment strategy for audio-text multimodal integration, allowing the model to learn better cross-modal representations between respiratory sounds and corresponding textual clinical information. We evaluate RespiraMFM across five major respiratory diseases using seven real-world datasets in both supervised fine-tuning and zero-shot settings, achieving a 9.15% improvement in AUROC on supervised tasks and a 20.98% gain on zero-shot tasks over existing baselines. These findings underscore the potential of our framework to advance early diagnosis and improve clinical decision-making in respiratory disease management.

URL PDF HTML ☆

赞 0 踩 0

2606.09962 2026-06-10 cs.LG cs.AI cs.SD 新提交

Optimality of FSQ Tokens for Continuous Diffusion for Categorical Data with Application to Text-to-Speech

FSQ 令牌在分类数据连续扩散中的最优性及其在文本到语音中的应用

Vadim Popov, Wenju Gu, Tasnima Sadekova, Georgii Aparin, Assel Yermekova

AI总结本文研究连续扩散模型中离散令牌的潜在空间结构，通过理论分析和实验证明 FSQ 令牌化方案在分类数据连续扩散中最优，并在文本到语音任务中验证其优于基于 LLM 的方法。

详情

AI中文摘要

分类数据的连续扩散是一种属于扩散家族的框架，旨在生成离散数据。近年来，由于研究人员试图实现寻找自回归大型语言模型的合理替代方案这一具有挑战性的目标，对此类模型的科学兴趣不断增长。在本文中，我们研究了与离散令牌相对应的潜在空间结构的性质，这些性质通过扩散路径测度上的 Kullback-Leibler 散度和最优训练扩散模型正确预测令牌的准确性来表达。我们发现，FSQ 令牌化方案具有的潜在空间结构使其最适合分类数据的连续扩散，这一点通过严格的理论分析和数值实验得到了验证。为了在现实场景中验证我们的发现，我们训练了几个以语音令牌作为中间声学特征的文本到语音扩散模型，并表明基于 FSQ 令牌的模型确实表现最佳，而且它优于其强大的基于 LLM 的对应模型，同时体积更小、速度更快。

英文摘要

Continuous diffusion for categorical data is a framework belonging to the diffusion family and aiming at generating discrete data. The scientific interest to such models has been constantly increasing these days because researchers try to achieve a challenging goal of finding reasonable alternatives to autoregressive large language models. In this paper, we study the properties of the structure of the latent space corresponding to discrete tokens expressed in terms of Kullback-Leibler divergence on diffusion path measures and accuracy of the correct token prediction by the optimally trained diffusion model. We find that FSQ tokenization scheme has the latent space structure with the properties that make it best suited for continuous diffusion for categorical data as verified through rigorous theoretical analysis and numerical experiments. To validate our findings in real-life scenario, we train several text-to-speech diffusion models having speech tokens as intermediate acoustic features, and show that the one based on FSQ tokens indeed performs the best, and, moreover, it outperforms its strong LLM-based counterpart, at the same time being significantly smaller and faster.

URL PDF HTML ☆

赞 0 踩 0

2606.09961 2026-06-10 cs.LG cs.AI 新提交

3SPO: State-Score-Supervised Policy Optimization for LLM Agents

3SPO: 面向LLM智能体的状态分数监督策略优化

Yu Han, Kailing Li, Yang Jiao, Yulin Dai, Yuqian Fu, Linhai Zhuo, Tianwen Qian

AI总结提出3SPO算法，通过动态状态分数监督实现逐步骤策略优化，解决多轮智能体任务中奖励稀疏和信用分配问题，在ALFWorld和WebShop上分别比GRPO提升22.6%和15.6个百分点。

详情

AI中文摘要

通过强化学习（RL）将大型语言模型（LLM）训练为自主智能体，已使前沿模型在长周期任务中实现超人类性能。然而，现有RL算法在轨迹级别操作，仅在收集完整回合后执行策略优化。这种粗粒度方法在多轮智能体设置中面临根本性挑战，其中奖励稀疏、延迟，且跨单个步骤的信用分配至关重要。在这项工作中，我们提出\textbf{状态分数监督策略优化（3SPO）}，一种新颖的RL算法，通过动态状态分数监督执行逐步骤策略优化。在每个步骤，3SPO基于历史成功率计算状态分数，监督逐步骤信用分配、自适应回合和逐步骤策略优化，无需价值函数估计或额外辅助模型。理论上，在每状态臂架抽象下，我们证明所提出的分数监督分配机制实现了对数分配遗憾，并为动作识别、分数可区分性和过滤稳定性提供了样本复杂度保证。在ALFWorld和WebShop上使用Qwen2.5-1.5B/7B-Instruct的实验表明，3SPO在ALFWorld上持续优于GRPO $+22.6\%$，在WebShop上优于$+15.6$个百分点，同时使用相当资源实现了$2.4\times$更多的状态探索和$1.8\times$更快的收敛。代码可从此https URL获取。

英文摘要

Training large language models (LLMs) as autonomous agents via reinforcement learning (RL) has enabled frontier models to achieve superhuman performance in long-horizon tasks. However, existing RL algorithms operate at the trajectory level, performing policy optimization only after collecting complete episode rollouts. This coarse-grained approach faces fundamental challenges in multi-turn agent settings where rewards are sparse, delayed, and credit assignment across individual steps is critical. In this work, we propose \textbf{State-Score-Supervised Policy Optimization (3SPO)}, a novel RL algorithm that performs post-step policy optimization with dynamic state score supervision. At each step, 3SPO computes the state score based on historical success rates, supervising step-wise credit assignment, adaptive rollout and post-step policy optimization without requiring value function estimation or additional auxiliary models. Theoretically, under a per-state bandit abstraction, we show that the proposed score-supervised allocation mechanism achieves logarithmic allocation regret and provide sample-complexity guarantees for action identification, score distinguishability, and filtering stability. Experiments on ALFWorld and WebShop with Qwen2.5-1.5B/7B-Instruct show that 3SPO consistently outperforms GRPO by $+22.6\%$ on ALFWorld and $+15.6$ points on WebShop, while using comparable resources to achieve $2.4\times$ more state exploration and $1.8\times$ faster convergence. Code is available at https://github.com/genalyu/3SPO.

URL PDF HTML ☆

赞 0 踩 0

2606.09960 2026-06-10 cs.LG cs.AI 新提交

HydraCIL: Decoupled Class-Incremental Learning through Prototype-Guided Multi-Head Classifiers

HydraCIL: 通过原型引导的多头分类器实现解耦的类增量学习

Daniel Vila-Cruz, Laura Morán-Fernández, Verónica Bolón-Canedo

AI总结提出HydraCIL模型，通过冻结主干网络、解耦特征提取与学习，并利用原型相似性选择任务特定分类头，在资源受限环境中实现高效类增量学习，匹配或超越现有方法同时大幅降低训练时间和碳排放。

详情

Comments: Accepted for publication at the International Joint Conference on Neural Networks (IJCNN 2026)

AI中文摘要

我们提出HydraCIL，一种基于原型引导的多头分类器的解耦持续学习模型，旨在嵌入式及资源受限环境中的可持续部署。虽然大多数类增量学习（CIL）方法依赖强大硬件和长时间再训练周期，但实际系统（如机器人或边缘AI设备）必须在有限资源下快速适应。HydraCIL通过冻结主干网络并将特征提取与学习解耦来解决这一问题。对于每个任务，特征被提取一次，并创建一个轻量级的、任务特定的分类器头，避免了昂贵的主干再训练。在推理时，HydraCIL通过与原型的相似性选择适当的头。在CIFAR-100、ImageNet-100、CoRe50和Flowers102数据集上的实验表明，HydraCIL匹配或超越了最先进的CIL方法，同时显著减少了训练时间和碳足迹，使其成为在能源效率和快速适应至关重要的实际及嵌入式环境中进行持续学习的实用解决方案。

英文摘要

We present HydraCIL, a decoupled continual learning model based on prototype-guided multi-head classifiers, targeting sustainable deployment in embedded and resource-constrained environments. While most Class-Incremental Learning (CIL) methods rely on powerful hardware and long retraining cycles, real-world systems, such as robots or edge AI devices, must adapt quickly with limited resources. HydraCIL addresses this gap by freezing the backbone and decoupling feature extraction from learning. For each task, features are extracted once and a lightweight, task-specific classifier head is created, avoiding costly backbone retraining. At inference, HydraCIL selects the appropriate head via similarity with prototypes. Experiments on CIFAR-100, ImageNet-100, CoRe50, and Flowers102 datasets show that HydraCIL matches or outperforms state-of-the-art CIL methods while significantly reducing training time and carbon footprint, making it a practical solution for continual learning in real-world and embedded settings, where energy efficiency and rapid adaptation are critical.

URL PDF HTML ☆

赞 0 踩 0

2606.09959 2026-06-10 cs.LG cs.AI 新提交

Temporal Context Conditioning for Seasonality-Aware Precipitation Nowcasting of High-Intensity Rainfall

面向高强度降雨的季节感知降水临近预报的时间上下文条件化

Gijs van Nieuwkoop, Siamak Mehrkanoon

AI总结提出TA-SmaAt-UNet模型，通过时间条件层（昼夜和季节循环编码）增强雷达降水临近预报，显著提升高强度降雨事件的预测性能。

详情

Comments: 9 pages, 6 figures

AI中文摘要

降水临近预报越来越多地采用直接从近期雷达观测中学习的深度学习模型。尽管这类模型能有效捕捉短期降水运动，但它们往往缺乏降雨发展所依据的气象条件的更广泛上下文信息。本文研究轻量级时间上下文是否能改善基于雷达的临近预报，特别是针对高强度降雨。我们提出了时间感知小注意力U-Net（TA-SmaAt-UNet），它在核心SmaAt-UNet模型基础上扩展了时间条件层，利用昼夜时间和一年中时间的循环编码来调节中间特征表示。在KNMI雷达降水数据上的实验表明，时间条件化对罕见的高强度降水事件最为有益，同时也能改善季节变异性和预测降水强度分布的表征。层传导分析进一步表明，尽管参数成本很小，模型仍积极使用添加的时间条件层。这些发现表明，简单的、基于物理动机的时间上下文可以提高基于深度学习的降水临近预报的真实性和可靠性。我们的模型实现和训练设置可在GitHub上获取。

英文摘要

Precipitation nowcasting is increasingly being approached with deep learning models that learn directly from recent radar observations. Although such models can efficiently capture short-term precipitation motion, they often lack broader contextual information about the meteorological conditions under which rainfall develops. This paper investigates whether lightweight temporal context can improve radar-based nowcasting, particularly for high-intensity rainfall. We propose the Time-Aware Small-Attention U-Net (TA-SmaAt-UNet), which extends the core SmaAt-UNet model with temporal conditioning layers that use cyclical encodings of time-of-day and time-of-year to modulate intermediate feature representations. Experiments on KNMI radar precipitation data show that temporal conditioning is most beneficial for rare, high-intensity precipitation events, while also improving the representation of seasonal variability and predicted rainfall-intensity distributions. A layer conductance analysis further indicates that the added temporal conditioning layers are actively used by the model despite their small parameter cost. These findings suggest that simple, physically motivated temporal context can improve the realism and reliability of deep learning-based precipitation nowcasts. The implementation of our models and training setup is available on \href{https://github.com/gijsvn/TA-SmaAt-UNet}{GitHub}.

URL PDF HTML ☆

赞 0 踩 0

2606.09958 2026-06-10 cs.RO cs.AI 新提交

Uncertainty-Aware Motion Planning for Autonomous Driving in Mixed Traffic Environment

混合交通环境下自动驾驶的不确定性感知运动规划

Ming Cheng, Hao Chen, Ziyi Yang, Ziluowen Luo, Senzhang Wang

AI总结提出不确定性感知运动规划（UAMP），通过量化人类意图不确定性并引入不确定性校准值学习，提升自动驾驶在混合交通中的安全性和舒适性。

详情

AI中文摘要

在自动驾驶和人类驾驶车辆可能共存的混合交通环境中，自动驾驶车辆的运动规划需要预测周围人类驾驶员的未来行为。现有的基于强化学习的方法通常直接将预测的人类意图纳入观测以实现主动规划。然而，由于行为多样性、感知噪声和部分可观测性，人类意图本质上是不确定的。将预测意图视为确定性状态可能导致自动驾驶车辆做出不安全决策。为解决此问题，我们提出不确定性感知运动规划（UAMP），该规划将人类意图预测的不确定性纳入自动驾驶决策。具体来说，UAMP首先引入一个邻近感知不确定性估计器，以量化交互条件下的意图不确定性，并构建一个不确定性引导的联合意图分布，覆盖周围的人类驾驶车辆。在此不确定性集合内，UAMP进一步引入不确定性校准值学习（UCVL），以纠正因直接将不确定的人类意图预测纳入观测而产生的值函数学习偏差。在各种混合交通场景中的大量实验表明，与现有方法相比，UAMP显著提高了安全性和驾驶舒适性，同时保持了交通效率。代码发布在此https URL。

英文摘要

In mixed-traffic environments where autonomous and human-driven vehicles may co-exist, motion planning for autonomous vehicles requires anticipating the future behaviors of surrounding human drivers. Existing reinforcement learning-based methods generally directly incorporate the predicted human intents into the observation to enable a proactive planning. However, human intent is inherently uncertain due to the behavioral diversity, perception noise, and partial observability. Treating predicted intends as deterministic states can result in unsafe decisions for autonomous vehicles. To address this problem, we propose Uncertainty-Aware Motion Planning (UAMP), which incorporates uncertainty in human intent prediction for AV decision-making. Specifically, UAMP first introduces a proximity-aware uncertainty estimator to quantify the interaction-conditioned intent uncertainty and constructs an uncertainty-guided joint intent distribution over surrounding human-driven vehicles. Within this uncertainty set, UAMP further introduces Uncertainty-Calibrated Value Learning (UCVL) to correct value function learning biases arising from directly incorporating uncertain human intent predictions into the observation. Extensive experiments in various mixed-traffic scenarios show that UAMP significantly improves safety and driving comfort, while maintaining traffic efficiency compared with existing approaches. The code is released at https://anonymous.4open.science/r/UAMP-5638.

URL PDF HTML ☆

赞 0 踩 0

2606.09957 2026-06-10 cs.SE cs.LG 新提交

Data-aware Static Analysis: Improving Detection of Semantic Faults in Machine Learning Code Using Data Characteristics

数据感知静态分析：利用数据特征改进机器学习代码中语义故障的检测

Willem Meijer, Kristian Sandahl, Dániel Varró

AI总结提出一种数据感知静态分析方法，结合数据流与控制流分析及API契约，在编写代码时而非训练后检测机器学习代码中的语义故障，如误用未缩放数据训练尺度敏感模型。

详情

DOI: 10.1145/3786582.3786805
Comments: 6 pages, 3 figures, 2 listings, 1 table; To be published in "2026 IEEE/ACM 48th International Conference on Software Engineering (ICSE-NIER '26)"

AI中文摘要

机器学习模型使用中的语义故障是机器学习开发者常见的问题，会导致预测次优、计算成本高或输出错误。例如，有人可能错误地使用未缩放的数据来训练尺度敏感模型。机器学习开发者在训练模型后手动分析结果来检测这些故障，这使得过程效率低下。我们提出了一种新颖的数据感知静态分析方法来检测机器学习代码中的语义故障，使开发者能够在编写代码时而不是在训练模型后揭示这些错误。我们的方法结合了数据流和控制流分析以及API契约，能够在高抽象层次上对机器学习代码进行数据感知推理。通过分析真实世界的机器学习笔记本样本，我们展示了我们解决方案的潜力，发现我们可以检测需要数据感知方法的故障。

英文摘要

Semantic faults specific to the use of machine learning models are a common problem for machine learning developers, causing suboptimal predictions, high computational cost, or incorrect outputs. For example, one may erroneously use unscaled data to train a scale-sensitive model. Machine learning developers detect these faults after training their models and manually analyzing the results, making it an inefficient process. We propose a novel data-aware static analysis approach to detect semantic faults in machine learning code, allowing developers to reveal these bugs while writing code instead of after training the model. Our approach uses combined data and control flow analysis, and API contracts, enabling data-aware reasoning about machine learning code at a high level of abstraction. We highlight the potential of our solution by analyzing a sample of real-world machine learning notebooks, finding that we can detect faults that require a data-aware approach.

URL PDF HTML ☆

赞 0 踩 0

2606.09956 2026-06-10 cs.SE cs.LG 新提交

Multi-task LLMs for Bug Classification: Efficient Inference with Auxiliary Decoding Heads

多任务大语言模型用于缺陷分类：基于辅助解码头的高效推理

Nikolai Rozanov

AI总结提出一种轻量级多任务大语言模型（MLC），通过令牌对齐算法和优化训练策略，实现全文件上下文下的行级缺陷定位，性能与代理方法相当但推理延迟降低数个数量级。

详情

Comments: 8 pages, 6 pages appendix

AI中文摘要

基于大语言模型的代码生成技术被迅速采用，极大地加速了软件开发，但有效的验证方法仍然严重不足。现有的缺陷定位技术要么成本过高（每个文件需要数分钟的代理推理和数千个生成令牌），要么以粗粒度的函数级别运行，不适合精确调试。而专注于行级粒度且更轻量的工作往往在性能或上下文大小上受到限制。我们提出了一种新颖的行级缺陷定位方法，通过三个关键贡献解决了这些限制：（1）一种令牌对齐算法，克服了先前工作中的基本令牌化挑战；（2）一种轻量级多任务大语言模型用于缺陷定位（MLC），实现高效的行级缺陷分类；（3）一种针对多行预测的优化训练策略。我们的方法在全文件上下文下的行级缺陷定位中，在类似设置中达到了最先进的性能。同时，在Defects4J和PypiBugs基准测试中，我们达到了与代理方法相当的性能，同时将推理延迟降低了数个数量级，每个文件仅需生成一个令牌。我们还通过引入并在一个小型域外评估数据集（Python）上进行评估，进一步证明了强大的泛化能力。我们将在论文被接收后开源我们的代码、模型和数据集。

英文摘要

The rapid adoption of LLM-powered code generation has dramatically accelerated software development, yet effective verification methods remain severely underdeveloped. Existing bug localization techniques are either prohibitively expensive, requiring minutes of agentic reasoning and thousands of generated tokens per file, and/or operate at coarse function-level granularity unsuitable for precise debugging. While works that focus on line-level granularity and are more light-weight are often limited in their performance or context size. We introduce a novel line-level bug localization approach that addresses these limitations through three key contributions: (1) a token alignment algorithm that overcomes fundamental tokenization challenges in previous work, (2) a lightweight multi-task LLM for bug localization (MLC) enabling efficient line-level bug classification, and (3) an optimized training recipe for multi-line prediction. Our method achieves state-of-the-art performance among similar setups on line-level bug localization with full-file context. At the same time we reach comparable performance to agentic approaches on Defects4J and PypiBugs benchmarks while reducing inference latency by orders of magnitudes, requiring only a single generated token per file. We further demonstrate strong generalization by introducing and evaluating on a small out-of-domain evaluation datasets in Python. We will open source our code, models, and datasets upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2606.09954 2026-06-10 cs.LG cs.AI 新提交

Does Normalization Choice Matter for Causal Large Time-Series Models?

归一化选择对因果大规模时间序列模型重要吗？

Samy-Melwan Vilhes, Gilles Gasso, Mokhtar Z Alaya

AI总结研究因果大规模时间序列模型中不同归一化策略对训练收敛和预测性能的影响，发现归一化选择显著影响模型效果。

详情

Journal ref: ICLR 2026 Workshop: Time Series in the Age of Large Models, Apr 2026, Rio De Janeiro, Brazil

AI中文摘要

用于时间序列预测的大规模模型已成为在异构信号集合上训练模型的有前景的范式。这些模型通常依赖于因果自回归架构，其中每个观测值根据过去依次预测。在实践中，真实世界的时间序列表现出非平稳性，这显著影响预测性能。为了缓解这一问题，通常采用归一化。然而，在高效的因果设置中，归一化可能在训练期间导致来自未来观测的信息泄漏。最近提出的替代方案，包括因果归一化和从初始观测计算的统计量，旨在解决这一问题，但其实际影响仍未被充分理解。在这项工作中，我们评估了基于Transformer的大规模时间序列模型（采用分块和高效因果策略训练）的归一化策略。我们展示了归一化选择显著影响训练收敛和预测性能。

英文摘要

Large models for time-series forecasting have been emerged as a promising paradigm for training models on heterogeneous collections of signals. These models typically rely on causal autoregressive architectures, where each observation is sequentially predicted from past. In practice, real-world time-series exhibit non-stationarities, which significantly influence predictive performance. To mitigate this, normalization is commonly employed. However, in efficient causal settings it might induce information leakage from future observations during training. Recent alternatives, including causal normalization and statistics computed from initial observations, have been proposed to address this issue, but their practical implications remain insufficiently understood. In this work, we evaluate normalization strategies for transformer-based large time-series models trained with patching and efficient causal strategy. We showcase that normalization choice significantly influences both training convergence and forecasting performance.

URL PDF HTML ☆

赞 0 踩 0

2606.09951 2026-06-10 cs.LG 新提交

Hasse Diagrams for Attention: A Partial Order Framework for Designing Transformer Masks

注意力的哈斯图：设计Transformer掩码的偏序框架

Chentao Li, Han Guo

AI总结本文提出一个理论框架，证明多层Transformer的信息流收敛到哈斯图，并将并行训练任务设计转化为求哈斯图最小公共超图问题，由此导出两种新注意力掩码。

详情

Comments: 21 pages, 9 figures. Theoretical framework for attention mask design; no experiments included

AI中文摘要

在大型Transformer模型的训练过程中，注意力掩码控制序列中信息流的范围和方向。存在多种掩码变体，诸如FlexAttention之类的算子已经支持任意注意力掩码。然而，对于任意掩码所引起的信息流结构，一直缺乏系统的形式化分析。本文开发了一个完整的理论框架。我们证明，在足够深度下，多层Transformer的信息流收敛到一个哈斯图——表示偏序的有向无环图。在此基础上，我们将并行训练任务的设计重新表述为寻找哈斯图的最小公共超图的问题，并建立了最小公共超图的判定准则。这产生了一种直接从任务族推导注意力掩码的构造性方法。应用该框架，我们设计了两种新颖的掩码：一种确保训练-推理一致性的块生成注意力掩码（块双流注意力），以及一种全监督双向注意力掩码（蝴蝶注意力）。这些结果证明了该框架发现新结构的能力。

英文摘要

During the training of large Transformer models, attention masks regulate the scope and direction of information flow across a sequence. Numerous mask variants exist, and operators such as FlexAttention already support arbitrary attention masks. Nevertheless, a systematic formal analysis of the information-flow structure induced by arbitrary masks has been missing. This paper develops a complete theoretical framework. We prove that, with sufficient depth, the information flow of a multi-layer Transformer converges to a Hasse diagram -- a directed acyclic graph representing a partial order. Building on this, we recast the design of parallel training tasks as the problem of finding a minimal common supergraph of Hasse diagrams, and we establish a criterion for the minimal common supergraph. This yields a constructive method to derive attention masks directly from a family of tasks. Applying the framework, we design two novel masks: a block-generation attention mask that ensures training-inference consistency (Block Two-Stream Attention), and a fully supervised bidirectional attention mask (Butterfly Attention). These results demonstrate the framework's capacity to discover new structures.

URL PDF HTML ☆

赞 0 踩 0

2606.09949 2026-06-10 cs.LG cs.AI 新提交

Learning Where to Simulate: Generative Active Sampling for Online PDE Surrogate Training

学习何处模拟：在线PDE代理训练的生成式主动采样

Pierre Cesar, Sofya Dymchenko, Abhishek Purandare, Bruno Raffin

AI总结提出在线生成式主动采样（OGAS），通过扩散模型学习配置参数与代理性能的关系，主动采样高难度区域，显著降低尾部分布误差，提升代理最坏情况可靠性。

详情

AI中文摘要

数据驱动的PDE代理使用数值PDE求解器产生的数据进行训练。然而，当代理的目标是在广泛的PDE配置（例如初始条件和物理系数）上泛化时，生成具有代表性的训练集并非易事。配置参数的均匀采样通常低估了表现出挑战性动力学的轨迹，导致训练后的代理出现高预测误差和大误差方差。在线训练将数据生成和代理训练耦合，通过允许实时调整求解器参数提供了自然优势。为了有效利用这一能力，我们引入了在线生成式主动采样（OGAS），一种主动学习方法，它反应性地学习配置参数与代理性能之间的关系，以控制采样分布。OGAS与代理并行训练一个快速扩散模型，作为条件采样器，将代理派生的难度信号（例如损失或不确定性）映射到配置参数。通过主动从偏向高难度的先验中抽取目标信号，OGAS持续将数据生成导向挑战性区域，而不会延迟训练流程。我们在具有不同挑战性动力学的2D PDE（Kuramoto-Sivashinsky、Navier-Stokes、Gray-Scott）上评估OGAS，参数多达308个，并使用多种代理架构。在所有设置中，与均匀采样相比，OGAS一致地改善了尾部分布统计，显著降低了第99百分位以上的误差和整体误差离散度。虽然优先考虑挑战性轨迹引入了与平均误差的权衡，但OGAS有效确保了训练后代理的最坏情况可靠性，且壁钟时间开销可忽略不计。

英文摘要

Data-driven PDE surrogates are trained with data produced by numerical PDE solvers. However, when the surrogate's goal is to generalize across a wide range of PDE configurations (e.g., initial conditions and physical coefficients), generating a representative training set is non-trivial. Uniform sampling of configuration parameters often under-represents trajectories exhibiting challenging dynamics, leading to high prediction errors and large error variance in the trained surrogate. Online training, where data generation and surrogate training are coupled, offers a natural advantage by allowing solver parameters to be steered on-the-fly. To efficiently exploit this capability, we introduce Online Generative Active Sampling (OGAS), an active learning method that reactively learns the relationship between configuration parameters and surrogate performance to control the sampling distribution. OGAS trains a fast diffusion model in parallel to the surrogate to act as a conditional sampler, mapping a surrogate-derived difficulty signal (e.g., loss or uncertainty) to configuration parameters. By actively drawing target signals from a prior biased toward high difficulty, OGAS continuously steers data generation toward challenging regimes without delaying the training workflow. We evaluate OGAS across 2D PDEs with distinct challenging dynamics (Kuramoto-Sivashinsky, Navier-Stokes, Gray-Scott) and up to 308 parameters, using multiple surrogate architectures. Across all settings, OGAS consistently improves tail statistics, yielding substantial reductions in errors above the 99th percentile and overall error dispersion compared to uniform sampling. While prioritizing challenging trajectories introduces a trade-off with average error, OGAS effectively ensures worst-case reliability of trained surrogates with negligible wall-time overhead.

URL PDF HTML ☆

赞 0 踩 0

2606.09946 2026-06-10 cs.AR cs.CV 新提交

SPARX: Secure and Privacy-Aware Approximate CNN Acceleration with Edge RISC-V SoC

SPARX: 面向边缘RISC-V SoC的安全与隐私感知近似CNN加速

Sonu Kumar, Akash Sankhe, Mukul Lokhande, Santosh Kumar Vishvakarma

AI总结提出SPARX框架，集成RISC-V指令扩展、近似对数CNN加速单元、差分隐私引擎和认证机制，通过近似感知决策框架选择最优乘法器，在边缘实现安全高效的CNN推理。

详情

Comments: Under review in 12th International Symposium on Smart Electronic Systems (iSES) 2026

AI中文摘要

边缘AI系统日益需要在严格的能耗、性能、安全和隐私约束下进行实时CNN推理。近似计算通过利用神经网络工作负载的错误容忍性来提高硬件效率；然而，大多数近似CNN加速器并未联合考虑安全的、隐私感知的边缘部署。本文提出了SPARX，一个集成在异构RV32IMC RISC-V系统级芯片（SoC）内的安全与隐私感知近似CNN加速框架。SPARX结合了自定义RISC-V指令扩展、近似对数CNN加速单元、轻量级基于差分噪声的隐私引擎以及挑战-响应认证机制。为了指导算术选择，引入了一个近似感知决策框架，该框架使用近似严重性指数（ASI）、近似效率（AE）、近似质量（QoA）、近似品质因数（AFOM）和硬件加速效率（HAE）。对11种最先进的近似MAC架构的评估表明，迭代对数乘法器（ILM）是最合适的设计，与精确的基4 Booth MAC相比，面积减少51.7%，功耗降低81.5%，吞吐量提升2.13倍，而仅使ResNet-20/CIFAR-10的准确率降低2.82个百分点。在Xilinx VC707平台上的FPGA实现实现了250 MHz下58.4 GOPS/W的能效，而28纳米CMOS物理实现验证了ASIC的可行性。

英文摘要

Edge-AI systems increasingly require real-time CNN inference under strict energy, performance, security, and privacy constraints. Approximate computing improves hardware efficiency by exploiting the error resilience of neural network workloads; however, most approximate CNN accelerators do not jointly consider secure, privacy-aware edge deployment. This paper presents SPARX, a Secure and Privacy-Aware Approximate CNN Acceleration framework integrated within a heterogeneous RV32IMC RISC-V System-on-Chip (SoC). SPARX combines a custom RISC-V instruction extension, an approximate logarithmic CNN acceleration unit, a lightweight differential-noise-based privacy engine, and a challenge-response authentication mechanism. To guide arithmetic selection, an approximation-aware decision framework is introduced that uses the Approximation Severity Index (ASI), Approximation Efficiency (AE), Quality of Approximation (QoA), Approximation Figure-of-Merit (AFOM), and Hardware Acceleration Efficiency (HAE). Evaluation across 11 state-of-the-art approximate MAC architectures identifies the Iterative Logarithmic Multiplier (ILM) as the most suitable design, achieving 51.7% area reduction, 81.5% power reduction, and 2.13x throughput improvement compared with an accurate radix-4 Booth MAC, while only reducing ResNet-20/CIFAR-10 accuracy by 2.82 percentage points. FPGA implementation on a Xilinx VC707 platform achieves 58.4 GOPS/W energy efficiency at 250 MHz, while 28-nm CMOS physical implementation validates ASIC feasibility

URL PDF HTML ☆

赞 0 踩 0

2606.09942 2026-06-10 cs.SE cs.AI 新提交

Anomaly Detection and Root Cause Analysis for Microservice Systems

微服务系统的异常检测与根因分析

Luan Pham

AI总结针对微服务系统异常检测与根因分析的五大局限性，提出端到端方法BARO、EventADL和TORAI，并构建基准RCAEval，通过实验验证有效性与鲁棒性。

详情

Comments: This is the pre-print of my PhD thesis, submitted to RMIT University

AI中文摘要

微服务系统被广泛用于构建云应用，但其复杂性使得故障不可避免，从而降低用户体验并造成经济损失。自动化异常检测与根因分析（RCA）目前是活跃的研究领域，但现有技术存在五个局限性。首先，大多数方法将异常检测和RCA分开处理，假设异常已被正确检测，当检测因噪声或延迟而不精确时便会失效。其次，它们关注指标、日志和跟踪，而忽略了事件数据（如API调用和配置变更）。第三，许多方法需要给定的服务调用图，否则无法诊断。第四，该领域缺乏标准化的数据集和评估框架，导致方法难以公平比较。第五，尽管基于因果推断的RCA已成为主流，但其有效性、效率和鲁棒性仍不明确。本论文通过两组贡献解决这些局限性。第一组引入了独立和联合利用可观测性数据的方法。BARO是一种针对指标数据的端到端异常检测与RCA方法。EventADL是一种针对事件数据的端到端框架。TORAI是一种无需服务调用图的多模态RCA框架。在真实微服务系统上的大量实验证明了它们的有效性和鲁棒性。第二组贡献提供了基准数据集、评估框架和系统性的评估工作。RCAEval是一个全面的基准，为未来研究提供即用数据集和可复现基线。对现有RCA方法（尤其是基于因果推断的方法）的系统性评估提供了指导未来方向的见解。本论文因此推进了微服务故障的自动化异常检测与RCA，为事件缓解和修复的未来研究奠定基础。

英文摘要

Microservice systems are widely used to build cloud applications, yet their complexity makes failures inevitable, degrading user experience and causing economic loss. Automated anomaly detection and root cause analysis (RCA) are now active research areas, but existing techniques share five limitations. First, most treat anomaly detection and RCA separately, assuming anomalies are detected correctly, and falter when detection is imprecise due to noise or delay. Second, they focus on metrics, logs, and traces, leaving event data such as API calls and configuration changes underexplored. Third, many require a given service call graph and cannot diagnose without one. Fourth, the field lacks standardised datasets and evaluation frameworks, so methods are hard to compare fairly. Fifth, although causal inference-based RCA has become dominant, its effectiveness, efficiency, and robustness remain unclear. This thesis addresses these limitations through two groups of contributions. The first introduces methods that exploit observability data both independently and collectively. BARO is an end-to-end anomaly detection and RCA approach for metric data. EventADL is an end-to-end framework for event data. TORAI is a multimodal RCA framework that requires no service call graph. Extensive experiments on real microservice systems demonstrate their effectiveness and robustness. The second group delivers benchmarking datasets, an evaluation framework, and systematic evaluation efforts. RCAEval is a comprehensive benchmark providing ready-to-use datasets and reproducible baselines for future research. A systematic evaluation of existing RCA methods, especially causal inference-based approaches, offers insights that guide future directions. This thesis thereby advances automated anomaly detection and RCA for microservice failures, enabling future research on incident mitigation and remediation.

URL PDF HTML ☆

赞 0 踩 0

2606.09940 2026-06-10 cs.LG cs.AI 新提交

Interactions Between Crosscoder Features: A Compact Proofs Perspective

交叉编码器特征间的交互：一个紧凑证明的视角

Dmitry Manning-Coe, Thomas Read, Anna Soligo, Oliver Clive-Griffin, Chun-Hei Yip, Rajashree Agrawal, Jason Gross

AI总结本文从紧凑证明角度形式化交叉编码器特征交互，提出交互度量并应用于计算稀疏性、语义聚类和检测休眠代理。

详情

Comments: Accepted at the NeurIPS 2025 Workshop on Mechanistic Interpretability

AI中文摘要

像稀疏自编码器（SAEs）和交叉编码器这样的字典学习方法试图通过将模型的激活分解为独立特征来解释模型。因此，特征之间的交互会在重构中引入误差。我们通过紧凑证明形式化了这一直觉，并做出了五项贡献。首先，我们展示了原则上如何使用交叉编码器构建模型性能的紧凑证明。其次，我们证明了该证明中出现的误差项可以自然地解释为交叉编码器特征之间交互的度量，并提供了多层感知器（MLP）层中交互项的显式表达式。然后，我们提供了这种新交互度量的三个应用。在第三项贡献中，我们展示了交互项本身可以用作可微分的损失惩罚。应用这种惩罚，我们可以实现“计算稀疏”的交叉编码器，当在每个数据点和神经元仅保留单个特征时，保留MLP性能的60%，而标准交叉编码器仅保留10%。接着，我们展示了根据我们的交互度量进行聚类可以提供语义上有意义的特征聚类，最后，我们展示了休眠代理具有显著的交互。代码可在以下网址获取：https://this URL。

英文摘要

Dictionary learning methods like Sparse Autoencoders (SAEs) and crosscoders attempt to explain a model by decomposing its activations into independent features. Interactions between features hence induce errors in the reconstruction. We formalize this intuition via compact proofs and make five contributions. First, we show how, \textit{in principle}, a compact proof of model performance can be constructed using a crosscoder. Second, we show that an error term arising in this proof can naturally be interpreted as a measure of interaction between crosscoder features and provide an explicit expression for the interaction term in the Multi-Layer Perceptron (MLP) layers. We then provide three applications of this new interaction measure. In our third contribution we show that the interaction term itself can be used as a differentiable loss penalty. Applying this penalty, we can achieve ``computationally sparse'' crosscoders that retain $60\%$ of MLP performance when only keeping a single feature at each datapoint and neuron, compared to $10\%$ in standard crosscoders. We then show that clustering according to our interaction measure provides semantically meaningful feature clusters, and finally that sleeper agents have significant interactions. Code is available at https://github.com/chainik1125/crosscoders-feature-interactions/tree/arxiv.

URL PDF HTML ☆

赞 0 踩 0

2606.09937 2026-06-10 cs.LG cs.AI cs.CL 新提交

RKSC: Reasoning-Aware KV Cache Sharing and Confident Early Exit for Multi-Step LLM Inference

RKSC：面向多步LLM推理的感知推理的KV缓存共享与自信提前退出

Anirudh Sekar

AI总结提出RKSC框架，通过注意力相似性KV共享、置信门控提前退出和推理选择性块缓存管理，消除多分支LLM推理中的结构冗余，实现平均3.008倍加速，错误率仅0.37%。

详情

Comments: Accepted to the ICML 2026 Workshop on Statistical Frameworks for Uncertainty in Agentic Systems

AI中文摘要

我们提出RKSC（感知推理的KV缓存共享），一种无需训练的推理框架，消除了多分支LLM推理流程中的两种结构冗余。ASKS（注意力相似性KV共享）计算前缀KV缓存一次，并通过隐藏状态余弦相似度广播给所有语义相似的分支，严格推广了vLLM和SGLang使用的精确令牌前缀缓存。CGEE（置信门控提前退出）应用两种互补的退出机制：（1）当生成置信度在分支间具有决定性时，完全跳过验证前向传播；（2）当逐层熵稳定时，在中间层终止验证传播，使用Transformer骨干上的轻量级钩子。RSBCM（推理选择性块缓存管理器）通过注意力加权深度优先驱逐防止无界缓存增长。在五个模型家族（7B-10B）、四个基准测试和1000个评估问题上，RKSC相对于无KV基线实现了平均3.008倍加速（峰值3.990倍），相对于vLLM等效前缀缓存平均提升1.66倍，CGEE导致的错误率仅为0.37%（1616次验证调用中6次错误）。无需微调或架构更改。代码可在该URL获取。

英文摘要

We introduce RKSC (Reasoning-Aware KV Cache Sharing), a training-free inference framework that eliminates two structural redundancies in multi-branch LLM reasoning pipelines. ASKS (Attention-Similarity KV Sharing) computes the prefix KV cache once and broadcasts it to all semantically similar branches via hidden-state cosine similarity, strictly generalising the token-exact prefix caching used by vLLM and SGLang. CGEE (Confidence-Gated Early Exit) applies two complementary exit mechanisms: (1) it skips the verification forward pass entirely when generation confidence is decisive across branches, and (2) it terminates the verification pass at an intermediate layer when per-layer entropy stabilises, using lightweight hooks on the transformer backbone. RSBCM (Reasoning-Selective Block Cache Manager) prevents unbounded cache growth via attention-weighted depth-priority eviction. Across five model families (7B-10B), four benchmarks, and 1,000 evaluated problems, RKSC achieves a mean speedup of 3.008x over the No-KV baseline (peak 3.990x), a 1.66x mean improvement over vLLM-equivalent prefix caching, with a CGEE-induced error rate of only 0.37% (6 errors out of 1,616 verify calls). No fine-tuning or architecture changes are required. Code is available at https://github.com/AnirudhSekar/RKSC.

URL PDF HTML ☆

赞 0 踩 0

2606.09936 2026-06-10 cs.LG cs.AI 新提交

One Lens, Many Worlds : A Capability-Typed Interface for World-Model Interpretability

一个镜头，多个世界：面向世界模型可解释性的能力类型接口

Bhavith Chandra Challagundla, Sanskar Pandey, Param Thakkar, Rishikesh Mallagundla, Yugandhar Reddy Gogireddy, Wenhao Lu, Hindol Roy Choudhury, Shravani Challagundla, Mohamed Deraz Nasr, Spursh Deshpande

AI总结提出WorldModelLens，通过能力类型适配器统一不同世界模型（如PlaNet、IRIS、I-JEPA）的可解释性分析，避免重复实现。

详情

AI中文摘要

世界模型现在建立在截然不同的计算基板上。潜在循环状态空间模型（如PlaNet和Dreamer系列）将观测压缩为循环状态；基于token的模型（如IRIS）将观测量化到学习到的码本中，并用transformer进行自回归预测；联合嵌入预测架构（如I-JEPA）在没有像素解码器的学习潜在空间中进行预测。应用于这些模型的可解释性方法，包括探针、激活修补、稀疏自编码器和惊喜分析，共享一组共同的基元，但由于现有的钩子和缓存工具假设一个没有动作、环境步骤或想象回滚概念的transformer语言模型，它们为每个架构从头重新实现。我们认为这种碎片化反映了工具而非模型，并且世界模型的共享结构可以通过一个小型类型接口捕获。我们提出了WorldModelLens，一个围绕能力类型适配器组织的开源可解释性基板：每个模型实现四个必需方法（编码、转移、初始状态、采样），并通过显式能力描述符声明一组可选头（解码、奖励、继续、行动者、评论者），使得强化学习和自监督世界模型成为一等公民，而无需模仿对方。单一的钩子和缓存层在此接口上暴露时间索引的激活、想象回滚和干预重放，使得每个分析只需编写一次。

英文摘要

World models are now built on substantially different computational substrates. Latent recurrent state-space models such as PlaNet and the Dreamer family compress observations into recurrent states; token-based models such as IRIS quantize observations into a learned codebook and predict autoregressively with a transformer; and joint-embedding predictive architectures such as I-JEPA predict in a learned latent space with no pixel decoder. The interpretability methods applied to these models, including probing, activation patching, sparse autoencoders, and surprise analysis, share a common set of primitives, yet they are re-implemented from scratch for each architecture because existing hook-and-cache tooling assumes a transformer language model with no notion of actions, environment steps, or imagined rollouts. We argue that this fragmentation reflects the tooling rather than the models, and that the shared structure of world models is captured by a small typed interface. We present WorldModelLens, an open-source interpretability substrate organized around a capability-typed adapter: every model implements four required methods (encode, transition, initial state, sample) and declares a set of optional heads (decode, reward, continue, actor, critic) through an explicit capability descriptor, so that reinforcement-learning and self-supervised world models are first-class without either imitating the other. A single hook and cache layer exposes time-indexed activations, imagination rollouts, and intervention replay over this interface, allowing each analysis to be written once.

URL PDF HTML ☆

赞 0 踩 0

2606.09935 2026-06-10 cs.CR cs.AI 新提交

GitInject: Real-World Prompt Injection Attacks in AI-Powered CI/CD Pipelines

GitInject: AI驱动的CI/CD流水线中的真实提示注入攻击

Jafar Isbarov, Umid Suleymanov, Ilia Shumailov, Murat Kantarcioglu

AI总结提出GitInject框架，在真实GitHub工作流中评估AI代理的提示注入漏洞，发现所有测试提供商均存在结构性风险，并给出最低成本防护措施。

详情

AI中文摘要

AI代理越来越多地嵌入持续集成和持续交付/部署（CI/CD）流水线中，以自主审查拉取请求（PR）、分类问题和维护代码库。这些代理在操作时摄入不可信内容，同时拥有提升的仓库权限，使其成为具有供应链后果的提示注入攻击的自然目标。我们提出GitInject，一个开源框架，用于评估真实、活跃的GitHub工作流（CI/CD流水线的广泛部署实例）中的提示注入漏洞。与先前模拟工具调用的代理安全基准不同，GitInject提供临时仓库并触发实际工作流运行，因此沙箱约束、凭证处理和权限边界的行为与生产环境完全一致。使用GitInject，我们研究了四个AI提供商的工作流配置，并记录了十一种命名攻击，涵盖配置文件注入、凭证窃取、判断操纵和可用性。我们发现，所有测试的提供商在其默认配置中至少容易受到一类攻击，且最关键的漏洞是结构性的：它们源于CI/CD基础设施处理凭证和配置文件的方式，而非任何特定模型的行为。对于每个确认的攻击类别，我们确定了最低成本的工作流级对策，并分析了其覆盖范围和局限性。GitInject已公开发布，以促进这一方向的进一步研究。

英文摘要

AI-powered agents are increasingly embedded in continuous integration and continuous delivery/deployment (CI/CD) pipelines to autonomously review pull requests (PRs), triage issues, and maintain codebases. These agents ingest untrusted content while operating with elevated repository permissions, making them a natural target for prompt injection attacks with supply chain consequences. We present GitInject, an open-source framework for evaluating prompt injection vulnerabilities in real, live GitHub workflows, a widely deployed instance of CI/CD pipelines. Unlike prior agent security benchmarks that simulate tool calls, GitInject provisions ephemeral repositories and triggers actual workflow runs, so that sandbox constraints, credential handling, and permission boundaries behave exactly as in production. Using GitInject, we study workflow configurations across four AI providers and document eleven named attacks spanning config-file injection, credential exfiltration, judgment manipulation, and availability. We find that all tested providers are susceptible to at least one attack class in their default configuration, and that the most critical vulnerabilities are structural: they arise from how CI/CD infrastructure handles credentials and configuration files, not from any specific model's behavior. For each confirmed attack class, we identify the minimum-cost workflow-level countermeasure and analyze its coverage and limitations. GitInject is released publicly to facilitate further research in this direction.

URL PDF HTML ☆

赞 0 踩 0

2606.09934 2026-06-10 cs.LG cs.CR 新提交

nCMD: Benign-Anchored Feature Selection for Imbalanced Network Intrusion Detection

nCMD: 面向不平衡网络入侵检测的良性锚定特征选择

Abu Fuad Ahmad, Istiaque Ahmed

AI总结提出良性锚定类均值偏差（nCMD）方法，通过计算攻击类分布与良性类均值的偏差进行特征选择，在四个基准数据集上优于传统过滤方法，尤其适用于特征预算紧张和类别严重不平衡的场景。

详情

Comments: 6 pages, IEEE double columns

AI中文摘要

特征选择对于在操作和防御网络中常见的高维、高度不平衡流量下运行的网络入侵检测系统（NIDS）至关重要。传统的过滤方法使用跨类别对称计算的全局统计量对特征进行排序，因此无法捕捉入侵检测的不对称性，其中攻击最好被描述为对主导良性流量的偏离。我们提出了良性锚定类均值偏差（nCMD），一种轻量级且可解释的方法，该方法基于攻击类分布与良性类均值的偏差（而非全局有偏的参考）对特征相关性进行评分。这种方法使特征选择与NIDS的操作语义保持一致，且不增加额外计算成本。在四个基准数据集（CICIDS2017、CICDDoS2019、NSL-KDD和UNSW-NB15）、多个特征预算和三个下游分类器上，nCMD在宏平均F1分数上达到或超过了经典过滤基线。它在四个数据集中的三个以及每个分类器下均取得了最佳结果，在特征预算紧张和类别严重不平衡的情况下改进最为显著。这些结果支持良性锚定排序作为资源受限NIDS的可扩展且可解释的预处理组件。

英文摘要

Feature selection is critical for network intrusion detection systems (NIDS) operating under high-dimensional, highly imbalanced traffic, as found in operational and defense networks. Traditional filter methods rank features using global statistics computed symmetrically across classes and thus fail to capture the asymmetry of intrusion detection, where attacks are best characterized as deviations from dominant benign traffic. We propose benign-anchored Classwise Mean Deviation (nCMD), a lightweight and interpretable method that scores feature relevance based on the deviation of attack-class distributions from the benign-class mean, rather than a globally biased reference. This approach aligns feature selection with the operational semantics of NIDS at no additional computational cost. Across four benchmark datasets (CICIDS2017, CICDDoS2019, NSL-KDD, and UNSW-NB15), multiple feature budgets, and three downstream classifiers, nCMD matches or exceeds classical filter baselines in macro-averaged F1-score. It achieves the best result on three of the four datasets and under every classifier, with the strongest improvements observed under tight feature budgets and severe class imbalance. These results support benign-anchored ranking as a scalable and interpretable preprocessing component for resource-constrained NIDS.

URL PDF HTML ☆

赞 0 踩 0

2606.09932 2026-06-10 cs.LG cs.AI 新提交

When RL Fails after SFT: Rejuvenating Model Plasticity for Robust SFT-to-RL Handoff

当强化学习在监督微调后失效：恢复模型可塑性以实现稳健的SFT到RL交接

Runze Liu, Jiashun Liu, Xu Wan, Yuqian Fu, Ling Pan

AI总结针对SFT过度训练导致RL阶段改进有限的问题，提出Rejuvenation方法，通过基模型锚定融合和神经元重置恢复模型可塑性，在数学推理和智能体任务上提升RL性能。

详情

AI中文摘要

监督微调（SFT）后接强化学习（RL）已成为大语言模型（LLM）后训练的标准流程。SFT预期为RL提供有用的行为先验，以进一步增强模型能力。然而，过度SFT的检查点在RL中往往表现出有限的改进。我们将此失败归因于模型可塑性的丧失：SFT初始化的策略被后续RL有效重塑的能力降低。为了更好地理解这一现象，我们从参数变化、输出空间和RL优化动态等多个角度进行了详细分析。我们的结果表明，过度SFT的模型倾向于产生过度自信的token分布，并表现出尖锐的参数景观，这使得它们在RL阶段更难优化。为了实现更稳健的SFT到RL交接，我们提出了Rejuvenation，一种简单而有效的方法，在保留有用的SFT获取先验的同时恢复可塑性。Rejuvenation利用基于基模型的模型融合来减少过度SFT引起的漂移，并通过有针对性的神经元重置来缓解模型僵化。在数学推理任务和智能体任务上的实验结果表明，我们的方法在过度训练的SFT模型上持续提升了RL性能，同时也增强了对分布外任务的泛化能力。

英文摘要

Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) has become a standard pipeline for Large Language Model (LLM) post-training. SFT is expected to provide a useful behavioral prior for RL to further enhance model capabilities. However, checkpoints with excessive SFT often show limited improvement during RL. We attribute this failure to the loss of model plasticity: the reduced ability of an SFT-initialized policy to be effectively reshaped by subsequent RL. To better understand this phenomenon, we conduct detailed analysis from multiple perspectives, including parameter changes, output spaces, and RL optimization dynamics. Our results show that models from excessive SFT tend to produce over-confident token distributions and exhibit sharp parameter landscapes, which make them harder to optimize in the RL stage. To enable a more robust SFT-to-RL handoff, we propose \texttt{Rejuvenation}, a simple yet effective method that restores plasticity while preserving useful SFT-acquired priors. Rejuvenation leverages base-anchored model fusion to reduce excessive SFT-induced drift with targeted neuron reset to mitigate model rigidity. Experimental results on both math reasoning tasks and agentic tasks demonstrate that our approach consistently improves RL performance on over-trained SFT models, while also enhancing generalization to out-of-distribution tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.09931 2026-06-10 cs.GT cs.AI 新提交

A Note on the Strategic Confinement Problem

关于战略约束问题的一个注记

Christian Schroeder de Witt

AI总结本文引入战略约束问题，指出当通信方为具有共享协调资源的战略智能体时，即使信道容量极小，也可能导致机密信息的高影响泄露，并论证学习型战略智能体系统自然实例化该问题。

详情

AI中文摘要

Lampson的约束问题询问如何防止处理机密信息的程序将其泄露给第三方。我们引入战略约束问题，当通信方是具有共享协调资源的战略智能体时出现该问题。在此设置中，剩余通信能力可以集中在机密数据的低熵、高影响谓词上。因此，信息泄露的界限不一定导致最坏情况危害的相应界限：一个容量可忽略的信道仍可能足以选择破坏性结果。我们认为，学习型战略智能体系统自然实例化此问题，因为它们不允许完整的行为规范，它们习得的惯例通常无法被外部观察者预测或重现，并且足够能力的智能体可以构建难以检测或消除的隐蔽通信方案。因此，我们的贡献不是一种新的通信理论，而是在存在战略智能体的情况下对约束的重新解释。经典约束限制了可能流动的信息；战略约束强调这不一定限制战略智能体可以共同实现的目标。

英文摘要

Lampson's confinement problem asks how to prevent a program that processes confidential information from leaking it to a third party. We introduce the strategic confinement problem, which arises when the communicating parties are strategic agents with shared coordination resources. In this setting, residual communication capacity can be concentrated on low-entropy, high-impact predicates of the confidential data. Consequently, bounds on information leakage need not induce corresponding bounds on worst-case harm: a channel with negligible capacity may still suffice to select damaging outcomes. We argue that systems of learnt strategic agents naturally instantiate this problem because they do not admit complete behavioural specifications, their learnt conventions generally cannot be predicted or reproduced by an external observer, and sufficiently capable agents can construct covert communication schemes that are difficult to detect or eliminate. Our contribution is therefore not a new theory of communication, but a reinterpretation of confinement in the presence of strategic agents. Classical confinement bounds what information may flow; strategic confinement highlights that this need not bound what strategic agents can jointly achieve.

URL PDF HTML ☆

赞 0 踩 0

2606.09930 2026-06-10 cs.PL cs.LG cs.SC 新提交

Compile Once, Differentiate Everywhere: A Differentiable Meta-Circular Interpreter

一次编译，处处微分：可微分元循环解释器

Lucas Sheneman

AI总结提出一种将Scheme子集编译为可微分计算图的编译器，实现可微分元循环解释（DMCI），支持对包含闭包、递归和数据结构的程序进行反向模式自动微分，无需重新编译。

详情

AI中文摘要

程序执行与基于梯度的优化之间的界限长期以来限制了代码本身作为可学习科学模型的使用。我们提出一个编译器，将Scheme的自托管子集转换为用于自动微分后端的可微分计算图。由于该子集可以编译自身的求值器，这产生了可微分元循环解释（DMCI）：一个编译后的Scheme解释器执行作为数据提供的程序，而反向模式自动微分将梯度传播到嵌入在这些程序中的连续常数。解释器只编译一次，因此新程序无需重新编译或自定义梯度机制即可继承可微性，同时保留闭包、递归和数据结构。我们证明通过编译解释器的梯度几乎处处正确，并表明它们在171个递归和高阶程序-种子对上与直接编译的数值精度匹配。然后，我们使用DMCI进行程序与参数联合搜索，其中大型语言模型提出Scheme程序，精确梯度通过单个冻结的解释器校准其连续参数。这实现了OpenEvolve风格的程序搜索，其中外部循环提出离散程序结构，DMCI提供每个候选程序连续参数的精确基于梯度的校准。在电池容量衰减数据上，该搜索恢复了膝盖状退化结构，并在更难的早期外推分割上改善了保留外推性能，优于手工基线，在后期分割上与之匹配。在高维厄尔尼诺反问题中，DMCI优化了基于解释的卡尔曼滤波器似然，而无梯度搜索失败。这些结果将符号回归和神经符号搜索从闭式表达式扩展到可执行、有状态的程序，使模型生成的代码可直接针对数据进行优化。

英文摘要

The boundary between program execution and gradient-based optimization has long limited the use of code itself as a learnable scientific model. We present a compiler that translates a self-hosting subset of Scheme into differentiable computation graphs for autograd backends. Because the subset can compile its own evaluator, this yields differentiable meta-circular interpretation (DMCI): a compiled Scheme interpreter executes programs supplied as data, while reverse-mode autodiff propagates gradients to continuous constants embedded in those programs. The interpreter is compiled once, so new programs inherit differentiability without recompilation or custom gradient machinery, while retaining closures, recursion, and data structures. We prove that gradients through the compiled interpreter are correct almost everywhere and show that they match direct compilation to numerical precision across 171 recursive and higher-order program-seed pairs. We then use DMCI for program-and-parameter co-search, where a large language model proposes Scheme programs and exact gradients calibrate their continuous parameters through a single frozen interpreter. This enables OpenEvolve-style program search in which an outer loop proposes discrete program structures and DMCI supplies exact gradient-based calibration of each candidate's continuous parameters. On battery capacity-fade data, the search recovers a knee-like degradation structure and improves held-out extrapolation over hand-crafted baselines on the harder early-extrapolation split, matching them on the later split. On a high-dimensional El Nino inverse problem, DMCI optimizes an interpreted Kalman-filter likelihood where gradient-free search fails. These results extend symbolic regression and neurosymbolic search from closed-form expressions to executable, stateful programs, making model-generated code directly optimizable against data.

URL PDF HTML ☆

赞 0 踩 0

2606.09929 2026-06-10 cs.LG cs.AI 新提交

Between Amnesia and Chaos: A Memory Stability Expressivity Trilemma for Trainable Dissipative Oscillator Networks

介于遗忘与混沌之间：可训练耗散振荡器网络的记忆稳定性表现力三难困境

Caleb Munigety

AI总结本文研究可训练非线性振荡器网络，发现记忆范围、梯度稳定性和动态表现力三者受阻尼控制，存在无法同时最大化的三难困境，并通过实验验证了理论边界。

详情

AI中文摘要

物理储层计算利用非线性机械动力学，但传统上冻结基底并仅训练线性读出层，假定基底不可训练。我们重新审视这一前提，研究非线性振荡器网络，其质量、阻尼和刚度通过辛积分器端到端学习。我们的核心结果是三难困境：记忆范围、梯度稳定性和动态表现力无法同时最大化，因为三者均由阻尼控制。反向梯度以阻尼决定的速率衰减，限制了信用传播的距离，而前向灵敏度以最大李雅普诺夫指数指数增长，因此可用梯度需要阻尼高于稳定下限。由于李雅普诺夫指数随阻尼增加而下降，而记忆上限随范围增加而下降，稳定训练被限制在一个随范围收缩并在临界点闭合的带状区域内。我们在一个二十振荡器网络上测试了每一步。阻尼扫描发现最大李雅普诺夫指数单调变化并在明确的下限处过零，证实了定理的关键假设。在九个范围上的延迟回忆任务中，学习基底与冻结基底的算力匹配比较显示，学习基底在短范围占优，优势在约十一步范围附近接近并反转，这是带状闭合的预测特征；训练模型稳定在稳定下限附近，自发寻求混沌边缘。解析上限高估经验交叉约五倍，这是可检测梯度与可学习梯度之间的差距，我们报告而非调整消除。贡献在于确认了何时训练物理基底优于冻结基底。

英文摘要

Physical reservoir computing harnesses nonlinear mechanical dynamics but, by convention, freezes the substrate and trains only a linear readout, presuming the substrate is not usefully trainable. We revisit that premise for networks of nonlinear oscillators whose mass, damping, and stiffness are learned end-to-end through a symplectic integrator. Our central result is a trilemma: memory horizon, gradient stability, and dynamical expressivity cannot be simultaneously maximized, because all three are governed by the damping. The backward gradient decays at a rate set by the damping, capping how far back credit can propagate, while forward sensitivities grow exponentially in the largest Lyapunov exponent, so usable gradients require damping above a stability floor. Since the Lyapunov exponent falls as damping rises while the memory ceiling falls as the horizon grows, stable training is confined to a band that contracts with horizon and closes at a critical point. We test every step on a twenty-oscillator network. A damping sweep finds the largest Lyapunov exponent monotone and crossing zero at a well-defined stability floor, confirming the theorem's key assumption. A compute-matched comparison of learned versus frozen substrate on delayed recall across nine horizons shows the learned substrate dominating at short horizons and the advantage closing and reversing near a horizon of eleven steps, the predicted signature of band closure; trained models settle near the stability floor, seeking the edge of chaos unprompted. The analytic ceiling overestimates the empirical crossover roughly fivefold, a gap between detectable and learnable gradient that we report rather than tune away. The contribution is a confirmed account of when training a physical substrate beats freezing it.

URL PDF HTML ☆

赞 0 踩 0

2606.09928 2026-06-10 cs.LG cs.AI 新提交

Forward-Only Convolutional Neural Networks with Learnable Channel-Class Assignment

具有可学习通道-类别分配的前向传播卷积神经网络

Mohammadnavid Ghader, Saeed Reza Kheradpisheh, Bahar Farahani, Mahmood Fazlali

AI总结提出可学习的通道-类别分配机制，结合熵和正交正则化，以及基于验证性能的损失感知层贡献策略，在残差CNN上实现前向传播学习，在CIFAR-10/100和Tiny-ImageNet上达到FF模型最佳性能，缩小与反向传播的差距。

详情

AI中文摘要

前向-前向（FF）算法通过用局部的前向目标替代基于梯度的信用分配，提供了一种受生物学启发的反向传播替代方案。虽然最近的扩展已将FF适应到卷积神经网络（CNN），但现有公式依赖于静态的通道-类别分区，并且在复杂任务中难以有效执行。在这项工作中，我们引入了一种可学习的通道-类别分配机制，实现了卷积通道的自适应、数据驱动特化，并辅以熵和正交正则化以提升学习性能。我们进一步提出了一种损失感知的层贡献策略，该策略根据中间层的验证性能自适应地加权其预测，从而增强前向推理的有效性。集成到残差CNN中，所提出的方法在CIFAR-10、CIFAR-100和Tiny-ImageNet上相比现有的类似前向方法持续实现了更优的性能。值得注意的是，它在基于FF的模型中建立了新的最先进性能，显著缩小了与反向传播的差距。这些发现表明，引入可学习的通道特化和层贡献加权显著增强了深度CNN中前向学习的表示能力。

英文摘要

The Forward-Forward (FF) algorithm offers a biologically inspired alternative to backpropagation by replacing gradient-based credit assignment with local, forward-only objectives. While recent extensions have adapted FF to convolutional neural networks (CNNs), existing formulations rely on static channel-class partitions and struggle to perform effectively in complex tasks. In this work, we introduce a learnable channel-class assignment mechanism that enables adaptive, data-driven specialization of convolutional channels, supported by entropy and orthogonality regularization to promote learning performance. We further propose a loss-aware layer contribution strategy that adaptively weights intermediate-layer predictions based on their validation performance, enhancing the effectiveness of forward-only inference. Integrated into residual CNNs, the proposed method achieves consistently superior performance across CIFAR-10, CIFAR-100, and Tiny-ImageNet compared to existing similar forward-only methods. Notably, it establishes new state-of-the-art performance among FF-based models, substantially narrowing the gap with backpropagation. These findings demonstrate that introducing learnable channel specialization and layer contribution weighting significantly enhances the representational capacity of forward-only learning in deep CNNs.

URL PDF HTML ☆

赞 0 踩 0

2606.09927 2026-06-10 cs.LG cs.AI cs.CL 新提交

Trainable Smooth-Rotation Transforms with Learned Channel Scales for LLM Quantization

可训练平滑旋转变换与学习通道尺度用于LLM量化

Patrik Czakó, Gábor Kertész, Sándor Szénási

AI总结针对大语言模型量化中激活值量化困难的问题，提出基于分位数鲁棒的缩放策略和梯度优化的通道尺度学习，在W4A4量化下显著降低误差。

详情

Comments: 6 pages, 8 figures, 3 tables. Accepted to IEEE INES 2026 conference proceedings

AI中文摘要

后训练量化（PTQ）是降低大语言模型（LLM）服务成本最实用的方法之一，但激活值量化仍然困难，因为异常值主导的通道会导致较大的量化误差。本文研究了这种退化是否部分由基于缩放的等效变换中的过度迁移引起。我们引入了一种用于SmoothRot风格变换的分位数鲁棒缩放策略，用高分位数替代基于最大值的激活统计量，并辅以通道尺度的约束梯度优化。在LLaMA-3.2-1B的W4A4量化下，仅分位数策略搜索相比SmoothRot基线将选定层误差降低11.1%，联合(alpha, q)搜索降低12%，训练达到18.5%。将最佳选定层策略重放到所有解码器块的下投影层，相应的全层平均误差从97.51降至78.08（19.9%）。结果表明，鲁棒的迁移控制和轻量级尺度学习在保持等效变换框架的同时，相比基于最大值的固定策略提供了持续改进。

英文摘要

Post-training quantization (PTQ) is one of the most practical ways to reduce the serving cost of Large Language Models (LLMs), but activation quantization remains difficult because outlier-dominated channels lead to large quantization errors. This paper investigates whether part of this degradation is caused by over-migration in scaling-based equivalent transformations. We introduce a quantile-robust scaling policy for SmoothRot-style transforms by replacing max-based activation statistics with high quantiles, and we complement it with constrained gradient-based optimization of channel scales. On LLaMA-3.2-1B under W4A4 quantization, quantile-only policy search improves selected-layer error by 11.1% over the SmoothRot baseline, joint (alpha, q) search improves it by 12%, and training reaches 18.5%. Replaying the best selected-layer policy on all decoder-block down-projection layers reduces the corresponding full-layer mean error from 97.51 to 78.08 (19.9%). The results show that robust migration control and lightweight scale learning provide consistent gains over max-based fixed policies while preserving the equivalent-transform framework.

URL PDF HTML ☆

赞 0 踩 0

2606.09926 2026-06-10 cs.LG cs.AI 新提交

Sample Where You Struggle: Sharpening Base Model Reasoning via Entropy-Guided Power Sampling

在你挣扎处采样：通过熵引导的幂采样增强基础模型推理

Hong Guo, Nianhui Guo, Christoph Meinel, Haojin Yang

AI总结提出熵引导的幂采样（EGPS），一种无需训练和验证器的采样方法，通过利用前向传播中的token级熵将MCMC移动定位到高熵区域，在多个基准上以高达12.6倍加速达到最优或并列最优准确率。

详情

AI中文摘要

从序列级幂分布 $p^\alpha$ 采样可以在不更新任何参数的情况下从基础语言模型中引出强化学习级别的推理，但标准的Metropolis-Hastings（MH），一种马尔可夫链蒙特卡洛（MCMC）采样器，既昂贵又慢混合。我们将这两个问题归因于结构不匹配：$p^\alpha$ 主要在稀疏、空间聚集的高熵决策点集上偏离 $p$，然而MH沿着前缀均匀地提出重采样位置——在近简并条件上浪费计算，同时在模式发散处欠混合。我们提出熵引导的幂采样（EGPS），一种无需训练和验证器的采样器，它从已经在前向传播中的token级熵重新推导其提议。EGPS跳过确定性块，将每个MCMC移动定位到高熵邻域，并在决策点应用多尝试Metropolis——使得采样成本随熵质量而非序列长度缩放。在Qwen2.5-Math-7B上，EGPS在所有三个基准（MATH500 $75.8\\%$，HumanEval $62.2\\%$，GPQA $42.4\\%$）上达到最佳或并列最佳准确率，同时相对于MH基线实现了高达12.6倍的墙钟加速。

英文摘要

Sampling from the sequence-level power distribution $p^α$ elicits RL-level reasoning from base language models without any parameter updates, but the standard Metropolis--Hastings (MH), a Markov Chain Monte Carlo (MCMC) sampler, is both expensive and slow-mixing. We trace both to a structural mismatch: $p^α$ mainly departs from $p$ at a sparse, spatially clustered set of high-entropy decision points, yet MH proposes resampling positions uniformly along the prefix -- wasting compute on near-degenerate conditionals while under-mixing precisely where modes diverge. We propose Entropy-Guided Power Sampling (EGPS), a training-free and verifier-free sampler that re-derives its proposal from token-level entropy already in the forward pass. EGPS skips deterministic blocks, localizes each MCMC move to a high-entropy neighborhood, and applies Multiple-Try Metropolis at decision points -- making sampling cost scale with \emph{entropy mass rather than sequence length}. On Qwen2.5-Math-7B, EGPS reaches best or tied-best accuracy on all three benchmarks (MATH500 $75.8\%$, HumanEval $62.2\%$, GPQA $42.4\%$) at up to a $12.6\times$ wall-clock speedup over the MH baseline.

URL PDF HTML ☆

赞 0 踩 0

2606.09925 2026-06-10 cs.SD 新提交

AudioProcessBench: Benchmark for Identifying Process Errors in Audio-Grounded Reasoning

AudioProcessBench: 音频基础推理中过程错误识别的基准

Xiangyu Zhao, Junyu Yan, Yaling Shen, Zimu Wang, Yiwen Jiang, Stephanie Fong, Qingyang Xu, Jiahe Liu, Dominic Dwyer, Zongyuan Ge

AI总结提出AudioProcessBench基准，用于评估音频-语言模型在推理步骤中的过程错误识别能力，涵盖步骤正确性、错误类型检测和链级聚合三种范式。

详情

AI中文摘要

大型音频-语言模型（LALMs）越来越多地使用显式推理轨迹进行复杂的音频理解，但对推理质量的评估仍未被充分探索。尽管过程级基准（用于过程奖励模型PRMs）在文本和多模态领域推进了推理评估，但音频推理的类似评估仍然有限。在本文中，我们提出了AudioProcessBench，一个用于音频推理中步骤级过程错误识别的综合基准。AudioProcessBench包含由6个音频和全模态语言模型生成的不同推理轨迹。每个轨迹被分割成离散的推理步骤，并标注了二元步骤正确性和细粒度错误类型。我们的基准在三种互补范式下评估模型：（1）步骤正确性识别，（2）错误类型条件检测，用于诊断音频特定验证器能力，以及（3）链级聚合，其中验证器为同一问题选择或聚合多个推理轨迹。这种设计使得系统分析当前模型是否能检测过程错误、它们的弱点是否因音频特定错误类型而异，以及过程验证是否能转化为改进的答案选择成为可能。AudioProcessBench为未来关于音频推理验证器、过程奖励模型和可靠的全模态推理研究提供了测试平台。

英文摘要

Large audio-language models (LALMs) increasingly use explicit reasoning traces for complex audio understanding, yet the evaluation of reasoning quality remains underexplored. Although process-level benchmarks for process reward models (PRMs) have advanced reasoning evaluation in text and multi-modal domains, comparable evaluation for audio reasoning remains limited. In this paper, we present AudioProcessBench, a comprehensive benchmark for step-level process error identification in audio reasoning. AudioProcessBench contains diverse reasoning traces generated by 6 audio and omni language models. Each trace is segmented into discrete reasoning steps and annotated with binary step correctness and fine-grained error types. Our benchmark evaluates models under three complementary paradigms: (1) step correctness identification, (2) error-type-conditioned detection for diagnosing audio-specific verifier capacities, and (3) chain-level aggregation, where verifiers select or aggregate among multiple reasoning traces for the same question. This design enables a systematic analysis of whether current models can detect process errors, whether their weaknesses differ across audio-specific error types, and whether process verification translates into improved answer selection. AudioProcessBench provides a testbed for future research on audio reasoning verifiers, process reward models, and reliable omni-modal reasoning.

URL PDF HTML ☆

赞 0 踩 0

2606.09924 2026-06-10 cs.LG cs.AI 新提交

Sigma-Branch: Hierarchical Single-Path Network Reconstruction for Dynamic Inference with Reduced Active Parameters

Sigma-Branch: 用于动态推理的分层单路径网络重构，减少活跃参数

Kohga Tanaka, Hiroaki Nishi

AI总结提出Sigma-Branch框架，通过分层二叉树结构将预训练密集网络重构为共享主干、分层路由器和专用叶子，利用激活聚类初始化并微调，推理时仅执行单一路径，在CIFAR-100/ResNet-50等任务上减少58-60%活跃参数，性能损失小于1.72个百分点。

详情

AI中文摘要

在内存受限的边缘加速器上部署深度神经网络，瓶颈在于每次推理的片外权重传输而非计算：密集网络无法保留在芯片上，每个输入都必须加载所有参数。现有模型压缩仅在永久容量损失代价下减少这种传输。我们提出Sigma-Branch (SigmaB)，一个将预训练密集网络重构为分层二叉树的框架，该树由共享主干、分层路由器和专用叶子组成。预训练权重通过基于激活的球形k-means聚类分布在树中，该聚类联合初始化路由器权重和每分支通道分配；然后通过软路由微调使每个叶子与其路由输入子集对齐。在推理时，所得网络仅执行一条根到叶路径，减少活跃参数占用，同时将完整密集参数集存储在内存中。在CIFAR-100 / ResNet-50、ImageNet-1K / ResNet-50和ModelNet40 / PointNet++上，SigmaB-Net将每次推理的活跃参数减少58-60%，同时与密集基线Top-1相比误差在1.72个百分点以内。在可比的ImageNet-1K Top-1下，活跃参数减少超过静态结构化剪枝（FPGM、HRank）14-23个百分点。跨模态评估涵盖2D视觉和3D点云骨干网络，证实了将每次推理内存流量与总参数数量解耦的框架级主张。

英文摘要

Deploying deep neural networks on memory-constrained edge accelerators is bottlenecked by per-inference off-chip weight transfer rather than computation: the dense network cannot be retained on-chip, and every parameter must be loaded for every input. Existing model compression reduces this transfer only at the cost of permanent capacity loss. We propose Sigma-Branch (SigmaB), a framework that restructures a pretrained dense network into a hierarchical binary tree composed of a shared backbone, hierarchical routers, and specialized leaves. Pretrained weights are distributed across the tree via activation-based spherical k-means clustering, which jointly initializes router weights and per-branch channel allocations; soft-routing fine-tuning then aligns each leaf with its routed input subset. At inference, the resulting network executes only a single root-to-leaf path, reducing the active-parameter footprint while storing the complete dense parameter set in memory. Across CIFAR-100 / ResNet-50, ImageNet-1K / ResNet-50, and ModelNet40 / PointNet++, SigmaB-Net reduces per-inference active parameters by 58-60% while remaining within 1.72 percentage points (pp) of the dense baseline Top-1. At comparable ImageNet-1K Top-1, the active-parameter reduction exceeds static structured pruning (FPGM, HRank) by 14-23 pp. The cross-modal evaluation, spanning 2D vision and 3D point-cloud backbones, substantiates a framework-level claim that decouples per-inference memory traffic from the total parameter count.

URL PDF HTML ☆

赞 0 踩 0

2606.09923 2026-06-10 cs.LG cs.AI 新提交

Conformal Prediction for Neural Operators: Distribution-Free Uncertainty Quantification in Physics Simulation

神经算子的共形预测：物理模拟中无分布不确定性量化

Michael Chin

AI总结提出将分裂共形预测应用于神经算子物理模拟，实现无分布预测区间和有限样本覆盖保证，并通过归一化共形预测方案生成自适应宽度区间。

详情

Comments: 13 pages, 7 tables, 7 figures. Full-scale experiments on NVIDIA V100

AI中文摘要

神经算子如傅里叶神经算子（FNO）已成为求解偏微分方程（PDE）的强大替代方法，比传统数值求解器快几个数量级。然而，在安全关键工程应用（如电子元件和电池系统的热管理）中部署这些模型，不仅需要准确的点预测，还需要严格的不确定性保证。现有的神经算子不确定性量化（UQ）方法，包括蒙特卡洛Dropout和深度集成，仅提供相对不确定性估计，没有正式的覆盖保证。在这项工作中，我们首次将分裂共形预测应用于基于神经算子的物理模拟，提供具有有限样本覆盖保证的无分布预测区间。我们进一步引入了一种归一化共形预测方案，利用MC Dropout不确定性生成自适应宽度区间，在低不确定性区域产生更紧的区间，在模型不太确定的区域产生更宽的区间。在稳态热传导基准上的全规模实验（3370万参数，800个训练样本，5个集成成员，NVIDIA V100）表明，我们的方法在目标水平alpha=0.1下达到89.1%的经验覆盖率，同时生成反映底层物理不确定性结构的空间自适应预测区间。我们还提供了一个不确定性分解框架，将认知不确定性（占总量的68%）与偶然不确定性（占总量的32%）分离，为数据收集和模型改进提供可操作指导。我们的方法在一个开源平台上实现，具有REST API端点和交互式3D可视化。

英文摘要

Neural operators such as the Fourier Neural Operator (FNO) have emerged as powerful surrogates for solving partial differential equations (PDEs), achieving speedups of several orders of magnitude over traditional numerical solvers. However, deploying these models in safety-critical engineering applications -- such as thermal management of electronic components and battery systems -- requires not only accurate point predictions but also rigorous uncertainty guarantees. Existing uncertainty quantification (UQ) methods for neural operators, including Monte Carlo Dropout and Deep Ensembles, provide only relative uncertainty estimates without formal coverage guarantees. In this work, we propose the first application of split conformal prediction to neural operator-based physics simulation, providing distribution-free prediction intervals with finite-sample coverage guarantees. We further introduce a normalized conformal prediction scheme that leverages MC Dropout uncertainty to produce adaptive-width intervals, yielding tighter intervals in regions of low uncertainty and wider intervals where the model is less certain. Full-scale experiments (33.7M parameters, 800 training samples, 5 ensemble members, NVIDIA V100) on steady-state heat conduction benchmarks demonstrate that our method achieves 89.1% empirical coverage at the target level of alpha=0.1, while producing spatially adaptive prediction intervals that reflect the underlying physical uncertainty structure. We also provide an uncertainty decomposition framework that separates epistemic uncertainty (68% of total) from aleatoric uncertainty (32% of total), offering actionable guidance for data collection and model improvement. Our method is implemented in an open-source platform with REST API endpoints and interactive 3D visualization.

URL PDF HTML ☆

赞 0 踩 0

2606.09919 2026-06-10 cs.LG cs.AI cs.MA cs.RO 新提交

Co-GLANCE: Uncertainty-Aware Active Perception for Heterogeneous Robot Teaming

Co-GLANCE: 异构机器人团队的不确定性感知主动感知

Michal P. Podolinsky, Neel P. Bhatt, Pranay Samineni, Rohan Siva, Christian Ellis, Ufuk Topcu

AI总结提出Co-GLANCE系统，通过蒸馏视觉语言模型实现实时遮挡分割与机器人分配，结合共形预测与选择性弃权提供统计保证的不确定性量化，驱动主动感知，在真实场景中遮挡分割和分配准确率分别提升25%和36%，推理延迟降低350倍。

详情

Comments: Code, videos, and dataset available at https://co-glance.github.io/

AI中文摘要

感知不确定性是异构机器人团队在非结构化户外环境中运行的核心挑战，因为单一视角无法提供可靠的场景理解。由遮挡等来源引起的感知不确定性，根据场景结构在不同机器人视角下表现不同。检测和解决感知不确定性的来源需要基于场景的上下文推理和具备能力感知的机器人分配。虽然视觉语言模型为两者提供了强大的语义先验，但它们对于机载推理在计算上过于昂贵，且缺乏校准的不确定性量化。我们介绍了Co-GLANCE，一个用于异构机器人团队不确定性解决的实时机载感知与决策系统。Co-GLANCE将视觉语言模型的语义推理能力蒸馏为用于遮挡分割和机器人分配的端到端模型，消除了对基于云推理的需求。为了量化感知不确定性，Co-GLANCE结合了共形预测与选择性弃权，为分割、机器人分配和检测输出提供统计有效的覆盖保证。这些校准的不确定性估计直接触发主动感知，派遣最合适的机器人获取信息丰富的视角并解决不确定性。在真实世界场景中，Co-GLANCE在遮挡分割和机器人分配准确率上分别比基于云的视觉语言模型基线高出25%和36%，同时将每帧推理延迟降低350倍。我们还发布了一个空地数据集以供未来研究。代码、视频和数据集可在以下网址获取：此 https URL。

英文摘要

Perceptual uncertainty is a central challenge for heterogeneous robot teams operating in unstructured outdoor environments, where no single viewpoint affords reliable scene understanding. Perceptual uncertainty, arising from sources such as occlusions, manifests differently across robot viewpoints depending on scene structure. Detecting and resolving sources of perceptual uncertainty requires both scene-based contextual reasoning and capability-aware robot allocation. While vision-language models provide strong semantic priors for both, they are computationally prohibitive for onboard inference and lack calibrated uncertainty quantification. We introduce Co-GLANCE, a real-time onboard perception and decision-making system for uncertainty resolution in heterogeneous robot teams. Co-GLANCE distills the semantic reasoning capabilities of a vision-language model into an end-to-end model for occlusion segmentation and robot allocation, eliminating the need for cloud-based inference. To quantify perceptual uncertainty, Co-GLANCE combines conformal prediction with selective abstention to provide statistically valid coverage guarantees for segmentation, robot allocation, and detection outputs. These calibrated uncertainty estimates directly trigger active perception, dispatching the most appropriate robot to acquire informative viewpoints and resolve uncertainty. Across real-world scenarios, Co-GLANCE outperforms cloud-based vision-language model baselines in occlusion segmentation and robot allocation accuracy by 25% and 36%, respectively, while reducing per-frame inference latency 350x. We also release an air-ground dataset for future research. Code, videos, and dataset available at https://co-glance.github.io/ .

URL PDF HTML ☆

赞 0 踩 0