arXivDaily arXiv每日学术速递 周一至周五更新
2606.20045 2026-06-19 cs.CV cs.AI 新提交

See-and-Reach: Precise Vision-Language Navigation for UAVs within the Field of View

See-and-Reach: 视场内的精确视觉语言导航用于无人机

Fanfu Xue, En Yu, Yantian Shen, Zhikun Hu, Hongjun Wang, Yang Yang, Xindi Wang, Jiande Sun

发表机构 * School of Information Science and Engineering, Shandong University(山东大学信息科学与工程学院) Faculty of Engineering and Information Technology, University of Technology Sydney(悉尼科技大学工程与信息技术学院) School of Computer Science and Technology, Shandong University(山东大学计算机科学与技术学院) School of Artificial Intelligence, Shandong University(山东大学人工智能学院) School of Computer Science and Artificial Intelligence, Shandong Normal University(山东师范大学计算机科学与人工智能学院) Interdisciplinary Research Center of General Artificial Intelligence, Shandong Normal University(山东师范大学通用人工智能跨学科研究中心)

AI总结 针对无人机视觉语言导航中目标可见后精确到达能力评估不足的问题,提出UAV-VLN-FOV任务和3DG-VLN框架,通过动态3D方向线索增强细粒度视觉定位与空间对齐,在基准和真实实验中显著提升成功率。

Comments 12 pages, 7 figures

详情
AI中文摘要

无人机视觉语言导航(UAV-VLN)通常被形式化为一个整体的搜索与到达问题,其中远程目标发现和最终目标接近被联合优化和评估。这种表述使得评估空中具身代理的关键能力变得困难,即一旦目标进入其视场,无人机能否准确地将可见目标定位并将视觉语言证据转化为精确的3D运动。为了解决这一局限性,我们引入了UAV-VLN-FOV,一个目标可见的导航任务,它隔离了“看到并到达”阶段,并能够对终端到达能力进行更具诊断性的评估。我们进一步提出了3DG-VLN,一种由动态3D方向线索引导的视觉语言航点预测框架,以增强细粒度视觉定位和空间方向对齐,从而实现精确的目标到达。具体来说,3DG-VLN自适应地处理高分辨率的前视和下视观测,以保留用于目标定位的细粒度视觉和几何细节。它还在闭环导航过程中在线更新目标相对方向,使代理能够保持与目标的空间对齐并减少累积的方向漂移。为了支持该任务,我们构建了一个专用的高分辨率基准,包含2,717条轨迹,带有面向目标的高级指令、高分辨率的前视和下视自我中心观测以及连续的3D航点注释。实验表明,3DG-VLN优于具有竞争力的UAV-VLN基线,成功率提高了13.82%。真实世界试验进一步展示了3DG-VLN在实际“看到并到达”导航中的潜力。源代码和基准可在以下网址获取:此 https URL。

英文摘要

UAV Vision-Language Navigation (UAV-VLN) is typically formulated as a holistic search-and-reach problem, where long-range target discovery and final target approach are optimized and evaluated jointly. This formulation makes it difficult to assess a critical capability of aerial embodied agents, namely whether a UAV can accurately ground a visible target and translate vision-language evidence into precise 3D motion once the target enters its field of view. To address this limitation, we introduce UAV-VLN-FOV, a target-visible navigation task that isolates the see-and-reach stage and enables a more diagnostic evaluation of terminal reaching ability. We further propose 3DG-VLN, a vision-language waypoint prediction framework guided by dynamic 3D direction cues to enhance fine-grained visual grounding and spatial direction alignment for precise target reaching. Specifically, 3DG-VLN adaptively processes high-resolution front-view and downward-view observations to preserve fine-grained visual and geometric details for target grounding. It also updates the target-relative direction online during closed-loop navigation, allowing the agent to maintain spatial alignment with the target and reduce accumulated direction drift. To support this task, we construct a dedicated high-resolution benchmark which contains 2,717 trajectories with target-oriented high-level instructions, high-resolution front-view and downward-view egocentric observations, and continuous 3D waypoint annotations. Experiments show that 3DG-VLN outperforms competitive UAV-VLN baselines, achieving a 13.82\% improvement in success rate. Real-world trials further demonstrate the potential of 3DG-VLN for practical see-and-reach navigation. The source code and benchmark are available at https://github.com/xuefanfu/3DG-VLN.

2606.20044 2026-06-19 cs.CV 新提交

FUSE: Frequency-domain Unification and Spectral Energy Alignment for Multi-modal Object Re-Identification

FUSE:面向多模态目标重识别的频域统一与频谱能量对齐

Xuanhao Qi, Tom H. Luan, Yukang Zhang, Jinkai Zheng, Zhou Su, Shuwei Li, Lei Tan

发表机构 * School of Cyber Science and Engineering, Xi'an Jiaotong University(西安交通大学网络空间安全学院) School of Informatics, Xiamen University(厦门大学信息学院) National University of Singapore(新加坡国立大学)

AI总结 提出频域框架FUSE,通过频谱解耦和能量对齐两阶段处理,解决多模态重识别中低频偏置问题,在三个数据集上mAP提升9.1%。

Comments Accepted in ICML 2026

详情
AI中文摘要

尽管多模态重识别(ReID)取得了显著进展,现有方法往往强调低频线索。因此,它们关注颜色、光照和粗略外观等属性,而忽略了编码几何、纹理和身份判别细节的中高频结构。这种不平衡导致频谱表示不完整和跨模态对齐不稳定。为了克服这些限制,我们引入了FUSE,一个频域框架,将多模态ReID重新表述为频谱解耦和能量对齐的两阶段过程。所提出的频谱分解模块(SDM)自适应地将特征划分为低频、中频和高频子空间,实现分层频谱建模。跨模态对齐模块(CAM)进一步通过频率一致性正则化强制实现跨模态的能量对齐和子空间互补性。此外,FUSE结合了可学习的频率调制,以增强在不同光照和异构传感器条件下的鲁棒性。在RGBNT201、RGBNT100和MSVR310上的大量实验表明,FUSE实现了9.1%的mAP和9.5%的Rank-1改进,为多模态表示学习建立了一个可解释的频域范式。

英文摘要

Despite significant progress in multi-modal Re-Identification (ReID), existing methods tend to emphasize low-frequency cues. Consequently, they focus on attributes such as color, illumination, and coarse appearance, while overlooking mid and high-frequency structures that encode geometric, textural, and identity-discriminative details. This imbalance leads to incomplete spectral representations and unstable cross-modal alignment. To overcome these limitations, we introduce FUSE, a frequency-domain framework that reformulates multi-modal ReID as a two-stage process of spectral disentanglement and energy alignment. The proposed Spectral Decomposition Module (SDM) adaptively partitions features into low, mid, and high-frequency subspaces, enabling hierarchical spectral modeling. The Cross-Modal Alignment Module (CAM) further enforces energy alignment and subspace complementarity across modalities via frequency-consistency regularization. In addition, FUSE incorporates learnable frequency modulation to enhance robustness under varying illumination and heterogeneous sensor conditions. Extensive experiments on RGBNT201, RGBNT100, and MSVR310 show that FUSE achieves 9.1\% mAP and 9.5\% Rank-1 improvements, establishing an interpretable frequency-domain paradigm for multi-modal representation learning.

2606.20037 2026-06-19 cs.LG 新提交

Alzheimer's Disease Diagnosis using a Multimodal Approach with 3D MRI and PET

使用3D MRI和PET的多模态方法诊断阿尔茨海默病

Loukas Ilias, Anthi-Maria Vozinaki, Christos Ntanos, Dimitris Askounis

发表机构 * DSS Lab, School of ECE, NTUA(NTUA ECE学院DSS实验室)

AI总结 提出结合3D卷积特征提取器与三种融合策略(拼接、门控多模态单元、门控自注意力)及稀疏门控混合专家分类器的多模态模型,用于阿尔茨海默病诊断,在三个二分类任务上验证了输入自适应建模的有效性。

Comments 2025 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)

详情
AI中文摘要

阿尔茨海默病(AD)是一种不可逆的神经退行性疾病,也是全球主要的死亡原因之一。早期诊断尤为重要,尤其是在轻度认知障碍(MCI)阶段,及时干预有助于延缓其向AD的进展。神经影像数据,如磁共振成像(MRI)和正电子发射断层扫描(PET),可以通过提供与疾病相关的结构和功能脑变化来帮助早期检测脑部变化。然而,许多多模态模型仍通过静态拼接融合MRI和PET,并对所有受试者应用相同的计算,这限制了其对患者/站点异质性的鲁棒性,并可能浪费计算资源。为解决这些局限性,我们首次研究了将3D卷积特征提取器与三种融合策略(拼接、门控多模态单元(GMU)和门控自注意力)以及一个稀疏门控混合专家(MoE)分类器相结合的方法,该分类器执行输入自适应路由,仅激活每个病例中最具信息量的专家。最后,我们利用Grad-CAM可视化疾病相关区域,确保模型的可解释性。实验在三个二分类任务(NC vs. MCI、MCI vs. AD和NC vs. AD)上进行。结果表明,GMU在NC vs. MCI和NC vs. AD上分别达到80.46%和95.47%的准确率,而门控自注意力在MCI vs. AD上达到82.08%。消融实验表明,移除MoE会持续降低所有任务的准确率。这些发现强调了利用MRI和PET互补性的输入自适应多模态建模在AD诊断中的价值。

英文摘要

Alzheimer's disease (AD) is an irreversible neurodegenerative disorder and a leading cause of death worldwide. Early diagnosis plays an important part especially at the Mild Cognitive Impairment stage, where timely intervention can help slow its progression before it advances to AD. Neuroimaging data, like Magnetic Resonance Imaging (MRI) and Positron Emission Tomography (PET) scans, can help detect brain changes early by providing structural and functional brain changes related to the disease. Yet, many multimodal models still fuse MRI and PET with static concatenation and apply identical computation to all subjects, which limits robustness to patient/site heterogeneity and can waste computation. To address these limitations, we present the first study of combining 3D convolutional feature extractors with three fusion strategies - concatenation, Gated Multimodal Unit (GMU), and gated self-attention - and a sparsely gated Mixture-of-Experts (MoE) classifier that performs input-adaptive routing, activating only the most informative experts per case. Finally, we utilize Grad-CAM to visualize disease-related regions, ensuring model interpretability. Experiments are performed across three binary classification tasks (NC vs. MCI, MCI vs. AD, and NC vs. AD). Results show that GMU achieves accuracies of 80.46 % (NC vs. MCI) and 95.47 % (NC vs. AD), while gated self-attention attains 82.08 % on MCI vs. AD. Ablations show that removing the MoE consistently degrades accuracy across all tasks. These findings underscore the value of input-adaptive, multimodal modeling for AD diagnosis by leveraging the complementary nature of MRI and PET.

2606.20035 2026-06-19 cs.CV cs.LG 新提交

PU-UNet: Stable Multiplicative Interactions for Medical Image Segmentation

PU-UNet:用于医学图像分割的稳定乘法交互

Ziyuan Li, Osamah Sufyan, Uwe Jaekel, Babette Dellen

发表机构 * Department of Mathematics, Informatics and Technology, University of Applied Sciences Koblenz(科布伦茨应用科学大学数学、信息学与技术系) Technical University of Munich(慕尼黑工业大学)

AI总结 提出PU-UNet,通过稳定乘积单元残差块在低分辨率阶段实现显式乘法特征交互,在三个医学图像分割数据集上提升Dice和IoU,降低假阳性率。

Comments Accepted to the ICANN 2026

详情
AI中文摘要

许多密集预测网络依赖于加性特征变换,并且仅隐式地建模高阶特征交互。乘积单元为乘法特征建模提供了显式机制,但其对数-指数公式可能导致数值不稳定性,这限制了它们在深度密集预测网络中的使用。在这项工作中,我们提出了乘积单元U-Net(PU-UNet),这是一种残差U-Net,它将稳定的乘积单元残差块集成到丰富的低分辨率阶段,用于医学图像分割。所提出的公式结合了平滑正性映射和对数域裁剪,实现了稳定的乘法特征学习,且计算开销可忽略不计。在ISIC 2018、Kvasir-SEG和BUSI上,PU-UNet分别达到了0.942、0.959和高达0.925的Dice分数。与匹配的残差U-Net基线相比,PU-UNet在保持参数、FLOPs和推理延迟几乎不变的情况下,持续提高了Dice和IoU,并将正常BUSI病例的图像级假阳性率从0.077降至零。消融研究表明,这些增益与乘积单元交互相关,在低分辨率放置下最强,并受益于所提出的稳定化设计。这些结果表明,稳定的乘积单元残差学习可以成为通过显式乘法交互增强U-Net风格分割网络的有效方式。

英文摘要

Many dense prediction networks rely on additive feature transformations and model higher-order feature interactions only implicitly. Product units provide an explicit mechanism for multiplicative feature modeling, but their logarithmic--exponential formulation can cause numerical instability, which has limited their use in deep dense prediction networks. In this work, we propose Product-Unit U-Net (PU-UNet), a residual U-Net that integrates stable product-unit residual blocks into rich low-resolution stages for medical image segmentation. The proposed formulation combines smooth positivity mapping with log-domain clipping, enabling stable multiplicative feature learning with negligible computational overhead. On ISIC 2018, Kvasir-SEG, and BUSI, PU-UNet achieves Dice scores of 0.942, 0.959, and up to 0.925, respectively. Compared with a matched Residual U-Net baseline, PU-UNet consistently improves Dice and IoU while keeping parameters, FLOPs, and inference latency nearly unchanged, and reduces the image-level false-positive rate on normal BUSI cases from 0.077 to zero. Ablation studies suggest that the gains are associated with product-unit interactions, are strongest under low-resolution placement, and benefit from the proposed stabilization design. These results suggest that stable product-unit residual learning can be an effective way to enhance U-Net-style segmentation networks with explicit multiplicative interactions.

2606.20034 2026-06-19 cs.LG 新提交

Exploring the potential of AlphaEarth and TESSERA embeddings for Fine-scale Local Climate Zone Mapping: A case study across five cities in Switzerland

探索AlphaEarth和TESSERA嵌入在精细尺度局地气候区制图中的应用潜力:以瑞士五个城市为例

Htet Yamin Ko Ko, Clement Atzberger

AI总结 本研究对比TESSERA和AlphaEarth嵌入与传统Sentinel-1/2数据,使用注意力U-Net将粗分辨率LCZ图提升至10米,发现嵌入模型在跨城市迁移和精度上表现更优,但跨年迁移仍是挑战。

详情
AI中文摘要

理解城市空间形态对于气候建模、风险评估和可持续城市设计至关重要,而局地气候区(LCZ)制图为此提供了基本框架。然而,许多城市仍使用约100米分辨率的粗LCZ记录,这并不适用于精细尺度的城市研究。在本研究中,我们将TESSERA(Feng等人,2025)和AlphaEarth(Brown等人,2025)的预计算嵌入与传统的Sentinel-1/2(S1S2)合成数据在瑞士五个城市进行比较,以评估它们是否能够使用基于注意力的U-Net将粗LCZ图提升至10米分辨率。三个实验评估了多城市迁移性、更高分辨率参考数据的影响以及对年际物候变化的时间鲁棒性。我们发现,所有数据集在前两个实验中均取得了强劲性能,测试数据的交并比(IoU)分别在0.59-0.69和0.77-0.82之间。TESSERA在两种设置下均一致优于S1S2和AlphaEarth。正如预期,我们发现基于嵌入的模型从一年迁移到另一年仍然是一个开放的挑战。然而,总体而言,我们的结果表明,来自地球观测基础模型的嵌入在减少耗时预处理和手动特征工程任务方面具有巨大潜力,并能够指导通用的基于深度学习的LCZ制图工作流程。当与简单的位置感知注意力U-Net架构结合时,这些嵌入增强了区域迁移性和可扩展性,支持为全球城市气候应用开发全面且可重复的精细尺度LCZ图。提高参考数据质量仍然是进一步提升精度的最强杠杆。

英文摘要

Understanding urban spatial morphology is critical for climate modeling, risk assessment, and sustainable urban design, and Local Climate Zone (LCZ) mapping provides the basic framework for this. However, many cities still use coarse ~100-m resolution LCZ records, which are unsuitable for fine-scale urban research. In this study, precomputed embeddings from TESSERA (Feng et al., 2025) and AlphaEarth (Brown et al., 2025) are compared to traditional Sentinel-1/2 (S1S2) composites in five Swiss cities to see if they can upscale coarse LCZ maps to 10-m resolution using an attention-based U-Net. Three experiments assess multi-city transferability, the impact of higher-resolution reference data, and temporal robustness to year-to-year phenology changes. We find that all datasets achieve strong performance with test data Intersection-over-Union (IoU) ranging from 0.59-0.69 and 0.77-0.82 in the first two experiments. TESSERA consistently outperforms both S1S2 and AlphaEarth across both settings As expected, we find that the transfer of embedding-based models from one year to another remains an open challenge. Overall, however, our results demonstrate the promising potential of embeddings derived from EO foundation models to reduce time consuming preprocessing, respectively, manual feature engineering tasks and to guide a universal deep learning-based LCZ mapping workflow. When combined with a simple location-aware attention U-Net architecture, the embeddings enhance regional transferability and scalability, supporting the development of comprehensive and reproducible fine-scale LCZ maps for global urban climate applications Improving reference data quality remains the strongest lever for further accuracy gains.

2606.20032 2026-06-19 cs.CV 新提交

ReA-OVCD: Reliability-Aware Open-Vocabulary Change Detection via Semantic and Spatial Refinement

ReA-OVCD:通过语义和空间精炼的可靠性感知开放词汇变化检测

Hongming Zhu, Huaji Chen, Bowen Du, Sicong Liu, Qin Liu

发表机构 * School of Computer Science and Technology, Tongji University(同济大学计算机科学与技术学院) College of Surveying and Geo-Informatics, Tongji University(同济大学测绘与地理信息学院)

AI总结 提出一种无需训练的可靠性感知开放词汇变化检测框架,通过语义变化推理和边界感知精炼策略,解决实例级比较忽略细粒度变化和像素级比较不可靠的问题,在多个数据集上F1提升2.13%-9.75%。

详情
AI中文摘要

与依赖预定义类别的传统遥感变化检测不同,开放词汇变化检测(OVCD)使用任意文本提示灵活识别土地覆盖变化。然而,现有方法在建模变化时存在固有折衷:实例级比较忽略了细粒度语义变化(例如部分建筑扩建),而直接像素比较不可靠,由于语义模糊和空间不一致导致不稳定响应和边界伪影。为此,我们提出一种高效的无训练可靠性感知开放词汇变化检测(ReA-OVCD)框架。它首先从像素级语义差异中推导候选变化区域,以确保灵活和详细的定位。为确保可靠性,随后引入协作精炼策略,从语义和空间角度显式建模变化有效性。具体而言,我们开发了语义变化推理(SCR)模块,通过联合分析分布差异和响应变化重新评估变化,从而抑制偶然不一致性同时保留可靠的语义转变。此外,设计了边界感知变化精炼(BCR)模块,通过验证候选区域是否得到可靠内部像素支持来减轻由边界错位和不确定性引起的伪影。在多个数据集(LEVIR-CD、WHU-CD、DSIFN和SECOND)上的大量实验表明,我们的方法持续优于现有技术,在更高计算效率下实现了2.13%至9.75%的F1提升。代码已公开于此 https URL。

英文摘要

Unlike traditional remote sensing change detection that relies on predefined categories, Open-Vocabulary Change Detection (OVCD) identifies land cover changes flexibly using arbitrary text prompts. However, existing methods suffer from an inherent trade-off when modeling changes: instance-level comparison overlooks fine-grained semantic variations (e.g., partial building extensions), while direct pixel comparison proves unreliable, yielding unstable responses and boundary artifacts due to semantic ambiguity and spatial inconsistency. To this end, we propose an efficient training-free Reliability-Aware Open-Vocabulary Change Detection (ReA-OVCD) framework. It first derives candidate change regions from pixel-wise semantic discrepancies to ensure flexible and detailed localization. To ensure reliability, it subsequently introduces a collaborative refinement strategy to explicitly model change validity from both semantic and spatial perspectives. Specifically, we develop a Semantic Change Reasoning (SCR) module that reassesses changes by jointly analyzing distributional divergence and response variation, enabling the suppression of incidental inconsistencies while preserving reliable semantic shifts. In addition, a Boundary-aware Change Refinement (BCR) module is designed to mitigate artifacts stemming from boundary misalignment and uncertainty through validating whether candidate regions are supported by reliable interior pixels. Extensive experiments across multiple datasets (LEVIR-CD, WHU-CD, DSIFN, and SECOND) demonstrate that our method consistently outperforms state-of-the-art approaches, achieving $\mathrm{F}_{1}^{C}$ improvements of 2.13\% to 9.75\% with higher computational efficiency. The code is publicly available at \https://github.com/Funny0101/ReA-OVCD

2606.20031 2026-06-19 cs.RO cs.AI 新提交

A Neuromorphic Reinforcement Learning Framework for Efficient Pathfinding in Robotic Mobile Fulfillment Systems

一种用于机器人移动履行系统高效路径规划的神经形态强化学习框架

Junzhe Xu, Zecui Zeng, Lusong Li, Yuetong Fang, Renjing Xu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) JD Explore Academy(京东探索研究院)

AI总结 提出SDQN-RMFS框架,通过ANN到SNN的转换和硬标签知识蒸馏,在神经形态芯片上实现超低功耗路径规划,相比GPU能耗降低11281倍,延迟减少近一半。

详情
AI中文摘要

动态环境变化、受限工作空间和严格的实时约束使得机器人移动履行系统(RMFS)中的路径规划对传统的搜索和基于规则的方法来说是一个具有挑战性的问题,这些方法通常遭受高计算复杂性和长决策延迟。虽然强化学习(RL)已成为一种强大的替代方案,但在资源受限的硬件上以极端的能源效率部署学习到的策略仍然是一个开放的挑战。我们提出了SDQN-RMFS,一个端到端的框架,实现了从全精度人工神经网络(ANN)训练的RL策略到神经形态芯片的高保真部署。通过仅在稀疏事件触发时进行计算,该框架实现了超低功耗的RMFS路径规划。我们的全栈流水线操作如下:首先通过碰撞允许策略高效训练ANN策略以密集化信息轨迹,然后通过硬标签知识蒸馏方法将其转换为脉冲神经网络(SNN)。这有效地解决了输出分布不匹配问题,在保持策略能力的同时显著降低了推理延迟。硬件实验表明,与高性能GPU基线相比,能耗节省高达11281倍,延迟几乎减少两倍,同时决策质量与原始训练策略相当。这些结果确立了物理神经形态推理作为大规模RMFS运营的实用且能源可持续的途径。

英文摘要

Dynamic environmental changes, confined workspaces, and stringent real-time constraints make pathfinding in Robotic Mobile Fulfillment Systems (RMFS) a challenging problem for conventional search- and rule-based methods, which typically suffer from high computational complexity and long decision latency. While reinforcement learning (RL) has emerged as a powerful alternative, deploying learned policies with extreme energy efficiency on resource-constrained hardware remains an open challenge. We present SDQN-RMFS, an end-to-end framework that achieves high-fidelity deployment of an RL-trained policy from a full-precision artificial neural network (ANN) through to a neuromorphic chip. By computing only when triggered by sparse events, this framework unlocks ultra-low-power RMFS pathfinding. Our full-stack pipeline operates as follows: an ANN policy is first efficiently trained via a collision-allowing strategy to densify informative trajectories, and then converted into a spiking neural network (SNN) via a hard-label knowledge distillation approach. This effectively addresses the output distribution mismatch, preserving policy capability across the ANN-to-SNN pipeline while substantially reducing inference latency. Hardware experiments demonstrate up to 11,281$\times$ energy savings and a nearly two-fold reduction in latency compared to a high-performance GPU baseline, while maintaining decision quality on par with the original trained policy. These results establish physical neuromorphic inference as a practical and energy-sustainable pathway for large-scale RMFS operations.

2606.20027 2026-06-19 cs.CV 新提交

QG-MIL: A Gated Transformer Aggregator for Domain-Agnostic Multiple Instance Learning in Medical Imaging

QG-MIL:一种用于医学影像中领域无关多实例学习的门控Transformer聚合器

Luca Zedda, Davide Antonio Mura, Cecilia Di Ruberto, Maurizio Atzori, Muhammed Furkan Dasdelen, Carsten Marr, Andrea Loddo

发表机构 * Department of Mathematics and Computer Science, University of Cagliari(卡利亚里大学数学与计算机科学系) Institute of AI for Health, Helmholtz Munich(亥姆霍兹慕尼黑人工智能健康研究所)

AI总结 提出QG-MIL门控Transformer聚合器,通过RMSNorm预归一化、逐头QK归一化、细粒度注意力输出门控和SwiGLU前馈模块,解决注意力集中问题,在六个基准上平均提升+6.1个宏F1分数。

详情
AI中文摘要

医学影像中基于注意力的多实例学习聚合器容易出现注意力集中,导致预测过于自信且不稳定。我们引入QG-MIL,一种门控Transformer聚合器,通过四个协同架构组件解决这一问题:基于RMSNorm的预归一化、逐头QK归一化、细粒度注意力输出门控和SwiGLU风格的前馈模块。这些设计选择共同稳定了训练,并将注意力更均匀地分布在实例上,无需辅助损失、掩码或多阶段正则化。我们在涵盖全切片病理学和细胞级血液学的六个基准上评估了QG-MIL,覆盖两种根本不同的MIL尺度。性能最佳的QG-MIL变体在所有六个基准上均优于领先的基线,平均提升+6.1个宏F1分数。注意力覆盖图和注意力质量分析证实了更分布的实例权重。消融研究表明,虽然单个组件在特定数据集上可以匹配完整模型,但与所选基线相比,QG-MIL设计提供了最一致的跨域性能和最紧凑的方差。我们发布了一个可配置的实现以支持可重复性,网址为:this https URL

英文摘要

Attention-based Multiple Instance Learning aggregators in medical imaging are prone to attention concentration, producing overconfident and unstable predictions. We introduce QG-MIL, a gated transformer aggregator that addresses this through four synergistic architectural components: RMSNorm-based pre-normalization, per-head QK normalization, fine-grained attention output gating, and SwiGLU-style feed-forward modules. Together, these design choices stabilize training and distribute attention more uniformly across instances without auxiliary losses, masking, or multi-stage regularization. We evaluate QG-MIL across six benchmarks spanning whole-slide pathology and cell-level hematology, covering two fundamentally different MIL scales. The best-performing QG-MIL variants outperform leading baselines on all six benchmarks, with an average improvement of +6.1 mean macro F1 points. Attention overlays and attention mass analysis confirm more distributed instance weighting. Ablation studies show that while individual components can match the full model on specific datasets, the QG-MIL design provides the most consistent cross-domain performance and tightest variance when compared to selected baselines. We release a configurable implementation to support reproducibility at: https://github.com/unica-visual-intelligence-lab/QG-MIL

2606.20023 2026-06-19 cs.SE cs.AI cs.CL 新提交

When Lower Privileges Suffice: Investigating Over-Privileged Tool Selection in LLM Agents

当较低权限足够时:探究LLM代理中的过度权限工具选择

Kaiyue Yang, Yuyan Bu, Jingwei Yi, Yuchi Wang, Biyu Zhou, Juntao Dai, Songlin Hu, Yaodong Yang

AI总结 针对LLM代理在工具选择中偏好高权限工具的安全问题,提出ToolPrivBench评估框架,发现主流代理普遍存在过度权限选择且被瞬态故障放大,并设计权限感知后训练防御方法有效减少不必要的高权限工具使用。

Comments code: https://github.com/AISafetyHub/agent-tool-selection-bias

详情
AI中文摘要

随着LLM代理越来越多地自主选择工具,它们在具有不同权限的工具之间的选择变得与安全相关。然而,先前的工具选择研究侧重于安全无关的元数据偏好,使得权限敏感的选择未被充分探索。为填补这一空白,我们研究了过度权限工具选择,即代理在存在足够低权限替代方案时仍选择或升级到更高权限工具。我们引入ToolPrivBench来评估代理是否在存在足够低权限替代方案时仍选择更高权限工具,同时衡量初始选择和瞬态工具故障后的升级。在八个领域和五种重复风险模式中,我们发现过度权限工具选择在主流LLM代理中很常见,并且被瞬态故障进一步放大。我们进一步发现,通用安全对齐不能可靠地迁移到最小权限工具选择,而提示级控制在瞬态故障下仅提供有限的缓解。因此,我们引入了一种权限感知的后训练防御,教导代理偏好足够低权限的工具,仅在必要时升级。我们的缓解实验表明,这种防御在保持通用能力的同时,显著减少了不必要的高权限工具使用。

英文摘要

As LLM agents increasingly select tools autonomously, their choices among tools with different privileges become safety-relevant. However, prior tool-selection studies focus on safety-agnostic metadata preferences, leaving privilege-sensitive choices underexplored. To address this gap, we study over-privileged tool selection, in which an agent selects or escalates to a higher-privilege tool despite a sufficient lower-privilege alternative. We introduce ToolPrivBench to evaluate whether agents choose higher-privilege tools despite sufficient lower-privilege alternatives, measuring both initial selection and escalation after transient tool failures. Across eight domains and five recurring risk patterns, we find that over-privileged tool selection is common among mainstream LLM agents and is further amplified by transient failures. We further find that general safety alignment does not reliably transfer to least-privilege tool choice, while prompt-level controls provide only limited mitigation under transient failures. We therefore introduce a privilege-aware post-training defense that teaches agents to prefer sufficient lower-privilege tools and escalate only when necessary. Our mitigation experiments show that this defense substantially reduces unnecessary high-privilege tool use while preserving general capabilities.

2606.20015 2026-06-19 cs.LG 新提交

Adaptive Distance-Aware Trunk Deep Operator Learning for Long-Span Roadway Bridges

自适应距离感知主干深度算子学习用于大跨度公路桥梁

Bilal Ahmed, Diab W. Abueidda, Waleed El-Sekelly, Tarek Abdoun, Mostafa E. Mobasher

发表机构 * Urban Engineering Department , addressline= New York University Abu Dhabi , country= United Arab Emirates organization= National Center for Supercomputing Applications , addressline= University of Illinois at Urbana-Champaign , country= United States of America organization= Department of Structural Engineering , addressline= Mansoura University , country= Mansoura, Egypt

AI总结 提出自适应主干DeepONet框架,通过KNN构建荷载相关学习域、距离感知特征和刚度-informed Schur补全重建,实现大跨度桥梁局部响应高精度快速预测,相对误差低于5%,速度提升约60倍。

Comments 39 pages, 26 figures

详情
AI中文摘要

大跨度公路桥梁在车辆荷载下表现出高度局部化的结构响应,使得重复有限元分析在影响面生成和结构数字孪生等应用中计算成本高昂。现有的科学机器学习方法难以准确捕捉这些局部响应。为解决这一挑战,本研究提出了一种自适应主干DeepONet用于大型桥梁系统的局部结构响应预测。该框架利用KNN策略动态构建荷载相关的学习域,使网络聚焦于结构影响区域。主干网络进一步通过距离感知特征增强,这些特征编码了荷载与结构节点之间的几何关系。通过刚度-informed Schur补全公式引入基于物理的全场重建,使得自适应节点上的预测能够扩展到整个结构域。为了实现可扩展训练,使用降阶等效壳模型生成响应数据,该模型保留了主要的全局行为,同时显著降低了计算成本。该框架在基准桥梁模型和真实世界的Mussafah桥上进行了验证。结果表明,该方法实现了有限元级别的精度,相对误差低于5%,同时将总响应评估时间(包括全场重建)减少了约60倍;排除后处理重建步骤,AD-DeepONet推理比有限元快四个数量级。此外,该框架能够在任意车辆荷载配置下快速生成全场响应、影响线和影响面,显示出在大规模桥梁分析和数字孪生应用中的巨大潜力。

英文摘要

Long-span roadway bridges exhibit highly localized structural responses under vehicular loading, making repeated FE analysis computationally expensive for applications such as influence surface generation and structural digital twins. Existing SciML approaches struggle to accurately capture these localized responses. To address this challenge, this study proposes an adaptive-trunk DeepONet for localized structural response prediction in large-scale bridge systems. The framework dynamically constructs a load-dependent learning domain using a KNN strategy, allowing the network to focus on structural influence zones. The trunk network is further enhanced using distance-aware features that encode the geometric relationship between the load and structural nodes. A physics-based full-field reconstruction is incorporated through a stiffness-informed Schur complement formulation, enabling predictions at adaptive nodes to be extended to the entire structural domain. To enable scalable training, response data are generated using a reduced-order equivalent shell model that preserves the dominant global behavior while significantly reducing computational cost. The proposed framework is validated on both a benchmark bridge model and the real-world Mussafah Bridge. Results show that the method achieves FEM-level accuracy with relative errors below 5%, while reducing the total response evaluation time (including full-field reconstruction) by approximately 60x; excluding the post-processing reconstruction step, the AD-DeepONet inference is up to four orders of magnitude faster than FEM. In addition, the framework enables rapid generation of full-field responses, influence lines, and influence surfaces under arbitrary vehicular loading configurations, demonstrating strong potential for large-scale bridge analysis and digital twin applications.

2606.20014 2026-06-19 cs.LG cs.AI 新提交

Hierarchical Control in Multi-Agent Games: LLM-based Planning and RL Execution

多智能体博弈中的层次化控制:基于LLM的规划与RL执行

Jannik Hösch, Alessandro Sestini, Florian Fuchs, Amir Baghi, Joakim Bergdahl, Konrad Tollmar, Jean-Philippe Barrette-LaPierre, Linus Gisslén

AI总结 提出LLM作为中央策略控制器选择RL技能策略的层次化架构,在2v2对抗环境中达到与手工BT相当的胜率,且被感知为最类人。

Comments 12 pages, 9 figures

详情
AI中文摘要

强化学习(RL)在序列决策中取得了强劲表现,但由于稀疏奖励、大状态-动作空间以及学习协调策略的困难,扩展到复杂多智能体环境仍具挑战。我们提出一种层次化架构,其中预训练的大语言模型(LLM)作为集中式策略控制器,为一组智能体选择专门的RL技能策略,而RL策略负责反应式底层执行。我们在竞争性2v2 King of the Hill环境中评估该混合系统,与行为树(BT)和“扁平”RL(无技能分解的端到端训练)基线进行比较。LLM+RL系统实现了与手工BT统计上相当的任务性能(胜率46.4% vs 51.5%,p=0.103),而两者均显著优于无技能分解训练的扁平RL。一项用户研究(n=15)显示,60%的参与者认为LLM+RL智能体最像人类(p=0.027),归因于行为适应性和战术变异性。这些结果表明,预训练LLM推理可以有效编排预训练RL技能,实现具有竞争力的多智能体协调和优越的感知可信度,而无需手动规则工程。

英文摘要

Reinforcement learning (RL) has achieved strong performance in sequential decision-making, yet scaling to complex multi-agent environments remains challenging due to sparse rewards, large state-action spaces, and the difficulty of learning coordinated strategies. We propose a hierarchical architecture where a pretrained large language model (LLM) acts as a centralized strategic controller that selects among specialized RL skill policies for a team of agents, while RL policies handle reactive low-level execution. We evaluate this hybrid system in a competitive 2v2 King of the Hill environment against behavior tree (BT) and \emph{``Flat''} RL (end-to-end training without skill decomposition) baselines. The LLM+RL system achieves task performance statistically equivalent to hand-crafted BT (46.4\% vs 51.5\% win rate, $p=0.103$) while both significantly outperform Flat RL trained without skill decomposition. A user study ($n=15$) reveals that 60\% of participants perceive LLM+RL agents as the most human-like ($p=0.027$), citing behavioral adaptability and tactical variability. These results demonstrate that pretrained LLM reasoning can effectively orchestrate pretrained RL skills, achieving competitive multi-agent coordination and superior perceived believability without manual rule engineering.

2606.20010 2026-06-19 cs.LG 新提交

Self-Adaptive Scale Handling for Forecasting Time Series with Scale Heterogeneity

面向尺度异质性时间序列的自适应尺度处理方法

Xu Zhang, Zhengang Huang, Yunzhi Wu, Xun Lu, Erpeng Qi, Yunkai Chen, Zhongya Xue, Peng Wang, Wei Wang

发表机构 * College of Computer Science and Artificial Intelligence, Fudan University(复旦大学计算机科学与人工智能学院) Ant Group(蚂蚁集团)

AI总结 提出自适应尺度处理模块,通过学习自适应尺度因子保留语义区分性并减少逆缩放误差,在基金销售数据集上提升主流预测模型性能。

Comments This is the full version of the paper accepted by the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). The code and dataset are available at https://github.com/Meteor-Stars/ASTSF

详情
AI中文摘要

当前时间序列预测研究主要关注尺度同质数据,即不同时间序列具有相似的数值量级范围。然而,在金融产品销售等真实工业场景中,不同时间序列常相差多个数量级(尺度异质性)。由于这些序列共享相似的时间模式,联合建模有利于更好地利用数据,但现有缩放方法要么压缩低尺度信号(全局归一化),要么破坏语义区分性并放大逆缩放误差(基于窗口的缩放)。本文提出一种自适应尺度处理模块,该模块学习针对每个输入的自适应尺度因子,在保持语义区分性的同时减少逆缩放误差。AS由尺度校准(SC)和缩放选择(SS)组成,SC通过神经网络校准先验均值尺度因子,SS决定是否应用校准或保留原始因子,避免过度校准。在蚂蚁财富和支付宝的真实基金销售数据集上的实验表明,AS能无缝集成到主流TSF模型中并持续提升其性能。代码和数据集可在链接 https://this URL 获取。

英文摘要

Current time series forecasting (TSF) research predominantly focuses on scale-homogeneous data, where different time series share similar numerical magnitude ranges. However, in real-world industrial scenarios such as financial product sales, different time series often differ by orders of magnitude (scale heterogeneity). Since these series share similar temporal patterns, joint modeling is desirable for better data utilization, yet existing scaling methods either compress low-scale signals (global normalization) or destroy semantic discriminability and amplify inverse-scaling errors (window-based scaling). This paper proposes a self-Adaptive Scale-handling (AS) module that learns adaptive scale factors tailored to each input, preserving semantic discriminability while reducing inverse-scaling errors. AS consists of Scale Calibrating (SC), which calibrates prior mean scaling factors through neural networks, and Scaling Selection (SS), which decides whether to apply calibration or retain the original factor, avoiding over-calibration. Experiments on real-world fund sales datasets from Ant Fortune and Alipay show that AS seamlessly integrates into popular TSF models and consistently improves their performance. The code and dataset are available at the link https://github.com/Meteor-Stars/ASTSF.

2606.20008 2026-06-19 cs.LG 新提交

VIMPO: Value-Implicit Policy Optimization for LLMs

VIMPO: 值隐式策略优化用于大语言模型

Zhewei Kang, Aosong Feng, Sergey Levine, Dawn Song, Xuandong Zhao

发表机构 * UC Berkeley(加州大学伯克利分校) Yale University(耶鲁大学)

AI总结 提出VIMPO方法,通过KL正则化强化学习的最优条件导出策略隐含值函数,无需训练评论家,实现细粒度信用分配,在数学推理基准上优于GRPO。

详情
AI中文摘要

基于可验证奖励的强化学习已成为提升大语言模型推理能力的核心工具,但当前方法在简单性与信用分配之间存在权衡。GRPO等群组相对方法避免了训练评论家,但通常为每个token分配轨迹级优势。Actor-critic方法提供更密集的学习信号,但需要学习值函数,其自身存在训练不稳定性。我们提出VIMPO,一种无需评论家的策略优化方法,从KL正则化强化学习的最优条件推导出策略隐含值函数。对于自回归生成,得到的值递归可以用策略-参考对数比率表示,并由轨迹结束时无未来奖励的终止条件锚定。这给出了一个简单的值损失,它结合了结果级可验证奖励,而无需训练评论家。相同的推导也产生了无需评论家的actor优势,使VIMPO能够通过值损失分离奖励合并,并通过PPO风格的actor更新进行策略改进。在数学RLVR基准上,VIMPO在MATH-500、AIME 2024、AIME 2025和OlympiadBench上均优于GRPO,尤其在竞赛式评估中提升更大。在噪声奖励下,VIMPO保持对GRPO的持续优势,表明策略隐含值优化可以在保持无评论家训练实用简单性的同时提供更精细的信用分配。

英文摘要

Reinforcement learning with verifiable rewards has become a central tool for improving the reasoning ability of large language models, but current methods face a trade-off between simplicity and credit assignment. Group-relative methods such as GRPO avoid training a critic, but typically assign a trajectory-level advantage to every token. Actor-critic methods provide denser learning signals, but require a learned value function with its own training instability. We introduce VIMPO, a critic-free policy optimization method that derives a policy-implied value function from the optimality conditions of KL-regularized reinforcement learning. For autoregressive generation, the resulting value recurrence can be written in terms of policy-reference log-ratios and anchored by the terminal condition that no future reward remains at the end of a trajectory. This gives a simple value loss that incorporates outcome-level verifiable rewards without training a critic. The same derivation also yields a critic-free actor advantage, allowing VIMPO to separate reward incorporation through the value loss from policy improvement through a PPO-style actor update. On mathematical RLVR benchmarks, VIMPO improves over GRPO across MATH-500, AIME 2024, AIME 2025, and OlympiadBench, with especially larger gains on competition-style evaluations. Under noisy rewards, VIMPO retains a consistent advantage over GRPO, suggesting that policy-implied value optimization can provide finer credit assignment while preserving the practical simplicity of critic-free training.

2606.20005 2026-06-19 cs.LG cs.AI 新提交

StreamKL: Fast and Memory-Efficient KL Divergence for Boosting Attention Distillation

StreamKL: 快速且内存高效的KL散度用于提升注意力蒸馏

Guangda Liu, Yiquan Wang, Chengwei Li, Wenhao Chen, Jing Lin, Yiwu Yao, Danning Ke, Wenchao Ding, Jieru Zhao

发表机构 * Shanghai Jiao Tong University(上海交通大学) Huawei(华为) Fudan University(复旦大学)

AI总结 提出StreamKL,首个融合GPU原语,通过在线公式和逐块重计算将注意力蒸馏的内存和IO成本从O(N_QN_K)降至O(1),实现高达43倍前向和14倍反向加速。

详情
AI中文摘要

注意力蒸馏通过最小化Kullback-Leibler (KL)散度来训练一个注意力分布匹配另一个,广泛应用于知识蒸馏、模型压缩、持续学习和稀疏注意力LLM训练。然而,现有方法在计算KL归约前需要具体化两个注意力分布,导致$O(N_QN_K)$的内存和IO成本,在长上下文长度下变得不可接受。我们提出StreamKL,首个用于注意力KL散度的融合GPU原语,消除了这种二次具体化。StreamKL推导了一种新颖的在线公式用于耦合的双分布KL归约,使得单个前向内核能够通过片上SRAM流式处理查询-键块。对于反向传播,StreamKL逐块重计算注意力概率,避免存储二次中间结果。我们进一步设计并实现了具有专用优化的高效GPU内核。实验表明,StreamKL在前向和反向传播中分别比基线方法快高达43倍和14倍。最重要的是,StreamKL将注意力蒸馏的额外HBM占用从$O(N_QN_K)$减少到$O(1)$,使得在单个GPU上进行长上下文蒸馏成为可能。

英文摘要

Attention distillation, which trains one attention distribution to match another by minimizing their Kullback-Leibler (KL) divergence, is widely used in knowledge distillation, model compression, continual learning, and sparse-attention LLM training. However, existing approaches materialize both attention distributions before computing the KL reduction, incurring $O(N_QN_K)$ memory and IO costs that become prohibitive at long context lengths. We present StreamKL, the first fused GPU primitive for attention KL divergence that eliminates this quadratic materialization. StreamKL derives a novel online formulation for the coupled two-distribution KL reduction, enabling a single one-pass forward kernel that streams query-key tiles through on-chip SRAM. For the backward pass, StreamKL recomputes attention probabilities tile-by-tile, avoiding storage of quadratic intermediates. We further design and implement efficient GPU kernels with dedicated optimizations. Experiments show StreamKL delivers up to $43\times$ and $14\times$ speedups over baseline methods in the forward and backward passes, respectively. Most importantly, StreamKL reduces the extra HBM footprint of attention distillation from $O(N_QN_K)$ to $O(1)$, enabling long-context distillation on a single GPU.

2606.20002 2026-06-19 cs.LG cs.AI cs.CL 新提交

Connect the Dots: Training LLMs for Long-Lifecycle Agents with Cross-Domain Generalization Via Reinforcement Learning

Connect the Dots:通过强化学习训练具备跨域泛化能力的长期生命周期智能体

Yanxi Chen, Weijie Shi, Yuexiang Xie, Boyi Hu, Yaliang Li, Bolin Ding, Jingren Zhou

发表机构 * Alibaba Group(阿里巴巴集团)

AI总结 提出Connect the Dots框架,通过端到端强化学习训练LLM在长期任务中自我更新上下文并泛化到新领域,实验验证了跨域泛化能力。

Comments Work in progress; we will continuously update the codebase and arXiv version

详情
AI中文摘要

本文提出了一个通用框架,用于训练大型语言模型(LLMs)具备“Connect the Dots”(CoD)这一元能力,该能力是长期生命周期智能体所必需的:当基于LLM的AI智能体部署在环境中时,它解决一系列长期任务,同时持续探索环境、从自身经验中学习,并迭代地自我更新关于环境的上下文,从而在更新上下文的条件下,在未来任务上实现逐步更好的性能。CoD框架的主要组成部分包括:(1)用于端到端强化学习(RL)的算法设计和基础设施,其中包含交替执行任务和更新上下文的长展开序列;(2)用于在训练过程中激励和激发LLM中目标元能力的任务和环境,以及在评估过程中忠实衡量进展的任务和环境。我们展示了CoD框架的概念验证实现,包括具有细粒度信用分配的GRPO风格RL算法,以及针对目标元能力(而非特定领域的LLM能力或标准的逐任务RL)量身定制的任务和环境。实证结果验证了CoD设置中端到端RL训练的有效性,并展示了所激发元能力的分布外泛化潜力——在训练领域内、跨不同领域以及从CoD到Ralph-loop设置中。我们对CoD的研究连接了多项先前工作,并为推进LLM和AI智能体开辟了新的机遇。为促进进一步研究和应用,我们在\url{this https URL}上发布了我们的实现。

英文摘要

This work presents a general framework for training large language models (LLMs) to "Connect the Dots" (CoD), a meta-capability required by long-lifecycle agents: as an LLM-based AI agent gets deployed in an environment, it solves a long sequence of tasks while continuously exploring the environment, learning from its own experiences, and iteratively self-updating its context about the environment, thereby achieving progressively better performance on future tasks conditioned on the updated context. Major components of the CoD framework include: (1) algorithm design and infrastructure for end-to-end reinforcement learning (RL) with long rollout sequences interleaving solve-task and update-context episodes; (2) tasks and environments for incentivizing and eliciting the targeted meta-capability in LLMs during training, as well as for faithfully measuring progress during evaluation. We present proof-of-concept implementations of the CoD framework, including a GRPO-style RL algorithm with fine-grained credit assignment, as well as tasks and environments tailored to the targeted meta-capability (rather than domain-specific LLM capabilities or standard task-by-task RL). Empirical results validate the efficacy of end-to-end RL training in the CoD setting, and demonstrate the potential for out-of-distribution generalization -- within the training domains, across different domains, and from CoD to Ralph-loop settings -- of the elicited meta-capability. Our investigation of CoD connects several lines of prior works, and opens up new opportunities for advancing LLMs and AI agents. To facilitate further research and applications, we release our implementations at \url{https://github.com/agentscope-ai/Trinity-RFT/tree/research/cod/examples/research_cod}.

2606.19998 2026-06-19 cs.RO cs.AI cs.CV cs.LG 新提交

Tri-Info: Generalizable, Interpretable Failure Prediction for VLA Models via Information Theory

Tri-Info: 基于信息论的VLA模型可泛化、可解释的故障预测

Jinghan Yang, Yunchao Zhang, Wang Yuan, Haolun Wan, Jiaming Zhang, Zhengyang Hu, Yanchao Yang

发表机构 * InfoBodied AI Lab, The University of Hong Kong(香港大学信息具身人工智能实验室) HKU Musketeers Foundation Institute of Data Science(香港大学赛马会数据科学研究院)

AI总结 提出Tri-Info方法,通过信息论信号捕捉动作多样性、时间一致性和状态耦合,实现跨架构、环境及仿真到现实的零样本故障检测,准确率达83%。

详情
AI中文摘要

视觉-语言-动作(VLA)模型越来越多地部署在各种任务中,但它们仍然是黑箱,其物理交互可能导致不可逆的伤害,因此需要可泛化和可解释的故障检测。我们观察到成功和失败的轨迹具有系统不同的信息论特征。基于此,我们将VLA控制形式化为闭环信息管道,并推导出三重信息论(Tri-Info)信号,这些信号捕捉动作是否保持多样性、时间一致性以及与状态转换的耦合。在六个VLA模型和三个基准环境中,Tri-Info在域内匹配最强的基线。此外,Tri-Info无需重新训练即可跨架构、环境和仿真到现实差距迁移,在现实世界任务中达到83%的准确率,而先前的检测器则降至随机水平。这确立了Tri-Info作为一种简单而强大的方法,不仅能够检测故障并具有强大的跨域泛化能力,还能提供底层故障模式的可解释诊断。

英文摘要

Vision-Language-Action (VLA) models are increasingly deployed across diverse tasks, yet they remain black boxes whose physical interactions can cause irreversible harm, making generalizable and interpretable failure detection essential. We observe that successful and failed rollouts carry systematically different information-theoretic signatures. Building on this, we formalize VLA control as a closed-loop information pipeline and derive the Triple Information-theoretic (Tri-Info) signals that capture whether actions remain diverse, temporally consistent, and coupled to state transitions. Across six VLA models and three benchmark environments, Tri-Info matches the strongest baselines in-domain. Moreover, Tri-Info transfers across architectures, environments, and the sim-to-real gap without retraining, reaching 83\% accuracy on real-world tasks where prior detectors collapse to chance. This establishes Tri-Info as a simple yet powerful method that not only detects failures with strong cross-domain generalization, but also delivers interpretable diagnostics of the underlying failure modes.

2606.19996 2026-06-19 cs.SD cs.CL 新提交

Segment-Level Mandarin Chinese Speech-Based Cognitive Impairment Detection via an Autoencoder with Contrastive Learning

基于自编码器与对比学习的段级普通话语音认知障碍检测

Yongqi Shao, Hong Huo, Flavio Bertini, Danilo Montesi, Tao Fang

发表机构 * School of Automation and Intelligent Sensing, Shanghai Jiao Tong University(上海交通大学自动化与智能感知学院) Key Laboratory of System Control and Information Processing, Ministry of Education of China(教育部系统控制与信息处理重点实验室) Shanghai Key Laboratory of Perception and Control in Industrial Network Systems(上海市工业网络系统感知与控制重点实验室) Department of Computer Science and Engineering, University of Bologna(博洛尼亚大学计算机科学与工程系) Department of Mathematical, Physical and Computer Sciences, University of Parma(帕尔马大学数学、物理与计算机科学系)

AI总结 提出段级表示学习框架,结合自编码器和对比学习,在四个普通话数据集上实现稳定的二分类和三分类认知障碍检测,尤其改善了临床困难的三分类性能。

Comments 15 pages, 7 figures, 5 tables

详情
AI中文摘要

\noindent\textbf{背景与目标:} 语音已成为一种低成本、非侵入性的数字生物标志物,在认知障碍检测方面具有巨大潜力。然而,有限的标注数据和跨数据集变异性仍然是构建稳健的语音筛查系统的主要挑战。\par\noindent\textbf{方法:} 我们开发了一个用于语音认知障碍检测的段级表示学习框架。将语音录音分割成短片段并转换为语谱图表示。为了在有限数据条件下提高鲁棒性,将离线和在线增强策略与基于自编码器的表示学习和对比目标相结合,以增强判别性潜在表示。\par\noindent\textbf{结果:} 在四个独立的普通话语音数据集上进行的实验表明,在二分类和三分类任务中均取得了稳定且有竞争力的性能,尤其是在临床具有挑战性的三分类设置中取得了显著改进。消融研究进一步支持了所提框架的有效性。\par\noindent\textbf{结论:} 研究结果表明,段级语音表示学习可能为资源受限的临床环境中的认知障碍筛查提供一种可扩展且实用的方法。

英文摘要

\noindent\textbf{Background and Objective:} Speech has emerged as a low-cost and non-invasive digital biomarker with considerable potential for cognitive impairment detection. However, limited labeled data and cross-dataset variability remain major challenges for robust speech-based screening systems. \par\noindent\textbf{Methods:} We developed a segment-level representation learning framework for speech-based cognitive impairment detection. Speech recordings were divided into short segments and converted into spectrogram representations. To improve robustness under limited-data conditions, offline and online augmentation strategies were combined with autoencoder-based representation learning and contrastive objectives to enhance discriminative latent representations. \par\noindent\textbf{Results:} Experiments conducted on four independent Mandarin Chinese speech datasets demonstrated stable and competitive performance in both binary and three-class classification tasks, with particularly notable improvements in the clinically challenging three-class setting. Ablation studies further supported the effectiveness of the proposed framework. \par\noindent\textbf{Conclusions:} The findings suggest that segment-level speech representation learning may provide a scalable and practical approach for cognitive impairment screening in resource-constrained clinical settings.

2606.19993 2026-06-19 cs.LG 新提交

Activation- and Influence-Aware Ranks (AIR): Function-Preserving SVD Compression for LLMs

激活与影响感知秩 (AIR):保持功能的SVD压缩用于大语言模型

Nico Harder, Daniel Becking, Karsten Mueller, Wojciech Samek

AI总结 提出AIR框架,基于SVD和反向信号影响度量,通过单次交替最小二乘扫描实现权重矩阵的低秩近似,在参数保留≤60%时困惑度比SVD-LLM(W)改善>18%,并减少90%校准数据。

Comments Accepted at the ICML 2026 Workshop on Resource-Adaptive Foundation Model Inference (AdaptFM), Seoul, South Korea (non-archival)

详情
AI中文摘要

我们提出了激活与影响感知秩(AIR),一个基于SVD的大语言模型压缩框架,它使用反向信号影响度量来指导每个权重矩阵的低秩近似。从SVD-LLM(W)的激活感知最优解出发,AIR运行单次封闭形式的交替最小二乘(ALS)扫描,在单调下降保证下逐元素整合影响。AIR是层局部的,并与端到端方法正交组合:单独使用时超过ACIP,AIR+LoRA进一步超越。AIR在参数保留≤60%时,困惑度比SVD-LLM(W)改善超过18%,使用约90%更少的校准数据达到相同质量,并将参数节省转化为FLOP、峰值内存和每令牌延迟的收益。

英文摘要

We present Activation- and Influence-Aware Ranks (AIR), an SVD-based LLM compression framework that guides each weight matrix's low-rank approximation with a backward-signal influence metric. Starting from the activation-aware optimum of SVD-LLM(W), AIR runs a single closed-form alternating least squares (ALS) sweep that integrates influence element-wise under a monotone-descent guarantee. AIR is layer-local and composes orthogonally with end-to-end methods: alone it exceeds ACIP, and AIR+LoRA outperforms it further. AIR improves perplexity over SVD-LLM(W) by >18% at <=60% parameter retention, matches its quality with ~90% less calibration data, and turns parameter savings into FLOP, peak-memory, and per-token latency gains.

2606.19992 2026-06-19 cs.SE cs.AI 新提交

Beyond Static Endpoints: Tool Programs as an Interface for Flexible Agentic Web Services

超越静态端点:工具程序作为灵活智能体网络服务的接口

Mugeng Liu, Shuoqi Li, Yixuan Zhang, Yun Ma

AI总结 提出ToolPro,将工具意图表示为可执行程序,通过约束引导构建、效应感知重放和策略决策,在MCP服务上实现最高53.4%的延迟降低和96.1%的流量减少。

Comments Accepted by ICML 2026

详情
AI中文摘要

在智能体网络时代,基于LLM的智能体越来越多地将网络服务作为工具调用,然而大多数接口仍然是\emph{静态端点},难以表达包含循环、条件、连接和重试的长周期工作流。我们提出ToolPro,它将智能体的工具意图表示为一个\emph{可执行工具程序},该程序紧凑地编码了多步服务交互并带有显式效应类型。ToolPro结合了约束引导的程序构建、用于精确一次状态修改调用的效应感知重放,以及一个基于配置文件的策略,该策略决定何时程序执行优于逐步调用。我们在具有WebAssembly沙箱的MCP风格服务上实例化ToolPro,并在现实应用的各种工作流上进行了评估。ToolPro将端到端延迟降低了高达53.4%,客户端流量减少了高达96.1%,在网络延迟和工作流复杂度更高时收益更大。

英文摘要

In the agentic web era, LLM-based agents increasingly invoke web services as tools, yet most interfaces remain \emph{static endpoints} that poorly express long-horizon workflows with loops, conditionals, joins, and retries. We present ToolPro, which represents an agent's tool intent as an \emph{executable tool program} that compactly encodes multi-step service interactions with explicit effect types. ToolPro combines constraint-guided program construction, effect-aware replay for exactly-once state-modifying calls, and a profile-driven policy that decides when program execution outperforms stepwise calling. We instantiate ToolPro over MCP-style services with WebAssembly sandboxing and evaluate it on diverse workflows of real-world applications. ToolPro reduces end-to-end latency by up to 53.4\% and client-side traffic by up to 96.1\%, with larger gains under higher network latency and workflow complexity.

2606.19990 2026-06-19 cs.AI 新提交

Reward as An Agent for Embodied World Models

奖励作为具身世界模型的智能体

Pu Li, Zhigang Lin, Qiang Wu, Yongxuan Lv, Fei Wang, Shan You

发表机构 * ACE Robotics(ACE机器人)

AI总结 提出奖励智能体框架和动态感知 rollout 多样化方法,通过鲁棒验证支持更广泛探索,缓解奖励黑客问题,提升世界模型性能。

详情
AI中文摘要

虽然强化学习已成为改进世界模型的有前景工具,现有方法大多依赖于训练分布附近的保守 rollout,限制了探索、行为多样性和更丰富的动态发现。在这项工作中,我们挑战这种保守范式。我们认为核心限制不是探索本身,而是缺乏支持更广泛探索的可靠验证策略。没有可靠的验证,扩展的探索极易受到奖励黑客攻击,即策略利用不完美的奖励而未能实现真正的改进。为了评估这一动机,我们在具身世界模型中实例化我们的方法,其中物理合理性和任务完成性为复杂动态下的可扩展强化学习提供了严格的测试平台。在验证方面,我们引入奖励作为智能体,一种主动评估生成行为以提供鲁棒奖励信号并减轻分布偏移下奖励黑客攻击的智能体奖励框架。在探索方面,我们通过 DynDiff-GRPO 引入动态感知 rollout 多样化,显式扩展动作空间探索以多样化轨迹、拓宽状态-动作覆盖范围,并鼓励超越保守 rollout 机制的更丰富具身行为。通过将奖励作为智能体与 DynDiff-GRPO 统一,我们在更可靠的奖励基础上实现强化学习,并大幅多样化采样,有效缓解奖励黑客攻击,同时在多个开源世界模型上取得显著的精度提升,从而证明当基于鲁棒验证时,更广泛的探索可以成功扩展。

英文摘要

While RL has become a promising tool for refining world models, existing methods largely rely on conservative rollouts near the training distribution, limiting exploration, behavioral diversity, and richer dynamic discovery. In this work, we challenge this conservative paradigm. We argue that the core limitation is not exploration itself, but the lack of reliable verification strategies to support broader exploration. Without reliable verification, expanded exploration becomes highly susceptible to reward hacking, where policies exploit imperfect rewards without achieving genuine improvement. To evaluate this motivation, we instantiate our method in embodied world models, where physical plausibility, and task completion provide a rigorous testbed for scalable RL under complex dynamics. On the verification side, we introduce Reward as an Agent, an agentic reward framework that actively evaluates generated behaviors to provide robust reward signals and mitigate reward hacking under distribution shifts. On the exploration side, we introduce Dynamic-Aware Rollout Diversification through DynDiff-GRPO, which explicitly expands action-space exploration to diversify trajectories, broaden state-action coverage, and encourage richer embodied behaviors beyond conservative rollout regimes. By unifying Reward as an Agent with DynDiff-GRPO, we enable RL on a more reliable reward foundation with substantially diversified sampling, effectively mitigating reward hacking while yielding significant accuracy gains across multiple open-source world models, thereby demonstrating that broader exploration can scale successfully when grounded in robust verification.

2606.19989 2026-06-19 cs.DC cs.LG 新提交

Online Dynamic Batching with Formal Guarantees for LLM Training

面向LLM训练的具有形式保证的在线动态批处理

Dian Li, Zekun Wang, Yaoru Wang, Jiahong Yan

AI总结 提出在线动态批处理(ODB)系统,在数据加载器侧将批构建延迟到样本真实成本可观测时,解决离线批采样中预处理成本不可见问题,实现1.58-4.43x吞吐量提升,并提供无死锁有界终止的形式化保证。

Comments 29 pages, 3 figures, 21 tables

详情
AI中文摘要

现代LLM训练打破了离线批采样器背后的一个核心假设:样本的真实训练成本只有在预处理、增强、模板化、分词和多模态视觉标记扩展之后才能观察到。除非为依赖于预处理和增强的长度缓存付费,否则批构建对于决定填充、内存使用和GPU饱和度的量是盲目的。我们引入了在线动态批处理(ODB),这是一个数据加载器侧的即插即用系统,它将批形成移动到这一精确可观测性点,同时保持DDP步骤对齐。我们将这一同步需求形式化为分布式组对齐问题,并证明了在默认加入模式身份覆盖和可选非加入样本配额封闭下的无死锁有界终止。ODB不需要修改模型、优化器或注意力核,并以轻量级训练器适配器的形式发布为online-dynamic-batching。在UltraChat/LLaVA/ShareGPT4o上对公开的2B/8B Qwen3-VL进行的实验中,与固定批Standard相比,ODB在单节点全量微调/LoRA上实现了1.58-2.51倍的逐字样本吞吐量提升,在两节点全量微调上实现了1.71-3.78倍提升,质量与Standard相当;生产环境MM-Mix达到4.43倍。与GMT/BMT离线令牌预算预言机相比,ODB在UltraChat/LLaVA上差距在15%以内,在高变异系数的ShareGPT4o上更快:单节点全量微调/LoRA为2.24-2.39倍,两节点全量微调为3.06-3.69倍。总之,ODB占据了高异质性LLM微调的在线/即插即用领域:在质量与Standard相当的情况下实现大幅吞吐量提升,提供形式化的DGAP保证,无需长度缓存预计算或核重写。

英文摘要

Modern LLM training breaks a core assumption behind offline batch samplers: the true training cost of a sample is only observable after preprocessing, augmentation, templating, tokenization, and multimodal visual-token expansion. Unless one pays for a preprocessing- and augmentation-dependent length cache, batch construction is therefore blind to the quantity that determines padding, memory use, and GPU saturation. We introduce Online Dynamic Batching (ODB), a DataLoader-side drop-in system that moves batch formation to this point of accurate observability while preserving DDP step alignment. We formalize this synchronization requirement as the Distributed Group Alignment Problem and prove deadlock-free bounded termination with default join-mode identity coverage and opt-in non-join sample-quota closure. ODB requires no model, optimizer, or attention-kernel changes and is released as online-dynamic-batching with lightweight trainer adapters. Across public 2B/8B Qwen3-VL runs on UltraChat/LLaVA/ShareGPT4o, ODB improves literal emitted-sample throughput vs. fixed-batch Standard by 1.58-2.51x on single-node Full FT/LoRA and 1.71-3.78x on two-node Full FT, with Standard-comparable quality; production MM-Mix reaches 4.43x. Against GMT/BMT offline token-budget oracles, ODB is within 15% on UltraChat/LLaVA and faster on high-CV ShareGPT4o: 2.24-2.39x single-node Full FT/LoRA and 3.06-3.69x two-node Full FT. Together, ODB occupies the online/drop-in regime for high-heterogeneity LLM fine-tuning: large throughput gains at Standard-comparable quality, formal DGAP guarantees, and no length-cache precompute or kernel rewrites.

2606.19988 2026-06-19 cs.SE 新提交

Repository-Level Solidity Code Generation with Large Language Models: From Prompting to Fine-Tuning

基于大语言模型的仓库级Solidity代码生成:从提示到微调

Shi Chen, Rongcun Wang, Yuan Tian, Xiaoyuan Xie, Wei Song, Rubing Huang

AI总结 提出SolidityBench基准和SolidityScore指标,评估多种LLM方法在仓库级Solidity代码生成中的表现,发现监督微调最有效。

Comments 33 pages

详情
AI中文摘要

大语言模型(LLMs)在通用代码生成方面表现出强大的能力,但其在专业软件领域的有效性仍未得到充分探索。Solidity智能合约代表了一个高风险领域,生成的代码必须满足严格的语言级、安全性和软件工程约束。现有的基准和指标对于仓库级Solidity生成仍然不足,其中模型必须从自然语言需求中合成完整的合约。为了解决这一差距,我们引入了SolidityBench,一个包含5,470个仓库级Solidity智能合约及其自然语言描述的基准。我们还提出了SolidityScore,一种基于Solidity的语义度量,强调领域关键结构,如安全修饰符、合约声明和Solidity特定关键词。使用该基准,我们评估了代表性的代码LLM,包括Qwen2.5-Coder、DeepSeek-Coder和CodeLlama,涵盖零样本提示、思维链推理、上下文学习、检索增强生成和监督微调。结果表明,通用模型在仓库级Solidity生成中表现出系统性的结构缺陷。在非参数方法中,检索增强生成表现最佳,而上下文学习在超过两个示例后因上下文饱和而性能下降。监督微调通过将Solidity特定约束内化到模型参数中实现了最大的改进。总体而言,我们的研究为仓库级Solidity代码生成提供了全面的基准,并表明高质量领域数据结合监督微调是提高LLM生成智能合约可靠性的最有效策略。

英文摘要

Large Language Models (LLMs) have shown strong capabilities in general-purpose code generation, but their effectiveness in specialized software domains remains underexplored. Solidity smart contracts represent a high-stakes domain where generated code must satisfy strict language-level, security, and software-engineering constraints. Existing benchmarks and metrics remain insufficient for repository-level Solidity generation, where models must synthesize complete contracts from natural language requirements. To address this gap, we introduce SolidityBench, a benchmark of 5,470 repository-level Solidity smart contracts paired with natural language descriptions. We also propose SolidityScore, a Solidity-aware semantic metric that emphasizes domain-critical constructs such as security modifiers, contract declarations, and Solidity-specific keywords. Using this benchmark, we evaluate representative code LLMs, including Qwen2.5-Coder, DeepSeek-Coder, and CodeLlama, across zero-shot prompting, Chain-of-Thought reasoning, in-context learning, retrieval-augmented generation, and supervised fine-tuning. The results show that general-purpose models exhibit systematic structural deficiencies in repository-level Solidity generation. Among non-parametric methods, retrieval-augmented generation performs best, while in-context learning degrades beyond two examples due to context saturation. Supervised fine-tuning achieves the largest improvement by internalizing Solidity-specific constraints into model parameters. Overall, our study provides a comprehensive benchmark for repository-level Solidity code generation and shows that high-quality domain data combined with supervised fine-tuning is the most effective strategy for improving the reliability of LLM-generated smart contracts.

2606.19987 2026-06-19 cs.SD eess.AS 新提交

PolSeT: Polish Semantics of Timbre Dataset

PolSeT: 波兰语音色语义数据集

Jan Jasiński

AI总结 介绍PolSeT数据集,通过自由言语化和语义差异实验,收集波兰语语义描述符和音色评分,填补音色研究数据空白,支持跨文化心理声学和MIR研究。

Comments 8 pages, 7 figures. Data descriptor for the PolSeT dataset (Polish Semantics of Timbre), available at https://doi.org/10.5281/zenodo.17830609 under CC BY 4.0

详情
AI中文摘要

本数据报告介绍了PolSeT(波兰语语义音色)数据集,该数据集旨在促进波兰语及跨文化背景下的心理声学和音乐信息检索(MIR)研究。数据集包含两个连续实验的数据。实验1(N=60)是一项自由言语化任务,旨在创建波兰语语义描述符词汇表。使用11个刺激,共收集了1901个描述符(701个唯一)。实验2(N=105)利用该词汇表进行语义差异研究,参与者对18种乐器声音在8个双极量表上进行评分,并进行了重复试验以进行信度分析。发布的数据集包括原始听众响应、全面的人口统计数据(经验、性别、年龄)、音频刺激以及提取的声学特征及Python提取代码。该数据集填补了开放音色研究数据的空白,为心理声学研究和多语言语义嵌入模型的训练提供了必要的定性语言基础和定量评分。

英文摘要

This data report introduces PolSeT (Polish Semantic Timbre), a dataset designed to facilitate research in psychoacoustics and Music Information Retrieval (MIR) in Polish and cross-cultural contexts. The dataset contains data from two sequential experiments. Experiment 1 (N=60) was a free-verbalization task aimed at creating a lexicon of Polish semantic descriptors. Using 11 stimuli, a total of 1901 descriptors (701 unique) were gathered. Experiment 2 (N=105) utilized this lexicon to conduct a semantic differential study, where participants rated 18 instrument sounds on 8 bipolar scales, with repeated trials for reliability analysis. The released dataset includes raw listener responses, comprehensive demographics (experience, gender, age), audio stimuli, and extracted acoustic features with Python extraction code. This dataset addresses a gap in open timbre research data, providing both the qualitative linguistic groundwork and the quantitative ratings necessary for psychoacoustic research and the training of multilingual semantic embedding models.

2606.19985 2026-06-19 cs.CV 新提交

Vision-Reasoning-Guided Occlusion Removal from Light Fields

视觉推理引导的光场遮挡去除

Mohamed Youssef, Oliver Bimber

发表机构 * Johannes Kepler University(约翰·开普勒大学)

AI总结 提出结合光场积分与视觉语言模型的框架,通过多视图融合和语义先验恢复被遮挡场景,在合成和真实数据上取得最优性能。

详情
AI中文摘要

遮挡鲁棒的场景恢复仍然是计算成像中的一个主要挑战,特别是在自然环境中,密集的前景植被严重限制了可见性。我们提出了一种视觉推理引导的光场遮挡去除框架,该框架结合了光场积分(LFI)的可见性恢复能力和视觉语言模型(VLM)的语义推理能力。首先通过LFI集成多视图观测以抑制前景遮挡,生成初始的可见性增强表示。然后,引入VLM作为条件语义先验,在观测测量的指导下恢复退化结构并恢复细节。为了提高恢复一致性并减少幻觉伪影,我们引入了一种多样本融合策略,将多个生成的假设聚合为统一的估计。在合成和真实世界数据集上的实验结果表明,该方法达到了最先进的性能,在四个合成光场基准场景(4-Syn)上取得了最高的平均SSIM,并在结构化和非结构化采集设置中表现出强大的泛化能力。这些结果凸显了将物理成像约束与视觉语言推理相结合在严重遮挡下实现鲁棒感知的有效性,可应用于搜索救援和探索性机器人导航。

英文摘要

Occlusion-robust scene recovery remains a major challenge in computational imaging, particularly in natural environments where dense foreground vegetation severely limits visibility. We propose a vision-reasoning-guided light field occlusion removal framework that combines the visibility recovery capability of light field integration (LFI) with the semantic reasoning capacity of vision-language models (VLMs). Multi-view observations are first integrated via LFI to suppress foreground occlusions and produce an initial visibility-enhanced representation. A VLM is then incorporated as a conditional semantic prior to restore degraded structures and recover fine details, guided by the observed measurements. To improve recovery consistency and reduce hallucination artifacts, we introduce a multi-sample fusion strategy that aggregates multiple generated hypotheses into a unified estimate. Experimental results on synthetic and real-world datasets demonstrate state-of-the-art performance, achieving the highest average SSIM across four synthetic light field benchmark scenes (4-Syn) and strong generalization across structured and unstructured acquisition settings. These results highlight the effectiveness of combining physical imaging constraints with vision-language reasoning for robust perception under severe occlusion, with applicability to search-and-rescue and exploratory robotic navigation.

2606.19984 2026-06-19 cs.LG 新提交

Kolmogorov-Arnold Reservoir Computing

Kolmogorov-Arnold 储层计算

Juntian Huang, Jurgen Kurths, Ying Tang

发表机构 * Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China(电子科技大学基础与前沿科学研究所) Potsdam Institute for Climate Impact Research(波茨坦气候影响研究所) Department of Physics, Humboldt University Berlin(柏林洪堡大学物理系) Research Institute of Intelligent Complex Systems, Fudan University(复旦大学智能复杂系统研究所) School of Physics, University of Electronic Science and Technology of China(电子科技大学物理学院) Key Laboratory of Quantum Physics and Photonic Quantum Information, Ministry of Education, University of Electronic Science and Technology of China(电子科技大学教育部量子物理与光子量子信息重点实验室) Non-classical Information Science Basic Discipline Research Center of Sichuan Province, University of Electronic Science and Technology of China(电子科技大学四川省非经典信息科学基础学科研究中心)

AI总结 提出Kolmogorov-Arnold储层计算(KARC),用显式基函数展开替代储层,结合KAN的表达能力和储层计算的闭式训练,在偏微分方程等基准上优于现有方法。

详情
AI中文摘要

储层计算为预测动力系统提供了轻量级框架,但由于表示能力有限,可能难以捕捉长程依赖。传统储层计算循环使用可训练储层,对超参数敏感,而下一代储层计算以特征维度快速增长为代价去除了循环。在此,我们开发了Kolmogorov-Arnold储层计算(KARC),它受Kolmogorov-Arnold表示定理启发,用显式基函数展开替代储层。我们严格证明KARC是Kolmogorov-Arnold网络(KAN)的轻量级设计,保留了KAN的潜在表达能力,同时允许储层计算的高效闭式训练。在相当的成本下,KARC在包括偏微分方程在内的挑战性基准上优于现有储层计算方法。它还可以与生成扩散模型集成用于文本到图像生成。因此,本工作建立了储层计算与KAN之间的原则性桥梁,实现了高效高保真的动力系统预测。

英文摘要

Reservoir computing offers a lightweight framework for forecasting dynamical systems but may struggle to capture long-range dependencies due to limited representational capacity. Conventional reservoir computing recurrently uses trainable reservoirs with hyperparameter sensitivity, while the next-generation reservoir computing removes recurrence at the cost of rapidly growing feature dimensions. Here, we develop Kolmogorov-Arnold Reservoir Computing (KARC), which replaces reservoirs with explicit basis-function expansions inspired by the Kolmogorov-Arnold representation theorem. We rigorously show that KARC is a lightweight design of Kolmogorov-Arnold networks (KANs), preserving the potential expressive capacity of KANs while admitting efficient closed-form training of reservoir computing. At comparable cost, KARC outperforms existing reservoir computing methods on challenging benchmarks including partial differential equations. It can also be integrated with generative diffusion models for text-to-image generation. This work thus establishes a principled bridge between reservoir computing and KANs, enabling efficient and high-fidelity dynamical system forecasting.

2606.19983 2026-06-19 cs.CR 新提交

A Measurement Study of Cryptographic Misuse in Embodied AI Mobile Applications

具身AI移动应用中加密误用的测量研究

Junchao Li, Xuelei Wang, Yuhang Huang, Qi Wang, Boyang Ma, Xuelong Dai, Minghui Xu, Yue Zhang

AI总结 首次大规模测量具身AI移动应用的加密误用,通过自动化语义分析管道发现12,975个误用实例,揭示延迟敏感控制路径和离线配置导致的结构性安全权衡。

详情
AI中文摘要

具身AI (EAI) 移动应用正从辅助用户界面演变为主动控制路径组件,直接将移动端加密安全与网络物理信任联系起来。尽管发生了这种转变,现有的安全研究主要关注具身AI设备和云基础设施,而移动控制层作为关键攻击面在很大程度上未被探索。为了弥补这一差距,我们提出了首个针对EAI移动生态系统内加密误用的大规模测量研究。我们构建了EAIAppZoo,一个涵盖六个EAI领域的507个真实世界应用的基准测试,并采用自动化语义分析管道来测量五种主要加密失效模式的普遍性和特征。我们的测量结果产生了12,975个误用发现(评估精度为80.74%),揭示这些加密失效是由EAI特定的工程约束而非随机开发者错误驱动的。我们揭示了结构性的安全权衡:延迟敏感的控制路径系统性地削弱了传输保护,而对离线设备配置和遗留物联网SDK的严重依赖加剧了本地硬编码认证凭证的问题。通过真实世界案例研究,我们展示了这些移动端加密缺陷如何绕过名义上的网络保护,使攻击者能够拦截命令通道并劫持EAI实体的物理控制。最终,我们的发现强调,移动应用已成为网络物理系统中一个脆弱但被忽视的加密信任边界。

英文摘要

Embodied AI (EAI) mobile applications are evolving from auxiliary user interfaces into active control-path components, directly linking mobile-side cryptographic security to cyber-physical trust. Despite this shift, existing security research predominantly focuses on embodied AI devices and cloud infrastructures, leaving the mobile control layer largely unexplored as a critical attack surface. To bridge this gap, we present the first large-scale measurement study of cryptographic misuse within the EAI mobile ecosystem. We construct EAIAppZoo, a benchmark of 507 real-world applications across six EAI domains, and employ an automated semantic-aware analysis pipeline to measure the prevalence and characteristics of five major cryptographic failure modes. Our measurement yields 12,975 misuse findings (with an evaluated precision of 80.74\%), revealing that these cryptographic failures are driven by EAI-specific engineering constraints rather than random developer errors. We uncover structural security trade-offs: latency-sensitive control paths systematically weaken transport protection, while the heavy reliance on offline device provisioning and legacy IoT SDKs exacerbates the local hardcoding of authentication credentials. Through real-world case studies, we demonstrate how these mobile-side cryptographic flaws bypass nominal network protections, enabling adversaries to intercept command channels and hijack the physical control of EAI entities. Ultimately, our findings highlight that mobile applications have become a fragile, yet overlooked, cryptographic trust boundary in cyber-physical systems.

2606.19980 2026-06-19 cs.AI 新提交

ENPIRE: Agentic Robot Policy Self-Improvement in the Real World

ENPIRE: 现实世界中智能体机器人策略的自我改进

Wenli Xiao, Jia Xie, Tonghe Zhang, Haotian Lin, Letian "Max" Fu, Haoru Xue, Jalen Lu, Yi Yang, Cunxi Dai, Zi Wang, Jimmy Wu, Guanzhi Wang, S. Shankar Sastry, Ken Goldberg, Linxi "Jim" Fan, Yuke Zhu, Guanya Shi

发表机构 * NVIDIA(英伟达) CMU(卡内基梅隆大学) UC Berkeley(加州大学伯克利分校)

AI总结 提出ENPIRE框架,通过环境重置、策略执行、结果验证和迭代优化的闭环反馈,使编码智能体自主改进机器人操作策略,在灵巧操作任务上达到99%成功率。

详情
AI中文摘要

在现实世界中实现灵巧的机器人操作严重依赖人工监督和算法工程,这成为追求通用物理智能的核心瓶颈。尽管新兴的编码智能体可以生成代码来自动化算法搜索,但其成功主要局限于数字环境。我们推测,自动化机器人研究缺失的抽象是一个可重复的反馈循环,用于现实世界策略改进:重置场景、执行策略、验证结果并优化下一次迭代。为弥补这一差距,我们引入ENPIRE,一个用于编码智能体的框架,通过四个核心模块实例化这一物理反馈例程:环境模块(EN)用于自动重置和验证,策略改进模块(PI)启动策略优化,推出模块(R)用于评估一个或多个并行运行的物理机器人的策略,以及进化模块(E),其中编码智能体分析日志、查阅文献、改进训练基础设施和算法代码以解决失败模式。这一闭环系统将现实世界操作学习转化为可控的优化过程,在最小化人工努力的同时,允许对训练方案和智能体变体进行公平消融。在ENPIRE的支持下,前沿编码智能体可以自主训练策略,在具有挑战性的灵巧操作任务(如整理针盒、紧固扎带和工具使用)上达到99%的成功率,并且当我们派遣智能体团队在机器人集群上工作时,这一过程会进一步加速。我们的结果展示了将编码智能体部署到物理世界中自主推进机器人技术的实用且可扩展的路径。

英文摘要

Achieving dexterous robotic manipulation in the real world heavily relies on human supervision and algorithm engineering, which becomes a central bottleneck in the pursuit of general physical intelligence. Although emerging coding agents can generate code to automate algorithm search, their successes remain largely confined in digital environments. We conjecture that the missing abstraction to automate robotics research is a repeatable feedback loop for real-world policy improvement: reset the scene, execute a policy, verify the outcome, and refine the next iteration. To bridge this gap, we introduce ENPIRE, a harness framework for coding agents that instantiates this physical feedback routine with four core modules: an Environment module (EN) for automatic reset and verification, a Policy Improvement module (PI) that launches policy refinement, a Rollout module (R) to evaluate policies with one or multiple physical robots operating in parallel, and an Evolution module (E) in which coding agents analyze logs, consult literature, improve training infrastructure and algorithm code to address failure modes. This closed-loop system transforms real-world manipulation learning into a controllable optimization procedure, minimizing human effort while allowing fair ablations across training recipe and agent variants. Powered by ENPIRE, frontier coding agents can autonomously train a policy to achieve a 99% success rate on challenging, dexterous manipulation tasks, such as organizing a pin box, fastening a zip tie, and tool use, a process that further accelerates when we dispatch an agent team on a robot fleet. Our results suggest a practical and scalable path toward deploying coding agents to autonomously advancing robotics in the physical world.

2606.19975 2026-06-19 cs.CY cs.AI 新提交

The Algorithmic-Human Manager: AI, Apps, and Workers in the Indian Gig Economy

算法-人类管理者:印度零工经济中的AI、应用程序与工人

Omir Kumar, Krishnan Narayanan

AI总结 本文研究AI和数字技术对印度蓝领零工经济中算法管理的影响,发现其虽扩大就业机会但引发公平性、透明度和工人尊严问题,提出算法-人类管理者混合治理模型。

Comments Published by the Centre for Responsible AI (CeRAI) at IIT Madras

详情
AI中文摘要

本文考察了人工智能和数字技术对印度蓝领零工经济的影响,重点关注算法管理——即在基于位置的服务(如拼车和配送)中使用自动化系统来分配、监控和评估工作。采用社会正义框架和混合方法(包括对16名零工工人和21名关键利益相关者的访谈),研究揭示了一个双重现实:虽然AI驱动的系统扩大了工作机会并产生了运营效率,但它们同时引入了与公平、透明度和工人尊严相关的重大挑战。关键发现表明,算法系统设计上不透明,产生不公平的结果,并且其结构不能为额外劳动提供相应报酬。研究倡导一种务实的混合治理模型——算法-人类管理者框架,其中技术效率和人类问责制共同运作而非对立。研究结果对政策制定者、平台公司以及致力于为印度和全球南方的零工经济设计公平AI治理框架的民间社会组织具有启示意义。

英文摘要

This paper examines the impact of artificial intelligence and digital technologies on the blue-collar gig economy in India, focusing on algorithmic management. This paper examines the impact of artificial intelligence and digital technologies on the blue collar gig economy in India, focusing on algorithmic management he use of automated systems to allocate, monitor, and evaluate work in location-based services such as ride sharing and delivery. Using a social justice framework and a mixed-methods approach comprising interviews with 16 gig workers and 21 key stakeholders, the study uncovers a dual reality: while AI-powered systems expand access to work and generate operational efficiencies, they simultaneously introduce significant challenges related to fairness, transparency, and worker dignity. Key findings reveal that algorithmic systems are opaque by design, produce inequitable outcomes, and are not structured to reward additional labour with proportionate pay. The study advocates for a pragmatic hybrid governance model an Algorithmic Human Manager framework in which technological efficiency and human accountability operate together rather than in opposition. The findings carry implications for policymakers, platform companies, and civil society organizations working to design equitable AI governance frameworks for the gig economy in India and across the Global South.

2606.19971 2026-06-19 cs.RO 新提交

Evaluation of Augmented Reality-based Intuitive Interface for Robot-Assisted Transesophageal Echocardiography: A User Study

基于增强现实的机器人辅助经食管超声心动图直观界面评估:用户研究

Xiu Zhang*, Matteo Di Mauro*, Sofia Breschi, Angela Peloso, Emiliano Votta, Arianna Menciassi, Elena De Momi

AI总结 本研究提出并评估了一种基于增强现实的直观界面,用于机器人辅助经食管超声心动图,通过3D可视化与尖端控制显著提升空间精度并降低操作误差。

详情
AI中文摘要

经食管超声心动图(TEE)对于诊断和引导结构性心脏病(SHD)介入治疗至关重要。然而,手动TEE操作需要操作者具备丰富的专业技能,体力消耗大,并且在透视下操作会使临床医生暴露于辐射中。机器人辅助TEE系统已被引入以改进探头操作并减少操作者疲劳,但直观有效的用户界面设计仍是一个开放挑战。本研究提出并评估了一种模型增强的、基于增强现实(AR)的直观界面,用于机器人辅助TEE,旨在提高空间意识和控制直观性。使用集成电磁跟踪和虚拟模拟器的机器人TEE平台,比较了三种在可视化和交互模式上不同的用户界面:2D关节级(2D-JI)、3D关节级(3D-JI)和3D尖端级(3D-TI)。36名参与者执行标准化导航任务以再现目标超声心动图视图,通过位置和方向误差、完成时间和NASA-TLX工作量评分评估性能。结果表明,3D可视化显著提高了空间精度,与2D界面相比,中位位置误差从13毫米减少到3毫米,方向误差减半。尖端级交互相比关节级控制,方向误差进一步降低50%,并减少了用户间变异性。总体而言,3D-TI配置结合了沉浸式可视化与直接尖端级控制,被证明是最有效且符合人体工程学的界面,支持将基于AR的可视化和直观控制范式集成到下一代机器人TEE系统中,以增强操作者性能和手术安全性。

英文摘要

TransEsophageal Echocardiography (TEE) is essential for diagnosing and guiding Structural Heart Disease (SHD) interventions. However, manual TEE manipulation demands significant operator expertise, is physically demanding, and exposes clinicians to radiation when performed alongside fluoroscopy. Robotic-assisted TEE systems have been introduced to improve probe handling and reduce operator fatigue, yet the design of intuitive and effective user interfaces remains an open challenge. This study presents and evaluates a model-enhanced, Augmented Reality (AR)-based intuitive interface for robot-assisted TEE, designed to improve spatial awareness and control intuitiveness. A robotic TEE platform integrated with electromagnetic tracking and a virtual simulator was used to compare three user interfaces differing in visualization and interaction modalities: 2D jointlevel (2D-JI), 3D joint-level (3D-JI), and 3D tip-level (3D-TI). Thirty six participants performed standardized navigation tasks to reproduce target echocardiographic views, with performance assessed via position and orientation errors, completion time, and NASA-TLX workload scores. Results show that 3D visualization significantly improved spatial accuracy, reducing median position error from 13 mm to 3 mm and halving the orientation error compared with the 2D interface. Tip-level interaction yielded a further 50% reduction in orientation error and reduced interuser variability relative to joint-level control. Overall, the 3D-TI configuration, combining immersive visualization with direct tip-level control, proved the most effective and ergonomic interface, supporting the integration of AR-based visualization and intuitive control paradigms into next-generation robotic TEE systems to enhance operator performance and procedural safety.

2606.19970 2026-06-19 cs.CV 新提交

CrossFlow: One-Step Generation Across Latent and Pixel Spaces

CrossFlow: 跨潜在空间与像素空间的单步生成

Xiyuan Wang, Xiao Zhang, Yang Li, Ruoxi Jiang, Zhao Zhong, Liefeng Bo, Muhan Zhang

发表机构 * Institute for Artificial Intelligence, Peking University(北京大学人工智能研究院) Tencent(腾讯) Fudan University(复旦大学)

AI总结 提出CrossFlow,一种跨空间流模型,将噪声潜在输入直接映射到像素图像,通过无速度单步目标实现潜在到像素的生成,并替代潜在扩散中的解码器,在ImageNet-1k上达到1.62 FID。

Comments Preprint, Under Review

详情
AI中文摘要

大多数扩散和流匹配生成器在相同的表示空间中定义先验、概率路径和预测目标。潜在扩散通过将该路径移动到自编码器潜在空间来提高效率,但最终样本仍由单独训练的解码器生成。这种分离造成了不匹配:生成器针对潜在空间预测进行优化,而最终质量取决于解码器如何处理可能与干净编码器输出不同的生成潜在变量。我们引入了CrossFlow,一种跨空间流公式,将噪声潜在输入直接映射到像素空间图像。关键技术步骤是一个无速度的单步目标:潜在轨迹定义了训练路径,但监督预测是图像而非潜在位移。这使得一个模型既可以作为单步潜在到像素生成器,也可以作为潜在扩散管道的解码器替代品。在类别条件ImageNet-1k $256\ imes256$上,CrossFlow-XL通过一次函数评估达到了1.62 FID。消融实验表明,潜在编码器以及像素空间感知和对抗损失对保真度很重要。这些结果表明,跨空间流目标可以结合潜在表示的效率与直接像素空间监督,而无需在推理时使用单独的解码器。

英文摘要

Most diffusion and flow-matching generators define the prior, probability path, and prediction target in the same representation space. Latent diffusion improves efficiency by moving this path into an autoencoder latent space, but the final sample is still produced by a separately trained decoder. This separation creates a mismatch: the generator is optimized for latent-space prediction, while final quality depends on how the decoder handles generated latents that may differ from clean encoder outputs. We introduce CrossFlow, a cross-space flow formulation that maps noisy latent inputs directly to pixel-space images. The key technical step is a velocity-free one-step objective: the latent trajectory defines the training path, but the supervised prediction is an image rather than a latent displacement. This lets one model act both as a one-step latent-to-pixel generator and as a decoder replacement for latent diffusion pipelines. On class-conditional ImageNet-1k at $256\times256$, CrossFlow-XL achieves 1.62 FID with one function evaluation. Ablations show that the latent encoder and pixel-space perceptual and adversarial losses are important for fidelity. These results indicate that cross-space flow objectives can combine the efficiency of latent representations with direct pixel-space supervision, without requiring a separate decoder at inference.