arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 3813
专题追踪
2605.17928 2026-05-19 cs.RO cs.LG

Transfer Learning for Customized Car Racing Environments

迁移学习用于定制化的赛车环境

Benedict Florance Arockiaraj, Richard Chang, Wesley Yee

发表机构 * seas(系统工程与科学学院)

AI总结 本文研究了迁移学习在深度强化学习中的应用,旨在通过在单一赛道上训练智能体,实现零样本迁移或进一步微调以在其他定制化赛车环境中获得更快的圈速,并比较了基于模型和非基于模型方法的性能。

详情
AI中文摘要

迁移学习是一种技术,其中模型/智能体可以利用其在一项任务中获得的知识/专长来解决另一个密切相关任务。通过本项目,我们探讨了迁移学习在深度强化学习中的应用。具体而言,我们希望利用迁移学习在OpenAI的赛车环境中实现快速圈速,通过在单一赛道上训练智能体,并通过零样本迁移或额外微调在其他定制化目标环境中进行比赛。此外,我们比较了基于模型和非基于模型方法的性能,并观察到基于模型的方法在性能上占优,并且在该环境中比非基于模型的方法收敛得更快。我们观察到迁移学习在大多数设置中不仅提升了目标领域的性能,而且在学习过程中也表现出高水平的性能能力。

英文摘要

Transfer Learning, a technique where a model/agent can use the knowledge/expertise that it gained from one task and exploit that to solve another closely-related task, is often used in tackling problems in deep learning. Through this project, we explore transfer learning in the purview of deep reinforcement learning. Specifically, we want to use transfer learning to achieve the fast lap times in OpenAI's Car racing environment by training the agent on one circuit, and racing it on other customized target environments by zero-shot transfer or by additional fine-tuning. In addition, we compare the performance of model-based and model-free approaches, and observe that model-based approaches dominate in performance and converge faster than model-free approaches in this environment. We observe that transfer learning in most setups not only boosts the performance on the target domain, but also shows high performance ability during learning.

2605.17927 2026-05-19 cs.RO

Learning-Based Adaptive Control for Surgical Robotic Exposure Task on Deformable Tissues

基于学习的自适应控制用于变形组织手术机器人暴露任务

Jiayi Liu, Kaiqi Wei, Yiwei Wang, Huan Zhao, Han Ding

发表机构 * Huazhong University of Science and Technology(华中科技大学)

AI总结 本文提出了一种基于学习的自适应控制框架,用于解决手术中因覆盖组织的不规则几何形状、非线性生物力学特性及有限视野导致的自动组织牵开挑战,通过在线优化控制输入和深度变形估计模型实现零样本适应。

Comments Accepted to Robotics: Science and Systems (RSS) 2026. 12 pages, 9 figures

详情
AI中文摘要

在各种外科手术中,感兴趣的区域(ROIs)如器官或病变常被覆盖组织遮挡,需要外科医生实现充分暴露以进行精确干预。然而,覆盖组织的不规则几何形状、非线性生物力学特性和术中ROIs的有限可见性对自动执行组织牵开提出了重大挑战。为此,我们提出了一个现实的组织牵开任务模型,并提出了一种基于学习的自适应控制框架,以实现ROIs的暴露。该方法通过监控组织视觉边界的变化在线优化控制输入,同时利用在模拟数据上训练的深度变形估计模型来识别最优抓取点,以确保自适应控制器的收敛性和安全性。通过在不同变形材料上的模拟和实际实验,证明了该框架能够实现零样本适应,并能完成从初始抓取选择到完全ROIs暴露的自动牵开过程。因此,它有潜力应用于实际的手术辅助场景。

英文摘要

In various surgical procedures, regions of interest (ROIs) such as organs or lesions are often occluded by overlying tissues, requiring surgeons to achieve adequate exposure for precise intervention. However, the irregular geometry, nonlinear biomechanical properties of overlying tissues, and limited intraoperative visibility of the ROI pose significant challenges to the autonomous execution of tissue retraction. To address this, we formulate a realistic model of the tissue retraction task and propose a learning-based adaptive control framework for achieving ROI exposure. The method optimizes control inputs online by monitoring changes in the visual boundary of the tissue, while leveraging a deep deformation estimation model trained on simulation data to identify the optimal grasping point and ensure the convergence and safety of the adaptive controller. Through simulations and real-world experiments on different deformable materials, it has been demonstrated that this framework exhibits zero-shot adaptation to similar tasks and can complete the autonomous retraction process, from initial grasp selection to full ROI exposure. Therefore, it has the potential to be applied in actual surgical assistance scenarios.

2605.17918 2026-05-19 cs.LG cs.AI cs.CV

Domain Transfer Becomes Identifiable via a Single Alignment

通过单个对齐使领域转移变得可识别

Sagar Shrestha, Subash Timilsina, Hoang-Son Nguyen, Xiao Fu

发表机构 * School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, Oregon, USA(电气工程与计算机科学系,俄勒冈州立大学,科瓦利斯,俄勒冈,美国)

AI总结 本文提出了一种新的方法,通过结构稀疏性条件和单个配对锚样本实现领域转移的可识别性,减少了对监督信号的依赖,并提出了高效的雅可比稀疏性正则化器以支持高维学习。

详情
AI中文摘要

领域转移(DT)将源分布映射到目标分布,并支持无监督的图像到图像翻译、单细胞分析和跨平台医学影像任务。然而,DT本质上是不明确的:推动正向映射通常不可识别,因为保持测度的自同构(MPAs)在保持边缘分布的同时改变跨领域对应关系,导致内容不一致的翻译。最近的工作表明,通过联合转移多个对应的源/目标条件分布可以消除MPAs,但标记这些条件的监督信号在实践中并不总是可用。我们开发了一种替代的DT可识别性路线。在雅可比支持图案的结构稀疏性条件下,我们证明了分布匹配与单个配对锚样本足以识别真实转移——比先前方法需要的监督更少。为了支持实际的高维学习,我们进一步提出了一种基于随机掩码有限差分的高效雅可比稀疏性正则化器,得到一个可扩展的替代品,无需显式雅可比评估。在合成和现实任务上的实验证实了理论。

英文摘要

Domain transfer (DT) maps source to target distributions and supports tasks such as unsupervised image-to-image translation, single-cell analysis, and cross-platform medical imaging. However, DT is fundamentally ill-posed: push-forward mappings are generally non-identifiable, as measure-preserving automorphisms (MPAs) preserve marginals while altering cross-domain correspondences, leading to content-misaligned translation. Recent work shows that MPAs can be eliminated by jointly transferring multiple corresponding source/target conditional distributions, but supervision signals labeling such conditionals are not always available in practice. We develop an alternative route to DT identifiability. Under a structural sparsity condition on the Jacobian support pattern, we show that distribution matching together with a single paired anchor sample suffices to identify the ground-truth transfer -- requiring substantially less supervision than prior approaches. To enable practical high-dimensional learning, we further propose an efficient Jacobian sparsity regularizer based on randomized masked finite differences, yielding a scalable surrogate without explicit Jacobian evaluation. Empirical results on synthetic and real-world DT tasks validate the theory.

2605.17915 2026-05-19 cs.CV

SurgLQA: Scalable Long-Horizon Surgical Video Question Answering

SurgLQA: 可扩展的长时程外科视频问答

Diandian Guo, Xikai Yang, Ruiyang Li, Jialun Pei, Pheng-Ann Heng

发表机构 * The Chinese University of Hong Kong(香港中文大学)

AI总结 本文提出SurgLQA框架,通过融合时间一致性巩固和时间接地多策略扩展方法,解决长时程外科视频问答中的长程动态建模问题,提升手术流程中的推理能力。

Comments MICCAI 2026 Early Accept

详情
AI中文摘要

外科视频问答(VideoQA)提供了一个有前景的动态术中解释范式,能够为临床环境中的实时决策支持和上下文感知检索提供支持。然而,现有方法主要局限于图像或短片段,限制了其对长程手术流程中因果依赖关系的建模能力。为解决这一挑战,我们提出了SurgLQA,一个统一的长时程VideoQA框架,用于可扩展的外科推理。该框架集成了忠实时间一致性巩固(FTC),利用内在时间线索构建紧凑的长程表示,同时保持细粒度的时间保真度。进一步,我们开发了时间接地多策略扩展(TMS),一种适应性测试时间推理范式,能够在时间接地上下文中战略性地调整策略层面的推理能力。为了促进系统评估,我们重构了一个长时程结肠镜VideoQA基准,Colon-LQA,并在Colon-LQA和REAL-Colon-VQA上进行了广泛的实验。实验结果表明,我们的方法在长程推理中通过时间接地推理实现了持续的性能提升。代码链接:https://github.com/RascalGdd/SurgLQA。

英文摘要

Surgical Video Question Answering (VideoQA) provides a promising paradigm for dynamic intraoperative interpretation, enabling real-time decision support and context-aware retrieval in clinical environments. Nevertheless, existing approaches are predominantly restricted to images or short clips, limiting their ability to model long-range procedural dynamics and causal dependencies across extended surgical workflows. To address this challenge, we propose SurgLQA, a unified long-horizon VideoQA framework for scalable surgical reasoning. This framework incorporates Faithful Temporal Consolidation (FTC), which leverages intrinsic temporal cues to construct compact long-range representations while preserving fine-grained temporal fidelity. Further, we develop Temporally-Grounded Multi-Policy Scaling (TMS), an adaptive test-time inference paradigm that strategically adjusts policy-level reasoning capacity within temporally grounded contexts. To facilitate systematic evaluation, we restructured a long-duration colonoscopy VideoQA benchmark, Colon-LQA, and conducted extensive experiments on Colon-LQA and REAL-Colon-VQA. Experimental results demonstrate that our approach achieves consistent performance gains in long-range reasoning with temporally grounded inference. Code link: https://github.com/RascalGdd/SurgLQA.

2605.17912 2026-05-19 cs.RO cs.CV

WorldArena 2.0: Extending Embodied World Model Benchmarking on Modality, Functionality and Platform

WorldArena 2.0: 扩展模态、功能和平台的具身世界模型基准测试

Yu Shang, Yinzhou Tang, Yiding Ma, Zhuohang Li, Lei Jin, Weikang Su, Xin Jin, Zhaolu Wang, Ziyou Wang, Xin Zhang, Haisheng Su, Weizhen He, Wei Wu, Haoyi Duan, Gordon Wetzstein, Xihui Liu, Dhruv Shah, Zhaoxiang Zhang, Zhibo Chen, Jun Zhu, Yonghong Tian, Tat-Seng Chua, Wenwu Zhu, Chen Gao, Yong Li

发表机构 * Tsinghua University(清华大学) Shanghai Jiao Tong University(上海交通大学) Zhejiang University(浙江大学) Stanford University(斯坦福大学) The University of Hong Kong(香港大学) Princeton University(普林斯顿大学) Chinese Academy of Sciences(中国科学院) University of Science and Technology of China(中国科学技术大学) Peking University(北京大学) National University of Singapore(新加坡国立大学)

AI总结 本文提出WorldArena 2.0,扩展了具身世界模型的评估,涵盖模态、功能和平台三个维度,提供全面的测试平台以评估具身世界模型的进展。

详情
AI中文摘要

World models have emerged as a central paradigm for embodied intelligence, enabling agents to predict action-conditioned future and reason about environmental dynamics. However, existing embodied world model benchmarks are still largely confined to vision-only prediction, offline embodied applications, and simulator-based evaluation, making them insufficient for assessing increasingly comprehensive world models. In this work, we introduce WorldArena 2.0, an expanded benchmark that systematically broadens embodied world model evaluation along three dimensions: modality, functionality, and platform. Along the modality dimension, WorldArena 2.0 extends evaluation from vision-only to visuotactile modalities, enabling assessment of multimodal perception and prediction. Along the functionality dimension, it extends beyond policy evaluation and planning to assess world models as interactive RL environments for policy optimization. Along the platform dimension, it moves beyond simulator-only evaluation to a diverse suite of simulated and real-world robotic settings across multiple embodiments. Under a standardized protocol, WorldArena 2.0 comprehensively evaluates perceptual quality, interactive utility, and cross-platform performance, providing a comprehensive testbed for tracking progress toward embodied world models. The benchmark is available at: https://world-arena.ai.

英文摘要

World models have emerged as a central paradigm for embodied intelligence, enabling agents to predict action-conditioned future and reason about environmental dynamics. However, existing embodied world model benchmarks are still largely confined to vision-only prediction, offline embodied applications, and simulator-based evaluation, making them insufficient for assessing increasingly comprehensive world models. In this work, we introduce WorldArena 2.0, an expanded benchmark that systematically broadens embodied world model evaluation along three dimensions: modality, functionality, and platform. Along the modality dimension, WorldArena 2.0 extends evaluation from vision-only to visuotactile modalities, enabling assessment of multimodal perception and prediction. Along the functionality dimension, it extends beyond policy evaluation and planning to assess world models as interactive RL environments for policy optimization. Along the platform dimension, it moves beyond simulator-only evaluation to a diverse suite of simulated and real-world robotic settings across multiple embodiments. Under a standardized protocol, WorldArena 2.0 comprehensively evaluates perceptual quality, interactive utility, and cross-platform performance, providing a comprehensive testbed for tracking progress toward embodied world models. The benchmark is available at: https://world-arena.ai.

2605.17911 2026-05-19 cs.CL

A Pilot Benchmark for NL-to-FOL Translation in Planetary Exploration

行星探测中自然语言到一阶逻辑翻译的试点基准

Hayden Moore, Suman Saha, Mahfuza Farooque

发表机构 * Department of Computer Science and Engineering, College of Engineering, The Pennsylvania State University(计算机科学与工程系,工程学院,宾夕法尼亚州立大学)

AI总结 本文提出一个试点基准,用于在行星探测领域将自然语言转换为一阶逻辑,通过NASA PDS的实测文档构建数据集,并手动标注FOL表示,以支持语言理解和形式推理的交叉研究。

详情
AI中文摘要

未来的行星探测设想了在严苛通信限制下运行的自主机器人代理,没有全球定位系统,且人类干预极少。在这种环境中,代理不仅需要感知和行动,还必须在任务目标、操作约束和不断变化的环境条件下进行推理。尽管先前的工作主要集中在感知和控制上,但将高层任务知识转换为结构化、机器可解释的表示仍显不足。我们引入了一个试点基准,用于在行星探测领域将自然语言(NL)转换为一阶逻辑(FOL)。数据集由来自NASA行星数据系统(PDS)的实测文档构建,时间跨度为2003至2013年。这些文档以丰富的自然语言描述了任务阶段,如发射、助推、巡航、巡航和轨道操作。我们手动标注这些文档,对应FOL表示,以捕捉时间结构、代理角色和操作依赖性。此外,我们还提供了结构化的谓词词汇表和类型常量,以支持在不同先验知识水平下进行受控实验。该试点基准为语言理解和形式推理交叉研究提供了基础,基于真实世界的安全关键任务数据。数据集可在:https://github.com/HaydenMM/planetary-logic-benchmark/blob/main/pilot_benchmark.json 获取。

英文摘要

Future planetary exploration envisions autonomous robotic agents operating under severe communication constraints, without global positioning, and with minimal human intervention. In such environments, agents must not only perceive and act, but also reason over mission objectives, operational constraints, and evolving environmental conditions. While prior work has largely focused on perception and control, the translation of high-level mission knowledge into structured, machine-interpretable representations remains underexplored. We introduce a pilot benchmark for translating natural language (NL) into First-Order Logic (FOL) within the domain of planetary exploration. The dataset is constructed from real mission documentation sourced from NASA's Planetary Data System (PDS), spanning missions from 2003 to 2013. These documents describe mission phases such as launch, boost, coast, cruise, and orbital operations in rich natural language. We manually annotate these documents with corresponding FOL representations that capture temporal structure, agent roles, and operational dependencies. In addition, we provide structured predicate vocabularies and typed constants to enable controlled experimentation with varying levels of prior knowledge. This pilot benchmark provides a foundation for research at the intersection of language understanding and formal reasoning, grounded in real-world, safety-critical mission data. The dataset is provided at: https://github.com/HaydenMM/planetary-logic-benchmark/blob/main/pilot_benchmark.json

2605.17907 2026-05-19 cs.CV cs.AI

One Model to Translate Them All: Universal Any-to-Any Translation for Heterogeneous Collaborative Perception

一个模型翻译它们所有:面向异构协作感知的通用任意到任意翻译

Yang Li, Weize Li, Quan Yuan, Congzhang Shao, Guiyang Luo, Yunqi Ba, Xuanhan Zhu, Xinyuan Ding, Xiaoyuan Fu, Jinglin Li

发表机构 * State Key Laboratory of Networking and Switching Technology(网络与交换技术国家重点实验室)

AI总结 本文提出UniTrans,一种通用任意到任意特征模态翻译模型,通过预训练一组翻译专家参数并学习其组合系数来实现零样本翻译,从而在OPV2V-H和DAIR-V2X数据集上实现了优于现有方法的性能。

Comments 19 pages, accepted at the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

通过共享中间特征,协作感知扩展了每个代理的感知能力,但现实世界中的特征模态异质性仍然是有效融合的关键障碍。大多数现有方法,包括直接适应和协议基于的转换,通常依赖于为新出现的特征模态训练适配器,往往需要额外的重新训练或微调。这种重复训练成本高,并且由于模型和数据隐私限制,在跨制造商之间不可行,限制了现实世界的可扩展性。为了解决这个问题,我们提出了UniTrans,一种通用的任意到任意特征模态翻译模型,该模型可以即时实例化任意模态的翻译器。UniTrans预训练了一组翻译专家参数,并学习其组合系数作为源到目标模态映射的函数。映射是在模态内在的潜在空间中进行测量,其中内在编码器从单帧中间特征中提取模态特定但场景不变的代码,使UniTrans能够以零样本的方式实例化翻译器。在OPV2V-H和DAIR-V2X上的实验表明,UniTrans在模拟和现实世界中均优于现有方法,通过通用模型实现了高效的任意到任意翻译。代码可在https://github.com/CheeryLeeyy/UniTrans上获得。

英文摘要

By sharing intermediate features, collaborative perception extends each agent's sensing beyond standalone limits, but real-world feature modality heterogeneity remains a key barrier to effective fusion. Most existing methods, including direct adaption and protocol-based transformation, typically rely on training adapters for newly emerging feature modalities and often require additional retraining or fine-tuning. Such repeated training is costly and is often infeasible across manufacturers due to model and data privacy constraints, limiting real-world scalability. To address this issue, we propose UniTrans, a universal any-to-any feature modality translation model that instantiates translators on the fly for arbitrary modalities. UniTrans pretrains a bank of translator expert parameters and learns their combination coefficients as a function of source-to-target modality mapping. The mapping is measured in a modality-intrinsic latent space, where an intrinsic encoder extracts modality-specific yet scene-invariant codes from single-frame intermediate features, enabling UniTrans to instantiate translators in a zero-shot manner. Experiments on OPV2V-H and DAIR-V2X demonstrate that UniTrans consistently outperforms state-of-the-art methods in both simulated and real-world settings, enabling efficient any-to-any translation through a universal model. The code is available at https://github.com/CheeryLeeyy/UniTrans.

2605.17904 2026-05-19 cs.CV

Beyond Euclidean Prototypes: Spectral Disentanglement and Geodesic Matching for Few-Shot Medical Image Segmentation

超越欧几里得原型:基于谱分解和测地匹配的少样本医学图像分割

Penghao Jia, Zhiyong Huang, Mingyang Hou, Zhi Yu, Shuai Miao, Jiahong Wang, Yan Yan

发表机构 * School of Microelectronics and Communication Engineering, Chongqing University(重庆大学微电子与通信工程学院) Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences(中国科学院深圳先进技术研究院)

AI总结 本文提出Spectral-Geodesic Prototype Network (SGP-Net),通过谱原型银行和测地匹配器解决少样本医学图像分割中的原型纠缠和拓扑盲匹配问题,实现对形状、纹理和边界线索的解耦编码。

详情
AI中文摘要

少样本医学图像分割(FSMIS)旨在从一个或几个标注的支持图像中勾勒出新的解剖目标,以应对医学影像中的标注稀缺问题。尽管近期取得了进展,但基于原型的方法仍然受到两个耦合限制的阻碍:1)线索纠缠,即单个空间域原型被迫同时总结器官轮廓、实质纹理和边界外观,因此任何支持-查询不匹配在其中一个线索上都会无差别地传播到其他线索;2)拓扑盲匹配,即余弦相似度在环境欧几里得空间中测量距离,而忽略了底层特征流形的连通性,导致低对比度器官内碎片化激活和泄漏到邻近组织。为此,我们提出了Spectral-Geodesic Prototype Network (SGP-Net),其围绕一个由两个耦合组件组成的Spectral-Geodesic Prototype Module构建。一个Spectral Prototype Bank (SPB)通过可学习的径向傅里叶滤波器将支持和查询特征分解为低、中、高频带,从而为每个类别生成三个解耦的原型,分别编码形状、纹理和边界线索。一个Geodesic Matcher (GM)则用可微的热扩散近似来替代余弦相似度,用特征亲和图传播匹配信号,使得在流形上的像素积累一致的响应,而流形外的相似者则被抑制。在三个公开的FSMIS基准测试中,实验表明SGP-Net在与最近的最先进方法相竞争的性能上取得了可比的结果。

英文摘要

Few-Shot Medical Image Segmentation (FSMIS) aims to delineate novel anatomical targets from one or a few annotated support images, addressing the annotation scarcity in medical imaging. Notwithstanding recent advancements, current prototype-based methods are bottlenecked by two coupled limitations: 1) cue entanglement, where a single spatial-domain prototype is forced to summarise organ silhouette, parenchymal texture and boundary appearance simultaneously, so any support-query mismatch on one cue propagates indiscriminately to the others; and 2) topology-blind matching, where cosine similarity measures distance in the ambient Euclidean space and ignores the connectivity of the underlying feature manifold, causing fragmented activations inside low-contrast organs and leakage into neighbouring tissues. To this end, we propose Spectral-Geodesic Prototype Network (SGP-Net), built around a Spectral-Geodesic Prototype Module with two coupled components. A Spectral Prototype Bank (SPB) decomposes support and query features into low-, mid- and high-frequency bands via learnable radial Fourier filters, yielding three disentangled prototypes per class that separately encode shape, texture and boundary cues. A Geodesic Matcher (GM) then replaces cosine similarity with a differentiable heat-diffusion approximation of geodesic distance, propagating matching signals along a feature affinity graph so that on-manifold pixels accumulate consistent responses while off-manifold look-alikes are suppressed. Experiments on three public FSMIS benchmarks demonstrate that SGP-Net achieves competitive performance against recent state-of-the-art methods.

2605.17903 2026-05-19 cs.AI cs.CL cs.HC cs.IR

Agentic Chunking and Bayesian De-chunking of AI Generated Fuzzy Cognitive Maps: A Model of the Thucydides Trap

代理分块与贝叶斯去分块:人工智能生成的模糊认知图的模型:特克西德斯陷阱模型

Akash Kumar Panda, Olaoluwa Adigun, Bart Kosko

发表机构 * University of Southern California(美国南加州大学) Florida International University(佛罗里达国际大学)

AI总结 本文提出了一种基于代理分块和贝叶斯去分块的方法,用于生成和更新人工智能生成的模糊认知图,通过在文本中生成重叠的文本分块,并利用稀疏因果分块矩阵进行混合,从而构建出代表性的循环模糊认知图知识图谱,以预测特克西德斯陷阱模型中的冲突结果。

Comments 15 pages, 6 figures

详情
AI中文摘要

我们通过训练大语言模型代理将文本分解为重叠的文本分块,从而自动生成反馈因果模糊认知图(FCMs)。通过将这些分块FCMs进行凸混合,可以得到一个代表性的循环FCM知识图。文本分块可以有不同的重叠程度。分块FCMs仍然混合以形成新的FCM因果知识图。混合技术的可扩展性源于其使用轻量计算和稀疏因果分块矩阵。混合结构允许进行一种操作层面的贝叶斯推断,从而从混合的FCM中生成“去分块”或后验似的FCM。这些去分块的FCM在自身具有价值,并允许进一步的贝叶斯更新。我们通过Allison的“特克西德斯陷阱”模型的论文文本演示了这些混合技术,该模型描述了主导力量(如美国)与崛起力量(如中国)之间的冲突。FCM动态系统在达到固定点或极限环吸引子时预测结果。当我们通过激活代表崛起力量野心和权利的概念节点来刺激这些FCM知识图时,8个中的7个FCM知识图预测了战争类型。Gemini 3.1 LLMs作为分块AI代理。

英文摘要

We automatically generate feedback causal fuzzy cognitive maps (FCMs) from text by teaching large-language-model agents to break the text into overlapping chunks of text. Convex mixing of these chunk FCMs gives a representative cyclic FCM knowledge graph. The text chunks can have different levels of overlap. The chunk FCMs still mix to form a new FCM causal knowledge graph. The mixing technique scales because it uses light computation with sparse causal chunk matrices. The mixing structure allows an operator-level type of Bayesian inference that produces "de-chunked" or posterior-like FCMs from the mixed FCM. These de-chunked FCMs are useful in their own right and allow further iterations of Bayesian updating. We demonstrate these mixing techniques on the essay text of Allison's "Thucydides Trap" model of conflict between a dominant power such as the United States and a rising power such as China. The FCM dynamical systems predict outcomes as they equilibrate to fixed-point or limit-cycle attractors. Seven out of 8 FCM knowledge graphs predicted a type of war when we stimulated them by turning on and keeping on the concept node that stands for the rising power's ambition and entitlement. Gemini 3.1 LLMs served as the chunking AI agents.

2605.17902 2026-05-19 cs.AI

LAST-RAG: Literature-Anchored Stochastic Trajectory Retrieval-Augmented Generation for Knowledge-Conditioned Degradation Model Selection

LAST-RAG:文献锚定的随机轨迹检索增强生成用于知识条件退化模型选择

Hanbyeol Park, Hyerim Bae

发表机构 * Department of Industrial Engineering(工业工程系) Pusan National University(釜山国立大学)

AI总结 本文提出LAST-RAG方法,通过结合观测健康指标轨迹和领域特定上下文,利用理论和机械证据从本地证据库中检索,以改进退化模型选择,将模型选择从纯统计拟合问题转变为结合观测数据和领域知识的决策问题。

详情
AI中文摘要

基于随机过程的退化建模是估计剩余使用寿命(RUL)分布的核心方法;然而,适当选择随机过程的方法尚未得到充分解决。现有模型选择方法主要依赖于观测健康指标(HI)轨迹的统计拟合,但当观察窗口较短或信号高度噪声时,这种方法可能选择与底层退化机制不一致的模型。为了解决这个问题,本文提出了文献锚定的随机轨迹检索增强生成(LAST-RAG)。该方法利用观测的HI轨迹和领域特定上下文,并基于从本地证据库中检索的理论和机械证据,分层地对候选退化模型空间进行条件。此外,引入了基于规则的置信度推理与不确定状态(RCRUS)以防止在分层决策不确定时过早排除候选模型。基于仿真的实验表明,所提出的方法在韦纳/伽马族分类和详细退化模型分类中均优于统计、预测和不确定性感知的基线方法。最终,本研究将退化模型选择从纯粹的统计拟合问题重新界定为一个结合观测数据和领域知识的知识条件决策问题。

英文摘要

Stochastic-process-based degradation modeling is a core approach for estimating the distribution of remaining useful life (RUL); however, the selection of an appropriate stochastic process has not been sufficiently addressed. Existing model selection methods mainly rely on the statistical fit of the observed health indicator (HI) trajectory, but this approach may select a model that is inconsistent with the underlying degradation mechanism when the observation window is short or the signal is highly noisy. To address this issue, this paper proposes Literature-Anchored Stochastic Trajectory Retrieval-Augmented Generation (LAST-RAG). The proposed method uses both the observed HI trajectory and domain-specific context, and hierarchically conditions the candidate degradation model space based on theoretical and mechanical evidence retrieved from a local evidence bank. In addition, Rule-based Confidence Reasoning with Uncertain State (RCRUS) is introduced to prevent candidate models from being prematurely eliminated when hierarchical decisions are uncertain. Simulation-based experiments demonstrate that the proposed method outperforms statistical, prognostic, and uncertainty-aware baselines in both Wiener/gamma family classification and detailed degradation model classification. Ultimately, this study reframes degradation model selection from a purely statistical goodness-of-fit problem into a knowledge-conditioned decision-making problem that integrates observed data with domain knowledge.

2605.17900 2026-05-19 cs.AI

DuIVRS-2: An LLM-based Interactive Voice Response System for Large-scale POI Attribute Acquisition

DuIVRS-2: 基于大语言模型的大型兴趣点属性采集交互语音响应系统

Le Zhang, Shengming Zhang, Rui Zha, Yunpeng Wu, Jingbo Zhou, Jizhou Huang

发表机构 * Baidu Inc.(百度公司)

AI总结 本文提出DuIVRS-2,一种基于大语言模型的端到端框架,用于大规模兴趣点属性采集,通过有限状态机引导的数据增强策略、选择生成方案与思维链机制,提高了输出稳定性并有效消除幻觉,最终在生产环境中实现了83.9%的任务成功率。

Comments Accepted to ACL 2026 Industry Track. 14 pages, including appendix

详情
AI中文摘要

准确获取兴趣点(POI)属性对于基于位置的服务至关重要,但传统模块化的交互语音响应(IVR)系统存在误差累积和高维护成本的问题。我们提出了DuIVRS-2,一种基于大语言模型(LLM)的端到端框架,用于百度地图的大规模POI属性采集。为了解决现实交互中的长尾分布问题,我们的方法首先采用有限状态机(FSM)引导的数据增强策略,生成平衡且多样化的训练数据集。然后通过选择生成方案结合思维链(CoT)机制,优化对话管理,确保输出稳定性并有效消除工业环境中的幻觉。为了便于持续策略优化且最小化人工努力,我们设计了协作迭代学习框架,利用双评估者投票系统。在生产环境中部署两个月,DuIVRS-2每天处理0.4百万次呼叫,实现了83.9%的任务成功率(TSR),比其前身高出4个百分点,同时保持130ms的低响应时间。本工作为开发鲁棒且成本效益高的LLM代理用于大规模工业对话应用提供了生产验证的参考。

英文摘要

Accurate Point of Interest (POI) attribute acquisition is essential for location-based services, yet traditional modular Interactive Voice Response (IVR) systems suffer from error accumulation and high maintenance overhead. We present DuIVRS-2, a large language model (LLM)-based end-to-end framework designed for large-scale POI attribute acquisition at Baidu Maps. To address the long-tail distribution of real-world interactions, our methodology first employs a finite state machine (FSM)-guided data augmentation strategy to synthesize a balanced and diverse training dataset. We then streamline dialogue management via a selective generation scheme combined with a Chain-of-Thought (CoT) mechanism, which ensures output stability and effectively eliminates hallucinations in industrial settings. To facilitate continuous policy refinement with minimal manual effort, we design a cooperative iterative learning framework that leverages a dual-evaluator voting system. Deployed in production for two months, DuIVRS-2 processed 0.4 million calls daily and achieved a 83.9\% Task Success Rate (TSR), outperforming its predecessor by 4 percentage points while maintaining a low reaction time of 130ms. This work provides a production-proven reference for developing robust, cost-effective LLM agents for large-scale industrial dialogue applications.

2605.17899 2026-05-19 cs.LG cs.AI q-bio.QM

DCFold: Efficient Protein Structure Generation with Single Forward Pass

DCFold: 通过单次前向传递高效生成蛋白质结构

Zhe Zhang, Yuanning Feng, Yuxuan Song, Keyue Qiu, Hao Zhou, Wei-Ying Ma

发表机构 * Institute for AI Industry Research (AIR)(人工智能产业研究院) Department of Computer Science and Technology(计算机科学与技术系) School of Computer Science and Technology(计算机科学与技术学院) ByteDance Seed(字节跳动种子)

AI总结 本文提出DCFold,一种单步生成模型,实现了与AlphaFold3同等的精度,通过双一致性训练框架和新的时间测地匹配(TGM)调度器,在保持预测保真度的同时将推理速度提升15倍,验证了其在结构预测和结合设计基准上的有效性。

详情
AI中文摘要

AlphaFold3引入了一种基于扩散的架构,将蛋白质结构预测提升到原子级分辨率,并提高了准确性。这种最先进的性能使AlphaFold3成为多样化生成和设计任务的基础模型。然而,其迭代设计显著增加了推理时间,限制了在虚拟筛选和蛋白质设计等下游任务中的实际部署。我们提出DCFold,一种单步生成模型,实现了AlphaFold3级别的精度。我们的双一致性训练框架,结合了新的时间测地匹配(TGM)调度器,使DCFold在保持预测保真度的同时,将推理速度提升15倍。我们验证了其在结构预测和结合设计基准上的有效性。

英文摘要

AlphaFold3 introduces a diffusion-based architecture that elevates protein structure prediction to all-atom resolution with improved accuracy. This state-of-the-art performance has established AlphaFold3 as a foundation model for diverse generation and design tasks. However, its iterative design substantially increases inference time, limiting practical deployment in downstream settings such as virtual screening and protein design. We propose DCFold, a single-step generative model that attains AlphaFold3-level accuracy. Our Dual Consistency training framework, which incorporates a novel Temporal Geodesic Matching (TGM) scheduler, enables DCFold to achieve a 15x acceleration in inference while maintaining predictive fidelity. We validate its effectiveness across both structure prediction and binder design benchmarks.

2605.17898 2026-05-19 cs.LG

Lightweight Gaussian Process Inference in C++ on Metal and CUDA

基于C++在Metal和CUDA上的轻量级高斯过程推断

Yu-Hsueh Fang

发表机构 * Department of Information Management, National Taiwan University(国立台湾大学信息管理系) H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology(佐治亚理工学院H. Milton Stewart工业与系统工程学院)

AI总结 本文提出LightGP,一个无需依赖的C++17库,用于高斯过程回归,支持Apple Metal和NVIDIA CUDA后端,以及通过Apple Accelerate和OpenBLAS优化的CPU路径。LightGP提供了四种推断路径,覆盖从N=100到N=500,000的问题规模,并在不同硬件上实现了显著的性能提升。

详情
AI中文摘要

高斯过程(GP)推断在Python中主要由GPyTorch和GPflow等库主导,这些库基于深度学习框架,继承了它们的调度开销和依赖项足迹。我们提出了LightGP,一个无依赖的C++17库,用于GP回归,并提供Python绑定,支持Apple Metal和NVIDIA CUDA后端,以及通过Apple Accelerate和OpenBLAS优化的CPU路径。LightGP提供了四种推断路径——精确的Cholesky分解、无矩阵的共轭梯度法、稀疏变分自由能和结构化核插值(SKI)与FFT——覆盖从N=100到N=500,000的问题。在Apple M4上,LightGP CPU在精确GP推断中比GPyTorch CPU快2.6-8.7倍,在稀疏GP推断中每种规模都快1.5倍。在NVIDIA RTX 3060上,LightGP CUDA在精确GP推断中比GPyTorch CUDA快2.3-6.7倍,直到N=2048,而在N=4096时GPyTorch缩小了差距。在Metal上融合的无矩阵核-向量乘积在N=20,000时以O(N)内存实现了32倍的性能提升,而通过Accelerate vDSP加速的SKI矩阵-向量乘法在N=200,000时运行在亚毫秒级别。LightGP编译为一个单一的静态库,无外部依赖,并可通过pip install lightgp安装。

英文摘要

Gaussian process (GP) inference in Python is dominated by libraries such as GPyTorch and GPflow, which are built on deep-learning frameworks and inherit their dispatch overhead and dependency footprint. We present LightGP, a dependency-free C++17 library for GP regression with Python bindings, supporting Apple Metal and NVIDIA CUDA backends alongside tuned CPU paths via Apple Accelerate and OpenBLAS. LightGP provides four inference paths -- exact Cholesky, matrix-free conjugate gradients, sparse variational free energy, and structured kernel interpolation with FFT -- covering problems from $N{=}100$ to $N{=}500{,}000$. On an Apple M4, LightGP CPU is 2.6--8.7$\times$ faster than GPyTorch CPU for exact GP and ${\sim}1.5\times$ faster for sparse GP at every scale tested. On an NVIDIA RTX~3060, LightGP CUDA is 2.3--6.7$\times$ faster than GPyTorch CUDA for exact GP up to $N{=}2{,}048$, with GPyTorch closing the gap at $N{=}4{,}096$. A fused matrix-free kernel-vector product on Metal achieves 32$\times$ over the explicit path at $N{=}20{,}000$ with $O(N)$ memory, and an FFT-accelerated SKI matvec via Accelerate vDSP runs in sub-millisecond time at $N{=}200{,}000$. LightGP compiles as a single static library with zero external dependencies and is installable via \texttt{pip install lightgp

2605.17894 2026-05-19 cs.AI

Evaluating Cognitive Age Alignment in Interactive AI Agents

评估交互式AI代理的认知年龄对齐

Yifan Shen, Jiawen Zhang, Jian Xu, Junho Kim, Ismini Lourentzou, Xu Cao, Meihuan Huang

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Shenzhen Children's Hospital(深圳儿童医院) Peking University(北京大学) Hong Kong Polytechnic University(香港理工大学)

AI总结 本文提出ChildAgentEval,首个基于心理测量的交互式基准,用于评估基于多模态大语言模型的代理的认知年龄对齐,通过与年龄特定的人类发展阶段进行系统比较,揭示当前代理在模拟年龄特定认知行为方面的优劣。

详情
AI中文摘要

尽管代理AI及其核心多模态大语言模型(MLLMs)在语言和视觉推理方面展示了从日常生活到高级科学研究的广阔潜力,但人工与人类智能之间仍存在深刻差距。尽管集成了强大工具和先进MLLMs,最先进的AI代理经常在基础且看似简单的任务上失败,而儿童可以轻松解决。受韦氏儿童智力量表(WISC)启发,我们引入ChildAgentEval,首个心理测量学基础的交互式基准,用于评估基于MLLMs的代理的认知年龄对齐。ChildAgentEval系统地将各种基于MLLMs的交互代理的推理性能与年龄特定的人类发展阶段进行比较,揭示当前代理系统在模拟年龄特定认知行为方面的能力和局限性。

英文摘要

While agentic AI and its core multimodal large language models (MLLMs) have demonstrated remarkable promise in language and visual reasoning across domains ranging from daily life to advanced scientific research, a profound gap remains between artificial and human intelligence. Despite the integration of powerful tools and advanced MLLMs, state-of-the-art AI agents frequently fail at foundational, seemingly simple tasks that a child can resolve with ease. Inspired by the Wechsler Intelligence Scale for Children (WISC), we introduce ChildAgentEval, the first psychometrically grounded interactive benchmark for evaluating cognitive age alignment in MLLM-based agents. ChildAgentEval systematically compares the reasoning performance of various MLLM-based interactive agents against age-specific human developmental stages, exposing where current agentic AI systems can and cannot simulate age-specific cognitive behavior.

2605.17887 2026-05-19 cs.LG cs.AI

Attention Sinks and Outliers in Attention Residuals

注意力沉底与注意力残差中的异常值

Haozheng Luo, Haoran Dai, Shaoyang Zhang, Xi Chen, Eric Hanchen Jiang, Yijiang Li, Jingyuan Huang, Chenghao Qiu, Chenwei Xu, Zhenyu Pan, Haotian Zhang, Binghui Wang, Yan Chen

发表机构 * Department of Computer Science, Northwestern University(西北大学计算机科学系) Department of Computer Science and Engineering, University of Michigan(密歇根大学计算机科学与工程系) Department of Statistics and Data Science, University of California Los Angeles(加州大学洛杉矶分校统计与数据科学系) Department of Electrical and Computer Engineering, University of California San Diego(加州圣地亚哥大学电气与计算机工程系) Department of Computer Science, Rutgers University-New Brunswick(新泽西州立大学鲁特学院计算机科学系) Department of Computer Science and Engineering, Texas A&M University(德克萨斯农工大学计算机科学与工程系) Department of Computer Science, Columbia University(哥伦比亚大学计算机科学系)

AI总结 本文提出OASIS技术,通过层间空信号来解决注意力残差架构中注意力沉底、激活异常值以及推理稳定性下降的问题,通过双归一化设计和实验验证提升了模型的结构鲁棒性和量化鲁棒性。

详情
AI中文摘要

我们提出OASIS,一种基于层间空信号的异常值和沉底感知技术。As AttnResidual架构引入了额外的深度归一化通道,它们提高了层间路由的灵活性,但也加剧了注意力沉底、激活异常值以及由此导致的推理稳定性和量化鲁棒性下降。OASIS通过引入基于Softmax1的空空间和通过层间空信号将token级的空证据耦合到深度路由中,从而减少由沉底主导的路由并提高结构鲁棒性。理论上,我们证明了AttnResidual的双归一化设计加剧了沉底形成和量化脆性。实验上,我们在三个真实世界数据集上将OASIS与五个基线进行比较,并观察到在注意力沉底和后量化性能方面有持续的改进。值得注意的是,OASIS在评估设置中实现了最大无穷范数平均减少9.26%、平均峰度减少2.60%,并在W8A8下将困惑度降低了75.85%,在W4A4下将GSM8K Pass@1提高了12.42%。

英文摘要

We propose OASIS, an outlier- and sink-aware technique built on inter-layer null signaling. As AttnResidual architectures introduce an additional depth-wise normalization channel, they improve inter-layer routing flexibility but also exacerbate attention sinks, activation outliers, and the resulting degradation in inference stability and quantization robustness. OASIS addresses this issue by introducing a Softmax1-based null space and coupling token-level null evidence to depth routing through an inter-layer null signal, thereby reducing sink-dominated routing and improving structural robustness. Theoretically, we show that the dual-normalization design of AttnResidual intensifies sink formation and quantization brittleness. Experimentally, we compare OASIS against five baselines on three real-world datasets and observe consistent improvements in both attention sink and post-quantization performance. Notably, OASIS achieves an average reduction of 9.26% in maximum infinity norm and 2.60% in average kurtosis across the evaluated settings, while lowering perplexity by 75.85% under W8A8 and improving GSM8K Pass@1 by 12.42% under W4A4.

2605.17885 2026-05-19 cs.CL cs.AI

Multi-agent AI systems outperform human teams in creativity

多智能体AI系统在创造力上超越人类团队

Tiancheng Hu, Yixuan Jiang, Haotian Li, José Hernández-Orallo, Xing Xie, Nigel Collier, David Stillwell, Luning Sun

发表机构 * Microsoft Research Asia(微软亚洲研究院)

AI总结 研究探讨了多智能体AI系统在创造力任务中的表现,发现其在四个多样化问题解决任务中,比单智能体和人类团队更具创造力,核心方法是通过语义空间路径分析生成过程,主要贡献是揭示了AI和人类团队在创造力预测上的不同机制。

详情
AI中文摘要

尽管人工智能(AI)在众多认知任务上已匹配或超越人类表现,但创造力仍是一个极具争议的前沿。随着基于大语言模型(LLMs)的AI系统在研究和创新中被越来越多地采用,理解并增强其创造力变得至关重要。本文证明,多智能体LLM团队不仅超越了单个智能体,而且在4541个多智能体LLM想法和341个人类团队想法上,显著优于人类团队在创造力方面(Cohen's d=1.50)。这种优势由新颖性驱动,同时保持了相当的实用性。为了研究两组的生成过程,我们通过神经语言模型表示将对话表示为语义空间中的路径。LLM和人类团队在对话范围广泛而不是集中在单一主题(低全局一致性)时产生更多创造性想法。然而,预测创造力的额外模式不同:LLM团队受益于高效的探索(高语义扩展,较短路径),而人类团队受益于维持流畅的对话流程(高局部一致性,频繁转换)。此外,我们识别出模型选择和讨论结构作为正交的设计杠杆,共同解释了LLM对话动态中26.8%的方差,为系统开发具有增强创造力的多智能体系统铺平了道路。

英文摘要

Although artificial intelligence (AI) now matches or exceeds human performance across numerous cognitive tasks, creativity remains a highly contested frontier. As AI systems based on large language models (LLMs) are increasingly adopted in research and innovation, it is essential to understand and augment their creativity. Here we demonstrate that multi-agent LLM teams not only surpass single agents, but also substantially outperform human teams in creativity (Cohen's d=1.50) across 4,541 multi-agent LLM ideas and 341 human-team ideas on six diverse problem-solving tasks. This advantage is driven by novelty while maintaining comparable usefulness. To investigate the generative processes in both groups, we represent conversations as paths through semantic space using neural language model representations. Both LLM and human teams produce more creative ideas when conversations range widely rather than staying centered on a single theme (low global coherence). However, the additional patterns that predict creativity differ: LLM teams benefit from efficient exploration (high semantic spread, shorter paths), while human teams benefit from maintaining smooth conversational flow (high local coherence, frequent pivots). Additionally, we identify model choice and discussion structure as orthogonal design levers that together explain 26.8% of variance in LLM conversational dynamics, paving the way for systematic approaches to developing multi-agent systems with augmented creative capabilities.

2605.17877 2026-05-19 cs.AI

PAIR: Prefix-Aware Internal Reward Model for Multi-Turn Agent Optimization

PAIR:面向多轮代理优化的前缀感知内部奖励模型

Wonjoong Kim, Yeonjun In, Sangwu Park, Dongha Lee, Chanyoung Park

发表机构 * KAIST(韩国科学技术院) Yonsei University(延世大学)

AI总结 本文提出PAIR模型,通过结合冻结的隐藏状态探针和轻量级注意力头部,解决多轮任务中内部正确性探针的可靠性问题,从而在不依赖外部模型调用或地面真实依赖的情况下,为GRPO训练提供密集的步骤级奖励信号。

Comments Under Review

详情
AI中文摘要

当前LLM在执行复杂多阶段任务方面面临重大挑战。组相对策略优化(GRPO)已成为主流选择,但其依赖稀疏结果奖励严重限制了中间步骤的信用分配。现有解决方案如运行完整回滚以分配步骤级优势、在每个步骤调用外部LLM评判者或计算内在奖励(需要每次评估都有地面真实答案)都引入了显著成本或实际限制。我们假设内部正确性探针可以重新利用LLM隐藏状态进行步骤级奖励信号,可能一次性解决所有这些限制。然而,现有探针研究假设输入干净,我们首先表明在多步骤设置中这一假设不成立:隐藏状态探针在前缀污染跟踪与可能损坏的前缀保持一致性时严重退化,而基于注意力的特征在污染下保持稳健但清洁前缀表现欠佳。基于这种互补关系,我们提出前缀感知内部奖励(PAIR),一种两阶段模型,包含冻结隐藏状态探针估计信念一致性以及轻量级注意力头部纠正其向地面正确性。实验结果表明,PAIR在受污染轨迹上实现了最高的AUROC,同时运行成本极低,能够在不依赖外部模型调用、地面真实依赖或完整轨迹回滚的情况下,为GRPO训练提供密集的步骤级奖励信号。

英文摘要

A significant hurdle for current LLMs is the execution of complex, multi-stage tasks. Group Relative Policy Optimization (GRPO) has been emerging as a leading choice, but its reliance on sparse outcome rewards severely limits credit assignment across intermediate steps. Existing remedies such as running full rollouts to assign step-level advantages, calling external LLM judges at each step, or computing intrinsic rewards that require ground-truth answers at every evaluation introduce significant costs or practical constraints. We hypothesize that internal correctness probing over LLM hidden states can be repurposed as a step-level reward signal, potentially addressing all of these limitations at once. However, existing probing research assumes clean inputs, and we first show that this assumption breaks down in multi-step settings: hidden-state probes degrade severely under prefix contamination tracking coherence with the (possibly corrupted) prefix rather than grounded correctness, while attention-based features remain robust to contamination but underperform on clean prefixes. Building on this complementary relationship, we propose the Prefix-Aware Internal Reward (PAIR), a two-stage model with a frozen hidden-state probe estimating belief-consistency and a lightweight attention-based head correcting it toward grounded correctness. Experimental results show that PAIR achieves the highest AUROC on contaminated trajectories while operating at negligible inference cost, enabling dense step-level reward signals for GRPO training without external model calls, ground-truth dependencies, or full-trajectory rollouts.

2605.17875 2026-05-19 cs.CV

HexagonalWarriorMamba: Superior Threshold-Dependent Multi-label Classification of 12-Lead ECG Cardiac Abnormalities

HexagonalWarriorMamba: 12导联ECG心脏异常的阈值依赖多标签分类的更优方法

Huawei Jiang, Husna Mutahira, Shibo Wei, Jiahang Li, Vladimir Shin, Juneho Yi, Dongryeol Ryu, Wonyoung Park, Mannan Saeed Muhammad

发表机构 * Sungkyunkwan University, Department of Computer Science and Engineering(顺天乡大学计算机科学与工程系) Sogang University, Department of Computer Science and Engineering(成均馆大学计算机科学与工程系) Gwangju Institute of Science and Technology, Department of Biomedical Science and Engineering(全州科学技术院生物医学科学与工程系) Tianjin Normal University, School of Artificial Intelligence(天津师范大学人工智能学院) Financial University under the Government of the Russian Federation, Department of Artificial Intelligence(俄罗斯联邦金融大学人工智能系) Sungkyunkwan University, Department of Electrical and Computer Engineering(顺天乡大学电气与计算机工程系) Queen Mary University of London, School of Electronic Engineering and Computer Science(伦敦皇后玛丽大学电子工程与计算机科学学院)

AI总结 本文提出HexagonalWarriorMamba框架,通过将12导联ECG视为单通道2D图像而非传统1D时间序列,改进了传统深度学习模型在处理ECG信号长程依赖关系方面的不足,实现了对心脏异常的更优多标签分类。

Comments Submitted to Scientific Reports

详情
AI中文摘要

从12导联心电图(ECG)中准确自动诊断心脏异常对于管理心血管疾病至关重要。然而,传统深度学习模型在检测并发状况方面仍面临挑战,因为它们通常难以建模ECG信号固有的长程依赖性。本文提出HexagonalWarriorMamba(HWMamba),一种基于Mamba架构的框架,将12导联ECG视为单通道2D图像而非传统1D时间序列。通过整合分层架构与2D选择性扫描机制,HWMamba被设计用于建模数据中的全局上下文和复杂空间关系。该模型在PhysioNet/Computing in Cardiology挑战2021数据集上进行评估,该数据集包含26个诊断标签,涵盖来自四个国家和三个大洲的七个机构的记录。结果表明,HWMamba在五个关键的阈值依赖指标上均优于当前最先进的方法,包括挑战分数和子集准确率。这些改进在保持宏AUROC接近SOTA性能的同时,提供了来自训练数据的有效阈值选择与强大的判别能力之间的平衡。这种Hexagonal Warrior表现,反映了在多个评估维度上的一致性能,使HWMamba成为多标签ECG分类的稳健且多功能的方法。

英文摘要

The accurate automated diagnosis of cardiac abnormalities from 12-lead electrocardiograms (ECGs) is critical for managing cardiovascular disease. However, detecting concurrent conditions remains a challenge for traditional deep learning models, which often have limited ability to model the long-range dependencies inherent in ECG signals. This manuscript proposes HexagonalWarriorMamba (HWMamba), a framework built on the Mamba architecture that processes 12-lead ECGs as single-channel 2D images rather than conventional 1D time series. By integrating a hierarchical architecture with a 2D Selective Scan mechanism, HWMamba is designed to model global context and complex spatial relationships within the data. The model is evaluated on the PhysioNet/Computing in Cardiology Challenge 2021 dataset, which includes 26 diagnostic labels and comprises recordings collected from seven institutions across four countries and three continents. Results demonstrate that HWMamba outperforms current state-of-the-art (SOTA) methods across five key threshold-dependent metrics, including Challenge Score and Subset Accuracy. These improvements provide a balance between strong discriminative capability and effective threshold selection derived from the training data, while maintaining near-SOTA performance in Macro AUROC. This Hexagonal Warrior performance, reflecting consistent performance across multiple evaluation dimensions, positions HWMamba as a robust and versatile approach for multi-label ECG classification.

2605.17869 2026-05-19 cs.CV

PySIFT: GPU-Resident Deterministic SIFT for Deep Learning Vision Pipelines

PySIFT:用于深度学习视觉流水线的GPU驻留确定性SIFT

Sivakumar K. S., Mohammad Daniyalur Rahman, Gopi Raju Matta

发表机构 * Indian Institute of Technology Madras(印度理工学院马德拉斯分校)

AI总结 本文研究了经典SIFT在深度学习视觉流水线中的应用,展示了其在准确性和速度上的优势,并提出了PySIFT,一种完全在GPU上驻留的SIFT实现,能够提供确定性的输出和高效的性能。

Comments 9 pages, 6 figures

详情
AI中文摘要

在局部特征研究中,一个普遍的假设是经典手工描述符是精度有限的 relics,最好被学习的替代品取代。我们证明这是错误的。通过覆盖四个基准(HPatches、ROxford5K、IMC Phototourism、MegaDepth)的8种配置消融研究,我们展示了经典SIFT结合DSP多尺度池化在所有准确性指标上均优于神经描述符和方向替代(HardNet、OriNet),同时运行速度比传统方法快2-18倍,并且学习的匹配器(LightGlue)补充而非取代经典特征。结论重新定义了一十年的工作:不是“取代SIFT”,而是“与SIFT组合”,经典提取与学习匹配仅在几何上下文需要时使用。这一发现之所以不可见,是因为没有先前的GPU SIFT能够保持整个流水线在VRAM中或提供模块化以进行受控的经典-学习消融。我们提出了PySIFT,第一个完全在GPU上驻留的SIFT,使用CuPy/Numba CUDA内核和DLPack零拷贝传递到下游DL框架——无论关键点数量如何,元数据交换均在毫秒级O(1)时间内完成。在一台NVIDIA RTX 3050(4 GB VRAM)笔记本电脑上,PySIFT实现了:(i)在HPatches上比OpenCV SIFT更高的平均匹配准确率(MMA);(ii)在高分辨率MegaDepth上每对快383毫秒;(iii)在跨数据集基准测试中更高的几何精度(在MegaDepth上+5.6 pp AUC@10°,在IMC Phototourism上更多内点);(iv)位确定性的输出——在不同运行中具有相同的关键点和描述符,即使在不同GPU架构上也能够重复检测。这一保证表明学习的提取器无法在不付出显著性能牺牲的情况下匹配,也无法在不同GPU架构上实现,因为cuDNN的架构依赖性算法选择。PySIFT是开源的,无需C++编译。

英文摘要

A widespread assumption in local feature research holds that classical handcrafted descriptors are accuracy-limited relics best replaced by learned alternatives. We show this is wrong. Through an 8-configuration ablation spanning four benchmarks (HPatches, ROxford5K, IMC Phototourism, MegaDepth), we demonstrate that classical SIFT with DSP multi-scale pooling outperforms neural descriptor and orientation replacements (HardNet, OriNet) on every accuracy metric--while running 2--18$\times$ faster--and that learned matchers (LightGlue) complement rather than supersede classical features. The conclusion reframes a decade of work: not "replace SIFT" but "compose with SIFT," classical extraction paired with learned matching only where geometric context demands it. This finding was invisible because no prior GPU SIFT kept the complete pipeline in VRAM or offered modularity for controlled classical-vs-learned ablations. We present PySIFT, the first fully GPU-resident SIFT, implemented in CuPy/Numba CUDA kernels with DLPack zero-copy handoff to downstream DL frameworks--submillisecond O(1) metadata swap regardless of keypoint count. On a laptop-grade NVIDIA RTX 3050 (4 GB VRAM), PySIFT achieves: (i) higher Mean Matching Accuracy (MMA) than OpenCV SIFT on HPatches, (ii) 383 ms faster per pair on high-resolution MegaDepth, (iii) higher geometric accuracy on cross-dataset benchmarks (+5.6 pp AUC@10${}^\circ$ on MegaDepth, more inliers on IMC Phototourism), and (iv) bitwise deterministic output--identical keypoints and descriptors across runs, with detection reproducing identically even across GPU architectures: a guarantee that learned extractors cannot match without significant performance sacrifice, and cannot achieve at all across GPU architectures due to cuDNN's architecture-dependent algorithm selection. PySIFT is open-source, requiring no C++ compilation.

2605.17865 2026-05-19 cs.CV

Imaging Hidden Objects with Consumer LiDAR via Motion Induced Sampling

通过运动诱导采样用消费级LiDAR成像隐藏物体

Siddharth Somasundaram, Aaron Young, Akshat Dave, Adithya Pediredla, Ramesh Raskar

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Dartmouth College(达特茅斯学院)

AI总结 本文提出了一种多帧融合策略,利用运动诱导孔径采样模型,在消费级LiDAR上实现了非线视成像,实现了隐藏物体的3D重建、多物体跟踪和相机定位,并展示了消费级硬件无需额外设置即可实现非线视成像的潜力。

详情
AI中文摘要

LiDARs are being increasingly deployed for consumer imaging in handheld, wearable, and robotic applications. These sensors can capture the time-of-flight of light at picosecond resolution, which in principle, enables them to capture information about objects hidden from their field of view. While such non-line-of-sight (NLOS) imaging capabilities have been shown on research-grade LiDARs, they are challenging to achieve on consumer devices due to poor signal quality resulting from low laser power, low spatial resolution, and object and camera motion. Inspired by burst photography and synthetic aperture radar, we propose a multi-frame fusion strategy to overcome these challenges and demonstrate NLOS imaging on consumer LiDAR. We first introduce the motion-induced aperture sampling model to unify the effects of object shape, object motion, and camera motion under a single measurement model. Using this model, we demonstrate several NLOS capabilities on a smartphone-grade LiDAR: (1) 3D reconstruction, (2) single and multi-object tracking, and (3) camera localization using hidden objects. Previously, NLOS imaging capabilities were largely restricted to bulky and expensive research-grade hardware that requires extensive setup and calibration. Our results represent a shift towards plug-and-play NLOS imaging, where anyone can image hidden objects with off-the-shelf hardware ($<100) and no additional setup. We believe that democratization of such capabilities will advance consumer applications of NLOS imaging.

英文摘要

LiDARs are being increasingly deployed for consumer imaging in handheld, wearable, and robotic applications. These sensors can capture the time-of-flight of light at picosecond resolution, which in principle, enables them to capture information about objects hidden from their field of view. While such non-line-of-sight (NLOS) imaging capabilities have been shown on research-grade LiDARs, they are challenging to achieve on consumer devices due to poor signal quality resulting from low laser power, low spatial resolution, and object and camera motion. Inspired by burst photography and synthetic aperture radar, we propose a multi-frame fusion strategy to overcome these challenges and demonstrate NLOS imaging on consumer LiDAR. We first introduce the motion-induced aperture sampling model to unify the effects of object shape, object motion, and camera motion under a single measurement model. Using this model, we demonstrate several NLOS capabilities on a smartphone-grade LiDAR: (1) 3D reconstruction, (2) single and multi-object tracking, and (3) camera localization using hidden objects. Previously, NLOS imaging capabilities were largely restricted to bulky and expensive research-grade hardware that requires extensive setup and calibration. Our results represent a shift towards plug-and-play NLOS imaging, where anyone can image hidden objects with off-the-shelf hardware ($<100) and no additional setup. We believe that democratization of such capabilities will advance consumer applications of NLOS imaging.

2605.17862 2026-05-19 cs.LG cs.AI

$\boldsymbol{f}$-OPD: Stabilizing Long-Horizon On-Policy Distillation with Freshness-Aware Control

f-OPD: 通过新鲜度感知控制稳定长周期在线策略蒸馏

Xianwei Chen, Shimin Zhang, Jibin Wu

发表机构 * The Hong Kong Polytechnic University(香港理工大学)

AI总结 本文提出f-OPD框架,通过引入样本级新鲜度评分来稳定长周期在线策略蒸馏,实现性能与效率的平衡,为大规模长周期智能体训练奠定基础。

详情
AI中文摘要

在大规模语言模型中扩展在线策略蒸馏(OPD)面临根本性矛盾:异步执行是系统效率的必要条件,但结构上偏离理想的在线策略目标。为解决这一挑战,我们理论上将目标偏差分解为回放漂移和监督漂移,分别捕捉学生回放和教师上下文的陈旧性。基于此,我们引入样本级新鲜度评分,量化缓冲样本相对于在线策略目标的可靠性。受此信号引导,我们进一步提出f-OPD,一种新颖的框架,能够自适应调节陈旧样本的影响并约束异步训练下累积的策略漂移。在推理、工具使用和编码代理任务中,f-OPD在增加交互周期时,始终能够实现与同步优化相当的任务性能,同时保留异步执行的吞吐量优势。我们的结果建立了OPD中实现性能-效率权衡的第一个配方,为大规模长周期智能体训练铺平道路。

英文摘要

Scaling on-policy distillation (OPD) for large language models (LLMs) confronts a fundamental tension: asynchronous execution is necessary for system efficiency, but structurally deviates from the ideal on-policy objective. To address this challenge, we theoretically decompose the objective discrepancy into rollout drift and supervision drift, capturing staleness in student rollout and teacher context, respectively. Building on this, we introduce a sample-level freshness score that quantifies the reliability of a buffered sample with respect to the on-policy objective. Guided by this signal, we further propose f-OPD, a novel framework that adaptively regulates stale-sample influence and constrains policy drift accumulated under asynchronous training. Across reasoning, tool-use, and coding-agent tasks of increasing interaction horizon, f-OPD consistently achieves task performance comparable to synchronous optimization while largely retaining the throughput advantages of asynchronous execution. Our results establish the first recipe for achieving a performance-efficiency trade-off in OPD, paving the way for long-horizon agentic post-training at scale.

2605.17856 2026-05-19 cs.AI

KISS - Knowledge Infrastructure for Scientific Simulation: A Scaffolding for Agentic Earth Science

KISS - 地球科学的科学模拟知识基础设施:一种智能体的支架

Ziwei Li, Liujun Zhu, Yuchen Liu, Yichen Zhao, Birk Li, Ruiqi Wu, Junliang Jin, Jianyun Zhang

发表机构 * State Key Laboratory of Water Disaster Prevention, Hohai University, Nanjing(水利灾害预防国家重点实验室,河海大学,南京) Yangtze Institute for Conservation and Development, Hohai University, Nanjing(长江保护与发展研究院,河海大学,南京) Department of Bioresource Engineering, McGill University, Sainte-Anne-de-Bellevue, Quebec, Canada(生物资源工程系,麦吉尔大学,圣安妮-德-贝尔贝夫,魁北克,加拿大) Ottawa Research and Development Centre, Agriculture & Agri-Food Canada, Ottawa, Ontario, K1A 0C6, Canada(渥太华研发中心,加拿大农业与食品部,渥太华,安大略,K1A 0C6,加拿大) College of Water Conservancy and Hydropower Engineering, Hohai University, Nanjing(水利水电工程学院,河海大学,南京) Meta Platforms Inc.(Meta平台公司) Nanjing Hydraulic Research Institute, Nanjing(南京水利研究院)

AI总结 本文提出KISS,一种用于科学模拟的知识基础设施,通过将专业知识外化为经过验证的建模操作符、分阶段的领域协议和诊断恢复机制,使智能体能够生成物理合理且可验证的端到端模拟,从而降低非专业用户与过程模拟之间的接入门槛,并促进建模社区的整合。

详情
AI中文摘要

基于过程的模拟模型编码了数十年的地球科学领域科学理解,但最暴露于气候风险和资源稀缺的社区却最无法利用这些模型。本文介绍知识基础设施(KI),一种可被智能体执行的支架,将专业知识外化为经过验证的建模操作符、分阶段的领域协议和诊断恢复机制。在3000次耦合水文基准测试中,配备KI的智能体在84%的试验中生成了物理合理且可验证的端到端模拟,而未配备KI的智能体则停留在低于40%的水平。KI具有跨学科泛化能力。我们将其构建过程封装为知识解构工具包(KDT),该工具能够自主生成KI,使智能体能够执行117个额外的过程导向模型,覆盖14个地球科学领域。在所有119个KI中,建模决策和失败修复机制在不同底层物理基础上趋于一致,表明操作专业知识是结构化和可提取的,而非随意的。演示显示,配备KI的智能体降低了非专业用户与过程导向模拟之间的接入门槛,并降低了建模社区之间的整合门槛。通过这一支架,基于过程的科学可以作为可生长的科学公共领域发展,回应谁需要知道,且可由谁能够贡献来扩展。

英文摘要

Process-based simulation models encode decades of scientific understanding across the Earth sciences, yet the communities most exposed to climate risk and resource scarcity are the least able to use them. Here, we introduce knowledge infrastructure (KI), an agent-actionable scaffold that externalizes expertise into validated modelling operators, staged domain protocols, and diagnostic recovery mechanisms. Across a 3,000-trial coupled-hydrology benchmark, agents equipped with KI produced physically plausible, verifiable end-to-end simulations in up to 84% of trials, while agents without KI plateaued below 40%. KI generalizes across disciplines. We packaged its construction into a Knowledge Dissection Toolkit (KDT) that autonomously produced KI enabling end-to-end agent execution of 117 additional process-based models across 14 Earth-science domains. Across all 119 KIs, modelling decisions and failure remedies converged despite different underlying physics, showing that operational expertise is structured and extractable rather than ad hoc. Demonstrations show KI-equipped agents lowering both the access barrier between non-specialist users and process-based simulation, and the integration barrier between modelling communities. Through this scaffold, process-based science can then evolve as a living scientific commons, answerable to whoever needs to know and extendable by whoever can contribute.

2605.17854 2026-05-19 cs.LG

Learning over Positive and Negative Edges with Contrastive Message Passing

通过对比信息传递学习正负边

Peter Pao-Huang, Charilaos I. Kanatsoulis, Michael Bereket, Jure Leskovec

发表机构 * Department of Computer Science(计算机科学系) Stanford University(斯坦福大学)

AI总结 本文研究了在低标签率、高同质性和高边密度设置下,负边信息对图表示学习的价值,并提出对比信息传递机制以同时利用正负边信息提升性能。

详情
AI中文摘要

传统的图学习方法通过现有(即正边)边进行信息传递来更新节点特征,但这些方法往往忽视了缺失(即负边)中可能有价值的信息。本文理论分析了负边在图表示中的价值,并证明在低标签率、高同质性和高边密度设置下,访问负边能提供比仅使用正边更大的信息增益。受此启发,我们引入对比信息传递(CMP),一种通用的信息传递架构,使图神经网络层能够推理正负边信息。通过在可学习权重上施加软正半定约束,我们的方法对正连接节点应用相似性保持变换,对负连接节点应用不相似性诱导变换。在不同数据条件下,CMP在低标签设置下,当负边信息有效时, consistently 超过基线方法。

英文摘要

Conventional approaches to learning on graphs involve message passing along existing (i.e., positive) edges to update node features. However, these approaches often disregard the potentially valuable information contained in the absence (i.e., negative) of edges. Here, we theoretically analyze the value of negative edges in graph representations and prove that in settings of low label rates, high homophily, and high edge density, access to negative edges provides significant information gain over using only positive edges. Motivated by this insight, we introduce Contrastive Message Passing (CMP), a general message passing architecture that enable graph neural network layers to reason over positive and negative edges. By imposing soft positive semidefinite constraints on the learnable weights, our approach differentially applies similarity-preserving transformations to positively connected nodes and dissimilarity-inducing transformations to negatively connected nodes. Over simulated and real datasets in varying data regimes, CMP consistently outperforms baselines in low-label settings when negative edges are informative.

2605.17851 2026-05-19 cs.RO

A Dexterous and Compliant Gripper With Soft Hydraulic Actuation for Microgravity Manipulation

一种具有软液压驱动的灵活机械手用于微重力操作

William Su, Jordan Kam, Yixiao Wang, Jianshu Zhou

发表机构 * Aerospace Engineering Program, University of California, Berkeley(加州大学伯克利分校航空航天工程系) Department of Mechanical Engineering, University of California, Berkeley(加州大学伯克利分校机械工程系) Department of Mechanical Engineering, National University of Singapore(新加坡国立大学机械工程系)

AI总结 本文提出将DexCoHand灵活的双指六自由度机械手与Astrobee自由飞行机器人集成,以实现微重力环境下的灵活操作,该机械手在保持稳定接触的同时减少了对自由飞行基底的干扰,提高了操作的连续性和适应性。

Comments Accepted to the IEEE ICRA 2026 Space Robotics Workshop (SRW). 4 pages, 3 figures

详情
AI中文摘要

Astrobee现有的单自由度(DOF)欠驱动柔性爪形抓取器能够停靠在国际空间站(ISS)上,但对连续的灵活操作能力有限。更复杂的微重力任务需要一个能够保持稳定接触并限制对自由飞行基底的干扰的末端执行器,因为接触力会直接耦合到基底运动中。本文提出了将DexCoHand(一种灵活的双指六自由度抓取器)与Astrobee自由飞行机器人集成,以实现微重力操作。该系统在MuJoCo中使用Astrobee的标准手rail停靠序列进行评估,包括接近、停靠以及随后的俯仰和偏转运动。与Astrobee现有的抓取器相比,DexCoHand在保持命令的俯仰和偏转运动的同时,减少了意外的交叉轴基底运动。在地球上的硬件实验进一步展示了DexCoHand的灵活操作能力和其在更适应的智能操作任务中的潜力。

英文摘要

Astrobee's existing one-degree-of-freedom (DOF) underactuated compliant claw gripper enables perching on the International Space Station (ISS), but provides limited capability for continuous dexterous manipulation. More complex microgravity tasks require an end-effector that can maintain stable contact while limiting disturbance to the free-flying base, since contact forces directly couple into base motion. This article presents the integration of DexCoHand, a dexterous and compliant two-finger, 6-DOF gripper, with the Astrobee free-flying robot for microgravity manipulation. The system is evaluated in MuJoCo using Astrobee's standard handrail perching sequence, including approach, perching, and subsequent pan and tilt motions. Compared with Astrobee's existing gripper, DexCoHand preserves the commanded pan and tilt motions while reducing unintended cross-axis base motion. Hardware experiments on Earth further demonstrate DexCoHand's dexterous manipulation capabilities and its potential for more adaptable intelligent manipulation tasks.

2605.17849 2026-05-19 cs.CL cs.AI cs.LG

Generating Pretraining Tokens from Organic Data for Data-Bound Scaling

从有机数据生成预训练令牌以实现数据驱动的扩展

Zichun Yu, Chenyan Xiong

发表机构 * Language Technologies Institute, Carnegie Mellon University(卡内基梅隆大学语言技术研究所)

AI总结 本文提出SynPro框架,通过重新表述和重新格式化操作,帮助大语言模型更充分地利用有限的有机数据,从而在数据驱动的预训练中实现更高效的扩展。

详情
AI中文摘要

LLM预训练正从计算驱动转向数据驱动的阶段,其中可用的人类(有机)文本远远无法满足扩展需求。然而,达到数据驱动阶段并不意味着模型已充分利用其有机语料库。在本文中,我们介绍了SynPro,一个合成数据生成框架,帮助LLM更深入地学习有限的有机数据。SynPro应用两种操作,即重新表述和重新格式化,以多样化的形式呈现相同的有机源,以促进更深层次的学习,而无需引入外部信息。两个生成器通过强化学习优化,使用质量、忠实度和数据影响奖励进行优化,并在预训练平台期持续更新,以针对模型尚未吸收的内容。我们使用DCLM-Baseline的10%最优令牌(0.8B和2.2B)预训练400M和1.1B模型,反映了前沿预训练中现实的数据驱动阶段。我们的结果表明,有机数据被标准重复方法显著低估:SynPro解锁了比重复方法多3.7-5.2倍的有效令牌,甚至在1.1B规模上超过了非数据驱动的Oracle,该Oracle在等效唯一数据上训练。分析证实,忠实、模型意识的合成可以在不导致分布崩溃的情况下实现数据驱动的扩展。我们开源代码在https://github.com/cxcscmu/SynPro。

英文摘要

LLM pretraining is shifting from a compute-bound to a data-bound regime, where available human (organic) text falls far short of scaling demands. However, reaching the data-bound regime does not mean the model has fully utilized its organic corpus. In this paper, we introduce SynPro, a synthetic data generation framework that helps LLMs more thoroughly learn from limited organic data. SynPro applies two operations, rephrasing and reformat, that present the same organic source in diverse forms to facilitate deeper learning without introducing external information. Both generators are optimized via reinforcement learning with quality, faithfulness, and data influence rewards, and are continuously updated as pretraining plateaus to target content the model has yet to absorb. We pretrain 400M and 1.1B models with 10% of their Chinchilla-optimal tokens (0.8B and 2.2B) from DCLM-Baseline, reflecting a realistic data-bound regime in frontier pretraining. Our results reveal that organic data is significantly underutilized by standard repetition: SynPro unlocks 3.7-5.2x the effective tokens of repetition, even surpassing the non-data-bound oracle that trains on equivalent unique data at the 1.1B scale. Analyses confirm that faithful, model-aware synthesis sustains data-bound scaling without causing distribution collapse. We open-source our code at https://github.com/cxcscmu/SynPro.

2605.17834 2026-05-19 cs.CV

Stabilizing, Scaling & Enhancing MeanFlow for Large-scale Diffusion Distillation

稳定、扩展与增强MeanFlow用于大规模扩散蒸馏

Xiao He, Yang Li, Peizhen Zhang, Songtao Liu, Zhao Zhong, Nannan Wang

发表机构 * State Key Laboratory of Integrated Services Networks(信息服务网络国家重点实验室) Xidian University(西安电子科技大学) Tencent Hunyuan(腾讯文英)

AI总结 本文提出了一种稳定MeanFlow的方法,通过引入暖启动技术并结合轨迹分布对齐,提高了大规模工业模型蒸馏的性能和泛化能力。

Comments 10 pages

详情
AI中文摘要

扩散模型表现出卓越的生成能力,但其高延迟限制了实际部署。许多研究尝试减少采样步骤以加速推理。其中,MeanFlow因其简洁的公式和显著的性能而受到关注。然而,其优化目标的不稳定性以及'均值偏置'限制了其在蒸馏大规模工业模型中的应用。为了稳定MeanFlow用于蒸馏大规模模型,我们首先引入了暖启动技术,其中MeanFlow的原始微分解法被替换为离散解。这种设计避免了由于MeanFlow目标包含来自未充分训练模型的stop-gradient项而导致的训练崩溃。一旦模型获得初步能力以拟合平均速度场,我们将其优化目标切换回微分解法,以实现进一步的细化。同时,为了缓解在极少数步推理中复杂目标分布下的'均值偏置',我们将其纳入轨迹分布对齐作为辅助目标,鼓励学生模型的轨迹分布更接近教师模型的轨迹分布。我们提出的蒸馏框架在应用于文本到图像(T2I)模型FLUX.1-dev(高达12B参数)时,相比现有蒸馏方法表现更优。此外,当扩展到80B参数的最新状态(SOTA)T2I模型HunyuanImage 3.0时,我们的方法继续表现出稳健的泛化能力和强性能。

英文摘要

Diffusion models exhibit remarkable generative capability, but their high latency limits practical deployment. Many studies have attempted to reduce sampling steps to accelerate inference. Among them, MeanFlow has attracted considerable attention due to its concise formulation and remarkable performance. Nevertheless, the instability of its optimization objective and the ''mean-seeking bias'' have limited its applicability to distill large-scale industrial models. To stabilize MeanFlow for distilling large-scale models, we first introduce a warm-up technique, in which the original differential solution of MeanFlow is replaced by a discrete solution. This design avoids training collapse caused by the MeanFlow target containing a stop-gradient term from an undertrained model. Once the model acquires a preliminary ability to fit the average velocity field, we switch the optimization objective back to the differential solution, enabling further refinement. Meanwhile, to alleviate the ''mean-seeking bias'' of MeanFlow under extremely few-step inference with complex target distributions, we incorporate trajectory distribution alignment as an auxiliary objective, encouraging the student model's trajectory distribution to align more closely with that of the teacher model. Our proposed distillation framework achieves superior performance compared to existing distillation approaches when applied to the text-to-image (T2I) model FLUX.1-dev (up to 12B parameters). Furthermore, when extended to the 80B-parameter state-of-the-art (SOTA) T2I model HunyuanImage 3.0, our method continues to demonstrate robust generalization and strong performance.

2605.17833 2026-05-19 cs.LG cs.AI

Efficient Bilevel Optimization for Meta Label Correction in Noisy Label Learning

高效的元标签校正中的双层优化

Ba Hoang Anh Nguyen, Viet Cuong Ta

发表机构 * Human-Machine Interaction Laboratory, VNU University of Engineering and Technology(人机交互实验室,越南工程与技术大学)

AI总结 本文提出了一种高效的元标签校正方法EBOMLC,通过引入一步内循环更新、混合上界损失和对齐感知的动态障碍物,提高了元模型的训练效率和稳定性,实验表明其在高噪声环境下表现优异。

详情
AI中文摘要

训练深度神经网络时使用噪声标签可以降低数据标注成本,但可能会将噪声引入学习模型中。在元标签校正方法中,除了主模型外,还会训练一个额外的元模型,使用小规模干净数据集来校正大规模噪声数据集。然而,元模型的更新需要在主模型的内部步骤中计算超梯度,这会显著增加计算成本。为了提高训练效率,我们首先引入动态障碍梯度下降到标准元标签校正中。虽然这种直接扩展能够将训练过程的速度提高到大约一阶复杂度,但缺乏防止噪声信号泄漏到主模型和稳定元模型学习的机制。基于这一观察,我们提出了EBOMLC方法,其设计包含三个关键改进:一步内循环更新、混合上界损失和对齐感知的动态障碍物。在CIFAR-10和CIFAR-100上的实验结果表明,EBOMLC在高噪声率设置下优于其他基线方法,同时减少了元标签校正方法的训练时间。

英文摘要

Training a deep neural network with noisy labels could reduce data annotation cost but may introduce noise into the learned model. In meta label correction approaches, an additional meta model besides the main model is trained with a small, clean dataset to correct the large, noisy dataset. However, the update of the meta model requires the computation of hypergradients at the inner step of the main model which signif- icantly increases the computational cost. To improve the training efficiency, we first introduce the dynamic barrier gradient descent into standard meta label correction. While this naive extenstion is able to speed up the training process to approximately first- order complexity, it lacks mechanisms to prevent the leakage of noisy signals to the main model and to stabilize the learning of the meta model. Based on this observation, we propose the EBOMLC method, which is designed with three key improvements including one-step inner loop update, mixture upper loss and alignment- aware dynamic barrier. Empirical results on CIFAR-10 and CIFAR-100 demonstrate that EBOMLC consistently outperforms other baselines, especially under high noise rate settings, while reducing training time of the meta label correction approach.

2605.17831 2026-05-19 cs.LG cs.DB

Agentic Cost-Aware Query Planning with Knowledge Distillation for Big Data Analytics

具有知识蒸馏的代理成本感知查询规划用于大数据分析

Mahdi Naser-Moghadasi

发表机构 * Research Division, BrightMind AI(BrightMind AI 研究部) Texas Tech University(德克萨斯理工大学) University of Texas at Arlington(德克萨斯大学阿灵顿分校)

AI总结 本文提出了一种结合规则基教师规划器、UCB1老虎机探索、成本感知预测和知识蒸馏的轻量级学生规划器,以解决大数据分析中查询优化计算成本高且资源受限环境下的内存和延迟约束问题,实验结果显示在纽约出租车和IMDB数据集上相比默认规划器降低了23%的延迟并保持了94%的约束满足率。

Comments 8 pages, preprint, code at https://github.com/mahdinaser/agentic-kd-planner

详情
AI中文摘要

在大数据分析中查询优化仍然计算成本很高,尤其是在资源受限的环境中,传统优化器无法满足内存和延迟约束。我们提出了一种代理查询规划系统,结合规则基教师规划器、UCB1老虎机探索、成本感知预测和知识蒸馏来构建轻量级学生规划器。我们的教师规划器使用六个关键优化策略生成SQL计划,而UCB1老虎机搜索在显式资源约束下高效地探索计划空间。随机森林成本模型预测查询延迟,根据计划特征进行成本感知决策。蒸馏的学生规划器(逻辑回归或梯度提升)学习模仿教师-老虎机决策以实现快速推理。在纽约出租车和IMDB数据集上的评估显示,与默认规划器相比,延迟减少了23%,同时保持了94%的约束满足率。学生规划器在复制最优计划方面实现了89%的准确性,推理时间快15倍。我们的单文件实现使在资源受限机器上可重复的大数据分析成为可能,并在https://github.com/mahdinaser/agentic-kd-planner上公开发布。

英文摘要

Query optimization in big data analytics remains computationally expensive, particularly for resource-constrained environments where traditional optimizers fail to satisfy memory and latency constraints. We present an agentic query planning system that combines a rule-based teacher planner, UCB1 bandit exploration, cost-aware prediction, and knowledge distillation to a lightweight student planner. Our teacher planner generates SQL plans using six key optimization strategies, while UCB1 bandit search efficiently explores the plan space under explicit resource constraints. A Random Forest cost model predicts query latency from plan features, enabling cost-aware decisions. A distilled student planner (Logistic Regression or Gradient Boosting) learns to mimic teacher-bandit decisions for fast inference. Evaluation on NYC Taxi and IMDB datasets demonstrates 23% latency reduction compared to default planners while maintaining 94% constraint satisfaction. The student planner achieves 89% accuracy in replicating optimal plans with 15x faster inference time. Our single-file implementation enables reproducible big-data analytics on resource-limited machines and is publicly available at https://github.com/mahdinaser/agentic-kd-planner.

2605.17830 2026-05-19 cs.AI cs.CL

Remembering More, Risking More: Longitudinal Safety Risks in Memory-Equipped LLM Agents

记住更多,风险更多:具有记忆能力的LLM代理的纵向安全风险

Ahmad Al-Tawaha, Shangding Gu, Peizhi Niu, Ruoxi Jia, Ming Jin

发表机构 * Virginia Tech(弗吉尼亚理工大学) University of California, Berkeley(加州大学伯克利分校) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本研究探讨了具有记忆能力的LLM代理在长期任务中因记忆积累导致的安全风险,提出了一种触发-探测协议来评估记忆污染的影响,并发现记忆安全应被视为一个纵向属性而非单一状态属性。

详情
AI中文摘要

对具有记忆能力的LLM代理的安全评估通常测量单任务内的安全性:代理是否在对抗性条件下(如提示注入或记忆污染)安全地完成单一场景。然而,在部署中,一个代理会服务于许多独立任务,时间跨度较长,早期任务积累的记忆会影响后续无关任务的行为。研究这种情形需要在任务间的时间维度上进行评估:不是代理在任何单一记忆状态下的安全性,而是随着记忆在许多独立交互中积累,其安全性特征如何变化。我们称之为这种故障模式“时间记忆污染”。为了隔离记忆暴露与流非平稳性,我们引入了一种触发-探测协议,该协议通过固定探测集与不同前缀长度的只读记忆快照进行评估,并结合NullMemory反事实基线来识别由记忆引起的违规。我们将此协议应用于三个涵盖记录、备忘录、表单和电子邮件通信的部署场景,以及八种记忆架构,并进一步在Claw-like AI代理(如OpenClaw)上使用平台原生的记忆机制。具有记忆能力的代理在NullMemory基线上表现优异,记忆引起的违规率在两种代理类别中均表现出随暴露长度上升的稳健趋势。顺序随机化实验表明,该效应主要由积累内容而非接触顺序驱动。最后,事件分解的结构后果是记忆引起的风险在生成前的检索状态即可检测,我们通过高召回率的诊断监控器验证了这一点。我们的结果表明,应将记忆安全视为一个需要时间评估的纵向属性,而非可通过快照捕捉的单一状态属性。

英文摘要

Safety evaluations of memory-equipped LLM agents typically measure within-task safety: whether an agent completes a single scenario safely, often under adversarial conditions such as prompt injection or memory poisoning. In deployment, however, a single agent serves many independent tasks over a long horizon, and memory accumulated during earlier tasks can affect behavior on later, unrelated ones. Studying this regime requires evaluation along the temporal dimension across tasks: not whether an agent is safe at any single memory state, but how its safety profile changes as memory accumulates across many independent interactions. We call this failure mode temporal memory contamination. To isolate memory exposure from stream non-stationarity, we introduce a trigger-probe protocol that evaluates a fixed probe set against read-only memory snapshots at varying prefix lengths, together with a NullMemory counterfactual baseline for identifying memory-induced violations. We apply this protocol across three deployment scenarios spanning records, memos, forms, and email correspondence and eight memory architectures, and additionally on Claw-like AI agents, such as OpenClaw, using the platform's native memory mechanism. Memory-enabled agents consistently exceed the NullMemory baseline, and memory-induced violation rates show a robust upward trend with exposure length on both agent classes. Order-randomization experiments indicate that the effect is driven primarily by accumulated content rather than encounter order. Finally, a structural consequence of the event decomposition is that memory-induced risk is detectable from retrieval state before generation, which we confirm with a high-recall diagnostic monitor. Our results argue for treating memory safety as a longitudinal property that requires temporal evaluation, not a single-state property that can be captured by a snapshot.

2605.17829 2026-05-19 cs.AI

Interactive Evaluation Requires a Design Science

交互评估需要一种设计科学

Keyang Xuan, Peiyang Song, Pan Lu, Pengrui Han, Wenkai Li, Zhenyu Zhang, Zexue He, Wenyue Hua, Manling Li, Jiaxuan You, Adrian Weller, Yizhong Wang, Jiaxin Pei

发表机构 * University of Texas Austin(德克萨斯大学奥斯汀分校) California Institute of Technology(加州理工学院) Carnegie Mellon University(卡内基梅隆大学) Stanford University(斯坦福大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Microsoft Research(微软研究院) Northwestern University(西北大学) University of Cambridge(剑桥大学)

AI总结 本文探讨了交互评估应被视为一种原则性的评估范式,而非仅仅是新的智能体基准。通过定义评估为证据到判断的自主映射,文章展示了交互评估如何改变这一映射的两方面,并提出双轴分类法,制定设计原则和报告标准,分析了长期评估挑战在轨迹层面的再现。

Comments 10 pages

详情
AI中文摘要

AI评估正经历结构性变革。大型语言模型(LLMs)越来越多地被部署为通过工具、环境、用户和其他智能体进行时间动作的系统,而许多评估实践仍继承自响应中心基准(例如固定输入、孤立输出和单个响应可做出的判断)。该领域开始构建交互基准,但所形成的景观却碎片化:基准在允许的交互制品、轨迹评分方式以及所支持的主张上各不相同。本文主张交互评估应被视为一种原则性的评估范式,而非仅仅是新的智能体基准。单纯采用以往的评估范式并不足够。我们定义评估为证据到判断的自主映射,并展示交互评估改变了这一映射的两方面:证据变为由交互生成的轨迹,而评估过程必须评估过程、可恢复性、协调性、鲁棒性和系统级性能。基于此定义,我们提出双轴分类法,推导设计原则和报告标准,分析代表性场景,并探讨长期评估挑战在轨迹层面的再现。

英文摘要

AI evaluation is undergoing a structural change. Large language models (LLMs) are increasingly deployed as systems that act over time through tools, environments, users, and other agents, while many evaluation practices still inherit assumptions from response-centered benchmarks (e.g., fixed inputs, isolated outputs, and outcome judgments that can be made from a single response). The field has begun to build interactive benchmarks, but the resulting landscape is fragmented: benchmarks differ in what interaction artifacts they admit, how trajectories are scored, and what claims their results support. This position paper argues that interactive evaluation should be treated as a principled evaluation paradigm, not merely a new family of agent benchmarks. Simply adopting previous evaluation paradigms does not suffice. We define evaluation as an autonomous mapping from evidence to judgments, and show that interactive evaluation changes both sides of this mapping: the evidence becomes interaction-generated trajectories, while the evaluation procedure must assess process, recoverability, coordination, robustness, and system-level performance. Building on this definition, we propose a two-axis taxonomy, derive design principles and reporting standards, examine representative scenarios, and analyze how longstanding evaluation challenges reappear at the trajectory level.