arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 3405
2605.25737 2026-05-26 cs.CV

SFR-Net: Learning Scale-Frustum Representations for Ultra-Wide Area Remote Sensing Image Segmentation

SFR-Net: 学习尺度截锥体表示用于超广域遥感图像分割

Chuyu Zhong, Keyan Chen, Qinzhe Yang, Bowen Chen, Zhengxia Zou, Zhenwei Shi

发表机构 * Department of Aerospace Intelligent Science and Technology, School of Astronautics, Beihang University(航天智能科学与技术学院,北京航空航天大学) Key Laboratory of Spacecraft Design Optimization and Dynamic Simulation Technologies, Ministry of Education, Beihang University(航天器设计优化与动态仿真技术重点实验室,北京航空航天大学) Shen Yuan Honors College, Beihang University(神元荣誉学院,北京航空航天大学) College of Computing and Data Science, Nanyang Technological University(计算与数据科学学院,新加坡南洋理工大学)

AI总结 针对超广域遥感图像中地物尺度差异大和长距离上下文语义连续性问题,提出尺度截锥体表示网络(SFR-Net),通过构建尺度截锥体表示和级联跨尺度融合机制,在GID和FBPS数据集上分别提升mIoU 1.72%和4.29%。

详情
AI中文摘要

像素数量和地理覆盖范围是遥感图像的两个关键特征。现有的遥感图像分割方法通常专注于像素数量小或像素数量大但地理覆盖范围有限的图像。本文介绍了一种针对超广域(UWA)遥感图像的新分割任务,其特点是像素数量大且地理覆盖范围极广。UWA分割的核心挑战在于同时处理尺度变化显著的地物以及保持长距离上下文语义连续性。为了解决这些挑战,我们提出了尺度截锥体表示网络(SFR-Net)。受不同高度拍摄的遥感图像视锥体的启发,我们构建了尺度截锥体表示,实现了不同尺度下地物和上下文特征的统一建模。此外,我们设计了一种级联跨尺度融合机制,以有效整合这些表示,增强局部语义理解,同时确保长距离上下文连续性。在GID和FBPS上的实验结果表明,SFR-Net达到了最先进的性能,相比最强的竞争方法,mIoU分别提高了1.72%和4.29%。此外,所提出的尺度截锥体表示可以集成到通用分割网络中,以提高分割精度和收敛速度。实现代码将在https://github.com/ChuyuZhong/SFR-Net公开。

英文摘要

Pixel count and geographical coverage are two key characteristics of remote sensing images. Existing remote sensing image segmentation methods typically focus on images with either a small pixel count or a large pixel count but limited geographical coverage. In this paper, we introduce a novel segmentation task targeting ultra-wide area (UWA) remote sensing images, characterized by both a large pixel count and extremely wide geographical coverage. The core challenges of UWA segmentation lie in simultaneously handling ground objects with significantly varying scales and maintaining long-range contextual semantic continuity. To address these challenges, we propose the Scale-Frustum Representation Network (SFR-Net). Inspired by the viewing frustums of remote sensing images captured from different altitudes, we construct scale-frustum representations, enabling unified modeling of ground objects and contextual features at different scales. Furthermore, we design a cascaded cross-scale fusion mechanism to effectively integrate these representations, enhancing local semantic understanding while ensuring long-range contextual continuity. Experimental results on GID and FBPS demonstrate that SFR-Net achieves state-of-the-art performance, improving mIoU by 1.72% and 4.29%, respectively, over the strongest competing methods. In addition, the proposed scale-frustum representations can be integrated into generic segmentation networks to improve both segmentation accuracy and convergence speed. The implementation code will be publicly available at https://github.com/ChuyuZhong/SFR-Net.

2605.25735 2026-05-26 cs.AI

A Deep Dive into Axiomatic Design -- Part I: Problem Formulation

公理化设计的深度剖析——第一部分:问题表述

Aydin Homay

发表机构 * Technische Universität Dresden(德累斯顿理工大学)

AI总结 本文聚焦公理化设计中的问题表述步骤,澄清一级功能需求的定义与特性,分析常见误区与困难,并提供实用指导,最后探讨大语言模型在该步骤中的作用。

Comments The paper is accepted at the ICAD 2026 - MIT and the final camera ready will be available once it got published by the Springer

详情
AI中文摘要

问题表述——将客户需求和约束转化为最小的一组独立的一级功能需求——可以说是每个设计框架中最关键的步骤,包括公理化设计,然而在实践中它经常被误解或低估。本文专门关注公理化设计中的问题表述,澄清一级FR是什么(以及不是什么),解释为什么在给定的相同需求和约束下,它们不应在不同设计者之间合理变化,并强调导致设计失败的内在困难和反复出现的陷阱。讨论主要基于Nam P. Suh的三本书:《设计原理》、《公理化设计:进展与应用》和《复杂性理论》,并提供实用指导,帮助设计者制定适定的一级FR。最后,本文简要回顾了大语言模型时代的问题表述,并讨论了此类工具在一级层面上能够(以及不能)做出什么贡献。

英文摘要

Problem formulation translating customer needs and constraints into a minimum set of independent first-level functional requirements, is arguably the most critical step in every design framework, including axiomatic design yet it is frequently misunderstood or underestimated in practice. This paper focuses exclusively on problem formulation in axiomatic design it clarifies what first-level FRs are (and are not), explains why they should not legitimately vary across designers given the same needs and constraints, and highlights intrinsic difficulties and recurring pitfalls that lead to design failure. The discussion is grounded primarily in Nam P.Suh's three books. The Principles of Design, Axiomatic Design Advances and Applications, and Complexity Theory, and it offers practical guidance to help designers formulate well-posed first-level FRs. Finally, the paper briefly revisits problem formulation in the era of large language models and discusses what such tools can (and cannot) contribute at the first level.

2605.25730 2026-05-26 cs.CV

DeCoDrift: Stabilizing Decoder Coupling in Closed-Loop Foundation Segmentation

DeCoDrift:闭环基础分割中的解码器耦合稳定化

H. M. Shadman Tabib, Md. Shamsuzzoha Bayzid, M Sohel Rahman

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系)

AI总结 针对闭环迭代分割中解码器耦合漂移导致误差累积的问题,提出无需训练或真值监督的推理时稳定化框架DeCoDrift,通过约束提示更新和保持解码器耦合来提升注意力稳定性、时间一致性和分割质量。

Comments 18 Pages, 5 Figures

详情
AI中文摘要

基础分割模型(如Segment Anything Model, SAM)现在常被用于迭代流水线中,其中每个预测掩码被反馈作为下一个提示。这种做法将分割转变为闭环动态过程,但这些系统的解码器级行为在很大程度上仍未得到研究。我们表明,这种反馈循环可能引发一种先前被忽视的故障模式——解码器耦合漂移,其中掩码解码器的交叉注意力逐渐失去与目标对象的对齐,导致误差在迭代中累积。我们通过检测SAM的掩码解码器并推导出无真值的提示-图像耦合、注意力稳定性和时间一致性度量来研究这一现象。在体积电子显微镜数据上,这些解码器内部信号显示,与基于真值锚定的反馈相比,标准迭代提示系统性地降低了注意力对齐和时间一致性。然后,我们将迭代提示形式化为一个离散时间动态系统,并展示近端锚定如何减少反馈循环中的误差放大。基于这一分析,我们引入了DeCoDrift,一个无需训练、推理时稳定的框架,它约束提示更新并在迭代中保持解码器耦合。在大量实验中,DeCoDrift在注意力稳定性、时间一致性和分割质量上持续优于标准迭代提示,无需重新训练或真值监督。更广泛地说,我们的结果表明,解码器内部动态不仅仅是诊断性的:它们为在闭环使用中稳定基础分割模型提供了可操作的信号。

英文摘要

Foundation segmentation models such as Segment Anything Model (SAM) are now routinely used in iterative pipelines, where each predicted mask is fed back as the next prompt. This practice turns segmentation into a closed-loop dynamical process, yet the decoder-level behavior of these systems remains largely unexamined. We show that this feedback loop can induce a previously overlooked failure mode, decoder coupling drift, in which the mask decoder's cross-attention progressively loses alignment with the target object, causing errors to accumulate across iterations. We study this phenomenon by instrumenting SAM's mask decoder and deriving ground-truth-free measures of prompt-image coupling, attention stability, and temporal consistency. On volumetric electron microscopy data, these decoder-internal signals reveal that standard iterative prompting systematically degrades attention alignment and temporal coherence relative to oracle-anchored feedback. We then formalize iterative prompting as a discrete-time dynamical system and show how proximal anchoring reduces error amplification in the feedback loop. Building on this analysis, we introduce DeCoDrift, a training-free inference-time stabilization framework that constrains prompt updates and preserves decoder coupling across iterations. Across extensive experiments, DeCoDrift consistently improves attention stability, temporal coherence, and segmentation quality over standard iterative prompting, without retraining or ground-truth supervision. More broadly, our results show that decoder-internal dynamics are not merely diagnostic: they provide actionable signals for stabilizing foundation segmentation models in closed-loop use.

2605.25725 2026-05-26 cs.CV

TriDP-PTM: a three-stage distortion-perception tradeoff guides the pre-training model for radar cardiac sensing

TriDP-PTM:三阶段失真-感知权衡引导的预训练模型用于雷达心脏感知

Jinye Li, Aidong Men, Yang Liu, Qingchao Chen

发表机构 * National Institute of Health Data Science, Peking University(北京大学国家健康数据科学研究院) Institute of Medical Technology, Peking University(北京大学医学技术研究院) Beijing University of Posts and Telecommunications(北京邮电大学) School of Artificial Intelligence, Beijing University of Posts and Telecommunications(北京邮电大学人工智能学院) Wangxuan Institute of Computer Technology, Peking University(北京大学王轩计算机技术研究所) State Key Laboratory of General Artificial Intelligence, Peking University(北京大学通用人工智能国家重点实验室)

AI总结 提出三阶段失真-感知预训练模型(TriDP-PTM),通过雷达-心电图-任务间接路径和复合损失函数,在合作竞争阶段实现最佳下游临床精度。

详情
AI中文摘要

心血管疾病(CVDs)仍然是全球主要的死亡原因,需要连续、准确的非侵入性心脏监测。虽然非接触式雷达方法显示出巨大潜力,但它们通常采用单一的“失真驱动”或“感知驱动”范式,经常面临“低失真但弱语义信息”与“高感知保真度但差可解释性”之间的权衡。为了解决这个问题,我们提出了一种三阶段失真-感知预训练模型(TriDP-PTM),这是一个基于雷达的多尺度融合双路径框架,系统比较了“直接雷达到任务”路径与“间接雷达到心电图到任务”路径。通过将心电图生成器与特征判别器集成以形成复合损失函数,我们的方法有效地将医学先验知识(如心电图形态和节律)纳入下游任务。通过实证分析,我们揭示了这种权衡表现为三个不同阶段(正和、合作竞争和负和),表明最佳的下游临床准确性通常出现在合作竞争阶段。在涉及30名受试者、5种生理状态的数据集上进行的大量实验表明,间接路径在各种任务中始终优于直接路径,在波形分割中实现了0.80的平均IoU,在四个任务中实现了98.3%的平均分类准确率,并且与最强基线相比,血压回归的MAE降低了56%。这些发现验证了我们的框架,并表明在间接雷达到心电图路径中,适当权衡失真和感知损失以在合作竞争机制中运行,对于在非接触式心脏监测中实现临床可解释的心电图形态和强大的下游准确性至关重要。

英文摘要

Cardiovascular diseases (CVDs) remain a leading cause of death globally, necessitating continuous, accurate non-invasive cardiac monitoring. While non-contact radar-based approaches show great promise, they often employ a single "distortion-driven" or "perception-driven" paradigm, frequently facing a trade-off between "low distortion but weak semantic information" and "high perceptual fidelity but poor interpretability." To address this, we propose a Three-stage Distortion-Perception Pre-Training Model (TriDP-PTM), a radar-based multi-scale fusion dual-path framework that systematically compares the "direct radar-to-task" path against an "indirect radar-to-ECG-to-task" path. By integrating an ECG generator with a feature discriminator to form a composite loss function, our approach effectively incorporates medical priors - such as ECG morphology and rhythm - into downstream tasks. Through empirical analysis, we reveal that this trade-off manifests in three distinct phases (Positive-Sum, Coopetitive, and Negative-Sum), showing optimal downstream clinical accuracy typically emerges in the coopetitive stage. Extensive experiments on a dataset involving 30 subjects across 5 physiological states reveal that the indirect path consistently outperforms the direct path in diverse tasks, achieving 0.80 mean IoU in waveform segmentation, 98.3% average classification accuracy across four tasks, and a 56% MAE reduction in blood pressure regression compared to the strongest baselines. These findings validate our framework and indicate that, within the indirect radar-to-ECG pathway, appropriately weighting distortion and perception losses to operate in the coopetitive regime is critical for achieving both clinically interpretable ECG morphology and strong downstream accuracy in non-contact cardiac monitoring.

2605.25720 2026-05-26 cs.AI

Learning to Search and Searching to Learn for Generalization in Planning

学习搜索与搜索学习以实现规划中的泛化

Michael Aichmüller, Yannik Hesse, Hector Geffner

发表机构 * Department of Machine Learning and Reasoning, RWTH Aachen University(机器学习与推理部门,亚琛RWTH大学)

AI总结 提出一种结合关系图神经网络值启发式的自改进WA*学习框架,通过搜索引导和Q学习更新启发式,实现零样本泛化,在多个规划任务中优于深度强化学习。

Comments Accepted at ICML 2026

详情
AI中文摘要

组合泛化仍然是深度强化学习(DRL)中的一个核心挑战。经典规划通过显式关系描述为研究这一问题提供了一个简单但具有挑战性的环境,无需从感知中学习。在稀疏奖励领域中,通过实时搜索的标准RL探索效率低下,而基于学习的规划方法通常依赖于专家演示、事后重标或从目标状态开始的随机游走。相比之下,规划器依赖于最佳优先搜索方法(如$\mathrm{A}^\star$)从头开始解决问题。我们提出了一种自改进的$\mathrm{WA}^\star$学习框架,结合由关系图神经网络表示的值启发式:启发式引导搜索,产生的搜索数据通过$Q$-学习更新启发式。这个循环产生了可以作为通用策略的启发式,并且即使在没有搜索的情况下也能解决新实例,而DRL在其他情况下会失败,正如我们在Sokoban、PushWorld、The Witness以及2023年国际规划竞赛基准等谜题上所展示的。值得注意的是,我们展示了强大的零样本泛化能力:例如,在少于30个块的Blocksworld实例上训练的启发式,无需搜索即可成功解决包含488个块的实例。

英文摘要

Combinatorial generalization remains a central challenge in Deep Reinforcement Learning (DRL). Classical planning provides a simple yet challenging setting to study this problem through explicit relational descriptions, without requiring learning from perception. In sparse-reward domains, standard RL exploration via real-time search is ineffective, and learning-based planning methods often rely on expert demonstrations, hindsight relabeling, or random walks from the goal state. In contrast, planners rely on best-first search methods such as $\mathrm{A}^\star$ to solve problems from scratch. We propose a self-improving $\mathrm{WA}^\star$ learning framework in combination with a value heuristic represented by a Relational Graph Neural Network: the heuristic guides search, and the resulting search data updates the heuristic via $Q$-learning. This loop yields heuristics that can function as general policies and solve new instances even without search, where DRL otherwise fails, as we show on puzzles such as Sokoban, PushWorld, The Witness, and the 2023 International Planning Competition benchmarks. Notably, we demonstrate strong zero-shot generalization: For example, heuristics trained on Blocksworld instances with fewer than 30 blocks successfully solve instances with 488 blocks without search.

2605.25717 2026-05-26 cs.AI cs.CE cs.LG

FLOATBench: A Dataset and Benchmark for Floating Offshore Wind Turbine Tower Fatigue

FLOATBench:浮式海上风力发电机塔架疲劳数据集与基准

João Alves Ribeiro, Bruno Alves Ribeiro, Francisco Pimenta, Sérgio M. O. Tavares, Faez Ahmed

发表机构 * Department of Mechanical Engineering(机械工程系) Massachusetts Institute of Technology(麻省理工学院) School of Engineering(工程学院) Brown University(布朗大学) CONSTRUCT, Faculty of Engineering University of Porto(CONSTRUCT,工程学院,葡萄牙波尔图大学) University of Aveiro(阿维罗大学)

AI总结 提出FLOATBench,一个包含582,120个疲劳损伤标签的表格基准,基于22 MW浮式风机塔架的高保真仿真,并引入工况感知的评估协议以检测随机划分无法发现的性能排名变化。

详情
AI中文摘要

全球大部分海上风能资源位于水深过大、无法使用固定式基础的海域,因此浮式海上风力发电机(FOWT)对于深水部署至关重要。随着行业向22 MW级设计规模发展,塔架疲劳变得愈发关键,因为更大的结构会放大由持续风浪激励引起的耦合气动-水动-伺服-弹性载荷。准确的疲劳损伤预测对于认证、设计优化和成本降低至关重要。然而,该领域缺乏共享的替代模型基准:不同研究报告了不同的仿真、划分和指标,使得方法难以比较。我们提出FLOATBench,一个公开的表格基准,包含三种22 MW FOWT塔架几何形状的582,120个逐截面疲劳损伤标签,这些标签来自三种塔架的19,404次高保真OpenFAST仿真(每种塔架6,468次:1,078个对齐风浪工况点×六个湍流种子),每种塔架在30个截面上进行标注。FLOATBench包括一个基于工况感知的联合风浪运行包络的alpha-shape划分,将测试点分为训练内、插值和外推区域。它配备了一个可复现的评估框架,涵盖三个协议级别:随机验证(E1)、塔内工况感知评估(E2)和跨塔迁移(E3)。工况感知协议揭示了全局性能与外推性能之间的排名变化,而随机划分排行榜无法检测到这些变化。据作者所知,FLOATBench是首个用于表格替代建模的FOWT疲劳基准,并提供了一个可推广到定义在物理运行包络上的工程替代模型的评估协议。数据集和代码可在以下网址获取:https://github.com/Joao97ribeiro/FLOATBench。

英文摘要

Most of the world's offshore wind resource lies in waters too deep for fixed-bottom foundations, making floating offshore wind turbines (FOWTs) essential for deep-water deployment. As the industry scales toward $22$ MW class designs, tower fatigue becomes increasingly critical because larger structures amplify the coupled aero-hydro-servo-elastic loads induced by continuous wind and wave excitation. Accurate fatigue-damage prediction is therefore central to certification, design optimization, and cost reduction. Yet the field lacks a shared surrogate benchmark: studies report different simulations, splits, and metrics, making methods difficult to compare. We present FLOATBench, a public tabular benchmark with $582{,}120$ per-section fatigue-damage labels across three $22$ MW FOWT tower geometries, derived from $19{,}404$ high-fidelity OpenFAST simulations across the three towers ($6{,}468$ per tower: $1{,}078$ aligned wind/wave operating points $\times$ six turbulence seeds), labeled at $30$ cross-sections per tower. FLOATBench includes a regime-aware alpha-shape partition of the joint wind/wave operating envelope, stratifying test points into in-train, interpolation, and extrapolation regimes. It is paired with a reproducible evaluation harness covering three protocol levels: random validation (E1), within-tower regime-aware evaluation (E2), and cross-tower transfer (E3). The regime-aware protocol reveals rank shifts between global and extrapolation performance that random-split leaderboards cannot detect. To the authors' knowledge, FLOATBench is the first FOWT fatigue benchmark for tabular surrogate modeling, and offers an evaluation protocol that generalizes to engineering surrogates defined over physical operating envelopes. Dataset and code available at: https://github.com/Joao97ribeiro/FLOATBench.

2605.25708 2026-05-26 cs.CV cs.CL cs.ET

CMAP: Cross-Modal Adaptive Prompting for Multi-Domain Task-Incremental Learning

CMAP: 面向多域任务增量学习的跨模态自适应提示

Sriram Mandalika

发表机构 * Hasso Plattner Institute(霍普斯·普拉特纳研究所)

AI总结 针对多域任务增量学习,提出跨模态自适应提示方法,利用CLIP文本嵌入空间进行任务路由、置信度估计和编码器适应,在MTIL基准上超越现有技术。

详情
AI中文摘要

多域任务增量学习要求模型在视觉多样的域中顺序获取知识,同时不遗忘先前任务,且在推理时无法访问任务身份。基于冻结视觉-语言模型的参数高效方法已取得显著进展,但现有方法完全依赖视觉特征进行任务路由、置信度估计和编码器适应,未利用CLIP的跨模态文本嵌入空间。我们通过三个贡献填补这一空白。文本空间任务路由将视觉高斯匹配替换为与冻结CLIP文本原型的余弦相似度,实现与顺序无关的路由,在零参数成本下对数据稀缺具有鲁棒性。多原型视觉-文本置信度将单高斯类建模替换为K均值视觉原型和任务校准阈值下的跨模态对齐分数。对称跨模态门控将每层Gumbel门扩展到文本编码器,以批量图像特征为条件,在分布外输入上保持跨模态对齐。在涵盖11个数据集和1201个类的MTIL基准上,我们的方法在Order-I下达到74.2%的迁移率、80.5%的平均准确率和88.7%的最终准确率,仅用2.5M可训练参数且无外部数据,分别超越先前最优方法5.0、3.7和3.0个百分点。

英文摘要

Multi-domain task-incremental learning requires a model to sequentially acquire knowledge across visually diverse domains without forgetting prior tasks, and without access to task identity at inference. Parameter-efficient methods built on frozen vision-language models have made strong progress, yet all existing approaches rely exclusively on visual features for task routing, confidence estimation, and encoder adaptation, leaving CLIP's cross-modal text embedding space entirely unexploited. We address this gap through three contributions. Text-space task routing replaces visual Gaussian matching with cosine similarity to frozen CLIP text prototypes, giving order-independent routing robust to data scarcity at zero parameter cost. Multi-prototype visual-textual confidence replaces single-Gaussian class modeling with K-means visual prototypes and cross-modal alignment scores under task-calibrated thresholds. Symmetric cross-modal gating extends per-layer Gumbel gates to the text encoder conditioned on batch image features, preserving cross-modal alignment on out-of-distribution inputs. On the MTIL benchmark spanning 11 datasets and 1201 classes, our method achieves 74.2% Transfer, 80.5% Average, and 88.7% Last under Order-I, surpassing the prior state of the art by 5.0, 3.7, and 3.0 percentage points with only 2.5M trainable parameters and no external data.

2605.25707 2026-05-26 cs.AI

AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions

AgentHijack:基准测试计算机使用智能体对常见环境干扰的鲁棒性

Jingwei Sun, Jianing Zhu, Yuanyi Li, Tongliang Liu, Xia HU, Bo Han

发表机构 * TMLR Group, Hong Kong Baptist University(香港 Baptist 大学 TMLR 团体) The University of Texas at Austin(德克萨斯大学奥斯汀分校) Sydney AI Centre, The University of Sydney(悉尼大学 AI 中心) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 提出AgentHijack基准,通过9种可配置的常见环境干扰评估多模态大语言模型驱动的计算机使用智能体的鲁棒性,并设计AgentHijack-Agent框架提升其抗干扰能力。

Comments accepted by ICML 2026

详情
AI中文摘要

由多模态大语言模型(MLLM)驱动的自主计算机使用智能体正在成为完成复杂数字工作流的得力助手。然而,真实世界的执行环境远非理想:弹出窗口、分辨率变化和竞争性应用频繁干扰智能体的感知和控制。我们引入了AgentHijack,一个旨在评估计算机使用智能体在常见干扰下鲁棒性的基准,其中动态环境中的不确定性在没有直接对抗意图的情况下破坏执行流程。具体来说,AgentHijack引入了9种可配置的常见干扰来复现现实的不完美场景。我们评估了多种利用基于MLLM的智能体的桌面任务,发现即使是微小的干扰实例也会导致显著的性能下降,这强调了智能体的脆弱性以及鲁棒性评估的必要性。随后,我们提出了AgentHijack-Agent,一个将具有增强基础能力的动作生成器与负责行为总结和环境检查的旁观者相结合的框架。大量实验验证了其有效性。我们的代码、环境、基线模型和数据公开于:https://AgentHijack.github.io。

英文摘要

Autonomous computer use agents that powered by multimodal large language models (MLLMs) are emerging as capable assistants for completing complex digital workflows. However, real-world execution environments are far from ideal: pop-ups, resolution changes, and competing applications frequently interfere with agent perception and control. We introduce AgentHijack, a benchmark designed to evaluate the robustness of computer-use agents under common corruptions, where the uncertainties in dynamic environment disrupt the execution flow without direct adversarial intent. Specifically, AgentHijack introduces 9 configurable common corruptions to replicate realistic imperfect scenarios. We evaluate a variety of desktop tasks that utilize MLLM-based agents and discover that even minor instances of corruption can result in substantial performance degradation, which emphasizes the fragility of agents and underscores the necessity of robustness evaluation. Afterward, we propose AgentHijack-Agent, a framework that integrates an action generator with enhanced grounding capabilities and an onlooker responsible for behavior summarization and environment checking. Extensive experiments validate its effectiveness. Our code, environment, baseline models and data are publicly available at: https://AgentHijack.github.io.

2605.25706 2026-05-26 cs.CV

Towards Open-World Referring Expression Comprehension: A Benchmark with Training-free Multi-task Consistency Checker

迈向开放世界的指代表达理解:一种无需训练的多任务一致性检查器基准

Zongjian Wu, Lei Zhang

发表机构 * School of Microelectronics and Communication Engineering, Chongqing University, Chongqing, China(微电子与通信工程学院,重庆大学,重庆,中国)

AI总结 针对现有指代表达理解(REC)基准局限于简单场景和单目标假设的问题,提出OpenRef基准,涵盖多样视觉场景、可变目标数量和丰富词汇类型,并引入无需训练的多任务一致性检查器(MCC)以提升模型在开放世界中的性能。

Comments 17 pages, 7 figures. Project Page: https://zongjianwu.github.io/openref

详情
AI中文摘要

指代表达理解(REC)旨在根据给定表达在图像中定位目标对象。尽管视觉语言模型的最新进展已使REC任务取得显著改进,但当前的REC基准通常局限于简单场景,并假设每个表达映射到唯一对象。这些限制阻碍了REC模型在开放世界环境中的部署。为填补这一空白,我们引入了OpenRef,一个针对复杂视觉和语言场景的新REC基准。OpenRef具有三个关键进展:1)多样化的视觉场景:涵盖多种视觉领域,包括地面视角、无人机视角、黑暗场景和恶劣天气条件;2)可变目标数量:通过多目标和零目标样本打破单目标限制;3)丰富的词汇类型:包含专有名词、多义词和序数词,以适应更广泛的表达需求。此外,由于传统指标不足以应对开放世界设置,我们利用F1衡量定位准确性,并提出N3R(负相对拒绝可靠性)来评估对否定表达的相对拒绝可靠性。最后,我们引入了多任务一致性检查器(MCC),这是一种无需训练但即插即用的策略,通过强制执行一致性自我验证,一键提升模型性能。大量实验表明,本工作显著提升了现有REC模型在复杂场景中的性能,为开放世界REC铺平了道路。项目页面:https://zongjianwu.github.io/openref

英文摘要

Referring expression comprehension (REC) aims to localize a target object within an image based on a given expression. Although recent advances in vision-language models have led to substantial improvements in REC tasks, current REC benchmarks often hold simple scenarios and the assumption that each expression maps to a unique object. These limitations hinder the deployment of REC models in open-world environments. To fill this gap, we introduce OpenRef, a new benchmark for REC in complex visual and linguistic scenarios. OpenRef features three key advancements: 1) Diverse visual scenarios: spanning diverse visual domains, including ground views, drone views, dark scenes and adverse weather conditions; 2) Variable target counts: breaking the single-target limitation with multi-target and none-target samples; 3) Rich vocabulary types: incorporating proper nouns, polysemous words and ordinal terms to fit a wider range of expression needs. Furthermore, as traditional metrics are insufficient for open-world setting, we leverage F1 to measure grounding accuracy and propose N3R (Negative Relative Rejection Reliability) to assess relative rejection reliability against negative expressions. Finally, we introduce Multi-task Consistency Checker (MCC), a training-free but plug-and-play strategy that enhances model performance with one click by enforcing consistency self-verification. Extensive experiments demonstrate that this work significantly advances the performance of existing REC models in complex scenarios, paving the way for open-world REC. Project page: https://zongjianwu.github.io/openref

2605.25704 2026-05-26 cs.CL cs.LG

PowLU: An Activation Function for Stable Pre-Training of LLMs

PowLU: 一种用于LLM稳定预训练的激活函数

Peijie Jiang, Yuqi Feng, Cunyin Peng, Qian Zhao, Jia Liu, KunLong Chen, Zhiqiang Zhang, Jun Zhou

发表机构 * Ant Group(蚂蚁集团)

AI总结 提出PowLU激活函数,通过有理幂函数实现自适应非线性,解决SwiGLU在低精度LLM训练中的数值不稳定问题,在大规模训练中取得与SwiGLU和SwiGLU-Clip相当的性能并提升可扩展性。

Comments 17 pages, 7 figures, techreport

详情
AI中文摘要

在当代大型语言模型(LLM)中,swish门控线性单元(SwiGLU)激活函数被广泛采用以调节信息流并引入非线性。对于大的正输入,SwiGLU近似于二次函数$x^2$,提供强非线性和表达能力。然而,这一特性也导致随着输入或模型规模增大时的数值不稳定性,特别是在低精度LLM训练中。主要原因是其近似二次放大,扩大了输出范围并加剧了异常值。为了解决这个问题,我们提出了一种稳定的激活函数——幂线性单元(PowLU),用于大规模LLM预训练。具体来说,PowLU采用有理幂函数实现自适应非线性,从而改善表示能力并在尖峰区域实现稳定训练。此外,我们为PowLU的几个关键性质提供了理论证明。缩放定律实验确认了性能在不同模型规模下的一致性,进一步使用Ling架构(总参数7.9B和124B)的实验结果表明,PowLU在大规模LLM训练中取得了与SwiGLU和SwiGLU-Clip相当的结果。此外,实验结果还表明PowLU有效提升了LLM大规模训练的可扩展性。

英文摘要

In contemporary large language models (LLMs), the swish-gated linear unit (SwiGLU) activation function is widely adopted to regulate the information flow and introduce non-linearity. For large positive inputs, SwiGLU approximates the quadratic function $x^2$, providing strong nonlinearity and expressive capacity. However, this property also causes numerical instability as the input or model scale increases, particularly in low-precision LLM training. The main reason is its approximate quadratic amplification, which enlarges the output range and exacerbates outliers. To address this issue, we propose a stable activation function, Power Linear Unit (PowLU), for large-scale LLM pre-training. Specifically, PowLU employs a rational power function to achieve adaptive nonlinearity, thereby improving representation ability and enabling stable training in spike regions. Moreover, we provide theoretical justification for several key properties of PowLU. Scaling law experiments confirm that the performance is consistent across model sizes, and further experimental results with the Ling architecture (7.9B and 124B total parameters) demonstrate that PowLU achieves competitive results against SwiGLU and SwiGLU-Clip in large-scale training of LLMs. In addition, the experimental results also show that PowLU effectively improves the scalability of the large-scale training of LLMs.

2605.25698 2026-05-26 cs.LG cs.AI

How Should LLMs Consume High-Quality Data? Optimal Data Scheduling via Quality-Aware Functional Scaling Laws

LLM应如何消费高质量数据?通过质量感知的功能缩放定律实现最优数据调度

Zhitao Zhu, Xili Wang, Shizhe Wu, Jiawei Fu, Xiaoqing Liu

发表机构 * Peking University(北京大学) Meituan(美团)

AI总结 本文通过引入数据质量维度扩展功能缩放定律,解析求解了联合数据质量和批次大小调度问题,揭示了高质量数据的双重角色,并提出了Drop-Stable-Rampup调度策略,在15B MoE模型上相比WSD和余弦衰减分别提升平均准确率+1.70和+2.98。

详情
AI中文摘要

高质量数据在大语言模型训练中稀缺,但如何联合训练动态调度其使用缺乏理论指导。我们通过引入数据质量维度扩展功能缩放定律,并以渐近闭式形式求解了联合数据质量和批次大小调度问题。该解揭示了两个阶段和高质量数据的双重角色。在噪声受限阶段,高质量数据应作为信号放大器:降低批次大小将更清洁的数据转换为更多信号而不放大噪声。在信号受限阶段,它应作为噪声抑制器:后期放置可减少终端噪声而不牺牲信号积累。现有的课程式流程主要利用第二个角色,将更清洁的数据放在后期,但忽略了第一个角色,因为传统的衰减调度在高质量数据可用时恰好降低了更新强度。受此启发,我们为LLM中期训练提出了Drop-Stable-Rampup:在质量转换时,降低批次大小,保持稳定以积累信号,然后逐渐增加以抑制终端噪声。在一个在108B tokens上中期训练的15B混合专家模型上,Drop-Stable-Rampup相比Warmup-Stable-Decay (WSD)平均准确率提升+1.70,相比余弦衰减提升+2.98,在数学推理基准如GSM8K (+4.23)和MATH (+2.80)上增益尤其显著。

英文摘要

High-quality data is scarce in large language model (LLM) training, yet how to schedule its use jointly with training dynamics lacks theoretical guidance. We extend functional scaling laws by incorporating a data-quality dimension, and solve the joint data-quality and batch-size scheduling problem in asymptotic closed form. The solution reveals two regimes and a dual role of high-quality data. In the noise-limited regime, high-quality data should be used as a signal amplifier: lowering the batch size converts cleaner data into more signal without amplifying noise. In the signal-limited regime, it should be used as a noise suppressor: late placement reduces terminal noise without sacrificing signal accumulation. Existing curriculum-style pipelines primarily exploit the second role by placing cleaner data late, but miss the first role because conventional decay schedules reduce update intensity exactly when high-quality data becomes available. Guided by this, we propose Drop-Stable-Rampup for LLM midtraining: upon the quality transition, drop the batch size, hold it stable to accumulate signal, then ramp up to suppress terminal noise. On a 15B Mixture-of-Experts model midtrained on 108B tokens, Drop-Stable-Rampup improves average accuracy over Warmup-Stable-Decay (WSD) by +1.70 and over Cosine-decay by +2.98, with particularly large gains on mathematical reasoning benchmarks such as GSM8K (+4.23) and MATH (+2.80).

2605.25696 2026-05-26 cs.LG

Evaluating passing decision-making in professional football: An enhanced MPNN approach to Receiver Selection

评估职业足球中的传球决策:一种增强的MPNN方法用于接球者选择

Gabriel Masella, Giuseppe Alessio D'Inverno, Max Goldsmith, Gianluigi Rozza

发表机构 * Department of Mathematics, Informatics and Geoscience(数学、信息学与地质科学系) University of Trieste(特里斯特大学) MathLab(数学实验室) International School for Advanced Studies (SISSA)(国际高级研究学校(SISSA)) Royal Belgium Football Association(比利时皇家足球协会)

AI总结 提出一种图神经网络框架,通过将场上交互建模为动态图来预测最佳传球目标,在接球者选择任务上达到竞争性准确率,并能在数秒内评估超过1000次传球。

详情
AI中文摘要

足球中的决策过程以空间定位、对手压力和球员意图之间的复杂相互作用为特征。本文介绍了一种图神经网络(GNN)框架,旨在通过将场上交互建模为动态图来预测接球者选择,即最佳传球目标。每个球员被表示为一个节点,具有位置和上下文特征,而潜在的传球线形成加权边,由距离、角度和压力指标表征。我们开发并训练了一个消息传递神经网络(MPNN),使用了来自职业比赛的跟踪数据和事件数据的组合,通过基于优化版Needleman-Wunsch算法的稳健流水线进行同步。该模型在识别实际选择的接球者方面达到了竞争性准确率,并在前三建议中达到了最先进的准确率。我们的模型还提供了每个选项的可能性、威胁和创造力的量化,使表现分析师能够在数秒内评估超过1000次传球。

英文摘要

The process of decision-making in football is characterized by a complex interplay between spatial positioning, opponent pressure, and player intent. This work introduces a Graph Neural Network (GNN) framework designed to predict Receiver Selection, the optimal passing target, by modeling on-field interactions as dynamic graphs. Each player is represented as a node with positional and contextual features, while potential passing lines form weighted edges characterized by distance, angle, and pressure metrics. A Message-Passing Neural Network (MPNN) has been developed and trained using a combination of tracking data and event data from professional matches, synchronized through a robust pipeline based on an optimized version of the Needleman-Wunsch Algorithm. The model achieves competitive accuracy in identifying the actual chosen receiver and state-of-the-art accuracy within its top three suggestions. Our model further offers quantification of each option's likelihood, threat, and creativity, enabling performance analysts to evaluate over 1,000 passes in seconds.

2605.25693 2026-05-26 cs.CL cs.DB cs.MA

From Facts to Insights: A Persona-Driven Dual Memory Framework and Dataset for Role-Playing Agents

从事实到洞察:面向角色扮演智能体的角色驱动双记忆框架与数据集

Rongsheng Zhang, Ruofan Hu, Weijie Chen, Jiji Tang, Junnan Ren, Wanying Wu, Xunuoyan Chen, Tangjie Lv, Tao Jin, Zhou Zhao

发表机构 * Zhejiang University(浙江大学) Fuxi AI Lab, Netease Inc.(复活AI实验室,网易公司)

AI总结 针对长期对话中角色扮演智能体因上下文窗口限制而丧失角色一致性的问题,提出角色记忆数据集RoleMemo和双记忆框架DualMem,通过将记忆解耦为事实认知和角色条件洞察,结合监督微调与强化学习,在4B参数模型上超越基于DeepSeek-V3.2的零样本角色无关框架。

Comments Preprint

详情
AI中文摘要

尽管角色扮演智能体在短期交互中表现出色,但长期对话会压垮上下文窗口,从而促使外部记忆框架的发展。当前系统通常依赖角色无关的摘要,记录事实而不进行角色特定的解释,导致生成通用回复,损害角色保真度。为弥补这一差距,我们引入了RoleMemo数据集,其中包含四个推理任务,这些任务要求通过角色解释事实片段以得出正确答案。在RoleMemo上的评估揭示了角色无关框架的关键局限性。因此,我们提出了DualMem,它将记忆解耦为两个流:事实认知和角色条件洞察。通过监督微调(SFT)和强化学习(RL)训练,我们的框架使用4B参数模型在持续角色保真度上优于由DeepSeek-V3.2驱动的零样本角色无关框架。我们的资源可在https://github.com/role2026/rolememo获取。

英文摘要

While role-playing agents excel in short-term interactions, long-term conversations overwhelm context windows, motivating external memory frameworks. Current systems typically rely on persona-agnostic summarization, which records facts without persona-specific interpretation, yielding generic responses that compromise persona fidelity. To bridge this gap, we introduce RoleMemo, a dataset featuring four reasoning tasks where the factual fragments must be interpreted through the persona to reach the correct answer. Evaluation on RoleMemo exposes critical limitations of persona-agnostic frameworks. We thus propose DualMem, which decouples memory into two streams: factual cognition and persona-conditioned insight. Trained through Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), our framework with a 4B-parameter model outperforms zero-shot persona-agnostic frameworks powered by DeepSeek-V3.2 for sustained persona fidelity. Our resources are available at https://github.com/role2026/rolememo.

2605.25686 2026-05-26 cs.CL

Testing the Deliteralization Hypothesis in Human and Machine Translation

测试人类与机器翻译中的去字面化假设

Malik Marmonier, Rachel Bawden, Benoît Sagot

发表机构 * Inria(法国国家信息与自动化研究所)

AI总结 通过比较人类翻译、NMT系统与LLM在54个语言对上的字面化程度,验证去字面化假设是否适用于LLM生成与修订过程。

详情
AI中文摘要

从专用NMT系统向通用LLM的近期转变重塑了机器翻译,据报道LLM比其前身产生更流畅、更少字面化的输出。我们测试这种转变是否延伸到去字面化假设,即翻译研究中长期存在的说法:翻译在起草和修订过程中逐渐变得不那么字面化。使用WMT24++数据集,我们比较了人类翻译和后编辑与两个NMT系统和六个LLM在54个语言对和三个任务上的字面化程度:直接翻译、迭代自我修订和人类草稿的后编辑。字面化程度通过基于六个启发式方法构建的经过验证的合成字面化指数来衡量。我们发现:(i) 人类翻译仍然明显比所有测试的MT系统更少字面化,尽管最近的LLM缩小了差距;(ii) 当提示迭代修订自己的输出时,LLM单调地去字面化,首次提供了该假设原生适用于LLM生成的证据;(iii) 作为后编辑者,LLM反转了人类后编辑者的修订触发因素,容忍字面化草稿并针对惯用的人类表述进行修订。

英文摘要

The recent shift from dedicated NMT systems to general-purpose LLMs has reshaped machine translation, with LLMs reported to produce more fluent, less literal output than their predecessors. We test whether this shift extends to the deliteralization hypothesis, the long-standing claim from translation studies that translations become progressively less literal as they are drafted and revised. Using the WMT24++ dataset, we compare the literality of human translations and post-editions to that of two NMT systems and six LLMs across 54 language pairs and three tasks: direct translation, iterative self-revision, and post-editing of human drafts. Literality is measured via a validated Synthetic Literality Index built from six heuristics. We find that (i) human translations remain significantly less literal than those of all tested MT systems, though recent LLMs narrow the gap; (ii) when prompted to iteratively revise their own output, LLMs deliteralize monotonically, providing the first evidence that the hypothesis applies natively to LLM generation; and (iii) as post-editors, LLMs invert the revision triggers of human post-editors, tolerating literal drafts and targeting idiomatic human formulations for revision.

2605.25685 2026-05-26 cs.RO

HumanFlow -- Diffusion-Driven MAV Navigation Among Humans via Tightly-Coupled Motion Tracking, Forecasting, and Control

HumanFlow -- 通过紧耦合运动跟踪、预测和控制的扩散驱动MAV在人群中导航

Simon Schaefer, Joshua Näf, Stefan Leutenegger

发表机构 * Technical University of Munich(慕尼黑技术大学) MCML MIRMI ETH Zurich(苏黎世联邦理工学院)

AI总结 提出HumanFlow,一种潜在扩散模型,统一了人体运动跟踪与预测,并利用3D场景上下文,在严重遮挡下实现高精度、高效率的运动估计,并通过紧耦合控制实现MAV在人群中的无碰撞导航。

Comments Accepted to Robotics Science and Systems (RSS), 2026

详情
AI中文摘要

在3D场景上下文中对人类的鲁棒和准确感知对于将机器人集成到日常环境中至关重要。然而,现有方法通常无法预测与周围场景一致的合理且准确的人体运动估计,尤其是在存在严重遮挡或部分可见性的情况下。这可能会限制机器人操作的安全性和效率。我们引入了HumanFlow,一种潜在扩散模型,它统一了人体运动跟踪和预测,并以3D场景上下文为条件。我们展示了我们的人体运动模型在具有挑战性的条件下(包括严重遮挡)能够产生平滑且准确的预测,并且在跟踪精度上优于最先进的方法,同时效率显著更高。此外,我们展示了如何通过将这些表示作为基于流匹配的近似MPC策略的条件,将HumanFlow的潜在空间与控制紧密耦合。我们在模拟中使用真实人类轨迹验证了我们的策略用于MAV社交导航,展示了优越的导航性能,并且在人类部分可观察的情况下仍能保持无碰撞。

英文摘要

Robust and accurate perception of humans in their 3D scene context is essential for integrating robots into everyday environments. Existing approaches, however, often fail to predict plausible and accurate human motion estimates that are consistent with the surrounding scene, especially in the presence of heavy occlusions or partial visibility. This can limit both safety and efficiency for robotic operations. We introduce HumanFlow, a latent diffusion model that unifies human motion tracking and forecasting, conditioned on the 3D scene context. We show that our human motion model produces smooth and accurate predictions under challenging conditions, including heavy occlusions, and outperforms state-of-the-art methods in tracking accuracy while being significantly more efficient. Furthermore, we show how HumanFlow's latent space can be tightly coupled with control by conditioning a flow-matching-based, approximate MPC policy on these representations. We validate our policy in simulation with real human trajectories for MAV social navigation, demonstrating superior navigation performance and remaining collision-free, even under partial observability of the human.

2605.25681 2026-05-26 cs.LG cs.AI

Don't Retrain, Just Reuse: Recovering Dual-Target Molecules from Single-Target Diffusion Models

不要重新训练,只需重用:从单目标扩散模型中恢复双目标分子

Qingyuan Zeng, Pengxiang Cai, Zixin Guan, Ziyang Chen, Anglin Liu, Lang Qin, Xinyao Lai, Jintai Chen

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Guangzhou University of Chinese Medicine(广州中医药大学)

AI总结 提出REUSE框架,通过层次化进化输入空间搜索,从冻结的单目标扩散模型中恢复双目标分子,无需重新训练或修改扩散过程,在双目标亲和力上提升20.9个百分点。

详情
AI中文摘要

设计一个能调节两个靶点的单一分子是多药理学中一种有前景的策略,但它比标准的单目标生成要困难得多,因为一个候选分子必须满足两个结合要求,同时保持药物相似性和可合成性。现有的双目标生成方法通常通过在采样期间重新训练生成器或干预扩散过程来引入双目标能力。前者在双目标监督稀疏时可能成本高昂且难以稳定,而后者可能对去噪时的目标平衡和竞争性更新方向敏感。这些局限性促使我们寻找一种保持生成器不变的替代方案:能否在不修改参数或去噪动态的情况下,从冻结的单目标扩散模型的输入空间中恢复双目标候选分子?我们将此任务表述为一个受约束的多目标优化问题,并提出REUSE,一种层次化进化输入空间搜索框架,结合配对条件探索和结构化多阶段选择,以强制执行双目标亲和力、化学质量和多样性。实验表明,与修改扩散过程的方法相比,REUSE持续改善了双目标亲和力和平衡性,在双高亲和力指标上比最强基线提高了20.9个百分点,同时保持了竞争性的分子质量。

英文摘要

Designing a single molecule that modulates two targets is a promising strategy for polypharmacology, but it remains substantially harder than standard single-target generation because one candidate must satisfy two binding requirements while preserving drug-likeness and synthesizability. Existing dual-target generative methods typically introduce dual-target capability by either retraining the generator or intervening in the diffusion process during sampling. The former can be costly and difficult to stabilize when dual-target supervision is sparse, while the latter may be sensitive to denoising-time target balancing and competing update directions. These limitations motivate a generator-preserving alternative that keeps the pretrained prior intact: can dual-target candidates instead be recovered from the input space of a frozen single-target diffusion model, without modifying its parameters or denoising dynamics? We formulate this task as a constrained multi-objective optimization problem and propose REUSE, a hierarchical evolutionary input-space search framework that combines pair-conditioned exploration with structured multi-stage selection to enforce dual-target affinity, chemical quality, and diversity. Experiments show that, compared with methods that modify the diffusion process, REUSE consistently improves dual-target affinity and balance, achieving a 20.9-percentage-point gain in Dual High Affinity over the strongest prior baseline while maintaining competitive molecular quality.

2605.25680 2026-05-26 cs.CL cs.AI

Simulating Human Memory with Language Models

用语言模型模拟人类记忆

Qihan Wang, Nicholas Tomlin, Michael Hu, Brian Dillon, Tal Linzen

发表机构 * NYU(纽约大学) UMass Amherst(马萨诸塞大学阿姆赫斯特分校)

AI总结 本研究通过心理学经典记忆实验对比语言模型与人类记忆,发现未经调优的模型记忆优于人类,但通过提示策略和压缩器可使模型遗忘方式更接近人类,从而在下游教育任务中成为更有效的用户模拟器。

详情
AI中文摘要

语言模型越来越多地被部署为用户模拟器,但它们的记忆远比真实用户可靠。为了衡量这一差距,我们在人类和语言模型上进行了一系列来自心理学的经典记忆实验。跨任务我们发现,未经调优的语言模型表现出比人类更好的记忆,即使在被提示模仿人类行为时也是如此。然后我们表明,更好的提示策略和使用压缩器可以使语言模型以更类似人类的方式遗忘内容。使用这些方法,我们初步证明,具有人类类似记忆约束的语言模型可以在下游教育任务中作为更有效的用户模拟器。最后,我们发布人类参考数据和基准,以支持未来关于用语言模型模拟人类记忆的工作。

英文摘要

Language models are increasingly being deployed as user simulators, but their memory is far more reliable than that of real users. To measure this gap, we run a series of classic memory experiments from psychology on both humans and language models. Across tasks, we find that out-of-the-box language models exhibit better memory than humans, even when prompted to imitate human behavior. We then show that better prompting strategies and the use of a compactor can cause language models to forget content in a more human-like way. Using these methods, we show preliminary evidence that language models with human-like memory constraints can function as more effective user simulators in a downstream education task. Finally, we release human reference data and benchmarks to support future work on simulating human memory with language models.

2605.25676 2026-05-26 cs.CL

Llamion Technical Report

Llamion 技术报告

Kisu Yang, Yoonna Jang, Hyeonseok Moon, Hwanseok Jang, Taewoo Lee, Hyungjin Lee, Jeseung Lee, Juhyoung Park, Heuiseok Lim

发表机构 * VAIV Company(VAIV公司) Korea University(韩国大学) University of Copenhagen(哥本哈根大学) Samsung Electronics(三星电子)

AI总结 提出 KEPT 方法将 Orion-14B 转换为 Llama 架构的 Llamion 模型,通过参数映射和知识蒸馏在少量数据上恢复性能,并在 KoMMLU 上达到领先水平。

Comments Research conducted in 2024

详情
AI中文摘要

我们发布了 Llamion,一个 14B 参数的开源语言模型系列,通过将 Orion-14B 转换为标准化的 Llama 家族架构得到。该转换通过高效知识保留转换(KEPT)方法完成,该方法结合了 (i) 用于未改变模块的正常参数映射(NPM),(ii) 优化参数映射(OPM),一种无需训练的 LayerNorm 到 RMSNorm 初始化,我们证明在权重衰减引起的近零均值激活机制下该初始化是最优的,以及 (iii) 跨架构知识蒸馏(XKD),一种等大小的冻结教师蒸馏,将转换后模型的输出与源模型在任何合理输入分布上的输出对齐。Llamion 在单个 A100 上仅用约 1.23 亿 token 和四天时间,在 H6、MT-Bench 和 KoMMLU 上恢复了 Orion 的行为;Llamion-Base 在 KoMMLU 上达到 66.87%,在提交时比 Open Ko LLM Leaderboard 的次优条目高出超过 7.0 个绝对百分点。转移语料库中完全缺失的能力(Python 编程和 20 万 token 上下文处理)在架构转换后完整保留。我们发布了三个检查点(Base、Chat、LongChat),可在 Hugging Face Transformers 库中以 trust_remote_code=False 加载。

英文摘要

We release Llamion, a family of 14B-parameter open-weight language models obtained by transforming Orion-14B into the standardized Llama-family architecture. The transformation is performed by Efficient Knowledge Preservation for Transformation (KEPT), a recipe that combines (i) Normal Parameter Mapping (NPM) for unchanged modules, (ii) Optimized Parameter Mapping (OPM), a training-free LayerNorm-to-RMSNorm initialization we prove optimal under the near-zero-mean activation regime induced by weight decay, and (iii) Cross-architecture Knowledge Distillation (XKD), an equal-size frozen-teacher distillation that aligns the converted model's outputs with the source model's on any reasonable input distribution. Llamion recovers Orion's behaviour on H6, MT-Bench, and KoMMLU with only ~123M tokens on a single A100 in four days; Llamion-Base reaches 66.87% on KoMMLU, exceeding the next-best entry of the Open Ko LLM Leaderboard by >7.0 absolute points at submission time. Capabilities entirely absent from the transfer corpus (Python programming and 200K-token context handling) survive the architectural transition intact. We release three checkpoints (Base, Chat, LongChat) that load with trust_remote_code=False in the Hugging Face Transformers library.

2605.25674 2026-05-26 cs.LG

Stochastic Estimation of the Layer-wise Hessian Trace for Monitoring Neural-network Training

逐层Hessian迹的随机估计用于监测神经网络训练

Maxim Bolshim, Alexander Kugaevskikh

发表机构 * ITMO University, St. Petersburg, Russia(圣彼得堡ITMO大学)

AI总结 提出一种随机估计器,通过Hutchinson迹估计与单次Hessian-向量积结合,在单次反向传播中无偏估计神经网络每层Hessian矩阵对角块的迹,并应用于检测标签记忆化阶段。

Comments 9 pages, 1 table

详情
AI中文摘要

损失及其梯度范数只能微弱地区分神经网络训练的健康和病态阶段,而经验风险的曲率在两者间有质的差异,但在参数数量$P\sim 10^{6}-10^{8}$时无法显式计算。我们提出了一种神经网络经验风险Hessian矩阵对角块迹的随机估计器。该过程将Hutchinson随机迹估计与整个参数向量上的单次Hessian-向量积相结合,并在计算图的单次反向传播中恢复每层迹的无偏估计。我们证明,在权重共享下,正确性要求逐层Hessian在第二次微分之前组装:将共享权重展开为独立坐标会引入系统偏差,其符号和大小由展开Hessian的跨实例块控制。推导了固定Hessian下估计器方差的闭式表达式,以及小批量采样分布下总方差的分解。该分解产生一个临界探测次数$K^{\star}$,平衡了两个随机源,并支持在线监测模式下$K\in[5,10]$的实用建议。该估计器应用于检测ResNet-18、ResNet-34和VGG-11在CIFAR-10和CIFAR-100上的标签记忆化阶段,其中校准的累积和决策规则在虚警率$16/120$下达到了$179/180$的经验检测能力。

英文摘要

The loss and the norm of its gradient separate the healthy and the pathological regimes of neural-network training only weakly, whilst the curvature of the empirical risk differs qualitatively between them but is inaccessible explicitly at parameter counts $P\sim 10^{6}-10^{8}$. We present a stochastic estimator of the trace of the diagonal blocks of the Hessian matrix of the empirical risk of a neural network. The procedure combines the Hutchinson stochastic trace estimator with a single Hessian-vector product over the whole parameter vector and recovers unbiased estimates of every per-layer trace in one backward pass through the computational graph. We show that correctness under weight sharing requires the layer-wise Hessian to be assembled before the second differentiation: unrolling shared weights into independent coordinates introduces a systematic bias whose sign and magnitude are governed by the cross-instance blocks of the unrolled Hessian. A closed-form expression for the variance of the estimator at a fixed Hessian is derived, together with a decomposition of the total variance under the mini-batch sampling distribution. This decomposition yields a critical probe count $K^{\star}$ that balances the two sources of randomness and supports the practical recommendation $K\in[5,10]$ in the on-line monitoring regime. The estimator is applied to the detection of the label-memorisation regime of ResNet-18, ResNet-34, and VGG-11 on CIFAR-10 and CIFAR-100, where a calibrated cumulative-sum decision rule attains an empirical detection power of $179/180$ at a false-alarm rate of $16/120$.

2605.25672 2026-05-26 cs.RO

Compliant Non-Prehensile Pushing Manipulation

顺应性非抓取推动操作

Francesco Cufino, Mario Selvaggio, Fabio Amadio, Fabio Ruggiero

发表机构 * PRISMA Lab, department of Electrical Engineering and Information Technology of the University of Naples Federico II(PRISMA实验室,那不勒斯费德里科二世大学电气工程与信息科技系) Inria, CNRS, Université de Lorraine(Inria、CNRS、洛林大学) ABB Corporate Research Center(ABB企业研究中心)

AI总结 针对顺应性机器人系统中的非抓取推动操作,提出基于阻抗控制与模型预测控制的框架,通过优化位置/速度设定点实现顺应性推动,并集成能量罐无源性滤波器保证安全交互。

详情
AI中文摘要

在本文中,我们解决了使用顺应性机器人操作系统执行非抓取推动操作的挑战。为了确保在人类环境中安全操作,机器人必须顺从外部物理交互并表现出被动行为。为此,我们扩展了最先进的推动模型,将其与阻抗控制机器人集成。我们开发了一个基于该模型的模型预测控制框架,通过最优调节机器人的位置/速度设定点来实现顺应性推动,同时实现所需的推动力和接触点适应,以获得期望的物体运动。然而,外部交互可能导致跟踪误差,从而引起推动力潜在的无限增加。为了防止这种情况,我们集成了一个能量罐无源性滤波器,进一步调节机器人速度设定点以保证无源性并避免不受控制的能量积累。所提出的方法已在仿真中严格测试,并通过两个不同机器人系统的实验验证,展示了在人机交互过程中的被动顺应性,并评估了轨迹跟踪性能和对物体物理参数变化的鲁棒性。

英文摘要

In this paper, we address the challenge of performing non-prehensile pushing operations with a compliant robotic manipulation system. To ensure safe operations in human-populated environments, robots must comply with external physical interactions and exhibit passive behavior. To achieve this, we extend a state-of-the-art pushing model to integrate it with impedance-controlled robots. We develop a model predictive control framework built upon this model that enables compliant pushing through optimal modulation of the robot's position/velocity set-point, jointly realizing the required pushing force and contact point adaptation to obtain desired object motion. However, external interactions may induce tracking errors, causing a consequent potentially indefinite increase of the pushing force. To prevent this, we integrate an energy tank passivity filter that further modulates the robot velocity set-point to guarantee passivity and avoid uncontrolled energy buildup. The proposed method has been rigorously tested in simulation and validated through experiments on two different robotic systems, demonstrating passive compliance during human-robot interactions and assessing trajectory tracking performance and robustness to variations in the object's physical parameters.

2605.25663 2026-05-26 cs.LG cs.CV

Opportunistic Target Selection: Early Directional Commitment for Query-Efficient Black-Box Adversarial Attacks

机会目标选择:面向查询高效黑盒对抗攻击的早期定向承诺

Florent Tariolle, Florian Yger

发表机构 * INSA Rouen Normandy(里昂-诺曼底理工学院) LITIS

AI总结 提出一种轻量级方法OTS,通过早期将无目标攻击切换为有目标攻击,锁定当前领先的非真实类,从而减少查询次数并提高成功率。

Comments 13 pages, 10 figures, 3 tables; code available at https://github.com/Tariolle/opportunistic-target-selection

详情
AI中文摘要

仅最小化真实置信度的黑盒对抗攻击存在类别漂移问题:扰动在特征空间中游荡而不承诺特定对抗类别,浪费查询在分散、无方向的进展上。我们引入机会目标选择(OTS),一种轻量级包装器,在攻击轨迹早期将无目标攻击切换为有目标目标,锁定当前领先的非真实类别。OTS不需要对底层攻击进行架构修改,不需要梯度访问,也不需要先验的目标类别知识。我们在五个标准ImageNet分类器(4500次运行)上对三种基于分数的攻击(SimBA、使用交叉熵损失的Square Attack和Bandits)验证了OTS。在随机搜索攻击上,OTS紧密跟踪oracle性能,在ResNet-50上成功率提升高达27个百分点,审查均值迭代次数相对减少43%。在梯度估计攻击(Bandits)和边际损失攻击上,OTS是冗余的,这一负面结果强化了我们将OTS解释为边际损失替代的观点。在对抗训练模型上,双峰难度分布消除了目标帮助的机制。

英文摘要

Black-box adversarial attacks that minimize only the ground-truth confidence suffer from class drift: perturbations wander through the feature space without committing to a specific adversarial class, wasting queries on diffuse, undirected progress. We introduce Opportunistic Target Selection (OTS), a lightweight wrapper that switches an untargeted attack to a targeted objective early in its trajectory, locking onto whichever non-true class currently leads. OTS requires no architectural modification to the underlying attack, no gradient access, and no a priori target-class knowledge. We validate OTS on three score-based attacks (SimBA, Square Attack with cross-entropy loss, and Bandits) across five standard ImageNet classifiers (4,500 runs). On random-search attacks, OTS closely tracks oracle performance, with gains up to +27 pp in success rate and 43% relative reduction in censored-mean iterations on ResNet-50. On gradient-estimation attacks (Bandits) and attacks with margin loss, OTS is redundant, a negative result that reinforces our interpretation of OTS as a margin-loss surrogate. On adversarially-trained models, a bimodal difficulty distribution eliminates the regime where targeting helps.

2605.25662 2026-05-26 cs.LG

Closed-Form Node Classification with Exact Graph Unlearning

具有精确图遗忘的闭式节点分类

Aditya Gaur, Charu Sharma

发表机构 * Machine Learning Lab IIIT Hyderabad(IIIT Hyderabad 机器学习实验室)

AI总结 提出一种基于调整同配性的路由闭式框架,通过闭式求解器(SGC+Ridge回归或LCF-Net)匹配或超越图神经网络性能,并实现精确图遗忘的快速更新与隐私分析。

Comments 19 pages, 5 figures, 12 tables (7 main + 5 appendix)

详情
AI中文摘要

用于节点分类的图神经网络通常通过梯度下降训练数百或数千个epoch。最近的工作表明,当适当调整时,经典的GCN/SAGE/GAT架构可以在许多节点分类基准上匹配图变换器。我们提出一个互补的问题:通过确定性闭式求解器能恢复多少性能,以及这能提供什么保证? 我们引入了一个由调整同配性选择的路由闭式框架。对于同配图,我们使用SGC风格的传播后接Ridge回归;对于异配图,我们引入LCF-Net,一种逐层闭式图特征精炼网络,其每层Ridge求解由高斯核-Ridge头部限制。在14个基准上,包括ogbn-arxiv和ogbn-proteins,我们的闭式预测器在9个测量数据集中的9个上匹配或击败了最佳普通2层GCN/SAGE/GAT,在12个小基准中的9个上在1个标准差内与调优的深度配方持平,并在两个大图上超过了OGB排行榜的普通GCN。剩余的异配差距紧密跟踪从普通2层到深度SAGE的增益,表明残差差异主要是架构性的。 由于我们的预测器是确定性线性系统的显式解,修改后的图输入可以重新求解以获得重训练等效参数。我们形式化了标签、特征、边、节点和子图修改的精确图对象遗忘,证明了Ridge组件的K跳局部性,并在109个配置上验证了精确性。在ogbn-arxiv上,局部更新比完全重新求解快21-45倍,比梯度重训练快约10^6倍。结构反演实验进一步量化了精确重训练的隐私下限和近似图遗忘方法的额外泄漏。

英文摘要

Graph neural networks for node classification are typically trained by gradient descent over hundreds or thousands of epochs. Recent work has shown that, when properly tuned, classic GCN/SAGE/GAT architectures can match graph transformers on many node-classification benchmarks. We ask a complementary question: how much of this performance can be recovered by deterministic closed-form solvers, and what guarantees does this enable? We introduce a routed closed-form framework selected by adjusted homophily. For assortative graphs, we use SGC-style propagation followed by Ridge regression; for heterophilous graphs, we introduce LCF-Net, a layer-wise closed-form graph feature-refinement network whose per-layer Ridge solves are capped by a Gaussian kernel-Ridge head. Across 14 benchmarks, including ogbn-arxiv and ogbn-proteins, our closed-form predictors match or beat the best vanilla 2-layer GCN/SAGE/GAT on 9 of 9 measured datasets, tie tuned deep recipes within one standard deviation on 9 of 12 small benchmarks, and exceed the OGB-leaderboard plain GCN on both large graphs. The remaining heterophilous gap closely tracks the gain from vanilla 2-layer to deep SAGE, suggesting that the residual difference is primarily architectural. Because our predictors are explicit solutions of deterministic linear systems, modified graph inputs can be re-solved to obtain retrain-equivalent parameters. We formalize exact graph-object unlearning for label, feature, edge, node, and subgraph modifications, prove K-hop locality for Ridge components, and verify exactness across 109 configurations. On ogbn-arxiv, localized updates give $21$--$45\times$ speedups over full re-solving and roughly $10^{6}\times$ speedups over gradient retraining. Structural-inversion experiments further quantify the privacy floor of exact retraining and the additional leakage of approximate graph-unlearning methods.

2605.25661 2026-05-26 cs.CV

DRM: Diffusion-based Reward Model With Step-wise Guidance

DRM: 基于扩散的奖励模型与逐步引导

Jaxon Zhang, Binxin Yang, Hubery Yin, Chen Li, Jing Lyu

发表机构 * Peking University(北京大学) WeChat Vision, Tencent Inc.(腾讯微信视觉实验室)

AI总结 提出基于扩散的奖励模型(DRM),利用预训练扩散模型作为评估骨干,通过逐步评估能力改进强化学习对齐和推理采样,提升图像生成质量。

详情
AI中文摘要

当前主流将扩散模型与人类偏好对齐的方法通常采用基于VLM的奖励模型。然而,这些为语义对齐预训练的奖励模型难以捕捉关键的感知质量,如美学、构图和视觉和谐。在这项工作中,我们认为一个能够高保真生成的模型必须对这些视觉属性有深刻理解。基于这一见解,我们引入了基于扩散的奖励模型(DRM),这是一种新颖的范式,使用预训练的扩散模型作为强大的评估骨干。DRM的一个关键优势是其独特的能力,不仅可以评估最终图像,还可以评估生成过程中任何阶段的噪声中间潜变量。我们以两种方式利用这种逐步评估能力。首先,我们提出了逐步GRPO,一种强化学习算法,提供密集的每步奖励,以解决GRPO算法中不精确的信用分配问题,从而实现更稳定和有效的对齐。其次,我们引入了逐步采样,一种新颖的推理策略,使用DRM作为动态引导,在每一步评估多个生成路径,引导过程朝向更高质量的结果。大量实验证实,我们的方法显著提升了生成图像的最终质量。代码:https://github.com/jjaxonx/DRM。

英文摘要

Current mainstream methods of aligning diffusion models with human preferences typically employ VLM-based reward models. However, these reward models, pre-trained for semantic alignment, struggle to capture the essential perceptual qualities-such as aesthetics, composition, and visual harmony. In this work, we argue that a model capable of high-fidelity generation must possess a profound understanding of these visual attributes. Based on this insight, we introduce the Diffusion-based Reward Model (DRM), a novel paradigm that use the pre-trained diffusion model as a powerful evaluative backbone. A key advantage of the DRM is its unique ability to assess not only the final image but also the noisy intermediate latents at any stage of the generative process. We leverage this step-wise evaluative capacity in two ways. First, we propose Step-wise GRPO, a reinforcement learning algorithm that provides dense, per-step rewards to resolve the imprecise credit assignment problem in GRPO algorithm, leading to more stable and effective alignment. Second, we introduce Step-wise Sampling, a novel inference strategy that employs the DRM as a dynamic guide to evaluate multiple generation paths at each step, steering the process towards higher-quality outcomes. Extensive experiments confirm that our approach significantly enhances the final quality of generated images. Code: https://github.com/jjaxonx/DRM.

2605.25659 2026-05-26 cs.CV

StreamChar: Long-Horizon Streaming Character Audio-Video Generation with Decoupled Orchestration

StreamChar: 基于解耦编排的长时程流式角色音频-视频生成

Linrui Tian, Qi Wang, Bang Zhang

发表机构 * Tongyi Lab, Alibaba Group(通义实验室,阿里巴巴集团)

AI总结 提出StreamChar流式框架,通过LLM编排器与联合音频-视频DiT解耦长时程编排与短窗去噪,实现实时、稳定、高质量的角色动画生成。

详情
AI中文摘要

实时流式联合音频-视频生成用于角色动画需要生成器说出请求的文本、跨块保持视觉身份并在严格的播放预算内运行。这些要求难以同时满足:逐块自回归生成会累积文本-音频错位和视觉漂移,而低延迟所需的少步蒸馏通常会降低空间多样性和时间质量。我们提出StreamChar,一种将长时程编排与短窗音频-视频去噪分离的流式框架。基于LLM的编排器使用文本和历史上下文生成帧对齐的音频条件,联合音频-视频DiT在参考和运动帧条件下执行局部双向去噪。为高效部署,我们使用两阶段蒸馏流程,首先压缩采样器,然后在在线块展开下微调学生模型。进度感知指针在展开训练期间将部分文本与生成的音频对齐,而汇块记忆提供持久视觉锚点以减少长时程漂移。在短片段和长时程协议上的实验表明,StreamChar在单个H100 GPU上实时运行,与最近的联合和音频驱动基线相比,在文本保真度、音视频同步、视觉质量和流式稳定性方面提供了有利的系统级权衡。

英文摘要

Real-time streaming joint audio-video generation for character animation requires a generator to speak the requested transcript, maintain visual identity across chunks, and run within a strict playback budget. These requirements are difficult to satisfy simultaneously: chunk-wise autoregressive generation can accumulate transcript-audio misalignment and visual drift, while the few-step distillation needed for low latency often degrades spatial diversity and temporal quality. We present StreamChar, a streaming framework that separates long-horizon orchestration from short-window audio-video denoising. An LLM-based orchestrator uses the transcript and historical context to produce frame-aligned audio conditions, and a joint audio-video DiT performs local bidirectional denoising with reference and motion-frame conditioning. For efficient deployment, we use a two-stage distillation pipeline that first compresses the sampler and then fine-tunes the student under online chunk rollouts. A progress-aware pointer aligns partial transcripts with generated audio during rollout training, and a sink-chunk memory provides a persistent visual anchor for reducing long-horizon drift. Experiments on short-clip and long-horizon protocols show that StreamChar runs in real time on a single H100 GPU and provides a favorable system-level trade-off among transcript fidelity, audio-visual synchronization, visual quality, and streaming stability compared with recent joint and audio-driven baselines.

2605.25658 2026-05-26 cs.CL cs.AI

AutoSG: LLM-Driven Solver Generation Solely from Task Prompts for Expensive Optimization

AutoSG: 仅从任务提示出发的LLM驱动的昂贵优化求解器生成

Haoran Gu, Handing Wang, Yi Mei, Mengjie Zhang

发表机构 * Xidian University(西安电子科技大学) Victoria University of Wellington(威灵顿维多利亚大学)

AI总结 提出AutoSG框架,通过检索增强生成、单步自优化和无实例评估机制,从自然语言提示直接生成可执行定制求解器,解决昂贵优化中的幻觉、结构破坏和评估成本问题。

详情
AI中文摘要

昂贵优化任务在现实应用中普遍存在,需要高度专业化的求解器。虽然LLM驱动的自动求解器生成显示出前景,但当前范式在处理昂贵优化时面临三个关键问题:由于领域知识不足导致的事实幻觉、在细化过程中频繁破坏先前建立的局部最优结构,以及在训练实例上执行带来的高昂评估成本和受限的泛化能力。为了解决这些问题,我们引入了AutoSG,一个完全自动化的流程,直接将自然语言提示转换为可执行的定制求解器。AutoSG具有三个核心创新:一个检索增强的求解器生成模块,严格将代码基于经过验证的文献;一个单步自优化算子,在保留关键结构组件的同时引入特定任务的改进;以及一个基于Elo的无实例LLM-as-a-Judge评估机制,快速建立全局排名。在多种昂贵优化任务上的广泛评估证实,AutoSG显著优于人工设计的最先进框架和现有的LLM生成的求解器。

英文摘要

Expensive optimization tasks are ubiquitous in real-world applications, demanding highly specialized solvers. While LLM-driven automated solver generation shows promise, current paradigms face three critical issues when tackling expensive optimization: factual hallucinations due to deficient domain knowledge, the frequent dismantling of previously established locally optimal structures during refinement, and the prohibitive evaluation costs alongside restricted generalization caused by executing on training instances. To address these issues, we introduce AutoSG, a fully automated workflow directly translating natural language prompts into executable customized solvers. AutoSG features three core innovations: a retrieval-augmented solver generation module strictly grounding code in verified literature; a one-step self-refinement operator introducing task-specific improvements while preserving critical structural components; and an instance-free Elo-based LLM-as-a-Judge evaluation mechanism rapidly establishing global rankings. Extensive evaluations across diverse expensive optimization tasks confirm AutoSG significantly outperforms human-designed state-of-the-art frameworks and existing LLM-generated solvers.

2605.25657 2026-05-26 cs.CV

ARMA-C3: A Contrastive ARMA Convolutional Framework for Unsupervised and Semi-supervised Classification

ARMA-C3: 一种用于无监督和半监督分类的对比ARMA卷积框架

VSS Tejaswi Abburi, Saurabh J. Shigwan, Nitin Kumar

发表机构 * VSS Tejaswi Abburi Saurabh J. Shigwan Nitin Kumar

AI总结 提出ARMA-C3框架,利用对比学习和图割正则化在无监督和半监督场景下学习图节点的判别性表示,在多个医学影像数据集上表现优异。

详情
AI中文摘要

在生物医学和神经退行性疾病中,由于标记数据的稀缺和成像模式的复杂性,准确和早期疾病识别仍然具有挑战性。为了解决这些问题,我们引入了ARMA-C3,一个统一的无监督和半监督图学习框架,用于基于对比学习和图割正则化的节点分类,以学习结构上有意义且具有判别性的表示。通过将样本或图像建模为图节点并利用样本间关系,所提出的框架捕获了传统机器学习方法通常忽略的受试者级别依赖关系。我们在五个临床相关数据集上进行了广泛的二分类实验:阿尔茨海默病神经影像学倡议(ADNI)、额颞叶痴呆神经影像学(NIFD)数据集以及三个医学影像基准(BreastMNIST、PneumoniaMNIST和一个肝脏超声数据集)。实验结果表明,ARMA-C3在多个评估设置中,特别是在有限监督和严重类别不平衡下,与经典聚类技术、最先进的机器学习模型以及现有的基于图的深度学习方法相比,取得了具有竞争力且通常更优越的性能。所提出的框架进一步展示了在多样化生物医学成像模态中的鲁棒表示学习和强跨模态泛化能力。

英文摘要

In biomedical and neurodegenerative disorders, accurate and early disease identification remains challenging due to the scarcity of labeled data and the complexity of imaging patterns. To address these challenges, we introduce ARMA-C3, a unified unsupervised and semi-supervised graph learning framework for node classification based on contrastive learning and graph-cut regularization to learn structurally meaningful and discriminative representations. By modeling samples or images as graph nodes and exploiting inter-sample relationships, the proposed framework captures subject-level dependencies that conventional machine learning methods typically overlook. We conduct extensive binary classification experiments across five clinically relevant datasets: the Alzheimer's Disease Neuroimaging Initiative (ADNI), the Neuroimaging in Frontotemporal Dementia (NIFD) dataset, and three medical imaging benchmarks (BreastMNIST, PneumoniaMNIST, and a liver ultrasound dataset). Experimental results demonstrate that ARMA-C3 achieves competitive and frequently superior performance compared to classical clustering techniques, state-of-the-art machine learning models, and existing graph-based deep learning approaches across multiple evaluation settings, particularly under limited supervision and severe class imbalance. The proposed framework further demonstrates robust representation learning and strong cross-modal generalization across diverse biomedical imaging modalities.

2605.25656 2026-05-26 cs.CV

Event-based Batting Impact Estimation

基于事件的击球冲击估计

Ryotaro Ishida, Wataru Ikeda, Ryosei Hara, Akemi Kobayashi, Toshitaka Kimura, Mariko Isogawa

发表机构 * Keio University(庆应大学) NTT Communication Science Laboratories(NTT通信科学实验室)

AI总结 提出利用事件相机的高时间分辨率和高动态范围,通过检测球与球棒的加权质心距离来估计击球冲击时刻,并引入掩膜细化网络解决事件帧与RGB图像之间的域差异,在低光和严重遮挡条件下将平均绝对误差降低约63%。

Comments Accepted to IEEE International Conference on Image Processing (ICIP) 2026. (c) 2026 IEEE. Personal use of this material is permitted

详情
AI中文摘要

精确估计击球冲击时刻对于理解快速感觉运动控制至关重要。然而,由于时间分辨率不足和运动模糊,RGB相机难以完成此任务。同样,惯性测量单元(IMU)由于传感器侵入性和有限的时间精度,在实际比赛中不实用。为克服这些限制,我们提出了一种新颖框架,利用事件相机(具有微秒级分辨率和高动态范围)基于检测到的球与球棒之间的加权质心距离来估计冲击时刻。为解决事件帧与RGB图像之间的域差异(这会降低分割精度),我们生成高密度事件帧。然后,我们引入一个掩膜细化网络,利用这些帧和双向掩膜信息,并通过一种新颖的损失函数进行优化。在真实数据集上的实验表明,我们的方法在具有挑战性的条件下(包括低光环境和严重遮挡)实现了卓越的准确性,将平均绝对误差降低了约63%,优于基线方法。

英文摘要

Estimating the precise timing of batting impact is crucial for understanding the rapid sensorimotor control. However, this task is challenging for RGB cameras due to insufficient temporal resolution and motion blur. Similarly, Inertial Measurement Units (IMUs) are impractical for actual matches due to sensor intrusiveness and their limited temporal precision. To overcome these limitations, we propose a novel framework leveraging event-based cameras, which offer microsecond resolution and high dynamic range, to estimate impact timing based on the weighted centroid distance between the detected ball and bat. To address the domain gap between event frames and RGB images that degrades segmentation accuracy, we generate high-density event frames. We then introduce a mask refinement network that leverages these frames and bidirectional mask information, optimized using a novel loss function. Experiments on real-world datasets demonstrate that our method achieves superior accuracy under challenging conditions, including low-light environments and severe occlusions, outperforming baselines by reducing the Mean Absolute Error by approximately 63%.

2605.25646 2026-05-26 cs.RO

G-DRAGON: Geospatial Reasoning and Dynamic Planning for Retrieval-Augmented Outdoor Navigation

G-DRAGON:面向检索增强的户外导航的地理空间推理与动态规划

Dongzhihan Wang, Yi Du, Jianan Sun, Yuan Xue, Yingchen Zhang, Bing Xiao, Chen Wang, Liang Xu

发表机构 * Spatial AI & Robotics Lab(空间人工智能与机器人实验室) University at Buffalo(布法罗大学) School of Future Technology(未来技术学院) Shanghai University(上海大学) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 提出G-DRAGON框架,通过轻量级LLM的生成式检索将自然语言命令映射到本地OSM实体,结合全局路径规划与SLAM系统,并利用前沿探索和开放集语义体素映射实现最后一英里目标定位,在仿真和真实场景中优于现有方法。

Comments Accepted by IEEE Robotics and Automation Letters (RA-L)

详情
AI中文摘要

在大型户外环境中运行的自主地面机器人需要强大的远程导航和细粒度的“最后一英里”探索。当前视觉语言导航(VLN)的进展在短距离任务中表现良好,但缺乏长距离任务的地理空间基础。一些基于OpenStreetMap(OSM)的方法依赖云端大型语言模型(LLM),容易产生事实幻觉,且无法根据人类指令进行“最后一英里”探索。为解决这些挑战,我们提出了G-DRAGON,一个用于户外开放世界导航的检索增强框架。该框架通过基于轻量级LLM的生成式检索将自然语言命令映射到版本化的本地OSM实体,为全局路径规划生成精确坐标。高级规划模块将全局拓扑路线与SLAM系统桥接,将地理空间路点投影到机器人的可导航框架中。对于“最后一英里”,框架转换为基于前沿的探索和开放集语义体素映射,以定位开放词汇目标。仿真实验表明,我们的框架优于最先进的基线。此外,我们在未见过的真实城市环境中使用无人地面车辆(UGV)验证了该系统,成功完成了轨迹长达500米的人员搜索任务。

英文摘要

Autonomous ground robots operating in large-scale outdoor environments require both robust long-range navigation and fine-grained ''last-mile'' exploration. Current advances in visual-language navigation (VLN) work well at short-range tasks, lacking geospatial grounding for long-distance missions. Some OpenStreetMap (OSM)-based methods relying on cloud-based Large Language Models (LLMs) are prone to factual hallucination and cannot conduct ''last-mile'' exploration based on human instruction. To address these challenges, we present G-DRAGON, a retrieval-augmented framework for outdoor, open-world navigation. This framework maps natural-language commands to versioned, local OSM entities via generative retrieval based on lightweight LLM, yielding accurate coordinates for global route planning. A high-level planning module bridges global topological routes with the SLAM system, projecting geospatial waypoints into the robot's navigable frame. For the ''last mile," the framework transitions to frontier-based exploration and open-set semantic voxel mapping to localize open-vocabulary targets. Experimental results in simulation demonstrate our framework outperforms state-of-the-art baselines. Furthermore, we validate the system in unseen real-world urban environments on an Unmanned Ground Vehicle (UGV), successfully completing person-search missions with trajectories of up to 500m.

2605.25641 2026-05-26 cs.CL

Iterate Until Retrieved: Factual Nugget Optimization for Discoverable Continual Corrections in Agentic RAG

迭代直到检索到:面向可发现持续修正的事实性片段优化在智能体RAG中的应用

Moshe Hazoom, Gal Patel, Alon Talmor, Tom Hope

发表机构 * Mosaic AI The Hebrew University of Jerusalem(希伯来大学杰里科分校)

AI总结 提出迭代片段优化(INO)方法,通过将反馈转化为事实性片段并利用生产环境智能体RAG系统迭代优化,提升事实性修正的可发现性和使用率。

详情
AI中文摘要

在复杂的B2B(企业对企业)环境中,智能体检索增强生成(RAG)系统经常接收自由形式的反馈。我们关注可操作的事实性修正,而非风格、偏好或整体响应质量等通用反馈信号。我们识别这些实例并将其转化为紧凑的知识库条目,称为事实性片段。我们引入迭代片段优化(INO),一种索引时优化方法,将生产环境中的智能体RAG作为测试平台:它创建初始片段,使用触发查询及其释义进行探测,反思失败的检索和回答轨迹,并修订片段直到其可被发现。我们使用两个生产B2B知识辅助代理(一个回答公司特定知识库问题的产品支持代理,以及一个协助支持工程师的支持工单代理)在多家使用我们系统的公司中评估INO。在自动化和人工评估中,INO在事实性修正的可发现性和使用率方面持续优于基线。

英文摘要

Agentic retrieval-augmented generation (RAG) systems in complex B2B (business-to-business) settings may often receive free-form response feedback. Rather than generic feedback signals such as style, preference, or overall response quality, we focus on actionable factual corrections. We identify these instances and convert them into compact knowledge-base entries, which we call factual nuggets. We introduce Iterative Nugget Optimization (INO), an index-time optimization method that uses the production agentic RAG as a test harness: it creates an initial nugget, probes it with the triggering query and paraphrases, reflects over failed retrieval and answer traces, and revises the nugget until it is discoverable. We evaluate INO with two production B2B knowledge-assistance agents across multiple companies that use our system: a product support agent that answers questions over company-specific knowledge bases, and a support ticket agent that assists support engineers. INO consistently improves results over baselines in terms of discoverability and usage of factual corrections, in automated and human evaluations.

2605.25632 2026-05-26 cs.AI cs.LG q-fin.RM

Insuring Every Action: An Authority Frontier Framework for Runtime Actuarial Control of Autonomous AI Agents

为每个行动投保:自主AI代理运行时精算控制的权威边界框架

Hao-Hsuan Chen

发表机构 * Department of Risk Management and Insurance(风险管理与保险系)

AI总结 提出精算行动接口(AAI)和权威边界框架,通过确定性运行时合约对自主AI代理的副作用行动进行定价、门控和评估,实现跨领域的精算控制与基准测试。

Comments 35 pages, 4 figures, 11 tables. Companion paper on the mathematical foundations: SSRN 6761960

详情
AI中文摘要

自主AI代理越来越多地产生带有副作用的行动:数据库变更、退款、支付、外部承诺。我们提出精算行动接口(AAI),这是一个确定性的运行时合约,它在时间一致的风险映射下,对每个此类行动按照合约固定的安全默认值进行定价,并根据每个边界的储备资本预算门控执行。然后我们开发了权威边界,这是一种评估原语,用于衡量运行时在每个储备资本水平下释放的自主权威量。该框架提供:(i) 一个确定性的报价-绑定-提交协议,带有通行费限制的能力令牌;(ii) 一个通用的七类行动分类法,将异构工具调用映射到可比较的权威单位;(iii) 在alpha支出下的重放确定性和逐路径储备覆盖;(iv) 通过全储备需求C_full和资本指标Capital@k进行跨域归一化。我们在四个代理环境(数据库变更、客服退款以及公共tau-bench零售和航空工具使用轨迹)中实例化AAI,并报告一个实时Postgres面板,其中三个Azure托管的模型通过同一合约提出行动。边界在跨域中表现出常见的低储备拒绝和中间释放模式,仅在预算网格达到全储备需求时饱和;所需储备资本变化达22倍(Capital@50从289到6457)。该框架不强制域采用相同形状;它揭示每个域的精算几何。在实时面板中,合约在低预算下防止了所有三个模型的实现损失,但在拒绝下的承保持续性方面有所不同:模型身份是一个精算承保变量。贡献是一个用于自主代理副作用运行时精算控制的基准就绪评估框架。

英文摘要

Autonomous AI agents increasingly issue side-effect-bearing actions: database mutations, refunds, payments, external commitments. We propose the Actuarial Action Interface (AAI), a deterministic runtime contract that prices each such action against a contractually fixed safe default under a time-consistent risk mapping, and gates execution against a per-boundary reserve capital budget. We then develop the Authority Frontier, an evaluation primitive measuring how much autonomous authority the runtime releases at each level of reserve capital. The framework provides (i) a deterministic quote-bind-commit protocol with toll-bounded capability tokens; (ii) a universal seven-class action taxonomy mapping heterogeneous tool calls to comparable authority units; (iii) replay determinism and pathwise reserve coverage under alpha-spending; (iv) cross-domain normalization via full reserve demand C_full and capital metrics Capital@k. We instantiate AAI across four agentic environments (database mutation, customer-service refund, and the public tau-bench retail and airline tool-use traces) and report a live Postgres panel in which three Azure-hosted models propose actions through the same contract. The frontier exhibits a common low-reserve refusal and intermediate-release pattern across domains, with saturation only where the budget grid reaches full reserve demand; required reserve capital varies by 22x (Capital@50 from 289 to 6457). The framework does not force domains into the same shape; it surfaces each domain's actuarial geometry. In the live panel the contract prevents realized loss across all three models at low budget while differing in underwriting persistence under denial: model identity is an actuarial underwriting variable. The contribution is a benchmark-ready evaluation framework for runtime actuarial control of autonomous-agent side effects.