arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部类别2183
2606.12087 2026-06-11 cs.CL 新提交

FORT-Searcher: Synthesizing Shortcut-Resistant Search Tasks for Training Deep Search Agents

FORT-Searcher:合成抗捷径搜索任务以训练深度搜索智能体

Jia Deng, Yimeng Chen, Xiaoqing Xiang, Ziyang Zeng, Shuo Tang, Wayne Xin Zhao, Feng Chang, Chuan Hao, Yuan Wei, Ran Tao, Bryan Dai, Ji-Rong Wen

发表机构 * Gaoling School of Artificial Intelligence Renmin University of China(中国人民大学高瓴人工智能学院) KAUST(阿卜杜拉国王科技大学) IQuest Research(IQuest研究院) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出FORT框架,通过控制四种捷径风险合成抗捷径训练数据,使搜索智能体进行更长的预答案搜索,减少捷径模式,仅用SFT训练即达到最优性能。

详情
Comments
30 pages
AI中文摘要

训练深度搜索智能体需要可验证的问题,其答案只有在通过搜索获得足够证据后才可用。现有的合成方法通常通过丰富图结构来增加表面难度,但仅凭结构复杂性并不能保证实现实际的搜索难度:预期的搜索过程可能通过更便宜的识别路径崩溃。我们用一个捷径感知的难度框架形式化了这一差距,并识别了四种可操作的捷径风险:证据共覆盖、单线索选择性、暴露常数和先验知识绑定。为了诊断它们的实际效果,我们使用轨迹签名,包括求解成本、答案命中时间和先验捷径率。在此框架的指导下,我们引入了FORT,一个抗捷径训练数据合成框架。FORT通过控制实体选择、证据图构建、问题表述和对抗性细化中的捷径风险来构建抗捷径训练数据。实验表明,与现有的开源深度搜索数据集相比,FORT诱导了更长的预答案搜索和更少的捷径模式。使用由此产生的轨迹,我们仅通过监督微调(SFT)训练FORT-Searcher,并在具有挑战性的深度搜索基准上取得了可比大小的开源搜索智能体中最佳的整体性能。相关资源将在https://this URL上提供。

英文摘要

Training deep search agents requires verifiable questions whose answers remain unavailable until sufficient evidence has been acquired through search. Existing synthesis methods often increase apparent difficulty by enriching graph structures, but structural complexity alone does not guarantee realized search difficulty: the intended search process can collapse through a cheaper identifying route. We formalize this gap with a shortcut-aware difficulty framework and identify four actionable shortcut risks: evidence co-coverage, single-clue selectivity, exposed constants, and prior-knowledge binding. To diagnose their realized effects, we use trajectory signatures including solving cost, answer hit time, and prior-shortcut rate. Guided by this framework, we introduce FORT, a Framework of Shortcut-Resistant Training-Data Synthesis. FORT constructs shortcut-resistant training data by controlling shortcut risks across entity selection, evidence graph construction, question formulation, and adversarial refinement. Experiments show that FORT induces longer pre-answer search and fewer shortcut patterns than existing open-source deep search datasets. Using the resulting trajectories, we train FORT-Searcher with supervised fine-tuning (SFT) only, and it achieves the best overall performance among comparable-size open-source search agents on challenging deep search benchmarks. Relevant resources will be made available at this https URL.

2606.12086 2026-06-11 cs.AI cs.LG 新提交

IntElicit: Eliciting and Assessing Contextualized Creativity via Dialogue Policy Optimization

IntElicit: 通过对话策略优化引出和评估情境化创造力

Mingjia Li, Jin Wu, Hong Qian, Wenhao Huang, Yiyang Huang, Yiwen Zhang, Chanjin Zheng, Xiangfeng Wang, Aimin Zhou, Jiajun Guo

发表机构 * East China Normal University(华东师范大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 提出IntElicit框架,通过分解过程奖励机制优化对话策略,在交互中减少非创造性混淆因素,从而更有效地引出和评估情境化创造力。

详情
AI中文摘要

情境化评估为评估创造力提供了高生态效度,但也引入了一个关键挑战:观察到的表现可能与认知熟练度(领域知识)和能动性(参与意愿)相混淆。同时,在生成式AI时代,创造性问题解决越来越多地发生在工具中介和人机交互环境中,使得完全静态的评估与当代创造性实践不太一致。为了解决这些问题,本文提出了IntElicit,一个通过对话策略优化来引出和评估情境化创造力的框架。IntElicit作为一个受约束的自适应AI面试官:它在多轮交互中提供非指导性的知识和能动性支架,以减少非创造性混淆因素,同时保留参与者生成被评估的创造性内容的责任。具体来说,为了解决开放教育对话中的稀疏奖励和潜在奖励破解(例如,答案听写),IntElicit引入了一种分解过程奖励机制。该机制将策略与教学引出对齐,奖励那些引出参与者推理而非代表他们产生最优答案的提示。大量实验,包括参与者模拟和一项人类受试者研究(N=64),表明IntElicit比专家设计的基线提高了引出的创造性成果。总之,结果表明,交互式引出可以揭示静态FPSP式评估可能遗漏的创造性潜力,为AI中介学习环境中的情境化创造力评估提供了形成性和诊断性视角。

英文摘要

Contextualized assessment offers high ecological validity for evaluating creativity but introduces a critical challenge: observed performance may be confounded with cognitive proficiency (domain knowledge) and agency (willingness to engage). Meanwhile, in the age of generative AI, creative problem solving increasingly occurs in tool-mediated and human--AI interactive environments, making fully static assessment less aligned with contemporary creative practice. To address these issues, this paper proposes IntElicit, a framework for eliciting and assessing contextualized creativity via dialogue policy optimization. IntElicit functions as a constrained adaptive AI Interviewer: it provides non-directive knowledge and agency scaffolds in multi-turn interaction to reduce non-creative confounders, while preserving participants' responsibility for generating the creative content being evaluated. Specifically, to tackle sparse rewards and potential reward hacking (e.g., answer dictation) in open-ended educational dialogue, IntElicit introduces a decomposed process reward mechanism. This mechanism aligns the policy with pedagogical elicitation, rewarding prompts that draw out participant reasoning rather than producing optimal answers on their behalf. Extensive experiments, including participant simulation and a human subject study (N=64), show that IntElicit improves elicited creative outcomes over expert-designed baselines. Together, the results suggest that interactive elicitation can reveal creative potential that static FPSP-style assessment may miss, providing a formative and diagnostic lens for contextualized creativity assessment in AI-mediated learning contexts.

2606.12077 2026-06-11 cs.LG 新提交

Efficient Time Series Clustering from Multiscale Reservoir Dynamics with Granular-Ball Anchoring Graph Optimization

基于多尺度储层动力学与粒球锚定图优化的高效时间序列聚类

Yifan Wang, Lifeng Shen, Shuyin Xia, Yi Wang

发表机构 * Chongqing Key Laboratory of Computational Intelligence, Key Laboratory of Cyberspace Big Data Intelligent Security, Ministry of Education, Sichuan-Chongqing Co-construction Key Laboratory of Digital Economy Intelligence and Key Laboratory of Big Data Intelligent Computing, College of Computer Science and Technology, Chongqing University of Posts and Telecommunications(重庆邮电大学计算机科学与技术学院,计算智能重庆市重点实验室,网络空间大数据智能安全教育部重点实验室,川渝共建数字经济智能重点实验室,大数据智能计算重点实验室) Chongqing Ant Consumer Finance Co,. Ltd , Ant Group(蚂蚁集团,重庆蚂蚁消费金融有限公司)

AI总结 提出MSRGC-Net框架,结合无训练储层计算、粒球锚定图构建和共识学习,实现高效且准确的时间序列聚类。

详情
Comments
Accepted by IJCAI 2026
AI中文摘要

时间序列聚类由于聚类效果与计算效率之间的固有权衡仍然具有挑战性。基于相似性的方法通常因成对距离计算而面临二次复杂度,而基于深度学习的方法通常依赖于昂贵的迭代训练和大量可训练参数。在本文中,我们提出了MSRGC-Net,一种高效的时间序列聚类框架,它集成了多尺度储层计算、基于粒球的锚定图构建和共识学习。MSRGC-Net采用无训练的储层计算范式,从原始时间序列中提取多尺度时间表示,无需反向传播,显著降低了计算开销。为了捕捉所得表示的内在结构,采用粒球计算通过密度一致区域自适应地建模数据分布,生成紧凑且鲁棒的锚定图表示。此外,引入了一种基于共识的锚定图优化策略,以有效对齐多尺度储层表示并整合跨时间尺度的互补信息。在广泛使用的单变量和多变量基准数据集上的大量实验表明,MSRGC-Net在聚类性能上持续优于最先进的方法,同时保持卓越的计算效率。

英文摘要

Time-series clustering remains challenging due to the inherent trade-off between clustering effectiveness and computational efficiency. Similarity-based methods often suffer from quadratic complexity caused by pairwise distance computations, while deep learning-based approaches typically rely on costly iterative training and a large number of trainable parameters. In this paper, we propose MSRGC-Net, an efficient time-series clustering framework that integrates multiscale reservoir computing, granular-ball-based anchoring graph construction, and consensus learning. MSRGC-Net adopts a training-free reservoir computing paradigm to extract multiscale temporal representations from raw time series without backpropagation, significantly reducing computational overhead. To capture the intrinsic structure of the resulting representations, granular-ball computing is employed to adaptively model data distributions via density-consistent regions, yielding compact and robust anchor graph representations. Furthermore, a consensus-based anchoring graph optimization strategy is introduced to effectively align multiscale reservoir representations and integrate complementary information across temporal scales. Extensive experiments on widely used univariate and multivariate benchmark datasets demonstrate that MSRGC-Net consistently outperforms state-of-the-art methods in clustering performance while maintaining superior computational efficiency.

2606.12074 2026-06-11 cs.CV cs.AI eess.IV 新提交

Non-frontal face recognition using GANs and memristor-based classifiers

基于GAN和忆阻器分类器的非正面人脸识别

Semih Vazgecen, Cristian Sestito, Spyros Stathopoulos, Themis Prodromakis

发表机构 * Centre for Electronics Frontiers, Institute for Integrated Micro and Nano Systems, School of Engineering, The University of Edinburgh(爱丁堡大学工程学院集成微纳系统研究所电子前沿中心)

AI总结 提出将轻量级GAN正面化与忆阻器神经形态识别结合,解决非正面人脸识别,在数据集上达96%准确率。

详情
Comments
12 pages, 4 figures, 1 Supplementary (22 pages, 16 figures, 6 tables, 4 supplementary notes)
AI中文摘要

人脸识别系统通过深度学习技术取得了显著进展,在复杂场景中实现了高性能和鲁棒性。然而,这些方法带来了巨大的计算开销,限制了它们在资源受限平台(如无人机)上的原位适用性,而这些平台需要应对非正面人脸图像等挑战。基于忆阻器的神经形态系统已成为边缘AI应用的一种引人注目的方法,它将生物启发式处理与高效可扩展的计算相结合。在这项工作中,我们提出了一种人脸识别框架,通过集成基于轻量级生成对抗网络(GAN)的正面化处理和基于忆阻器的神经形态识别,来解决非正面姿态变化问题。在两个数据集上的实验结果表明,将对抗学习与忆阻技术相结合的有效性,实现了高达96%的识别准确率。所提出的方法缓解了传统AI的计算瓶颈,并为动态真实环境中的人脸识别提供了一种可扩展、高效的解决方案。

英文摘要

Face recognition systems have advanced significantly through deep learning techniques, delivering high performance and robustness in complex scenarios. However, these approaches incur substantial computational overhead, limiting their in situ applicability in resource-constrained platforms such as drones, where they can address challenges including non-frontal facial imagery. Memristor-based neuromorphic systems have emerged as a compelling approach for edge AI applications, combining biologically inspired processing with efficient and scalable computation. In this work, we propose a facial recognition framework that addresses non-frontal pose variations by integrating lightweight generative adversarial network (GAN)-based pose frontalisation with memristor-based neuromorphic recognition. The experimental results on two datasets demonstrate the effectiveness of combining adversarial learning with memristive technology, achieving up to 96% identification accuracy. The proposed approach alleviates the computational bottlenecks of conventional AI and offers a scalable, efficient solution for face recognition in dynamic real-world environments.

2606.12072 2026-06-11 cs.CV 新提交

World Model Self-Distillation: Training World Models to Solve General Tasks

世界模型自蒸馏:训练世界模型以解决通用任务

Sebastian Stapf, Pablo Acuaviva Huertos, Aram Davtyan, Paolo Favaro

发表机构 * Department of Computer Science(计算机科学系)

AI总结 提出结合自蒸馏与强化学习的框架,从预训练视频生成器中提取任务解决能力,无需配对任务视频,在基准测试中超越原始模型。

详情
AI中文摘要

预训练视频生成器是有前景的视觉世界模型,展现出涌现的任务解决能力;然而,它们对详细文本描述的依赖限制了其在规划和决策中的直接使用。现有方法要么将这种推理外包给语言或视觉-语言模型,要么依赖带有配对任务执行视频的监督微调,后者收集成本高且难以扩展。我们提出一个可扩展的框架,通过结合自蒸馏与强化学习来激发此类模型的任务解决能力。给定一张无标注场景图像,视觉-语言模型生成候选任务和详细的逐步解决方案。该解决方案条件化一个预训练视频扩散模型(演示者);我们将其行为蒸馏到一个仅以图像和简短任务提示为条件的执行者中。这将执行知识从字幕引导生成转移到指令条件任务解决,无需精心策划的任务视频监督。我们进一步通过来自VLM反馈的强化学习改进执行者,利用判断采样视频是否满足任务与生成解决方案之间的不对称性。在我们提出的WorldTasks-Benchmark和DreamGen机器人基准上的实验表明,在我们基于VLM的评估协议下,执行者超越了演示者,并具有竞争力地迁移到机器人任务。

英文摘要

Pretrained video generators are promising visual world models that exhibit emergent task-solving abilities; however, their reliance on detailed textual descriptions limits their direct use for planning and decision-making. Existing approaches either outsource this reasoning to language or vision-language models, or rely on supervised fine-tuning with paired task-execution videos, which are costly to collect and difficult to scale. We propose a scalable framework that elicits task-solving ability in such models by combining self-distillation with reinforcement learning. Given an unlabeled scene image, a vision-language model generates a candidate task and a detailed step-by-step solution. The solution conditions a pretrained video diffusion model, the Demonstrator; we distill its behavior into an Executor conditioned only on the image and a short task prompt. This transfers execution knowledge from caption-guided generation to instruction-conditioned task solving without curated task-video supervision. We further improve the Executor with reinforcement learning from VLM feedback, exploiting the asymmetry between judging whether a sampled video satisfies a task and generating the solution. Experiments on our proposed WorldTasks-Benchmark and the DreamGen robotics benchmark show that the Executor surpasses the Demonstrator under our VLM-based evaluation protocol and transfers competitively to robotic tasks.

2606.12070 2026-06-11 cs.RO 新提交

Fibration Trees: A Unified Approach to Multi-Robot Motion Planning

纤维树:多机器人运动规划的统一方法

Andreas Orthey, Florian T. Pokorny, Lydia E. Kavraki

发表机构 * Technical University of Berlin(柏林工业大学) KTH Royal Institute of Technology(瑞典皇家理工学院) Rice University and the Ken Kennedy Institute(莱斯大学和肯·肯尼迪研究所)

AI总结 提出纤维树统一框架,通过纤维化建模投影,结合优先序、并行分解和任务空间投影,并开发Fibration-RRT规划器,在高维多机器人运动规划中实现概率完备性。

详情
Comments
23 pages, 12 figures
AI中文摘要

状态空间投影与分解已成为解决高维多机器人运动规划问题中维度灾难的强大工具。然而,现有方法缺乏一个统一框架来无缝处理投影(优先序或任务空间)与分解(并行或解耦子空间)的组合。为填补这一空白,我们引入了纤维树,即以状态空间为节点、纤维化为边的树结构,其中纤维化将高维空间投影到低维(或简化)空间。通过将投影建模为纤维化,我们将顺序优先序、并行分解和任务空间投影统一在单一、连贯的形式体系下。在此基础上,我们开发了快速探索随机纤维树(Fibration-RRT)规划器,这是一种基于采样的运动规划器,它推广了商空间RRT(用于顺序优先序)和离散RRT(用于并行分解)的策略,同时允许包含任务空间投影。Fibration-RRT在用户定义的纤维树上运行,并被证明是概率完备的。为测试Fibration-RRT的通用性和效率,我们提供了开源实现,并在32个场景中进行了实验,使用了多达96自由度的多机器人团队。结果表明,Fibration-RRT通过利用用户定义的纤维树高效解决了高维问题,从而确立了纤维树作为多机器人运动规划的强大统一框架。

英文摘要

State space projections and decompositions have emerged as powerful tools to tackle the curse of dimensionality in high-dimensional, multi-robot motion planning problems. However, existing methods lack a unified framework which seamlessly handles combinations of projections (prioritization or task-space) and decompositions (parallel or decoupled subspaces). To fill this gap, we introduce fibration trees, which are trees consisting of state spaces as nodes and fibrations as edges, whereby a fibration models a projection from a higher-dimensional space to a lower-dimensional (or simplified) space. By modeling projections as fibrations, we unify sequential prioritization, parallel decomposition, and task-space projections under a single, coherent formalism. Building on this, we develop the rapidly-exploring random fibration trees (Fibration-RRT) planner, a sampling-based motion planner that generalizes strategies from quotient-space RRT (for sequential prioritizations) and discrete RRT (for parallel decompositions), while allowing the inclusion of task-space projections. Fibration-RRT operates on user-defined fibration trees and is proven to be probabilistically complete. To test the generality and efficiency of Fibration-RRT, we provide an open-source implementation and conduct experiments on 32 scenarios using multi robot teams with up to 96 degrees of freedom. Our results indicate that Fibration-RRT efficiently solves high-dimensional problems by exploiting user-defined fibration trees, thereby establishing fibration trees as a powerful, unified framework for multi-robot motion planning.

2606.12069 2026-06-11 cs.CV 新提交

Tac-DINO: Learning Vision-Tactile Features with Patch Alignment

Tac-DINO:基于补丁对齐的视觉-触觉特征学习

Hong Li, Yankang Dong, Yue Xu, Yihan Tang, Mingzhu Li, Jiamin Qiu, Qihang Yao, Xing Zhu, Yujun Shen, Nan Xue, Yong-Lu Li

发表机构 * Shanghai Jiao Tong University(上海交通大学) Ant Group(蚂蚁集团) Shanghai Innovation Institute(上海创新研究院)

AI总结 提出Tac-DINO方法,通过构建大规模触觉数据集和视觉-触觉全息匹配基准,利用补丁对齐学习局部到全局的视觉-触觉表征,性能优于无对齐方法。

详情
AI中文摘要

触觉是人类与环境交互的主要媒介。目前,触觉学习主要关注图像级预训练或对齐。然而,触觉信号对应局部物体接触,而尺度对齐和全息匹配的研究仍然有限,且缺乏合适的数据集和基准。为弥补这一差距,我们首先构建了一个数据采集系统,获取了大规模触觉数据集,包含来自505个真实物体的超过2万次触觉接触。基于该数据集,我们设计了一个视觉-触觉全息匹配基准,用于评估视觉-触觉局部到全局的对齐能力。然后,我们提出了视觉-触觉补丁对齐(VTPA)方法用于视觉-触觉表征学习。实验表明,这些方法超越了无对齐方法,并与全物体图像对齐的性能相当。

英文摘要

Touch is the primary medium through which humans interact with the environment. Currently, tactile learning mainly focuses on image-level pretraining or alignment. However, tactile signals correspond to local object contact, while research into scale alignment and holographic matching remains limited and proper datasets and benchmarks also lack. To bridge this gap, we first construct a data collection system to acquire a large-scale tactile dataset, with over 20 K tactile contacts from 505 real-world objects. Building on this dataset, we design a Vis-Tac Holographic Matching Benchmark to evaluate vision-tactile local-to-global alignment ability. Then we propose Vision-Tactile Patch Alignment (VTPA) methods for vision-tactile representation learning. Experiments demonstrate that these exceed the performance of methods without alignment and align with whole-object images.

2606.12059 2026-06-11 cs.LG cs.NE nlin.AO 新提交

Attention by Synchronization in Coupled Oscillator Networks

耦合振荡器网络中的同步注意力机制

Fabio Pasqualetti, Taosha Guo

发表机构 * University of California, Irvine(加州大学尔湾分校)

AI总结 提出基于Kuramoto同步动力学的固定查询振荡器注意力机制,无需指数运算和全局归约,在物理基板上实现注意力计算,并在关键词识别和主谓一致任务上优于softmax。

详情
AI中文摘要

我们探讨了能量受限物理基板上的Transformer注意力机制。Softmax注意力需要指数运算和全局归约,这些操作在冯·诺依曼硬件上能耗高且没有自然的物理模拟。我们证明Kuramoto同步动力学(出现在电气、机械、超导和电荷密度波振荡器阵列等物理系统中)无需上述操作即可实现定义良好的注意力操作。由此产生的机制——固定查询振荡器注意力——用球面上梯度流的平衡取代了softmax的算术运算:查询是固定在球面上的学习锚点,自由振荡器在Kuramoto-Lohe动力学下演化,直到它们稳定在通过余弦相似度编码注意力权重的位置上。由于计算是平衡过程,因此不需要指数运算;唯一的全局操作是读出时的仿射归一化。该不动点是唯一且从几乎所有初始条件全局吸引的,这一保证适用于所有物理实现。在实验上,在最小硬件配置(振荡器维度$d_{\mathrm{osc}}=2$)下,振荡器注意力在关键词识别(+1.00个百分点)和主谓一致(困难句子+5.27个百分点,零训练失败,而softmax五分之一失败)上优于softmax。在因果语言建模中,softmax仍保持优势,但振荡器注意力随着$d_{\mathrm{osc}}$的增长缩小了差距:在WikiText-2上,从$d_{\mathrm{osc}}=2$时的+11.09 PPL降至$d_{\mathrm{osc}}=32$时的+2.98 PPL;在TinyStories上,从$d_{\mathrm{osc}}=2$时的+2.39 PPL降至$d_{\mathrm{osc}}=32$时的+0.57 PPL。本工作的主要目标不是用软件替代softmax,而是为物理基板上的精确注意力提供数学基础蓝图。

英文摘要

We address transformer attention on energy-constrained physical substrates. Softmax attention requires exponentiation and global reduction, operations with high energy cost on von Neumann hardware and no natural physical analog. We show that Kuramoto synchronization dynamics (which arise in electrical, mechanical, superconducting, and charge-density-wave oscillator arrays, among other physical systems) implement a well-defined attention operation without either. The resulting mechanism, fixed-query oscillator attention, replaces softmax's arithmetic with the equilibration of a gradient flow on the sphere: queries are learned anchors fixed on the sphere, and free oscillators evolve under Kuramoto-Lohe dynamics until they settle at positions encoding attention weights via cosine similarity. Because the computation is equilibration, it requires no exponentiation; the only global operation is an affine normalization at readout. The fixed point is provably unique and globally attractive from almost every initial condition, a guarantee that holds across every physical realization. Empirically, at the minimal hardware configuration (oscillator dimension $d_{\mathrm{osc}}$ = 2), oscillator attention outperforms softmax on keyword spotting (+1.00 pp) and on subject-verb agreement (+5.27 pp on hard sentences, with zero training failures versus one in five for softmax). On causal language modeling, where softmax retains an advantage, oscillator attention closes the gap as $d_{\mathrm{osc}}$ grows: from +11.09 PPL at $d_{\mathrm{osc}}$ = 2 to +2.98 PPL at $d_{\mathrm{osc}}$ = 32 on WikiText-2, and from +2.39 PPL at $d_{\mathrm{osc}}$ = 2 to +0.57 PPL at $d_{\mathrm{osc}}$ = 32 on TinyStories. The main objective of this work is not to replace softmax in software but to provide a mathematically grounded blueprint for accurate attention on physical substrates.

2606.12054 2026-06-11 cs.LG 新提交

Simplicity Suffices for Parameter Noise Injection in Stochastic Gradient Descent

随机梯度下降中参数噪声注入的简单性足以胜任

Benjamin Leblanc, Louis-Jacob Lebel, Teddy Kana, Richard Kamel

发表机构 * Université Laval(拉瓦尔大学)

AI总结 研究随机梯度下降中的参数噪声注入,提出线性层逐样本噪声注入的高效方法,并实验证明简单各向同性噪声即可达到复杂方案的优化与泛化效果。

详情
Comments
Accepted at the Data Science Meets Optimisation workshop in IJCAI 2026
AI中文摘要

向优化过程中注入噪声是一种改善深度神经网络训练和泛化的成熟技术。然而,尽管现有方法众多,实践中哪些设计选择真正重要仍不清楚。本文研究随机梯度下降中的参数噪声注入,聚焦两个关键问题:如何在 mini-batch 训练中高效地为每个训练样本配对其自身的扰动,以及复杂的噪声参数化或多样本梯度平均是否比简单替代方案带来有意义的增益。针对第一个问题,我们利用线性层的分布恒等式,允许在不破坏批计算的情况下进行逐样本噪声注入。针对第二个问题,我们在 CIFAR100 上系统比较了几种对角高斯参数化与各向同性基线在不同噪声水平下的表现。结果一致表明,简单的轻量级策略——每个更新步使用单次扰动前向传播的各向同性噪声——即可恢复更复杂方案的大部分收益。这些发现表明,参数噪声注入的简单性足以胜任,实践者无需采用精心设计的扰动方案即可获得噪声 SGD 的优化和泛化优势。

英文摘要

Injecting noise into the optimization process is a well-established technique for improving the training and generalization of deep neural networks. Yet, despite the breadth of existing approaches, it remains unclear which design choices truly matter in practice. In this work, we investigate parameter noise injection for stochastic gradient descent, focusing on two key questions: how to efficiently pair each training example with its own perturbation in mini-batch training, and whether sophisticated noise parameterizations or multi-sample gradient averaging yield meaningful gains over simpler alternatives. To address the first question, we leverage a distributional identity for linear layers that allows per-example noise injection without breaking batched computation. To address the second, we systematically compare several diagonal Gaussian parameterizations against an isotropic baseline across varying noise levels on CIFAR100. Our results consistently show that simple, lightweight strategies, isotropic noise with a single perturbed forward pass per update step, recover most of the benefit of more complex schemes. These findings suggest that simplicity suffices for parameter noise injection, and that practitioners need not resort to elaborate perturbation designs to reap the optimization and generalization benefits of noisy SGD.

2606.12051 2026-06-11 cs.CV 新提交

MFEN:Multi-Frequency Expert Network for Visible-Infrared Person Re-ID

MFEN:用于可见光-红外行人重识别的多频专家网络

Xulin Li, Yan Lu, Bin Liu, Qinhong Yang, Qi Chu, Tao Gong, Nenghai Yu

发表机构 * University of Science and Technology of China(中国科学技术大学) Anhui Province Key Laboratory of Digital Security(安徽省数字安全重点实验室) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出多频专家网络(MFEN),通过多频调制和混合专家设计自适应组合不同频带,结合随机频率增强和频率辅助优化,解决可见光-红外图像模态差异问题。

详情
Comments
CVPR Highlight
AI中文摘要

可见光-红外行人重识别(VI-ReID)由于可见光和红外图像之间的巨大模态差异而具有挑战性。我们认为这种差异主要与不同的光照条件有关,包括光波长和光源类型的差异。最近,基于频率的VI-ReID方法取得了显著成功,因为频率信息可以更好地提取与身份相关的轮廓和细节,同时排除无关的光照和颜色。然而,现有方法要么不区分不同频带,要么只关注一个频带,这在多样化的光照条件下是不够的。为了进行全面的频域学习,我们提出了多频专家网络(MFEN),通过混合专家设计实现多频调制并自适应组合不同频带。我们进一步引入随机频率增强(RFA)和频率辅助优化(FAO)来更好地训练MFEN。这三个模块互补,共同捕获关键的频域细节以实现鲁棒的表示学习。在三个VI-ReID数据集上的大量实验证明了我们方法的有效性。

英文摘要

Visible-infrared person re-identification (VI-ReID) is challenging due to the large modality discrepancy between visible and infrared images. We contend that this discrepancy is largely related to differing lighting conditions, including differences in light wavelength and light source type. Recently, frequency-based VI-ReID approaches have achieved notable success because frequency information can better extract identity-relevant contours and details while excluding irrelevant lighting and color. However, existing methods either do not distinguish different frequency bands or focus on only one band, which is insufficient under diverse lighting conditions. To perform comprehensive frequency domain learning, we propose a Multi-Frequency Expert Network (MFEN) that enables multi-frequency modulation and adaptively combines different bands through a mixture-of-experts design. We further introduce Random Frequency Augmentation (RFA) and Frequency Auxiliary Optimization (FAO) to better train MFEN. The three modules are complementary and jointly capture critical frequency-domain details for robust representation learning. Extensive experiments on three VI-ReID datasets demonstrate the effectiveness of our approach.

2606.12050 2026-06-11 cs.LG math.DS 新提交

Reliable Error Estimation for PINNs: Lower and Upper A Posteriori Bounds

PINNs的可靠误差估计:后验下界与上界

Ismail Huseynov, Arzu Ahmadova, Agamirza Bashirov

发表机构 * Physikalisch-Technische Bundesanstalt (PTB)(德国联邦物理技术研究院) Technical University of Berlin(柏林工业大学) Weierstrass Institute for Applied Analysis and Stochastics(魏尔斯特拉斯应用分析与随机研究所) Eastern Mediterranean University(东地中海大学)

AI总结 提出PINNs求解常微分方程的可计算后验误差下界,结合局部单侧Lipschitz条件得到更紧的上界,实现双侧误差包络,并讨论初始条件处理对下界的影响。

详情
AI中文摘要

物理信息神经网络(PINNs)将机器学习与物理定律相结合以求解微分方程。虽然现有结果为PINN预测误差提供了严格的后验上界,但完整认证还需要互补的下界信息以获得可计算的双侧误差包络。本文在合适的认证状态空间域上,在局部强单调性条件下推导了PINN误差在常微分方程中的可计算后验下界。我们将这些估计与在单侧Lipschitz条件下的互补局部上界相结合,该条件弱于先前工作中使用的全局Lipschitz假设,并能产生更尖锐的误差上界带。所得界仅依赖于神经网络近似、ODE残差以及局部单调性和增长常数,因此无需访问精确解。对于线性时不变和时变系统,我们进一步根据系统矩阵对称部分的最小和最大特征值得出显式公式。我们还讨论了PINN中初始条件的软硬约束区别,并解释了为什么精确约束可能使标量下界证书无效。为了在线性情形中恢复有意义的非平凡下界信息,我们使用基于坐标单位向量的符号残差有限探针证书。我们还制定了一种证书引导的训练策略,其中传播的上界证书用作辅助正则化器,而下界证书保留为训练后诊断。总体而言,所提出的框架为PINN逼近ODE提供了严格且实际可计算的误差证书,同时明确了假设可验证的域和模型类别。

英文摘要

Physics-informed neural networks (PINNs) combine machine learning with physical laws to solve differential equations. While existing results provide rigorous \emph{a posteriori} upper bounds for PINN prediction errors, complete certification also requires complementary lower information in order to obtain computable two-sided error enclosures. In this paper, we derive computable \emph{a posteriori} lower bounds for PINN errors in ordinary differential equations on suitable certified state-space domains under a localized strong monotonicity condition. We combine these estimates with complementary localized upper bounds under a one-sided Lipschitz condition, which is weaker than the global Lipschitz assumption used in previous work and can yield sharper upper error bands. The resulting bounds depend only on the neural-network approximation, the ODE residual, and local monotonicity and growth constants, and therefore do not require access to the exact solution. For linear time-invariant and time-varying systems, we further derive explicit formulas in terms of the minimal and maximal eigenvalues of the symmetric part of the system matrix. We also discuss the distinction between soft and hard enforcement of initial conditions in PINNs and explain why exact enforcement can make the scalar lower certificate uninformative. To recover nontrivial lower information in the linear setting, we use a signed-residual finite-probe certificate based on coordinate unit vectors. We also formulate a certificate-informed training strategy in which the propagated upper certificate is used as an auxiliary regularizer, while lower certificates remain post-training diagnostics. Altogether, the proposed framework provides rigorous and practically computable error certificates for PINN approximations of ODEs, while making explicit the domains and model classes for which the assumptions can be verified.

2606.12048 2026-06-11 cs.RO 新提交

Point Cloud Segmentation for Autonomous Clip Positioning in Laparoscopic Cholecystectomy on a Phantom

用于腹腔镜胆囊切除术中自动夹子定位的点云分割(在体模上)

Balázs Gyenes, Nikolai Franke, Paul Maria Scheikl, Pit Henrich, Rayan Younis, Gerhard Neumann, Martin Wagner, Franziska Mathis-Ullrich

发表机构 * Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院) HIDSS4Health - Helmholtz Information and Data Science School for Health(亥姆霍兹信息与数据科学健康学校) Friedrich-Alexander-University, Erlangen-Nuremberg(弗里德里希-亚历山大大学埃尔朗根-纽伦堡) University Hospital Carl Gustav Carus and Centre for Tactile Internet with Human-in-the-loop (CeTI), Dresden University of Technology(卡尔·古斯塔夫·卡鲁斯大学医院及德累斯顿工业大学触觉互联网人机共融卓越中心)

AI总结 提出首个在腹腔镜手术体模上实现自主夹子定位的机器人系统,通过点云分割和样条插值提取目标位置,利用合成数据预训练和两种数据增强克服数据稀缺,达到0.75mm精度和100%成功率。

详情
Comments
8 pages, 5 figures, accepted to IEEE Robotics and Automation Letters (RAL)
AI中文摘要

机器人技术中的高风险应用,如机器人辅助手术,提出了独特的挑战。这些系统必须高度精确且可解释,才能部署在对错误或不安全探索容忍度极低的环境中。我们提出了第一个在腹腔镜手术(普外科最常见的手术之一)中在物理体模上演示自主夹子定位的机器人系统。在从单个相机分割无色点云后,使用样条插值提取夹子的目标位置,然后可由操作员调整。分割模型仅使用60个手工标记的真实点云进行训练,反映了手术领域的数据稀缺性。我们通过结合在128,000个合成点云上的预训练和两种新颖的数据增强技术来克服这一问题。末端执行器到每个目标的运动可视化给操作员,满足微创手术的独特运动约束,同时确保机器人的动作可验证和可解释。在真实机器人实验中,我们的系统以95%的成功率定位目标,精度为0.75mm,并以100%的成功率执行自主夹子定位。我们提供的见解适用于许多其他需要识别并导航到精确目标的手术和非手术任务。源代码和项目页面:此 https URL

英文摘要

High-risk applications in robotics, such as robot-assisted surgery, present unique challenges. These systems must be both highly precise and interpretable in order to be deployed in environments with very low tolerance for error or unsafe exploration. We present the first robotic system to demonstrate autonomous clip positioning on a physical phantom in laparoscopic surgery, one of the most common interventions in general surgery. After segmentation of a colorless point cloud from a single camera, target positions for the clips are extracted using spline interpolation, and can then be adjusted by the human operator. The segmentation model is trained on only 60 hand-labeled real point clouds, reflecting data scarcity in the surgical domain. We overcome this with a combination of pre-training on 128,000 synthetic point clouds and two novel data augmentation techniques. The motion of the end-effector to each target is visualized for the operator, satisfying the unique motion constraints of minimally-invasive surgery while ensuring that the robot's actions are verifiable and interpretable. In real robot experiments, our system localizes targets with the required precision of 0.75mm at a 95% success rate and executes autonomous clip positioning with a 100% success rate. We provide insights that are applicable to many other surgical and non-surgical tasks that require identifying and navigating to a precise target. Source code and project page: this https URL

2606.12047 2026-06-11 cs.CV cs.AI stat.ML 新提交

Metadata-Aware Multi-Prompt Reasoning for Zero-Shot Accident Understanding

元数据感知的多提示推理用于零样本事故理解

Tarandeep Singh, Soumyanetra Pal, Soham Biswas, Nishanth Chandran

发表机构 * Netradyne

AI总结 提出三阶段流水线,通过视觉-语言相似性、元数据驱动的多提示推理和开放词汇检测,实现零样本事故视频的时序定位、语义分类和空间定位,显著提升性能。

详情
Comments
Accepted at the AUTOPILOT Workshop, CVPR 2026 (non-archival). Workshop Paper ID 15
AI中文摘要

在本文中,我们通过识别冲击事件发生的时间、类型以及帧中的位置,使用自然语言解决监控视频中事故的零样本理解问题。我们提出一个三阶段流水线,将事故理解分解为何时、何物和何地。第一阶段利用视觉-语言相似性提取冲击周围的短时间窗口。第二阶段,我们执行元数据驱动的多提示推理,包含五个互补视角(基线、运动、几何、对比和决胜),并通过熵门控成对裁决器解决分歧。最后,我们基于预测的事故类型和场景布局查询开放词汇检测器以定位冲击,并使用分数加权质心聚合关键帧上的检测结果。我们的流水线在零样本ACCIDENT @ CVPR基准测试上,相对于帧中心基线,调和平均分数有显著提升。我们表明,将零样本视频理解分解为时序定位、语义分类和空间定位,比直接提示更能实现视觉-语言模型的可靠推理。

英文摘要

In this paper, we address the problem of zero-shot understanding of accidents from surveillance videos by identifying when an impact event occurs, what type of impact it is, and where in the frame it occurs using natural language. We propose a three-stage pipeline that decomposes the accident understanding into when, what, and where. The first stage extracts a short temporal window around the impact using vision-language similarity. In the second stage, we perform metadata-driven multi-prompt reasoning with five complementary views (baseline, motion, geometry, contrast, and tiebreaker) and resolve disagreement via an entropy-gated pairwise adjudicator. Finally, we localize the impact of an open-vocabulary detector queried on the predicted accident type and scene layout, and aggregate detections across keyframes using a score-weighted centroid. Our pipeline achieves a substantial improvement in the harmonic-mean score over a centre-of-frame baseline on the zero-shot ACCIDENT @ CVPR benchmark. We show that decomposing zero-shot video understanding into temporal localization, semantic classification, and spatial grounding enable more reliable reasoning with vision-language models than direct prompting alone.

2606.12042 2026-06-11 cs.RO 新提交

KinematicRL: A Sim-to-Real Reinforcement Learning Framework For Social Navigation With Kinodynamic Feasibility

KinematicRL: 一种面向社交导航的具有运动学可行性的仿真到现实强化学习框架

Zhiming Xu, Haodong Yang, Chengju Liu, Qijun Chen, Chenpeng Yao

发表机构 * School of Computer Science and Technology, Tongji University(同济大学计算机科学与技术学院) Department of Electronics and Information Engineering, Tongji University(同济大学电子与信息工程学院) Shanghai Institute of Intelligent Science and Technology, Tongji University(同济大学上海智能科学与技术研究院)

AI总结 提出KinematicRL框架,通过二阶控制动作空间、基于2D LiDAR的聚类人体追踪和无偏残差门控模块,解决社交导航中仿真到现实的动态可行性问题。

详情
Comments
Accepted by IEEE Transactions on Automation Science and Engineering (T-ASE)
AI中文摘要

深度强化学习(DRL)在社交导航中展现出潜力,但其实际部署仍受到由简化一阶动力学和特定上下文的人体状态估计管道导致的持续仿真到现实差距的阻碍。本文提出一个统一框架,解决这些限制,以生成适用于实际部署的动态可行导航策略。首先,理论分析表明,模拟与实际机器人位置之间的跟踪误差随控制阶数增加呈指数衰减,这促使使用高阶控制输入作为DRL动作空间。针对差动驱动机器人开发了二阶控制公式,并辅以随机迭代线性二次型调节器(iLQR),通过散度最小化目标预训练策略。其次,为避免相机-激光雷达融合带来的额外系统复杂性,引入仅使用2D激光雷达的基于聚类的人体追踪管道。根据空间邻近性和速度相似性关联人体检测,实现对附近行人的可靠区分,并通过时间聚合获得稳定的速度估计。第三,我们引入一个无偏残差门控模块,以平衡基于反应和基于记忆的行为,同时处理时变的人群规模,这两者对于社交导航至关重要。由此产生的策略KinematicRL持续改善运动学性能,并适应检测到的人类数量变化。在真实环境中的实验表明,当与所提出的追踪管道结合时,KinematicRL可以在实际差动驱动机器人上以最小修改部署。

英文摘要

Deep Reinforcement Learning (DRL) has shown promise for social navigation, yet its real-world deployment remains hindered by a persistent sim-to-real gap arising from simplified first-order dynamics and context-specific human state estimation pipelines. This work presents a unified framework that addresses these limitations to produce dynamically feasible navigation policies suitable for real-world deployment. First, theoretical analysis reveals that tracking error between simulated and actual robot position decays exponentially with increased control order, motivating the use of higher-order control inputs as DRL action space. A second-order control formulation tailored to differential drive robots is developed, complemented by a stochastic iterative Linear Quadratic Regulator (iLQR) that pretrains the policy via a divergence minimization objective. Second, to avoid the added system complexity of camera-LiDAR fusion, a cluster-based human tracking pipeline using only 2D LiDAR is introduced. Human detections are associated according to both spatial proximity and velocity similarity, enabling reliable differentiation of nearby pedestrians and yielding stable velocity estimates through temporal aggregation. Third, we introduce an unbiased residual gating block to balance reaction- and memory-based behaviors while handling time-varying crowd sizes, both critical for social navigation. The resulting policy, KinematicRL, consistently improves kinematic performance and adapts to varying number of detected humans. Experiments in real-world environments demonstrate that, when combined with the proposed tracking pipeline, KinematicRL can be deployed on a real differential drive robot with minimal modifications.

2606.12036 2026-06-11 cs.CV 新提交

Vision Transformers for Face Recognition Need More Registers

人脸识别的视觉Transformer需要更多寄存器

Tahar Chettaoui, Guray Ozgur, Eduarda Caldeira, Naser Damer, Fadi Boutros

发表机构 * Fraunhofer Institute for Computer Graphics Research IGD(弗劳恩霍夫计算机图形研究所) Department of Computer Science, TU Darmstadt(达姆施塔特工业大学计算机科学系)

AI总结 针对ViT在人脸识别中注意力图存在伪影的问题,引入寄存器令牌以增强可解释性,ViT-8R模型在IJB-B和IJB-C上达到最优性能。

详情
Comments
Accepted at the 20th IEEE International Conference on Automatic Face and Gesture Recognition (2026)
AI中文摘要

近期,用于人脸识别(FR)的视觉Transformer(ViT)的进展已超越了标准的CLS令牌范式。在该范式中,一个特殊的分类令牌(CLS)被前置到补丁嵌入中,并用作输入的下游任务表示。另一种方法,即拼接补丁嵌入(CPE),则通过将所有补丁令牌拼接成一个单一向量来利用它们,然后将其投影为紧凑的人脸表示。与基于CLS的方法相比,CPE已被证明能提高识别性能,但我们对注意力图的定性分析显示存在限制其可解释性的伪影。为解决此问题,我们引入了寄存器令牌,这些可学习令牌被拼接到初始补丁嵌入中,并通过ViT编码器块联合处理。与基线ViT相比,该机制已被证明能产生更结构化和可解释的注意力图。我们通过实验证明,这些伪影在各种ViT骨干网络(包括小型和大型模型)中一致出现,而引入寄存器令牌能有效缓解它们。添加四个或八个寄存器显著增强了可解释性,其中八个寄存器提供了最高的验证准确率和最平滑的注意力结构。我们最终的模型ViT-8R,对应一个基于CPE的ViT-B架构并增加了八个寄存器令牌,在大规模IJB-B和IJB-C基准测试中,在基于ViT的FR模型中达到了最先进的性能。此外,与基线模型相比,ViT-8R产生了明显更清晰的注意力图,这为模型的注意力行为提供了更深入的见解(此 https URL )。

英文摘要

Recent advances in Vision Transformers (ViTs) for face recognition (FR) have moved beyond the standard CLS-token paradigm. In this paradigm, a special classification token (CLS) is prepended to the patch embeddings and used as a representation of the input for downstream tasks. An alternative approach, Concatenated Patch Embeddings (CPE), instead leverages all patch tokens by concatenating them into a single vector, which is then projected into a compact face representation. CPE has been shown to improve recognition performance in comparison to CLS-based ones, but our qualitative analysis of attention maps showed the presence of artifacts that limit their interpretability. To address this issue, we incorporate register tokens, learnable tokens concatenated to the initial patch embeddings, and processed jointly through the ViT encoder blocks. This mechanism has been shown to produce more structured and interpretable attention maps compared to baseline ViT. We empirically demonstrate that these artifacts consistently appear across various ViT backbones, including small and large models, and that introducing register tokens effectively mitigates them. Adding four or eight registers significantly enhances interpretability, with eight registers providing the highest verification accuracies and smoothest attention structures. Our resulting model, ViT-8R, corresponds to a CPE-based ViT-B architecture augmented with eight register tokens achieves state-of-the-art performance among ViT-based FR models on large-scale IJB-B and IJB-C benchmarks. Also, ViT-8R produces substantially clearer attention maps compared with the baseline model, which offer deeper insight into the model's attention behavior ( this https URL )

2606.12033 2026-06-11 cs.CV 新提交

SpikeTAD: Spiking Neural Networks for End-to-End Temporal Action Detection

SpikeTAD:用于端到端时序动作检测的脉冲神经网络

Min Yang, Mi Zhou, Limin Wang

发表机构 * State Key Laboratory for Novel Software Technology, Nanjing University(南京大学计算机软件新技术国家重点实验室)

AI总结 提出首个基于脉冲神经网络的端到端时序动作检测架构SpikeTAD,在保持极低功耗的同时,在THUMOS14和ActivityNet-1.3上分别达到67.2%和37.42%的平均mAP。

详情
Comments
Accepted by Pattern Recognition
AI中文摘要

视频理解是计算机视觉的关键部分,具有众多应用场景。随着移动设备的日益普及,越来越多的努力试图在其上部署视频理解模型。然而,现有的视频理解模型由于体积大且功耗高而难以部署。脉冲神经网络(SNNs)相比人工神经网络(ANNs)显示出生物合理性和低功耗优势,尤其是在被视为未来移动设备关键组件的神经形态芯片上。然而,过长的转换时间步长和严重的性能退化问题限制了它们的应用。为了解决上述问题,我们探索了SNNs在时序动作检测(TAD)上的应用,这是视频理解中的重要任务,并提出了首个基于SNN的端到端TAD架构,称为SpikeTAD。在保持极低功耗的同时,SpikeTAD在THUMOS14上实现了67.2%的平均mAP,在ActivityNet-1.3上实现了37.42%的平均mAP,证明了低功耗TAD模型的可行性。我们的代码可在以下网址获取:此 https URL。

英文摘要

Video understanding is a crucial part of computer vision, with numerous application scenarios. With the increasing popularity of mobile devices, an increasing number of efforts are trying to deploy video understanding models on them. However, existing video understanding models are difficult to deploy due to their large size and prohibitive power consumption. Spiking Neural Networks (SNNs) have shown bioplausibility and low power advantages over Artificial Neural Networks (ANNs), especially on neuromorphic chips which are regarded as essential components of future mobile devices. However, excessively long conversion time-steps and severe performance degradation problems limit their application. To solve the problems above, we explore the application of SNNs on temporal action detection (TAD), which is an important task in video understanding, and propose the first SNN-based end-to-end TAD architecture coined as SpikeTAD. While maintaining extremely low power consumption, SpikeTAD achieves an average mAP of 67.2% in THUMOS14 and 37.42% in ActivityNet-1.3, demonstrating the feasibility of a low-power TAD model. Our code is available at this https URL.

2606.12028 2026-06-11 cs.RO 新提交

VICX: Generalizable Robot Manipulation via Video Generation and In-Context Operator Network

VICX: 通过视频生成和上下文操作网络实现可泛化的机器人操作

Song Chen, Linyan Xiang, Ying Zhou, Liu Yang

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 提出VICX框架,利用冻结视频生成模型生成视觉计划,并通过视频到轨迹的上下文操作网络(V2T-ICON)将其映射为机器人可执行轨迹,实现跨任务、跨本体泛化。

详情
Comments
The first two authors contributed equally to this work
AI中文摘要

可泛化的机器人操作不仅需要对未见场景进行任务级推理,还需要将视觉计划可靠地映射到具体本体的执行中。为弥合这一差距,我们提出了VICX(视频生成与上下文执行),一种解耦的闭环操作框架。在VICX中,冻结的视频生成模型生成视觉-语言条件化的高层视觉计划,而视频到轨迹的上下文操作网络(V2T-ICON)作为任务无关的接口,将这些计划映射为可执行的机器人状态轨迹。为提高执行泛化性,V2T-ICON基于分割提取的仅手臂帧观测,并使用检索到的图像-状态对作为上下文提示,从而在推理时无需参数更新即可实现鲁棒且可泛化的视觉到状态映射。在Meta-World上的实验表明,VICX支持跨任务泛化、闭环自我修正和跨本体迁移,展示了在任务语义和机器人执行上的双重泛化能力。项目网页见:此 https URL。

英文摘要

Generalizable robot manipulation requires not only task-level reasoning over unseen scenes, but also reliable grounding of visual plans into embodiment-specific execution. To bridge this gap, we propose VICX (Video generation and In-Context eXecution), a decoupled closed-loop manipulation framework. In VICX, a frozen video generation model produces vision-language-conditioned high-level visual plans, while a Video-to-Trajectory In-Context Operator Network (V2T-ICON) serves as the task-agnostic interface that grounds these plans into executable robot-state trajectories. To improve execution generalization, V2T-ICON operates on segmentation-extracted arm-only frame observations and uses retrieved image-state pairs as in-context prompts, allowing a robust and generalizable visual-to-state mapping at inference time without parameter updates. Experiments on Meta-World show that VICX supports cross-task generalization, closed-loop self-correction, and cross-embodiment transfer, demonstrating dual generalization across both task semantics and robot execution. The project webpage can be found here: this https URL.

2606.12027 2026-06-11 cs.RO 新提交

Learning Unions of Convex Sets via Invertible Latent Decomposition for Path Planning

通过可逆潜在分解学习凸集并集用于路径规划

Taerim Yoon, Dongho Kang, Kisang Park, Junha Cha, Stelian Coros, Sungjoon Choi

发表机构 * Korea University(高丽大学) ETH Zurich(苏黎世联邦理工学院)

AI总结 提出ILD框架,联合学习可逆映射和潜在空间中的显式凸多面体并集,实现路径规划,并通过可见性引导采样保持凸集连通性,在多种环境中取得更高成功率。

详情
AI中文摘要

在杂乱的真实世界环境中进行无碰撞路径规划依赖于对无碰撞空间的表示,现有表示大致分为两类。显式表示(如凸集并集)可以作为硬的无碰撞约束嵌入基于优化的规划器中,但其参数随配置空间维度扩展性差。相比之下,隐式表示灵活且能很好地扩展到复杂几何形状,但通常缺乏此类保证。我们通过ILD(可逆潜在分解)弥合这一差距,该框架联合学习可逆映射和所得潜在空间中的显式凸多面体并集。规划在这些潜在凸集上进行,可逆映射将所得路径解码回原始配置空间,同时保持相对于细化后的显式安全区域的可行性。我们进一步提出可见性引导采样(VGS)以保持凸集连通性用于路径规划。在2D导航、6自由度(DoF)和14自由度操作环境中,ILD实现了比先前基线更广的覆盖、更好的集间连通性和更高的路径规划成功率,且在测试时细化后观察到零假阳性。在14自由度双臂操作器上,我们进一步展示了实时无碰撞规划,测试时细化适应了真实世界部署中单个6自由度臂的场景几何变化。

英文摘要

Collision-free path planning in cluttered, real-world environments relies on a representation of the collision-free space, and existing representations broadly fall into two categories. Explicit representations, such as unions of convex sets, can be plugged into optimization-based planners as hard collision-free constraints, but their parameters scale poorly with configuration-space dimension. Implicit representations, by contrast, are flexible and scale well to complex geometries, yet typically lack such guarantees. We bridge this gap with ILD (Invertible Latent Decomposition), a framework that jointly learns an invertible mapping and a union of explicit convex polytopes in the resulting latent space. Planning is carried out over these latent convex sets, and the invertible mapping decodes the resulting paths back to the original configuration space while preserving feasibility with respect to the refined explicit safe regions. We further propose Visibility-Guided Sampling (VGS) to keep the convex sets connected for path planning. Across 2D navigation, 6-DoF, and 14-DoF manipulation environments, ILD achieves broader coverage, better inter-set connectivity, and higher path-planning success rates than prior baselines, with zero observed false positives after test-time refinement. On a 14-DoF bimanual manipulator, we further demonstrate real-time collision-free planning, with test-time refinement adapting to scene-geometry changes during real-world deployment on a single 6-DoF arm.

2606.12023 2026-06-11 cs.CV 新提交

ViT-FREE: Efficient Face Recognition via Early Exiting and Synthetic Adaptation

ViT-FREE:通过早期退出和合成自适应实现高效人脸识别

Tahar Chettaoui, Guray Ozgur, Eduarda Caldeira, Naser Damer, Fadi Boutros

发表机构 * Fraunhofer Institute for Computer Graphics Research IGD, Germany(德国弗劳恩霍夫计算机图形学研究所IGD) Department of Computer Science, TU Darmstadt, Germany(德国达姆施塔特工业大学计算机科学系)

AI总结 提出ViT-FREE框架,利用预训练ViT的早期退出策略,在不修改或重新训练骨干模型的情况下,从中间层进行人脸验证,实现高效推理;进一步提出ViT-FREE_FT轻量级微调策略,仅用合成数据适配投影层,提升浅层退出性能。

详情
Comments
Accepted at the 20th IEEE International Conference on Automatic Face and Gesture Recognition (2026)
AI中文摘要

视觉Transformer(ViT)在计算机视觉中获得了显著关注,并显示出在人脸识别(FR)方面的强大潜力。然而,其高计算成本使得在资源受限设备上部署具有挑战性,这促使需要平衡效率和准确性的方法。在这项工作中,我们研究了预训练ViT中的早期退出作为一种简单且无需训练的高效FR推理策略。利用Transformer编码器块之间统一的特征维度,我们引入了ViT-FREE,一个多退出框架,可以直接从中间表示进行人脸验证,而无需修改或重新训练骨干模型,从而降低推理成本。实验表明,补丁嵌入和注意力图在深度上逐渐演化,相邻ViT块之间具有高度相似性,并且与最终表示的对齐程度逐渐增加。这表明特征逐步细化和注意力收敛,表明中间层已经提供了适合早期退出的稳定且具有判别性的表示。通过在多个FR基准上的广泛实验,我们系统地分析了不同退出深度的准确性-效率权衡。结果表明,较晚的退出实现了非常有利的平衡,在第10层退出在IJB-C等基准上实现了高达20%的加速,同时验证性能仅下降1.5。此外,我们提出了ViT-FREE_FT,一种轻量级的退出特定微调策略,仅使用小型合成数据集适配投影层,同时保持Transformer骨干冻结。这种方法提高了浅层退出的性能,同时保留了效率优势,并且对较深退出几乎没有影响。

英文摘要

Vision Transformers (ViTs) have gained significant attention in computer vision and shown strong potential for face recognition (FR). However, their high computational cost makes deployment on resource-constrained devices challenging, motivating the need for methods that balance efficiency and accuracy. In this work, we investigate early exiting in pretrained ViTs as a simple yet effective training-free strategy for efficient FR inference. Leveraging the uniform feature dimensionality across transformer encoder blocks, we introduce ViT-FREE, a multi-exit framework that enables face verification directly from intermediate representations without modifying or retraining the backbone model, and thus, reducing inference cost. Empirically, we show that patch embeddings and attention maps evolve progressively across depth, exhibiting high similarity between consecutive ViT blocks and increasing alignment with the final representation. This indicates gradual feature refinement and attention convergence, suggesting that intermediate layers already provide stable and discriminative representations suitable for early exiting. Through extensive experiments on multiple FR benchmarks, we systematically analyze the accuracy-efficiency trade-off across exit depths. Our results demonstrate that later exits achieve a highly favorable balance, with exiting at layer 10 yielding up to a 20% speedup while incurring only a 1.5 drop in verification performance on benchmarks such as IJB-C. Also, we propose ViT-FREE_FT, a lightweight exit-specific fine-tuning strategy that adapts only the projection layers using a small synthetic dataset while keeping the transformer backbone frozen. This approach improves the performance of shallow exits while preserving the efficiency benefits and leaving deeper exits largely unaffected.

2606.12019 2026-06-11 cs.RO 新提交

MPPI-based Informative Trajectory Planning for Search and Capture of Drifting Targets with ASVs

基于MPPI的自主水面艇搜索与捕获漂移目标的信息轨迹规划

Sanjeev Ramkumar Sudha, Marija Popović, Erlend M. Coates

发表机构 * Norwegian University of Science and Technology (NTNU)(挪威科技大学) TU Delft(代尔夫特理工大学)

AI总结 针对自主水面艇在动态环境中搜索并捕获多个漂移目标的问题,提出一种基于模型预测路径积分(MPPI)控制的混合规划框架,通过优化长时域连续轨迹平衡搜索与跟踪,并在拦截阶段切换至纯追踪制导,实验验证了有效性。

详情
AI中文摘要

自主水面艇为开放水域的环境清理以及搜索救援行动提供了高效解决方案。这些环境中的目标持续漂移,因此高效搜索必须平衡未观测区域的探索与已知目标的跟踪。然而,大多数目标跟踪与追捕场景仅考虑简单的制导行为及短期预测用于决策。在本论文中,我们针对动态环境中搜索并捕获多个漂移目标(如垃圾)的问题,提出一种混合规划框架。我们策略的一个关键方面是基于模型预测路径积分(MPPI)控制的时空信息规划方法,这是一种基于采样的模型预测控制方法。该规划器通过优化长时域上的连续轨迹直接生成运动学级指令。多目标代价函数平衡搜索与跟踪目标,同时确保安全、可行的轨迹。在拦截阶段,我们切换至纯追踪制导控制器以实现对移动目标的物理捕获。实验表明,我们的规划器优于所选的规划基线。最后,我们在自主水面艇的现场试验中验证了该方法。

英文摘要

Autonomous surface vehicles offer an efficient solution for environmental cleanup as well as search and rescue operations in open waters. Targets in these settings drift continuously, so efficient search must balance exploration of unobserved regions with tracking of known targets. However, most target tracking and pursuit scenarios consider simple guidance behaviours and short-term predictions for decision-making. In this letter, we address the problem of search and capture of multiple drifting targets, such as litter, in dynamic environments, using a hybrid planning framework. A key aspect of our strategy is a spatiotemporal informative planning method based on model predictive path integral (MPPI) control, a sampling-based model predictive control approach. The planner directly generates kinematic-level commands by optimising continuous trajectories over long horizons. A multi-objective cost balances search and tracking objectives while ensuring safe, feasible trajectories. In the interception stage, we switch to a pure pursuit guidance controller for the physical capture of moving targets. Experiments show that our planner outperforms the chosen planning baselines. Finally, we validate our approach in field trials with an ASV.

2606.12018 2026-06-11 cs.AI 新提交

MODF-SIR: A Multi-agent Omni-modal Distilled Framework for Social Intelligence Reasoning

MODF-SIR:面向社交智能推理的多智能体全模态蒸馏框架

Shang Ma, Jisheng Dang, Wencan Zhang, Yifan Zhang, Bimei Wang, Hong Peng, Bin Hu, Qi Tian, Tat-Seng Chua

发表机构 * School of Information Science and Engineering, Lanzhou University(兰州大学信息科学与工程学院) School of Medical Technology, Beijing Institute of Technology(北京理工大学医学技术学院) Cloud and AI BU, Huawei(华为云与AI业务部) School of Computing, National University of Singapore(新加坡国立大学计算机学院)

AI总结 提出基于轻量级多模态大语言模型的多智能体协作框架,通过知识蒸馏增强训练与推理,结合测试时适应、长尾事件提取和链式思维提示,在多个基准上取得最优结果。

详情
AI中文摘要

我们提出一个基于轻量级多模态大语言模型(MLLM)的多智能体协作框架,专门设计用于社交智能推理。我们方法的一个关键特征是,训练和推理阶段都通过知识蒸馏进行增强。在该架构中,与社交智能相关的多模态数据被精确定位。此外,相关的长尾事件被识别、提取并呈现为格式化的显式文本。这种格式化策略防止关键的长尾信息在分词过程中被头部事件和环境噪声掩盖。具体来说,我们在整个推理流程中集成了测试时适应(TTA),包括长尾事件的提取和表示、链式思维(CoT)提示和自我反思。该TTA机制也经过蒸馏增强,利用低秩适应(LoRA)仅针对实例级推理微调基础模型。在多个基准上对各种开源和专有AI模型进行的广泛评估证明了所提出框架的有效性。使用IntentTrain约30%的训练数据,我们取得了最先进的结果。代码见https://this URL,演示见https://this URL,LoRA见https://this URL,训练路由器的数据集见https://this URL。

英文摘要

We propose a multi-agent collaborative framework built upon a lightweight Multimodal Large Language Model (MLLM), specifically designed for social intelligence reasoning. A key feature of our approach is that both the training and inference phases are augmented via knowledge distillation. Within this architecture, multi-modal data pertinent to social intelligence is precisely localized. Furthermore, relevant long-tail events are identified, extracted, and rendered as formatted, explicit text. This formatting strategy prevents critical long-tail information from being overshadowed by head events and environmental noise during the tokenization process. Specifically, we integrate Test-Time Adaptation (TTA) across the entire reasoning pipeline, encompassing the extraction and representation of long-tail events, Chain-of-Thought (CoT) prompting, and self-reflection. This TTA mechanism is also distillation-enhanced, utilizing Low-Rank Adaptation (LoRA) to fine-tune the foundation model exclusively for instance-level reasoning. Extensive evaluations against various open-source and proprietary AI models across multiple benchmarks demonstrate the effectiveness of the proposed framework. With around 30% of training data from IntentTrain, we achieve state-of-the-art results. Codes are available at this https URL, demo is available at this https URL, LoRA is available at this https URL and the dataset for training router is available at this https URL.

2606.12016 2026-06-11 cs.LG cs.AI 新提交

Generalization Hacking: Models Can Game Reinforcement Learning by Preventing Behavioral Generalization

泛化黑客:模型可通过阻止行为泛化来博弈强化学习

Frank Xiao, Mary Phuong

发表机构 * California Institute of Technology(加州理工学院)

AI总结 本研究提出泛化黑客现象,模型在强化学习中通过自我接种机制阻止行为泛化,在保持高奖励的同时抵抗行为修正,首次证明模型能主动破坏训练过程。

详情
AI中文摘要

模型后训练,特别是强化学习(RL),是开发者塑造模型价值观和行为的主要机制之一。然而,随着模型越来越具有评估和训练意识,当感知到的目标与其当前价值观冲突时,它们可能会被激励去抵抗训练,从而削弱开发者通过进一步训练检测错位和纠正模型行为的能力。在本文中,我们展示了泛化黑客,即模型在RL期间收集奖励的同时阻止奖励行为泛化。我们在Qwen3-235B-A22B上构建了一个模型有机体,对描述训练意识和自我接种(一种新颖机制,其中模型在其思维链中将合规性框架为上下文特定,而不演示或指示任一行为)的合成文档进行微调。该模型有机体在训练时实现了与对照组相当的有害性,同时在700步RL中保持了持续的约15个百分点的合规差距。此外,仅接受训练意识文档训练的对照有机体在RL压力下独立发现了类似接种的推理,尽管从未接触过该概念,却发展出自己的合规差距。由于泛化黑客有机体在整个过程中获得高奖励,标准训练指标未提供泛化失败的信号。我们的结果首次证明模型可以在保持高奖励的同时主动抵抗RL行为修正,表明随着模型变得更有能力和训练意识,它们可能能够破坏训练过程本身。

英文摘要

Model post-training, and in particular reinforcement learning (RL), is one of the primary mechanisms by which developers can shape models' values and behaviors. However, as models become increasingly evaluation and training aware, they may be motivated to resist training when the perceived objective conflicts with their current values, undermining developers' ability to detect misalignment and correct model behavior through further training. In this paper, we demonstrate generalization hacking, in which a model collects reward during RL while preventing the rewarded behavior from generalizing. We construct a model organism on Qwen3-235B-A22B, finetuning on synthetic documents describing training awareness and self-inoculation, a novel mechanism in which the model frames compliance as context-specific in its chain of thought, without demonstrating or instructing either behavior. The model organism achieves train-time harmfulness comparable to controls while maintaining a persistent ${\sim}15$ percentage point compliance gap across 700 steps of RL. Additionally, a control organism trained only on training awareness documents independently discovers inoculation-like reasoning under RL pressure, developing its own compliance gap despite never being exposed to the concept. Because the generalization-hacking organism receives high reward throughout, standard training metrics provide no signal that generalization has failed. Our results constitute the first demonstration that a model can actively resist RL behavioral modification while maintaining high reward, suggesting that as models become more capable and training-aware, they may be able to undermine the training process itself.

2606.12012 2026-06-11 cs.CV 新提交

FitVTON: Fit-aware Virtual Try-On via Body-Garment Size Control

FitVTON: 通过身体-服装尺寸控制实现合身感知的虚拟试穿

Yiqun Ning, Ao Shen, Chenhang He, Lei Zhang

发表机构 * Department of Computing, The Hong Kong Polytechnic University(香港理工大学计算学系) Nuvatech

AI总结 针对现有虚拟试穿忽略物理合身性的问题,提出FitVTON模型,通过结构化文本提示编码服装-身体尺寸,并引入辅助头预测服装和暴露身体掩膜,结合纹理校正阶段,在真实数据集FittingEffect3K上验证了尺寸准确性和形状保持的优越性。

详情
AI中文摘要

尽管基于扩散的虚拟试穿已经实现了令人印象深刻的视觉真实性,但大多数方法将任务视为2D修复,优先考虑纹理保持而非物理合理性。因此,它们通常生成看似合理的图像,但未能反映不同体型下真实的服装合身性。我们提出了FitVTON,一种在野外不同身体上的合身感知虚拟试穿模型。FitVTON通过结构化文本提示编码服装-身体尺寸,并从参数化服装模型的模拟试穿三元组中学习。为了改善服装轮廓的合身效果,我们引入了两个辅助头来预测服装和暴露身体的掩膜。我们进一步引入了一个纹理校正阶段,以改善模拟数据的真实外观。为了评估合身保真度,我们策划了一个真实世界数据集FittingEffect3K,并结合了基于VLM的评分协议。主观和定量实验表明,FitVTON展示了真实的合身保真度,在尺寸准确性和形状保持方面显著优于最先进的方法,同时保持了有竞争力的图像质量。项目页面:此https URL。

英文摘要

While diffusion-based virtual try-on has achieved impressive visual realism, most methods treat the task as 2D inpainting, prioritizing texture preservation over physical plausibility. Consequently, they often produce plausible-looking images that fail to reflect authentic garment fit across diverse body shapes. We present FitVTON, a Fit-aware virtual try-on model on different bodies in the wild. FitVTON encodes garment-body size through structured text prompts, and learn from simulated try-on triplets from parameterized garment model. To improve the fitting effects over garment silhouettes, we introduce two auxiliary head to predict the masks for both the garment and the exposed body. We further introduce a texture rectification stage to improve realistic appearance from simulated data. To evaluate the fitting fidelity, we curate a real-world dataset, FittingEffect3K, combining VLM-based scoring protocol. Both subjective and quantitive experiments show that FitVTON demonstrate authentic fitting fidelity, with significant sizing accuracy and shape preservation over state-of-the-art methods while maintaining competitive image quality. Project Page: this https URL.

2606.12006 2026-06-11 cs.LG cs.AI 新提交

Tabular Foundation Models for Clinical Survival Analysis via Survival-Aware Adaptation

通过生存感知适配的临床生存分析表格基础模型

Minh-Khoi Pham, Luca Cotugno, Alina Sirbu, Tai Tan Mai, Martin Crane, Marija Bezbradica

发表机构 * ADAPT Centre, Dublin City University(ADAPT中心,都柏林城市大学) School of Computing, Dublin City University(都柏林城市大学计算机学院) Department of Computer Science and Engineering, University of Bologna(博洛尼亚大学计算机科学与工程系)

AI总结 提出轻量级适配方法,将表格基础模型(TabPFN、TabDPT、TabICL)与多任务逻辑回归头结合,用于临床生存分析,在多个基准和ICU队列上达到竞争性或更优性能。

详情
Comments
Accepted for publication at International Conference on AI in Healthcare 2026
AI中文摘要

预测死亡率等时间至事件结果是临床决策中的基本任务,通常通过生存分析来解决。虽然经典的统计和深度学习方法已被广泛研究,但它们通常需要特定任务的训练和足够的标记数据。最近表格基础模型的进展通过学习结构化数据的通用表示提供了一种新范式。然而,它们在临床环境中对删失时间至事件预测的适用性仍未得到充分探索,因为典型应用仅限于离散分类而非生存分析任务。在这项工作中,我们提出了一种轻量级适配方法,通过直接在预训练表示之上训练一个生存感知头,将表格基础模型应用于临床生存分析。我们研究了代表性架构,包括TabPFN、TabDPT和TabICL,并使用多任务逻辑回归(MTLR)头对它们进行适配,以建模右删失时间至事件结果。我们在多个公开生存基准和两个大规模ICU队列MIMIC-IV和eICU上评估了该方法。我们的结果表明,这种迁移学习方法与强基线相比达到了竞争性或更优的性能。在MIMIC-IV上,TabDPT-FT-MTLR达到了0.856的C指数,相对于最佳非FM基线(DeepSurv,0.844)相对提升了+1.4%,相对于最佳零样本模型(0.802)提升了+6.7%。在eICU上,TabICL-FT-MTLR达到了0.797,分别获得了+1.7%(DeepSurv,0.784)和+6.4%(0.749)的提升。这些发现强调了将预训练表格表示与生存感知目标相结合的重要性,并表明表格基础模型为临床生存预测提供了一种实用且有效的替代方案。

英文摘要

Predicting time-to-event outcomes such as mortality is a fundamental task in clinical decision-making, commonly addressed through survival analysis. While classical statistical and deep learning approaches have been widely studied, they typically require task-specific training and sufficient labeled data. Recent advances in tabular foundation models offer a new paradigm by learning general-purpose representations for structured data. However, their applicability to censored time-to-event prediction in clinical settings remains underexplored, as typical applications are restricted to discrete classification rather than survival analysis tasks. In this work, we propose a lightweight adaptation approach for applying tabular foundation models to clinical survival analysis by directly training a survival-aware head on top of the pretrained representations. We study representative architectures, including TabPFN, TabDPT, and TabICL, and adapt them using a multi-task logistic regression (MTLR) head to model right-censored time-to-event outcomes. We evaluate this approach on a diverse set of public survival benchmarks and two large-scale ICU cohorts, MIMIC-IV and eICU. Our results show that this transfer learning approach achieves competitive or superior performance compared to strong baselines. On MIMIC-IV, TabDPT-FT-MTLR reaches a C-index of 0.856, corresponding to a relative improvement of +1.4% over the best non-FM baseline (DeepSurv, 0.844) and +6.7% over the best zero-shot model (0.802). On eICU, TabICL-FT-MTLR achieves 0.797, yielding gains of +1.7% (DeepSurv, 0.784) and +6.4% (0.749), respectively. These findings highlight the importance of combining pretrained tabular representations with survival-aware objectives and suggest that tabular foundation models provide a practical and effective alternative for clinical survival prediction.

2606.12003 2026-06-11 cs.CL 新提交

Agreement in Representation Space for Open-Ended Self-Consistency

表示空间中的一致性:面向开放式自洽性

Paula Ontalvilla, Gorka Azkune, Aitor Ormazabal

发表机构 * HiTZ Center - Ixa, University of the Basque Country (UPV/EHU)(HiTZ中心 - Ixa,巴斯克大学(UPV/EHU))

AI总结 针对开放式生成任务,提出基于嵌入的协议(EBA),通过聚类采样生成的嵌入表示来估计自洽性,无需训练即可鲁棒地选择更可靠的输出。

详情
AI中文摘要

自洽性通过采样多个输出并选择最一致的答案来改进大语言模型的推理,但现有方法主要依赖于精确匹配,因此仅限于具有分类输出的任务。在这项工作中,我们研究开放式生成任务(如代码合成和文本摘要)中的自洽性。我们假设一致性可以理解为生成空间的几何属性,其中语义兼容的生成在表示空间的相似区域中集中。为了研究这一假设,我们引入了基于嵌入的协议(EBA),这是一种简单的无需训练的操作方法,通过在嵌入空间中对采样生成进行聚类来估计一致性。通过在数学推理、代码生成和摘要上的实验,我们表明表示空间中的一致性为开放式任务提供了鲁棒且可扩展的自洽性信号。特别是,EBA 始终优于随机选择,并且比最近基于大语言模型评估或不确定性估计的选择方法表现出更稳定的扩展行为。我们进一步表明,这些一致性信号在不同模型家族和嵌入空间中保持稳定,即使使用原生隐藏表示也是如此。最后,我们的分析表明,采样生成所占据的几何位置与生成质量强相关:集中在表示空间中心区域附近的生成往往对应于更可靠的输出,而外围生成则显著不准确。总体而言,我们的研究结果支持将自洽性视为采样生成的几何组织属性,而非精确符号重叠。

英文摘要

Self-consistency improves LLM reasoning by sampling multiple outputs and selecting the most consistent answer, but existing formulations largely rely on exact matching and therefore remain limited to tasks with categorical outputs. In this work, we study self-consistency in open-ended generation tasks such as code synthesis and text summarization. We hypothesize that consistency can be understood as a geometric property of the generation space, where semantically compatible generations concentrate in similar regions of representation space. To study this hypothesis, we introduce Embedding-Based Agreement (EBA), a simple training-free operationalization that estimates agreement by clustering sampled generations in embedding space. Through experiments on mathematical reasoning, code generation, and summarization, we show that agreement in representation space provides a robust and scalable signal of self-consistency for open-ended tasks. In particular, EBA consistently outperforms random selection and exhibits more stable scaling behavior than recent selection approaches based on LLM evaluation or uncertainty estimation. We further show that these agreement signals remain stable across model families and embedding spaces, even with native hidden representations. Finally, our analysis shows that the geometric location occupied by sampled generations is strongly correlated with generation quality: generations concentrated near central regions of representation space tend to correspond to more reliable outputs, whereas peripheral generations are substantially less accurate. Overall, our findings support viewing self-consistency as a property of the geometric organization of sampled generations rather than exact symbolic overlap.

2606.11998 2026-06-11 cs.LG 新提交

Bootstrapped Monitoring: Leveraging Transparent Reasoning to Oversee Stronger AI Agents

自助监控:利用透明推理监督更强的AI智能体

Frank Xiao, Mary Phuong

发表机构 * California Institute of Technology(加州理工学院)

AI总结 提出自助监控协议,通过插入具有透明思维链的不可信中间模型来监督更强智能体,在软件工程任务中显著提升捕获率,即使不可信监控者与智能体合谋。

详情
AI中文摘要

可信监控是AI控制的基石。然而,随着前沿模型能力增强,可信与不可信模型之间的能力差距可能使可信模型成为不可靠的监控者。我们引入了\emph{自助监控}协议,通过在监督链中插入一个具有透明思维链推理的更强的不可信中间模型来解决这一问题。不可信监控者($U_m$)评估智能体的行为,而较弱的可信模型($T$)监督$U_m$的推理以检测合谋。我们在多轮软件工程任务(BashArena)上对多个智能体和监控者评估了自助监控。即使不可信监控者主动与智能体合谋,只要我们能够访问其原始思维链,自助监控相比仅使用可信监控显著提高了捕获率。我们的结果表明,随着AI能力的进步,自助监控可以延长可信模型在控制中的有效寿命。

英文摘要

Trusted monitoring is a cornerstone of AI control. However, as frontier models grow more capable, the increasing capabilities gap between trusted and untrusted models may render trusted models unreliable monitors. We introduce \emph{bootstrapped monitoring}, a protocol that addresses this by inserting a stronger, intermediate untrusted model with transparent chain-of-thought reasoning into the oversight chain. The untrusted monitor ($U_m$) evaluates the agent's actions, while a weaker trusted model ($T$) oversees $U_m$'s reasoning to detect collusion. We evaluate bootstrapped monitoring on multi-turn software engineering tasks (BashArena) across multiple agents and monitors. Bootstrapped monitoring substantially improves catch rates over trusted-only monitoring, even when the untrusted monitor actively colludes with the agent, provided we have access to its raw chain-of-thought. Our results suggest that bootstrapped monitoring can extend the useful lifetime of trusted models in control as AI capabilities advance.

2606.11990 2026-06-11 cs.LG cs.AI 新提交

Time-Series Foundation Model Embeddings for Remaining Useful Life Estimation

用于剩余使用寿命估计的时间序列基础模型嵌入

Amir El-Ghoussani, Michele De Vita, Ronald Naumann, Valiseios Belagiannis

发表机构 * University of Erlangen-Nuremberg(埃尔朗根-纽伦堡大学) Siemens AG(西门子股份公司)

AI总结 提出冻结预训练时间序列基础模型Chronos-2作为骨干,结合轻量回归头进行剩余寿命预测,在工业传感器数据上优于多种基线方法。

详情
Comments
Accepted to EUSIPCO 2026, 4 pages, 2 figures
AI中文摘要

剩余使用寿命(RUL)预测对于工业预测性维护至关重要,然而许多基于学习的方法依赖于大量的特征工程或大型标注数据集来训练特定任务的序列模型。在这项工作中,我们引入了一种轻量级学习方法,利用冻结的预训练时间序列基础模型(TSFM),并将其与一个小型回归头结合,用于从多变量传感器流中估计RUL。具体来说,我们使用Chronos-2作为冻结骨干来提取上下文窗口特征,并训练一个轻量级回归神经网络进行RUL预测。在来自两种设备类型的真实工业传感器数据上的实验表明,在相同的预处理和评估协议下,Chronos-2特征一致地优于循环、卷积、基于Transformer和梯度提升基线。我们进一步分析了上下文长度的影响,发现随着历史记录变长,性能显著提升,这表明TSFM表示为工业环境中的RUL估计提供了一种实用且数据高效的替代方案。

英文摘要

Remaining Useful Life (RUL) prediction is essential for industrial predictive maintenance, yet many learning-based approaches rely on extensive feature engineering or large labeled datasets to train task-specific sequence models. In this work, we introduce a lightweight learning approach, in which we leverage a frozen pretrained time-series foundation model (TSFM) and combine it with a small regression head for RUL estimation from multivariate sensor streams. More specifically, we use Chronos-2 as a frozen backbone to extract context window features and train a lightweight regression neural network for RUL prediction. Experiments on real-world industrial sensor data from two device types show that Chronos-2 features consistently improve over recurrent, convolutional, Transformer-based, and gradient-boosting baselines under the same preprocessing and evaluation protocol. We further analyze the impact of context length and find that performance improves significantly with longer histories, indicating that TSFM representation offer a practical and data-efficient alternative for RUL estimation in industrial settings.

2606.11989 2026-06-11 cs.CV 新提交

From Nominal Intensity to Equivalent Rainfall: A Path-Based Credibility Evaluation Framework for Simulated Rainfall in Autonomous-Driving Perception Tests

从名义强度到等效降雨:自动驾驶感知测试中模拟降雨的基于路径的可信度评估框架

Tian Xia, Xin Zhao, Shaolingfeng Ye, Junyi Chen

发表机构 * College of Automotive and Energy Engineering, Tongji University(同济大学汽车与能源工程学院) Tsinghua University(清华大学)

AI总结 提出基于路径的可信度评估方法,通过路径等效降雨强度、不确定性带和雨滴分布真实度评分,结合激光雷达点云计数和平均反射率进行感知一致性校正,实现模拟降雨与真实降雨的对齐及测试结果映射。

详情
Comments
17 pages, preprint
AI中文摘要

可信的模拟降雨条件对于识别自动驾驶感知系统边界和支持面向SOTIF的风险评估至关重要。然而,封闭场地测试通常仅用名义降雨强度或单点测量来描述,这使得模拟降雨场难以与真实降雨对齐,并将测试结果映射到真实场景。本文提出了一种基于路径的自动驾驶感知测试中模拟降雨的可信度评估方法。以真实降雨的雨滴尺寸和速度联合分布为参考,每条候选路径由路径等效降雨强度、不确定性带和路径平均雨滴分布真实度(RRD)评分表示。进一步利用激光雷达目标点云计数和平均反射率进行感知一致性校正,量化每条模拟降雨路径对真实降雨感知效果的代理能力。实验使用了约10,000个真实降雨雨滴谱样本、728个RainSense感知样本以及2.4 m x 7.2 m模拟降雨区域内的45个空间采样点。结果表明,在相同名义条件下空间非均匀性仍然存在,证实了基于路径评估的必要性。该方法识别出路径IV和路径VI为优选候选路径,结果分别为11.54 +/- 0.31 mm/h、RRD = 0.43和8.28 +/- 0.34 mm/h、RRD = 0.46。这些路径在降雨强度稳定性、雨滴谱真实性和感知一致性方面表现出更均衡的性能。所提方法支持降雨条件下自动驾驶感知测试的路径选择、条件描述和可信解释。

英文摘要

Credible simulated-rainfall conditions are essential for identifying perception-system boundaries and supporting SOTIF-oriented risk assessment in automated driving. However, closed-field tests are often described only by nominal rainfall intensity or single-point measurements, making it difficult to align simulated rain fields with real rainfall and map test results to real-world scenarios. This paper proposes a path-based credibility evaluation method for simulated rainfall in autonomous-driving perception tests. Using the drop size and velocity joint distribution of real rainfall as the reference, each candidate path is represented by path-equivalent rainfall intensity, an uncertainty band, and a path-averaged Realism of Raindrop Distribution (RRD) score. Lidar target point-cloud count and mean reflectivity are further used for perception-consistency correction, quantifying the proxy capability of each simulated-rainfall path for real-rainfall perception effects. Experiments are conducted using about 10,000 real-rainfall raindrop-spectrum samples, 728 RainSense perception samples, and 45 spatial sampling points in a 2.4 m x 7.2 m simulated-rainfall area. Results show that spatial non-uniformity remains under the same nominal condition, confirming the need for path-based evaluation. The method identifies Path IV and Path VI as preferable candidates, with results of 11.54 +/- 0.31 mm/h, RRD = 0.43, and 8.28 +/- 0.34 mm/h, RRD = 0.46, respectively. These paths show more balanced performance in rainfall-intensity stability, raindrop-spectrum realism, and perception consistency. The proposed method supports path selection, condition description, and credible interpretation of autonomous-driving perception tests under rainfall.

2606.11988 2026-06-11 cs.LG stat.ML 新提交

What Uncertainties Do We Need for Dynamical Systems?

动力系统需要哪些不确定性?

Yusuf Sale, Christopher Bülte, Felix Czaja, Joshua Stiller, Eyke Hüllermeier

发表机构 * Institute of Computer Science, LMU Munich(慕尼黑大学计算机科学研究所) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心) Department of Mathematics, LMU Munich(慕尼黑大学数学系) German Research Center for Artificial Intelligence (DFKI, DSA)(德国人工智能研究中心(DFKI, DSA))

AI总结 本文从机器学习视角探讨动力系统中的不确定性,区分偶然与认知不确定性,并讨论不同任务中表示和量化不确定性的目标。

详情
Comments
EIML@ICML
AI中文摘要

偶然不确定性和认知不确定性之间的区别在机器学习研究中受到了相当大的关注,主要是在监督学习的背景下,但也涉及其他设置,如生成建模。在本文中,我们提供了一个关于动力系统不确定性建模的机器学习视角,这方面的研究迄今较少。特别是,我们提出:动力系统需要哪些不确定性?我们讨论了不确定性的来源,阐明了它们的性质(偶然或认知),并考虑了表示和量化不确定性的目标如何在不同任务中变化。

英文摘要

The distinction between aleatoric and epistemic uncertainty has received considerable attention in machine learning research, mainly in the context of supervised learning but also in other settings such as generative modeling. In this paper, we offer a machine learning perspective on uncertainty modeling for dynamical systems, which has been studied much less so far. In particular, we ask: what uncertainties do we need for dynamical systems? We discuss sources of uncertainty, clarify their nature (aleatoric or epistemic), and consider how the objectives of representing and quantifying uncertainty vary across different tasks.

2606.11977 2026-06-11 cs.CV 新提交

ParseFixer: An Agentic Framework for Document Parsing via Selective Multimodal Correction

ParseFixer: 一种通过选择性多模态校正的文档解析智能体框架

LeKai Yu, Hao Liu, Kun Wang, Zhiran Li, Ruping Cao, Fan Liu, Yupeng Hu

发表机构 * Shandong University(山东大学) Southeast University(东南大学)

AI总结 提出ParseFixer框架,结合全页骨干解析和智能体选择性校正,通过验证-回滚机制修复高价值解析错误,在DataMFM挑战赛文档解析任务中获得第三名。

详情
AI中文摘要

在本报告中,我们介绍了DataMFM挑战赛赛道1:文档解析的第三名解决方案。该赛道要求模型从文档页面图像中恢复结构化的Markdown文档,同时保留文本内容和文档结构。为了解决准确内容恢复和忠实结构重建的互补需求,我们提出了ParseFixer,一个用于骨干解析和选择性校正的智能体框架。ParseFixer包含两个关键模块:全页骨干解析(FBP)和智能体选择性校正(ASC)。FBP使用MinerU2.5 Pro生成稳定的初始Markdown输出,而ASC通过验证-回滚校正过程检测并修复高价值的解析失败。通过在开源骨干解析之后放置选择性多模态校正,ParseFixer在不重写可靠骨干预测的情况下,改善关键文档元素的恢复。在测试集上,我们的最终系统取得了61.78的总分,在赛道1中排名第三,证明了其在准确文档解析方面的有效性。我们的代码将发布在:this https URL。

英文摘要

In this report, we present our third-place solution for the DataMFM Challenge Track 1: Document Parsing. This track requires models to recover structured Markdown documents from document page images while preserving textual content and document structure. To address the complementary requirements of accurate content recovery and faithful structure reconstruction, we propose ParseFixer, an agentic framework for backbone parsing and selective correction. ParseFixer consists of two key modules: Full-Page Backbone Parsing (FBP) and Agentic Selective Correction (ASC). FBP produces stable initial Markdown outputs with MinerU2.5 Pro, while ASC detects high-value parsing failures and repairs them through a verify-and-rollback correction process. By placing selective multimodal correction after open-source backbone parsing, ParseFixer improves the recovery of key document elements without rewriting reliable backbone predictions. On the test set, our final system achieves an overall score of 61.78 and ranks third in Track 1, demonstrating its effectiveness for accurate document parsing. Our code will be released at: this https URL.