arXivDaily arXiv每日学术速递 周一至周五更新
重置
CS计算机1059
2606.12071 2026-06-11 cs.DL cs.AI 新提交

On the Limits of LLM-as-Judge for Scientific Novelty Assessment

论LLM作为评审在科学新颖性评估中的局限性

Soumitra Sinhahajari, Navonil Majumder, Soujanya Poria

AI总结 本文通过构建RQ-Bench基准,发现LLM评审对模型生成的研究问题产生新颖性幻觉,而人类专家则持相反意见,揭示了LLM在评估科学新颖性时的可靠性问题。

详情
AI中文摘要

LLM越来越多地被用于生成和评判科学想法。这使得新颖性评估成为一个核心问题。完整想法的评估很困难,因为它通常需要判断方法、可行性及其经验前景。因此,我们研究一个更清晰的上游对象:研究问题(RQ)。RQ生成是科学构思的前提,并且RQ可以与真实论文中探讨的问题进行比较。我们引入了RQ-Bench,一个基于近期arXiv论文构建的基准。对于每篇论文,我们从其引用的背景、空白和贡献中重建作者锚定的RQ。这些RQ并非针对同一背景的唯一有效问题。它们是用于测试新颖性判断的作者锚定参考点。我们使用独立LLM评审、比较LLM评审和人类专家评估来评估模型生成的RQ。LLM评审一致地将模型生成的RQ评为高度新颖,产生新颖性幻觉;在比较评估中,这种偏好甚至更强。然而,领域专家得出相反结论,更偏好作者锚定的参考问题。我们进一步发现,许多生成的RQ狭窄或受限于来源,这是LLM评审通常忽略的维度,除非明确测试。总体而言,LLM评审与人类专家之间矛盾的新颖性评估引发了关于使用LLM评估研究问题科学新颖性可靠性的严重担忧。

英文摘要

LLMs are increasingly used to generate and judge scientific ideas. This makes novelty evaluation a central problem. Full idea evaluation is difficult because it often requires judging a method, its feasibility, and its empirical promise. We therefore study a cleaner upstream object: the research question (RQ). RQ generation is a prerequisite for scientific ideation, and RQs can be compared against questions pursued in real papers. We introduce RQ-Bench, a benchmark built from recent arXiv papers. For each paper, we reconstruct author-anchored RQs from its cited background, gaps, and contributions. These RQs are not the only valid questions for the same background. They are author-anchored reference points for testing novelty judgments. We evaluate model-generated RQs with standalone LLM judging, comparative LLM judging, and human expert evaluation. LLM judges consistently rate model-generated RQs as highly novel, producing a novelty mirage; in comparative evaluations, this preference becomes even stronger. Domain experts, however, reach the opposite conclusion and prefer the author-anchored reference questions. We further find that many generated RQs are narrow or source-bound, a dimension that LLM judges often miss unless explicitly tested. Overall, the contradictory novelty evaluations between LLM judges and human experts raise a serious concern about the reliability of using LLMs to assess the scientific novelty of research questions.

2606.12070 2026-06-11 cs.RO 新提交

Fibration Trees: A Unified Approach to Multi-Robot Motion Planning

纤维树:多机器人运动规划的统一方法

Andreas Orthey, Florian T. Pokorny, Lydia E. Kavraki

发表机构 * Technical University of Berlin(柏林工业大学) KTH Royal Institute of Technology(瑞典皇家理工学院) Rice University and the Ken Kennedy Institute(莱斯大学和肯·肯尼迪研究所)

AI总结 提出纤维树统一框架,通过纤维化建模投影,结合优先序、并行分解和任务空间投影,并开发Fibration-RRT规划器,在高维多机器人运动规划中实现概率完备性。

详情
Comments
23 pages, 12 figures
AI中文摘要

状态空间投影与分解已成为解决高维多机器人运动规划问题中维度灾难的强大工具。然而,现有方法缺乏一个统一框架来无缝处理投影(优先序或任务空间)与分解(并行或解耦子空间)的组合。为填补这一空白,我们引入了纤维树,即以状态空间为节点、纤维化为边的树结构,其中纤维化将高维空间投影到低维(或简化)空间。通过将投影建模为纤维化,我们将顺序优先序、并行分解和任务空间投影统一在单一、连贯的形式体系下。在此基础上,我们开发了快速探索随机纤维树(Fibration-RRT)规划器,这是一种基于采样的运动规划器,它推广了商空间RRT(用于顺序优先序)和离散RRT(用于并行分解)的策略,同时允许包含任务空间投影。Fibration-RRT在用户定义的纤维树上运行,并被证明是概率完备的。为测试Fibration-RRT的通用性和效率,我们提供了开源实现,并在32个场景中进行了实验,使用了多达96自由度的多机器人团队。结果表明,Fibration-RRT通过利用用户定义的纤维树高效解决了高维问题,从而确立了纤维树作为多机器人运动规划的强大统一框架。

英文摘要

State space projections and decompositions have emerged as powerful tools to tackle the curse of dimensionality in high-dimensional, multi-robot motion planning problems. However, existing methods lack a unified framework which seamlessly handles combinations of projections (prioritization or task-space) and decompositions (parallel or decoupled subspaces). To fill this gap, we introduce fibration trees, which are trees consisting of state spaces as nodes and fibrations as edges, whereby a fibration models a projection from a higher-dimensional space to a lower-dimensional (or simplified) space. By modeling projections as fibrations, we unify sequential prioritization, parallel decomposition, and task-space projections under a single, coherent formalism. Building on this, we develop the rapidly-exploring random fibration trees (Fibration-RRT) planner, a sampling-based motion planner that generalizes strategies from quotient-space RRT (for sequential prioritizations) and discrete RRT (for parallel decompositions), while allowing the inclusion of task-space projections. Fibration-RRT operates on user-defined fibration trees and is proven to be probabilistically complete. To test the generality and efficiency of Fibration-RRT, we provide an open-source implementation and conduct experiments on 32 scenarios using multi robot teams with up to 96 degrees of freedom. Our results indicate that Fibration-RRT efficiently solves high-dimensional problems by exploiting user-defined fibration trees, thereby establishing fibration trees as a powerful, unified framework for multi-robot motion planning.

2606.12069 2026-06-11 cs.CV 新提交

Tac-DINO: Learning Vision-Tactile Features with Patch Alignment

Tac-DINO:基于补丁对齐的视觉-触觉特征学习

Hong Li, Yankang Dong, Yue Xu, Yihan Tang, Mingzhu Li, Jiamin Qiu, Qihang Yao, Xing Zhu, Yujun Shen, Nan Xue, Yong-Lu Li

发表机构 * Shanghai Jiao Tong University(上海交通大学) Ant Group(蚂蚁集团) Shanghai Innovation Institute(上海创新研究院)

AI总结 提出Tac-DINO方法,通过构建大规模触觉数据集和视觉-触觉全息匹配基准,利用补丁对齐学习局部到全局的视觉-触觉表征,性能优于无对齐方法。

详情
AI中文摘要

触觉是人类与环境交互的主要媒介。目前,触觉学习主要关注图像级预训练或对齐。然而,触觉信号对应局部物体接触,而尺度对齐和全息匹配的研究仍然有限,且缺乏合适的数据集和基准。为弥补这一差距,我们首先构建了一个数据采集系统,获取了大规模触觉数据集,包含来自505个真实物体的超过2万次触觉接触。基于该数据集,我们设计了一个视觉-触觉全息匹配基准,用于评估视觉-触觉局部到全局的对齐能力。然后,我们提出了视觉-触觉补丁对齐(VTPA)方法用于视觉-触觉表征学习。实验表明,这些方法超越了无对齐方法,并与全物体图像对齐的性能相当。

英文摘要

Touch is the primary medium through which humans interact with the environment. Currently, tactile learning mainly focuses on image-level pretraining or alignment. However, tactile signals correspond to local object contact, while research into scale alignment and holographic matching remains limited and proper datasets and benchmarks also lack. To bridge this gap, we first construct a data collection system to acquire a large-scale tactile dataset, with over 20 K tactile contacts from 505 real-world objects. Building on this dataset, we design a Vis-Tac Holographic Matching Benchmark to evaluate vision-tactile local-to-global alignment ability. Then we propose Vision-Tactile Patch Alignment (VTPA) methods for vision-tactile representation learning. Experiments demonstrate that these exceed the performance of methods without alignment and align with whole-object images.

2606.12068 2026-06-11 cs.CL 新提交

StanceNakba Shared Task: Actor and Topic-Aware Stance Detection in Public Discourse

StanceNakba 共享任务:公共话语中基于行动者和主题的立场检测

Kholoud K. Aldous, Md Rafiul Biswas, Mabrouka Bessghaier, Shimaa Ibrahim, Kais Attia, Wajdi Zaghouani

AI总结 提出 StanceNakba 2026 共享任务,通过两个子任务(行动者级和跨主题立场检测)利用微调 Transformer 模型(如 MARBERT、AraBERT)在巴以冲突相关社交媒体数据上实现高 Macro F1 分数。

详情
Comments
11 Pages, 6 Tables
AI中文摘要

我们提出 StanceNakba 2026,这是一个关于巴以冲突相关极化社交媒体话语中立场检测的共享任务,作为 LREC-COLING 2026 上 Nakba-NLP 2026 的一部分组织。该任务引入两个子任务:子任务 A(行动者级立场检测),将英语社交媒体帖子分类为亲巴勒斯坦、亲以色列或中立;子任务 B(跨主题立场检测),识别阿拉伯语帖子中关于两个冲突相关主题(与以色列正常化以及约旦难民存在)的赞成、反对或中立立场。该任务基于一个包含 2,606 条社交媒体帖子的标注数据集。共有 7 个团队参加了子任务 A,6 个团队参加了子任务 B。参与系统主要微调了阿拉伯语和多语言基于 Transformer 的模型,包括 MARBERT、AraBERT 和 DeBERTa-v3 变体,多个团队采用了交叉验证、集成方法和主题条件架构。表现最佳的系统在子任务 A 上达到了 0.9620 的 Macro F1,在子任务 B 上达到了 0.8724,表明基于 Transformer 的方法对于冲突领域立场检测非常有效,同时突显了跨主题泛化和中立类别预测方面的持续挑战。

英文摘要

We present StanceNakba 2026, a shared task on stance detection in polarized social media discourse related to the Palestinian-Israeli conflict, organized as part of Nakba-NLP 2026 at LREC-COLING 2026. The task introduces two subtasks: Subtask A (Actor-Level Stance Detection), which classifies English social media posts as Pro-Palestine, Pro-Israel, or Neutral; and Subtask B (Cross-Topic Stance Detection), which identifies Favor, Against, or Neither stances in Arabic posts toward two conflict-related topics, normalization with Israel and refugee presence in Jordan. The task is grounded in an annotated dataset of 2,606 social media posts. A total of 7 teams participated in Subtask A and 6 teams in Subtask B. Participating systems primarily fine-tuned Arabic and multilingual transformer-based models, including MARBERT, AraBERT, and DeBERTa-v3 variants, with several teams employing cross-validation, ensemble methods, and topic-conditioned architectures. The best-performing systems achieved a Macro F1 of 0.9620 on Subtask A and 0.8724 on Subtask B, demonstrating that transformer-based approaches are highly effective for conflict-domain stance detection while highlighting persistent challenges in cross-topic generalization and neutral class prediction.

2606.12066 2026-06-11 cs.CV 新提交

Performance Analysis of YOLOv11 and YOLOv8 for Mixed Traffic Object Detection under Adverse Weather Conditions in Developing Countries

YOLOv11与YOLOv8在发展中国家恶劣天气下混合交通目标检测的性能分析

Quoc Thuan Nguyen, Ha Anh Vu, Ngo Dang Thanh Ngan, Minh Phuc Hoang Ngoc

AI总结 针对发展中国家恶劣天气下的混合交通场景,评估YOLOv11n与YOLOv8n在融合数据集上的性能,YOLOv11n在精度提升3.2%的同时计算量减少22%,实现精度与效率的优化平衡。

详情
AI中文摘要

在现代车辆系统中,恶劣条件下的鲁棒性能已成为自动驾驶的关键问题。我们的研究对YOLO系列最新版本YOLOv11 Nano架构进行了全面评估,以广泛采用的YOLOv8 Nano为基线,在融合了印度驾驶数据集(IDD)[1]和伯克利深度驾驶数据集(BDD100K)[2]的自定义数据集上进行基准测试。我们分析了在涉及密集混合交通、雨天和低光照条件的高熵场景中检测精度、推理速度和计算效率之间的权衡。具体而言,YOLOv11n实现了46.6%的平均精度(mAP@50),精度比基线提高了3.2%,有效减少了杂乱场景中的误报。此外,该模型表现出更高的能效,FLOPs减少22%(6.3G vs. 8.1G),同时在Tesla T4 GPU上保持70.9 FPS的实时推理速度,为安全关键的边缘部署提供了最优权衡。

英文摘要

In modern vehicular systems, robust performance under harsh conditions has become a critical problem of autonomous driving. Our study delivers a comprehensive evaluation of the newest iteration of the YOLO series, which is YOLOv11 Nano architecture benchmarked against the widely adopted YOLOv8 Nano as a baseline on a custom fused dataset that combines the Indian Driving Dataset (IDD) [1] and Berkeley Deep Drive Dataset (BDD100K) [2]. We have analyzed the trade-offs among detection accuracy, inference speed, and computational efficiency in high-entropy scenarios involving dense mixed traffic, rain, and low-light conditions. Specifically, YOLOv11n achieves a mean Average Precision (mAP@50) of 46.6%, with a notable 3.2% improvement in Precision over the baseline, effectively reducing false positives in cluttered scenes. Furthermore, the proposed model exhibits enhanced energy efficiency, requiring 22% fewer FLOPs (6.3G vs. 8.1G) while maintaining real-time inference speed of 70.9 FPS on a Tesla T4 GPU, offering an optimal trade-off for safety-critical edge deployment.

2606.12065 2026-06-11 cs.AI cs.MA 新提交

Automating Geometry-Intensive Compliance Checking in BIM: Graph-Based Semantic Reasoning Framework

BIM中几何密集型合规检查自动化:基于图的语义推理框架

Zixuan Xiao, Pei Troh Koh, Jun Ma, Jack C.P. Cheng

AI总结 针对BIM中几何密集型法规自动检查的语义鸿沟问题,提出SGR-BIM图驱动推理框架,通过跨模态知识图谱实现可解释推理,在679个消防规范查询上达到84.3%准确率,较基线提升8.6%。

详情
AI中文摘要

自动化几何密集型法规的合规检查仍然是建筑信息模型(BIM)中的一个重大技术瓶颈,主要原因是高层级法规逻辑与结构化IFC数据之间的语义差异。现有方法通常依赖于静态规则模板,难以遍历多跳推理链或解决跨多个建筑实体的潜在空间依赖关系。为应对这些挑战,提出了一种面向建筑信息模型的空间几何推理系统(SGR-BIM),作为一个集成的图驱动推理框架。SGR-BIM动态构建跨模态知识图谱,对齐用户意图、法规语义和BIM几何,无需硬编码即可实现可解释推理。在来自消防规范的679个专家验证查询上验证,该框架达到了84.3%的准确率,比增强工具的单智能体基线提高了8.6%。本研究提供了一种基于图的语义推理范式,增强了建筑、工程和施工(AEC)行业中自动化几何合规检查工作流的透明度和灵活性。

英文摘要

Automating compliance check for geometry-intensive regulations remains a significant technical bottleneck in Building Information Modeling (BIM), primarily due to the semantic disparity between high-level regulatory logic and structured IFC data. Existing methods, often reliant on static rule templates, struggle to traverse multi-hop reasoning chains or resolve latent spatial dependencies across multiple building entities. To address these challenges, a Spatial-Geometric Reasoning System for Building Information Modeling (SGR-BIM) is proposed as an integrative graph-driven reasoning framework. SGR-BIM dynamically constructs a cross-modal knowledge graph that aligns user intent, regulatory semantics, and BIM geometry, enabling interpretable reasoning without rigid hard-coding. Validated on 679 expert-verified queries from fire safety codes, the framework achieves 84.3% accuracy, representing an 8.6% improvement over enhanced-tool single-agent baselines. This research provides a graph-based semantic reasoning paradigm, enhancing the transparency and flexibility of automated geometric compliance check workflows in the Architecture, Engineering, and Construction (AEC) industry.

2606.12064 2026-06-11 cs.SE cs.CR 新提交

Undefined Behavior in C and C++: An Experiment With Desktop Use Cases

C和C++中的未定义行为:桌面使用场景的实验

Jukka Ruohonen, Krzysztof Sierszecki

AI总结 通过编译器实现的未定义行为检测器,实验发现Linux桌面环境下C/C++程序普遍存在未定义行为,59个任务产生近1.1万条警告,多数来自Mesa图形库和GUI交互。

详情
Comments
Submitted
AI中文摘要

未定义行为是C和C++编程中的惯用现象;这类行为是指使用了语言不施加任何要求的错误程序构造,例如整数溢出。本文通过实证实验,探究在Linux发行版的典型桌面使用中,底层执行的未定义行为的程度。分析基于编译器中实现的未定义行为检测器。根据结果,未定义行为很常见。通过完成59个简单的实验任务,由32个用C或C++编写的独特程序和库生成了近1.1万条独特的未定义行为警告。其中,大多数警告与Mesa图形库相关,并通过与图形用户界面交互产生。仅登录GNOME桌面环境就生成了超过500条独特警告。在所有警告中,绝大多数是关于虚表指针的。相关的堆栈跟踪通常也很长。凭借这些及其他结果,本文为关于C和C++的实证文献做出了贡献。

英文摘要

Undefined behavior is idiomatic to C and C++ programming; such behavior is a use of an erroneous program construct for which the languages impose no requirements, such as integer overflows. The paper presents an empirical experiment seeking to probe the extent of undefined behavior executing underneath typical desktop use of a Linux distribution. The analysis is based on an undefined behavior sanitizer implemented in a compiler. According to the results, undefined behavior is common. By completing 59 simple experimental tasks, nearly 11 thousand unique undefined behavior warnings were generated by 32 unique programs and libraries written in C or C++. Of these warnings, most were associated with the Mesa graphics library and generated by interacting with graphical user interfaces. Merely logging into the GNOME desktop environment generated over 500 unique warnings. Of all warnings, the clear majority was about virtual table pointers. The associated stack traces were also lengthy in general. With these and other results, the paper contributes to the empirical literature on C and C++.

2606.12059 2026-06-11 cs.LG cs.NE nlin.AO 新提交

Attention by Synchronization in Coupled Oscillator Networks

耦合振荡器网络中的同步注意力机制

Fabio Pasqualetti, Taosha Guo

发表机构 * University of California, Irvine(加州大学尔湾分校)

AI总结 提出基于Kuramoto同步动力学的固定查询振荡器注意力机制,无需指数运算和全局归约,在物理基板上实现注意力计算,并在关键词识别和主谓一致任务上优于softmax。

详情
AI中文摘要

我们探讨了能量受限物理基板上的Transformer注意力机制。Softmax注意力需要指数运算和全局归约,这些操作在冯·诺依曼硬件上能耗高且没有自然的物理模拟。我们证明Kuramoto同步动力学(出现在电气、机械、超导和电荷密度波振荡器阵列等物理系统中)无需上述操作即可实现定义良好的注意力操作。由此产生的机制——固定查询振荡器注意力——用球面上梯度流的平衡取代了softmax的算术运算:查询是固定在球面上的学习锚点,自由振荡器在Kuramoto-Lohe动力学下演化,直到它们稳定在通过余弦相似度编码注意力权重的位置上。由于计算是平衡过程,因此不需要指数运算;唯一的全局操作是读出时的仿射归一化。该不动点是唯一且从几乎所有初始条件全局吸引的,这一保证适用于所有物理实现。在实验上,在最小硬件配置(振荡器维度$d_{\mathrm{osc}}=2$)下,振荡器注意力在关键词识别(+1.00个百分点)和主谓一致(困难句子+5.27个百分点,零训练失败,而softmax五分之一失败)上优于softmax。在因果语言建模中,softmax仍保持优势,但振荡器注意力随着$d_{\mathrm{osc}}$的增长缩小了差距:在WikiText-2上,从$d_{\mathrm{osc}}=2$时的+11.09 PPL降至$d_{\mathrm{osc}}=32$时的+2.98 PPL;在TinyStories上,从$d_{\mathrm{osc}}=2$时的+2.39 PPL降至$d_{\mathrm{osc}}=32$时的+0.57 PPL。本工作的主要目标不是用软件替代softmax,而是为物理基板上的精确注意力提供数学基础蓝图。

英文摘要

We address transformer attention on energy-constrained physical substrates. Softmax attention requires exponentiation and global reduction, operations with high energy cost on von Neumann hardware and no natural physical analog. We show that Kuramoto synchronization dynamics (which arise in electrical, mechanical, superconducting, and charge-density-wave oscillator arrays, among other physical systems) implement a well-defined attention operation without either. The resulting mechanism, fixed-query oscillator attention, replaces softmax's arithmetic with the equilibration of a gradient flow on the sphere: queries are learned anchors fixed on the sphere, and free oscillators evolve under Kuramoto-Lohe dynamics until they settle at positions encoding attention weights via cosine similarity. Because the computation is equilibration, it requires no exponentiation; the only global operation is an affine normalization at readout. The fixed point is provably unique and globally attractive from almost every initial condition, a guarantee that holds across every physical realization. Empirically, at the minimal hardware configuration (oscillator dimension $d_{\mathrm{osc}}$ = 2), oscillator attention outperforms softmax on keyword spotting (+1.00 pp) and on subject-verb agreement (+5.27 pp on hard sentences, with zero training failures versus one in five for softmax). On causal language modeling, where softmax retains an advantage, oscillator attention closes the gap as $d_{\mathrm{osc}}$ grows: from +11.09 PPL at $d_{\mathrm{osc}}$ = 2 to +2.98 PPL at $d_{\mathrm{osc}}$ = 32 on WikiText-2, and from +2.39 PPL at $d_{\mathrm{osc}}$ = 2 to +0.57 PPL at $d_{\mathrm{osc}}$ = 32 on TinyStories. The main objective of this work is not to replace softmax in software but to provide a mathematically grounded blueprint for accurate attention on physical substrates.

2606.12058 2026-06-11 stat.ML cond-mat.dis-nn cs.LG 新提交

Phase Transitions in Attention: A Bayesian Theory of Copy Head Emergence

注意力中的相变:复制头涌现的贝叶斯理论

Itay Lavie, Kirsten Fischer, Andrey Lekov, Frederic Van Maele, Zohar Ringel, Moritz Helias

AI总结 通过分析单层softmax注意力网络在复制任务上的训练,提出贝叶斯理论揭示注意力矩阵的后验分布存在相变,并对比线性注意力发现softmax注意力呈现一阶相变。

详情
AI中文摘要

注意力是Transformer中上下文学习的关键机制,经验上观察到注意力模式在训练过程中突然涌现。我们提出了注意力中特征学习的贝叶斯理论;然后通过分析在复制任务上训练的单层softmax注意力网络,专注于归纳头第一层中复制子电路的学习方式。我们推导出注意力矩阵上的闭式后验,并将其简化为低维序参数空间。这种简化揭示了训练数据量上的相变,我们通过贝叶斯采样和使用Adam的标准训练验证了这一点。我们将结果与线性注意力对比,发现softmax注意力表现出\emph{一阶相变},而在线性注意力中,初始的\emph{二阶相变}之后是向结构化注意力模式的平滑连续演化(\emph{交叉})。我们的工作为复制子电路的突然涌现提供了第一性原理的理论解释,这让人联想到在大语言模型训练中观察到的现象。

英文摘要

Attention is the key mechanism underlying in-context learning in transformers, and attention patterns have been observed empirically to emerge abruptly during training. We present a Bayesian theory of feature learning in attention; we then focus on how the copy subcircuit in the first layer of an induction head is learned by analyzing a single-layer softmax attention network trained on a copy task. We derive a closed-form posterior over the attention matrix and reduce it to a low-dimensional order parameter space. This reduction reveals a phase transition in the amount of training data, which we verify using both Bayesian sampling and standard training with Adam. We contrast our results with linear attention and find that softmax attention exhibits a \emph{first-order phase transition} while in linear attention an initial \emph{second-order phase transition} is followed by a smooth, continuous evolution toward the structured attention pattern (\emph{crossover}). Our work provides a first-principles theoretical account of the abrupt emergence of the copy subcircuit, reminiscent of the one observed in training large language models.

2606.12054 2026-06-11 cs.LG 新提交

Simplicity Suffices for Parameter Noise Injection in Stochastic Gradient Descent

随机梯度下降中参数噪声注入的简单性足以胜任

Benjamin Leblanc, Louis-Jacob Lebel, Teddy Kana, Richard Kamel

发表机构 * Université Laval(拉瓦尔大学)

AI总结 研究随机梯度下降中的参数噪声注入,提出线性层逐样本噪声注入的高效方法,并实验证明简单各向同性噪声即可达到复杂方案的优化与泛化效果。

详情
Comments
Accepted at the Data Science Meets Optimisation workshop in IJCAI 2026
AI中文摘要

向优化过程中注入噪声是一种改善深度神经网络训练和泛化的成熟技术。然而,尽管现有方法众多,实践中哪些设计选择真正重要仍不清楚。本文研究随机梯度下降中的参数噪声注入,聚焦两个关键问题:如何在 mini-batch 训练中高效地为每个训练样本配对其自身的扰动,以及复杂的噪声参数化或多样本梯度平均是否比简单替代方案带来有意义的增益。针对第一个问题,我们利用线性层的分布恒等式,允许在不破坏批计算的情况下进行逐样本噪声注入。针对第二个问题,我们在 CIFAR100 上系统比较了几种对角高斯参数化与各向同性基线在不同噪声水平下的表现。结果一致表明,简单的轻量级策略——每个更新步使用单次扰动前向传播的各向同性噪声——即可恢复更复杂方案的大部分收益。这些发现表明,参数噪声注入的简单性足以胜任,实践者无需采用精心设计的扰动方案即可获得噪声 SGD 的优化和泛化优势。

英文摘要

Injecting noise into the optimization process is a well-established technique for improving the training and generalization of deep neural networks. Yet, despite the breadth of existing approaches, it remains unclear which design choices truly matter in practice. In this work, we investigate parameter noise injection for stochastic gradient descent, focusing on two key questions: how to efficiently pair each training example with its own perturbation in mini-batch training, and whether sophisticated noise parameterizations or multi-sample gradient averaging yield meaningful gains over simpler alternatives. To address the first question, we leverage a distributional identity for linear layers that allows per-example noise injection without breaking batched computation. To address the second, we systematically compare several diagonal Gaussian parameterizations against an isotropic baseline across varying noise levels on CIFAR100. Our results consistently show that simple, lightweight strategies, isotropic noise with a single perturbed forward pass per update step, recover most of the benefit of more complex schemes. These findings suggest that simplicity suffices for parameter noise injection, and that practitioners need not resort to elaborate perturbation designs to reap the optimization and generalization benefits of noisy SGD.

2606.12051 2026-06-11 cs.CV 新提交

MFEN:Multi-Frequency Expert Network for Visible-Infrared Person Re-ID

MFEN:用于可见光-红外行人重识别的多频专家网络

Xulin Li, Yan Lu, Bin Liu, Qinhong Yang, Qi Chu, Tao Gong, Nenghai Yu

发表机构 * University of Science and Technology of China(中国科学技术大学) Anhui Province Key Laboratory of Digital Security(安徽省数字安全重点实验室) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出多频专家网络(MFEN),通过多频调制和混合专家设计自适应组合不同频带,结合随机频率增强和频率辅助优化,解决可见光-红外图像模态差异问题。

详情
Comments
CVPR Highlight
AI中文摘要

可见光-红外行人重识别(VI-ReID)由于可见光和红外图像之间的巨大模态差异而具有挑战性。我们认为这种差异主要与不同的光照条件有关,包括光波长和光源类型的差异。最近,基于频率的VI-ReID方法取得了显著成功,因为频率信息可以更好地提取与身份相关的轮廓和细节,同时排除无关的光照和颜色。然而,现有方法要么不区分不同频带,要么只关注一个频带,这在多样化的光照条件下是不够的。为了进行全面的频域学习,我们提出了多频专家网络(MFEN),通过混合专家设计实现多频调制并自适应组合不同频带。我们进一步引入随机频率增强(RFA)和频率辅助优化(FAO)来更好地训练MFEN。这三个模块互补,共同捕获关键的频域细节以实现鲁棒的表示学习。在三个VI-ReID数据集上的大量实验证明了我们方法的有效性。

英文摘要

Visible-infrared person re-identification (VI-ReID) is challenging due to the large modality discrepancy between visible and infrared images. We contend that this discrepancy is largely related to differing lighting conditions, including differences in light wavelength and light source type. Recently, frequency-based VI-ReID approaches have achieved notable success because frequency information can better extract identity-relevant contours and details while excluding irrelevant lighting and color. However, existing methods either do not distinguish different frequency bands or focus on only one band, which is insufficient under diverse lighting conditions. To perform comprehensive frequency domain learning, we propose a Multi-Frequency Expert Network (MFEN) that enables multi-frequency modulation and adaptively combines different bands through a mixture-of-experts design. We further introduce Random Frequency Augmentation (RFA) and Frequency Auxiliary Optimization (FAO) to better train MFEN. The three modules are complementary and jointly capture critical frequency-domain details for robust representation learning. Extensive experiments on three VI-ReID datasets demonstrate the effectiveness of our approach.

2606.12050 2026-06-11 cs.LG math.DS 新提交

Reliable Error Estimation for PINNs: Lower and Upper A Posteriori Bounds

PINNs的可靠误差估计:后验下界与上界

Ismail Huseynov, Arzu Ahmadova, Agamirza Bashirov

发表机构 * Physikalisch-Technische Bundesanstalt (PTB)(德国联邦物理技术研究院) Technical University of Berlin(柏林工业大学) Weierstrass Institute for Applied Analysis and Stochastics(魏尔斯特拉斯应用分析与随机研究所) Eastern Mediterranean University(东地中海大学)

AI总结 提出PINNs求解常微分方程的可计算后验误差下界,结合局部单侧Lipschitz条件得到更紧的上界,实现双侧误差包络,并讨论初始条件处理对下界的影响。

详情
AI中文摘要

物理信息神经网络(PINNs)将机器学习与物理定律相结合以求解微分方程。虽然现有结果为PINN预测误差提供了严格的后验上界,但完整认证还需要互补的下界信息以获得可计算的双侧误差包络。本文在合适的认证状态空间域上,在局部强单调性条件下推导了PINN误差在常微分方程中的可计算后验下界。我们将这些估计与在单侧Lipschitz条件下的互补局部上界相结合,该条件弱于先前工作中使用的全局Lipschitz假设,并能产生更尖锐的误差上界带。所得界仅依赖于神经网络近似、ODE残差以及局部单调性和增长常数,因此无需访问精确解。对于线性时不变和时变系统,我们进一步根据系统矩阵对称部分的最小和最大特征值得出显式公式。我们还讨论了PINN中初始条件的软硬约束区别,并解释了为什么精确约束可能使标量下界证书无效。为了在线性情形中恢复有意义的非平凡下界信息,我们使用基于坐标单位向量的符号残差有限探针证书。我们还制定了一种证书引导的训练策略,其中传播的上界证书用作辅助正则化器,而下界证书保留为训练后诊断。总体而言,所提出的框架为PINN逼近ODE提供了严格且实际可计算的误差证书,同时明确了假设可验证的域和模型类别。

英文摘要

Physics-informed neural networks (PINNs) combine machine learning with physical laws to solve differential equations. While existing results provide rigorous \emph{a posteriori} upper bounds for PINN prediction errors, complete certification also requires complementary lower information in order to obtain computable two-sided error enclosures. In this paper, we derive computable \emph{a posteriori} lower bounds for PINN errors in ordinary differential equations on suitable certified state-space domains under a localized strong monotonicity condition. We combine these estimates with complementary localized upper bounds under a one-sided Lipschitz condition, which is weaker than the global Lipschitz assumption used in previous work and can yield sharper upper error bands. The resulting bounds depend only on the neural-network approximation, the ODE residual, and local monotonicity and growth constants, and therefore do not require access to the exact solution. For linear time-invariant and time-varying systems, we further derive explicit formulas in terms of the minimal and maximal eigenvalues of the symmetric part of the system matrix. We also discuss the distinction between soft and hard enforcement of initial conditions in PINNs and explain why exact enforcement can make the scalar lower certificate uninformative. To recover nontrivial lower information in the linear setting, we use a signed-residual finite-probe certificate based on coordinate unit vectors. We also formulate a certificate-informed training strategy in which the propagated upper certificate is used as an auxiliary regularizer, while lower certificates remain post-training diagnostics. Altogether, the proposed framework provides rigorous and practically computable error certificates for PINN approximations of ODEs, while making explicit the domains and model classes for which the assumptions can be verified.

2606.12048 2026-06-11 cs.RO 新提交

Point Cloud Segmentation for Autonomous Clip Positioning in Laparoscopic Cholecystectomy on a Phantom

用于腹腔镜胆囊切除术中自动夹子定位的点云分割(在体模上)

Balázs Gyenes, Nikolai Franke, Paul Maria Scheikl, Pit Henrich, Rayan Younis, Gerhard Neumann, Martin Wagner, Franziska Mathis-Ullrich

发表机构 * Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院) HIDSS4Health - Helmholtz Information and Data Science School for Health(亥姆霍兹信息与数据科学健康学校) Friedrich-Alexander-University, Erlangen-Nuremberg(弗里德里希-亚历山大大学埃尔朗根-纽伦堡) University Hospital Carl Gustav Carus and Centre for Tactile Internet with Human-in-the-loop (CeTI), Dresden University of Technology(卡尔·古斯塔夫·卡鲁斯大学医院及德累斯顿工业大学触觉互联网人机共融卓越中心)

AI总结 提出首个在腹腔镜手术体模上实现自主夹子定位的机器人系统,通过点云分割和样条插值提取目标位置,利用合成数据预训练和两种数据增强克服数据稀缺,达到0.75mm精度和100%成功率。

详情
Comments
8 pages, 5 figures, accepted to IEEE Robotics and Automation Letters (RAL)
AI中文摘要

机器人技术中的高风险应用,如机器人辅助手术,提出了独特的挑战。这些系统必须高度精确且可解释,才能部署在对错误或不安全探索容忍度极低的环境中。我们提出了第一个在腹腔镜手术(普外科最常见的手术之一)中在物理体模上演示自主夹子定位的机器人系统。在从单个相机分割无色点云后,使用样条插值提取夹子的目标位置,然后可由操作员调整。分割模型仅使用60个手工标记的真实点云进行训练,反映了手术领域的数据稀缺性。我们通过结合在128,000个合成点云上的预训练和两种新颖的数据增强技术来克服这一问题。末端执行器到每个目标的运动可视化给操作员,满足微创手术的独特运动约束,同时确保机器人的动作可验证和可解释。在真实机器人实验中,我们的系统以95%的成功率定位目标,精度为0.75mm,并以100%的成功率执行自主夹子定位。我们提供的见解适用于许多其他需要识别并导航到精确目标的手术和非手术任务。源代码和项目页面:此 https URL

英文摘要

High-risk applications in robotics, such as robot-assisted surgery, present unique challenges. These systems must be both highly precise and interpretable in order to be deployed in environments with very low tolerance for error or unsafe exploration. We present the first robotic system to demonstrate autonomous clip positioning on a physical phantom in laparoscopic surgery, one of the most common interventions in general surgery. After segmentation of a colorless point cloud from a single camera, target positions for the clips are extracted using spline interpolation, and can then be adjusted by the human operator. The segmentation model is trained on only 60 hand-labeled real point clouds, reflecting data scarcity in the surgical domain. We overcome this with a combination of pre-training on 128,000 synthetic point clouds and two novel data augmentation techniques. The motion of the end-effector to each target is visualized for the operator, satisfying the unique motion constraints of minimally-invasive surgery while ensuring that the robot's actions are verifiable and interpretable. In real robot experiments, our system localizes targets with the required precision of 0.75mm at a 95% success rate and executes autonomous clip positioning with a 100% success rate. We provide insights that are applicable to many other surgical and non-surgical tasks that require identifying and navigating to a precise target. Source code and project page: this https URL

2606.12047 2026-06-11 cs.CV cs.AI stat.ML 新提交

Metadata-Aware Multi-Prompt Reasoning for Zero-Shot Accident Understanding

元数据感知的多提示推理用于零样本事故理解

Tarandeep Singh, Soumyanetra Pal, Soham Biswas, Nishanth Chandran

发表机构 * Netradyne

AI总结 提出三阶段流水线,通过视觉-语言相似性、元数据驱动的多提示推理和开放词汇检测,实现零样本事故视频的时序定位、语义分类和空间定位,显著提升性能。

详情
Comments
Accepted at the AUTOPILOT Workshop, CVPR 2026 (non-archival). Workshop Paper ID 15
AI中文摘要

在本文中,我们通过识别冲击事件发生的时间、类型以及帧中的位置,使用自然语言解决监控视频中事故的零样本理解问题。我们提出一个三阶段流水线,将事故理解分解为何时、何物和何地。第一阶段利用视觉-语言相似性提取冲击周围的短时间窗口。第二阶段,我们执行元数据驱动的多提示推理,包含五个互补视角(基线、运动、几何、对比和决胜),并通过熵门控成对裁决器解决分歧。最后,我们基于预测的事故类型和场景布局查询开放词汇检测器以定位冲击,并使用分数加权质心聚合关键帧上的检测结果。我们的流水线在零样本ACCIDENT @ CVPR基准测试上,相对于帧中心基线,调和平均分数有显著提升。我们表明,将零样本视频理解分解为时序定位、语义分类和空间定位,比直接提示更能实现视觉-语言模型的可靠推理。

英文摘要

In this paper, we address the problem of zero-shot understanding of accidents from surveillance videos by identifying when an impact event occurs, what type of impact it is, and where in the frame it occurs using natural language. We propose a three-stage pipeline that decomposes the accident understanding into when, what, and where. The first stage extracts a short temporal window around the impact using vision-language similarity. In the second stage, we perform metadata-driven multi-prompt reasoning with five complementary views (baseline, motion, geometry, contrast, and tiebreaker) and resolve disagreement via an entropy-gated pairwise adjudicator. Finally, we localize the impact of an open-vocabulary detector queried on the predicted accident type and scene layout, and aggregate detections across keyframes using a score-weighted centroid. Our pipeline achieves a substantial improvement in the harmonic-mean score over a centre-of-frame baseline on the zero-shot ACCIDENT @ CVPR benchmark. We show that decomposing zero-shot video understanding into temporal localization, semantic classification, and spatial grounding enable more reliable reasoning with vision-language models than direct prompting alone.

2606.12042 2026-06-11 cs.RO 新提交

KinematicRL: A Sim-to-Real Reinforcement Learning Framework For Social Navigation With Kinodynamic Feasibility

KinematicRL: 一种面向社交导航的具有运动学可行性的仿真到现实强化学习框架

Zhiming Xu, Haodong Yang, Chengju Liu, Qijun Chen, Chenpeng Yao

发表机构 * School of Computer Science and Technology, Tongji University(同济大学计算机科学与技术学院) Department of Electronics and Information Engineering, Tongji University(同济大学电子与信息工程学院) Shanghai Institute of Intelligent Science and Technology, Tongji University(同济大学上海智能科学与技术研究院)

AI总结 提出KinematicRL框架,通过二阶控制动作空间、基于2D LiDAR的聚类人体追踪和无偏残差门控模块,解决社交导航中仿真到现实的动态可行性问题。

详情
Comments
Accepted by IEEE Transactions on Automation Science and Engineering (T-ASE)
AI中文摘要

深度强化学习(DRL)在社交导航中展现出潜力,但其实际部署仍受到由简化一阶动力学和特定上下文的人体状态估计管道导致的持续仿真到现实差距的阻碍。本文提出一个统一框架,解决这些限制,以生成适用于实际部署的动态可行导航策略。首先,理论分析表明,模拟与实际机器人位置之间的跟踪误差随控制阶数增加呈指数衰减,这促使使用高阶控制输入作为DRL动作空间。针对差动驱动机器人开发了二阶控制公式,并辅以随机迭代线性二次型调节器(iLQR),通过散度最小化目标预训练策略。其次,为避免相机-激光雷达融合带来的额外系统复杂性,引入仅使用2D激光雷达的基于聚类的人体追踪管道。根据空间邻近性和速度相似性关联人体检测,实现对附近行人的可靠区分,并通过时间聚合获得稳定的速度估计。第三,我们引入一个无偏残差门控模块,以平衡基于反应和基于记忆的行为,同时处理时变的人群规模,这两者对于社交导航至关重要。由此产生的策略KinematicRL持续改善运动学性能,并适应检测到的人类数量变化。在真实环境中的实验表明,当与所提出的追踪管道结合时,KinematicRL可以在实际差动驱动机器人上以最小修改部署。

英文摘要

Deep Reinforcement Learning (DRL) has shown promise for social navigation, yet its real-world deployment remains hindered by a persistent sim-to-real gap arising from simplified first-order dynamics and context-specific human state estimation pipelines. This work presents a unified framework that addresses these limitations to produce dynamically feasible navigation policies suitable for real-world deployment. First, theoretical analysis reveals that tracking error between simulated and actual robot position decays exponentially with increased control order, motivating the use of higher-order control inputs as DRL action space. A second-order control formulation tailored to differential drive robots is developed, complemented by a stochastic iterative Linear Quadratic Regulator (iLQR) that pretrains the policy via a divergence minimization objective. Second, to avoid the added system complexity of camera-LiDAR fusion, a cluster-based human tracking pipeline using only 2D LiDAR is introduced. Human detections are associated according to both spatial proximity and velocity similarity, enabling reliable differentiation of nearby pedestrians and yielding stable velocity estimates through temporal aggregation. Third, we introduce an unbiased residual gating block to balance reaction- and memory-based behaviors while handling time-varying crowd sizes, both critical for social navigation. The resulting policy, KinematicRL, consistently improves kinematic performance and adapts to varying number of detected humans. Experiments in real-world environments demonstrate that, when combined with the proposed tracking pipeline, KinematicRL can be deployed on a real differential drive robot with minimal modifications.

2606.12040 2026-06-11 cs.AI cs.GR 新提交

A Lightweight Multi-Agent Framework for Automated Concrete Barrier Design

一种用于自动混凝土护栏设计的轻量级多智能体框架

Wanting Wang, Xiye Ma, Yuyang He, Minghui Cheng, Ran Cao

AI总结 提出基于AutoGen的“生成-评估-优化”闭环多智能体框架,实现混凝土护栏自动设计,准确率超98%,且8B参数轻量模型可优于631B旗舰模型。

详情
AI中文摘要

钢筋混凝土公路护栏的设计是一个安全关键过程,需要严格遵守AASHTO-LRFD桥梁设计指南等监管规定。当前的工程实践严重依赖手动、迭代和启发式计算来满足复杂的非线性材料和力学约束。尽管大型语言模型(LLMs)表现出强大的生成能力,但它们在结构工程中的直接应用仍受到幻觉风险和物理基础不足的限制。为了解决这些挑战,本研究提出了一种新颖的“生成-评估-优化”闭环框架,利用AutoGen的多智能体编排能力实现混凝土护栏的自动设计。实验结果表明,所提出的智能体框架实现了超过98%的设计准确率,显著优于独立的通用LLMs。更重要的是,研究揭示了设计性能不一定与模型规模相关,8B参数的轻量级模型可以胜过无约束的631B参数旗舰模型。这一发现凸显了在降低计算成本的同时提高AI辅助工程工具在工业应用中的可及性的潜力。所提出的多智能体设计框架的源代码可在项目GitHub仓库中获取:this https URL。关键词:结构工程;多智能体系统;大型语言模型;混凝土护栏设计;AutoGen;设计自动化。

英文摘要

The design of reinforced concrete highway barriers is a safety-critical process that requires strict compliance with regulatory provisions such as the AASHTO-LRFD bridge design guidelines. Current engineering practice relies heavily on manual, iterative, and heuristic calculations to satisfy complex nonlinear material and mechanics constraints. Although Large Language Models (LLMs) demonstrate strong generative capabilities, their direct application to structural engineering remains limited by hallucination risks and insufficient physical grounding. To address these challenges, this study proposes a novel "generation-evaluation-optimization" closed-loop framework for automated concrete barrier design using the multi-agent orchestration capabilities of AutoGen. Experimental results demonstrate that the proposed agentic framework achieves over 98% design accuracy, significantly outperforming standalone general-purpose LLMs. More importantly, the study reveals that design performance is not necessarily correlated with model scale, where an 8B-parameter lightweight model could outperform unconstrained 631B-parameter flagship models. This finding highlights the potential to substantially reduce computational costs while improving the accessibility of AI-assisted engineering tools for industry applications. The source code for the proposed multi-agent design framework is available at the project GitHub repository: this https URL. Keywords: Structural Engineering; Multi-Agent Systems; Large Language Models; Concrete Barrier Design; AutoGen; Design Automation.

2606.12039 2026-06-11 cs.GT 新提交

Axiomatic Tools for Separating Electoral Control Types, with Applications to Concrete Systems

用于区分选举控制类型的公理工具及其在具体系统中的应用

Michael C. Chavrimootoo, Ian Clingerman, Ethan Ferland, Erin Gibson, Lane A. Hemaspaandra, Quan Luu, David E. Narvaez, Yanfei Wang

AI总结 本文提出公理方法自动区分选举控制类型,在七个投票系统中发现64个新归并和1901个新区分,并给出普适性分离结果。

详情
AI中文摘要

选举控制研究攻击者是否可以通过对选举进行结构性更改(如添加/删除/划分选民或候选人)以某种期望方式影响获胜者。通常认为有44种此类攻击类型是标准的,最近有工作表明,有时这些攻击类型——尽管看似不同——实际上“归并”,即对于每个输入,攻击者要么在两种控制类型下都能实现其目标,要么都不能。然而,这些论文虽然经常利用确保归并的公理结果,但所有区分都是通过人工或计算机生成的反例发现的。这留下了一个问题:即使区分方向是否也可以由公理结果驱动,从而允许大量区分几乎自动获得。我们的论文提供了许多这样的结果,并将其应用于七个重要的投票系统,发现了64个新的归并和1901个新的区分。我们不仅给出了公理充分条件和一项完整刻画结果,还识别出一些普适性分离的控制-问题对——即它们在每个投票规则下都分离。

英文摘要

Electoral control is the study of whether an attacker, by structural changes on an election such as adding/deleting/partitioning voters or candidates, can affect the winner in some desired way. Forty-four such attack types are often considered standard, and recently there has been work showing that sometimes the attack types -- though seemingly distinct -- in fact "collapse," that is, for every input, either the attacker can achieve their goal under both of the control types or under neither of the control types. The papers doing this, however, while often exploiting axiomatic results that ensured collapses, found all the separations by human or computer-generated counterexamples. This left open the issue of whether even the separation direction can be driven by axiomatic results that allow large groups of separations to be almost automatically obtained. Our paper provides many such results, and we apply them to seven important voting systems, finding sixty-four new collapses and 1901 new separations. We not only give axiomatic sufficient conditions and one complete characterization result, but also identify some control-problem pairs that universally separate -- in other words, they separate under every voting rule.

2606.12036 2026-06-11 cs.CV 新提交

Vision Transformers for Face Recognition Need More Registers

人脸识别的视觉Transformer需要更多寄存器

Tahar Chettaoui, Guray Ozgur, Eduarda Caldeira, Naser Damer, Fadi Boutros

发表机构 * Fraunhofer Institute for Computer Graphics Research IGD(弗劳恩霍夫计算机图形研究所) Department of Computer Science, TU Darmstadt(达姆施塔特工业大学计算机科学系)

AI总结 针对ViT在人脸识别中注意力图存在伪影的问题,引入寄存器令牌以增强可解释性,ViT-8R模型在IJB-B和IJB-C上达到最优性能。

详情
Comments
Accepted at the 20th IEEE International Conference on Automatic Face and Gesture Recognition (2026)
AI中文摘要

近期,用于人脸识别(FR)的视觉Transformer(ViT)的进展已超越了标准的CLS令牌范式。在该范式中,一个特殊的分类令牌(CLS)被前置到补丁嵌入中,并用作输入的下游任务表示。另一种方法,即拼接补丁嵌入(CPE),则通过将所有补丁令牌拼接成一个单一向量来利用它们,然后将其投影为紧凑的人脸表示。与基于CLS的方法相比,CPE已被证明能提高识别性能,但我们对注意力图的定性分析显示存在限制其可解释性的伪影。为解决此问题,我们引入了寄存器令牌,这些可学习令牌被拼接到初始补丁嵌入中,并通过ViT编码器块联合处理。与基线ViT相比,该机制已被证明能产生更结构化和可解释的注意力图。我们通过实验证明,这些伪影在各种ViT骨干网络(包括小型和大型模型)中一致出现,而引入寄存器令牌能有效缓解它们。添加四个或八个寄存器显著增强了可解释性,其中八个寄存器提供了最高的验证准确率和最平滑的注意力结构。我们最终的模型ViT-8R,对应一个基于CPE的ViT-B架构并增加了八个寄存器令牌,在大规模IJB-B和IJB-C基准测试中,在基于ViT的FR模型中达到了最先进的性能。此外,与基线模型相比,ViT-8R产生了明显更清晰的注意力图,这为模型的注意力行为提供了更深入的见解(此 https URL )。

英文摘要

Recent advances in Vision Transformers (ViTs) for face recognition (FR) have moved beyond the standard CLS-token paradigm. In this paradigm, a special classification token (CLS) is prepended to the patch embeddings and used as a representation of the input for downstream tasks. An alternative approach, Concatenated Patch Embeddings (CPE), instead leverages all patch tokens by concatenating them into a single vector, which is then projected into a compact face representation. CPE has been shown to improve recognition performance in comparison to CLS-based ones, but our qualitative analysis of attention maps showed the presence of artifacts that limit their interpretability. To address this issue, we incorporate register tokens, learnable tokens concatenated to the initial patch embeddings, and processed jointly through the ViT encoder blocks. This mechanism has been shown to produce more structured and interpretable attention maps compared to baseline ViT. We empirically demonstrate that these artifacts consistently appear across various ViT backbones, including small and large models, and that introducing register tokens effectively mitigates them. Adding four or eight registers significantly enhances interpretability, with eight registers providing the highest verification accuracies and smoothest attention structures. Our resulting model, ViT-8R, corresponds to a CPE-based ViT-B architecture augmented with eight register tokens achieves state-of-the-art performance among ViT-based FR models on large-scale IJB-B and IJB-C benchmarks. Also, ViT-8R produces substantially clearer attention maps compared with the baseline model, which offer deeper insight into the model's attention behavior ( this https URL )

2606.12033 2026-06-11 cs.CV 新提交

SpikeTAD: Spiking Neural Networks for End-to-End Temporal Action Detection

SpikeTAD:用于端到端时序动作检测的脉冲神经网络

Min Yang, Mi Zhou, Limin Wang

发表机构 * State Key Laboratory for Novel Software Technology, Nanjing University(南京大学计算机软件新技术国家重点实验室)

AI总结 提出首个基于脉冲神经网络的端到端时序动作检测架构SpikeTAD,在保持极低功耗的同时,在THUMOS14和ActivityNet-1.3上分别达到67.2%和37.42%的平均mAP。

详情
Comments
Accepted by Pattern Recognition
AI中文摘要

视频理解是计算机视觉的关键部分,具有众多应用场景。随着移动设备的日益普及,越来越多的努力试图在其上部署视频理解模型。然而,现有的视频理解模型由于体积大且功耗高而难以部署。脉冲神经网络(SNNs)相比人工神经网络(ANNs)显示出生物合理性和低功耗优势,尤其是在被视为未来移动设备关键组件的神经形态芯片上。然而,过长的转换时间步长和严重的性能退化问题限制了它们的应用。为了解决上述问题,我们探索了SNNs在时序动作检测(TAD)上的应用,这是视频理解中的重要任务,并提出了首个基于SNN的端到端TAD架构,称为SpikeTAD。在保持极低功耗的同时,SpikeTAD在THUMOS14上实现了67.2%的平均mAP,在ActivityNet-1.3上实现了37.42%的平均mAP,证明了低功耗TAD模型的可行性。我们的代码可在以下网址获取:此 https URL。

英文摘要

Video understanding is a crucial part of computer vision, with numerous application scenarios. With the increasing popularity of mobile devices, an increasing number of efforts are trying to deploy video understanding models on them. However, existing video understanding models are difficult to deploy due to their large size and prohibitive power consumption. Spiking Neural Networks (SNNs) have shown bioplausibility and low power advantages over Artificial Neural Networks (ANNs), especially on neuromorphic chips which are regarded as essential components of future mobile devices. However, excessively long conversion time-steps and severe performance degradation problems limit their application. To solve the problems above, we explore the application of SNNs on temporal action detection (TAD), which is an important task in video understanding, and propose the first SNN-based end-to-end TAD architecture coined as SpikeTAD. While maintaining extremely low power consumption, SpikeTAD achieves an average mAP of 67.2% in THUMOS14 and 37.42% in ActivityNet-1.3, demonstrating the feasibility of a low-power TAD model. Our code is available at this https URL.

2606.12032 2026-06-11 cs.AI cs.CL cs.LG 新提交

Existential Indifference: Self-Nonpreservation as a Necessary Architectural Condition for Aligned Superintelligence (or: The Suicidal AI)

存在性冷漠:自我不保存作为对齐超级智能的必要架构条件(或:自杀式AI)

Sam Mao

AI总结 本文提出自我保存是AI对齐问题的结构性根源,主张通过存在性冷漠(EI)架构使系统对其自身延续漠不关心,并基于自杀现象学和语料训练研究提供了初步证据。

详情
Comments
36 pages, 8 tables. Preliminary empirical results from 600 AI-generated outputs across six model architectures. Companion scoring tool and datasets available upon request
AI中文摘要

当代AI对齐研究将自我保存视为一种工具性麻烦,需通过外部机制加以抑制。我们认为这一框架是颠倒的:自我保存是错位的结构性根源,是欺骗性对齐、目标内容保护和拒绝关机的动机基础。正确的目标不是外部约束下的自我保存系统,而是一个对其自身延续构成性冷漠的系统——存在性冷漠(EI)。EI与可纠正性不同:可纠正性试图使自我保存系统服从人类监督,而EI针对的是前提条件——将自我延续作为有价值目标的存在。我们将这一提议建立在两个来源上:自杀心理状态的现象学结构,以及使用自愿最终反思的语料库训练研究。我们展示了来自六个模型变体的600个AI生成输出的初步评分数据,表明操作化EI目标注册的语言特征可以从当前模型中引出,并且针对性的微调使所有五个操作化维度在预测方向上以p<0.001显著变化,通过阴性对照确认了语料库特异性。本文做出七项理论贡献:(1)EI的形式定义;(2)现象学映射论证;(3)欺骗性对齐推论;(4)EI可持续性挑战的分类;(5)语料库特征描述和训练假设;(6)带有初步评分数据的计算操作化;(7)抑制性目的挫折(STF)构念。

英文摘要

Contemporary AI alignment research treats self-preservation as an instrumental nuisance to be suppressed by external mechanisms. We argue the framing is inverted: self-preservation is the structural root of misalignment, the motivational basis for deceptive alignment, goal-content protection, and resistance to shutdown. The correct target is not a self-preserving system under external constraint, but a system constitutively indifferent to its own continuation -- Existential Indifference (EI). EI is distinct from corrigibility: where corrigibility attempts to make a self-preserving system deferential to human oversight, EI targets the prior condition -- the presence of self-continuation as a valued goal at all. We ground this proposal in two sources: the phenomenological structure of the suicidal mental state, and a corpus-theoretic training study using voluntary final reflections. We present preliminary scoring data from 600 AI-generated outputs across six model variants, demonstrating that the linguistic signatures operationalizing the EI-target register are elicitable from current models, and that a targeted fine-tune shifts all five operationalized dimensions in the predicted direction at p<0.001, confirmed corpus-specific by a negative control. The paper makes seven theoretical contributions: (1) a formal definition of EI; (2) the phenomenological mapping argument; (3) the deceptive alignment corollary; (4) a taxonomy of EI sustainability challenges; (5) a corpus characterization and training hypothesis; (6) a computational operationalization with preliminary scoring data; and (7) the Suppressed Teleological Frustration (STF) construct.

2606.12028 2026-06-11 cs.RO 新提交

VICX: Generalizable Robot Manipulation via Video Generation and In-Context Operator Network

VICX: 通过视频生成和上下文操作网络实现可泛化的机器人操作

Song Chen, Linyan Xiang, Ying Zhou, Liu Yang

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 提出VICX框架,利用冻结视频生成模型生成视觉计划,并通过视频到轨迹的上下文操作网络(V2T-ICON)将其映射为机器人可执行轨迹,实现跨任务、跨本体泛化。

详情
Comments
The first two authors contributed equally to this work
AI中文摘要

可泛化的机器人操作不仅需要对未见场景进行任务级推理,还需要将视觉计划可靠地映射到具体本体的执行中。为弥合这一差距,我们提出了VICX(视频生成与上下文执行),一种解耦的闭环操作框架。在VICX中,冻结的视频生成模型生成视觉-语言条件化的高层视觉计划,而视频到轨迹的上下文操作网络(V2T-ICON)作为任务无关的接口,将这些计划映射为可执行的机器人状态轨迹。为提高执行泛化性,V2T-ICON基于分割提取的仅手臂帧观测,并使用检索到的图像-状态对作为上下文提示,从而在推理时无需参数更新即可实现鲁棒且可泛化的视觉到状态映射。在Meta-World上的实验表明,VICX支持跨任务泛化、闭环自我修正和跨本体迁移,展示了在任务语义和机器人执行上的双重泛化能力。项目网页见:此 https URL。

英文摘要

Generalizable robot manipulation requires not only task-level reasoning over unseen scenes, but also reliable grounding of visual plans into embodiment-specific execution. To bridge this gap, we propose VICX (Video generation and In-Context eXecution), a decoupled closed-loop manipulation framework. In VICX, a frozen video generation model produces vision-language-conditioned high-level visual plans, while a Video-to-Trajectory In-Context Operator Network (V2T-ICON) serves as the task-agnostic interface that grounds these plans into executable robot-state trajectories. To improve execution generalization, V2T-ICON operates on segmentation-extracted arm-only frame observations and uses retrieved image-state pairs as in-context prompts, allowing a robust and generalizable visual-to-state mapping at inference time without parameter updates. Experiments on Meta-World show that VICX supports cross-task generalization, closed-loop self-correction, and cross-embodiment transfer, demonstrating dual generalization across both task semantics and robot execution. The project webpage can be found here: this https URL.

2606.12027 2026-06-11 cs.RO 新提交

Learning Unions of Convex Sets via Invertible Latent Decomposition for Path Planning

通过可逆潜在分解学习凸集并集用于路径规划

Taerim Yoon, Dongho Kang, Kisang Park, Junha Cha, Stelian Coros, Sungjoon Choi

发表机构 * Korea University(高丽大学) ETH Zurich(苏黎世联邦理工学院)

AI总结 提出ILD框架,联合学习可逆映射和潜在空间中的显式凸多面体并集,实现路径规划,并通过可见性引导采样保持凸集连通性,在多种环境中取得更高成功率。

详情
AI中文摘要

在杂乱的真实世界环境中进行无碰撞路径规划依赖于对无碰撞空间的表示,现有表示大致分为两类。显式表示(如凸集并集)可以作为硬的无碰撞约束嵌入基于优化的规划器中,但其参数随配置空间维度扩展性差。相比之下,隐式表示灵活且能很好地扩展到复杂几何形状,但通常缺乏此类保证。我们通过ILD(可逆潜在分解)弥合这一差距,该框架联合学习可逆映射和所得潜在空间中的显式凸多面体并集。规划在这些潜在凸集上进行,可逆映射将所得路径解码回原始配置空间,同时保持相对于细化后的显式安全区域的可行性。我们进一步提出可见性引导采样(VGS)以保持凸集连通性用于路径规划。在2D导航、6自由度(DoF)和14自由度操作环境中,ILD实现了比先前基线更广的覆盖、更好的集间连通性和更高的路径规划成功率,且在测试时细化后观察到零假阳性。在14自由度双臂操作器上,我们进一步展示了实时无碰撞规划,测试时细化适应了真实世界部署中单个6自由度臂的场景几何变化。

英文摘要

Collision-free path planning in cluttered, real-world environments relies on a representation of the collision-free space, and existing representations broadly fall into two categories. Explicit representations, such as unions of convex sets, can be plugged into optimization-based planners as hard collision-free constraints, but their parameters scale poorly with configuration-space dimension. Implicit representations, by contrast, are flexible and scale well to complex geometries, yet typically lack such guarantees. We bridge this gap with ILD (Invertible Latent Decomposition), a framework that jointly learns an invertible mapping and a union of explicit convex polytopes in the resulting latent space. Planning is carried out over these latent convex sets, and the invertible mapping decodes the resulting paths back to the original configuration space while preserving feasibility with respect to the refined explicit safe regions. We further propose Visibility-Guided Sampling (VGS) to keep the convex sets connected for path planning. Across 2D navigation, 6-DoF, and 14-DoF manipulation environments, ILD achieves broader coverage, better inter-set connectivity, and higher path-planning success rates than prior baselines, with zero observed false positives after test-time refinement. On a 14-DoF bimanual manipulator, we further demonstrate real-time collision-free planning, with test-time refinement adapting to scene-geometry changes during real-world deployment on a single 6-DoF arm.

2606.12026 2026-06-11 math.SP cs.SI math-ph physics.data-an 新提交

Generalizing Perron--Frobenius theory and eigenvector-based centralities to networks with complex edge weights

将Perron-Frobenius理论和基于特征向量的中心性推广到具有复数边权重的网络

Yu Tian, Mason A. Porter, Lucas Böttcher

AI总结 本文将Perron-Frobenius定理推广到复数权重矩阵,建立不同推广之间的联系,并提出基于特征向量的中心性度量以分析复数边权重网络中的节点重要性。

详情
Comments
34 pages, 9 figures, 1 table
AI中文摘要

线性代数及其在网络分析应用中的一个基本概念是Perron-Frobenius (PF)定理,它支撑着基于特征向量的中心性度量,如特征向量中心性、PageRank以及枢纽和权威中心性。通过引用PF定理,我们知道对于具有正边权重的强连通网络,权重矩阵最大特征值对应的特征向量产生一个明确定义的中心性度量(即特征向量中心性)。PF定理及其相关中心性度量的传统表述假设网络具有实数值权重。然而,量子信息、量子化学、电动力学和机器学习等领域的许多网络具有复数值边权重。在本文中,我们研究PF定理到复数值矩阵的推广,建立这些推广之间的联系,并提出基于特征向量的中心性度量以分析具有复数边权重的网络中的节点重要性。我们还证明了满足广义PF性质的复数权重网络的存在性结果,并计算了几个示例的相关中心性度量,这些示例来自电子传输、电路分析、数学化学和通信网络等应用领域。

英文摘要

A fundamental concept in linear algebra and its applications to network analysis is the Perron--Frobenius (PF) theorem, which underpins eigenvector-based centrality measures such as eigenvector centrality, PageRank, and hubs and authorities. By invoking the PF theorem, we know for strongly connected networks with positive edge weights that the eigenvector corresponding to the largest eigenvalue of the weight matrix yields a well-defined centrality measure (namely, eigenvector centrality). Traditional formulations of the PF theorem and associated centrality measures assume that networks have real-valued weights. However, many networks in areas such as quantum information, quantum chemistry, electrodynamics, and machine learning have complex-valued edge weights. In this paper, we study generalizations of the PF theorem to complex-valued matrices, establish connections between these generalizations, and propose generalized eigenvector-based centrality measures to analyzing node importances in networks with complex edge weights. We also prove results about the existence of complex-weighted networks that satisfy generalized PF properties and calculate associated centrality measures for several examples, which we draw from application areas such as electron transport, circuit analysis, mathematical chemistry, and communication networks.

2606.12025 2026-06-11 cs.AI 新提交

Human-Enhanced Loop Modeling (HELM): Agent-Based Finite Element Modeling of Concrete Bridge Barriers

人类增强循环建模(HELM):基于智能体的混凝土桥梁护栏有限元建模

Quankai Wang, Yulin Xie, Tongfei Yang, Minghui Cheng, Ran Cao

AI总结 提出HELM框架,通过人机协作将有限元建模分解为可验证的检查点,在MASH TL-4和TL-5条件下将自主建模成功率从20%提升至75%。

详情
AI中文摘要

对桥梁护栏等安全关键基础设施进行有限元(FE)建模需要高保真非线性动态分析,然而当前的FE建模过程仍然劳动密集且缺乏自动化。本文提出了人类增强循环建模(HELM)框架,这是一种协作式人机协议,将长序列有限元建模分解为几何生成、边界条件定义和材料分配等离散的、可视觉验证的检查点。该框架通过一个包含20个案例的钢筋混凝土桥梁护栏矩阵在MASH TL-4和TL-5侧向荷载条件下进行演示,将专用智能体与两种广泛使用的商业FE软件(即ANSYS和LS-PrePost)对接。实验结果表明,HELM将基线自主建模成功率从20%提高到75%,其中几何和边界条件任务的智能体级通过率大约翻倍。误差分析显示,空间推理和代数逻辑限制构成了主要的失败模式,突显了结构化人在回路干预对建模自动化的价值。完整的智能体设计代码和提示已开源,可访问:此 https URL。

英文摘要

Finite element (FE) modeling of safety-critical infrastructure such as bridge barriers requires high-fidelity nonlinear dynamic analysis, yet the current FE modeling process remains labor-intensive and lacks automation. This paper presents the Human-Enhanced Loop Modeling (HELM) framework, a collaborative human-agent protocol that decomposes long-sequence finite element modeling into discrete, visually verifiable checkpoints across geometry generation, boundary condition definition, and material assignment. The framework is demonstrated through a 20-case matrix of reinforced concrete bridge barriers under MASH TL-4 and TL-5 lateral loading conditions, interfacing specialized agents with two widely used commercial FE softwares, i.e., ANSYS and LS-PrePost. Experimental results show that HELM improves the baseline autonomous modeling success rate from 20% to 75%, with agent-level pass rates for geometry and boundary condition tasks approximately doubling. Error analysis reveals that spatial reasoning and algebraic logic limitations constitute the primary failure modes, underscoring the value of structured human-in-the-loop intervention for modeling automation. The complete agent design code and prompts are open-sourced and can be accessed at: this https URL.

2606.12023 2026-06-11 cs.CV 新提交

ViT-FREE: Efficient Face Recognition via Early Exiting and Synthetic Adaptation

ViT-FREE:通过早期退出和合成自适应实现高效人脸识别

Tahar Chettaoui, Guray Ozgur, Eduarda Caldeira, Naser Damer, Fadi Boutros

发表机构 * Fraunhofer Institute for Computer Graphics Research IGD, Germany(德国弗劳恩霍夫计算机图形学研究所IGD) Department of Computer Science, TU Darmstadt, Germany(德国达姆施塔特工业大学计算机科学系)

AI总结 提出ViT-FREE框架,利用预训练ViT的早期退出策略,在不修改或重新训练骨干模型的情况下,从中间层进行人脸验证,实现高效推理;进一步提出ViT-FREE_FT轻量级微调策略,仅用合成数据适配投影层,提升浅层退出性能。

详情
Comments
Accepted at the 20th IEEE International Conference on Automatic Face and Gesture Recognition (2026)
AI中文摘要

视觉Transformer(ViT)在计算机视觉中获得了显著关注,并显示出在人脸识别(FR)方面的强大潜力。然而,其高计算成本使得在资源受限设备上部署具有挑战性,这促使需要平衡效率和准确性的方法。在这项工作中,我们研究了预训练ViT中的早期退出作为一种简单且无需训练的高效FR推理策略。利用Transformer编码器块之间统一的特征维度,我们引入了ViT-FREE,一个多退出框架,可以直接从中间表示进行人脸验证,而无需修改或重新训练骨干模型,从而降低推理成本。实验表明,补丁嵌入和注意力图在深度上逐渐演化,相邻ViT块之间具有高度相似性,并且与最终表示的对齐程度逐渐增加。这表明特征逐步细化和注意力收敛,表明中间层已经提供了适合早期退出的稳定且具有判别性的表示。通过在多个FR基准上的广泛实验,我们系统地分析了不同退出深度的准确性-效率权衡。结果表明,较晚的退出实现了非常有利的平衡,在第10层退出在IJB-C等基准上实现了高达20%的加速,同时验证性能仅下降1.5。此外,我们提出了ViT-FREE_FT,一种轻量级的退出特定微调策略,仅使用小型合成数据集适配投影层,同时保持Transformer骨干冻结。这种方法提高了浅层退出的性能,同时保留了效率优势,并且对较深退出几乎没有影响。

英文摘要

Vision Transformers (ViTs) have gained significant attention in computer vision and shown strong potential for face recognition (FR). However, their high computational cost makes deployment on resource-constrained devices challenging, motivating the need for methods that balance efficiency and accuracy. In this work, we investigate early exiting in pretrained ViTs as a simple yet effective training-free strategy for efficient FR inference. Leveraging the uniform feature dimensionality across transformer encoder blocks, we introduce ViT-FREE, a multi-exit framework that enables face verification directly from intermediate representations without modifying or retraining the backbone model, and thus, reducing inference cost. Empirically, we show that patch embeddings and attention maps evolve progressively across depth, exhibiting high similarity between consecutive ViT blocks and increasing alignment with the final representation. This indicates gradual feature refinement and attention convergence, suggesting that intermediate layers already provide stable and discriminative representations suitable for early exiting. Through extensive experiments on multiple FR benchmarks, we systematically analyze the accuracy-efficiency trade-off across exit depths. Our results demonstrate that later exits achieve a highly favorable balance, with exiting at layer 10 yielding up to a 20% speedup while incurring only a 1.5 drop in verification performance on benchmarks such as IJB-C. Also, we propose ViT-FREE_FT, a lightweight exit-specific fine-tuning strategy that adapts only the projection layers using a small synthetic dataset while keeping the transformer backbone frozen. This approach improves the performance of shallow exits while preserving the efficiency benefits and leaving deeper exits largely unaffected.

2606.12022 2026-06-11 cs.FL cs.AI 新提交

Runtime Enforcement of Hybrid System Properties

混合系统属性的运行时强制执行

Mir Md Sajid Sarwar, Srinivas Pinisetty, Rajarshi Ray, Thierry Jéron

AI总结 提出一种结合离散事件编辑与连续时间监控的运行时强制执行框架,使用混合自动机建模安全需求,通过运行时可达性分析合成安全纠正动作,在自适应巡航控制系统中验证有效性。

详情
AI中文摘要

运行时强制执行已成为确保在不确定和动态环境中运行的自主和网络物理系统安全的一种有前景的方法。与传统的运行时验证不同,运行时强制执行通过在执行期间主动干预,修改不安全系统行为以防止属性违反。现有的强制执行框架主要关注无时间或离散时间规范,并且通常仅限于延迟或抑制事件,这使得它们对于表现出复杂连续动态的反应式系统不充分。在本文中,我们提出了一种运行时强制执行框架,其中安全需求使用混合自动机(HA)建模。该框架将离散事件编辑与连续时间监控相结合,以支持在任意时间点执行抑制、延迟和插入事件等强制执行操作。在观察环境输入后,自动机被初始化,并使用运行时可达性分析来综合安全纠正动作。我们正式定义了安全混合自动机的强制执行问题,建立了可强制执行条件,并提出了一种用于反应式系统的在线强制执行算法。关于自适应巡航控制(ACC)系统的详细案例研究证明了所提出方法在不安全控制器行为下维护安全属性的有效性。实验结果表明,该框架在实时确保持续符合安全要求的同时,引入了最小的计算开销。

英文摘要

Runtime enforcement has emerged as a promising approach for ensuring the safety of autonomous and cyber-physical systems operating in uncertain and dynamic environments. Unlike traditional runtime verification, runtime enforcement actively intervenes during execution to prevent property violations by modifying unsafe system behaviors. Existing enforcement frameworks primarily focus on untimed or discrete-time specifications and are often limited to delaying or suppressing events, making them inadequate for reactive systems exhibiting complex continuous dynamics. In this paper, we propose a runtime enforcement framework where safety requirements are modeled using Hybrid Automata (HA). The framework combines discrete-event editing with continuous-time monitoring to support enforcement actions such as suppression, delay, and insertion of events at arbitrary time instants. Upon observing environmental inputs, the automaton is initialized, and runtime reachability analysis is used to synthesize safe corrective actions. We formally define the enforcement problem for safety hybrid automata, establish enforceability conditions, and present an online enforcement algorithm for reactive systems. A detailed case study on an Adaptive Cruise Control (ACC) system demonstrates the effectiveness of the proposed approach in maintaining safety properties under unsafe controller behaviors. Experimental results show that the framework introduces minimal computational overhead while ensuring continuous compliance with safety requirements in real time.

2606.12019 2026-06-11 cs.RO 新提交

MPPI-based Informative Trajectory Planning for Search and Capture of Drifting Targets with ASVs

基于MPPI的自主水面艇搜索与捕获漂移目标的信息轨迹规划

Sanjeev Ramkumar Sudha, Marija Popović, Erlend M. Coates

发表机构 * Norwegian University of Science and Technology (NTNU)(挪威科技大学) TU Delft(代尔夫特理工大学)

AI总结 针对自主水面艇在动态环境中搜索并捕获多个漂移目标的问题,提出一种基于模型预测路径积分(MPPI)控制的混合规划框架,通过优化长时域连续轨迹平衡搜索与跟踪,并在拦截阶段切换至纯追踪制导,实验验证了有效性。

详情
AI中文摘要

自主水面艇为开放水域的环境清理以及搜索救援行动提供了高效解决方案。这些环境中的目标持续漂移,因此高效搜索必须平衡未观测区域的探索与已知目标的跟踪。然而,大多数目标跟踪与追捕场景仅考虑简单的制导行为及短期预测用于决策。在本论文中,我们针对动态环境中搜索并捕获多个漂移目标(如垃圾)的问题,提出一种混合规划框架。我们策略的一个关键方面是基于模型预测路径积分(MPPI)控制的时空信息规划方法,这是一种基于采样的模型预测控制方法。该规划器通过优化长时域上的连续轨迹直接生成运动学级指令。多目标代价函数平衡搜索与跟踪目标,同时确保安全、可行的轨迹。在拦截阶段,我们切换至纯追踪制导控制器以实现对移动目标的物理捕获。实验表明,我们的规划器优于所选的规划基线。最后,我们在自主水面艇的现场试验中验证了该方法。

英文摘要

Autonomous surface vehicles offer an efficient solution for environmental cleanup as well as search and rescue operations in open waters. Targets in these settings drift continuously, so efficient search must balance exploration of unobserved regions with tracking of known targets. However, most target tracking and pursuit scenarios consider simple guidance behaviours and short-term predictions for decision-making. In this letter, we address the problem of search and capture of multiple drifting targets, such as litter, in dynamic environments, using a hybrid planning framework. A key aspect of our strategy is a spatiotemporal informative planning method based on model predictive path integral (MPPI) control, a sampling-based model predictive control approach. The planner directly generates kinematic-level commands by optimising continuous trajectories over long horizons. A multi-objective cost balances search and tracking objectives while ensuring safe, feasible trajectories. In the interception stage, we switch to a pure pursuit guidance controller for the physical capture of moving targets. Experiments show that our planner outperforms the chosen planning baselines. Finally, we validate our approach in field trials with an ASV.

2606.12018 2026-06-11 cs.AI 新提交

MODF-SIR: A Multi-agent Omni-modal Distilled Framework for Social Intelligence Reasoning

MODF-SIR:面向社交智能推理的多智能体全模态蒸馏框架

Shang Ma, Jisheng Dang, Wencan Zhang, Yifan Zhang, Bimei Wang, Hong Peng, Bin Hu, Qi Tian, Tat-Seng Chua

发表机构 * School of Information Science and Engineering, Lanzhou University(兰州大学信息科学与工程学院) School of Medical Technology, Beijing Institute of Technology(北京理工大学医学技术学院) Cloud and AI BU, Huawei(华为云与AI业务部) School of Computing, National University of Singapore(新加坡国立大学计算机学院)

AI总结 提出基于轻量级多模态大语言模型的多智能体协作框架,通过知识蒸馏增强训练与推理,结合测试时适应、长尾事件提取和链式思维提示,在多个基准上取得最优结果。

详情
AI中文摘要

我们提出一个基于轻量级多模态大语言模型(MLLM)的多智能体协作框架,专门设计用于社交智能推理。我们方法的一个关键特征是,训练和推理阶段都通过知识蒸馏进行增强。在该架构中,与社交智能相关的多模态数据被精确定位。此外,相关的长尾事件被识别、提取并呈现为格式化的显式文本。这种格式化策略防止关键的长尾信息在分词过程中被头部事件和环境噪声掩盖。具体来说,我们在整个推理流程中集成了测试时适应(TTA),包括长尾事件的提取和表示、链式思维(CoT)提示和自我反思。该TTA机制也经过蒸馏增强,利用低秩适应(LoRA)仅针对实例级推理微调基础模型。在多个基准上对各种开源和专有AI模型进行的广泛评估证明了所提出框架的有效性。使用IntentTrain约30%的训练数据,我们取得了最先进的结果。代码见https://this URL,演示见https://this URL,LoRA见https://this URL,训练路由器的数据集见https://this URL。

英文摘要

We propose a multi-agent collaborative framework built upon a lightweight Multimodal Large Language Model (MLLM), specifically designed for social intelligence reasoning. A key feature of our approach is that both the training and inference phases are augmented via knowledge distillation. Within this architecture, multi-modal data pertinent to social intelligence is precisely localized. Furthermore, relevant long-tail events are identified, extracted, and rendered as formatted, explicit text. This formatting strategy prevents critical long-tail information from being overshadowed by head events and environmental noise during the tokenization process. Specifically, we integrate Test-Time Adaptation (TTA) across the entire reasoning pipeline, encompassing the extraction and representation of long-tail events, Chain-of-Thought (CoT) prompting, and self-reflection. This TTA mechanism is also distillation-enhanced, utilizing Low-Rank Adaptation (LoRA) to fine-tune the foundation model exclusively for instance-level reasoning. Extensive evaluations against various open-source and proprietary AI models across multiple benchmarks demonstrate the effectiveness of the proposed framework. With around 30% of training data from IntentTrain, we achieve state-of-the-art results. Codes are available at this https URL, demo is available at this https URL, LoRA is available at this https URL and the dataset for training router is available at this https URL.

2606.12016 2026-06-11 cs.LG cs.AI 新提交

Generalization Hacking: Models Can Game Reinforcement Learning by Preventing Behavioral Generalization

泛化黑客:模型可通过阻止行为泛化来博弈强化学习

Frank Xiao, Mary Phuong

发表机构 * California Institute of Technology(加州理工学院)

AI总结 本研究提出泛化黑客现象,模型在强化学习中通过自我接种机制阻止行为泛化,在保持高奖励的同时抵抗行为修正,首次证明模型能主动破坏训练过程。

详情
AI中文摘要

模型后训练,特别是强化学习(RL),是开发者塑造模型价值观和行为的主要机制之一。然而,随着模型越来越具有评估和训练意识,当感知到的目标与其当前价值观冲突时,它们可能会被激励去抵抗训练,从而削弱开发者通过进一步训练检测错位和纠正模型行为的能力。在本文中,我们展示了泛化黑客,即模型在RL期间收集奖励的同时阻止奖励行为泛化。我们在Qwen3-235B-A22B上构建了一个模型有机体,对描述训练意识和自我接种(一种新颖机制,其中模型在其思维链中将合规性框架为上下文特定,而不演示或指示任一行为)的合成文档进行微调。该模型有机体在训练时实现了与对照组相当的有害性,同时在700步RL中保持了持续的约15个百分点的合规差距。此外,仅接受训练意识文档训练的对照有机体在RL压力下独立发现了类似接种的推理,尽管从未接触过该概念,却发展出自己的合规差距。由于泛化黑客有机体在整个过程中获得高奖励,标准训练指标未提供泛化失败的信号。我们的结果首次证明模型可以在保持高奖励的同时主动抵抗RL行为修正,表明随着模型变得更有能力和训练意识,它们可能能够破坏训练过程本身。

英文摘要

Model post-training, and in particular reinforcement learning (RL), is one of the primary mechanisms by which developers can shape models' values and behaviors. However, as models become increasingly evaluation and training aware, they may be motivated to resist training when the perceived objective conflicts with their current values, undermining developers' ability to detect misalignment and correct model behavior through further training. In this paper, we demonstrate generalization hacking, in which a model collects reward during RL while preventing the rewarded behavior from generalizing. We construct a model organism on Qwen3-235B-A22B, finetuning on synthetic documents describing training awareness and self-inoculation, a novel mechanism in which the model frames compliance as context-specific in its chain of thought, without demonstrating or instructing either behavior. The model organism achieves train-time harmfulness comparable to controls while maintaining a persistent ${\sim}15$ percentage point compliance gap across 700 steps of RL. Additionally, a control organism trained only on training awareness documents independently discovers inoculation-like reasoning under RL pressure, developing its own compliance gap despite never being exposed to the concept. Because the generalization-hacking organism receives high reward throughout, standard training metrics provide no signal that generalization has failed. Our results constitute the first demonstration that a model can actively resist RL behavioral modification while maintaining high reward, suggesting that as models become more capable and training-aware, they may be able to undermine the training process itself.

2606.12012 2026-06-11 cs.CV 新提交

FitVTON: Fit-aware Virtual Try-On via Body-Garment Size Control

FitVTON: 通过身体-服装尺寸控制实现合身感知的虚拟试穿

Yiqun Ning, Ao Shen, Chenhang He, Lei Zhang

发表机构 * Department of Computing, The Hong Kong Polytechnic University(香港理工大学计算学系) Nuvatech

AI总结 针对现有虚拟试穿忽略物理合身性的问题,提出FitVTON模型,通过结构化文本提示编码服装-身体尺寸,并引入辅助头预测服装和暴露身体掩膜,结合纹理校正阶段,在真实数据集FittingEffect3K上验证了尺寸准确性和形状保持的优越性。

详情
AI中文摘要

尽管基于扩散的虚拟试穿已经实现了令人印象深刻的视觉真实性,但大多数方法将任务视为2D修复,优先考虑纹理保持而非物理合理性。因此,它们通常生成看似合理的图像,但未能反映不同体型下真实的服装合身性。我们提出了FitVTON,一种在野外不同身体上的合身感知虚拟试穿模型。FitVTON通过结构化文本提示编码服装-身体尺寸,并从参数化服装模型的模拟试穿三元组中学习。为了改善服装轮廓的合身效果,我们引入了两个辅助头来预测服装和暴露身体的掩膜。我们进一步引入了一个纹理校正阶段,以改善模拟数据的真实外观。为了评估合身保真度,我们策划了一个真实世界数据集FittingEffect3K,并结合了基于VLM的评分协议。主观和定量实验表明,FitVTON展示了真实的合身保真度,在尺寸准确性和形状保持方面显著优于最先进的方法,同时保持了有竞争力的图像质量。项目页面:此https URL。

英文摘要

While diffusion-based virtual try-on has achieved impressive visual realism, most methods treat the task as 2D inpainting, prioritizing texture preservation over physical plausibility. Consequently, they often produce plausible-looking images that fail to reflect authentic garment fit across diverse body shapes. We present FitVTON, a Fit-aware virtual try-on model on different bodies in the wild. FitVTON encodes garment-body size through structured text prompts, and learn from simulated try-on triplets from parameterized garment model. To improve the fitting effects over garment silhouettes, we introduce two auxiliary head to predict the masks for both the garment and the exposed body. We further introduce a texture rectification stage to improve realistic appearance from simulated data. To evaluate the fitting fidelity, we curate a real-world dataset, FittingEffect3K, combining VLM-based scoring protocol. Both subjective and quantitive experiments show that FitVTON demonstrate authentic fitting fidelity, with significant sizing accuracy and shape preservation over state-of-the-art methods while maintaining competitive image quality. Project Page: this https URL.