arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1926
专题追踪
2605.21935 2026-05-22 cs.RO

Learning to Evolve: Multi-modal Interactive Fields for Robust Humanoid Navigation in Dynamic Environments

学习进化:多模态交互场用于动态环境中的稳健双足机器人导航

Peifeng Jiang, Hong Liu, Jin Jin, Wenshuai Wang, Xia Li

发表机构 * State Key Laboratory of General Artificial Intelligence, Peking University, Shenzhen Graduate School(一般人工智能国家重点实验室,北京大学深圳研究生院) Oxford Robotics Institute, University of Oxford(牛津大学机器人研究所) Institute for Machine Learning, Department of Computer Science, ETH Zurich(苏黎世联邦理工学院计算机科学系机器学习研究所)

AI总结 本文提出多模态交互场(MIF)系统,通过结合置信度感知的语义3D高斯溅射、差异触发的空间记忆更新和任务驱动的几何重建,在闭环感知-适应管道中实现稳健的双足机器人导航,显著提高了非静态环境中的重定位成功率并减少了语义内存足迹。

Comments Accepted by Robotics: Science and Systems 2026

详情
AI中文摘要

安全的以操作为导向的导航对于双足机器人需要在运动引起的感知扭曲、环境变化和交互层面的几何安全约束下保持可靠的场景记忆。现有语义映射和场景图系统难以直接部署在此设置中,因为它们通常假设稳定的相机轨迹、静态环境或粗略的对象几何。我们引入多模态交互场(MIF),一个面向双足机器人的系统,整合了置信度感知的语义3D高斯溅射、差异触发的空间记忆更新和任务驱动的几何重建,形成闭环的感知-适应管道。MIF耦合了三个场:一个不确定性感知的3DGS外观场,用于抑制步态引起的模糊;一个空间场用于维护拓扑记忆;一个几何场用于在操作前支持交互姿态安全(IPS)。引入了一个差异检测分数,用于区分运动引起的假阳性变化与持续变化,并仅更新局部不一致的区域。在真实动态办公室中的Unitree-G1双足机器人上,MIF将非静态环境中的重定位成功率从12%提升到94%,同时通过特征蒸馏将语义内存足迹减少91.4%,以适应实际的在线操作。项目页面和代码:https://ziya-jiang.github.io/MIF-homepage/

英文摘要

Safe manipulation-oriented navigation for humanoid robots requires scene memory that remains reliable under locomotion-induced perceptual distortion, environmental changes, and interaction-level geometric safety constraints. Existing semantic mapping and scene-graph systems are difficult to deploy directly in this setting because they often assume stable camera trajectories, static environments, or coarse object geometry. We introduce the Multi-modal Interactive Field (MIF), a humanoid-oriented system that integrates confidence-aware semantic 3D Gaussian Splatting, discrepancy-triggered spatial memory updates, and task-driven geometric reconstruction within a closed-loop perception-adaptation pipeline. MIF couples three fields: an uncertainty-aware 3DGS Appearance Field that suppresses gait-induced blur, a Spatial Field that maintains topological memory, and a Geometry Field that supports Interaction Pose Safety (IPS) before manipulation. A discrepancy detection score is introduced to separate locomotion-induced false-positive changes from persistent changes and updates only locally inconsistent regions. On a Unitree-G1 humanoid in a real dynamic office, MIF improves relocation success in non-static environments from 12% to 94% compared with static scene-graph memory, while reducing semantic memory footprint by 91.4% through feature distillation for practical online operation. Project page and code: https://ziya-jiang.github.io/MIF-homepage/

2605.21932 2026-05-22 cs.RO

Auction-Consensus Algorithm with Learned Bidding Scheme for Multi-Robot Systems

带有学习出价方案的拍卖-共识算法用于多机器人系统

Jose Rodriguez, Constantine Tarawneh, Sven Koenig, Wenjie Dong, Qi Lu

发表机构 * Department of Electrical and Computer Engineering, The University of Texas at Rio Grande Valley (UTRGV)(德克萨斯理工大学里奥格兰德谷分校电子与计算机工程系) Department of Mechanical Engineering, UTRGV(UTRGV机械工程系) Department of Computer Science, Donald Bren School of Information and Computer Sciences, University of California, Irvine(加州大学尔湾分校计算机科学系) Department of Computer Science, UTRGV(UTRGV计算机科学系)

AI总结 本文提出了一种学习增强的拍卖-共识框架,通过强化学习训练神经出价策略来改进多机器人系统的任务分配,保留了传统的拍卖和共识阶段以实现去中心化协调。

Comments The 23rd International Conference on Ubiquitous Robots, 9 figures, 6 pages

详情
AI中文摘要

多机器人任务分配(MRTA)是分布式多智能体系统中的核心挑战,其中机器人团队必须在有限通信的情况下协作分配和执行任务,同时优化全局性能目标。拍卖-共识算法,如基于共识的捆绑算法(CBBA),提供了可扩展的去中心化协调,具有可证明的收敛性,但依赖于手工设计的贪婪评分函数,通常导致次优的任务分配。本文提出了一种学习增强的拍卖-共识框架,其中CBBA的确定性出价机制被神经出价策略取代,该策略通过强化学习进行训练。在集中训练和去中心化执行范式下,智能体学会从部分局部观测中计算任务出价,同时保留标准拍卖和共识阶段以实现去中心化协调。学习的出价策略通过混合整数线性规划获得的接近全局最优解的奖励进行训练。多个神经网络架构被评估,包括神经加法模型、长短期记忆(LSTM)模型和集合转换器模型。在不同群体大小的实验结果中,学习的出价策略在经典CBBA之上提高了解决方案的质量,同时保持了去中心化的执行。所提出的方法突显了将强化学习与经典分布式协调算法结合的有效性,为高质量的去中心化多机器人任务分配提供了可扩展的路径。

英文摘要

Multi-Robot Task Allocation (MRTA) is a central challenge in decentralized multi-agent systems, where teams of robots must cooperatively assign and execute tasks under limited communication while optimizing global performance objectives. Auction-consensus algorithms, such as the Consensus-Based Bundle Algorithm (CBBA), provide scalable decentralized coordination with provable convergence, but rely on hand-crafted greedy scoring functions that often lead to suboptimal task allocations. This paper proposes a learning-enhanced auction-consensus framework in which CBBA's deterministic bidding mechanism is replaced by a neural bidding policy trained using reinforcement learning. Under a centralized training and decentralized execution paradigm, agents learn to compute task bids from partial local observations while retaining the standard auction and consensus phases for decentralized coordination. The learned bidding policy is trained using Proximal Policy Optimization with rewards shaped by proximity to globally optimal solutions obtained via mixed-integer linear programming. Multiple neural architectures are evaluated, including a Neural Additive Model, the Long Short-Term Memory (LSTM) model, and the Set Transformer Model. Experimental results across varying swarm sizes demonstrate that learned bidding policies can improve solution quality over classical CBBA while preserving decentralized execution. The proposed approach highlights the effectiveness of integrating reinforcement learning with classical distributed coordination algorithms, offering a scalable pathway toward higher-quality decentralized multi-robot task allocation.

2605.21931 2026-05-22 cs.CV

EvoVid: Temporal-Centric Self-Evolution for Video Large Language Models

EvoVid: 以时间为中心的自我进化用于视频大语言模型

Shiqi Huang, Ziyue Wang, Zhongrong Zuo, Han Qiu, Qi She, Bihan Wen

发表机构 * School of Electrical and Electronic Engineering, Nanyang Technological University(南洋理工大学电子与电气工程学院) ByteDance(字节跳动)

AI总结 本文提出EvoVid,一种以时间为中心的自我进化框架,使视频大语言模型能够直接从未经标注的视频中改进。通过引入两个互补的时间感知奖励,即时间感知的问题生成奖励和时间基础的求解奖励,EvoVid在四个基础模型和六个基准测试中实现了优于基线模型和现有自我进化基线的改进,展示了时间为中心的自我进化在视频理解和推理中的有效性。

Comments Project page: https://huangshiqi128.github.io/EvoVid.io/

详情
AI中文摘要

近期的视频大语言模型(Video-LLMs)通过强化学习(RL)展示了在视频推理中的强大能力。然而,现有的RL流程严重依赖于人工标注的任务和解决方案,使其扩展成本高且本质上受人类专业知识的限制。自我进化框架最近作为一种有前途的替代方案出现,通过自主的提问者-求解者自玩。不幸的是,这些方法主要针对静态模态,如文本和图像,从根本上无法捕捉视频推理中至关重要的时间动态。在本工作中,我们提出了EvoVid,一种以时间为中心的自我进化框架,使Video-LLMs能够直接从原始、未标注的视频中改进。具体来说,我们引入了两个互补的时间感知奖励:一个时间感知的问题生成奖励,通过时间扰动敏感性鼓励时间依赖性的问题生成;一个时间基础的求解奖励,通过固有的视频片段定位提供自动的时间监督。在四个基础模型和六个基准测试中的广泛实验显示,EvoVid在基线模型和现有自我进化基线上实现了持续的改进,取得了与监督方法相竞争的性能。这些结果突显了时间为中心的自我进化作为视频理解和推理的有效且可扩展的范式。

英文摘要

Recent Video Large Language Models (Video-LLMs) have demonstrated strong capabilities in video reasoning through reinforcement learning (RL). However, existing RL pipelines rely heavily on human-annotated tasks and solutions, making them costly to scale and fundamentally constrained by human expertise. Self-evolving frameworks have recently emerged as a promising alternative through autonomous Questioner-Solver self-play. Unfortunately, these approaches are primarily designed for static modalities such as text and images, fundamentally failing to capture the temporal dynamics that are central to video reasoning. In this work, we propose $\textbf{EvoVid}$, a temporal-centric self-evolving framework that enables Video-LLMs to improve directly from raw, unannotated videos. Specifically, we introduce two complementary temporal-centric rewards: a temporal-aware Questioner reward that encourages temporally dependent question generation through temporal perturbation sensitivity, and a temporal-grounded Solver reward that provides automatic temporal supervision via inherent video segment localization. Extensive experiments across four base models and six benchmarks demonstrate consistent improvements over both base models and existing self-evolving baselines, achieving competitive performance with supervised methods. These results highlight temporal-centric self-evolution as an effective and scalable paradigm for video understanding and reasoning.

2605.21928 2026-05-22 cs.LG cs.AI stat.ME

CausalGuard: Conformal Inference under Graph Uncertainty

CausalGuard: 在图不确定性下的契合推断

Vikash Singh, Weicong Chen, Debargha Ganguly, Yanyan Zhang, Nengbo Wang, Sreehari Sankar, Mohsen Hariri, Alexander Nemecek, Chaoda Song, Shouren Wang, Biyao Zhang, Van Yang, Erman Ayday, Jing Ma, Vipin Chaudhary

发表机构 * Case Western Reserve University(凯斯西储大学)

AI总结 本文提出CausalGuard,一种结构加权的契合框架,通过聚合图条件双稳健伪结果进行校准,以在图不确定性下提供无分布的有限样本边际覆盖。

详情
AI中文摘要

从观察数据估计治疗效应需要选择调整集,但有效的调整依赖于未知的因果图。图的不规范可能导致覆盖不足,而图无关的契合包装可能只能通过大填充来恢复名义覆盖。我们介绍了CausalGuard,一种结构加权的契合框架,该框架在聚合图条件双稳健伪结果后进行校准。候选DAGs从LLM衍生的边先验中提出,通过条件独立性测试进行修剪,并通过贝叶斯信息准则重新加权。然后,一个复合非契合分数校准后加权的伪结果。CausalGuard为聚合的伪结果提供无分布的有限样本边际覆盖;在因果识别、重叠、条件均值噪声稳定性以及集中在目标对齐的有效调整策略下,其条件均值收敛于真实的条件平均治疗效应。在五个基准测试中,CausalGuard在可直接评估的目标上实现了均值覆盖超过名义90%水平,并在图无关契合基线需要大填充时减少了宽度。压力测试显示,当保留的候选集受数据支持时,CausalGuard能抑制无效的碰撞调整并在不规范的先验下保持稳定。

英文摘要

Estimating treatment effects from observational data requires choosing an adjustment set, but valid adjustment depends on an unknown causal graph. Graph misspecification can cause under-coverage, while graph-agnostic conformal wrappers may regain nominal coverage only through large padding. We introduce CausalGuard, a structure-weighted conformal framework that calibrates after aggregating graph-conditional doubly robust pseudo-outcomes. Candidate DAGs are proposed from an LLM-derived edge prior, pruned by conditional-independence tests, and reweighted by Bayesian Information Criterion. A composite nonconformity score then calibrates the posterior-weighted pseudo-outcome. CausalGuard provides distribution-free finite-sample marginal coverage for this aggregated pseudo-outcome; under causal identification, overlap, conditional-mean nuisance stability, and concentration on target-aligned valid adjustment strategies, its conditional mean converges to the true Conditional Average Treatment Effect. Across five benchmarks, CausalGuard attains mean coverage above the nominal 90% level for the directly evaluable target and reduces width when graph-agnostic conformal baselines require large padding. Stress tests show that CausalGuard suppresses invalid collider adjustment and remains stable under misspecified priors when the retained candidate set is data-supported.

2605.21924 2026-05-22 cs.CV

Visual-Advantage On-Policy Distillation for Vision-Language Models

基于视觉优势的在线策略蒸馏用于视觉-语言模型

Ruiqi Liu, Xiaolei Lv, Gengsheng Li, Ximo Zhu, Zhiheng Wang, Zhengbo Zhang, Junkai Chen, Zhiheng Li, Bo Li, Jun Gao, Shu Wu

发表机构 * Institute of Automation, CAS(中国科学院自动化研究所) School of Advanced Interdisciplinary Sciences, UCAS(中国科学院大学(UCAS)先进交叉学科学院) Hello Group Inc.(Hello集团有限公司) Sun Yat-sen University(中山大学)

AI总结 本文提出了一种基于视觉优势的在线策略蒸馏方法,用于提升视觉-语言模型对视觉输入的依赖性,通过引入视觉优势指标来区分关键视觉token与语言token,从而提高蒸馏效果。

详情
AI中文摘要

在线策略知识蒸馏在语言模型中已被证明有效,但其在视觉-语言模型(VLMs)中的应用仍显不足。我们发现标准在线策略蒸馏可以提高学生模型的输出质量,但未能增强其对视觉输入的依赖性:在视觉关键token上,学生模型的预测在是否具备细粒度视觉细节时基本保持不变,尽管教师模型的预测依赖于它。为了使这种差异变得明显,我们引入了视觉优势(VA),即当教师在评分学生生成的rollout时,有无细粒度视觉细节的token级对数概率差异。VA集中在少数token上,这些高VA token实际上承载了视觉监督信号。这促使我们提出了一种蒸馏目标,使它们与语言支架不同,以避免其被大量语言token稀释。我们提出了视觉优势在线策略蒸馏(VA-OPD),它在两个粒度上使用VA:通过轨迹平均VA进行rollout级重新加权,以及在高VA和低VA组内分别计算token级KL平均值。我们在这两个数学数据集(Geometry3K和ViRL39K)上进行训练,并在八个基准测试上进行评估,涵盖数学推理和视觉理解,跨三种教师大小(4B、8B和32B)在Qwen3-VL系列上。VA-OPD在每个基准测试上均优于标准在线策略蒸馏,增益随着教师大小和数据规模轴单调增长,表明这些因素一致地相互作用。

英文摘要

On-policy knowledge distillation has proven effective for language models, yet its application to vision-language models (VLMs) remains underexplored. We observe that standard on-policy distillation can improve a student's output quality while failing to strengthen its reliance on visual input: on vision-critical tokens, the student's predictions remain largely unchanged whether or not fine-grained visual detail is present, even though the teacher's predictions depend heavily on it.To make this difference observable, we introduce visual advantage (VA), the token-level log-probability difference when the teacher scores a student-generated rollout with versus without access to fine-grained visual detail. VA is concentrated in a small minority of tokens, and these high-VA tokens are the ones that actually carry the visual supervision signal. This motivates a distillation objective that treats them differently from language scaffolding, so their contribution is not diluted by the abundant surrounding language tokens.We propose Visual-Advantage On-Policy Distillation (VA-OPD), which uses VA at two granularities: rollout-level reweighting by trajectory-averaged VA, and token-level KL averaged within high-VA and low-VA groups separately. We train on two math datasets (Geometry3K and ViRL39K) and evaluate on eight benchmarks covering both mathematical reasoning and visual understanding, across three teacher sizes (4B, 8B, and 32B) on the Qwen3-VL family. VA-OPD improves over standard on-policy distillation on every benchmark, with the gain growing monotonically along both the teacher-size and data-scale axes, suggesting that these factors compound consistently.

2605.21919 2026-05-22 cs.CV cs.AI

SDGBiasBench: Benchmarking and Mitigating Vision--Language Models' Biases in Sustainable Development Goals

SDGBiasBench: 评估和减轻可持续发展目标中视觉-语言模型的偏见

Zihang Lin, Huaiyuan Qin, Muli Yang, Hongyuan Zhu

发表机构 * Nanyang Technological University(南洋理工大学)

AI总结 本文提出SDGBiasBench,一个用于评估和减轻可持续发展目标中视觉-语言模型偏见的大型基准测试集,通过分析模型在决策和估计层面的偏见,提出CADE方法以减少偏见,提高模型的准确性和可靠性。

详情
AI中文摘要

评估可持续发展目标(SDGs)的进展需要对视觉线索、上下文知识和发展指标进行多步骤推理,其中不完整的证据使用和不完美的证据整合可能引入隐藏的预测偏见。现实中的SDG监测还涵盖定性判断和定量估计。然而,现有基准通常孤立地评估这些方面,掩盖了当模型用先验代替证据时系统性偏见。为解决这一差距,我们提出了SDGBiasBench,一个面向SDG的视觉-语言推理大型基准测试集。该基准涵盖50万专家参与的多项选择题和5万回归任务,能够全面评估视觉-语言模型(VLMs)在决策和估计层面的偏见。在SDGBiasBench上的评估揭示了当前VLMs中固有的SDG偏见,其中预测通常由SDG特定的先验驱动,而非可靠的多模态线索。为减轻这种偏见,我们提出CADE(对比自适应去偏集合),一种无需训练的即插即用方法,利用模态特定的答案先验。CADE在所提出的基准上取得显著成效,提高了多项选择的准确率高达25%,并减少了回归MAE高达12点,适用于多种VLMs。我们希望我们的工作能促进更公平和可靠的AI系统在可持续发展中的发展。

英文摘要

Assessing progress toward the Sustainable Development Goals (SDGs) requires multi-step reasoning over visual cues, contextual knowledge, and development indicators, where incomplete evidence use and imperfect evidence integration can introduce hidden prediction biases. Real-world SDG monitoring further spans both qualitative judgments and quantitative estimation. However, existing benchmarks typically evaluate these aspects in isolation, obscuring systematic biases that emerge when models substitute priors for evidence. To address this gap, we propose SDGBiasBench, a large-scale benchmark suite for SDG-oriented vision-language reasoning. Spanning 500k expert-involved multiple-choice questions and 50k regression tasks, the benchmark enables comprehensive assessment of both decision-level and estimation-level bias in Vision--Language Models (VLMs). Evaluations on SDGBiasBench reveal an intrinsic SDG bias in current VLMs, where predictions are frequently driven by SDG specific priors rather than reliable multi-modal cues. To mitigate such bias, we propose CADE (Contrastive Adaptive Debias Ensemble), a training-free, plug-and-play method that leverages modality-specific answer priors. CADE yields significant gains on the proposed benchmark, improving multiple-choice accuracy by up to 25% and reducing regression MAE by up to 12 points across multiple VLMs. We hope our work can foster the development of more fair and reliable AI systems for sustainable development.

2605.21917 2026-05-22 cs.CV cs.AI

MAVEN: A Multi-stage Agentic Annotation Pipeline for Video Reasoning Tasks

MAVEN:一种多阶段代理标注管道用于视频推理任务

Han Zhang, Wanting Jiang, Tomasz Kornuta, Tian Zheng, Vidya Murali

发表机构 * NVIDIA

AI总结 本文提出MAVEN,一种多阶段代理标注管道,通过链式推理轨迹生成多任务训练数据,用于视频事件推理任务,核心方法是多尺度时空事件描述,支持代理驱动的领域适应,通过分层细化循环改进数据质量,并在多个数据集上验证了其有效性。

Comments CVPR 2026 Workshop

详情
AI中文摘要

训练视频事件推理的视觉语言模型(VLMs)需要高质量的结构化标注,这些标注不仅要描述发生了什么,还要捕捉何时、何地、为何以及后果。我们提出了MAVEN(多阶段代理视频事件标注),一种多阶段代理管道,通过链式推理(CoT)轨迹将原始视频转换为多任务训练数据,围绕指定的事件焦点组织。在核心部分,MAVEN从三个互补的标题级别合成多尺度时空事件描述(MSTED),该显式中间体是下游问答生成的唯一输入,适用于多种任务格式。关键的是,MAVEN支持代理驱动的领域适应:给定新的视频数据集和目标问题示例,代理可以重新设计所有提示,而无需手动重新工程。分层细化循环进一步将注释错误分类到分类学中,追溯根本原因到起始管道阶段,并应用有针对性的编辑,重写提示或修改管道结构本身,迭代改进数据质量。我们应用MAVEN标注超过5,300个交通视频,并在生成的数据上微调Cosmos-Reason2-8B。在私人CCTV评估集上,微调优于Gemini 2.5 Pro和3.1 Flash,包括在零样本情况下MCQ准确率提高了38.8个百分点。在AccidentBench上,仅使用CCTV训练提升了Cosmos-Reason2的MCQ分数10.7分,并在没有dashcam视频的情况下与Gemini 2.5 Pro持平;添加代理适应的dashcam注释缩小了与Gemini 3.1 Flash的差距,RL后训练将总体性能推过了Gemini基线。对仓库监控和公共安全视频的定性结果进一步表明,代理工作流能够轻松适应新领域。

英文摘要

Training Vision Language Models (VLMs) for video event reasoning requires high-quality structured annotations capturing not only what happened, but when, where, why, and with what consequence, at a scale manual labelling cannot support. We present MAVEN (Multi-stage Agentic Video Event aNnotation), a multi-stage agentic pipeline that turns raw videos into multi-task training data with Chain-of-Thought (CoT) reasoning traces, organized around a designated Event of Focus. At its core, MAVEN synthesizes a Multi-Scale Spatio-Temporal Event Description (MSTED) from three complementary caption levels; this explicit intermediate serves as the sole input to downstream Q&A generation across multiple task formats. Crucially, MAVEN supports agent-driven domain adaptation: given a new video dataset and target question examples, the agent redesigns all prompts top-down without manual re-engineering. A hierarchical refinement loop further classifies annotation errors against a taxonomy, traces root causes to the originating pipeline stage, and applies targeted edits that rewrite prompts or modify the pipeline structure itself, iteratively improving data quality. We apply MAVEN to label over 5,300 traffic videos and fine-tune Cosmos-Reason2-8B on the resulting data. On a private CCTV evaluation set, fine-tuning surpasses both Gemini 2.5 Pro and 3.1 Flash, including a $+38.8$-point gain in MCQ accuracy over zero-shot. On AccidentBench, CCTV-only training lifts Cosmos-Reason2 by $+10.7$ MCQ points and matches Gemini 2.5 Pro despite seeing no dashcam videos; adding agent-adapted dashcam annotations narrows the gap to Gemini 3.1 Flash, and RL post-training pushes overall performance past both Gemini baselines. Qualitative results on warehouse surveillance and public safety videos further show the agentic workflow readily adapts the pipeline to new domains.

2605.21914 2026-05-22 cs.RO

Non-Contact Vibration-Based Damage Detection of Civil Structures Using a Cost-Effective Autonomous UAV

基于低成本自主无人机的非接触式振动法 civil 结构损伤检测

Javier Becerril, Maximiliano Vargas, Jennifer Herrera, Joanna Gutierrez, Jorge Rios, Mohsen Amjadian, Constantine Tarawneh, Jinghao Yang, Qi Lu

发表机构 * Department of Computer Science, The University of Texas at Rio Grande Valley (UTRGV)(德克萨斯大学里奥格兰德谷大学计算机科学系) Department of Mechanical Engineering at UTRGV(德克萨斯大学里奥格兰德谷大学机械工程系) Department of Civil Engineering at UTRGV(德克萨斯大学里奥格兰德谷大学土木工程系) Department of Electrical and Computer Engineering at UTRGV(德克萨斯大学里奥格兰德谷大学电气与计算机工程系)

AI总结 本文提出了一种利用低成本自主无人机进行非接触式振动法 civil 结构损伤检测的方法,通过视频记录中的视觉运动追踪提取振动信号,识别自然频率的变化以检测结构退化。实验评估了实验室规模的框架结构在健康和模拟损伤条件下的表现,结果表明无人机能够可靠地检测到损伤引起的频率变化,尽管存在一定的误差,但其性能优于商业无人机系统。

Comments 8 pages, 8 figures, The 2026 International Conference on Unmanned Aircraft Systems, ICUAS 2026

详情
AI中文摘要

本文提出了一种非接触式振动法 civil 结构损伤检测方法,利用自主且定制化的低成本无人机(UAV)。通过基于视觉的运动追踪从视频记录中提取振动信号,以识别自然频率的变化,从而检测结构退化。在实验室规模的框架结构上,评估了健康和模拟损伤条件下的性能。所提出的系统通过实验研究验证,使用两部智能手机、USB相机和定制的低成本无人机,该无人机配备了内置相机和自主对齐系统,以在GPS受限环境中操作。提取并分析位移时间,并在频域中与参考测量值(来自接触加速度计和有限元模型)进行比较。实验结果表明,所有平台均能成功捕捉基频及其因损伤引起的偏移。尽管由于平台干扰和传感限制,无人机表现出略高的误差(最高达5.7%),但其能够可靠地检测到损伤引起的频率变化。与商业无人机系统相比,所提出的平台在显著降低成本的情况下实现了可比的检查性能。这些结果表明,低成本自主无人机为结构健康监测提供了一种实用、灵活且可扩展的解决方案,特别是在接触式传感不可行的情况下。此外,研究结果也支持了多个协作无人机部署的潜力,以进一步提高检查的覆盖范围和鲁棒性。

英文摘要

This paper presents a non-contact approach for vibration-based structural damage detection using an autonomous and customized cost-effective unmanned aerial vehicle (UAV). Vibration signals are extracted from video recordings through vision-based motion tracking to identify shifts in natural frequencies indicative of structural degradation. A laboratory-scale frame structure is evaluated under healthy and simulated-damage conditions. The proposed system is validated through an experimental study involving two smartphones, a USB camera, and a custom-built low-cost UAV equipped with an onboard camera and an autonomous alignment system for operation in GPS-denied environments. The displacement time is extracted and analyzed in the frequency domain and compared to reference measurements from contact accelerometers and a finite element model. Experimental results show that all platforms successfully capture the fundamental frequency and its shift due to damage. Although the UAV exhibits slightly higher errors (up to 5.7%) due to platform-induced disturbances and sensing limitations, it reliably detects damage-induced frequency changes. Compared to commercial UAV systems, the proposed platform achieves comparable inspection performance at significantly lower cost. These results demonstrate that low-cost autonomous UAVs provide a practical, flexible, and scalable solution for structural health monitoring, particularly in scenarios where contact-based sensing is impractical. The findings also support the potential for the deployment of multiple cooperative UAVs to further enhance inspection coverage and robustness.

2605.21913 2026-05-22 cs.CV

Multi-scale interaction network for stereo image super-resolution

多尺度交互网络用于立体图像超分辨率

Liyi Xu, Lin Qi

发表机构 * Ocean University of China(中国海洋大学)

AI总结 本文提出了一种多尺度交互网络,用于立体图像超分辨率,通过改进视内特征提取和视间匹配精度,实现了更优的超分辨率效果。

详情
AI中文摘要

立体图像超分辨率旨在通过利用双目系统的互补信息生成高分辨率图像。尽管先前研究取得了显著成果,但视内和视间信息的潜力尚未被充分挖掘。为了解决这个问题,我们提出了一种新颖的多尺度交互网络用于立体图像超分辨率。具体来说,我们设计了一个多尺度空间-通道注意模块,利用多尺度大可分离核注意和简单的通道注意来改进视内特征提取。此外,我们提出了一个双视图极线注意模块,利用最优传输算法实现更精确的极线匹配。广泛的实验和消融研究显示,我们的方法实现了具有竞争力的结果,优于大多数最先进的方法。

英文摘要

Stereo image super-resolution aims to generate high-resolution images by leveraging complementary information from binocular systems. Although previous studies have achieved impressive results, the potential of intra-view and cross-view information has not been fully exploited. To address this issue, we propose a novel multi-scale interaction network for stereo image super-resolution. Specifically, we design a Multi-scale Spatial-Channel Attention Module that utilizes multi-scale large separable kernel attention and simple channel attention to improve intra-view feature extraction. Additionally, we propose a Dual-View Epipolar Attention Module, utilizing an optimal transport algorithm to achieve more accurate matching along the epipolar line. Extensive experimental and ablation studies show that our method achieves competitive results that outperform most SOTA methods.

2605.21911 2026-05-22 cs.LG

Noise Schedule Design for Diffusion Models: An Optimal Control Perspective

扩散模型的噪声调度设计:一个最优控制视角

Seo Taek Kong, Weina Wang, R. Srikant

发表机构 * ECE & CSL University of Illinois Urbana-Champaign(电子工程与计算机科学实验室,伊利诺伊大学厄巴纳-香槟分校) Computer Science Department Carnegie Mellon University(计算机科学系,卡内基梅隆大学) ECE, CSL & NCSA University of Illinois Urbana-Champaign(电子工程、计算机科学实验室及国家计算科学中心,伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文从最优控制的角度出发,提出了一种分析和设计扩散模型噪声调度的框架,通过将噪声调度问题转化为最优控制问题,推导出噪声调度的充分条件,实现了更优的采样误差,并通过参数调整得到新的噪声调度方案,提升了图像生成的FID分数。

详情
AI中文摘要

我们开发了一个系统分析和设计扩散模型噪声调度的框架。我们证明可以将此设计问题重新表述为一个最优控制问题,其状态是扩散过程的Fisher信息,该信息根据微分方程演变,控制输入是噪声调度。最优控制问题的目标函数涉及Fisher信息,它被证明是Kullback-Leibler采样误差的上界。通过求解此最优控制问题,我们获得噪声调度的充分条件,使得最先进的~O(d/n)采样误差得以实现,其中d是数据维度,n是离散化步骤数。尽管现有理论工作也证明~O(d/n)采样误差界是可行的,但这些结果仅适用于特定的噪声调度,不包括实践中使用的调度。在进一步的数据分布参数假设下,我们证明可以得到噪声调度的闭式表达。这些噪声调度通过允许额外可调参数来推广标准经验调度,如指数和Sigmoid调度。系统地调整这些调度的参数可得到新的调度方案,在图像生成基准上取得更优的FID分数。

英文摘要

We develop a principled framework for analyzing and designing noise schedules in diffusion models. We show that one can recast this design problem as an optimal control problem, whose state is the Fisher information of the diffusion process which evolves according to an ODE and the control input is the noise schedule. The objective of the optimal control problem is a functional involving the Fisher information, which is shown to be an upper bound on the Kullback-Leibler sampling error. By solving this optimal control problem, we obtain sufficient conditions on noise schedules under which state-of-the-art $\tilde{\mathcal{O}} (d/n)$ sampling error is achievable, where $d$ is the data dimension and $n$ is the number of discretization steps. While existing theoretical work also prove that $\tilde{\mathcal{O}}(d/n)$ sampling error bounds are achievable, these results hold for specific noise schedules, which do not include the schedules used in practice. Under a further parametric assumption on the data distribution, we show that one can obtain closed-form expressions for the noise schedules. These noise schedules generalize standard empirical schedules such as exponential and sigmoid schedules by allowing additional parameters that can be tuned. Systematically tuning the parameters of these schedules yields new schedules that achieve superior FID scores on image generation benchmarks.

2605.21907 2026-05-22 cs.CV

Guided Trajectory Optimization with Sparse Scaling for Test-Time Diffusion

引导轨迹优化与稀疏缩放用于测试时间扩散

Gang Dai, Yining Huang, Yiming Xia, Guohao Chen, Shuaicheng Niu

发表机构 * Guangdong University of Technology(广东工业大学) South China University of Technology(华南理工大学) Nanyang Technological University(南洋理工大学)

AI总结 本文提出RTS方法,通过奖励引导的噪声优化策略和稀疏测试时间缩放框架,提升扩散模型的生成性能,实验表明在GenEval和ImageReward指标上均优于现有方法。

详情
AI中文摘要

高效的测试时间缩放(TTS)范式为提升扩散模型的生成性能提供了有前途的视角。然而,当前解决方案局限于静态、预定义的噪声池,并在去噪轨迹中的噪声探索上表现出灵活性不足。为弥合这一差距,我们提出了RTS,一种新颖的奖励引导轨迹缩放方法,以充分释放扩散模型的生成潜力。与现有方法不同,RTS通过两个核心创新实现了高质量图像的合成:1)奖励引导的噪声优化策略,主动将搜索方向引导至有前途的区域;2)结合PCA驱动的曲率分析方案的稀疏测试时间缩放框架,优先考虑去噪空间中的关键中间步骤,有效压缩搜索空间。实验表明,我们的方法在GenEval得分上比基线高出15.6%,在ImageReward得分上提升60.4%,设定了新的SOTA,并为扩散特定架构的更有效的测试时间缩放提供了实用指南。

英文摘要

The efficient Test-Time Scaling (TTS) paradigm offers a promising perspective for enhancing the generation performance of diffusion models. However, current solutions are limited to a static, pre-defined noise pool and suffer from inflexible noise exploration across the denoising trajectory. To bridge this gap, we propose RTS, a novel Reward-guided Trajectory Scaling method to fully unlock the generative potential of diffusion models. Unlike existing methods, RTS facilitates the synthesis of refined, high-fidelity images via two core innovations: 1) a reward-guided noise optimization strategy to actively direct the search towards promising regions; and 2) a sparse test-time scaling framework together with a PCA-driven curvature analysis scheme to prioritize key intermediate steps in the entire denoising space, effectively compressing the search space. Experiments show our approach outperforms baselines by 15.6% across GenEval Score, and a 60.4% enhancement in ImageReward score, setting a new SOTA while providing a practical guideline for more effective test-time scaling across diffusion-specific architectures.

2605.21901 2026-05-22 cs.RO

Higher Order Reasoning for Collaborative Communicationless Mobile Robot Operations

高阶推理用于无通信协作移动机器人操作

Jonathan Reasoner, Nicola Bezzo

发表机构 * Department of Electrical and Computer Engineering, University of Virginia(弗吉尼亚大学电气与计算机工程系)

AI总结 本文提出了一种基于高阶推理的动态认知规划框架,使机器人能够在无通信环境下实现隐式协调和长周期规划,通过仿真和实物实验验证了其在通信受限领域中提升任务完成效率的能力。

详情
AI中文摘要

在无通信环境下,多机器人系统必须在不进行常规定步协调策略所假设的持续信息交换的情况下运作。本文提出了一种新颖的动态认知规划框架,通过机器人之间的高阶推理实现隐式协调和长周期规划。我们的方法使机器人能够形成并传播高阶信念粒子,利用贝叶斯推断更新世界信念,并通过行为树选择动作,以预测队友的可能决策。一种时间感知的模型预测路径积分(MPPI)控制器将这种推理整合到低层执行中,使机器人能够在部分可观测条件下规划拦截并适应轨迹。所提出的框架在仿真和实物实验中均显示出比一阶基线方法更短的任务完成时间,证明了认知逻辑可以作为在通信受限领域中具有鲁棒性的协调基础。

英文摘要

In communicationless environments, multi-robot systems must operate without the constant information exchange that many coordination strategies typically assume. This paper presents a novel dynamic epistemic planning framework that enables implicit coordination and long horizon planning through higher-order reasoning among robots. With our approach, robots form and propagate higher-order belief particles, update world beliefs using Bayesian inference, and select actions via a behavior tree that anticipates teammates' likely decisions. A temporally aware Model Predictive Path Integral (MPPI) controller integrates this reasoning into low-level execution, allowing robots to plan intercepts and adapt trajectories under partial observability. The proposed framework is evaluated in both simulations and physical experiments, where it consistently reduces task completion time compared to a first-order baseline, demonstrating that epistemic logic can serve as a robust foundation for resilient coordination in communication-restricted domains.

2605.21882 2026-05-22 cs.CV

Thermo-VL: Extending Vision-Language Models to Thermal Infrared Perception

Thermo-VL:扩展视觉语言模型以适应热红外感知

Rusiru Thushara, Yasiru Ranasinghe, Jay Paranjape, Vishal M. Patel

发表机构 * Department of Electrical & Computer Engineering, Johns Hopkins University(约翰霍普金斯大学电气与计算机工程系)

AI总结 本文提出Thermo-VL,一种基于热红外感知的视觉语言模型,通过引入可训练的热编码器和文本引导的双注意力融合模块,提升了低光照条件下的多光谱融合能力,并在热红外和RGB+热红外推理任务中取得显著成果。

Comments 18 pages, 11 figures

详情
AI中文摘要

视觉语言模型(VLMs)在低光照条件下往往表现不佳,因为它们的视觉基础主要学习自RGB图像,而热红外图像在可见线索退化时能保留互补的场景结构。我们提出了Thermo-VL,一种波长感知的VLM,它在冻结的Molmo-7B主干上添加了可训练的热编码器和文本引导的双注意力融合模块。给定对齐的RGB标记、热标记和提示嵌入,融合模块将热特征条件化为语言和RGB上下文,然后将门控残差注入冻结的RGB流中,使热证据能够被纳入而不破坏Molmo预训练的RGB-语言接口。我们使用标准的语言建模目标以及辅助对齐和正则化损失来训练模型,这些损失提高了跨模态基础并减少了对RGB的依赖。我们还引入了一个像素对齐的RGB-热指令微调数据集和Thermo-VL-Bench,一个手动筛选的RGB-热VQA基准,用于低光照和跨光谱推理。实验表明,在具有挑战性的热红外和RGB+热红外推理任务中取得了显著的提升,突显了基于提示的多光谱融合的价值。我们的数据集和代码可在:https://thusharakart.github.io/Thermo-VL 公开获取。

英文摘要

Vision-language models (VLMs) often fail under low illumination because their visual grounding is learned predominantly from RGB imagery, whereas thermal infrared preserves complementary scene structure when visible cues degrade. We present Thermo-VL, a wavelength-aware VLM that augments a frozen Molmo-7B backbone with a trainable thermal encoder and a text-guided dual-attention fusion module. Given aligned RGB tokens, thermal tokens, and prompt embeddings, the fusion module conditions thermal features on both language and RGB context, then injects a gated residual into the frozen RGB stream so thermal evidence can be incorporated without disrupting Molmo's pretrained RGB-language interface. We train the model with the standard language-modeling objective together with auxiliary alignment and regularization losses that improve cross-modal grounding and reduce over-reliance on RGB. We also introduce a pixel-aligned RGB-thermal instruction-tuning dataset and Thermo-VL-Bench, a manually screened RGB-thermal VQA benchmark for low-light and cross-spectrum reasoning. Experiments show strong gains on challenging thermal-only and RGB+thermal reasoning tasks, highlighting the value of prompt-conditioned multispectral fusion. Our dataset and code are publicly available at: https://thusharakart.github.io/Thermo-VL

2605.21869 2026-05-22 cs.CV cs.AI cs.HC

Two-Stage Multimodal Framework for Emotion Mimicry Intensity Prediction

双阶段多模态框架用于情感模仿强度预测

Dinithi Dissanayake, Shaveen Silva, Ovindu Atukorala, Prasanth Sasikumar, Suranga Nanayakkara

发表机构 * Augmented Human Lab, National University of Singapore, Singapore(新加坡国立大学增强人类实验室) University of Moratuwa, Sri Lanka(斯里兰卡穆拉图瓦大学)

AI总结 本文提出了一种双阶段多模态框架,用于从真实视频片段中预测六个连续情绪强度维度,通过结合文本、音频和视觉表示,并可选运动分支,提供了一个实用且可复现的基线。

Comments 10th Affective & Behavior Analysis in-the-wild, CVPR Workshop 2026

详情
AI中文摘要

我们提交了Hume-ABAW10情感模仿强度(EMI)挑战的参赛方案,旨在从真实多模态视频片段中预测六个连续情绪强度维度:钦佩、娱乐、决心、共情痛苦、兴奋和快乐。我们提出了一种分阶段的多模态框架,结合文本、音频和视觉表示,可选运动分支。我们的方法首先独立训练模态特定的编码器,然后通过轻量级回归器融合其学习的表示,通过模态丢弃和受控编码器适应。在我们提交的系统中,最佳验证性能由文本-音频-视觉-运动融合模型在扩展的4:1划分下获得,平均皮尔逊相关系数为0.4722。尽管运动分支仅带来极小的提升,但其行为值得研究。我们的团队在EMI挑战中获得第三名,测试集的平均皮尔逊相关系数为0.57。总体而言,我们提供了一个实用且可复现的EMI预测基线。

英文摘要

We present our submission to the Hume-ABAW10 Emotional Mimicry Intensity (EMI) Challenge, which aims to predict six continuous emotion intensity dimensions: Admiration, Amusement, Determination, Empathic Pain, Excitement, and Joy, from in-the-wild multimodal video clips. We propose a staged multimodal framework that combines textual, acoustic, and visual representations, with an optional motion branch. Our approach first trains modality-specific encoders independently and then fuses their learned representations through a lightweight regressor with modality dropout and controlled encoder adaptation. Across our submitted systems, the best validation performance is obtained by the text--audio--vision--motion fusion model under the expanded 4:1 split, achieving an average Pearson correlation of 0.4722. Although the motion branch yields only very slight gains, its behavior can be interesting to study. Our team was placed third in the EMI challenge, achieving an average Pearson correlation of 0.57 for the test set. Overall, we provide a practical and reproducible baseline for EMI prediction.

2605.21863 2026-05-22 cs.RO

OCELOT: Odometry and Contact Estimation for Legged Robots

OCELOT:用于腿部机器人的步态和接触估计

Emre Girgin, Cagri Kilic

发表机构 * Department of Aerospace Engineering, Embry-Riddle Aeronautical University(航空航天工程系,埃默里-瑞德航空航天大学)

AI总结 本文提出了一种基于误差状态扩展卡尔曼滤波器(ESEKF)的完整腿部里程计管道,通过仅使用本体感觉数据(如固定IMU、关节编码器和力传感器)来实现准确的里程计估计,核心贡献是融合接触检测和不确定性量化模块,用于显式识别并拒绝滑动。

Comments 8 pages

详情
AI中文摘要

腿部机器人中的一项重大挑战是仅使用机载本体感觉传感器实现准确的里程计。在本研究中,我们提出了一种基于误差状态扩展卡尔曼滤波器(ESEKF)的完整腿部里程计管道,该管道仅依赖于本体感觉数据:固定IMU、关节编码器和力传感器,其中滤波器的状态通过确定处于静止支撑的脚来校正。我们的核心贡献是融合接触检测和一个不确定性量化模块,该模块设计用于显式识别并拒绝滑动。该模块为每只脚运行两个检测器:1)一个基于力的去抖 Gaussian Mixture Model(GMM)引导的有限状态机(FSM)以确认物理接触,2)一个基于运动学的广义似然比检验(GLRT)在估计的脚速度上。两个估计器的连续质量分数被融合,以检测脚是否同时物理加载和运动学静止,并作为每种接触的不确定性信号。为了验证我们的方法,我们收集了一个多模态数据集,包含29个序列,覆盖多样的室内外地形(例如混凝土、草地、鹅卵石和岩石),总长度为2.4公里。我们对比了本体感觉和外源感觉方法。结果表明,我们的方法在提供准确的里程计估计和在易滑动环境中具有鲁棒性。我们还分享了我们的代码和实时ROS2包作为开源。

英文摘要

One of the significant challenges in legged robotics is achieving accurate odometry using only onboard proprioceptive sensors. In this study, we present a complete leg odometry pipeline based on an Error-State EKF (ESEKF) that relies exclusively on proprioceptive data: a body fixed IMU, joint encoders, and force sensors, where filter's state is corrected by feet determined to be in a stationary stance. The core of our contribution is fused contact detection and an uncertainty quantification module designed to explicitly identify and reject slippage. This module runs two detectors in parallel for each foot, 1) a debounced, force-based Gaussian Mixture Model (GMM) guided Finite State Machine (FSM) to confirm physical contact, and 2) a kinematic-based Generalized Likelihood Ratio Test (GLRT) on the estimated velocity of the foot. The continuous quality scores from both estimators are fused to detect if the foot is both physically loaded and kinematically stationary and served as an uncertainty signal for each contact. To validate our approach, we collected a multi-modal dataset of 29 sequences spanning diverse indoor and outdoor terrains (e.g., concrete, grass, pebble, and rock) total of 2.4 km long. We benchmarked our approach against both proprioceptive and exteroceptive methods. The results demonstrate our method's efficacy in providing accurate odometry estimates, robustly handling slippage-prone environments. We also share our code and real-time ROS2 package as open-source.

2605.21862 2026-05-22 cs.RO cs.AI

EvoScene-VLA: Evolving Scene Beliefs Inside the Action Decoder for Chunked Robot Control

EvoScene-VLA: 在动作解码器中进化场景信念用于分块机器人控制

Chushan Zhang, Ruihan Lu, Jinguang Tong, Xuesong Li, Yikai Wang, Hongdong Li

发表机构 * Australian National University(澳大利亚国立大学) The University of Queensland(昆士兰大学) Beijing Normal University(北京师范大学)

AI总结 本文提出EvoScene-VLA,通过在动作解码器中维护更新的场景状态,改进分块机器人控制中的多步控制预测,提升了场景信念的持续性和准确性。

详情
AI中文摘要

Chunked vision-language-action (VLA) policies predict multi-step robot controls, conditioning each update on the current visual observation alone. Yet robot actions cause contact, occlusion, and object motion, and the geometry that later decisions depend on can change before the next visual update arrives. Spatial VLAs improve current-frame geometry. Temporal VLAs aggregate past frames. Neither maintains an action-updated scene prior across chunks. We argue for a persistent action-updated scene state across control calls, and introduce EvoScene-VLA. Its recurrent scene prefix carries a geometry-aware scene state across chunks. At each vision-language model (VLM) call, the VLM combines scene information from the current observation with the action-updated prior from the previous chunk; the action decoder outputs both the next action chunk and a compact scene update. This update becomes the next prior, which the VLM corrects against the new observation when the next call arrives. Each control call therefore starts from a scene prior that reflects both recent actions and fresh visual evidence. During training, extbf{Scene Predictor} supplies future scene-token targets, and Geometric Anchor aligns scene slots with frozen depth and 3D teachers. We discard both modules at deployment. On 31 RoboTwin tasks, EvoScene-VLA raises average success from 87.2% to 89.1% in fixed evaluation and from 86.1% to 88.5% in randomized evaluation. On the Galaxea R1-Lite real robot, EvoScene-VLA outperforms all baselines.

英文摘要

Chunked vision-language-action (VLA) policies predict multi-step robot controls, conditioning each update on the current visual observation alone. Yet robot actions cause contact, occlusion, and object motion, and the geometry that later decisions depend on can change before the next visual update arrives. Spatial VLAs improve current-frame geometry. Temporal VLAs aggregate past frames. Neither maintains an action-updated scene prior across chunks. We argue for a persistent action-updated scene state across control calls, and introduce EvoScene-VLA. Its recurrent scene prefix carries a geometry-aware scene state across chunks. At each vision-language model (VLM) call, the VLM combines scene information from the current observation with the action-updated prior from the previous chunk; the action decoder outputs both the next action chunk and a compact scene update. This update becomes the next prior, which the VLM corrects against the new observation when the next call arrives. Each control call therefore starts from a scene prior that reflects both recent actions and fresh visual evidence. During training, \textbf{Scene Predictor} supplies future scene-token targets, and Geometric Anchor aligns scene slots with frozen depth and 3D teachers. We discard both modules at deployment. On 31 RoboTwin tasks, EvoScene-VLA raises average success from 87.2% to 89.1% in fixed evaluation and from 86.1% to 88.5% in randomized evaluation. On the Galaxea R1-Lite real robot, EvoScene-VLA outperforms all baselines.

2605.21861 2026-05-22 cs.CV cs.AI

Learning Emergent Modular Representations in Multi-modality Medical Vision Foundation Models

在多模态医学视觉基础模型中学习涌现的模块化表示

Yuting He, Chenyu You, Shuo Li

发表机构 * Case Western Reserve University(凯斯西储大学) Stony Brook University(石溪大学)

AI总结 本文提出Director-Experts (DEX)框架,通过调控模块化动态,在多模态医学视觉基础模型中学习稳定的模块化表示,并在新的医学视觉基准数据集上验证了其在26个下游任务中的优越性。

Comments Accepted by KDD 2026

详情
AI中文摘要

多模态医学视觉(MV)基础模型(FM)在异质成像模态间面临显著的非独立同分布(Non-IID)特征统计挑战。对这类数据进行单一监督优化会引发冲突梯度,导致表示向模态主导的捷径坍缩。本文将这一失败重新解释为涌现模块化中专门化与协调之间的失衡,并提出Director-Experts(DEX)模块化网络,该网络在堆叠模块中显式调控这些动态。每个DEX模块包含一组专家,通过我们的图像级激活策略动态适应,自主专注于模态主导的统计特征,同时结合通过我们组指数移动平均更新的Director,将多专家知识蒸馏到共享空间,实现跨模态的语义整合,从而驱动模块化表示的涌现。我们构建了一个新的基准数据集Medical Vision Universe,包含超过400万张图像,覆盖10种模态,为DEX提供了最广泛的模态覆盖的FM级预训练。在26个下游任务上的广泛评估表明,DEX在优化行为和迁移性方面有所改进,表明DEX是通用多模态医学AI的有原则的一步。我们的代码和数据集将在https://github.com/YutingHe-list/DEX上公开。

英文摘要

Multi-modality medical vision (MV) foundation models (FM) are fundamentally challenged by pronounced Non-IID feature statistics across heterogeneous imaging modalities. Monolithic self-supervised optimization on such data induces conflicting gradients, driving representations to collapse toward modality-dominant shortcuts. This work reframes this failure as an imbalance between specialization and coordination in emergent modularity, and proposes Director-Experts (DEX), a modular network that explicitly regulates these dynamics in stacked modules. Each DEX module comprises a pool of experts, dynamically adapted by our image-wise activation strategy, autonomously specializing in modality-dominant statistics, together with a director, updated via our group exponential moving average, which distills multi-expert knowledge into a shared space for semantic integration across modalities, thus driving the emergence of modular representations. We curate a new benchmark, Medical Vision Universe, over 4 million images across 10 modalities, which provides a FM-level pre-training with the broadest coverage of distinct imaging modalities to our DEX. Extensive evaluations on 26 downstream tasks demonstrate improved optimization behavior and transferability, indicating DEX as a principled step toward general-purpose multi-modality medical AI. Our code and dataset will be opened at https://github.com/YutingHe-list/DEX.

2605.21858 2026-05-22 cs.CL

Hypergraph as Language

超图作为语言

Mengqi Lei, Guohuan Xie, Shihui Ying, Shaoyi Du, Jun-Hai Yong, Siqi Li, Yue Gao

发表机构 * Tsinghua University(清华大学) Yangtze Delta Region Institute(长江三角洲研究院) Shanghai Institute of Applied Mathematics and Mechanics(上海应用数学和力学研究所) State Key Laboratory of Human-Machine Hybrid Augmented Intelligence(人机混合增强智能国家重点实验室) National Engineering Research Center for Visual Information and Applications(视觉信息与应用国家工程研究中心) Institute of Artificial Intelligence and Robotics(人工智能与机器人研究所)

AI总结 本文提出了一种基于超图的语言模型对齐框架Hyper-Align,通过将超图结构转换为可被大语言模型理解的超图令牌,以更有效地处理高阶关联关系,从而在结构建模任务中取得显著优势。

详情
AI中文摘要

大型语言模型(LLMs)最近在建模关系结构方面展现出了强大的潜力。然而,现有方法仍然从根本上以图为中心:它们专注于将成对图结构转换为LLMs可以理解的令牌。相比之下,许多现实世界的关系模式并不自然地符合成对边的假设,而更适合用超图中的高阶关联来建模。对于超图结构,现有方法往往无法保留多个对象由同一高阶关系共同连接的本源语义,限制了其对复杂结构的利用能力。为了解决这一限制,我们提出了

英文摘要

Large language models (LLMs) have recently shown strong potential in modeling relational structures. However, existing approaches remain fundamentally graph-centric: they focus on processing pairwise graph structures into tokens that LLMs can understand. In contrast, many real-world relational patterns do not naturally conform to the pairwise-edge assumption, and are better modeled as high-order associations in hypergraphs. For hypergraph structures, existing methods often fail to preserve the native semantics that multiple objects are jointly connected by the same high-order relation, limiting their ability to exploit complex structures. To address this limitation, we put forth the "Hypergraph as Language" perspective and propose Hyper-Align, a hypergraph-native alignment framework for large language models. Hyper-Align compiles the query-object-centered hypergraph context into hypergraph tokens directly consumable by a base LLM. Specifically, we introduce Hypergraph Incidence Detail Template with Overview (HIDT-O), which serializes high-order association structures into a fixed-shape hybrid template combining local incidence details and overview-level summaries. We then design a Hypergraph Incidence Projector (HIP), which maps native high-order incidence structures into the LLM token space through explicit semantic-structural decoupling and bidirectional message passing between vertices and hyperedges. We further define a concrete Hypergraph-as-Language input protocol, which jointly feeds hypergraph tokens and textual prompts into a frozen base LLM, supporting both vertex-level and hyperedge-level tasks under a unified question-answering paradigm. To systematically evaluate different methods in hypergraph structural modeling, we introduce HyperAlign-Bench. Extensive experiments show that Hyper-Align significantly outperforms existing methods across in-domain and zero-shot evaluations.

2605.21856 2026-05-22 cs.LG cs.AI

The Illusion of Reasoning: Exposing Evasive Data Contamination in LLMs via Zero-CoT Truncation

推理的幻觉:通过零CoT截断揭示LLM中的逃避数据污染

Yifan Lan, Yuanpu Cao, Hanyu Wang, Lu Lin, Jinghui Chen

发表机构 * The Pennsylvania State University(宾夕法尼亚州立大学)

AI总结 本文提出零CoT探针(ZCP)方法,通过截断整个推理过程来暴露模型中的潜在捷径映射,以检测LLM中的直接和逃避数据污染,提出了 contamination confidence 指标来量化污染的可能性和严重性。

详情
AI中文摘要

大型语言模型(LLMs)在广泛的任务上展示了令人印象深刻的推理能力,但数据污染破坏了这些能力的客观评估。这个问题进一步加剧了恶意模型发布者使用逃避或间接污染策略,例如改写基准数据以逃避现有检测方法并人为提升排行榜表现。当前的方法难以可靠地检测这种隐蔽的污染。在本工作中,我们揭示了一个关键现象:模型生成的推理步骤主动掩盖其底层的记忆。受此启发,我们提出了零CoT探针(ZCP),一种新颖的黑盒检测方法,故意截断整个链式思维(CoT)过程以暴露潜在的捷径映射。为进一步将记忆与模型的内在问题解决能力区分开来,ZCP将模型在原始基准上的零CoT表现与等价扰动的参考数据集进行比较。此外,我们引入了污染置信度(Contamination Confidence),一个量化污染可能性和严重性的指标,超越了简单的二元分类。对已识别的污染模型和特别微调的污染模型的广泛实验表明,ZCP能够稳健地检测直接和逃避的数据污染。ZCP的代码可在https://github.com/Yifan-Lan/zero-cot-probe获取。

英文摘要

Large language models (LLMs) have demonstrated impressive reasoning abilities across a wide range of tasks, but data contamination undermines the objective evaluation of these capabilities. This problem is further exacerbated by malicious model publishers who use evasive, or indirect, contamination strategies, such as paraphrasing benchmark data to evade existing detection methods and artificially boost leaderboard performance. Current approaches struggle to reliably detect such stealthy contamination. In this work, we uncover a critical phenomenon: a model's generated reasoning steps actively mask its underlying memorization. Inspired by this, we propose the Zero-CoT Probe (ZCP), a novel black-box detection method that deliberately truncates the entire Chain-of-Thought (CoT) process to expose latent shortcut mappings. To further isolate memorization from the model's intrinsic problem-solving capabilities, ZCP compares the model's zero-CoT performance on the original benchmark against an isomorphically perturbed reference dataset. Furthermore, we introduce Contamination Confidence, a metric that quantifies both the likelihood and severity of contamination, moving beyond simple binary classifications. Extensive experiments on both previously identified contaminated models and specially fine-tuned contaminated models demonstrate that ZCP robustly detects both direct and evasive data contamination. The code for ZCP is accessible at https://github.com/Yifan-Lan/zero-cot-probe.

2605.21852 2026-05-22 cs.CV

Seizure-Semiology-Suite (S3): A Clinically Multimodal Dataset, Benchmark, and Models for Seizure Semiology Understanding

癫痫半相学套件(S3):一个临床多模态数据集、基准和模型用于癫痫半相学理解

Lina Zhang, Tonmoy Monsoor, Peizheng Li, Jiarui Cui, Xinyi Peng, Chong Han, Prateik Sinha, Siyuan Dai, Jessica Nichole Pasqua, Colin M McCrimmon, Weiting Liu, Hailey Marie Miranda, Bing Hu, Xiangting Wu, Tengyou Xu, Chunhan Li, Jiaye Tian, Jiarui Tang, Detao Ma, Lingye Kong, Junnan Lyu, Jungang Li, Yan Zan, Junhua Huang, Rajarshi Mazumder, Vwani Roychowdhury

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校) University of Pittsburgh(匹兹堡大学) Fudan University(复旦大学) University of California, Riverside(加州大学河滨分校) Hong Kong University of Science(香港科学大学) Maharishi International University(玛希拉国际大学)

AI总结 本文提出S3数据集和基准,用于细粒度、结构化的癫痫半相学理解,通过评估多模态大语言模型在低级视觉感知、时间序列处理、叙述报告生成和癫痫诊断中的能力,揭示了现有模型在左右脑推理、时间定位、症状序列和临床忠实报告方面的系统性弱点,并展示了针对癫痫的微调和双阶段神经符号框架在癫痫与非癫痫癫痫分类中的高F1分数。

Comments Accepted to ICML 2026 as a Spotlight presentation

详情
AI中文摘要

尽管多模态大语言模型(MLLMs)在一般视频理解方面表现出色,但其解释非自主、时空演变的病理运动行为如癫痫半相学的能力仍鲜有研究。为此,我们引入癫痫半相学套件(S3),一个临床导向的数据集和基准,用于细粒度、结构化的癫痫半相学理解。数据集包含438个癫痫视频,标注超过35,000个密集标签,涵盖20个ILAE定义的半相学特征。基于此数据集,我们提出了一个七任务分层基准,系统评估MLLMs从低级视觉感知到时间序列处理、叙述报告生成和癫痫诊断的能力。为进一步评估生成报告的临床意义,我们引入了癫痫半相学报告质量指数(Seizure-RQI)。在11个开放权重MLLMs上的广泛基线揭示了在左右脑推理、时间定位、症状序列和临床忠实报告方面的系统性弱点。我们展示,针对癫痫的微调显著提高了各任务的性能,而双阶段神经符号框架在癫痫与非癫痫癫痫分类中的F1分数达到0.96。S3数据集为评估多模态模型在安全关键医疗视频理解中的严谨基准,并指导开发临床可靠、领域适应的多模态智能。

英文摘要

While Multimodal Large Language Models (MLLMs) have demonstrated remarkable proficiency in general video understanding, their capacity to interpret involuntary, and spatio-temporally evolving pathologic motor behaviors such as seizure semiology remains largely untested. To address this gap, we introduce Seizure-Semiology-Suite, a clinically grounded dataset and benchmark for fine-grained, structured seizure semiology understanding. The dataset includes 438 seizure videos annotated with over 35,000 dense labels covering 20 ILAE-defined semiological features. Building on this dataset, we propose a seven-task hierarchical benchmark that systematically evaluates MLLMs from low-level visual perception to temporal sequencing, narrative report generation, and seizure diagnosis. To enable clinically meaningful evaluation of generated reports, we further introduce the Report Quality Index for Seizure Semiology (Seizure-RQI). Extensive baselines across 11 open-weight MLLMs reveal systematic weaknesses in laterality reasoning, temporal localization, symptom sequencing, and clinically faithful reporting. We show that seizure-specific fine-tuning substantially improves performance across tasks, and that a two-stage neuro-symbolic framework achieves an F1 score of 0.96 on epileptic versus non-epileptic seizure classification. Seizure-Semiology-Suite establishes a rigorous benchmark for evaluating multimodal models in safety-critical medical video understanding and guides the development of clinically reliable, domain-adaptive multimodal intelligence.

2605.21849 2026-05-22 cs.LG cs.CL

Geometry-Adaptive Explainer for Faithful Dictionary-Based Interpretability under Distribution Shift

基于几何适应的解释器:在分布偏移下字典基础可解释性的忠实性

Sungjun Lim, Heedong Kim, Andrew Lee, Kyungwoo Song

发表机构 * Yonsei University(延世大学) Harvard University(哈佛大学)

AI总结 本文提出了一种几何适应解释器(GAE),用于在分布偏移下提高基于字典的可解释性。通过重新对齐解释器的字典与偏移活跃子空间,同时保持原始特征结构,GAE在无监督的情况下减少了分布偏移下的忠实性差距。

详情
AI中文摘要

机制可解释性旨在通过识别因果负责的内部结构来解释模型的行为。基于字典的解释器如稀疏自编码器和转码器是主要工具,但其在分布外(OOD)偏移下的忠实性却很少受到系统性关注。我们证明分布偏移会旋转模型所使用的子空间,导致解释器的字典在训练分布(ID)激活上训练时出现对齐偏差。我们将这种偏差正式化为忠实性差距,即ID字典与OOD活跃子空间之间的几何距离,并证明其控制OOD忠实性退化。为了减少这种差距,我们提出了几何适应解释器(GAE),它在保持原始特征结构的同时,重新对齐解释器的字典与OOD活跃子空间。这只需要未标记的OOD激活,并且不需要梯度更新。我们证明GAE在无适应ID解释器上有所改进,其额外损失被二次限制于二阶矩偏移。经验上,GAE在多个模型和OOD设置中甚至匹配或超过了所有基于训练的基线在因果忠实性上的表现。

英文摘要

Mechanistic interpretability aims to explain a model's behavior by identifying causally responsible internal structures. Dictionary-based explainers such as sparse autoencoders and transcoders are a primary tool, but their faithfulness under out-of-distribution (OOD) shift has received little systematic attention. We show that distribution shift rotates the subspace that the model actively uses, misaligning the explainer's dictionary trained on in-distribution (ID) activations. We formalize this misalignment as the faithfulness gap, a geometric distance between the ID dictionary and the OOD-active subspace, and show that it controls OOD faithfulness degradation. To reduce this gap, we propose the Geometry-Adaptive Explainer (GAE), which realigns the explainer's dictionary with the OOD-active subspace while preserving the original feature structure. This requires only unlabeled OOD activations and no gradient updates. We prove that GAE improves over the unadapted ID explainer, with excess loss bounded quadratically by the second-moment shift. Empirically, GAE even matches or surpasses all training-based baselines in causal faithfulness across multiple models and OOD settings.

2605.21845 2026-05-22 cs.CL cs.AI

Comparing LLM and Fine-Tuned Model Performance on NVDRS Circumstance Extraction with Varying Prompt Complexity

对比LLM和微调模型在不同提示复杂度下的NVDRS情境提取性能

Geoffrey Martin, Xuan Zhong Feng, Yifan Peng

发表机构 * Department of Population Health Sciences, Weill Cornell Medicine(人群健康科学系,威尔·康奈尔医学学院) Systems Engineering, Cornell University(系统工程,康奈尔大学)

AI总结 本文研究了在不同提示复杂度下,LLM与微调模型在NVDRS情境提取任务中的表现差异,提出了一种复杂度评分算法,并展示了一个混合方法,通过不同情境选择提示策略,发现LLM在低 prevalence 情境中表现更优,且框架能跨不同前沿LLM通用。

Comments Accepted at IEEE ICHI 2026

详情
AI中文摘要

自杀是美国的主要死亡原因之一,理解其前因需要从死亡调查叙述中提取结构化信息。许多前因需要语义推理而非简单的关键词匹配。我们开发了一种“复杂度评分”算法,分析编码手册结构以预测何时详细提示(包含完整编码指南)比仅名称提示更优。随后,我们构建了一种混合方法,根据情境选择提示策略。我们评估了大型语言模型(LLMs)与微调的RoBERTa在25个从国家暴力死亡报告系统(NVDRS)中提取的推断复杂情境上的表现。我们发现,在训练数据不足的低 prevalence 情境中,LLMs表现显著优于微调模型。我们进一步展示了我们的框架能够跨前沿LLM通用,GPT-5.2、Gemini 2.5 Pro和Llama-3 70B显示出一致的表现模式。这些发现支持了一种混合架构,其中LLMs处理罕见的推断复杂情境,而微调模型处理常见情境。

英文摘要

Suicide is a leading cause of death in the United States, and understanding the circumstances that precede it requires extracting structured information from death investigation narratives. Many of these circumstances require semantic inference beyond simple keyword matching. We develop a ``Complexity Score'' algorithm that analyzes coding manual structure to predict when detailed prompts with full coding guidelines improve over name-only prompts. We then construct a hybrid approach that selects prompt strategy per circumstance. We evaluate large language models (LLMs) against fine-tuned RoBERTa on 25 inferentially complex circumstances from the National Violent Death Reporting System (NVDRS). We found that LLMs substantially outperform on low-prevalence circumstances where training data is insufficient. We further demonstrate that our framework generalizes across frontier LLMs, with GPT-5.2, Gemini 2.5 Pro and Llama-3 70B showing consistent performance patterns. These findings support a hybrid architecture where LLMs handle rare, inferentially complex circumstances while fine-tuned models handle common ones.

2605.21842 2026-05-22 cs.LG cs.CL eess.SP

Energy-Gated Attention: Spectral Salience as an Inductive Bias for Transformer Attention

能量门控注意力:频谱显著性作为Transformer注意力的归纳偏置

Athanasios Zeris

发表机构 * Independent Researcher, Athens, Greece(雅典,希腊独立研究者)

AI总结 本文提出能量门控注意力(EGA),通过频谱显著性作为归纳偏置来改进Transformer注意力机制,通过在键嵌入的频谱能量上进行门控,提高了信息密集位置的注意力权重,实验结果显示在多个数据集上均取得显著效果。

Comments 12 pages, 4 figures

详情
AI中文摘要

标准的Transformer注意力计算查询和键之间的成对相似性,将所有标记视为具有同等显著性,无论其内在信息含量如何。在湍流流体力学中,相干结构——在背景混沌中持续存在的能量主导、空间组织化的模式——承载了总能量的不成比例份额,并控制所有传输。我们提出,标记在Transformer注意力中扮演类似的角色:信息密集的位置(形态边界、语法头、话语标记)集中了频谱能量,并应比背景标记(功能词、重复模式、低信息填充词)获得更多的注意力。我们提出能量门控注意力(EGA):一种简单的修改,通过键标记嵌入的频谱能量来门控值聚合,该计算通过一个单个学习的线性投影完成,以发现嵌入场的主导频谱模式。在TinyShakespeare上,EGA仅使用12,480个额外参数(<0.26%的开销)和没有可测量的计算成本,就实现了+0.103的验证损失改进。结果在Penn Treebank上也一致(+0.101),证明了数据集的独立性。在三种小波家族(固定Morlet、Daubechies db2/db4和参数化Morlet)的系统消融研究中,发现固定结构基底是次优的——最优的能量方向是数据自适应的且非正弦的——同时识别出学习的小波包作为有前途的开放方向。学习的能量阈值收敛到tau ~ 0.35,无论初始化如何,对应于英语文本中携带高于平均频谱能量的约36%的标记比例,这是一个稳定的语言属性,与英语文本中内容词的比例一致。

英文摘要

Standard transformer attention computes pairwise similarity between queries and keys, treating all tokens as equally salient regardless of their intrinsic informational content. In turbulent fluid dynamics, coherent structures -- the energetically dominant, spatially organized patterns that persist amid background chaos -- carry a disproportionate fraction of total energy and govern all transport. We propose that tokens play an analogous role in transformer attention: informationally dense positions (morphological boundaries, syntactic heads, discourse markers) concentrate spectral energy and should attract proportionally more attention than background tokens (function words, repeated patterns, low-information filler). We propose Energy-Gated Attention (EGA): a simple modification that gates value aggregation by the spectral energy of key token embeddings, computed by a single learned linear projection that discovers the dominant spectral mode of the embedding field. On TinyShakespeare, EGA achieves +0.103 validation loss improvement with only 12,480 additional parameters (<0.26% overhead) and no measurable computational cost. The result is consistent on Penn Treebank (+0.101), demonstrating dataset independence. A systematic ablation across three wavelet families (fixed Morlet, Daubechies db2/db4, and a parametric Morlet) establishes that fixed structured bases are suboptimal -- the optimal energy direction is data-adaptive and non-sinusoidal -- while identifying learned wavelet packets as a promising open direction. The learned energy threshold converges to tau ~= 0.35 independently of initialization, corresponding to the fraction (~36%) of tokens carrying above-average spectral energy in English text, a stable linguistic property consistent with the fraction of content words in running English text.

2605.21834 2026-05-22 cs.LG

On-Policy Consistency Training Improves LLM Safety with Minimal Capability Degradation

基于策略的一致性训练通过最小能力退化提升大语言模型安全性

Andy Han, Kristina Fujimoto, Avidan Shah, Kiet Nguyen, Kai Xu, Chen Yueh-Han, Ilia Sucholutsky, Rico Angell

发表机构 * New York University(纽约大学)

AI总结 本文提出基于策略的一致性训练(OPCT)方法,通过模型自身响应对比性提示来提升大语言模型的安全性,实验表明OPCT在抑制顺从性、防止越狱和增强安全意识方面优于传统监督微调(SFT),同时避免了SFT导致的能力退化问题。

详情
AI中文摘要

对齐的模型可能以多种方式表现不当:它们常常谄媚,容易被越狱攻击,或未能包含适当的安全警告。一致性训练是一种有前途的新对齐范式,通过使用对比输入对训练模型的不变性来缓解此类失败。现有的一致性训练过程在离线生成监督信号,并使用监督微调(SFT)来更新模型。不幸的是,由此产生的模型往往只是记忆训练分布的表面形式,因此泛化能力差且能力退化。我们引入基于策略的一致性训练(OPCT),一种新的一致性训练方法,其目标是在模型自身对提示的响应上计算,由自身对相应对比提示的条件监督。我们评估了OPCT在三个安全轴上的表现:顺从性、越狱和安全意识。在三个模型家族中,OPCT在所有安全目标上均优于其SFT对应物。与基线相比,OPCT将顺从率几乎减半(8.1% vs. 15.4%,相比之下SFT为11.2%)。在适应性目标攻击者下,OPCT在保持的越狱行为上保持越狱防御成功率接近99%,而SFT平均达到87%。在安全意识方面,OPCT在两个模型中优于SFT,其余模型中与SFT相当。OPCT还大大避免了SFT引发的能力退化,如在MATH-500上下降28分。我们的结果表明,一致性训练最好以OPCT而不是SFT的方式实施,尤其是在希望超越训练分布泛化时。

英文摘要

Aligned models can misbehave in several ways: they are often sycophantic, fall victim to jailbreaks, or fail to include appropriate safety warnings. Consistency training is a promising new alignment paradigm to mitigate such failures by training invariants into the model using contrastive input pairs. Existing consistency training procedures generate the supervision signal once, offline, and use supervised fine-tuning (SFT) to update the model. Unfortunately, the resulting models tend to merely memorize the surface forms of the training distribution and thus generalize poorly and regress in their capabilities. We introduce On-Policy Consistency Training (OPCT), a new consistency training approach where the objective is computed over the model's own responses to prompts, supervised by itself conditioned on corresponding contrastive prompts. We evaluate OPCT on three safety axes: sycophancy, jailbreaking, and safety awareness. Across three model families, OPCT outperforms its SFT counterpart on all safety desiderata. It nearly halves the sycophancy rate relative to baseline (8.1% vs. 15.4%, compared to 11.2% for SFT). Under an adaptive per-target attacker, OPCT holds jailbreak defense success near 99% on held-out jailbreak behaviors, whereas SFT achieves 87% on average. On safety awareness, OPCT outperforms SFT in two out of three models, and matches it on the other. OPCT also largely avoids the capability regressions that SFT induces, such as a 28-point drop on MATH-500. Our results suggest that consistency training is best implemented as OPCT rather than as SFT, especially when generalization beyond the training distribution is desired.

2605.21827 2026-05-22 cs.CL cs.AI

Does Slightly Mean Somewhat? Measuring Vague Intensity Words in LLM Numeric Actions

‘稍微’意味着‘ somewhat’吗?在LLM数值行为中测量模糊强度词

Daniel Tabach

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 本文研究了语言模型在必须生成数值行为时是否保留强度词的顺序意义,发现模型在数值输出中压缩了模糊强度词,其解释依赖于状态并接近操作边界时出现不连续性。

Comments 9 figures, 2 tables, 16 references

详情
AI中文摘要

语言模型在必须生成数值行为时是否保留强度词的顺序意义?我研究了一个由研究者构建的10个英语程度修饰词尺度,从稍微到剧烈,依据Quirk等人程度修饰词分类法,在受控资源分配环境中进行测试,其中Claude Haiku接收自然语言指令,生成数值分配,并由确定性后端转换为可测量结果。在测试中,唯一变化的变量是强度词或起始系统状态,从而隔离了它们对模型数值输出的影响。在6,620次运行中(T=0.0和T=0.7),出现了三种模式。首先,模型将10个强度词压缩为5个不同的中位数输出:四个低层级词都映射到相同值,而更强的词则进入更高层级(Spearman rho=0.845,p<0.001)。其次,当当前系统状态作为上下文提供时,Kruskal-Wallis检验显示按起始分配分组捕捉到的基于排名的方差远多于按词分组(epsilon-squared基线=0.782 vs. epsilon-squared词=0.079),并且当系统接近容量时,词汇区分度降为零。第三,接近可行性极限时,模型表现出三种行为模式:弱词通过小调整进行妥协,强词完全回避,而“剧烈”词则推至局部天花板。这些模式在温度变化下保持不变,随机采样扩展了分布但未恢复词之间的顺序差异。在该模型和领域中,模型对模糊强度词的数值解释是压缩的、依赖状态的,并且在操作边界附近出现不连续性。

英文摘要

Do language models preserve the ordinal meaning of intensity words when those words must produce numeric actions? I study a researcher-constructed scale of 10 English degree modifiers, from slightly to drastically, informed by the Quirk et al. degree-modifier taxonomy, in a controlled resource-allocation environment where Claude Haiku receives a natural-language instruction, produces a numeric allocation, and a deterministic backend converts that allocation into a measurable outcome. The only variable that changes between runs is the intensity word or the starting system state, isolating their effects on the model's numeric output. Across 6,620 runs at T=0.0 and T=0.7, three patterns emerge. First, the model compresses 10 intensity words into 5 distinct median outputs: four lower-tier words all map to the same value, while stronger words break into higher regimes (Spearman rho = 0.845, p < 0.001). Second, when the current system state is supplied as context, separate Kruskal-Wallis tests show that grouping by starting allocation captures far more rank-based variance than grouping by word (epsilon-squared baseline = 0.782 vs. epsilon-squared word = 0.079), and lexical differentiation collapses to zero as the system approaches capacity. Third, near feasibility limits the model exhibits three behavioral modes: weak words hedge with small adjustments, strong words abstain entirely, and the word drastically pushes to the local ceiling. These patterns persist across temperature, with stochastic sampling broadening distributions but not restoring ordinal distinctions between words. In this model and domain, the model's numeric interpretation of vague intensity words is compressed, state-dependent, and discontinuous near operational boundaries.

2605.21825 2026-05-22 cs.AI cs.HC

Toward AI VIS Co-Scientists: A General and End-to-End Agent Harness for Solving Complex Data Visualization Tasks

迈向AI可视化合作者:一种通用且端到端的代理工具,用于解决复杂数据可视化任务

Haichao Miao, Zhimin Li, Kuangshi Ai, Kaiyuan Tang, Chaoli Wang, Peer-Timo Bremer, Shusen Liu

发表机构 * Lawrence Livermore National Laboratory(劳伦斯利弗莫尔国家实验室) Vanderbilt University(范德比大学) University of Notre Dame(圣母大学)

AI总结 本文提出了一种端到端的代理工具,能够基于数据和高层任务描述自动生成定制的可视化分析应用,推动实现通用AI合作者的愿景。

详情
AI中文摘要

检查、解释和沟通复杂数据的能力对于任何科学探索都是至关重要的,但通常需要在核心领域之外的大量专业知识,从数据管理和分析到可视化设计和实现。我们提出了一种端到端的代理工具,仅基于数据和任务的高层描述,独立设计定制的可视化分析应用(VIS应用)。这代表了朝着许多设想的通用AI合作者的重要一步,即一个能够根据高层指令自主执行长周期任务的自主系统。我们提出的VIS合作者是这一更广泛AI合作者愿景的重要组成部分:一个能够利用一组代理和专门技能,自主分析数据并设计可视化解决方案的工具,这些技能协调探索性分析、计划、配置环境、实施、验证界面,并最重要的是评估整体任务完成情况。每个阶段都产生文档和指令制品,指导后续工作并实现迭代改进。我们通过IEEE SciVis比赛验证了这种方法,这些比赛覆盖多个科学和工程领域,是理想的测试场,因为它们编码了现实世界的复杂性:模糊的要求、多样的数据模态、设计权衡和任务驱动的验证。仅给定数据和目标任务,我们的系统能够自主生成具有验证链接视图行为的功能单页VIS应用,高度定制以满足领域专家指定的任务和需求。

英文摘要

The ability to inspect, interpret, and communicate complex data is crucial for virtually any scientific endeavor, but often requires significant expertise outside the core domain ranging from data management and analysis to visualization design and implementation. We present an end-to-end agentic harness that, based on only the data and a high level description of the tasks, independently designs custom visual analysis applications (VIS apps). This represents an important step towards a general AI co-scientist envisioned by many as an autonomous system that can autonomously execute long horizon tasks based on high-level directions. Our proposed VIS co-scientist is an essential component of this broader AI co-scientist vision: a harness that can autonomously analyze data and design visualization solutions using a collection of agents and specialized skills that coordinate exploratory analysis, plan, configure the environment, implement, validate the interface, and most importantly evaluate the overall task completion. Each stage produces document and instruction artifacts that guide downstream work and enable iterative refinement. We validate this approach on IEEE SciVis Contests spanning multiple science and engineering fields. These contests serve as ideal proving grounds because they encode real-world complexity: ambiguous requirements, diverse data modalities, design trade-offs, and task-driven validation. Given only the data and target tasks, our system autonomously produces functional single-page VIS Apps with verified linked-view behavior, highly customized to domain experts' specified tasks and needs.

2605.21822 2026-05-22 cs.AI

Implicit Safety Alignment from Crowd Preferences

从大众偏好中隐式安全对齐

Qian Lin, Daniel S. Brown

发表机构 * Kahlert School of Computing, University of Utah(犹他大学计算学校)

AI总结 本文研究如何从大众偏好数据中提取共享的安全标准,并将其转移到下游强化学习任务中以规范智能体行为并确保安全。提出了一种基于大众偏好的安全强化学习框架,通过高级策略将安全对齐的技能组合起来,以安全地解决下游任务。

Comments Accepted to ICML 2026. Conference paper

详情
AI中文摘要

基于人类反馈的强化学习(RLHF)可以揭示超出任务完成的隐式目标,如安全考虑。在本工作中,我们关注嵌入在大众偏好数据集中的常见安全标准,其中不同用户可能表达不同的偏好或目标,但遵循相似的安全原则。我们的目标是从大众偏好中发现共享的安全标准,并将其转移到下游RL任务中以规范智能体行为并确保安全。我们首先证明了直接奖励组合——优化偏好学习的奖励模型与下游任务奖励——具有内在限制。受此启发,我们提出了基于大众偏好的安全强化学习(Safe Crowd Preference-based RL),这是一种分层框架,从大众偏好中提取安全对齐的技能,并通过高级策略将它们组合起来,以安全地解决下游任务。在安全RL环境和一个初步的LLM风格任务中,实验表明,我们的方法在没有访问显式安全奖励的情况下显著降低了安全成本,同时在任务性能上与使用真实安全信号训练的oracle方法相当。

英文摘要

Reinforcement Learning from Human Feedback (RLHF) can reveal implicit objectives such as safety considerations that go beyond task completion. In this work, we focus on the common safety criteria embedded in crowd preference datasets, where different users may express distinct preferences or objectives, yet follow similar safety principles. Our aim is to discover shared safety criteria from crowd preferences and then transfer them to downstream RL tasks to regularize agent behavior and enforce safety. We first show that direct reward combination-optimizing a preference-learned reward model together with downstream task rewards-has inherent limitations. Motivated by this, we propose Safe Crowd Preference-based RL, a hierarchical framework that extracts safety-aligned skills from crowd preferences and composes them via a high-level policy to safely solve downstream tasks. Experiments across safe RL environments and a preliminary LLM-style task with diverse user goals and shared safety constraints demonstrate that our approach substantially lowers safety costs without access to explicit safety rewards, while achieving task performance comparable to oracle methods trained with ground-truth safety signals.

2605.21820 2026-05-22 cs.LG cond-mat.mtrl-sci

Beyond Scalar Objectives: Expert-Feedback-Driven Autonomous Experimentation for Scientific Discovery at the Nanoscale

超越标量目标:基于专家反馈的自主实验探索用于纳米尺度科学发现

Ralph Bulanadi, Jefferey Baxter, Arpan Biswas, Hiroshi Funakubo, Dennis Meier, Jan Schultheiß, Rama Vasudevan, Yongtao Liu

发表机构 * Center for Nanophase Materials Sciences, Oak Ridge National Laboratory(橡树岭国家实验室纳米相材料中心) University of Tennessee-Oak Ridge Innovation Institute, University of Tennessee(田纳西大学橡树岭创新研究所) Department of Material Science and Engineering, School of Materials and Chemical Technology, Institute of Science Tokyo(东京科学大学材料科学与工程系、材料与化学技术学院) Department of Materials Science and Engineering, Norwegian University of Science and Technology (NTNU)(挪威科学技术大学(NTNU)材料科学与工程系) Faculty of Physics and Center for Nanointegration Duisburg-Essen (CENIDE), University of Duisburg-Essen(杜伊斯堡- Essen大学物理系和杜伊斯堡- Essen纳米集成中心(CENIDE)) Research Center Future Energy Materials and Systems, Research Alliance Ruhr(鲁尔研究联盟未来能源材料与系统研究中心)

AI总结 本文提出了一种名为深度核成对学习(DKPL)的方法,通过整合专家知识和跨学科科学知识,改进自主显微实验,从而在纳米尺度上更有效地发现科学现象。

详情
AI中文摘要

自动驾驶实验室或自主实验正成为加速科学发现的变革性平台。贝叶斯优化(BO)是用于此目的最广泛使用的机器学习框架之一,但这些基于BO的框架依赖于预定义的标量描述符来指导实验。在许多情况下,确定合适的标量描述符可能具有挑战性,并且可能无法捕捉到专家所察觉的微妙但科学重要的现象。为克服这一限制,本文开发了深度核成对学习(DKPL),一种用于自主显微实验的方法,该方法将人类专业知识和跨学科科学知识整合到一个主动学习循环中。与依赖显式标量目标不同,DKPL使专家能够直接评估哪些实验输出更有前途,使用跨学科知识。DKPL然后从这些专家判断中学习一个潜在的效用函数,以指导后续的自主显微实验。我们通过一个具有已知真实值的实验模型数据集展示了DKPL在学习物理有意义的纳米级结构方面的能力,同时有效优先考虑高信息测量区域。我们进一步将DKPL应用于分析铁电域墙的特性,在BiFeO3中区分高和低特征域墙角度,并在ErMnO3中发现头对头和尾对尾的域墙特性。这一发展建立了一种将专家知识整合到自主显微实验中的方法,并展示了一条通向能够解决超越标量度量驱动学习限制的科学问题的专家引导的自动驾驶实验室的路径。

英文摘要

Self-driving laboratories or autonomous experimentation are emerging as transformative platforms for accelerating scientific discovery. Bayesian optimization (BO) is among the most widely used machine learning frameworks for these purposes, but these BO-based frameworks rely on predefined scalar descriptors to guide experimentation. In many situations, the determination of an appropriate scalar descriptor can be challenging, and may fail to capture subtle yet scientifically important phenomena apparent to experts with interdisciplinary insight. To overcome this limitation, here we develop deep-kernel pairwise learning (DKPL), an approach for autonomous microscopy experiments which incorporates human expertise and interdisciplinary scientific knowledge into an active learning loop. Instead of relying on explicit scalar objectives, DKPL enables experts to directly evaluate which experimental output is more promising using interdisciplinary knowledge. DKPL then learns a latent utility function from these expert judgements to guide subsequent autonomous microscopy experiments. We demonstrate DKPL's performance in learning physically meaningful nanoscale structures while effectively prioritizing high-information measurement regions using an experimental model dataset with known ground truth. We further apply DKPL to analyze the character of ferroelectric domain walls, where we find DKPL capable of distinguishing between high and low characteristic domain-wall angles in bismuth ferrite, and able to discover both head-to-head and tail-to-tail domain-wall character in erbium manganite. This development establishes an approach to integrate expert knowledge into autonomous microscopy experiments and demonstrates a pathway toward expert-guided self-driving laboratories capable of addressing scientific problems beyond the limits of scalar-metrics-driven learning.

2605.21811 2026-05-22 cs.RO

Safe and Steerable Geometric Motion Policies for Robotic Dexterous Manipulation

安全且可操控的几何运动策略用于机器人灵巧操作

Albert Wu, Riccardo Bonalli, Thomas Lew, C. Karen Liu

发表机构 * Computer Science Department, Stanford University(斯坦福大学计算机科学系) Laboratory of Signals and Systems, University of Paris-Saclay, CNRS, CentraleSupélec(巴黎-萨克雷大学信号实验室,CNRS,CentraleSupélec) Toyota Research Institute(丰田研究院)

AI总结 本研究提出SafePBDS框架,通过几何一致的方法计算最优且可证明安全的配置流形加速度,以实现机器人灵巧操作中的目标和约束的持续协调,并在模拟和Franka Panda-Allegro手平台上验证了其在灵巧抓取和手部重定向中的高效规划和安全保障。

Comments 24 pages, 10 figures, 5 tables. Project page and demo video: https://tml.stanford.edu/safe-pbds

详情
AI中文摘要

机器人灵巧操作需要持续协调在异构几何空间上定义的目标和约束:一个在$\mathbb{R}^7$配置流形上控制的机器人可能需要在$\mathrm{SE}(3)$上跟踪末端执行器姿态,同时在$\mathbb{R}$上满足障碍物避让边距。我们提出了Safe Pullback Bundle Dynamical Systems(SafePBDS),一种几何一致的框架,该框架从任意任务流形上的目标和安全要求计算最优且可证明安全的配置流形加速度。SafePBDS建立在先前工作之上,将预定义的任务流形动力学系统结合以产生自主运动。其第一个创新是拉回控制屏障函数构造,将任务流形的安全条件转换为配置流形加速度上的线性约束。第二个创新是任务流形动作接口,允许高层策略注入低维残差运动;零输入恢复自主行为,而任意输入下保持安全。这使高层策略能够高效地引导探索,同时将精确运动留给自主行为。我们通过模拟和23自由度Franka Panda-Allegro手平台验证了SafePBDS。在灵巧抓取中,SafePBDS在20个家庭物体和120次试验中实现了92.5%的成功率。通过动作接口,该方法可通过一维动作排除抓取中的任一手指,实现94.4%的3指抓取成功率。SafePBDS的高效规划和安全保证还使其成为首个基于模型的、完全驱动的手部在手重定向方法,能够超过360度的yaw旋转,无论物体重量和腕部运动如何变化。演示视频和细节:https://tml.stanford.edu/safe-pbds

英文摘要

Robotic dexterous manipulation requires continuously reconciling objectives and constraints defined on heterogeneous geometric spaces: a robot controlled on a $\mathbb{R}^7$ configuration manifold may need to track end effector poses on $\mathrm{SE}(3)$ while satisfying obstacle avoidance margins in $\mathbb{R}$. We present Safe Pullback Bundle Dynamical Systems (SafePBDS), a geometrically consistent framework that computes optimal, certifiably safe configuration manifold accelerations from objectives and safety requirements on arbitrary task manifolds. SafePBDS builds on prior work that combines predefined task manifold dynamical systems to produce autonomous motion. Its first innovation is a pullback control barrier function construction, which converts task manifold safety conditions into linear constraints on configuration manifold accelerations. The second innovation is a task manifold action interface that allows a high-level policy to inject low dimensional residual motions; zero input recovers the autonomous behavior, while safety is preserved under arbitrary inputs. This lets high-level policies efficiently steer exploration while leaving precise motion to the autonomous behavior. We validate SafePBDS in simulation and on a 23-DOF Franka Panda-Allegro Hand platform. On dexterous grasping, SafePBDS achieves a $92.5\%$ success rate across 20 household objects and 120 trials. Using the action interface, the method can exclude any one of the four fingers during grasping via a one-dimensional action, achieving $94.4\%$ 3-finger grasp success across 3 objects and 36 trials. The efficient planning and safety guarantee of SafePBDS also enables the first model-based, fully actuated palm-down in-hand reorientation, exceeding $360^\circ$ of yaw rotation in both directions under varying object weight and wrist motion. Demo video and details: https://tml.stanford.edu/safe-pbds

2605.21810 2026-05-22 cs.AI cs.MA

Trace2Skill: Verifier-Guided Skill Evolution for Long-Context EDA Agents

Trace2Skill: 验证器引导的技能进化用于长上下文EDA代理

Zijian Du, Nathaniel Pinckney

发表机构 * NVIDIA

AI总结 本文提出Trace2Skill框架,通过验证器引导的技能进化提升硬件代理在复杂验证问题上的性能,无需RTL专用模型微调,通过密集验证器反馈实现任务通过率的显著提升。

详情
AI中文摘要

复杂Verilog设计问题(CVDP)挑战硬件LLM代理,因为解决这些问题需要在大型仓库快照中本地化相关RTL、测试平台、包含路径和构建依赖,进行精确编辑,并从稀疏隐藏验证器失败中恢复。我们提出了Trace2Skill,一个测试时间扩展框架,它在不进行RTL专用模型微调的情况下改进硬件代理。而不是训练新模型或仅采样更多候选解决方案,Trace2Skill将代理的自然语言技能视为可进化策略。它挖掘重复的运行轨迹以识别成功和失败模式,将其转换为密集的诊断和 oracle 教训,并使用 oracle、变异器和选择器循环生成任务特定的技能,以引导后续的搜索、编辑、验证和恢复。由于最终通过/失败标签通常对硬故障太粗略,Trace2Skill还支持有界运行时间密集验证器反馈,该反馈返回经过清理的功能观察,同时保持隐藏的Harness和参考解决方案对代理不可见。这种反馈通过连接技能文本、验证器证据和下游行为来引导技能进化和代理执行。在击败种子CVDP代理的硬CVDP任务上,包括也击败前沿编码代理的任务,Trace2Skill结合密集验证器反馈显著提高了任务通过率,并在之前未解决的任务上实现了突破性通过,而无需高质量微调数据、专用RTL模型训练或模型权重更新。相同的框架提供了一种通用测试时间扩展策略,可以扩展到其他可验证的EDA任务。

英文摘要

Complex Verilog Design Problems (CVDP) challenge hardware LLM agents because solving them requires localizing verifier-relevant RTL, testbenches, include paths, and build dependencies inside large repository snapshots, making precise edits, and recovering from sparse hidden-verifier failures. We present Trace2Skill, a test-time scaling framework that improves a hardware agent without RTL-specialized model fine-tuning. Rather than training a new model or only sampling more candidate solutions, Trace2Skill treats the agent's natural-language skill as an evolvable policy. It mines repeated rollout traces for success and failure modes, converts them into dense diagnostics and oracle lessons, and uses an oracle, mutator, and selector loop to produce task-specific skills that guide later search, editing, validation, and recovery. Because final pass/fail labels are often too coarse for hard failures, Trace2Skill also supports bounded runtime dense verifier feedback that returns sanitized functional observations while keeping hidden harnesses and reference solutions inaccessible to the agent. This feedback helps guide skill evolution and agent execution by connecting skill text, verifier evidence, and downstream behavior. Across hard CVDP tasks that defeat the seed CVDP agent, including tasks that also defeat frontier coding agents, Trace2Skill with dense verifier feedback substantially improves task pass rates and produces breakthrough passes on previously unsolved tasks, without requiring high-quality fine-tuning data, specialized RTL model training, or model weight updates. The same framework provides a general test-time scaling strategy that can extend beyond digital design to other verifiable EDA tasks.