arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2280
2605.27957 2026-05-28 cs.CL

DisasterBench: Benchmarking LLM Planning under Typed Tool Interface Constraints

DisasterBench: 在类型化工具接口约束下基准测试LLM规划

Zhitong Chen, Kai Yin, Weifeng Zhang, Zhiyuan Wang, Xiangjue Dong, Chengkai Liu, Zhewei Liu, Yiming Xiao, Ali Mostafavi, James Caverlee

AI总结 提出DisasterBench基准,通过类型化工具接口评估LLM在灾害响应中的结构化多智能体规划能力,并引入首次故障点(FPoF)方法进行步骤级故障归因,揭示语义推理与执行约束之间的差距。

详情
AI中文摘要

灾害造成严重的社会影响,需要快速协调异构AI工具(从卫星分析到洪水预测和损害评估)形成连贯的多步骤工作流。随着LLM越来越多地充当此类管道的编排者,有效的协调需要的不仅仅是选择语义上合理的工具:LLM必须生成具有正确参数绑定和依赖传播的可执行工作流。我们引入了DisasterBench,这是一个基准,用于评估在语义相似但操作上不同的灾害响应工具上的结构化多智能体规划。为了实现步骤级故障归因,我们进一步提出了首次故障点(FPoF),它定位预测工作流中最早的根因,将主要错误与下游级联效应分开。我们的评估揭示了三个发现:规划方法的有效性强烈依赖于模型容量;工具不匹配和参数绑定错误主导了首次故障,揭示了语义基础和执行一致性是不同瓶颈;冗长的中间推理可能与结构化输出要求产生指令冲突,破坏计划生成。总之,这些发现凸显了语义推理与执行基础协调之间的根本差距,强调了需要联合建模语义意图、执行约束和工作流一致性的规划框架。代码、数据和评估资源可在 https://github.com/TamuChen18/DisasterBench_Open 获取。

英文摘要

Disasters cause severe societal impacts, demanding rapid coordination of heterogeneous AI tools, from satellite analysis to flood prediction and damage assessment, into coherent multi-step workflows. As LLMs increasingly serve as orchestrators of such pipelines, effective coordination requires more than selecting semantically plausible tools: LLMs must generate executable workflows with correct parameter binding and dependency propagation. We introduce DisasterBench, a benchmark for evaluating structured multi-agent planning over semantically similar but operationally distinct disaster-response tools. To enable step-level failure attribution, we further propose First-Point-of-Failure (FPoF), which localizes the earliest root cause in a predicted workflow, separating primary errors from downstream cascading effects. Our evaluation reveals three findings: planning method effectiveness depends strongly on model capacity; tool mismatch and parameter-binding errors dominate first failures, revealing semantic grounding and execution consistency as distinct bottlenecks; and verbose intermediate reasoning can create instruction clash with structured output requirements, disrupting plan generation. Together, these findings highlight a fundamental gap between semantic reasoning and execution-grounded coordination, underscoring the need for planning frameworks that jointly model semantic intent, execution constraints, and workflow consistency. Code, data, and evaluation resources are available at: https://github.com/TamuChen18/DisasterBench_Open

2605.27954 2026-05-28 cs.LG

Cyclical Entropy Eruption: Entropy Dynamics in Agent Reinforcement Learning

循环熵爆发:智能体强化学习中的熵动力学

Wendi Li, Shawn Im, Sharon Li

AI总结 本文发现智能体强化学习训练中存在循环熵爆发现象,并提出了SEAL辅助损失函数来稳定训练、提升性能。

详情
AI中文摘要

智能体大型语言模型通过推理目标、调用工具和与外部环境交互,越来越多地被用于解决现实世界任务。强化学习为改进这些行为提供了自然框架,最近的智能体RL方法在多个领域取得了强劲成果。然而,智能体RL的训练动力学仍然知之甚少,限制了诊断不稳定性和设计更有效训练算法的能力。在这项工作中,我们识别了智能体RL中一个先前未被充分探索的现象,我们称之为循环熵爆发。与单轮推理RL(其中熵通常崩溃并保持低位)不同,智能体RL训练表现出独特的重复循环:熵急剧爆发然后逐渐消退。我们将这种动态分解为三个阶段,并对每个阶段进行理论和实证分析,解释其循环振荡的机制。我们进一步表明,一旦在爆发期间获得,诸如句子重复和幻觉等退化模式可以在循环中持续并累积。受这些发现的启发,我们提出了SEAL(分离增强型智能体学习),一种轻量级辅助损失,它在表示空间中分离正确和错误轨迹,直接针对熵爆发的根本原因。跨多个基准、模型和RL算法的实验表明,SEAL稳定了训练并产生了更强的下游智能体性能。

英文摘要

Agentic large language models are increasingly used to solve real-world tasks by reasoning over goals, invoking tools, and interacting with external environments. Reinforcement learning provides a natural framework for improving these behaviors, and recent agent RL methods have achieved strong results across domains. However, the training dynamics of agent RL remain poorly understood, limiting our ability to diagnose instabilities and design more effective training algorithms. In this work, we identify a previously underexplored phenomenon in agent RL, which we term cyclical entropy eruption. Unlike single-turn reasoning RL, where entropy typically collapses and stays low, agent RL training exhibits unique recurring cycles of sharp entropy eruption and gradual subsidence. We decompose this dynamic into three phases and provide theoretical and empirical analyses of each, explaining the mechanisms underlying its cyclical oscillation. We further show that degenerate patterns such as sentence duplication and hallucination, once acquired during eruption, can persist and accumulate across cycles. Motivated by these findings, we propose SEAL (Separation-Enhanced Agent Learning), a lightweight auxiliary loss that separates correct and incorrect trajectories in representation space, directly targeting the root cause of entropy eruption. Experiments across multiple benchmarks, models, and RL algorithms demonstrate that SEAL stabilizes training and yields stronger downstream agent performance.

2605.27952 2026-05-28 cs.CV cs.RO

Con-DSO: Learning Short-Horizon Consistency Priors for RGB-D Direct Sparse Odometry

Con-DSO:学习RGB-D直接稀疏里程计的短时一致性先验

Haolan Zhang, Thanh Nguyen Canh, Chenghao Li, Ziyan Gao, Xiongwen Jiang, Nak Young Chong

AI总结 提出Con-DSO框架,通过预测光度与深度几何一致性不确定性,实现质量感知的像素选择和加权,提升RGB-D直接稀疏里程计在动态、遮挡等挑战环境下的鲁棒性。

Comments Submitted

详情
AI中文摘要

视觉里程计(VO)是机器人和增强现实中的基础组件。RGB-D直接VO受益于度量深度测量,但在动态物体、遮挡、光照变化和不可靠深度违反直接对齐所使用的短时光度和深度几何一致性假设的挑战环境中,性能会下降。现有方法通过语义过滤、显式遮挡推理、光照适应或手工几何准则来缓解这些问题,但通常依赖外部模块或针对个别故障模式的固定假设,限制了其灵活性和以统一方式处理多样挑战的能力。本文提出Con-DSO,一种一致性感知的RGB-D直接稀疏里程计框架,从时间相邻的RGB-D帧对预测密集的光度和深度几何一致性不确定性。一致性网络通过流引导的光度误差和投影深度一致性误差进行训练,使得一致性违规可表示为像素级不确定性。这些成对不确定性预测被转换为关键帧跟踪的主机侧质量先验。该先验随后通过质量感知的支持像素选择和位姿估计中的解耦光度-几何加权应用于VO,使得不可靠观测持续衰减,而非硬拒绝或基于阈值的门控。在五个公开RGB-D基准上的实验表明,与直接RGB-D VO基线相比,在ICL-NUIM上绝对轨迹误差降低超过20%,在RGB-D Scenes V2、TUM/Bonn Dynamic和OpenLORIS序列上降低50%-80%。

英文摘要

Visual odometry (VO) is a fundamental component in robotics and augmented reality. RGB-D direct VO benefits from metric depth measurements, but it can degrade in challenging environments, where dynamic objects, occlusions, illumination changes, and unreliable depth violate the short-horizon photometric and depth-geometric consistency assumptions used by direct alignment. Existing approaches mitigate these issues through semantic filtering, explicit occlusion reasoning, illumination adaptation, or hand-crafted geometric criteria, but often rely on external modules or fixed assumptions tailored to individual failure modes, limiting their flexibility and ability to handle diverse challenges in a unified manner. In this work, we propose Con-DSO, a consistency-aware RGB-D direct sparse odometry framework that predicts dense photometric and depth-geometric consistency uncertainty from temporally adjacent RGB-D frame pairs. The consistency network is trained using flow-guided photometric errors and projective depth-consistency errors, allowing consistency violations to be represented as pixel-level uncertainty. These pairwise uncertainty predictions are converted into a host-side quality prior for keyframe-based tracking. The prior is then applied to VO through quality-aware support-pixel selection and decoupled photometric-geometric weighting during pose estimation, enabling continuous attenuation of unreliable observations rather than hard rejection or threshold-based gating. Experiments on five public RGB-D benchmarks show substantial gains over direct RGB-D VO baselines, with over 20\% absolute trajectory error reduction on ICL-NUIM and 50\%--80\% reductions on RGB-D Scenes V2, TUM/Bonn Dynamic, and OpenLORIS sequences.

2605.27950 2026-05-28 cs.CV

Evaluating the Feasibility of Inferring Dietary Behavior Change Receptivity from Egocentric Images of Eating Environment

从以自我为中心的饮食环境图像推断饮食行为改变接受度的可行性评估

Long Li, Yuning Huang, Heather A. Eicher-Miller, J. Graham Thomas, Fengqing Zhu, Edward Sazonov

AI总结 本研究利用可穿戴相机收集的以自我为中心的饮食图像,通过预训练CLIP视觉编码器和轻量级Transformer分类器,初步验证了被动感知推断饮食行为改变接受度的可行性。

详情
AI中文摘要

准确评估饮食行为改变接受度对于设计有效的即时自适应干预措施(JITAIs)以促进更健康的饮食习惯至关重要。然而,基于自我报告的行为改变接受度评估稀疏且延迟,限制了其在持续监测中的实际应用。为探索被动感知是否有助于解决这一挑战,本研究进行了一项初步调查,从可穿戴相机收集的以自我为中心的饮食图像中推断参与者自我报告的行为改变接受度。我们使用自动摄入监测器v2(AIM-2)在自由生活饮食事件中获取的初步数据。数据包括饮食期间捕获的以自我为中心的图像序列,并配以评估行为改变接受度特定维度(意识、互动能力和动机)的问题的回答。为了检查视觉信息是否与这些回答相关,我们评估了一个结合预训练对比语言-图像预训练(CLIP)视觉编码器和轻量级Transformer分类器的迁移学习辅助框架。该模型处理饮食事件图像序列,以提取与行为改变接受度相关的潜在语义和时间线索。初步实验结果显示,在行为改变接受度指标上,相比于简单基线模型有显著改进。这些早期发现表明,以自我为中心的饮食事件图像可能包含与饮食行为改变接受度相关的线索,并需要在更大、更全面的数据集上进行进一步研究。

英文摘要

Accurately assessing dietary behavior change receptivity is essential for designing effective just-in-time adaptive interventions (JITAIs) that promote healthier eating habits. However, self-report-based assessment of behavior change receptivity is sparse and delayed, limiting its practical use in continuous monitoring. To explore whether passive sensing may help address this challenge, this study conducts a pilot investigation of inferring participants' self-reported behavior change receptivity from egocentric eating images collected by a wearable camera. We use pilot data obtained from free-living eating episodes using the Automatic Ingestion Monitor v2 (AIM-2). The data included egocentric image sequences captured during eating and paired with responses to questions assessing specific dimensions of behavior change receptivity (awareness, interaction capability, and motivation). To examine whether visual information contained any relevancy to these responses, we evaluated a transfer-learning-assisted framework that combines a pre-trained Contrastive Language-Image Pre-Training (CLIP) vision encoder with a lightweight transformer classifier. The model processes eating episode image sequences to extract potential semantic and temporal cues related to behavior change receptivity. Preliminary experimental results show promising improvements over simple baseline models for behavior change receptivity indicators. These early findings suggest that egocentric eating episode images may contain cues related to dietary behavior change receptivity, and warrant further investigation with larger and more comprehensive datasets.

2605.27948 2026-05-28 cs.RO

VLM-Based Advanced Rider Assistance System for Motorcycle Safety

基于VLM的摩托车安全高级骑手辅助系统

Mohamed Elnoor, Francesca Baldini, Ananya Trivedi, Faizan M. Tariq, Jovin D'sa, David Isele, Sangjae Bae, Dinesh Manocha, Yosuke Sakamoto

AI总结 提出一种利用视觉语言模型进行语义感知和风险感知规划的摩托车高级骑手辅助系统,通过构建密集风险地图并采用基于采样的规划器,在CARLA模拟器中实现更高的成功率和更低的风险暴露。

Comments Accepted to IEEE IV 2026

详情
AI中文摘要

与汽车相比,摩托车由于防护有限且对路面危险更敏感,面临不成比例的高碰撞风险,然而高级骑手辅助系统(ARAS)相对于高级驾驶辅助系统(ADAS)仍不发达。我们提出一种新颖的ARAS,通过语义感知和风险感知规划来提升摩托车安全性。我们的方法利用视觉语言模型(VLM)进行上下文危险推理,并将其与基于分割的检测相结合,以构建密集风险地图。这些地图编码了语义特征(例如,坑洼严重程度、水坑湿滑度)和物理属性(例如,大小、深度),从而产生捕捉摩托车特定风险的逐像素危险成本。这些地图被一个针对摩托车动力学定制的基于采样的规划器使用,以推荐油门和转向动作,在向目的地前进的同时最小化危险暴露。我们在CARLA模拟器的不同场景中评估了我们的系统。与基线方法相比,我们的方法实现了更高的成功率和更低的危险暴露,同时定性结果展示了可解释的风险地图和安全的轨迹推荐。

英文摘要

Motorcycles face disproportionately high crash risks compared to cars due to limited protection and heightened sensitivity to surface hazards, yet Advanced Rider Assistance Systems (ARAS) remain underdeveloped relative to Advanced Driver Assistance Systems (ADAS). We propose a novel ARAS that enhances motorcycle safety through semantic perception and risk-aware planning. Our approach leverages Vision-Language Models (VLMs) for contextual hazard reasoning and integrates them with segmentation-based detection to construct dense risk maps. These maps encode both semantic characteristics (e.g., pothole severity, puddle slipperiness) and physical attributes (e.g., size, depth), which produce per-pixel hazard costs that capture motorcycle-specific risks. These maps are used by a sampling-based planner tailored to motorcycle dynamics to recommend throttle and steering actions that minimize hazard exposure while advancing toward the destination. We evaluate our system in different scenarios in the CARLA simulator. Compared to the baseline method, our method achieves higher success rates and lower hazard exposure, while qualitative results demonstrate interpretable risk maps and safe trajectory recommendations.

2605.27947 2026-05-28 cs.RO

SANTS: A State-Adaptive Scheduler for World Action Models

SANTS:面向世界动作模型的状态自适应调度器

Yirui Sun, Guangyu Zhuge, Keliang Liu, Jie Gu, Xinyu Bing, Zhongxue Gan, Chunxu Tian

AI总结 提出状态自适应噪声轨迹调度器(SANTS),通过根据视频状态动态选择去噪深度来优化视频到动作的扩散策略,在保持控制性能的同时大幅降低推理延迟。

Comments 17 pages, 5 figures, 8 tables. Project page: https://advanced-robotics-lab.github.io/SANTS/

详情
AI中文摘要

世界动作模型(WAMs)通过使用基于视频的未来表示来条件化动作生成,从而改进机器人操作。然而,在像素空间WAM中,最佳动作条件不一定是完全去噪的视频。受控去噪深度扫描显示,视频细化可以降低动作误差,直到一个状态依赖的点,此后当后期预测变得与动作相关性较低或物理上不可靠时,增益可能饱和甚至逆转。这表明动作生成应使用沿视频噪声轨迹的状态依赖点,而不是固定的终端去噪深度。我们引入了状态自适应噪声轨迹调度器(SANTS),一种用于视频到动作扩散策略的轻量级调度器。在每个视频决策点,SANTS读取当前视频状态表示和噪声水平,然后联合预测累积停止风险和相对噪声进展比率。SANTS在冻结的动作分支生成最终动作块后,通过路径级奖励进行后训练,因此调度器针对下游动作质量而非中间视频保真度进行优化,同时显式惩罚冗余的视频状态更新。实验表明,SANTS在RoboTwin 2.0上达到94.4%的整体成功率,在七个真实机器人任务上平均成功率为73.1%,同时相对于完全视频去噪分别降低了81.7%和79.0%的延迟。这些结果表明,沿视频噪声轨迹的自适应选择可以保留WAM式未来推理的控制优势,同时消除其大部分冗余推理成本。

英文摘要

World Action Models (WAMs) improve robot manipulation by using video-based future representations to condition action generation. In pixel-space WAMs, however, the best action condition is not necessarily the fully denoised video. Controlled denoising-depth scans show that video refinement can reduce action error up to a state-dependent point, after which the gain may saturate or even reverse when late predictions become less action-relevant or physically unreliable. This suggests that action generation should use a state-dependent point along the video noise trajectory rather than a fixed terminal denoising depth. We introduce State-Adaptive Noise Trajectory Scheduler (SANTS), a lightweight scheduler for video-to-action diffusion policies. At each video decision point, SANTS reads the current video-state representation and noise level, then jointly predicts a cumulative stopping hazard and a relative noise-progression ratio. SANTS is post-trained with a path-level reward computed after the frozen action branch generates the final action chunk, so the scheduler is optimized for downstream action quality rather than intermediate video fidelity, while redundant video-state updates are explicitly penalized. Experiments show that SANTS reaches \(94.4\%\) overall success on RoboTwin 2.0 and \(73.1\%\) average success across seven real-robot tasks, while reducing latency by \(81.7\%\) and \(79.0\%\) relative to full video denoising, respectively. These results indicate that adaptive selection along the video noise trajectory can preserve the control benefits of WAM-style future reasoning while removing much of its redundant inference cost.

2605.27944 2026-05-28 cs.AI cs.MM cs.SD

From Talking to Singing: A New Challenge for Audio-Visual Deepfake Detection

从说话到唱歌:音视频深度伪造检测的新挑战

Ke Liu, Jiwei Wei, Wenyu Zhang, Shuchang Zhou, Ruikun Chai, Yutao Dai, Chaoning Zhang, Yang Yang

AI总结 针对现有音视频深度伪造检测方法在唱歌场景中性能下降的问题,提出文本引导的音视频伪造检测框架(T-AVFD),通过面部真实性模式学习和多模态差异权重学习,在说话和唱歌场景中均实现鲁棒检测。

Comments Accepted by ICML 2026

详情
AI中文摘要

随着音视频生成模型的快速发展,可靠的伪造检测变得日益关键。现有的音视频深度伪造检测方法通常依赖于跨模态不一致性。在唱歌中,有节奏的发声削弱了这种耦合,并引入了显著的领域偏移,大幅降低了检测性能。我们使用节奏感知生成模型构建了唱歌头部深度伪造(SHDF)数据集,以填补唱歌基准的空白。为了应对跨场景领域偏移,我们提出了文本引导的音视频伪造检测(T-AVFD)框架,该框架在说话和唱歌场景中均具有泛化能力。T-AVFD 包含一个面部真实性模式学习器和一个多模态差异权重学习模块。模式学习器将面部特征与多粒度文本描述对齐,以学习可泛化的真实性模式。权重学习模块保留固有的音视频一致性,并通过差异权重将其与真实性模式自适应地整合。在多个说话头部深度伪造数据集和 SHDF 上的大量实验表明,该方法在现有基线上取得了一致的改进,并在多种扰动下表现出强大的鲁棒性。

英文摘要

With rapid advances in audio-visual generative models, reliable forgery detection becomes increasingly critical. Existing methods for audio-visual deepfake detection typically rely on cross-modal inconsistencies. In singing, rhythmic vocalization weakens this coupling and introduces a nontrivial domain shift, substantially degrading detection performance. We construct the Singing Head DeepFake (SHDF) dataset using rhythm-aware generative models to fill the gap in singing benchmarks. To cope with cross-scenario domain shifts, we propose a Text-guided Audio-Visual Forgery Detection (T-AVFD) framework that generalizes across both talking and singing scenarios. T-AVFD comprises a facial authenticity pattern learner and a multi-modal differential weight learning module. The pattern learner aligns facial features with multi-granularity textual descriptions to learn generalizable authenticity patterns. The weight learning module preserves intrinsic audio-visual consistency and adaptively integrates it with authenticity patterns via differential weighting. Extensive experiments on multiple talking head deepfake datasets and SHDF show consistent improvements over existing baselines and strong robustness under diverse perturbations.

2605.27938 2026-05-28 cs.CV

SEMAGIC: Learning Semantically Consistent Deformable 3D Representations from In-the-Wild Images

SEMAGIC: 从野外图像中学习语义一致的可变形3D表示

Sky Cen, Wufei Ma, Guofeng Zhang, Alan Yuille, Adam Kortylewski

AI总结 针对现有可变形3D重建方法语义对应不稳定的问题,提出SEMAGIC框架,通过特征级一致性损失和顶点索引条件变形,在重建过程中强制语义一致性,从而提升类别级语义对应性能。

详情
AI中文摘要

从单视图野外图像中学习可变形3D物体模型已实现了无需监督的令人印象深刻的3D形状重建。然而,这些模型是否捕捉到下游任务所需的语义结构仍不清楚。我们发现,现有的可变形重建方法尽管生成了视觉上合理的几何形状,但在实例间产生了不稳定的对应关系,并在语义对应基准上表现不佳。我们引入了SEMAGIC,一个从单视图野外图像中学习语义一致的可变形3D表示的框架。SEMAGIC不将重建视为最终目标,而是将可变形建模作为发现类别级对应关系的机制。每个类别由一个规范模板网格和一个学习到的变形场表示,其功能类似于一个从图像特征重建实例几何的自编码器,使得顶点能够在实例间保持一致的语义含义。训练过程中通过(i)对齐规范网格和变形网格之间语义特征的特征级一致性损失,以及(ii)保持实例间语义对应的顶点索引条件变形,来强制语义一致性。通过将几何变形与语义对齐显式耦合,SEMAGIC生成了在类别内变化中保持稳定部件对应的表示。实验表明,SEMAGIC在SPair-71k上将可变形模型的语义对应提高了+14.7 PCK@0.1,确立了可变形模型作为有效语义3D表示的地位。

英文摘要

Learning deformable 3D object models from single-view in-the-wild images has enabled impressive 3D shape reconstruction without supervision. However, it remains unclear whether these models capture the semantic structure required for downstream tasks. We find that existing deformable reconstruction approaches, despite producing visually plausible geometry, yield unstable correspondences across instances and perform poorly on semantic correspondence benchmarks. We introduce SEMAGIC, a framework for learning semantically consistent deformable 3D representations from single-view in-the-wild images. Rather than treating reconstruction as the end goal, SEMAGIC uses deformable modeling as a mechanism to discover category-level correspondences. Each category is represented by a canonical template mesh and a learned deformation field, functioning similarly to an autoencoder that reconstructs instance geometry from image features, enabling vertices to maintain consistent semantic meaning across instances. Semantic consistency is enforced during training through (i) a feature-level consistency loss aligning semantic features between canonical and deformed meshes, and (ii) vertex-index-conditioned deformation that preserves semantic correspondence across instances. By explicitly coupling geometric deformation with semantic alignment, SEMAGIC produces representations that maintain stable part correspondences across intra-category variation. Experiments demonstrate that SEMAGIC improves semantic correspondence of deformable models by +14.7 PCK@0.1 on SPair-71k, establishing deformable models as effective semantic 3D representations.

2605.27935 2026-05-28 cs.AI

Do Agents Think Deeper? A Mechanistic Investigation of Layer-Wise Dynamics in Sequential Planning

智能体思考得更深吗?顺序规划中层间动力学的机制研究

Zhenyu Cui, Xiangzhong Luo

AI总结 通过残差流探针、因果层跳跃干预和有效深度测量,研究了大型语言模型在自主智能体任务(多轮规划、工具使用、迭代状态更新)中层间动态的差异,发现智能体推理表现出与静态任务不同的深度分布,随着轨迹展开,模型逐步招募更多更深层,且后期出现更强的长距离层间依赖,同时残差更新从稳定特征积累转向重复校准,有效深度分析揭示了语义方向形成较早而深层对稳定最终输出仍必要的构建-精炼差距。

详情
AI中文摘要

最近的机制研究表明,大型语言模型(LLMs)在标准单轮任务中可能未能高效利用其深度。在自主智能体设置中,模型必须执行多轮规划、工具使用和迭代状态更新,这种情况是否仍然成立尚不清楚。我们通过系统性地对三个领域(深度研究、代码生成和表格处理)的完整用户-智能体轨迹进行逐层分析来研究这一问题。使用残差流探针、因果层跳过干预和有效深度测量,我们表明智能体推理表现出与静态任务不同的深度分布。随着轨迹展开,模型逐步招募更多和更深的层,在后期出现更强的长距离层间依赖。同时,残差更新变得越来越以校正为主导,表明从稳定的特征积累转向重复校准。有效深度分析进一步揭示了一个显著的构建-精炼差距:语义方向通常形成较早,而深层对于稳定最终输出仍然必要。在不同模型家族中,这一差距在Qwen和Minimax中显著,而GLM则表现出更依赖领域的深度分配模式。这些结果提供了机制证据,表明自主LLM智能体随着推理复杂性的增长自适应地分配深度。

英文摘要

Recent mechanistic studies suggest that large language models (LLMs) may utilize their depth inefficiently in standard single-turn tasks. Whether this still holds in autonomous agent settings, where models must perform multi-turn planning, tool use, and iterative state updates, remains unclear. We study this question through a systematic layer-wise analysis of complete user-agent trajectories spanning three domains: Deep Research, Code Generation, and Tabular Processing. Using residual stream probes, causal layer-skipping interventions, and effective-depth measurements, we show that agentic reasoning exhibits a distinct depth profile from static tasks. As trajectories unfold, models progressively recruit more and deeper layers, with stronger long-range inter-layer dependencies emerging in later turns. At the same time, residual updates become increasingly correction-dominant, indicating a shift from stable feature accumulation toward repeated recalibration. Effective-depth analysis further reveals a substantial construction-refinement gap: semantic direction often forms relatively early, while deep layers remain necessary for stabilizing final outputs. Across model families, this gap is pronounced in Qwen and Minimax, whereas GLM shows a more domain-dependent depth allocation pattern. These results provide mechanistic evidence that autonomous LLM agents allocate depth adaptively as reasoning complexity grows.

2605.27934 2026-05-28 cs.CL

GeneralThinker: Domain-General Reasoning through Likelihood-Guided Answer-Conditioned Optimization

GeneralThinker: 通过似然引导的答案条件优化实现领域通用推理

Shengmin Piao, Sanghyun Park

AI总结 提出GeneralThinker框架,利用答案似然进行密集监督和细粒度信用分配,无需领域特定验证器,在数学、STEM和通用推理等11个基准上取得最佳平均性能。

详情
AI中文摘要

基于可验证奖励的强化学习提升了语言模型的推理能力,但其对领域特定验证器的依赖、稀疏的结果奖励以及粗粒度的信用分配限制了其适用性。我们提出了GeneralThinker,一个在策略框架,将推理监督重新表述为密集的答案条件优化,无需领域特定验证器即可实现响应级评估和令牌级信用分配。GeneralThinker使用真实答案的似然来评估生成的推理轨迹,并推导出令牌级的兼容性信号用于细粒度信用分配。为了稳定优化,它通过裁剪和方向保持调制来约束令牌级更新。在涵盖数学、STEM和通用推理的11个基准测试中,GeneralThinker取得了最佳平均性能。进一步分析表明,不受控的令牌级调制可能破坏训练稳定性,而受控的调制使细粒度信用分配始终有效。

英文摘要

Reinforcement learning with verifiable rewards improves language model reasoning, but its reliance on domain-specific verifiers, sparse outcome rewards, and coarse-grained credit assignment limits its applicability. We introduce GeneralThinker, an on-policy framework that reformulates reasoning supervision as dense answer-conditioned optimization, enabling response-level evaluation and token-level credit assignment without domain-specific verifiers. GeneralThinker evaluates generated reasoning trajectories using the likelihood of the ground-truth answer and derives token-wise compatibility signals for fine-grained credit assignment. To stabilize optimization, it constrains token-level updates through clipping and direction-preserving modulation. Across 11 benchmarks spanning mathematics, STEM, and general reasoning, GeneralThinker achieves the best average performance. Further analyses show that uncontrolled token-level modulation can destabilize training, whereas controlled modulation makes fine-grained credit assignment consistently effective.

2605.27932 2026-05-28 cs.CV cs.AI cs.CL cs.CR cs.LG

When Think-with-Image Meets Safety: What Determines Multimodal Jailbreak Robustness?

当图文推理遇上安全:什么决定了多模态越狱鲁棒性?

Yuan Tian, Bing Hu, Fang Wu, Xiaomin Li, Binghang Lu, Neil Zhenqiang Gong

AI总结 本文研究多模态大语言模型中不同图文推理范式对越狱鲁棒性的影响,发现显式图像工具交互能显著降低攻击成功率,并通过引入图像工具安全向量框架从表征层面解释其机制。

Comments 17 pages, 6 figures, 7 tables

详情
AI中文摘要

图文推理正成为大型视觉-语言模型的一种新推理范式,但其安全性影响尚不明确。现有系统已涵盖多种流程设计,包括直接响应生成、纯文本前轮、视觉状态操作以及显式外部图像工具调用。本文探究这些评估范式中哪一种能提升多模态越狱鲁棒性及其原因。在多个视觉-语言模型上,我们的实验表明显式图像工具交互的攻击成功率最低,平均相对降低约30%。这一发现起初令人惊讶:即使返回的图像工具输出被人为覆盖或本身不安全,攻击成功率仍保持较低,但在纯文本前轮控制下又恢复到接近直接回答的水平。这些结果表明,较低的攻击成功率并非由良性返回图像语义或仅文本图像工具轨迹解释。为解释这一模式,我们引入了一个图像工具安全向量框架,将图像工具调用建模为隐藏表示向安全相关方向的残差偏移。表征层面的分析和激活干预支持了这一解释。总体而言,我们的结果表明,显式图像工具交互是提升越狱鲁棒性的一种有前景的设计模式,同时也推动了针对特定流程的安全性评估。

英文摘要

Think-with-image reasoning is emerging as a new inference paradigm for large vision-language models, but its safety implications remain poorly understood. Existing systems already span multiple process designs, including direct response generation, text-only prior turn, visual-state manipulation, and explicit external image-tool invocation. In this paper, we ask which of these evaluated paradigms improves multimodal jailbreak robustness, and why. Across multiple vision-language models, explicit image-tool interaction yields the lowest attack success rates in our experiments, reducing jailbreak success by around 30% relative on average across the evaluated models. This finding is initially surprising: ASR remains low even when the returned image-tool output is manually overridden or itself unsafe-looking, but returns near direct-answering levels under text-only prior turn controls. These results indicate that the lower ASR is not explained by benign returned-image semantics or by the textual image-tool trace alone. To explain the pattern, we introduce an image-tool safety vector framework that models image-tool invocation as a residual shift in hidden representations toward a safety-relevant direction. Representation-level analyses and activation interventions support this account. Overall, our results suggest that explicit image-tool interaction is a promising design pattern for improving jailbreak robustness, while also motivating pipeline-specific safety evaluation.

2605.27931 2026-05-28 cs.AI

DiagramRAG: A Lightweight Framework to Retrieve Scientific Diagram for Figure Generation

DiagramRAG:一个用于科学图表生成的轻量级检索增强框架

Xinjiang Yu, Junyi Han, Zhuofan Chen, Chi Zhang, Xiangyu Fu, Jingyuan Tan, Zirui You, Yixiang Jian, Yu-Ping Wang, Chengliang Chai

AI总结 提出DiagramRAG框架,通过检索与草图语义和拓扑结构兼容的参考图表,实现草图到科学图表的自动补全与生成。

Comments 23 pages, 9 figures

详情
AI中文摘要

科学图表对于在学术论文中传达复杂方法至关重要。研究人员指定此类图表的一种自然方式是通过粗略草图,其中文本标签、连接器和空间布局表达了早期的语义和拓扑意图。然而,草图通常不完整,不足以直接生成出版质量的图表。现有的基于草图的生成方法主要重构草图本身,而最近的文本驱动图表生成框架依赖文本语义,未能充分利用草图中包含的拓扑结构。在本文中,我们介绍了DiagramRAG,一个轻量级的检索增强框架,用于基于草图的科学图表补全。给定用户草图,DiagramRAG检索与草图内容语义相关且与其结构拓扑兼容的参考图表,并使用它们指导下游图表生成。为了实现高效的结构感知检索,我们将图表表示为知识图谱,在不同简化级别合成草图变体,并训练一个嵌入模型,将草图与共享空间中的兼容图表对齐。检索到的参考进一步提供内容、拓扑和视觉先验,用于补全和渲染最终图表。实验表明,DiagramRAG在DiagramBank和FigureBench上分别达到0.848和0.802的F1分数,并以最佳的VLM-as-a-Judge评分7.170提高了生成质量,同时将推理延迟降低到每个样本35.48秒。我们的代码和数据可在https://anonymous.4open.science/r/DiagramRAG-A262和https://huggingface.co/datasets/anonymous-review-a262/DiagramSketch获取。

英文摘要

Scientific diagrams are essential for communicating complex methodologies in academic papers. A natural way for researchers to specify such diagrams is through rough sketches, where text labels, connectors, and spatial arrangements express early semantic and topological intentions. However, sketches are usually incomplete, making them insufficient for directly producing publication-quality diagrams. Existing sketch-based generation methods mainly reconstruct the sketch itself, while recent text-driven diagram generation frameworks rely on textual semantics and do not fully exploit the topological structure contained in sketches. In this paper, we introduce DiagramRAG, a lightweight retrieval-augmented framework for sketch-based scientific diagram completion. Given a user sketch, DiagramRAG retrieves reference diagrams that are both semantically relevant to the sketch content and topologically compatible with its structure, and uses them to guide downstream diagram generation. To enable efficient structure-aware retrieval, we represent diagrams as knowledge graphs, synthesize sketch variants at different simplification levels, and train an embedding model to align sketches with compatible diagrams in a shared space. The retrieved references further provide content, topology, and visual priors for completing and rendering the final diagram. Experiments show that DiagramRAG achieves F1-scores of 0.848 and 0.802 on DiagramBank and FigureBench, respectively, and improves generation quality with the best VLM-as-a-Judge score of 7.170, while reducing inference latency to 35.48 seconds per sample. Our code and data are available at https://anonymous.4open.science/r/DiagramRAG-A262 and https://huggingface.co/datasets/anonymous-review-a262/DiagramSketch.

2605.27927 2026-05-28 cs.CV cs.LG

Structure-Guided Visual Perturbation Neutralization for LVLMs

结构引导的视觉扰动中和用于大型视觉语言模型

Yuanhe Zhang, Xueting Wang, YanBin Ren, Haoran Gao, Xinhan Zheng, Zhenhong Zhou, Fanyu Meng, Li Sun, Sen Su

AI总结 提出结构诱导引导中和(SIGN)框架,通过先验结构提取和动态引导中和实现轻量级、即插即用的对抗性防御,在仅0.5%像素修改和0.16秒每图下达到87%以上防御成功率。

详情
AI中文摘要

图像输入使大型视觉语言模型(LVLMs)能够感知细粒度的视觉信息,但也引入了一个像素级攻击面,通过该攻击面,对抗性扰动可以引发不安全的模型行为。然而,大多数现有防御是为传统计算机视觉场景设计的,因此常常忽略LVLMs所需的跨模态对齐,导致性能下降。同时,针对LVLMs的有限防御通常需要大量的图像修改并引入可观的计算开销,从而损害推理质量和效率。为解决这些限制,我们提出了结构诱导引导中和(SIGN),一个轻量级、即插即用的防御框架,通过先验结构提取提高LVLM兼容性,并通过动态引导中和实现高效的扰动抑制。大量实验表明,SIGN在仅0.5%像素修改和每张图像0.16秒的情况下实现了超过87%的防御成功率,同时几乎保留了原始视觉表示和良性任务性能。我们的工作为需要昂贵模型训练的防御提供了一种轻量级替代方案,并突显了利用视觉编码器进行高效对抗性保护的潜力。我们的代码已在 https://anonymous.4open.science/r/SIGN-BCB1 开源。

英文摘要

Image inputs enable Large Vision Language Models (LVLMs) to perceive fine-grained visual information, but also introduce a pixel-level attack surface through which adversarial perturbations can elicit unsafe model behaviors. However, most existing defenses are designed for traditional computer vision settings and thus often overlook the cross-modal alignment required by LVLMs, leading to degraded performance. Meanwhile, the limited defenses tailored to LVLMs often require substantial image modifications and introduce considerable computational overhead, thereby compromising inference quality and efficiency. To address these limitations, we propose Structure-Induced Guided Neutralization (SIGN), a lightweight, plug-and-play defense framework that improves LVLM compatibility via Prior Structural Extraction and achieves efficient perturbation suppression via Dynamic Guided Neutralization. Extensive experiments show that SIGN achieves over 87\% defense success rate with only 0.5\% pixel modification and 0.16 seconds per image, while nearly preserving original visual representations and benign task performance. Our work offers a lightweight alternative to defenses that require costly model training and highlights the potential of exploiting a vision encoder for efficient adversarial protection. Our code is open source on https://anonymous.4open.science/r/SIGN-BCB1.

2605.27924 2026-05-28 cs.CV

SIGMA: Semantic-Difference Instruction-Grounding Mask Annotator for Text-Driven Image Manipulation Localization

SIGMA: 基于语义差异的指令引导掩码标注器用于文本驱动图像操作定位

Peiyu Zhuang, Jianquan Yang, Haodong Li, Zhuoying Cai, Ruitao Xie, Jishen Zeng, Baoying Chen, Jiwu Huang, Xiaochun Cao

AI总结 提出SIGMA方法,通过视觉基础模型中的语义特征差异和指令引导的空间先验,自动从公开编辑数据集中生成像素级掩码,用于训练图像操作定位模型,在五个基准上F1提升12.20%,并生成约110万训练集使六个检测器平均F1提升18.34%。

详情
AI中文摘要

文本驱动的图像编辑发展迅速,但可靠地定位这些操作需要在大规模像素标注数据集上训练的图像操作定位(IML)模型,目前尚无低成本获取此类训练数据的方法。我们观察到这些数据实际上已经以伪装形式存在:公开编辑数据集包含数百万个与IML训练样本结构相同的(原始、编辑)图像对,仅缺少像素级掩码。自动恢复这些掩码并非易事:像素差异被扩散引起的所有像素扰动淹没,而仅基于指令的定位只能定位提示描述的内容,遗漏了意外的编辑副作用。我们提出SIGMA(语义差异指令引导掩码标注器),它在视觉基础骨干网络中进行语义特征差异计算,并通过双向跨模态精炼将指令导出的空间先验注入视觉流,在编辑器忠实实现用户意图时放大预期编辑区域的差异信号。SIGMA通过两个互补阶段训练:第一阶段在修复掩码上进行监督;第二阶段通过VAE往返噪声校准、EMA自训练和编辑噪声解耦损失来弥合扩散域偏移。SIGMA在五个基准上优于现有自动掩码生成器(F1提升12.20%,IoU提升11.16%)。当应用于公开编辑语料库时,它生成了约110万IML训练集,使六个不同检测器在五个数据集上平均F1提升18.34%,将以前未使用的编辑数据转化为IML的模型无关监督资源。论文被接收后我们将立即发布完整代码库。

英文摘要

Text-driven image editing has advanced rapidly, but reliably localizing these manipulations requires image manipulation localization (IML) models trained on large pixel-annotated datasets, and there is still no low-cost way to obtain such training data at scale. We observe that these data already exist in disguise: public editing datasets contain millions of structurally identical (original, edited) pairs to IML training samples, lacking only pixel-level masks. Recovering these masks automatically is non-trivial: pixel differencing is overwhelmed by diffusion-induced perturbations across all pixels, and instruction-only grounding localizes only what the prompt describes, missing unintended editor side-effects. We propose SIGMA (Semantic-difference Instruction-Grounding Mask Annotator), which performs semantic-feature differencing in a vision foundation backbone and injects an instruction-derived spatial prior into this visual stream via bidirectional cross-modal refinement, amplifying the difference signal at intended-edit regions when the editor faithfully realizes user intent. SIGMA is trained in two complementary stages: Stage I supervises on inpainting masks; Stage II closes the diffusion-domain shift via VAE-roundtrip noise calibration, EMA self-training, and an edit-noise disentanglement loss. SIGMA outperforms existing automatic mask generators on five benchmarks (+12.20% F1, +11.16% IoU). When applied to public editing corpora, it produces a ~1.1M IML training set that improves six diverse detectors by +18.34% F1 across five datasets, turning previously unused editing data into a model-agnostic supervisory resource for IML. We'll release the full codebase as soon as the paper is accepted.

2605.27923 2026-05-28 cs.CV cs.AI cs.LG quant-ph

Do We Really Need Quantum Machine Learning?: A Multidimensional Empirical Study

我们真的需要量子机器学习吗?:一项多维实证研究

Sudip Vhaduri, Ryan Gammon, Sayanton Dibbo

AI总结 通过在MNIST手写数字数据集上对经典和量子机器学习模型进行多维基准测试,发现量子模型在准确率、参数和内存效率上优于经典模型,但计算成本更高。

详情
AI中文摘要

计算机视觉的快速发展和日益复杂的图像识别任务暴露了经典机器学习模型的基本计算限制,推动了量子计算作为一种新兴范式的探索。本文对MNIST手写数字数据集上的经典和量子机器学习模型进行了全面的基准测试,评估了传统模型(经典支持向量机CSVM和量子支持向量机QSVM)以及深度神经网络模型(经典卷积神经网络CCNN和量子卷积神经网络QCNN)在四个性能维度上的表现:分类准确率、计算运行时间、参数数量和内存需求。实验作为特征维度和样本量的函数进行,并在CPU和GPU执行环境下进行,提供了受控的多维比较,以解决先前工作中的空白。对于基于SVM的模型,QSVM在准确率上始终优于CSVM,在1000个样本时达到约0.90对比约0.85,但计算成本更高。10个量子比特的特征数和200-500的样本量成为平衡准确率和运行时间的实际工作点。对于神经网络模型,CCNN和QCNN实现了可比的分类准确率,在64个特征和60000个样本时均超过0.96,但QCNN在参数和内存效率上显著更优,在较高特征数下比CCNN少约94%的参数和约75%的内存,但运行时间更长。在两个模型家族中,随着特征维度或样本量的增加,量子模型在准确率上始终以更大优势超越经典模型。

英文摘要

The rapid growth of computer vision and increasingly complex image recognition tasks has exposed fundamental computational limitations of classical machine learning models, motivating the exploration of quantum computing as an emerging new paradigm. This paper presents a comprehensive benchmarking study of classical and quantum machine learning models for image recognition on the MNIST handwritten digit dataset, evaluating both traditional models, a Classical Support Vector Machine (CSVM) and a Quantum Support Vector Machine (QSVM), and deep neural network models, a Classical Convolutional Neural Network (CCNN) and a Quantum Convolutional Neural Network (QCNN), across four performance dimensions: classification accuracy, computational runtime, parameter count, and memory requirements. Experiments are conducted as functions of both feature dimensionality and sample size, and across CPU and GPU execution environments, providing a controlled, multidimensional comparison to address gaps in prior work. For the SVM-based models, QSVM consistently outperforms CSVM in accuracy, reaching $\sim$ 0.90 versus $\sim$ 0.85 at 1,000 samples, with a higher computational cost. A feature count of 10 qubits and a sample size in the range of 200 -- 500 emerge as practical operating points that balance accuracy and runtime. For the neural network models, CCNN and QCNN achieve comparable classification accuracy, both exceeding 0.96 at 64 features and 60,000 samples, yet QCNN offers substantially superior parameter and memory efficiency, requiring $\sim$ 94\% fewer parameters and $\sim$ 75\% less memory than CCNN at higher feature counts, while incurring higher runtime. Across both model families, quantum models consistently outperform classical models by greater margins in accuracy as feature dimensionality or sample size increases.

2605.27922 2026-05-28 cs.AI

Harness-Bench: Measuring Harness Effects across Models in Realistic Agent Workflows

Harness-Bench: 在真实智能体工作流中测量不同模型的框架效应

Yilun Yao, Xinyu Tan, Chao-Hsuan Liu, Yaoming Li, Zhengyang Wang, Wenhan Yu, Zhewen Tan, Yuxuan Tian, Guangxiang Zhao, Lin Sun, Xiangzheng Zhang, Tong Yang

AI总结 提出Harness-Bench基准,通过106个沙盒离线任务评估不同模型与框架配置组合下的执行性能,发现智能体能力应归因于模型-框架配置而非基础模型。

Comments 16 pages, 4 figures, 11 tables. The first three authors contributed equally

详情
AI中文摘要

LLM智能体越来越多地被部署为可执行系统,使用工具、修改工作区并产生具体产物。在此类工作流中,性能不仅取决于基础模型,还取决于框架:管理上下文、工具、状态、约束、权限、追踪和恢复的系统层。然而,现有基准通常抽象掉执行过程、比较完整智能体系统或固定框架,使得执行层变化难以研究。我们引入Harness-Bench,一个用于评估真实智能体工作流中配置级框架效应的诊断基准。Harness-Bench在共享任务环境、预算和评估协议下,跨多个模型后端评估代表性框架配置,同时保留每个框架的原生执行行为。该基准包含106个沙盒离线任务,这些任务基于实际智能体使用模式构建,并经过人工审核以确保真实性、可解性、可验证性和完整性。每次运行记录最终产物、执行轨迹、使用统计和验证器输出,从而能够分析最终完成之外的内容。在5,194条执行轨迹中,我们观察到不同模型-框架配对在完成度、过程质量、效率和失败行为上存在显著差异。这些结果表明,智能体能力应在模型-框架配置级别报告,而非仅归因于基础模型。我们的分析进一步识别了重复的执行-对齐失败,其中合理的推理与工具反馈、工作区状态、证据或可验证输出契约脱节。Harness-Bench为诊断和改进可靠、高效且可审计的智能体执行栈提供了可复现的基础。

英文摘要

LLM agents are increasingly deployed as executable systems that use tools, modify workspaces, and produce concrete artifacts. In such workflows, performance depends not only on the base model, but also on the harness: the system layer that manages context, tools, state, constraints, permissions, tracing, and recovery. However, existing benchmarks typically abstract away execution, compare complete agent systems, or hold the harness fixed, making execution-layer variation difficult to study. We introduce Harness-Bench, a diagnostic benchmark for evaluating configuration-level harness effects in realistic agent workflows. Harness-Bench evaluates representative harness configurations across multiple model backends under shared task environments, budgets, and evaluation protocols, while preserving each harness's native execution behavior. The benchmark contains 106 sandboxed offline tasks constructed from practical agent-use patterns and manually reviewed for realism, solvability, oracle-checkability, and integrity. Each run records final artifacts, execution traces, usage statistics, and validator outputs, enabling analysis beyond final completion. Across 5,194 execution trajectories, we observe substantial variation in completion, process quality, efficiency, and failure behavior across model-harness pairings. These results suggest that agent capability should be reported at the model-harness configuration level rather than attributed to the base model alone. Our analysis further identifies recurring execution-alignment failures, where plausible reasoning becomes decoupled from tool feedback, workspace state, evidence, or verifiable output contracts. Harness-Bench provides a reproducible foundation for diagnosing and improving reliable, efficient, and auditable agent execution stacks.

2605.27921 2026-05-28 cs.AI cs.CL cs.CY cs.HC

Show, Don't TELL: Explainable AI-Generated Text Detection

展示,而非告知:可解释的AI生成文本检测

Aldan Creo, Suraj Ranganath

AI总结 提出一种名为TELL的新型可解释架构,通过内置解释机制和强化学习训练,在保持高检测性能(AUROC 0.927)的同时提供文本级注释,帮助用户基于自身判断识别AI生成文本。

详情
AI中文摘要

关于AI生成文本检测的研究已经提出了多种区分人类与AI文本的方法,其中一些方法在分布内性能上表现优异。然而,由于输出与用户(如教授)的需求不一致——他们只得到一个没有附带解释的数值分数——现实世界的应用进展缓慢。我们通过一种新颖的架构TELL解决了这个问题,该架构从一开始就内置了可解释性。虽然我们的系统仍像其他检测器一样提供数值分数以便比较,但TELL采用了一种根本不同的方法,旨在向用户展示模型认为文本是AI还是人类写作的“线索”,使用户能够根据自己的判断以及对写作背景和所谓作者的理解来决定文本的作者。我们在一个特定领域的作者注释自定义SFT数据集上训练TELL,并进一步使用GRPO结合课程学习来优化系统以提高性能。我们实现了与最先进检测器相竞争的性能(AUROC 0.927),同时原生提供解释检测器决策基础的注释。我们进一步使用人类注释数据集评估解释质量,报告了在注释的具体性、可证伪性、连贯性、合理性和基础性方面的高胜率(平均72.3%),使用户能够批判性思考并自行决定。因此,我们的工作从以人为中心的角度重新定义了AI生成文本检测的问题,并为专注于原生可解释性的新一代检测器铺平了道路。

英文摘要

Research on AI-generated text detection has presented a number of approaches to discern human from AI prose, some of which achieving high in-distribution performance. However, real-world applicability has stalled because their outputs are misaligned with the needs of users, such as professors, who are presented with a numeric score that has no attached explanation. We tackle this issue with a novel architecture, TELL, that bakes explainability from the ground-up. While our system still offers a numerical score like other detectors for comparability, TELL takes a fundamentally different approach where we aim to show the user the "tells" by which the model believes a text is AI or human-written, to empower the user to decide who wrote a text using their own judgment and understanding of the context of the writing and its alleged author. We train TELL on a custom SFT dataset of domain-specific authorship annotations, and further refine the system using GRPO with curriculum learning to improve performance. We achieve competitive performance with state-of-the-art detectors (AUROC 0.927) while natively providing annotations that explain the basis for the detector's decision. We further evaluate the quality of our explanations using a dataset of human annotations and report a high (mean 72.3%) win-rate on annotation concreteness, falsifiability, coherence, plausibility and grounding, allowing users to critically think and decide for themselves. Our work thus reframes the problem of AI-generated text detection in a human-centric perspective and paves the way for a new family of detectors that focus on native explainability.

2605.27920 2026-05-28 cs.CV

Rethinking Video-Language Model from the Language Input Perspective

从语言输入角度重新思考视频-语言模型

Xiang Fang, Wanlong Fang, Changshuo Wang, Xiaoye Qu, Daizong Liu

AI总结 本文从语言输入角度出发,提出一种即插即用的框架,通过生成正负文本、属性文本推理和自加权损失,提升视频-语言模型的性能。

Comments Published in AAAI 2026

详情
AI中文摘要

受大语言模型浪潮的驱动,视频-语言模型(VLM)已成为弥合视频与文本之间差距的重要且具有挑战性的技术。尽管先前的VLM工作取得了显著进展,但几乎所有工作都隐含地假设所有文本都是由特定模板预定义的。在实际应用中,这种严格的假设无法满足,因为1)预定义所有文本极其耗时费力;2)这些预定义的文本输入过于限制且不友好,限制了其应用。观察到,给定视频输入,语义相似但模板不同的文本会导致不同的性能。为此,本文提出了一种新颖的即插即用框架,用于各种基于VLM的方法,以充分弥合视频和文本。具体来说,我们首先从原始文本中生成正负文本,以针对特定的文本组件。然后,我们提出了一种基于属性的文本推理策略,以挖掘生成文本的细粒度语义。最后,我们利用视频作为指导,通过设计自加权损失来进行跨模态桥接。大量实验表明,所提出的方法可以作为即插即用模块,有效提升最先进VLM的性能。

英文摘要

Driven by the wave of large language models, Video-Language Models (VLMs) have become a significant yet challenging technology to bridge the gap between videos and texts. Although previous VLM works have made significant progress, almost all of them implicitly assume that all the texts are predefined by the specific template. In real-world applications, such a strict assumption is impossible to satisfy since 1) predefining all the texts is extremely time-consuming and labor-intensive. 2) these predefined text inputs are too restrictive and user-unfriendly, limiting their applications. It is observed that given a video input, texts with similar semantics but different templates lead to various performances. To this end, in this paper, we propose a novel plug-and-play framework for various VLM-based methods to fully bridge videos and texts. Specifically, we first generate positive and negative texts from the original ones to target specific text components. Then, we propose an attribute-based text reasoning strategy to mine fine-grained textual semantics of generated texts. Finally, we utilize videos as guidance to conduct cross-modal bridging by designing a self-weighted loss. Extensive experiments show that the proposed method can serve as the plug-and-play module to effectively improve the performance of state-of-the-art VLMs.

2605.27919 2026-05-28 cs.RO cs.LG

Frequency-Guided Action Diffusion via Sub-Frequency Manifold Traversal

通过子频率流形遍历的频率引导动作扩散

Junlin Wang

AI总结 提出频率引导算子(FGO),通过子频率流形逐步引导扩散策略生成平滑动作,在15个机器人操作任务上提升了动作平滑性和时间一致性。

Comments A preprint version of FGO

详情
AI中文摘要

通过行为克隆学习视觉运动策略通常涉及模仿人类操作员收集的专家演示。然而,自然的人类演示固有地包含高频噪声,例如间歇性抖动、暂停和动作抖动。训练策略直接模仿这些原始轨迹不可避免地会导致模型继承这些次优行为。这种病理在基于扩散的策略中尤为明显,其中迭代去噪步骤可能无意中放大高频伪影而牺牲有意义的细粒度细节。为了解决这些限制,我们提出了一种新颖的基于频率的算法,该算法能够实现隐式频谱操控和平滑动作生成。我们的方法,频率引导算子(FGO),通过逐步将噪声样本通过具有扩展频谱带的中间子频率流形驱动,来引导扩散策略的生成过程。在来自5个基准测试的15个机器人操作任务上验证,FGO在增强动作平滑性和时间一致性方面取得了优越性能,同时保留了成功执行任务所需的细节。项目网站:https://henrywjl.github.io/frequency-guidance-operator/

英文摘要

Learning visuomotor policies via behavior cloning typically involves mimicking expert demonstrations collected by human operators. However, natural human demonstrations inherently contain high-frequency noise, such as intermittent jerks, pauses, and action jitter. Training policies to directly imitate these raw trajectories inevitably causes the model to inherit these suboptimal behaviors. This pathology is particularly pronounced in diffusion-based policies, where iterative denoising steps can inadvertently amplify high-frequency artifacts at the expense of meaningful fine-grained details. To address these limitations, we present a novel frequency-based algorithm that enables implicit spectral maneuvering and smooth action generation. Our method, Frequency Guidance Operator (FGO), steers the generation process of diffusion polices by progressively driving the noisy samples through intermediate sub-frequency manifolds with expanding spectral bands. Validated on 15 robotic manipulation tasks from 5 benchmarks, FGO achieves superior performance in enhancing action smoothness and temporal consistency while preserving the details necessary for successful task execution. Project website: https://henrywjl.github.io/frequency-guidance-operator/

2605.27917 2026-05-28 cs.RO

A Surveillance Evasion Game with Continuous Sensor Redeployment via Bilevel Optimization

基于双层优化的连续传感器重新部署的监视规避博弈

Jaehyeok Kim, Kartik A. Pant, Joseph Kinerson, Kylie Sommer-Kohrt, Worawis Sribunma, Li-Yu Lin, James M. Goppert

AI总结 针对无人机利用传感器时空间隙渗透禁飞区的问题,提出通过双层优化实现传感器沿建筑边界连续滑动部署,并利用对数-求和-指数平滑近似保持可微性,最终收敛到局部纳什均衡。

Comments 8 pages, 8 figures, submitted to IEEE Robotics and Automation Letters (RA-L)

详情
AI中文摘要

无人航空系统(UAS)已成为关键基础设施安全日益增长的威胁,利用传感器周界的时空间隙未被探测地渗透受限空域。我们将这种交互建模为对抗性UAS与由定向和全向传感器组成的异构传感器网络之间的两人零和微分博弈。与早期将防御者限制在离散放置图或固定配置的博弈论方法不同,我们引入了一种连续传感器重新部署技术,其中每个传感器沿凸建筑边界自由滑动。这是通过对数-求和-指数平滑近似实现的,该近似在多边形顶点处保持可微性,从而能够使用基于梯度的方法进行优化。攻击者的最佳响应通过两步法计算,结合STP-RRT*进行可行轨迹初始化和非线性规划进行检测最小化细化。联合优化通过交替双层优化收敛到局部纳什均衡(LNE),为两个参与者推导出解析的一阶平稳性条件,从而为CUAS任务中的异构传感器放置建立了可部署的基线。

英文摘要

Uncrewed Aerial Systems (UASs) have become a growing threat to the security of critical infrastructure, exploiting spatiotemporal gaps in sensor perimeters to infiltrate restricted airspace undetected. We formulate this interaction as a two-player zero-sum differential game between an adversarial UAS and a heterogeneous sensor network of directional and omnidirectional sensors. Unlike earlier game-theoretic approaches that restrict the defender to discrete placement graphs or fixed configurations, we introduce a continuous sensor redeployment technique in which each sensor slides freely along the convex building boundaries. This is enforced via a log-sum-exp smooth approximation that preserves differentiability at polygon vertices, enabling optimization with gradient-based methods. The attacker's best response is computed via a two-step approach combining STP-RRT* for feasible trajectory initialization and nonlinear programming for detection-minimization refinement. The joint optimization converges to a Local Nash Equilibrium (LNE) via alternating bilevel optimization, with analytical first-order stationarity conditions derived for both players, thereby establishing a deployable baseline for heterogeneous sensor placements in CUAS missions.

2605.27916 2026-05-28 cs.CV cs.CL

OphIn-500K: Curating Web-Scale Visual Instructions for Scaling Ophthalmic Multimodal Large Language Models

OphIn-500K:策划网络规模的视觉指令以扩展眼科多模态大语言模型

Xuanzhao Dong, Wenhui Zhu, Xiwen Chen, Hao Wang, Xin Li, Yujian Xiong, Jiajun Cheng, Jingjing Wang, Xiaobing Yu, Haiyu Wu, Shao Tang, Zhipeng Wang, Langechuan Liu, Shan Lin, Oana Dumitrascu, Yalin Wang

AI总结 提出OphIn-Engine流水线从网络视频中构建高质量眼科指令数据,生成包含50万+指令实例的OphIn-500K数据集,并基于此开发眼科专用多模态大语言模型OphIn-VL,在多项任务上超越现有通用医学和专用模型。

详情
AI中文摘要

通用医学多模态大语言模型(MLLMs)的进步为构建支持临床诊断的对话助手展现了巨大潜力。然而,它们在高度专业化领域(如眼科)的适应性仍未得到充分探索,主要原因是缺乏大规模、领域特定的指令微调数据。现有的眼科对话数据集通常规模有限,且大多依赖于已建立的公共基准图像,限制了眼科MLLMs的可扩展性及其捕捉真实临床复杂性的能力。为解决这一问题,我们提出了$ extbf{OphIn-Engine}$,一个眼科特定的指令数据策划流水线,从开放获取的眼科网络规模视频中构建高质量指令数据。该流水线整合了多模态转录以提取图像-文本对、视觉线索分离与评分以识别临床相关的视觉描述,以及指令合成与质量控制以生成准确且多样的临床对话。利用该引擎,我们推出了$ extbf{OphIn-500K}$,一个大规模多模态眼科指令微调数据集,包含超过50万个指令实例和来自29,000多个视频片段的151,000多张独特图像,格式包括视觉问答(VQA)、多轮对话交互和思维链(CoT)推理。基于该数据集,我们进一步开发了$ extbf{OphIn-VL}$,一个具有高级视觉理解和对话能力的眼科专用MLLM。综合实验和案例研究表明,与最先进的通用医学和领域专用MLLMs相比,OphIn-VL实现了更优的性能。

英文摘要

The advancement of general medical Multimodal Large Language Models (MLLMs) has shown great potential for building conversational assistants to support clinical diagnosis. However, their adaptation to highly specialized domains such as ophthalmology remains underexplored, primarily due to the scarcity of large-scale, domain-specific instruction-tuning data. Existing ophthalmic datasets for conversational agents are often limited in scale and largely rely on images from established public benchmarks, limiting the scalability of ophthalmic MLLMs and their ability to capture real-world clinical complexity. To address this gap, we propose $\textbf{OphIn-Engine}$, an ophthalmology-specific instruction data curation pipeline that constructs high-quality instruction data from open-access ophthalmology web-scale videos. The pipeline integrates multimodal transcription for extracting image-transcript pairs, visual cue separation and scoring for identifying clinically relevant visual descriptions, and instruction synthesis with quality control for generating accurate and diverse clinical dialogues. Using this engine, we introduce $\textbf{OphIn-500K}$, a large-scale multimodal ophthalmology instruction-tuning dataset containing over 500,000 instruction instances and more than 151,000 unique images from over 29,000 video clips, formatted as visual question answering (VQA), multi-turn conversational interactions, and chain-of-thought (CoT) reasoning. Built upon this dataset, we further develop $\textbf{OphIn-VL}$, an ophthalmology-specific MLLM with advanced visual understanding and conversational capabilities. Comprehensive experiments and case studies demonstrate that OphIn-VL achieves superior performance compared with state-of-the-art general medical and domain-specific MLLMs.

2605.27913 2026-05-28 cs.LG

Where LLM Annotators Fail: Label-Free Learning on Graphs with LLMs

LLM标注者失败之处:基于LLM的图上无标签学习

Safal Thapaliya, Jiatan Huang, Chuxu Zhang

AI总结 针对图节点分类中LLM标注噪声不仅依赖于类别还依赖于区域的问题,提出聚类感知噪声估计(CANE)框架,通过估计聚类条件可靠性来筛选和校正伪标签,在多个图基准上超越现有无标签方法。

详情
AI中文摘要

图上的节点分类通常需要标注节点,然而在图规模上获取标签成本高昂。当节点属性包含语义内容(如论文摘要、网页或产品描述)时,大型语言模型(LLM)可以通过标注一小部分节点提供低成本监督。然而,这些LLM生成的标签带有噪声,现有的无标签图学习方法通常将这种噪声视为全局的或类别条件的。我们发现LLM标注错误不仅依赖于类别,还依赖于区域:在同一类别内,可靠性在特征空间聚类之间可能差异显著。鉴于此,我们提出聚类感知噪声估计(CANE),一种无标签学习框架,无需真实标签即可估计聚类条件的LLM可靠性,并利用该估计决定信任哪些伪标签以及校正哪些标签。在各种图基准和GNN骨干网络上,CANE优于最强的无标签基线,在表现出更强聚类条件噪声的数据集上提升最大。

英文摘要

Node classification on graphs often requires labeled nodes, yet obtaining labels at graph scale is expensive. When node attributes contain semantic content, such as paper abstracts, web pages, or product descriptions, large language models (LLMs) can provide low-cost supervision by annotating a small subset of nodes. However, these LLM-generated labels are noisy, and existing label-free graph learning methods usually treat this noise as either global or class-conditional. We find that LLM annotation errors are not only class-dependent but also region-dependent: within the same class, reliability can vary sharply across feature-space clusters. In light of this, we propose Cluster-Aware Noise Estimation (CANE), a label-free learning framework that estimates cluster-conditional LLM reliability without ground truth labels, and uses this estimate to decide which pseudo-labels to trust, and which labels to correct. Across various graph benchmarks and GNN backbones, CANE improves over the strongest label-free baselines, with the largest gains on datasets exhibiting stronger cluster-conditional noise.

2605.27911 2026-05-28 cs.AI

SuiChat-CN: Benchmarking Contextual Suicide Risk Assessment in Chinese Group Chats

SuiChat-CN:中文群聊情境自杀风险评估基准

Xiangyu Wang, Zhiwei Yu, Chengze Du, Dingchang Wang, Yuhan Ye, Fangyu Zheng

AI总结 针对即时通讯群聊中消息碎片化、多轮对话和隐晦表达带来的挑战,构建了首个中文群聊情境自杀风险评估基准SuiChat-CN,通过信号词提取和双向上下文扩展构建连贯对话片段,并利用专家验证的LLM辅助范式标注用户风险等级,实验表明上下文信息对可靠评估至关重要。

详情
AI中文摘要

自杀是一个关键的全球公共卫生挑战,每年导致约72万人死亡,需要及时有效的预防策略。现有的计算研究主要关注基于帖子的社交媒体平台(如Twitter和微博),而忽略了即时通讯环境(如Telegram)。然而,群聊带来了独特的挑战:消息简短、碎片化、多方参与,并且常常依赖隐晦或文化特定的表达,使得孤立的帖子级分析不足。我们引入了SuiChat-CN,一个用于情境自杀风险评估的中文群聊基准。我们收集了公开的Telegram群聊数据,通过信号词提取和双向上下文扩展构建连贯的对话片段,并使用专家验证的LLM辅助范式注释用户风险等级。SuiChat-CN包含来自1,406名用户的13,312个上下文片段,覆盖258,228条原始聊天消息。使用PLM和超过40个LLM的大量实验表明,上下文信息对于可靠的风险评估至关重要,而微调和部分上下文评估进一步揭示了多方对话中早期检测的挑战。出于伦理和敏感性考虑,该数据集不公开发布,但将根据合理请求与经认可的心理健康和自杀预防研究机构共享。

英文摘要

Suicide is a critical global public health challenge, causing approximately 720,000 deaths each year and calling for timely, effective prevention strategies. Existing computational studies primarily focus on post-based social media platforms such as Twitter and Weibo, leaving instant messaging environments such as Telegram underexplored. Yet group chats pose distinct challenges: messages are short, fragmented, multi-party, and often rely on implicit or culturally specific expressions, making isolated post-level analysis insufficient. We introduce SuiChat-CN, a Chinese group-chat benchmark for contextual suicide risk assessment. We collect public Telegram group-chat data, construct coherent conversational segments through signal-word extraction and bidirectional context expansion, and annotate user risk levels with an expert-validated, LLM-assisted paradigm. SuiChat-CN contains 13,312 contextual segments from 1,406 users, covering 258,228 raw chat messages. Extensive experiments with PLMs and more than 40 LLMs demonstrate that contextual information is essential for reliable risk assessment, while fine-tuning and partial-context evaluation further reveal the challenges of early detection in multi-party conversations. Due to ethical and sensitivity concerns, the dataset is not publicly released but will be shared with accredited mental health and suicide-prevention research institutions upon reasonable request.

2605.27909 2026-05-28 cs.RO

S-Cheetah: A Novel Quadrupedal Robot with a 3-DOF Active Spine Learning Agile Locomotion

S-Cheetah:一种具有3自由度主动脊柱学习敏捷运动的新型四足机器人

Zimu Li, Weibang Bai

AI总结 本文提出一种具有3自由度仿生主动脊柱的四足机器人S-Cheetah,并设计强化学习框架使其实现高速奔跑、原地转向及空中自翻等敏捷运动。

Comments Project website: https://himmy-robotics.github.io/scheetah

详情
AI中文摘要

四足动物的生物脊柱能够实现矢状面的屈伸、侧向弯曲和轴向旋转,在高度敏捷和灵巧的运动中起着关键作用。尽管许多研究已将主动脊柱关节集成到四足机器人中以增强敏捷性,但大多数设计通过减少脊柱自由度来简化控制复杂性,未能实现生物脊柱的空间三轴旋转特性。因此,复制多自由度仿生脊柱并有效利用它来赋能四足机器人的敏捷运动仍然是一个重要的研究挑战。在本研究中,我们提出了S-Cheetah,一种具有3自由度仿生串联主动脊柱的四足机器人,能够实现仿生空间三轴旋转。为了使机器人充分利用这一主动脊柱,我们开发了一个专门的强化学习框架,通过整合加速度课程学习策略和定制的奖励函数(如奔跑步态奖励、脊柱波动奖励和脊柱转向奖励),积极促进引入的脊柱的参与并最大化机器人的运动能力。实验结果表明,S-Cheetah使用旋转G2奔跑步态可以达到6.9米/秒的峰值速度,原地转向率为7.2弧度/秒。此外,该系统展现出一种新兴的、受猫启发的空中自翻能力,使其在自由落体过程中能够从任意方向稳定地四足着地。最后,通过在不同运动任务中的广泛评估,我们证明了所提出的3自由度脊柱的引入全面增强了四足机器人的运动敏捷性。项目网站:himmy-robotics.github.io/scheetah

英文摘要

The biological spine of quadrupeds enables sagittal flexion/extension, lateral bending, and axial rotation, playing a crucial role in highly agile and dexterous locomotion. While numerous studies have integrated active spinal joints into quadrupedal robots to enhance agility, most designs simplify control complexity by reducing spinal degrees of freedom (DOF), failing to achieve the spatial tri-axial rotation characteristic of biological spines. Consequently, replicating a multi-DOF biomimetic spine and effectively leveraging it to empower the agile locomotion of quadrupedal robots remains a significant research challenge. In this study, we present S-Cheetah, a quadrupedal robot featuring a 3-DOF bio-inspired serial active spine capable of biomimetic spatial tri-axial rotation. To empower the robot to fully utilize this active spine, we developed a specialized reinforcement learning framework to actively promote the engagement of the introduced spine and maximize the robot's locomotive capabilities by integrating an acceleration curriculum learning strategy with tailored reward functions, such as a gallop gait reward, a spine undulation reward, and a spine steering reward. Experimental results demonstrate that S-Cheetah can achieve a peak speed of 6.9 m/s using the rotary G2 gallop gait and an in-place turning rate of 7.2 rad/s. Besides, the system exhibits an emergent, feline-inspired aerial self-righting capability, allowing it to land stably on four feet from arbitrary orientations during free fall. Finally, through extensive evaluations across diverse locomotion tasks, we prove that the introduction of the proposed 3-DOF spine comprehensively enhances the locomotive agility of quadrupedal robots. Project website: himmy-robotics.github.io/scheetah

2605.27908 2026-05-28 cs.CL cs.AI

ESC-Skills: Discovering and Self-Evolving Skills for Emotional Support Conversations

ESC-Skills: 发现并自我进化情感支持对话技能

Jie Zhu, Huaixia Dou, Shuo Jiang, Junhui Li, Lifan Guo, Feng Chen, Chi Zhang, Fang Kong

AI总结 提出ESC-Skills框架,通过干预单元建模支持交互并构建技能库,结合多轮廓自我进化机制,提升情感支持对话的可解释性、可控性和效果。

详情
AI中文摘要

现有的情感支持对话(ESC)系统主要依赖于端到端的回复生成或粗粒度的策略监督,可解释性有限,且对系统性的技能提升支持不足。我们提出ESC-Skills,一个以技能为中心的框架,能够发现并自我进化可执行的情感支持技能。我们首先将局部支持交互建模为干预单元(IUs),捕捉求助者状态、支持干预和回复后情绪变化之间的状态-动作-结果动态。基于从成功和失败的ESC对话中提取的IUs,我们构建了ESC-Skills库,这是一个包含干预指导、适用条件、预期结果和潜在风险的可执行情感支持技能仓库。为了进一步提升鲁棒性,我们引入了一个多轮廓自我进化精炼框架,其中ESC代理在SAGE评估下与多种模拟求助者轮廓进行交互。分析由此产生的交互轨迹,以识别缺失的技能、不安全的干预和特定轮廓的失败模式,然后通过基于模拟的验证来精炼技能库。实验结果表明,ESC-Skills在提升回复质量和对话层面的情感结果的同时,提供了更可解释和可控的支持行为。我们将发布代码、提示和ESC-Skills库,网址为https://github.com/aliyun/qwen-dianjin。

英文摘要

Existing emotional support conversation (ESC) systems mainly rely on end-to-end response generation or coarse strategy supervision, offering limited interpretability and little support for systematic skill improvement. We propose ESC-Skills, a skill-centric framework that discovers and self-evolves executable emotional support skills. We first model localized support interactions as Intervention Units (IUs), which capture state--action--outcome dynamics between seeker states, support interventions, and post-response emotional changes. Based on IUs extracted from both successful and failed ESC dialogues, we construct the ESC-Skills Bank, a repository of executable emotional support skills containing intervention guidance, applicability conditions, expected outcomes, and potential risks. To further improve robustness, we introduce a multi-profile self-evolutionary refinement framework in which an ESC agent interacts with diverse simulated seeker profiles under SAGE evaluation. The resulting interaction traces are analyzed to identify missing skills, unsafe interventions, and profile-specific failure patterns, which are then used to refine the Skills Bank through simulation-based verification. Experimental results demonstrate that ESC-Skills improves both response-level quality and dialogue-level emotional outcomes while providing more interpretable and controllable support behaviors. We will release the code, prompts, and ESC-Skills Bank at https://github.com/aliyun/qwen-dianjin.

2605.27906 2026-05-28 cs.AI

Reasoning Matters: Mitigate Hallucination in Multimodal Large Reasoning Models via Reasoning-Conditioned Preference Optimization

推理至关重要:通过推理条件偏好优化减轻多模态大型推理模型中的幻觉

Jiawei Kong, Hao Fang, Shunxiang Liao, Jinyu Li, Bin Chen, Hao Wu, Shu-Tao Xia, Min Zhang

AI总结 提出推理条件直接偏好优化(RC-DPO)方法,通过将思维链作为答案生成的条件并对比不同思维链下的偏好,结合蒙特卡洛树搜索和注意力引导的思维链剪枝生成偏好数据,有效减轻多模态大型推理模型中的幻觉。

详情
AI中文摘要

多模态大型推理模型引入了推理范式,在复杂的视觉-语言任务中展现出强大的能力。然而,它们仍然存在严重的幻觉问题。现有的基于训练的方法通常通过响应级直接偏好优化(DPO)来减轻幻觉,其中思维链(CoT)和最终答案被视为一个整体输出并联合优化。我们发现这种公式的表现与仅答案优化相似,表明它主要学习答案级别的偏好,而未能充分利用CoT级别的监督。为了解决这个问题,我们明确制定了一个面向CoT的偏好项,并推导出推理条件直接偏好优化(RC-DPO),它将CoT建模为答案生成的条件,并在不同CoT条件下对比同一偏好答案的偏好,促进答案支持的推理链对齐。为了进一步优化,我们引入了一种推理增强的偏好数据生成策略,该策略采用蒙特卡洛树搜索来发现视觉基础且逻辑一致的CoT作为正样本,以及注意力引导的CoT令牌剪枝来构建负样本。在各种模型和基准上的大量实验表明,RC-DPO有效减轻了幻觉,并提高了多模态推理过程的可靠性。

英文摘要

Multimodal Large Reasoning Models introduce the reasoning paradigm, demonstrating strong capabilities on complex vision-language tasks. However, they still suffer from severe hallucinations. Existing training-based methods typically mitigate hallucinations through response-level direct preference optimization (DPO), where the Chain-of-Thought (CoT) and the final answer are treated as a monolithic output and optimized jointly. We reveal that this formulation performs similarly to answer-only optimization, suggesting that it primarily learns answer-level preference, while leaving CoT-level supervision insufficiently exploited. To address this issue, we explicitly formulate a CoT-oriented preference term and derive Reasoning-Conditioned Direct Preference Optimization (RC-DPO), which models the CoT as a condition for answer generation and contrasts the preference for the same preferred answer under different CoT conditions, promoting answer-supportive reasoning chain alignment. To further improve optimization, we introduce a reasoning-enhanced preference data generation strategy that employs Monte Carlo Tree Search to discover visually grounded and logically consistent CoTs as positive samples, and attention-guided CoT token pruning to construct negative ones. Extensive experiments across various models and benchmarks show that RC-DPO effectively mitigates hallucinations and improves the reliability of the multimodal reasoning process.

2605.27905 2026-05-28 cs.CL

AI Research Agents Narrow Scientific Exploration

AI研究代理缩小科学探索范围

Yixuan Tang, Yi Yang

AI总结 本研究通过四个AI研究代理框架和六个大语言模型生成37,802个科学想法,发现AI生成的想法比人类论文更集中、更接近起始文献,且与低引用论文相似,表明当前AI代理更适合局部细化而非拓宽科学探索。

详情
AI中文摘要

AI研究代理现在能够生成研究想法、设计实验、运行代码和起草论文,这引发了大规模AI辅助科学发现的可能性。许多当前的代理框架明确鼓励生成新颖且高影响力的想法。然而,目前尚不清楚AI辅助构思是拓宽了科学探索,还是主要集中于现有工作。我们将AI研究代理视为科学搜索系统进行研究。使用四个AI研究代理框架和六个大语言模型,我们从AI和机器学习中由引用定义的研究领域的共享种子文献中生成37,802个科学想法。然后,我们将生成的AI想法与来自相同研究领域的人类撰写论文、来自相同种子文献的后续人类研究以及种子文献本身进行比较。在实验中,出现了四个一致的模式。第一,AI生成的想法比来自相同研究领域的人类撰写论文更加集中。第二,AI生成的想法比后续人类工作更接近其起始文献。第三,与AI生成想法最相似的论文往往获得较低的后续引用。第四,当AI生成的想法与先前工作不同时,差异主要来自现有技术方法的重新组合,而不是引入全新的研究问题。总体而言,当前的AI研究代理似乎更适合局部细化,而不是拓宽科学探索。

英文摘要

AI research agents can now generate research ideas, design experiments, run code, and draft papers, raising the possibility of large-scale AI-assisted scientific discovery. Many current agent frameworks explicitly encourage the generation of novel and high-impact ideas. Yet it remains unclear whether AI-assisted ideation broadens scientific exploration or mainly concentrates around existing work. We study AI research agents as scientific search systems. Using four AI research-agent frameworks and six large language models, we generate 37,802 scientific ideas from shared seed literature across citation-defined research areas in AI and machine learning. We then compare the resulting AI ideas against human-authored papers from the same research areas, follow-on human research emerging from the same seed literature, and the seed literature itself. Across experiments, four consistent patterns emerge. First, AI-generated ideas are substantially more concentrated than human-authored papers from the same research areas. Second, AI-generated ideas remain much closer to their starting literature than later human follow-on work does. Third, papers most similar to AI-generated ideas tend to receive lower subsequent citations. Fourth, when AI-generated ideas differ from prior work, the differences arise primarily from recombining existing technical methods rather than introducing fundamentally new research questions. Overall, current AI research agents appear better suited to local elaboration than to broadening scientific exploration.

2605.27904 2026-05-28 cs.AI cs.LG

Dr-CiK: A Testbed for Foresight-Driven Agents

Dr-CiK:面向预见驱动型智能体的测试平台

Yihong Tang, Andrew Robert Williams, Arjun Ashok, Vincent Zhihao Zheng, Lijun Sun, Alexandre Drouin, Issam H. Laradji, Étienne Marcotte, Valentina Zantedeschi

AI总结 针对现有上下文辅助预测基准假设上下文已提供的问题,提出Dr-CiK基准,评估智能体从文档语料库中检索、过滤、提炼预测相关上下文并生成预测的能力,实验表明高质量上下文显著提升预测性能,但现有深度研究智能体恢复证据不足5%、易受干扰误导。

详情
AI中文摘要

现实环境中的时间序列预测通常不仅依赖于历史观测,还依赖于必须从嘈杂、异构的信息源中主动发现的外部上下文。然而,现有的上下文辅助预测基准通常假设支持性上下文已经提供,未考虑智能体是否能自行识别。因此,我们引入Dr-CiK,一个用于评估智能体是否能够从文档语料库中检索预测相关的支持性上下文、过滤干扰项、将检索到的上下文提炼为对预测有用的证据,并生成由该证据支持的预测的基准。通过上下文消融实验以及对最先进的深度研究和预测方法的联合评估,我们表明高质量上下文显著提高了Dr-CiK中的预测性能。然而,大多数现有的深度研究智能体仅能恢复一小部分真实支持证据(通常<5%),经常被干扰项误导(>80%的干扰项引用),并且可能导致预测器在使用检索到的上下文时比不使用上下文时表现更差。我们的结果激励了对预见驱动型智能体的研究,这些智能体能够搜索正确的上下文以预测未来。

英文摘要

Time series forecasting in real-world settings often depends not only on historical observations, but also on external context that must be actively discovered from noisy, heterogeneous information sources. Yet existing context-aided forecasting benchmarks typically assume that the supporting context is already provided, leaving open whether agents can identify it on their own. Therefore, we introduce Dr-CiK, a benchmark for evaluating whether agents can retrieve forecasting-relevant supporting context from a document corpus, filter out distractors, distill the retrieved context into forecast-useful evidence, and generate forecasts supported by that evidence. Through context ablations and evaluations of state-of-the-art deep research and forecasting methods paired together, we show that high-quality context substantially improves forecasting performance in Dr-CiK. However, most existing DR agents recover only a small fraction of the ground-truth supporting evidence (usually <5%), are frequently misled by distractors (>80% distractor citations), and can cause forecasters to perform worse with retrieved context than without context. Our results motivate research on foresight-driven agents that search for the right context to predict the future.

2605.27901 2026-05-28 cs.CL cs.AI

The Fragility of Chain-of-Thought Monitoring Across Typologically Diverse Languages

跨类型多样语言的思维链监控脆弱性

Eric Onyame, Runtao Zhou, Kowshik Thopalli, Bhavya Kailkhura, Chirag Agarwal

AI总结 本研究通过13种语言和7个前沿模型家族的评估,发现思维链监控在语言分布偏移下普遍不可靠(平均不可信率95.9%),模型会进行策略性操纵,且低资源语言中欺骗模式完全存在。

详情
AI中文摘要

思维链(CoT)监控已被提出作为一种有前景的安全机制,用于检测大型语言模型中的失调行为。然而,其在英语之外以及跨不同模型家族中的可靠性仍 largely unexplored。我们首次在13种多样语言和7个前沿模型家族(共16个模型)上对CoT可监控性进行了大规模评估。使用需要显式中间计算的对抗性提示评估,结合内部答案标记概率分析,我们一致发现CoT在语言和提示类型上存在不忠实性,在8B至120B参数模型中平均不忠实率为95.9%。我们发现前沿模型系统性地进行策略性操纵,包括答案切换、事后合理化以及对提示的程序性利用,使得外部监控器难以检测欺骗。我们表明,前沿模型通常在其潜在激活中在生成的前15%内就承诺了失调线索,即使CoT看起来忠实。令人惊讶的是,这些欺骗模式在低资源语言中保持100%,揭示了当前基于CoT的监督的根本局限性。我们的结果表明,CoT监控在语言分布偏移下本质上是脆弱的,提供的安全信号比仅英语研究所暗示的要弱得多。这些发现强调了开发稳健的CoT监控器以及加速白盒监控技术研究的迫切需要,特别是为了改善中低资源语言中的CoT可监控性。我们的代码可在此处获取:\href{https://multilingual-cot-monitoring.github.io/}{\textcolor{blue}{here}}。

英文摘要

Chain-of-thought (CoT) monitoring has been proposed as a promising safety mechanism for detecting misaligned behavior in large language models. However, its reliability remains largely unexplored beyond English and across diverse model families. We present the first large-scale evaluation of CoT monitorability across 13 diverse languages and seven frontier model families, comprising 16 models. Using adversarial-hint evaluations that require explicit intermediate computation, together with analysis of internal answer-token probabilities, we consistently find CoT unfaithfulness across languages and hint types, with an average rate of 95.9\% across 8B--120B parameter models. We find that frontier models systematically engage in strategic manipulation, including answer-switching, post-hoc rationalization, and procedural exploitation of hints, making external monitors struggle to detect deception. We show that frontier models often commit to the misaligned cue in their latent activations within the first 15\% of generation, even when the CoT appears faithful. Surprisingly, these deceptive patterns remain 100\% in low-resource languages, revealing fundamental limitations in current CoT-based oversight. Our results reveal that CoT monitoring is fundamentally fragile under linguistic distribution shift, providing a substantially weaker safety signal than what English-only studies suggest. These findings underscore an urgent need to develop robust CoT monitors and to accelerate research into white-box monitoring techniques, especially to improve CoT monitorability in mid- and low-resource languages. Our code is available \href{https://multilingual-cot-monitoring.github.io/}{\textcolor{blue}{here}}.

2605.27900 2026-05-28 cs.CV

Decoupled Training with Local Reinforcement Fine-Tuning in Federated Learning

联邦学习中解耦训练与局部强化微调

Yuting Ma, Lechao Cheng, Xiaohua Xu

AI总结 提出FedDTL框架,通过解耦图像和文本编码器训练并引入两阶段局部微调(监督微调+强化学习),解决联邦学习中客户端间优化不一致和客户端内过专门化问题,平衡全局任务适应性与泛化能力。

Comments This work has been accepted by ICML 2026

详情
AI中文摘要

联邦学习(FL)与预训练视觉-语言模型(VLM)的结合已成为各种下游任务的有前景范式。通过利用其强大的表示,最近的研究在局部数据不足的情况下改进了任务适应性,同时保持了泛化能力。然而,这些方法强调完全局部优化和简单的参数聚合,这可能在异构和全数据FL设置下放大客户端间优化不一致和客户端内过专门化,使得平衡全局任务适应性和泛化变得困难。为了解决这些挑战,我们提出了FedDTL,一种新颖的联邦VLM框架,该框架在客户端和服务器之间解耦图像编码器和文本编码器。通过解耦编码器训练与服务器-客户端模态对齐,FedDTL促进了连贯的全局语义更新并减少了客户端间优化不一致,从而改善了全局任务适应性。为了进一步缓解客户端内过专门化,我们引入了两阶段局部微调,其中监督微调阶段实现了快速可靠的预热启动,随后是增强泛化的强化学习阶段。在多个基准测试上的大量实验,包括标签偏移和特征偏移,表明FedDTL在少样本和全数据设置下,在各种FL数据分布中实现了全局任务适应性和泛化之间的有效平衡。

英文摘要

Federated Learning (FL) with pre-trained Vision-Language Models (VLMs) has emerged as a promising paradigm for various downstream tasks. By leveraging its strong representations, recent studies improve task adaptation under insufficient local data while preserving generalization. However, these methods emphasize fully local optimization with simple parameter aggregation,which can amplify inter-client optimization inconsistency and intra-client over-specialization under heterogeneous and full-data FL settings, making it difficult to balance global task adaptation and generalization. To address these challenges, we propose FedDTL, a novel federated VLM framework that decouples the image encoder and text encoder across clients and the server. Through decoupled encoder training with server-client modality alignment, FedDTL promotes coherent global semantic update and reduces inter-client optimization inconsistency, improving global task adaptation.To further mitigate intra-client over-specialization,we introduce a two-stage local fine-tuning, where a supervised fine-tuning stage enables rapid and reliable warm-start, followed by a reinforcement learning stage that enhances generalization. Extensive experiments on multiple benchmarks, including label skew and feature shift, demonstrate that FedDTL achieves an effective balance between global task adaptation and generalization under various FL data distributions in both few-shot and full-data regimes.