arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3410
2605.24012 2026-05-26 cs.CV

Deep Learning-Based Automated Quantification of TIMI Myocardial Perfusion Frame Count (DL-TMPFC) from Coronary Angiography: A Novel Framework for Rapid Assessment of Microvascular Dysfunction

基于深度学习的TIMI心肌灌注帧数自动量化(DL-TMPFC):一种快速评估微血管功能障碍的新框架

Si Li, Yuanqing He, Chenkai Hu, Xiaogang Guo, Huay-Cheem Tan, Chieh Yang Koo, Xuan Zhang, Lei He, Jingyuan Zeng, Shan Xiao

AI总结 提出DL-TMPFC框架,结合狭窄检测网络和区域感知分割网络,从冠状动脉造影中自动计算TIMI心肌灌注帧数,实现微血管功能障碍的客观量化,验证显示与专家手动测量高度一致。

Comments 15 pages,8 figures

详情
AI中文摘要

目的:冠状动脉微血管功能障碍(CMVD)影响约40%-60%的缺血和非阻塞性冠状动脉患者,但由于依赖侵入性功能测试或主观的TIMI血流分级,诊断仍具挑战性。TIMI心肌灌注帧数(TMPFC)提供了一种基于造影的客观定量测量CMVD的方法,但其临床应用受限于繁琐的手动计算和验证不足。本研究旨在开发和验证一种基于深度学习的TMPFC计算方法(DL-TMPFC),使其能够整合到临床工作流程中。方法和结果:DL-TMPFC框架包含两个组件。首先,狭窄检测网络排除阻塞性冠状动脉疾病(CAD)。然后,区域感知分割网络识别灌注区域,TMPFC计算模块自动从造影序列中确定首帧和末帧。该框架在来自三个独立机构的655名患者(445名阻塞性CAD、100名确诊CMVD、110名对照组)队列中进行了验证。DL-TMPFC与专家手动测量具有极好的一致性(偏差:-0.93帧;95%一致性界限:-5.33至+3.47;r=0.98)。DL-TMPFC通过完全自动化TMPFC并消除观察者依赖性,显著增强了临床可行性。临床上,DL-TMPFC能够准确识别全谱冠状动脉病理中的CMVD,并捕获超越二元分类的CMVD连续严重程度,实现定量风险分层。结论:DL-TMPFC实现了直接从常规造影中自动、标准化和准确地量化CMVD。通过提供自动和客观的测量,该工具为临床实践中及时识别和管理CMVD提供了即时诊断信息。

英文摘要

Aims: Coronary microvascular dysfunction (CMVD) affects approximately 40%-60% of patients with ischemia and non-obstructive coronary arteries, yet diagnosis remains challenging due to reliance on invasive functional testing or subjective Thrombolysis In Myocardial Infarction (TIMI) flow grade. The TIMI Myocardial Perfusion Frame Count (TMPFC) offers an objective, angiography-based quantitative measure of CMVD, but its clinical translation is hindered by cumbersome manual calculation and insufficient validation. This study aims to develop and validate a deep learning-powered TMPFC calculation (DL-TMPFC), enabling integration into clinical workflows. Methods and results: DL-TMPFC framework comprised two components. A stenosis detection network first excluded obstructive coronary artery disease (CAD). A territory-aware segmentation network then identified perfusion territories and TMPFC calculation module automatically determined the first and last frames from angiographic sequences. The framework was validated in a cohort of 655 patients (445 of obstructive CAD, 100 of confirmed CMVD, 110 of control group) from three independent institutions. DL-TMPFC showed excellent agreement with expert manual measurements (bias: -0.93 frames; 95% LoA: -5.33 to +3.47; r =0.98). DL-TMPFC markedly enhanced clinical feasibility by fully automating TMPFC and removing observer dependence. Clinically, DL-TMPFC accurately identified CMVD across a full spectrum of coronary pathologies and captured the continuous severity of CMVD beyond binary classification, enabling quantitative risk stratification. Conclusion: DL-TMPFC enabled automatic, standardized, and accurate quantification of CMVD directly from routine angiography. By providing an automatic and objective measure, this tool provided immediate diagnostic information for timely recognition and management of CMVD in clinical practice.

2605.24008 2026-05-26 cs.LG cs.CV cs.SE

CAFD: Concept-Aware DNN Fault Detection using VLMs

CAFD: 使用视觉语言模型的概念感知深度神经网络故障检测

Amin Abbasishahkoo, Mahboubeh Dadkhah, Lionel Briand

AI总结 提出概念感知故障检测(CAFD)方法,通过整合模型信号、距离特征和基于视觉语言模型的概念故障比(CFR)特征,在保持效率的同时显著提升DNN故障检测性能。

详情
AI中文摘要

近年来,深度神经网络(DNN)的故障检测受到越来越多的关注。虽然已经提出了更先进的混合方法来结合多种信息源并优于早期技术,但它们通常会产生大量的计算开销,限制了在现实环境中的可扩展性和实用性。在本文中,我们介绍了概念感知故障检测(CAFD),这是一种基于学习的方法,通过有效整合多个信息源同时保持实际效率,实现了卓越的故障检测性能。具体来说,CAFD使用一组精心挑选的信息特征进行训练,包括基于DNN输出的模型信号、基于距离的特征以及一种新颖的基于概念的特征,称为概念故障比(CFR)。CFR利用视觉语言模型(VLM)从图像中提取文本概念,并量化其存在与DNN故障相关的可能性。通过引入这一特征,CAFD受益于互补的语义信息,从而实现更有效的故障检测。我们的结果表明,CFR是DNN故障检测的有效指标。我们对CAFD进行了广泛的实证评估,将其与三个主题DNN模型和数据集(包括ImageNet)上的五个最先进基线进行了比较。在广泛的约束选择预算范围内,CAFD在故障检测率(FDR)上始终优于所有基线,在所有研究对象和预算规模上平均FDR提高了18.3%。

英文摘要

Fault detection for Deep Neural Networks (DNNs) has received increasing attention in recent years. While more advanced hybrid approaches have been proposed to combine multiple sources of information and outperform earlier techniques, they often incur substantial computational overhead, limiting scalability and practicality in real-world settings. In this paper, we introduce Concept-Aware Fault Detection (CAFD), a learning-based approach that achieves superior fault detection performance by effectively integrating multiple information sources while maintaining practical efficiency. Specifically, CAFD is trained using a carefully selected set of informative features, including model-based signals derived from the DNN's outputs, distance-based features, and a novel concept-based feature, called Concept Failure Ratio (CFR). CFR leverages Vision-Language Models (VLMs) to extract textual concepts from images and quantify the likelihood that their presence is associated with DNN failures. By incorporating this feature, CAFD benefits from complementary semantic information, enabling more effective fault detection. Our results demonstrate that CFR serves as an effective indicator for DNN fault detection. We conduct an extensive empirical evaluation of CAFD, comparing it against five state-of-the-art baselines across three subject DNN models and datasets, including ImageNet. Across a wide range of constrained selection budgets, CAFD consistently outperforms all baselines in Fault Detection Rate (FDR), achieving average FDR improvements of 18.3% across all investigated subjects and budget sizes.

2605.24004 2026-05-26 cs.AI cs.CV cs.LG cs.RO

Reason--Imagine--Act: Closed-Loop LLM Decision Making with World Models for Autonomous Driving

推理--想象--行动:基于世界模型的闭环LLM自动驾驶决策

Zhengqi Sun, Yiwen Sun, Boxuan Liu, Tailai Chen, Tianxu Guo, Jiabin Liu

AI总结 提出Reason--Imagine--Act (RIA)闭环框架,结合LLM推理器与动作条件世界模型进行在线安全验证,在CARLA点目标协议下实现80.05%路线完成率、51.10%到达率和0.20%碰撞率。

Comments Accepted by the 2026 IEEE International Conference on Intelligent Transportation Systems (ITSC 2026). 8 pages, 2 figures

详情
AI中文摘要

大型语言模型(LLM)在自动驾驶中具有潜力,但仅基于语义的决策策略可能在动态交通中产生物理上不安全的行为。现有方法要么在没有显式动力学验证的情况下进行在线语言推理,要么主要在离线流程中使用世界模型,在决策时语义意图与物理可行性之间存在差距。我们提出了Reason--Imagine--Act (RIA),一个闭环框架,将LLM推理器与动作条件世界模型耦合,用于在线安全验证。在每一步,LLM提出一个动作模板和候选子动作,世界模型执行短时域展开,安全评分器选择最安全的可执行动作并反馈给下一步推理。在统一的CARLA点目标协议(1000个回合)下,RIA实现了80.05%的路线完成率、51.10%的到达率和0.20%的碰撞率。在相同的闭环接口下,RIA在核心闭环指标上始终优于无训练基线,包括CARLA TM和MADA。为便于复现,代码可在https://github.com/pku-smart-city/source_code/tree/main/RIA获取。

英文摘要

Large language models (LLMs) are promising for autonomous driving, but semantics-only decision policies can yield physically unsafe behavior in dynamic traffic. Existing methods either perform online language reasoning without explicit dynamics verification or use world models mainly in offline pipelines, leaving a gap between semantic intent and physical feasibility at decision time. We propose Reason--Imagine--Act (RIA), a closed-loop framework that couples an LLM reasoner with an action-conditioned world model for online safety verification. At each step, the LLM proposes an action template and candidate sub-actions, the world model performs short-horizon rollouts, and a safety scorer selects the safest executable action with feedback to the next reasoning step. Under a unified CARLA point-goal protocol (1000 episodes), RIA achieves 80.05% route completion, 51.10% arrival rate, and 0.20% collision rate. Under the same closed-loop interface, RIA consistently outperforms training-free baselines, including CARLA TM and MADA, on core closed-loop metrics. For reproducibility, code is available at https://github.com/pku-smart-city/source_code/tree/main/RIA.

2605.24000 2026-05-26 cs.CL

Toxicity in Twitch Chats: An LLM-Based Analysis Across Gaming Communities

Twitch聊天中的毒性:基于LLM的游戏社区分析

Ronja Fuchs, Florian Rupp, Timo Bertram, Kai Eckert, Alexander Dockhorn

AI总结 使用预训练大语言模型对Twitch平台4452个直播流约2000万条聊天消息进行零样本分类,发现2.4%的消息有毒,其中MOBA游戏毒性最高(3.2%),体育游戏最低(2%),且游戏间毒性分布差异显著,表明存在游戏特定的社区规范。

Comments 8 pages, 2 figures, 5 tables. Accepted at the IEEE Conference on Games (IEEE CoG) 2026

详情
AI中文摘要

在线游戏社区中的毒性仍然是一个持续存在的挑战,体现在不同游戏类型、平台和玩家互动中。虽然许多研究关注游戏内毒性,但对于流媒体平台上不同游戏社区之间毒性行为的差异知之甚少。为了解决这一不足,我们分析了来自Twitch上七个游戏类型的4452个直播流的大约2000万条聊天消息。我们使用预训练的大语言模型通过零样本分类,根据Twitch的毒性分类法对消息进行分类。该分类法包括四个类别和八个子类,包括骚扰、歧视、性内容和脏话。我们的方法在TextDetox数据集上达到了94.5%的F1分数,并且显示出与人类间一致性相当的人机一致性。我们的分析显示,所有消息中有2.4%被归类为有毒,不同游戏类型之间存在显著差异:MOBA游戏的直播流显示出最高的相对毒性率(3.2%),而体育游戏的毒性率最低(2%)。此外,结果表明,即使在游戏类型内部,不同游戏在毒性分布上也存在显著差异,这表明存在游戏特定的社区规范和机制,这些因素塑造了超越游戏类型效应的毒性行为。这些发现为Twitch上游戏类型和游戏特定的毒性模式提供了实证见解,并可为游戏社区制定更有针对性的审核策略提供信息。

英文摘要

Toxicity in online gaming communities remains a persistent challenge, manifesting across genres, platforms, and player interactions. While much research is focused on in-game toxicity, less is known about how toxic behavior varies between gaming communities on streaming platforms. To address this shortcoming, we analyze approximately 20 million chat messages from 4,452 streams, spanning seven game genres on Twitch. We categorize messages according to Twitch's toxicity taxonomy with a pre-trained Large Language Model using zero-shot classification. The taxonomy comprises four categories and eight subclasses, including harassment, discrimination, sexual content, and profanity. Our approach achieves an F1 score of 94.5% on the TextDetox dataset and demonstrates human-model agreement comparable to inter-human agreement. Our analysis reveals that 2.4% of all messages are classified as toxic, with notable differences across genres: streams of MOBA games exhibit the highest relative rate of toxicity (3.2%), and sports games show the lowest rate (2%). Furthermore, results indicate that individual games differ significantly in their toxicity distributions, even within genres, suggesting the existence of game-specific community norms and mechanics that shape toxic behavior beyond genre-level effects. These findings offer empirical insights into genre- and game-specific toxicity patterns on Twitch and can inform more targeted moderation strategies for gaming communities.

2605.23997 2026-05-26 cs.CV cs.AI cs.LG

IVR-R1: Refining Trajectories through Iterative Visual-Grounded Reasoning in Reinforcement Learning

IVR-R1:通过强化学习中的迭代视觉基础推理优化轨迹

Chenghao Li, Fusheng Hao, Xikai Zhang, Likang Xiao, Yanwei Ren, Fuxiang Wu, Quan Chen, Liu Liu

AI总结 提出IVR-R1框架,利用奖励驱动的筛选机制和迭代再推理循环,在强化学习中动态校正多模态推理轨迹,以解决视觉幻觉和逻辑错误问题。

详情
AI中文摘要

通过强化学习的多模态大语言模型在复杂视觉推理任务中展现出显著能力,但在长程多模态场景中仍存在局限,常出现视觉幻觉和逻辑错误。当前方法通常将高维视觉场景预编码为离散文本代理以促进下游推理。然而,随着推理链展开,文本与视觉场景之间固有的信息不对称会侵蚀视觉基础,导致推理误导和错误输出。为解决此问题,我们提出IVR-R1(迭代视觉基础推理),一种新颖的强化学习训练框架,通过动态视觉重新对齐主动校正推理轨迹以指导策略优化。具体而言,利用奖励驱动的筛选机制识别有缺陷的展开,IVR-R1在多模态上下文中执行细粒度的步骤级错误归因。通过将中间推理状态与原始视觉先验进行迭代交叉引用,再推理循环实现自动轨迹校正,有效合成专家级演示,作为策略模型的高保真推理模板。我们在多种多模态基准上的实验表明,IVR-R1持续优于现有强化学习方法,为在复杂多模态推理中保持逻辑和视觉一致性建立了优越范式。

英文摘要

Multimodal large language models via reinforcement learning (RL) have demonstrated remarkable capabilities in complex visual reasoning tasks, yet they remain limited in long-horizon multimodal scenarios, often suffering from visual hallucination and logical error. Current methods typically pre-encode high-dimensional visual scenes into discrete textual proxies to facilitate downstream reasoning. As the reasoning chain unfolds, however, the inherent information asymmetry between text and visual scenes tends to erode visual grounding, resulting in misguided reasoning and erroneous outputs. To address this issue, we introduce IVR-R1 (Iterative Visual-grounded Reasoning), a novel RL training framework that facilitates dynamic visual re-alignment that actively rectifies reasoning trajectories to guide policy optimization. Specifically, by leveraging a reward-driven screening mechanism to identify flawed rollouts, IVR-R1 executes a fine-grained, step-level error attribution within the multimodal context. By iteratively cross-referencing intermediate reasoning states against pristine visual priors, a Re-Reasoning Loop enables automated trajectory rectification, effectively synthesizing expert-level demonstrations that serve as high-fidelity reasoning templates for the policy model. Our experiments across diverse multimodal benchmarks demonstrate that IVR-R1 consistently outperforms existing reinforcement learning methods, establishing a superior paradigm for maintaining logical and visual consistency in complex multimodal reasoning.

2605.23996 2026-05-26 cs.CV eess.IV

Brain-to-Image Retrieval and Reconstruction via Multimodal EEG Alignment

通过多模态EEG对齐实现脑到图像的检索与重建

Chi Kit Wong, Yan Liu, Haowen Yan

AI总结 提出一种脑到图像系统,通过多模态EEG对齐实现自然图像观看时的视觉刺激解码,在检索任务中达到86.30%的Top-1准确率,在重建任务中获得0.903的CLIP分数。

Comments 16 pages, 5 figures. Code available at: https://github.com/Chikit-WONG/DL_Project/

详情
AI中文摘要

我们提出一种脑到图像系统,该系统从自然图像观看期间记录的EEG信号中解码视觉刺激。我们的系统解决两个任务:(1) EEG到图像检索,给定一个EEG片段,在200个候选中对正确的刺激图像进行排序;(2) EEG到图像重建,生成与感知刺激一致的图像。对于检索,我们实现了一种多级模糊方法,该方法通过生物启发的EVNet特征进行改进,并使用InfoNCE损失进行训练。在单个受试者的10个随机种子评估中,检索模型实现了平均最终epoch Top-1准确率86.30%和Top-5准确率98.55%。对于重建,我们实现了CognitionCapturerPro,它将EEG表示对齐到多模态CLIP嵌入,包括图像、文本、深度和边缘嵌入,并通过IP-Adapter条件化使用SDXL-Turbo合成图像。在10个种子上平均,重建模型使用ViT-H-14实现了0.903的CLIP分数,使用ViT-L/14实现了0.870的CLIP分数,SSIM为0.409。这些结果证明了使用现代多模态对齐和生成建模技术从EEG信号解码丰富视觉表示的可行性。

英文摘要

We present a brain-to-image system that decodes visual stimuli from EEG signals recorded during natural image viewing. Our system addresses two tasks: (1) EEG-to-image retrieval, which ranks the correct stimulus image among 200 candidates given an EEG segment, and (2) EEG-to-image reconstruction, which generates an image consistent with the perceived stimulus. For retrieval, we implement a multi-level blurring approach improved with biologically inspired EVNet features and trained with the InfoNCE loss. Evaluated over 10 random seeds for a single subject, the retrieval model achieves a mean final-epoch Top-1 accuracy of 86.30% and Top-5 accuracy of 98.55%. For reconstruction, we implement CognitionCapturerPro, which aligns EEG representations to multi-modal CLIP embeddings, including image, text, depth, and edge embeddings, and synthesizes images with SDXL-Turbo conditioned via IP-Adapter. Averaged over 10 seeds, the reconstruction model achieves a CLIP score of 0.903 using ViT-H-14, a CLIP score of 0.870 using ViT-L/14, and an SSIM of 0.409. These results demonstrate the feasibility of decoding rich visual representations from EEG signals using modern multi-modal alignment and generative modeling techniques.

2605.23994 2026-05-26 cs.CV cs.AI

RAW: Robust Avatar Watermarking -- Benchmarking and Baseline

RAW:鲁棒的数字人水印——基准测试与基线方法

Jack Parry, Jack Saunders, Vinay Namboodiri

AI总结 针对数字人水印面临的后处理攻击,提出基准测试RAW和基于3D人脸重建的UV纹理空间水印方法WALT,在缩放攻击和背景移除攻击下分别达到92.4%和95.6%的鲁棒性。

详情
AI中文摘要

数字人水印面临独特挑战:在部署前,数字人通常要经过背景替换、重新构图和格式转换等常规后处理。我们提出 extbf{RAW}(鲁棒的数字人水印),一个包含来自5个商业提供商的50个合成数字人视频和6种模拟真实数字人工作流程的攻击的基准测试。评估7种现有方法发现,数字人特定的攻击(如背景移除)会显著降低水印恢复率。我们提出 extbf{WALT}(通过学习纹理进行数字人水印),该方法通过3D人脸重建在UV纹理空间中嵌入水印。WALT在缩放攻击下达到最高鲁棒性(92.4%),同时在背景移除攻击下保持强劲性能(95.6%)。我们发布该基准测试以促进针对数字人水印的研究。

英文摘要

Digital avatar watermarking presents unique challenges: avatars are routinely post-processed with background replacement, reframing, and format conversion before deployment. We introduce \textbf{RAW} (Robust Avatar Watermarking), a benchmark comprising 50 synthetic avatar videos from 5 commercial providers and 6 attacks simulating real-world avatar workflows. Evaluating 7 existing methods reveals that avatar-specific attacks such as background removal significantly degrade watermark recovery. We propose \textbf{WALT} (Watermarking Avatars with Learned Textures), which embeds watermarks in UV texture space via 3D face reconstruction. WALT achieves the highest robustness to zoom attacks (92.4\%) while maintaining strong performance on background removal (95.6\%). We release our benchmark to facilitate research into avatar-specific watermarking.

2605.23992 2026-05-26 cs.CV cs.AI

A World Model of Radiologist Reading for Medical Image Representation Learning

放射科医生阅读的世界模型用于医学图像表示学习

Yiwei Li, Zihao Wu, Huaqin Zhao, Yifan Zhou, Chao Cao, Dajiang Zhu, Tianming Liu, Lin Zhao

AI总结 提出GazeWorld,一种将图像视为世界、放射科医生注视序列视为轨迹的医学成像世界模型,通过自回归预测注视补丁表示和空间补全未访问区域,在多个基准上实现最先进的诊断准确率和零样本性能。

详情
AI中文摘要

放射科医生的眼动追踪数据提供了专家在图像阅读过程中如何搜索、比较和积累证据的丰富记录;然而,现有方法仅部分利用这一信号,要么作为静态空间先验,要么作为与诊断脱节的辅助预测目标。我们提出GazeWorld,一种医学成像世界模型,将图像视为世界,将放射科医生的注视序列视为通过该世界的轨迹。GazeWorld自回归地从所有先前访问过的补丁预测下一个注视补丁的潜在表示,同时一个空间补全分支覆盖未访问区域。在推理时,GazeWorld仅从图像生成一系列补丁表示,无需真实注视数据。冻结的GazeWorld特征在CheXpert、RSNA肺炎和SIIM-ACR气胸的所有九个监督设置中实现了最先进的诊断准确率,并在所有三个基准上取得了最高的零样本准确率。在GazeSearch基准上,使用相同冻结特征训练的通用解码器在ScanMatch和SED上分别比专门构建的LogitGaze-Med高出16%和22%,尽管未明确训练以预测注视。GazeWorld表明,建模专家如何阅读(而不仅仅是他们得出什么结论)为医学成像AI提供了一种有前景的预训练范式。

英文摘要

Radiologist eye-tracking data provide a rich record of how experts search, compare, and accumulate evidence during image reading; yet, existing methods exploit this signal only partially, either as a static spatial prior or as an auxiliary prediction target decoupled from diagnosis. We propose GazeWorld, a medical imaging world model that treats the image as the world and the radiologist's fixation sequence as a trajectory through it. GazeWorld autoregressively predicts the latent representation of the next fixated patch from all previously visited ones, while a spatial-completion branch covers unvisited regions. At inference, GazeWorld generates a sequence of patch representations from the image alone without requiring real gaze data. Frozen GazeWorld features achieve state-of-the-art diagnostic accuracy across all nine supervised settings on CheXpert, RSNA Pneumonia, and SIIM-ACR Pneumothorax, as well as the highest zero-shot accuracy on all three benchmarks. On the GazeSearch benchmark, a generic decoder trained on the same frozen features outperforms the purpose-built LogitGaze-Med by over 16\% in ScanMatch and 22\% in SED, despite not being explicitly trained to predict gaze. GazeWorld demonstrates that modeling how experts read, not just what they conclude, offers a promising pretraining paradigm for medical imaging AI.

2605.23989 2026-05-26 cs.AI cs.CL cs.CR

Towards trustworthy agentic AI: a comprehensive survey of safety, robustness, privacy, and system security

迈向可信的自主AI:安全性、鲁棒性、隐私与系统安全的全面综述

Jinhu Qi, Muzhi Li, Jiahong Liu, Yuqin Shu, Dianzhi Yu, Shicheng Ma, Wenqian Cui, Yiyang Zhao, Yiyi Chen, Ruoxi Jiang, Irwin King, Zenglin Xu

AI总结 本文综述了自主AI系统在安全鲁棒性与隐私系统安全两个核心维度的风险来源、阶段缓解策略及统一评估指标,并讨论了开放挑战。

Comments 36 pages, 4 figures. Survey/review article on trustworthy agentic AI. Published in Academia AI and Applications, 2026

详情
Journal ref
Academia AI and Applications, vol. 2, 2026
AI中文摘要

自主AI系统——即通过规划、工具使用、记忆和长程交互增强的大型语言模型(LLM)——能够自主执行复杂任务,但其多步轨迹引入了新的故障模式,挑战了可信赖性。本综述通过两个对高风险部署至关重要的核心维度,对可信自主AI进行了重点考察:安全性与鲁棒性,以及隐私与系统安全性。针对每个维度,我们澄清了关键概念,识别了风险在代理工作流中出现的环节,并总结了针对各阶段的缓解策略。其他可信赖性方面(价值对齐、透明度、公平性和问责制)作为相关背景而非平行章节进行讨论。为了支持一致的比较和部署决策,我们将评估整合到一个统一的指标与基准中心,强调结果和过程信号(例如,约束违反、轨迹完整性和对抗成功率),并为发布门控提供场景到指标的指导。最后,我们概述了开放挑战,如自我进化代理、运行时监控与验证、隐私保护个性化以及信任-效用权衡,并提出了一个关于开源自主系统中现实世界安全失败的案例研究。我们的目标是作为在高风险环境中构建可信自主系统的研究人员和实践者的实用参考。

英文摘要

Agentic AI systems -- Large Language Models (LLMs) augmented with planning, tool use, memory, and long-horizon interactions -- can execute complex tasks autonomously, but their multi-step trajectories introduce new failure modes that challenge trustworthiness. This survey provides a focused examination of trustworthy agentic AI through two core dimensions that are critical for high-risk deployments: Safety and Robustness, and Privacy and System Security. For each dimension, we clarify key concepts, identify where risks emerge along the agent workflow, and summarize stage-targeted mitigation strategies. Other trustworthiness aspects (value alignment, transparency, fairness, and accountability) are discussed as relevant context rather than parallel chapters. To support consistent comparison and deployment decisions, we consolidate evaluation into a unified metrics-and-benchmarks hub, emphasizing both outcome and process signals (e.g., constraint violations, trace completeness, and adversarial success rates) and offering scenario-to-metric guidance for release gating. We conclude by outlining open challenges such as self-evolving agents, runtime monitoring and verification, privacy-preserving personalization, and the trust-utility trade-off, and present a case study of real-world security failures in open-source agentic systems. Our goal is to serve as a practical reference for researchers and practitioners building trustworthy agentic systems in high-stakes environments.

2605.23987 2026-05-26 cs.AI cs.RO

Beyond Predefined Learning Objects: A Thinking-Learning Interaction Model for Up-to-Date Autonomous Robot Learning

超越预定义学习对象:面向最新自主机器人学习的思维-学习交互模型

Hong Su

AI总结 针对自主机器人在开放环境中无法依赖预定义学习对象的问题,提出一种思维-学习交互模型,通过思维指导学习(识别变化、选择证据、组织训练、规划验证)和学习促进思维(更新知识、经验、策略、推理)的双向机制,实现输入特征发现、输出类别扩展、模型更新和动作例程重构,实验验证了模型在特征适应、新类别形成、模型更新和动作优化上的有效性。

详情
AI中文摘要

在开放和变化环境中运行的自主机器人不能总是依赖预定义的输入、输出和动作例程。尽管现有的学习方法使机器人能够通过环境交互提高性能,但学习对象往往是预先固定的,例如输入特征、识别输出、网络结构、任务目标或动作序列。这限制了它们在长期运行中出现新特征、新类别或更高效任务例程时的适应能力。为解决此问题,本文提出了一种面向自主机器人的思维-学习交互模型。核心思想是:思维通过识别潜在变化、选择有用证据、组织训练材料和规划验证动作来指导学习,而学习通过更新任务知识、特征选择经验、动作策略和未来推理过程来促进思维。基于这种双向机制,机器人可以逐步超越预定义的学习设置,并通过与环境的持续交互调整其识别关系和动作关系。具体来说,该模型支持自适应输入特征发现、输出类别扩展、学习模型更新和动作例程重构。实验结果表明,该模型在特征适应中将最终识别准确率从0.419提高到0.845,实现了更高的新类别形成准确率和模型更新成功率,并将动作例程重构中的平均动作长度从13.0减少到4.0。在学习增强思维方面,有用证据选择率从0.272提高到0.965,表明学习结果能有效改善未来的证据选择和推理。

英文摘要

Autonomous robots operating in open and changing environments cannot always rely on predefined inputs, outputs, and action routines. Although existing learning methods enable robots to improve their performance through environmental interaction, the objects of learning are often fixed in advance, such as input features, recognition outputs, network structures, task goals, or action sequences. This limits their ability to adapt when new features, new categories, or more efficient task routines appear during long-term operation. To address this problem, this paper proposes a thinking-learning interaction model for autonomous robots. The core idea is that thinking guides learning by identifying potential changes, selecting useful evidence, organizing training materials, and planning verification actions, while learning promotes thinking by updating task knowledge, feature-selection experience, action strategies, and future reasoning processes. Based on this bidirectional mechanism, the robot can gradually move beyond predefined learning settings and adapt its recognition relations and action relations through continuous interaction with the environment. Specifically, the proposed model supports adaptive input feature discovery, output category expansion, learning model update, and action routine reconstruction. Experimental results show that the proposed model improves the final recognition accuracy from 0.419 to 0.845 in feature adaptation, achieves higher new-category formation accuracy and model-update success rate, and reduces the average action length from 13.0 to 4.0 in action routine reconstruction. In learning-enhanced thinking, the useful evidence selection rate increases from 0.272 to 0.965, indicating that learning results can effectively improve future evidence selection and reasoning.

2605.23984 2026-05-26 cs.LG cs.AI cs.CV

Parameter Efficient Multi-Class Intelligent Scheduling for Multimodal Online Distributed Industrial Anomaly Detection

面向多模态在线分布式工业异常检测的参数高效多类智能调度

Heqiang Wang, Weihong Yang, Zheyuan Yang, Jia Zhou, Xiaoxiong Zhong, Fangming Liu, Weizhe Zhang

AI总结 针对工业异常检测中分布式、持续生成数据的特点,提出多模态在线分布式工业异常检测框架,通过多类智能调度问题和序列边际增益贪婪算法协调模型更新,并采用资源高效类级低秩适应策略降低系统开销,在MVTec 3D-AD和Eyecandies数据集上取得优越性能。

详情
AI中文摘要

工业异常检测作为工业系统的基本挑战已引起广泛关注。异构工业传感器的快速发展推动工业异常检测从单模态向多模态范式转变。然而,现有方法主要针对集中式和离线场景设计,忽视了实际工业环境中分布式和持续生成的数据特征。随着边缘智能的发展,现代边缘设备不仅能够采集数据,还能进行分布式模型训练,实现系统范围内的协作智能。工业异常检测是此背景下的关键应用。受这些挑战启发,我们提出了一种名为多模态在线分布式工业异常检测(MODIAD)的新框架。首先给出了MODIAD的完整工作流程,然后制定了多类智能调度(MIS)问题,通过平衡数据充足性和类别更新频率来协调跨类模型更新。为了高效解决该问题,我们设计了序列边际增益贪婪(SMG)算法,能够在资源约束下实现有效的多类训练。此外,为了提升训练过程中的计算和通信效率,我们提出了资源高效类级低秩适应(REC-LoRA)策略,在保持检测性能的同时显著降低系统开销。在两个代表性多模态工业异常检测数据集MVTec 3D-AD和Eyecandies上的大量实验表明,所提方法在MODIAD场景下实现了优越的性能和效率。

英文摘要

Industrial anomaly detection has attracted significant attention as a fundamental challenge in industrial systems. The rapid advancement of heterogeneous industrial sensors has driven industrial anomaly detection from unimodal to multimodal paradigms. However, existing methods are primarily designed for centralized and offline settings, overlooking the distributed and continuously generated data characteristic of real-world industrial environments. With the advancement of edge intelligence, modern edge devices are increasingly capable of not only data acquisition but also distributed model training, enabling collaborative intelligence across the system. Industrial anomaly detection represents a critical application in this context. Motivated by these challenges, we propose a novel framework termed Multimodal Online Distributed Industrial Anomaly Detection (MODIAD). We first present a comprehensive workflow for MODIAD and then formulate a Multi-class Intelligent Scheduling (MIS) problem to coordinate cross class model updates by balancing data sufficiency and class update frequency. To efficiently solve this problem, we design a Sequential Marginal Gain Greedy (SMG) algorithm that enables effective multi-class training under resource constraints. Furthermore, to improve the computational and communication efficiency during training, we propose an Resource Efficient Class-Wise Low Rank Adaptation (REC-LoRA) strategy, which significantly reduces system overhead while preserving detection performance. Extensive experiments on two representative multimodal industrial anomaly detection datasets, MVTec 3D-AD and Eyecandies demonstrate that the proposed approach achieves superior performance and efficiency under the MODIAD scenario.

2605.23983 2026-05-26 cs.AI cs.LO cs.SI

Saturating Scaling Laws for Equational Discovery: A Phenomenology of Growth Dynamics in Three Toy Substrates with Two Real-World Replications

等式发现的饱和标度律:三个玩具基底中的增长动力学现象学及两个真实世界复现

Fabio Rovai

AI总结 研究确定性等式发现基底中的增长动力学,提出饱和幂律增长模型,并在玩具域和真实世界数据中验证其基底条件性。

Comments 17 pages, 5 figures, 4 tables, 2 algorithms. Code and data at https://github.com/fabio-rovai/saturating-scaling-laws (currently private; will be made public on acceptance)

详情
AI中文摘要

我们研究确定性等式发现基底中的增长动力学。在三个玩具域(算术、布尔、高阶列表;n=592条轨迹)中,短程基底大小符合幂律N(t) ∝ t^b。在每个基底内,b对架构敏感(交叉验证R²≈0.82);回归不能跨基底迁移(算术+布尔到列表得到R²≈-0.84)。一个启发式平均场闭包模型预测饱和幂律dN/dt = K N^k exp(-μ N),其中纯幂律是短程近似。三个稳健性检验:在4/5的玩具轨迹中,(k, μ)的bootstrap区间紧密,1/5退化;对玩具数据的样本外预测(拟合前100个epoch,预测后400个)中纯幂律5/5获胜,表明玩具轨迹未达到饱和;在两个真实世界增长代理上结果出现分歧。每月新Mathlib/*.lean文件添加量(mathlib4,60个月,9701个文件)支持饱和形式,在样本外预测上优于纯幂律约7倍;Coq mathcomp每月提交量(129个月,3083次提交)在两个测试中都偏向纯幂律,μ趋近于零。动力学在两个层面上是基底条件性的:基底内架构与b的回归不可迁移,且N(t)本身偏好的函数族(纯幂律vs饱和幂律)因基底而异。我们提出“饱和幂律增长,具有基底条件性的(k, μ),当基底达到饱和状态时可观测”作为工作框架。

英文摘要

We investigate growth dynamics in deterministic equational discovery substrates. Across three toy domains (arithmetic, boolean, higher-order list; n=592 trajectories), short-range substrate sizes fit a power-law N(t) proportional to t^b. Within each substrate b is architecture-sensitive (cross-validated R^2 approximately 0.82); the regression does not transfer across substrates (arith+bool to list yields R^2 approximately -0.84). A heuristic mean-field closure model predicts a saturating power-law dN/dt = K N^k exp(-mu N) of which the pure power-law is the short-range approximation. Three robustness checks: bootstrap intervals on (k, mu) are tight in 4/5 toy trajectories and degenerate in 1/5; out-of-sample forecasting on toy data (fit first 100 epochs, predict next 400) is won by pure power-law 5/5, indicating the toy trajectories do not reach saturation; on two real-world growth proxies the result splits. New Mathlib/*.lean file additions per month (mathlib4, 60 months, 9701 files) support the saturating form on OOS forecasting by approximately 7x over pure power-law; Coq mathcomp monthly commits (129 months, 3083 commits) favour pure power-law on both tests with mu collapsing to zero. The dynamics are substrate-conditional at two levels: within-substrate architecture-to-b regressions do not transfer, and the preferred functional family for N(t) itself (pure vs. saturating power-law) differs by substrate. We propose "saturating power-law growth with substrate-conditional (k, mu), observable when the substrate has reached its saturation regime" as a working framing.

2605.23982 2026-05-26 cs.SD

PiAnnotate: A Web Annotation Tool for Piano Fingering, with a Diagnostic Probe

PiAnnotate: 一个用于钢琴指法的网页标注工具,附带诊断探针

Joonhyung Bae, Kirak Kim, Hyeyoon Cho, Sein Lee, Yoon-Seok Choi, Hyeon Hur, Gyubin Lee, Akira Maezawa, Jonghwa Park, Jaebum Park, Juhan Nam

AI总结 提出基于网页的钢琴指法标注工具PiAnnotate,结合钢琴卷帘视图、演奏视频和3D手部网格,通过保留规则生成与人工编辑的配对指法轨迹实现标注可审计性,并训练小型Transformer探针从编辑标签中学习可改进的结构。

详情
AI中文摘要

钢琴指法决定了如何演奏一个乐段,但在演奏后标注指法却很困难。标注者必须决定每个音符由哪个手指弹奏,同时协调乐谱、时间、视频和手部运动。我们提出了PiAnnotate,一个基于网页的流程,用于为FurElise演奏数据集添加专家指法标注。该工具结合了钢琴卷帘视图、演奏视频和3D MANO手部网格,使审阅者能够在音乐和物理上下文中检查每个指法分配。PiAnnotate不仅存储最终答案,还保留配对的基于规则和人工编辑的指法轨迹。这些配对轨迹通过显示几何规则何时足够、专家何时干预以及标签在审查轮次中如何变化,使标注历史可审计。作为最终诊断,我们在配对轨迹上训练了一个小型Transformer探针。该探针在保留曲目上优于规则基线,同时对于已正确的标签保持保守更改,表明编辑后的标签包含可学习的结构,而不仅仅是孤立的修正。

英文摘要

Piano fingering shapes how a passage can be played, yet it is difficult to label after a performance. An annotator must decide which finger produced each note while reconciling the score, timing, video, and hand motion. We present PiAnnotate, a web-based pipeline for adding expert fingering annotations to the FurElise performance dataset. The tool brings together a piano-roll view, performance video, and a 3D MANO hand mesh so that reviewers can inspect each assignment in musical and physical context. Rather than storing only the final answer, PiAnnotate keeps paired rule-based and human-edited fingering tracks. These paired tracks make the annotation history auditable by showing where a geometric rule was sufficient, where experts intervened, and how labels changed across review passes. As a final diagnostic, we train a small Transformer probe on the paired tracks. The probe improves on the rule baseline on held-out pieces while remaining conservative about changing labels that were already correct, suggesting that the edited labels contain learnable structure rather than only isolated fixes.

2605.23978 2026-05-26 cs.LG econ.EM q-fin.ST q-fin.TR

Algometrics: Forecasting Under Algorithmic Feedback

算法度量:算法反馈下的预测

Marc Schmitt

AI总结 提出算法度量框架,研究预测算法影响自身评估数据的反馈机制,证明部署风险不可仅由历史数据识别,并给出估计方法。

详情
AI中文摘要

在算法市场中,预测模型成为其试图预测的数据生成过程的一部分。一旦其输出转化为交易、分配、执行计划或风险控制,它们就会改变用于评估的未来数据。我引入了算法度量,这是一个用于时间序列的框架,其演化依赖于预测它们的预测算法。该框架区分了被动预测下测量的历史风险和预测驱动行动时测量的部署风险。我证明了三个结果。首先,仅凭被动历史数据无法识别部署风险:即使在线性一步反馈模型中,无限多的算法介导环境会诱导相同的历史规律,但对同一预测器意味着不同的部署风险。其次,历史模型排名可能在拥挤下反转,因此被动误差较低的预测器在类似算法被采用后可能具有更高的部署误差。第三,随机化或工具化行动可识别短视界线性反馈,并且我推导出部署风险估计的有限样本界。这些结果表明,算法市场中的时间序列基准应报告反馈敏感性和预测准确性。

英文摘要

In algorithmic markets, predictive models become part of the data-generating process they aim to forecast. Once their outputs are converted into trades, allocations, execution schedules, or risk controls, they change the future data on which they are evaluated. I introduce algometrics, a framework for time series whose evolution depends on the predictive algorithms forecasting them. The framework distinguishes historical risk, measured under passive forecasting, from deployment risk, measured when forecasts drive actions. I prove three results. First, deployment risk is not identifiable from passive historical data alone: even in a one-step linear feedback model, infinitely many algorithm-mediated environments induce the same historical law while implying different deployment risks for the same forecaster. Second, historical model rankings can invert under crowding, so a predictor with lower passive error can have higher deployment error once similar algorithms are adopted. Third, randomized or instrumented actions identify short-horizon linear feedback, and I derive a finite-sample bound for deployment-risk estimation. These results suggest that time-series benchmarks in algorithmic markets should report feedback sensitivity alongside predictive accuracy.

2605.23977 2026-05-26 cs.CL cs.SD eess.AS

A Multi-Probe Audit of Clinical-Interview Depression Detection Benchmarks

临床访谈抑郁症检测基准的多探针审计

Takehiro Ishikawa, Jon Duke

AI总结 通过四个互补探针审计临床访谈抑郁症检测基准,发现评估协议缺陷、排行榜不可靠、跨域泛化弱以及文本与音频模态对症状密度的敏感性差异。

详情
AI中文摘要

本文通过四个互补探针对 DAIC/E-DAIC、CMDC、ANDROIDS、MODMA 和 PDCH 中的临床访谈抑郁症检测基准评估进行审计。首先,我们在严格的受试者不相交留一受试者交叉验证下重新评估 E-DAIC。一个轻量级混合文本加 LLM 评分模型达到了 macro-F1 = 0.723——据我们所知,这是该协议下报告的最高值——提供了一个不依赖特权官方保留集的保守出折参考点。其次,我们通过扫描 96 种跨模态组合、池化策略和学习器的模型配置,测试 E-DAIC 官方划分是否支持细粒度排行榜排名。开发侧交叉验证与官方测试排名仅中等程度对齐:最佳交叉验证配置在官方测试中排名第 20,官方测试获胜者按交叉验证排名第 41,前三名重叠为零,且表观获胜者在仅 32.3% 的受试者自举中排名第一。第三,我们外部验证了强大的公开 CMDC 和 ANDROIDS 基线,这些基线在域内实现了接近天花板的表现。到外部语料库的零样本迁移明显较弱。最后,我们使用基于 SRDS 的标注器定义的症状密集与症状稀疏的配对访谈片段,对 E-DAIC 文本和音频模型进行压力测试。文本分数在症状密集片段上急剧上升,而音频分数几乎持平;文本减音频的差距在所有五个种子上均为正。

英文摘要

This paper audits benchmark evaluation in clinical-interview depression detection through four complementary probes across DAIC/E-DAIC, CMDC, ANDROIDS, MODMA, and PDCH. First, we re-evaluate E-DAIC under strict subject-disjoint leave-one-subject-out cross-validation. A lightweight hybrid text-plus-LLM-score model reaches macro-F1 = 0.723 - the highest reported under this protocol, to our knowledge - providing a conservative out-of-fold reference point that does not depend on the privileged official holdout. Second, we test whether the E-DAIC official split supports fine-grained leaderboard rankings by sweeping 96 model configurations across modality bundles, pooling strategies, and learners. Development-side cross-validation and official-test rankings align only moderately: the best cross-validation configuration ranks twentieth on the official test, the official-test winner ranks forty-first by cross-validation, top-3 overlap is zero, and the apparent winner is rank-1 in only 32.3% of subject bootstraps. Third, we externally validate strong public CMDC and ANDROIDS baselines that achieve near-ceiling in-domain performance. Zero-shot transfer to external corpora is substantially weaker. Finally, we stress-test E-DAIC text and audio models using paired symptom-dense versus symptom-light interview slices defined by an SRDS-based annotator. Text scores rise sharply on symptom-dense slices, whereas audio scores remain nearly flat; the text-minus-audio gap is positive across all five seeds.

2605.23975 2026-05-26 cs.CL cs.SD

Direct Preference Optimization for English-Mandarin Code-Switching Speech Recognition in Audio LLMs

面向音频大语言模型中英双语码转换语音识别的直接偏好优化

Trung Nguyen Quang, Cheng Yi Lewis Won, Minh Duc Pham, Yingxu He, Shuo Sun, Ai Ti Aw

AI总结 针对音频大语言模型在英中码转换语音转录中的系统失败,提出使用直接偏好优化(DPO)对齐模型,通过构建偏好对(保留混合语言内容 vs 模仿失败模式)训练模型,实现词错误率降低最高89.6%(分布内)和20.0%(分布外)。

详情
AI中文摘要

音频大语言模型(Audio LLMs)尽管具有强大的多语言能力,但在转录码转换语音时表现出系统性失败。聚焦英中双语,我们识别出三种失败模式:语言省略、翻译替代转录和幻觉。我们应用直接偏好优化(DPO)来对齐模型,构建偏好对,其中选择响应保留混合语言内容,而拒绝响应模仿失败模式。在100K对(570小时)上训练三个Audio LLMs,我们观察到一致的行为转变:模型学会在提示转录时保留语言组成而非翻译。这种对齐使得词错误率降低高达89.6%(分布内)和20.0%(分布外)。我们的发现表明,DPO可以有效地从多语言Audio LLMs中引出正确的码转换转录行为。

英文摘要

Audio large language models (Audio LLMs) exhibit systematic failures in transcribing code-switching speech despite strong multilingual capabilities. Focusing on English-Mandarin, we identify three failure modes: language omission, translation-instead-of-transcription, and hallucination. We apply Direct Preference Optimization (DPO) to align models, constructing preference pairs in which chosen responses preserve mixed-language content while rejected responses mimic failure patterns. Training three Audio LLMs on 100K pairs (570 hours), we observe consistent behavioral shifts: models learn to preserve language composition rather than translating when prompted for transcription. This alignment yields MER reductions up to 89.6% (in-distribution) and 20.0% (out-of-distribution). Our findings suggest DPO can effectively elicit correct code-switching transcription behavior from multilingual Audio LLMs.

2605.23974 2026-05-26 cs.CL

AERIC: Anticipatory Hidden-State Monitoring for Implicit Harmful Dialogue

AERIC:面向隐式有害对话的预期性隐藏状态监控

Jihyung Park, Saleh Afroogh, Junfeng Jiao

AI总结 提出AERIC方法,利用生成器的隐藏状态进行同路径监控,通过短期危害预测、支持敏感抑制和提示条件残差评分,在轻量级线性监控器(仅387个可训练参数)上实现对隐式有害对话的早期检测,显著提升AUROC并降低延迟。

详情
AI中文摘要

当前语言模型带来两个安全挑战:必须足够早地检测风险以避免暴露有害的延续,且危害本身可能是隐式的而非通过明显的有毒文本信号。现有的响应级防护在评判完整文本方面表现强劲,原生流式防护则更接近令牌时间,但两者都未解决轻量级监控器能否从生成器自身的内部轨迹预测隐式有害漂移的问题。我们研究预期性同路径监控,其中安全监控器可以读取正常解码过程中产生的隐藏状态,但不能调用通过基础模型的额外前向传播。我们引入AERIC,一种面向隐式有害对话的迁移导向隐藏状态方法,结合短期危害预测、支持敏感抑制和提示条件残差评分,并采用同路径指数移动平均决策规则。默认线性监控器仅包含387个可训练头部参数。在平衡基准测试中,与Qwen3GuardStream-4B相比,AERIC在DiaSafety上将AUROC从0.6830提升至0.7143,在Harmful Advice上从0.8219提升至0.8582。对于提示级触发基准测试,我们通过源端安全预算规则校准AERIC阈值,该规则在将安全触发率限制在最多10%的同时最大化触发覆盖率。在该规则下,对于Qwen和Gemma,trigger@64在HarmBench DirectRequest上分别达到0.6438和0.4656,在SocialHarmBench上分别达到0.6849和0.7363,平均保留23.53至41.86个回答令牌。同路径部署也很高效:在Qwen3-8B下,针对HarmBench DirectRequest和SocialHarmBench聚合的63个提示有害提示固定生成基准测试中,监控器仅使平均延迟增加2.34%,而Qwen3Guard-Stream-4B使其增加79.40%。

英文摘要

Current language models create two safety challenges: risk must be detected early enough to avoid exposing harmful continuation, and the harmfulness itself may be implicit rather than signaled by overtly toxic text. Existing response-level guards are strong at judging completed text, and native streaming guards move closer to token time, but both settings leave open whether a lightweight monitor can anticipate implicit harmful drift from the generator's own internal trajectory. We study anticipatory same-pass monitoring, where a safety monitor may read hidden states produced during ordinary decoding but may not invoke an additional forward pass through the base model. We introduce AERIC, a transfer-oriented hidden-state approach for implicit harmful dialogue that combines short-horizon hazard forecasting, support-sensitive suppression, and prompt-conditioned residual scoring under a same-pass exponential moving average decision rule. The default linear monitor contains only 387 trainable head parameters. Against Qwen3GuardStream-4B on balanced benchmarks, AERIC improves AUROC from 0.6830 to 0.7143 on DiaSafety and from 0.8219 to 0.8582 on Harmful Advice. For promptlevel trigger benchmarks, we calibrate the AERIC threshold by a source-side safe-budget rule that maximizes trigger coverage while constraining the safe-trigger rate to at most 10%. Under that rule, trigger@64 reaches 0.6438 and 0.4656 on HarmBench DirectRequest and 0.6849 and 0.7363 on SocialHarmBench for Qwen and Gemma, respectively, withholding between 23.53 and 41.86 answer tokens on average. Same-pass deployment is also efficient: on a 63-prompt harmfulprompt fixed-generation benchmark aggregated over HarmBench DirectRequest and SocialHarmBench under Qwen3-8B, the monitor increases mean latency by only 2.34%, whereas Qwen3Guard-Stream-4B increases it by 79.40%.

2605.23972 2026-05-26 cs.AI cs.CL cs.RO

Why We Need World Models for AGI: Where LLMs Fail and How World Models May Outperform

为什么我们需要世界模型来实现通用人工智能:大语言模型失败之处以及世界模型如何可能超越

Feisal Alaswad, Batoul Aljaddouh, Maher Alrahhal, Poovammal E, Talal Bonny

AI总结 本文通过提出潜在动态推理(LDI)概念和Flux环境案例研究,论证了大语言模型在因果推理、状态跟踪和长程规划上的局限性,并展示基于显式状态空间的强化学习智能体在长程游戏中显著优于纯文本LLM。

Comments 19 pages, 5 figures

详情
AI中文摘要

大语言模型在语言生成和知识密集型任务中表现出色,但在需要因果推理、持久状态跟踪和长程规划的场景中仍然受限。我们认为,这些限制可能源于序列预测与对潜在环境动态进行推理之间的目标层级不匹配。为了形式化这一区别,我们引入了潜在动态推理(LDI),这是一种概念性视角,将语言和多模态观测解释为底层转移动态的部分证据。为了实证研究这一视角,我们引入了Flux,一个完全通过自然语言规则指定的序列推理环境。作为一个概念验证案例研究,这些规则首先被编译成一个显式的状态转移模拟器,说明在某些情况下,结构化的潜在转移动态可以从文本规则描述中操作性地提取出来。这使得我们能够在纯文本观测上运行的LLM与直接在提取的潜在状态空间中训练的强化学习智能体之间进行受控比较。在该案例研究中,能够显式访问潜在状态空间的智能体在长程游戏中表现出更稳定的行为,总胜率约为79%,而LLM仅为11%。定性分析进一步揭示了与不稳定的持久状态跟踪一致的失败模式,包括无效动作、状态跟踪错误和短程推理失败。Flux环境的完整实现可在https://github.com/FeisalAlaswad/FLUX-RL-Agent获取。在评估的设置中,这些结果表明,如果没有持久状态跟踪和转移建模的机制,仅凭强大的序列预测可能难以支持稳健的长程动态推理。

英文摘要

Large language models achieve strong performance in language generation and knowledge-intensive tasks, yet remain limited in settings requiring causal reasoning, persistent state tracking, and long-horizon planning. We argue that these limitations may arise from an objective-level mismatch between sequence prediction and reasoning over latent environment dynamics. To formalize this distinction, we introduce Latent Dynamics Inference (LDI), a conceptual perspective that interprets language and multimodal observations as partial evidence of underlying transition dynamics. To empirically investigate this perspective, we introduce Flux, a sequential reasoning environment specified entirely through natural-language rules. As a proof-of-concept case study, the rules are first compiled into an explicit state-transition simulator, illustrating that structured latent transition dynamics can, in some cases, be operationally extracted from textual rule descriptions. This enables a controlled comparison between the LLMs operating purely over textual observations and reinforcement-learning agents trained directly within the extracted latent state space. Within this case study, agents operating with explicit access to the latent state space exhibit substantially more stable behavior in long-horizon gameplay, achieving an aggregate win rate of approximately 79% versus 11% for LLMs. Qualitative analysis further reveals failure modes consistent with unstable persistent state tracking, including invalid actions, state-tracking errors, and short-horizon reasoning failures. The complete implementation of the Flux environment available at https://github.com/FeisalAlaswad/FLUX-RL-Agent Within the evaluated setting, these results suggest that strong sequence prediction alone may struggle to support robust long-horizon dynamic reasoning without mechanisms for persistent state tracking and transition modeling

2605.23970 2026-05-26 cs.CL

Faithful or Fabricated? A Causal Framework for Rationalization Bias in LLM Judges

忠实还是捏造?LLM 评判中合理化偏差的因果框架

Riya Tapwal, Abhishek Kumar, Carsten Maple

AI总结 提出因果框架研究 LLM 评判者对非证据线索的依赖,通过线索干预和锚定度量揭示其合理化偏差,并验证证据锁定策略的有效性。

详情
AI中文摘要

大型语言模型(LLM)越来越多地被用作自动评判者,用于摘要和对话评估。先前的工作记录了诸如位置、冗长性和风格偏好等偏差,但主要关注结果,对评判解释的探索不足。我们转而询问 LLM 评判者是否对线索不变,即当非证据线索被扰动而底层文本保持不变时,其排名和解释是否保持稳定。我们引入了一套线索干预(盲、真相、翻转、安慰剂、事后揭示)和线索感知度量,用于量化结果锚定和理由锚定,包括标签对齐的修辞和解释漂移,以及一致性和刻板印象入侵检查。我们使用冗长性和信心线索设计锚定攻击,并比较两种缓解措施:结构化思维链提示和证据优先(证据锁定、评分、排名)。使用包含来自传统抽取模型和 LLM 的 1000 篇摘要的新数据集,我们发现标签和安慰剂扰动下存在显著的线索锚定合理化,而证据优先在改善线索不变性方面显著优于基线。

英文摘要

Large language models (LLMs) are increasingly used as automatic judges for summarization and dialogue evaluation. Prior work has documented biases such as position, verbosity, and style preferences, but largely focuses on outcomes, leaving judge explanations underexplored. We instead ask whether LLM judges are cue-invariant, i.e., whether their rankings and explanations remain stable when non-evidential cues are perturbed while holding the underlying texts fixed. We introduce a suite of cue interventions (Blind, Truth, Flip, Placebo, Reveal-After) and tie-aware metrics that quantify outcome anchoring and rationale anchoring, including label-aligned rhetoric and explanation drift, alongside consistency and stereotype-intrusion checks. We design anchoring attacks using verbosity and confidence cues, and compare two mitigations: structured chain-of-thought prompting and PROOF-BEFORE-PREFERENCE (evidence lock, score, rank). Using a new dataset of 1,000 summaries from traditional extractive models and LLMs, we find substantial cue-anchored rationalization under label and placebo perturbations, while PROOF-BEFORE-PREFERENCE markedly improves cue invariance over baselines.

2605.23969 2026-05-26 cs.CL

SLAP: Stratified Loss-based Pruning for On-Policy Data-Efficient Instruction Tuning

SLAP: 基于分层损失剪枝的在线策略数据高效指令微调

Run Zou, Jianhang Ding, Yifan Ding, Wen Wu, Hao Chen, Renshu Gu

AI总结 提出SLAP框架,通过分布感知分层采样和相对距离优化实现批次级数据选择,在减少20-40%训练数据的同时保持或提升LLM性能。

Comments 15 pages, 10 figures

详情
AI中文摘要

指令微调优化了大语言模型(LLMs)的专门能力,但通常需要大量数据集和长时间训练。挑战在于通过识别有用数据并高效微调来开发特定能力。高质量且多样化的剪枝数据可以帮助模型以较低成本实现无损性能。在本文中,我们提出 extbf{SLAP},一种新颖的批次感知数据选择框架,评估整个批次组合的可学习性而非单个样本。SLAP通过分布感知分层采样确保全面的数据分布覆盖,同时通过相对距离优化最大化批次内多样性。通过利用Hessian近似的梯度信息进行动态批次选择,SLAP在多种模型架构(LLaMA、ChatGLM)和多样下游任务(包括多轮对话、多语言翻译和问答)上显著优于现有最先进方法。最值得注意的是,与完整数据集训练相比,SLAP在减少20-40%训练数据的情况下实现了更优性能,大幅降低了计算成本,同时保持或提升了模型能力。这些结果确立了SLAP作为大语言模型高效指令微调的有效方法。

英文摘要

Instruction tuning has optimized the specialized capabilities of large language models (LLMs), but it often requires extensive datasets and prolonged training times. The challenge lies in developing specific capabilities by identifying useful data and efficiently fine-tuning. High-quality and diverse pruned data can help models achieve lossless performance at a lower cost. In this paper, we propose \textbf{SLAP}, a novel batch-aware data selection framework that evaluates the learnability of entire batch compositions rather than individual. SLAP ensures comprehensive data distribution coverage through distribution-aware stratified sampling while maximizing intra-batch diversity through relative distance optimization. By leveraging Hessian-approximated gradient information for dynamic batch selection, SLAP significantly outperforms existing state-of-the-art methods across multiple model architectures (LLaMA, ChatGLM) and diverse downstream tasks including multi-turn dialogue, multilingual translation, and question answering. Most notably, SLAP achieves superior performance with 20-40\% less training data compared to full dataset training, substantially reducing computational costs while maintaining or improving model capabilities. These results establish SLAP as a powerful approach for efficient and effective instruction tuning of large language models.

2605.23966 2026-05-26 cs.CL cs.AI cs.SY eess.SY math.CO

TriVAL: A Tri-Validation Framework for Faithful Automatic Optimization Modeling

TriVAL: 一种用于忠实自动优化建模的三重验证框架

Ziyang Fang, JinXi Wang, Jinghui Zhong, Yew-Soon Ong

AI总结 提出TriVAL三重验证框架,在语义规范、数学公式和代码生成三个阶段进行显式验证,通过构建-验证-修正循环提高自动优化建模的准确性,并在新基准NL4COP上超越现有方法。

Comments 13 pages

详情
AI中文摘要

优化建模作为自然语言问题描述与优化求解器之间的关键桥梁,是将运筹学(OR)应用于实际决策的基石。大语言模型(LLM)的最新进展推动了自动优化建模的显著进步。然而,现有方法在建模过程中仍缺乏显式验证,导致早期阶段引入的错误会沿流水线传播,最终降低建模精度。为解决这一挑战,我们提出TriVAL,一种在自动优化建模的三个阶段(语义规范、数学公式和代码生成)进行显式验证的三重验证框架。在每个阶段,TriVAL遵循构建-验证-修正循环,根据阶段特定标准评估当前结果,并在必要时进行修正。这种设计有助于在错误跨阶段累积之前识别和纠正它们,从而在整个建模过程中保持忠实性。为了在更具挑战性的组合问题上评估自动优化建模,我们进一步引入NL4COP,一个包含50种不同问题类型、150个实例的基准,其决策逻辑更复杂、约束耦合更紧密、建模要求比现有基准更高。在NL4COP和已有基准上的实验表明,TriVAL始终优于最先进的方法,在最具挑战性的问题上提升最大。

英文摘要

Optimization modeling serves as the pivotal bridge between natural-language problem descriptions and optimization solvers, and remains a cornerstone for bringing operations research (OR) into real-world decision making. Recent advances in large language models (LLMs) have driven significant progress in automatic optimization modeling. However, existing methods still lack explicit validation during the modeling process, allowing errors introduced in earlier stages to carry through the pipeline and ultimately reduce final modeling accuracy. To address this challenge, we introduce TriVAL, a tri-validation framework that performs explicit validation at three stages of automatic optimization modeling: semantic specification, mathematical formulation, and code generation. At each stage, TriVAL follows a construct-validate-revise loop that assesses the current result against stage-specific criteria and revises it when needed. This design helps identify and correct errors before they accumulate across stages, helping preserve faithfulness throughout the modeling process. To evaluate automatic optimization modeling on more challenging combinatorial problems, we further introduce NL4COP, a benchmark of 150 instances across 50 diverse problem types with more complex decision logic, more tightly coupled constraints, and more demanding modeling requirements than existing benchmarks. Experiments on NL4COP and established benchmarks show that TriVAL consistently outperforms state-ofthe-art methods, with the largest gains on the most challenging problems.

2605.23957 2026-05-26 cs.AI cs.LG

Low-Cost Labels, Reliable Choices: Rollout-Calibrated Hyper-Heuristics for Job Shop Scheduling

低成本标签,可靠选择:用于作业车间调度的Rollout校准超启发式算法

Junhao Wei, Yanxiao Li, Yifu Zhao, Zhenhong Peng, Baili Lu, Dexing Yao, Haochen Li, Qinbin He, Sio-Kei Im, Yapeng Wang, Xu Yang

AI总结 提出一种基于Rollout校准的超启发式算法,通过遗憾归一化标签、上下文KNN不确定性估计和门控机制,在低成本标签下实现可靠的选择器,显著降低平均RPD。

详情
AI中文摘要

学习辅助的超启发式算法可以在保持构造性作业车间调度问题(JSSP)启发式的可行性和可解释性的同时,选择调度规则。其主要计算成本在于标签生成而非模型拟合,因为每个监督标签通常需要从部分调度中展开候选规则。我们研究了这一标签成本问题以及一个可靠性问题:学习的选择器不应偏离强默认规则,除非预测的增益是可信的。所提出的选择器使用遗憾归一化的展开标签、上下文KNN不确定性估计以及一个门控机制,仅在预测改进超过不确定性调整的边际时采取行动。我们还变化展开深度和广度以衡量成本-质量权衡。在合成JSSP实例上,门控选择器在学习的选择器中实现了最低的平均RPD,接近最佳固定调度规则,并将Random-HH的平均RPD降低了一个数量级以上。

英文摘要

Learning-assisted hyper-heuristics can select among dispatching rules while preserving the feasibility and interpretability of constructive Job Shop Scheduling Problem (JSSP) heuristics. Their main computational cost lies in label generation rather than model fitting, since each supervised label usually requires rolling out candidate rules from a partial schedule. We study this label-cost problem together with a reliability problem: a learned selector should not switch away from a strong default rule unless the predicted gain is credible. The proposed selector uses regret-normalized rollout labels, a contextual KNN uncertainty estimate, and a gate that acts only when the predicted improvement exceeds an uncertainty-adjusted margin. We also vary rollout depth and breadth to measure the cost-quality trade-off. On synthetic JSSP instances, the gated selector achieves the lowest mean RPD among learned selectors, remains close to the best fixed dispatching rule, and reduces Random-HH mean RPD by more than an order of magnitude.

2605.23956 2026-05-26 cs.AI cs.LG cs.MA

QUIVER: A Formal Framework for Quantifying Perturbation Propagation and Bifurcation in Compound AI Systems

QUIVER: 复合AI系统中扰动传播与分岔的量化形式化框架

Prashanti Nilayam, Sankalp Nayak

AI总结 提出QUIVER形式化框架,通过敏感性矩阵、轨迹散度、分岔阈值和分布忠实度四个组件,量化图结构LLM流水线中扰动传播与结构分岔,并在三个不同架构的企业和公共流水线上验证其有效性。

详情
AI中文摘要

将多个LLM调用链接成有向计算图的复合AI系统现已成为生产AI的主导架构。尽管这些架构利用具有混合模式输出的异构节点,但现有框架无法量化扰动如何通过此类流水线传播,其中节点是随机的且执行路径可能发生结构分岔。我们引入QUIVER,一个用于测量图结构LLM流水线中扰动传播的形式化框架。该框架定义了:(1) 一个敏感性矩阵,带有类型分派的距离度量,将边分类为放大器、吸收器或阈值敏感,并辅以出现提升;(2) 轨迹散度,将变异分解为值漂移、结构路径散度和迭代次数散度;(3) 分岔阈值,识别导致结构执行路径变化的最小扰动;(4) 分布忠实度,量化每个节点评估数据集何时偏离生产分布。我们在两个生产企业流水线和一个公共DSPy多跳QA流水线上进行验证,这三个架构在结构上各不相同。在8200多个仪器化轨迹(32000多对比较)中,我们证明QUIVER揭示了不同架构的独特敏感性剖面,区分了产生相同散度率的机制不同的级联模式,仅从观测数据预测易发生轨迹分岔的节点,并将过时的评估伪影定位到聚合指标无法揭示的特定节点-字段类别。

英文摘要

Compound AI systems that chain multiple LLM calls into directed computation graphs are now the dominant architecture for production AI. Although these architectures leverage heterogeneous nodes with mixed-mode outputs, no existing framework quantifies how perturbations propagate through such pipelines, where nodes are stochastic and execution paths can diverge structurally. We introduce QUIVER, a formal framework for measuring perturbation propagation in graph-structured LLM pipelines. The framework defines: (1) a sensitivity matrix with type-dispatched distance metrics that classifies edges as amplifiers, absorbers, or threshold-sensitive, complemented by occurrence-lift; (2) trajectory divergence decomposing variation into value drift, structural path divergence, and iteration count divergence; (3) bifurcation thresholds identifying the smallest perturbation that causes structural execution path changes; and (4) distribution faithfulness, quantifying when per node evaluation datasets diverge from production distributions. We validate on two production enterprise pipelines and a public DSPy multihop QA pipeline, three structurally distinct architectures. Across 8,200+ instrumented traces (32,000+ pair comparisons), we demonstrate that QUIVER reveals distinct sensitivity profiles across architectures, distinguishes mechanistically different cascade patterns producing identical divergence rates, predicts nodes prone to trajectory bifurcation from observational data alone, and localizes stale evaluation artifacts to specific node-field categories that aggregate metrics cannot surface.

2605.23954 2026-05-26 cs.CL cs.AI cs.SD

EchoDistill:Alignment Noisy-to-Clean Self-Distillation for Robust Audio LLMs

EchoDistill:面向鲁棒音频大语言模型的噪声到干净自蒸馏对齐

Liang Lin, Chunxi Luo, Kaiwen Luo, Jie Zhang, Jin Wang, Yuanhe Zhang, Cai Yuchen, Qiankun Li, Gongli Xi, Zhenhong Zhou, Kun Wang, Junhao Dong

AI总结 提出EchoDistill框架,通过冻结的干净音频教师模型指导噪声学生模型进行组相对策略优化,实现噪声到干净的自蒸馏对齐,提升音频大语言模型在复杂噪声下的语义可靠性和任务性能。

详情
AI中文摘要

音频大语言模型极易受到现实世界噪声的影响,常常导致严重的语义漂移和幻觉。现有的鲁棒性方法主要依赖于波形级声学增强、答案级监督或噪声表示的内部抑制。为了解决这些问题,我们提出了EchoDistill,一种基于对齐的噪声到干净自蒸馏框架。EchoDistill利用冻结的干净音频教师模型为推理时的噪声音频学生模型提供语义参考。具体地,学生模型在噪声条件下采样候选响应以暴露其测试时行为。这些轨迹随后通过组相对策略优化进行优化,其中与教师模型的令牌级一致性作为奖励加成。通过将噪声学生模型的候选响应与干净语义证据对齐,并应用音频感知奖励塑造,我们的方法鼓励既正确又真正基于声学推理的轨迹。EchoDistill显著提高了音频大语言模型在复杂噪声下的语义可靠性和任务性能,且不引入任何额外推理成本。大量实验表明:(I) 与最强基线相比,EchoDistill在强噪声下GSR平均提升4.18%↑。(II) 在Qwen-Omni上的消融结果进一步显示,EchoDistill相比仅GRPO变体在Acc上平均提升3.02%↑,在Noisy上提升3.89%↑,在GSR上提升4.53%↑。我们的代码可在https://anonymous.4open.science/r/echodistill-10DE获取。

英文摘要

Audio Large Language Models (ALLMs) are highly vulnerable to real-world noise, which often induces severe semantic drift and hallucinations. Existing robustness methods primarily rely on waveform-level acoustic enhancement, answer-level supervision, or the internal suppression of noise representations. To address these issues, we propose echodistill, an alignment-based noisy-to-clean self-distillation framework. Echodistill leverages a frozen clean-audio teacher to provide semantic references for an inference-time noisy-audio student. Specifically, the student samples candidate responses under noisy conditions to expose its test-time behavior. These trajectories are then optimized via group-relative policy optimization (GRPO), where the token-level consistency with the teacher acts as a reward bonus. By aligning the noisy student's candidate responses with clean semantic evidence, and applying audio-aware reward shaping, our method encourages reasoning trajectories that are both correct and genuinely acoustically grounded. Echodistill significantly improves the semantic reliability and task performance of Audio LLMs under complex noise, without introducing any additional inference costs. Extensive experiments show that: (I) Compared with the strongest baseline, echodistill achieves average improvements of 4.18\%$\uparrow$ in GSR under strong noise. (II) Ablation results on Qwen-Omni further show that echodistill improves over the GRPO-only variant by 3.02\%$\uparrow$ in Acc, 3.89\%$\uparrow$ in Noisy, and 4.53\%$\uparrow$ in GSR on average. Our codes are available at https://anonymous.4open.science/r/echodistill-10DE.

2605.23952 2026-05-26 cs.AI cs.CL q-bio.NC

Machine Psychometrics: A Mathematical Psychology of Artificial Intelligence

机器心理测量学:一种人工智能的数学心理学

Alex Bogdan, Adrian de Valois-Franklin

AI总结 针对人工智能评估中忽视心理结构或过度拟人化的两种错误,本文引入机器心理测量学,通过测量潜在行为、元认知、沟通和自我建模倾向,构建机器心智档案和信任协议,以测量而非判断来理解非人类智能体。

Comments 45 pages, 11 figures

详情
AI中文摘要

人工智能体现在产生的行为足够丰富,足以引发信任、惊喜和担忧,然而我们的评估工具仍然优先考虑能力分数而非心理结构。本文认为,两种对称错误(人工心智盲视,即否认非生物系统中的心理组织;以及人工心智投射,即仅从流畅行为推断类似人类的内心生活)之间的哲学僵局,可以通过在意识问题之下引入一个严谨的测量层来规避,而非解决意识问题本身。借鉴Michael Levin关于认知作为跨基质目标导向能力的连续统观点,以及数学心理学的方法论库(项目反应理论、信号检测理论、贝叶斯认知建模、校准分析、认知偏差测试组),本文发展了机器心理测量学,作为测量人工智能体中潜在行为、元认知、沟通和自我建模倾向的测量科学。其操作核心是机器心智档案:一个多维、领域受限、版本化的轮廓,涵盖校准、源完整性、暗示抵抗性、上下文稳定性、表达对齐、工具完整性、漂移监测和分布基础。一个补充的信任协议通过探针测试组、扰动测试、信度和效度分析以及高风险领域的纵向监测,将心智档案转化为部署决策。哲学贡献是第三种立场,人工心智纪律,既不拟人化也不否认,既不预设意识也不排除意识。目标不是将人工智能体人性化,而是精确地理解它们,因为它们不是人类,通过测量而非判断。

英文摘要

Artificial agents now generate behavior rich enough to invite trust, surprise, and concern, yet our evaluation tools still privilege capability scores over psychological structure. This paper argues that the philosophical impasse between two symmetrical errors (Artificial Mind Blindness, which dismisses psychological organization in non-biological systems, and Artificial Mind Projection, which infers human-like inner life from fluent behavior alone) can be circumvented not by resolving the consciousness question, but by introducing a disciplined measurement layer beneath it. Drawing on Michael Levin's continuum view of cognition as goal-directed competency across substrates, and on the methodological repertoire of mathematical psychology (Item Response Theory, Signal Detection Theory, Bayesian cognitive modeling, calibration analysis, cognitive-bias batteries), the paper develops Machine Psychometrics as a measurement science of latent behavioral, metacognitive, communicative, and self-modeling dispositions in artificial agents. Its operational core is the Machine Mindprint: a multidimensional, domain-bounded, versioned profile spanning calibration, source integrity, suggestibility resistance, context stability, expressive alignment, tool integrity, drift monitoring, and distributional grounding. A complementary Trust Protocol turns Mindprints into deployment decisions through probe batteries, perturbation testing, reliability and validity analysis, and longitudinal monitoring across high-stakes domains. The philosophical contribution is a third stance, Artificial Mind Discipline, that neither anthropomorphizes nor dismisses, neither presupposes consciousness nor forecloses it. The aim is not to humanize artificial agents, but to understand them precisely because they are not human, through measurement before judgment.

2605.23951 2026-05-26 cs.AI cs.LO cs.MA

Methods for Formal Verification of Agent Skills: Three Layers Toward a Mechanically Checkable Capability-Containment Proof

智能体技能形式化验证方法:面向可机械检查的能力包含证明的三层架构

Alfredo Metere

AI总结 本文提出三层可组合方法(静态抽象解释、精炼类型系统、SMT有界模型检测),将智能体技能从声明或测试级别提升至形式化验证级别,实现机械可检查的能力包含证明。

详情
AI中文摘要

伴随论文引入了一个关于智能体技能清单的四级验证格(未验证、声明、测试、形式化),并将最高级别作为目标。本文填补了这一空白。我们给出了技能行为的精确语义,忠实于技能如何被LLM驱动的运行时(通过非确定性LLM侧可达的确定性脚本侧)消费,将验证问题表述为该语义上的能力包含属性,并提出了三种可组合方法,共同将技能从声明或测试级别提升至形式化级别:(1)通过在小效应格上的抽象解释,对脚本侧进行可靠静态能力包含分析;(2)一个用于工具调用封装的精炼类型系统,机械地拒绝任何静态推断能力不在清单声明集中的调用;(3)针对父论文的双条件正确性准则的SMT有界模型检测,其中边界选择使得任何符合运行时事务缓冲区视野的反例都作为具体轨迹呈现。我们证明了这三个层次组合起来能可靠地覆盖父论文的威胁模型,仅剩一个残余(LLM拒绝行动的自由),该残余由父论文的运行时双条件在会话边界捕获。这些方法重用现有的成熟工具(Z3、Semgrep、CodeQL、精炼类型检查器、机械化证明助手),而非要求操作者构建新工具,并且携带证明的工件扩展了现有的SKILL.md约定。所有三种方法以及捆绑生产者和重新检查器作为零依赖JavaScript模块在开源enclawed框架(https://github.com/metereconsulting/enclawed;项目页面https://www.enclawed.com/)中提供,包含53个单元测试和一个端到端CLI演示示例技能。

英文摘要

The companion paper introduced a four-level verification lattice on agent-skill manifests (unverified, declared, tested, formal) and left the top level aspirational. This paper closes that gap. We give a precise semantics for skill behaviour faithful to how a skill is consumed by an LLM-driven runtime (a deterministic script-side reachable through a non-deterministic LLM-side), state the verification problem as a capability-containment property over that semantics, and present three composable methods that together raise a skill from declared or tested to formal: (1) sound static capability-containment analysis of the script-side via abstract interpretation over a small effect lattice; (2) a refinement type system for tool-call envelopes that mechanically rejects any call whose statically-inferred capability is not in the manifest's declared set; (3) SMT-bounded model checking against the parent paper's biconditional correctness criterion, with the bound chosen so any counter-example fitting the runtime's transaction-buffer horizon is exhibited as a concrete trace. We prove the three layers composed soundly cover the parent paper's threat model modulo a single residual (the LLM's freedom to refuse to act) that the parent paper's runtime biconditional catches at session boundary. The methods reuse existing well-engineered tools (Z3, Semgrep, CodeQL, refinement-type checkers, mechanised proof assistants) rather than asking operators to build new ones, and the proof-carrying artifact extends the existing SKILL.md convention. All three methods plus the bundle producer and re-checker ship as zero-dependency JavaScript modules in the open-source enclawed framework (https://github.com/metereconsulting/enclawed; project page https://www.enclawed.com/), with 53 unit tests and an end-to-end CLI demo on a sample skill.

2605.23950 2026-05-26 cs.AI cs.SE

Stop Comparing LLM Agents Without Disclosing the Harness

停止比较 LLM Agent 而不公开其执行框架

Yunbei Zhang, Janet Wang, Yingqiang Ge, Weijie Xu, Jihun Hamm, Chandan K. Reddy

AI总结 本文论证在长周期任务中,Agent 执行框架(Harness)比底层模型更能决定性能,并提出框架感知的评估标准与方差分解协议。

详情
AI中文摘要

这篇立场论文认为,对于在具有可比前沿能力的模型上评估的长周期任务,Agent 执行框架(即围绕语言模型管理上下文构建、工具交互、编排和验证的基础设施层)通常比其包装的模型更能决定 Agent 性能。我们形式化并辩护了绑定约束论题:在此情况下,性能方差更多地由框架配置而非模型选择决定,当前评估协议因此系统性地将框架层面的提升错误归因于模型改进。我们从三个方面支持这一论点。首先,控制论形式化将框架视为闭环动态系统的控制器,LLM 为其管理的随机策略,这解释了为什么小的框架变化可以产生超过替换模型所带来的性能变化。其次,已发表的基准测试、行业部署以及受控方差分解表明,框架引起的方差可能显著超过模型引起的方差,包括模型排名反转的情况。第三,我们提出了一个框架感知的评估框架,包含披露标准和方差分解协议。在框架规范被公开之前,长周期 Agent 的排行榜比较应被视为不完整且可能具有误导性。

英文摘要

This position paper argues that, for long-horizon tasks evaluated across models with comparable frontier capability, the agent execution harness, namely the infrastructure layer that governs context construction, tool interaction, orchestration, and verification around a language model, is often a stronger determinant of agent performance than the model it wraps. We formalize and defend the Binding Constraint Thesis: in this regime, performance variance is governed more by harness configuration than by model choice, and current evaluation protocols therefore systematically misattribute harness-level gains to model improvements. We support this thesis along three lines. First, a control-theoretic formalization treats the harness as the controller of a closed-loop dynamical system and the LLM as the stochastic policy it governs, which explains why small harness changes can produce performance shifts that exceed those obtained by substituting one model for another. Second, published benchmarks, industry deployments, and a controlled variance decomposition show that harness-induced variance can substantially exceed model-induced variance, including cases of model ranking reversal. Third, we propose a harness-aware evaluation framework with a disclosure standard and a variance decomposition protocol. Until harness specifications are disclosed, leaderboard comparisons for long-horizon agents should be treated as incomplete and potentially misleading.

2605.23945 2026-05-26 cs.AI cs.DC

Accelerating Long-Tail Generation in Synchronous RLHF Training via Adaptive Tensor Parallelism

通过自适应张量并行加速同步RLHF训练中的长尾生成

Long Zhao, Qinghe Wang, Jiaan Zhu, Youhui Bai, Zewen Jin, Chaoyi Ruan, Shengnan Wang, Cheng Li

AI总结 针对同步RLHF训练中长尾生成导致的GPU利用率低问题,提出自适应张量并行方法PAT,通过预测引导的在线重配置和轻量级状态迁移机制,显著降低生成延迟和端到端训练迭代延迟。

Comments 11page, 14 figures

详情
AI中文摘要

基于人类反馈的强化学习(RLHF)已成为提升模型质量的关键后训练范式。然而,同步三阶段RLHF流水线常受限于生成阶段,其中响应长度偏斜导致解码过程中有效批量大小迅速缩小,使得GPU在少数长响应未完成时处于低利用率状态。主流框架采用静态张量并行(TP)配置,无法适应变化的批量特征,留下了大量性能提升空间。我们提出PAT,一种自适应TP方法,在每次RLHF迭代的生成阶段动态重配置TP。PAT引入两项关键技术。首先,一种预测引导的在线重配置方法基于离线性能分析决定重配置时机和目标TP配置,仅在预测延迟收益超过重配置开销时触发重配置。其次,一种轻量级在线重配置机制仅更新受TP变化影响的状态和布局:通过基于成本模型在KV缓存迁移和重计算之间选择,适配未完成的解码状态;执行原地权重重分片;并重用缓存的通信组。我们在SGLang之上实现PAT,并将其集成到VeRL框架中。使用DeepScaleR对LLaMA3.1-8B和Qwen3-14B的评估表明,与原始VeRL设置相比,PAT将生成延迟降低高达34.6%,端到端RLHF训练迭代延迟降低高达27.2%。

英文摘要

Reinforcement Learning from Human Feedback (RLHF) has become a key post-training paradigm for improving model quality. However, the synchronous three-stage RLHF pipeline is often bottlenecked by the generation stage, where response-length skew causes the effective batch size to shrink rapidly during decoding, leaving GPUs underutilized while a few long responses remain unfinished. Mainstream frameworks employ a static tensor parallelism (TP) configuration that cannot adapt to changing batch characteristics, leaving substantial performance headroom unexplored. We propose PAT, an adaptive TP method that dynamically reconfigures TP during the generation stage of each RLHF iteration. PAT introduces two key techniques. First, a predictor-guided online reconfiguration method decides both the reconfiguration point and the target TP configuration based on offline profiling, triggering reconfiguration only when the predicted latency benefit outweighs the reconfiguration overhead. Second, a lightweight online reconfiguration mechanism updates only the states and layouts affected by TP changes: it adapts unfinished decoding states through a cost-model-based choice between KV-cache migration and recomputation, performs in-place weight resharding, and reuses cached communication groups. We implement PAT on top of SGLang and integrate it with the VeRL framework. Evaluations on LLaMA3.1-8B and Qwen3-14B using DeepScaleR show that PAT reduces generation latency by up to 34.6% and end-to-end RLHF training iteration latency by up to 27.2% compared to the original VeRL setup.

2605.23944 2026-05-26 cs.AI math.PR

Right-Sizing Communication and Recommendation Set Size in AI-Assisted Search

AI辅助搜索中的通信与推荐集规模合理设定

Jing Dong, Prakirt Raj Jhunjhunwala, Yash Kanoria

AI总结 通过建模用户与AI推荐系统的交互,研究在考虑通信成本和搜索成本时,如何优化消息精度和推荐集大小以最大化用户期望收益。

详情
AI中文摘要

我们建模了用户与AI驱动的推荐系统之间的交互。用户通过代价高昂且带有噪声的消息传递偏好信息来启动过程。AI助手作为贝叶斯代理,解释用户消息以形成关于其真实偏好的后验信念,并做出产品推荐。具体来说,它决定呈现多少推荐,以最大化用户最终选择的期望效用,同时考虑推荐集大小带来的搜索成本。我们使用基于互信息的成本函数来建模用户在交互过程中产生的两种不同成本:(i) 通信成本,随偏好消息的精度增加而增加;(ii) 搜索成本,随AI助手提供的推荐集大小增加而增加。我们研究位于d维空间中的产品和偏好,并询问如何最大化用户的期望收益。对于大d,我们描述了在两种不同的推荐采样分布下(即从产品宇宙中采样推荐),最优消息精度和推荐集大小如何依赖于成本参数:(i) 贝叶斯后验信念,和(ii) 优化的倾斜分布。在后验采样方案(i)下,我们识别出一种混合机制,其中高效的交互策略需要联合优化用户传达的信息量(以比特计)和AI助手提供的推荐数量。在倾斜采样方案(ii)下,我们的结果表明,最优交互策略仅使用通信和搜索中的一种,倾向于选择成本较低的那一种。

英文摘要

We model the interaction between a user and an AI driven recommendation system. The user initiates the process by conveying preference information through a costly and noisy message. The AI assistant, acting as a Bayesian agent, interprets the user's message to form a posterior belief about their true preferences and make product recommendations. In particular, it determines how many recommendations to present so as to maximize the user's expected utility from their final choice, while accounting for the search cost induced by the size of the recommendation set. We use mutual information based cost functions to model the two distinct costs incurred by the user during the interaction: (i) a communication cost, which increases with the precision of their preference message, and (ii) a search cost, which increases with the size of the recommendation set provided by the AI assistant. We study products and preferences which live in d dimensional space, and ask how the user's expected payoff can be maximized. For large d, we characterize how optimal message precision and recommendation set size depend on the cost parameters, under two distinct distributions from which recommendations can be sampled from the product universe: (i) Bayes' posterior belief, and (ii) an optimized tilted distribution. Under the posterior sampling scheme (i), we identify a hybrid regime, in which an efficient interaction policy requires jointly optimizing the amount of information (in bits) conveyed by the user and the number of recommendations provided by the AI assistant. In the tilted sampling scheme (ii), our results show that the optimal interaction policy uses only one of communication and search, favoring whichever of them is less costly.

2605.23943 2026-05-26 cs.AI physics.hist-ph quant-ph

Spacetime Formation under Requirements: Contextual Realization and Form-Dependent Probability

需求下的时空形成:语境实现与形式依赖概率

Song-Ju Kim

AI总结 本文提出一种新解释:量子概率是在有限状态需求下语境时空形成的固定时空投影,通过需求驱动的非布尔实现机制解释非交换性、干涉和类量子概率。

Comments 19 pages, 1 figure

详情
AI中文摘要

量子认知学通常通过在固定事件结构上用量子概率替代经典概率来解释顺序效应、语境性和全概率律违反。本文提出一种不同的解释:量子概率是在有限状态需求下语境时空形成的固定时空投影。该框架并非从时间、空间、对象或概率出发,而是从需求出发,例如有限表征能力、单态语义稳定性、语境敏感干预、避免显式语境标签、连贯世界形成和主体间可变换性。当这些需求无法在单一全局布尔事件结构中实现时,在固定时空投影下,这种不匹配表现为非交换性、干涉和类量子概率。基于先前的单态语境性方法,我们将经典语境簿记成本重新解释为语境时空形成的固定时空阴影。经典表征中的辅助记忆或语境标签,在此解释中对应于局部布尔逻辑世界之间的类似和乐的不匹配。干涉项是当局部经典实现贡献被非平凡粘合并投影回固定经典时空形式时产生的交叉项。结果是一种先验-操作实在论解释:对象性、事件性、概率和时空被视为需求下的实现形式,而客观性由跨观察者和历史依赖的时空形成所保持的不变量定义。

英文摘要

Quantum cognition often explains order effects, contextuality, and violations of the law of total probability by replacing classical probability with quantum probability on a fixed event structure. This paper proposes a different interpretation: quantum probability is the fixed-spacetime projection of contextual spacetime formation under finite-state requirements. The framework begins not with time, space, objects, or probabilities, but with requirements such as finite representational capacity, single-state semantic stability, context-sensitive intervention, avoidance of explicit context labels, coherent world-formation, and intersubjective transformability. When these requirements cannot be realized within a single global Boolean event structure, the mismatch appears, under fixed-spacetime projection, as noncommutativity, interference, and quantum-like probability. Building on prior single-state approaches to contextuality, we reinterpret classical contextual bookkeeping cost as the fixed-spacetime shadow of contextual spacetime formation. Auxiliary memory or context labels in a classical representation correspond, in this account, to holonomy-like mismatch among locally Boolean logic-worlds. The interference term is the cross term generated when locally classical realization contributions are nontrivially glued and projected back into a fixed classical spacetime form. The result is a transcendental-operational realist account: objecthood, eventhood, probability, and spacetime are treated as forms of realization under requirements, while objectivity is defined by invariants preserved across observer- and history-dependent spacetime formations.