arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.26112 2026-05-26 cs.AI cs.LG 版本更新

Claw-Anything: 对更广泛访问用户数字世界的始终在线个人助手的基准测试

Yusong Lin, Xinyuan Liang, Haiyang Wang, Qipeng Gu, Siqi Cheng, Jiangui Chen, Shuzhe Wu, Feiyang Pan, Lue Fan, Sanyuan Zhao, Dandan Tu

发表机构 * Beijing Institute of Technology（北京理工大学）； Huawei Technologies Co., Ltd（华为技术有限公司）； Peking University（北京大学）； Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）

AI总结提出Claw-Anything基准测试，通过扩展长期活动历史、相互依赖的后端服务以及跨多设备的GUI和CLI交互三个维度，评估大型语言模型代理在始终在线环境下的性能，发现GPT-5.5仅达34.5% pass@1，并发布自动化数据生成管道提升基线模型23.7%。

详情

AI中文摘要

大型语言模型代理越来越被设想为始终在线的个人助手，能够访问用户数字世界中任何相关的内容。然而，当前系统仅在该世界的狭窄片段上运行，限制了上下文敏感推理和有效协助。现有基准测试同样仅提供部分用户状态，因此无法捕捉在这种广泛、始终在线环境下的性能。为填补这一空白，我们引入了Claw-Anything，一个沿三个维度扩展代理上下文的基准测试：长期活动历史、相互依赖的后端服务以及跨多设备的集成GUI和CLI交互。为实例化这一设置，我们通过多轮事件注入模拟数月的用户活动，产生复杂的世界状态和现实噪声，包括无关事件和冲突信号。代理必须在丰富的上下文环境中进行推理，同时对此类噪声保持鲁棒性。这种扩展范围还使得能够评估主动协助，要求代理预测用户需求并提供及时建议。实验表明，GPT-5.5仅达到34.5%的pass@1，显著低于先前的基准测试，凸显了当前代理能力与始终在线个人协助需求之间的差距。除基准测试外，我们还发布了一个自动化数据生成管道，该管道产生了2000个训练环境，并将基线模型提升了23.7%，展示了可扩展数据基础设施的实用性。

英文摘要

Large language model agents are increasingly envisioned as always-on personal assistants with access to anything relevant in the user's digital world. Yet current systems operate over only narrow slices of that world, limiting context-sensitive reasoning and effective assistance. Existing benchmarks similarly provide only partial user state and therefore fail to capture performance in such a broad, always-on setting. To address this gap, we introduce Claw-Anything, a benchmark that expands agent context along three dimensions: long-horizon activity histories, interdependent backend services, and integrated GUI and CLI interaction across multiple devices. To instantiate this setting, we simulate months of user activity through multi-round event injection, producing complex world states and realistic noise, including irrelevant events and conflicting signals. Agents must reason over rich contextual environments while remaining robust to such noise. This expanded scope also enables the evaluation of proactive assistance, requiring agents to anticipate user needs and deliver timely recommendations. Experiments show that GPT-5.5 achieves only 34.5% pass@1, substantially below prior benchmarks, underscoring a gap between current agent capabilities and the demands of always-on personal assistance. Alongside the benchmark, we release an automated data-generation pipeline that yields 2,000 training environments and improves the base model by 23.7%, demonstrating its utility of scalable data infrastructure.

URL PDF HTML ☆

赞 0 踩 0

2605.26081 2026-05-26 cs.AI 版本更新

神经元随机注意力电路（NSAC）用于概率表示学习

Waleed Razzaq, Yun-Bo Zhao

发表机构 * Department of Automation, University of Science \& Technology of China, Hefei, China ； Institute of Artificial Intelligence, Hefei Comprehensive National Science Center

AI总结提出一种受生物学启发的连续时间注意力架构NSAC，通过Ornstein-Uhlenbeck随机微分方程和NCP门控机制在logits上诱导高斯分布，实现概率输出与不确定性量化。

详情

AI中文摘要

连续时间表示学习中不确定性估计的可靠量化仍处于初级阶段，尤其是在连续时间注意力架构中。我们引入了神经元随机注意力电路（NSAC），这是一种新颖的受生物学启发的连续时间注意力架构，它将注意力logit计算重新表述为Ornstein-Uhlenbeck随机微分方程的解，该方程由来自重新利用的秀丽隐杆线虫神经元电路策略（NCP）布线机制的输入依赖的非线性互连门调制。它在logits上诱导高斯分布，通过注意力权重上的逻辑正态分布传播原则性的随机性，从而产生概率输出。一个结合高斯负对数似然与认知分离正则化器的两项目标函数强制更高的预测方差，并能够联合量化偶然不确定性和认知不确定性。实验上，我们在多种学习任务中实现了NSAC，包括：(i) 不规则连续时间函数逼近；(ii) 多元回归；(iii) 长程预测；(iv) 工业4.0；以及(v) 自动驾驶车辆的车道保持。我们观察到，NSAC在准确性上与多个基线保持竞争力，产生合理校准的不确定性估计，同时在神经元细胞级别具有可解释性。

英文摘要

Reliable quantification of uncertainty estimates in continuous-time (CT) representation learning remains nascent, particularly within CT attention architectures. We introduce the Neuronal Stochastic Attention Circuit (NSAC), a novel biologically-inspired CT attention architecture that reformulates attention logit computation as the solution of an Ornstein-Uhlenbeck stochastic differential equation modulated by input-dependent, nonlinear interlinked gates derived from repurposed C.elegans Neuronal Circuit Policies (NCPs) wiring mechanism. It induces Gaussian distribution over logits that propagates principled stochasticity through logistic-normal distribution over attention weights to yield probabilistic output. A two-term objective function combining Gaussian negative log-likelihood with an epistemic-separation regularizer enforces higher predictive variance and enables joint quantification of aleatoric and epistemic uncertainty. Empirically, we implement NSAC in a diverse set of learning tasks including: (i) irregular CT function approximation; (ii) multivariate regression; (iii) long-range forecasting; (iv) Industry 4.0; and (v) the lane-keeping of autonomous vehicles. We observe that the NSAC remains competitive against several baselines in terms of accuracy and produces reasonably well-calibrated uncertainty estimates while being interpretable at the neuronal cell level.

URL PDF HTML ☆

赞 0 踩 0

2605.26045 2026-05-26 cs.CL cs.AI 版本更新

Confidence and Calibration of Activation Oracles for Reliable Interpretation of Language Model Internals

激活预言机的置信度与校准：用于语言模型内部的可信解释

Federico Torrielli, Peter Schneider-Kamp, Lukas Galke Poech

发表机构 * University of Turin（都灵大学）； University of Southern Denmark（南丹麦大学）

AI总结本文研究了6种激活预言机置信度估计方法，发现bootstrap模式频率在校准上优于其他方法（ECE 5.7% vs 25.5%），而log-prob基线可作为快速分诊信号。

2605.26040 2026-05-26 cs.AI 版本更新

L2IR: Revealing Latent Intent in Graph Fraud Detection

L2IR: 揭示图欺诈检测中的潜在意图

Jinsheng Guo, Zhenhao Weng, Yibo Liu, Yan Qiao, Meng Li

发表机构 * Hefei University of Technology（合肥工业大学）

AI总结提出L2IR框架，利用大语言模型从用户行为和可疑连接中提取潜在意图，通过自适应自训练增强鲁棒性，在广泛伪装的数据集上提升图神经网络检测器的AUPRC最高达8.27%。

Comments 12 pages, 6 figures

详情

AI中文摘要

图欺诈检测长期以来依赖图神经网络（GNN）在关系数据上传播和聚合信息。然而，实践中的一个关键障碍是欺诈者经常通过与良性用户伪造大量连接来伪装自己，导致欺诈信号在邻域聚合过程中逐渐稀释，削弱检测可靠性。尽管最近的工作使用大语言模型（LLM）为欺诈检测提供丰富的语义线索，但可疑连接背后的潜在意图仍未得到充分探索。更严重的是，标注欺诈样本的稀缺使得训练在严重伪装下保持鲁棒的检测器变得困难。为解决这些问题，我们提出L2IR，一种LLM驱动的潜在意图揭示框架，用于图欺诈检测。通过从用户行为和可疑连接中揭示潜在意图，L2IR从原始行为轨迹中提取意图感知表示，并推理单个连接背后的真实目的，有效区分支持性链接和误导性链接。它进一步结合自适应自训练，在有限监督下增强鲁棒性。在两个以广泛伪装为特征的真实世界数据集上的评估表明，L2IR超越了强基线，并可作为即插即用的增强模块用于多种基于GNN的检测器，将AUPRC提升最高达8.27%。

英文摘要

Graph fraud detection has long depended on Graph Neural Networks (GNNs) to propagate and aggregate information across relational data. A critical obstacle in practice, however, is that fraudsters frequently disguise themselves by forging numerous connections with benign users, causing fraud signals to be progressively diluted during neighborhood aggregation and undermining detection reliability. While recent efforts have used Large Language Models (LLMs) to provide rich semantic cues for fraud detection, the underlying intent behind suspicious connections remains insufficiently explored. Compounding this issue, the scarcity of annotated fraud samples makes it difficult to train detectors that remain robust under heavy camouflage. To address these gaps, we propose L2IR, an LLM-driven Latent Intent Revealing framework for graph fraud detection. By uncovering latent intent from both user behaviors and suspicious connections, L2IR extracts intent-aware representations from raw behavioral traces and reasons about the true purpose behind individual connections, effectively distinguishing supportive links from misleading ones. It further incorporates adaptive self-training to enhance robustness under limited supervision. Evaluations on two real-world datasets characterized by pervasive camouflage demonstrate that L2IR surpasses strong baselines and can function as a plug-in enhancement for a range of GNN-based detectors, improving AUPRC by up to 8.27%.

URL PDF HTML ☆

赞 0 踩 0

2605.26038 2026-05-26 cs.CV cs.AI 版本更新

DRScaffold: Boosting Dense-Scene Reasoning in Lightweight Vision Language Models

DRScaffold：提升轻量级视觉语言模型在密集场景推理中的能力

Xinrui Shi, Kai Liu, Ziqing Zhang, Jianze Li, Anqi Li, Yulun Zhang

发表机构 * Shanghai Jiao Tong University（上海交通大学）

AI总结针对轻量级视觉语言模型在密集场景推理中缺乏显式视觉锚定导致推理链不可靠的问题，提出DRScaffold监督微调框架，通过将监督目标分解为四个因果有序阶段，在不修改架构的情况下强制进行有根据的推理，显著提升密集场景推理性能。

详情

AI中文摘要

轻量级视觉语言模型在标准基准测试中表现有竞争力，但在密集场景推理中系统性失败，其中多个物体、属性和关系必须通过多步推理共同定位和解决。这种能力对于模型必须可靠解释杂乱环境的现实应用至关重要。然而，现有的训练信号在推理步骤与底层视觉实体和关系之间没有提供显式锚定，使得轻量级模型可以自由生成流畅但视觉上无根据的推理链。为解决这一差距，我们首先引入DRBench，一个包含2943张图像中14573个问题的基准，分为五个任务类别，跨越三个渐进推理层。基于DRBench，我们提出DRScaffold，一个监督微调框架，将监督目标分解为四个因果有序阶段，在不修改架构的情况下强制进行有根据的推理。在三个轻量级VLM上的实验表明，在DRBench上取得了显著提升，同时保持或改善了一般基准的性能。值得注意的是，使用DRScaffold训练的Qwen2.5-VL-3B在DRBench上超越了冻结的Qwen2.5-VL-32B，表明结构化监督可以替代密集场景推理中相当一部分模型规模。我们的代码和模型可在https://github.com/irene-shi/DRScaffold获取。

英文摘要

Lightweight vision-language models perform competitively on standard benchmarks yet fail systematically in dense-scene reasoning, where multiple objects, attributes, and relations must be jointly grounded and resolved through multi-step inference. Such capability is critical for real-world applications where models must reliably interpret cluttered environments. Yet existing training signals provide no explicit grounding between reasoning steps and the underlying visual entities and relations, leaving lightweight models free to generate fluent but visually unanchored reasoning chains. To address this gap, we first introduce DRBench, a benchmark of 14,573 questions across 2,943 images, organized into five task categories spanning three progressive reasoning layers. Building on DRBench, we propose DRScaffold, a supervised fine-tuning framework that decomposes the supervision target into four causally ordered stages, enforcing grounded reasoning without architectural modification. Experiments on three lightweight VLMs demonstrate substantial gains on DRBench while preserving or improving performance on general-purpose benchmarks. Notably, Qwen2.5-VL-3B trained with DRScaffold surpasses the frozen Qwen2.5-VL-32B on DRBench, demonstrating that structured supervision can substitute for a significant portion of model scale in dense-scene reasoning. Our code and models are available at https://github.com/irene-shi/DRScaffold .

URL PDF HTML ☆

赞 0 踩 0

2605.26036 2026-05-26 cs.AI cs.LG 版本更新

AdvantageFlow: 流模型中基于优势加权的强化学习最小二乘法

Branislav Kveton, Anup Rao, Subhojyoti Mukherjee, Krishna Kumar Singh, Viet Dac Lai

发表机构 * Adobe Research（Adobe研究）

AI总结提出AdvantageFlow算法，通过优势加权前向过程预测损失和 rollout 策略正则化，在图像生成任务中优于Flow-GRPO和负感知微调基线。

2605.26012 2026-05-26 cs.LG cs.AI 版本更新

Learning in Low-Dimensional Subspaces: Orthogonal Bottlenecks for Reinforcement Learning

低维子空间中的学习：强化学习的正交瓶颈

Aleksandar Todorov, Matthia Sabatelli

AI总结提出一种在强化学习编码器特征中插入固定正交投影以约束低维子空间的简单先验，证明其在线性可实现性假设下保持表达能力，并在实验中显示价值表示可压缩至极低维度而不损失性能。

详情

AI中文摘要

深度强化学习代理通常依赖高维神经表示，尽管越来越多的证据表明任务相关的价值和策略结构本质上是低维的。在这项工作中，我们提出了一种简单而有效的表示级先验，它插入一个固定的正交投影以将编码器特征约束到低维子空间，无需辅助目标、预训练或对底层RL算法的更改。在线性可实现性假设下，我们证明当瓶颈维度超过特征空间中最优价值函数的内在秩时，瓶颈保持表达能力，并将诱导的梯度动力学保留到等价的低维参数化。实验上，我们发现，在单任务和多任务基准测试中，一旦瓶颈维度超过一个小的任务相关阈值，基线性能要么匹配要么提高；在许多情况下，价值表示可以压缩到极低维度而不损失，最小充分维度更多地取决于环境复杂性而非编码器宽度。此外，我们分析了表示几何，发现正交瓶颈稳定了特征范数，并与更高的有效秩相关。这些结果共同支持了强化学习中流形假设的表示空间解释，并将正交瓶颈定位为一种轻量级、架构无关的塑造RL表示的机制。

英文摘要

Deep reinforcement learning (RL) agents commonly rely on high-dimensional neural representations, despite growing evidence that task-relevant value and policy structure may be intrinsically low-dimensional. In this work, we present a simple yet effective representation-level prior that inserts a fixed orthonormal projection to constrain encoder features to a low-dimensional subspace, requiring no auxiliary objectives, pretraining, or changes to the underlying RL algorithm. Under a linear realizability assumption, we prove that when the bottleneck dimension exceeds the intrinsic rank of the optimal value function in feature space, the bottleneck preserves expressivity and leaves the induced gradient dynamics unchanged up to an equivalent low-dimensional parameterization. Empirically, we find that across both single and multi-task benchmarks, baseline performance is either matched or improved once the bottleneck dimension exceeds a small task-dependent threshold; in many cases, value representations can be compressed to extremely low dimensions without loss, and the minimal sufficient dimension depends far more on environment complexity than encoder width. In addition, we analyze representation geometry and find that orthogonal bottlenecks stabilize feature norms and are associated with higher effective rank. Together, these results support a representation-space interpretation of the manifold hypothesis in reinforcement learning and position orthogonal bottlenecks as a lightweight, architecture-agnostic mechanism for shaping RL representations.

URL PDF HTML ☆

赞 0 踩 0

2605.26001 2026-05-26 cs.CL cs.AI cs.CY 版本更新

AI-Assisted Systematization for Evaluating GenAI Systems

AI辅助的系统化方法用于评估生成式AI系统

Dhruv Agarwal, Emily Sheng, Chad Atalla, Jean Garcia-Gathright, Hussein Mozannar, Hannah Washington, Alexandra Chouldechova, Solon Barocas, Hanna Wallach

发表机构 * Cornell University（康奈尔大学）； Microsoft Research（微软研究院）

AI总结针对生成式AI评估中概念模糊的问题，提出AI辅助系统化方法，通过概念规范和验证工作表生成可衡量的概念规范，并评估其内容效度和信息可恢复性。

详情

AI中文摘要

评估生成式AI（GenAI）系统具有挑战性，因为许多评估目标都是宽泛且有争议的概念，例如“推理”、“公平性”或“创造力”。当这些概念未得到充分明确时，就不清楚应该测量什么或如何解释评估结果。这个问题反映了一个缺失的步骤：系统化，即从一个宽泛的背景概念转变为用可衡量术语对概念进行明确、结构化的描述。为了帮助解决系统化在认知上要求高且资源密集的问题，我们研究了AI辅助是否能够支持这一过程。为了实现AI辅助的系统化并评估其质量，我们引入了系统化概念的结构化表示——概念规范——以及一个验证工作表。然后，我们开发了两种AI辅助系统化工具：一种直接的零样本方法和一种多智能体方法，后者更贴近现有文献中手动系统化的方法。我们使用这些系统化工具为两个概念——仇恨言论和数字共情——生成概念规范，并评估所得概念规范的内容效度和信息可恢复性。

英文摘要

Evaluating generative AI (GenAI) systems is challenging because many targets of evaluation are broad, contested concepts, such as "reasoning," "fairness," or "creativity." When these concepts are left underspecified, it becomes unclear what should be measured or how evaluation results should be interpreted. This problem reflects a missing step: systematization, that is, moving from a broad background concept to an explicit, structured account of the concept in measurable terms. To help address the fact that systematization is cognitively demanding and resource-intensive, we investigate whether AI assistance can support this process. To enable AI-assisted systematization and assess its quality, we introduce a structured representation of a systematized concept, a concept spec, and a validation worksheet. We then develop two AI-assisted systematizers: a direct, zero-shot approach and a multi-agent approach that more closely mirrors manual systematization approaches from existing literature. We use these systematizers to produce concept specs for two concepts -- hate-based rhetoric and digital empathy -- and evaluate resulting concept specs on content validity and information recoverability.

URL PDF HTML ☆

赞 0 踩 0

2605.25168 2026-05-26 eess.IV cs.AI cs.CV 版本更新

Methodology for Creating a Clinically Verified Dermoscopic Image Dataset

创建临床验证的皮肤镜图像数据集的方法论

Kozachok Elena Sergeevna

发表机构 * Ivannikov Institute for System Programming of the Russian Academy of Sciences（伊万诺夫系统编程研究所，俄罗斯科学院）

AI总结提出一种结合移动皮肤镜图像采集标准操作程序、结构化元数据信息模型和多阶段专家验证的方法，构建临床验证的皮肤镜图像数据集，用于医学信息学研究。

Comments 22 pages, 5 figures, 5 tables

详情

AI中文摘要

本研究提出了一种构建临床验证的皮肤镜图像数据集的方法，用于医学信息学研究。该工作的相关性在于，自动化诊断支持系统的性能不仅取决于图像数量，还取决于图像采集过程的可重复性、结构化元数据的完整性以及诊断标签的可靠性。国际数据集主要是在与俄罗斯常规门诊实践和移动皮肤镜显著不同的条件下创建的。所提出的方法整合了三个相互关联的组成部分：（1）通过移动皮肤镜采集图像的标准操作程序（SOP），（2）一个信息模型，包含16个结构化元数据字段，组织成六个临床导向的块，采用ISIC兼容的符号表示，以及（3）多阶段专家验证诊断标签（初始临床注释、三位专家的共识审查以及所有恶性肿瘤的组织学确认）。使用该方法，在2025年6月至2026年5月期间，收集了来自443名患者的1026张独特的皮肤镜图像数据集。从1044条初始记录中排除了18个重复项。该数据集包括九个疾病类别；所有39个恶性病变（18个黑色素瘤、15个基底细胞癌和6个鳞状细胞癌）均经过组织学验证。患者年龄范围为2至90岁（中位年龄38岁），其中女性279人（63%），男性164人（37%）。每张图像都附有专家注释的皮肤镜结构和明确的verification_stage字段，指示诊断确认的水平。所得数据集作为临床验证的试点资源，适用于独立模型评估、域偏移分析、可解释性研究和进一步扩展。

英文摘要

This study presents a methodology for constructing a clinically verified dataset of dermatoscopic images for medical informatics research. The relevance of the work is driven by the fact that the performance of automated diagnostic support systems depends not only on the volume of images, but also on the reproducibility of the image acquisition procedure, the completeness of structured metadata, and the reliability of diagnostic labels. International collections were primarily created under conditions that differ substantially from routine Russian outpatient practice and mobile dermatoscopy. The proposed methodology integrates three interconnected components: (1) a standard operating procedure (SOP) for acquiring images via mobile dermatoscopy, (2) an information model comprising 16 structured metadata fields organized into six clinically oriented blocks in ISIC-compatible notation, and (3) a multi-stage expert verification of diagnostic labels (initial clinical annotation, consensus review by three specialists, and histological confirmation of all malignant neoplasms). Using this methodology, a dataset of 1,026 unique dermatoscopic images from 443 patients was collected between June 2025 and May 2026. From 1,044 initial records, 18 duplicates were excluded. The dataset includes nine nosological categories; all 39 malignant lesions (18 melanomas, 15 basal cell carcinomas, and 6 squamous cell carcinomas) were histologically verified. Patient age ranged from 2 to 90 years (median 38), with 279 females (63%) and 164 males (37%). Each image is accompanied by expert-annotated dermatoscopic structures and an explicit verification_stage field indicating the level of diagnostic confirmation. The resulting dataset serves as a pilot clinically verified resource suitable for independent model evaluation, domain shift analysis, interpretability studies, and further expansion.

URL PDF HTML ☆

赞 0 踩 0

2605.24728 2026-05-26 cs.AI 版本更新

SafeCtrl-RL: 通过RL驱动的提示优化的LLM对话推理时自适应行为控制

Michael Orme, Yanchao Yu, Zhiyuan Tan

发表机构 * School of Computing, Engineering and Building Environment（计算、工程与建筑环境学院）

AI总结提出SafeCtrl-RL框架，利用强化学习在推理时动态选择提示调整策略，无需重新训练即可抑制不安全行为，提升LLM对话的安全性和响应质量。

详情

AI中文摘要

确保大型语言模型（LLM）的安全和上下文适当行为仍然是实际部署的关键挑战。我们提出了 extbf{SafeCtrl-RL}，一个推理时行为控制框架，无需模型重新训练或参数修改即可实现自适应安全调节。该方法将对话生成形式化为一个序列决策过程，其中强化学习代理根据上下文反馈动态选择提示调整策略。这使得不安全行为可以通过迭代细化被抑制，我们将其概念化为推理时行为遗忘。在多个LLM和不安全对话场景下的评估表明，SafeCtrl-RL一致地提高了安全性和响应质量，优于现有的基于提示的优化方法，并实现了良好的性能-效率权衡。**警告：本文可能包含有害语言的示例，建议读者谨慎阅读。**

英文摘要

Ensuring safe and contextually appropriate behaviour in Large Language Models (LLMs) remains a critical challenge for real-world deployment. We present \textbf{SafeCtrl-RL}, an inference-time behavioural control framework that enables adaptive safety regulation without model retraining or parameter modification. The method formulates dialogue generation as a sequential decision process, where a reinforcement learning agent dynamically selects prompt adjustment strategies based on contextual feedback. This allows unsafe behaviours to be suppressed through iterative refinement, which we conceptualise as inference-time behavioural unlearning. Evaluated across multiple LLMs and unsafe dialogue scenarios, SafeCtrl-RL consistently improves safety and response quality, outperforms existing prompt-based optimisation methods, and achieves favourable performance--efficiency trade-offs. **Warning: This paper may contain examples of harmful language, and reader discretion is recommended.

URL PDF HTML ☆

赞 0 踩 0

EchoPilot: 通过尺度空间语义提示和可靠性门控记忆实现无训练超声视频分割

Ruiqiang Xiao, Zhaohu Xing, Yijun Yang, Zhenyan Han, Weiming Wang, Kaishun Wu, Lei Zhu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； Third Affiliated Hospital of Sun Yat-Sen University（中山大学第三附属医院）； Hong Kong Metropolitan University（香港 Metropolitan 大学）

AI总结提出EchoPilot，一种无需训练、仅需单点点击和类别名称的超声视频分割框架，通过尺度空间语义提示解决初始化歧义，并引入可靠性门控记忆减少传播漂移，在多个数据集上达到最优性能。

Comments Early accepted to MICCAI 2026. Project page: https://keeplearning-again.github.io/EchoPilot/

详情

AI中文摘要

超声视频分割在临床上具有重要价值，但由于散斑噪声、弱边界和快速解剖变形而困难。最近的可提示基础模型实现了点引导分割，但它们在超声中的直接部署仍然不可靠：单个点提供的空间上下文不足以解决尺度模糊性，贪婪的记忆更新会将早期错误放大为严重的时间漂移。我们提出了EchoPilot，一个在稀疏第一帧交互下进行超声视频分割的无训练框架，仅需单点点击和解剖类别名称。EchoPilot协调一个冻结的医学视觉语言模型（VLM）进行语义定位，一个视觉基础模型（VFM）进行密集几何特征提取，以及一个可提示视频分割器进行掩码预测和传播。为了解决初始化歧义，我们提出了尺度空间语义提示，首先通过无参数的S.E.E.D.（语义能量-熵密度）准则选择最佳上下文视图，然后从密集基础特征中合成几何精确的辅助点提示，无需额外用户交互。为了减少传播漂移，进一步引入了可靠性门控记忆更新，在不确定预测下选择性冻结分割器的记忆库，防止错误累积。我们还贡献了第一个动态胎儿胎盘超声视频分割数据集，包含671个标注帧。在三个超声视频数据集上，EchoPilot在稀疏交互设置下实现了最先进的性能，持续优于无训练基线和微调专家。

英文摘要

Ultrasound video segmentation is clinically valuable yet difficult due to speckle noise, weak boundaries, and rapid anatomical deformation. Recent promptable foundation models enable point-guided segmentation, but their direct deployment in ultrasound remains unreliable: a single point provides insufficient spatial context to resolve scale ambiguity, and greedy memory updates amplify early errors into severe temporal drift. We present EchoPilot, a training-free framework for ultrasound video segmentation under sparse first-frame interaction, requiring only a single point click and an anatomical category name. EchoPilot orchestrates a frozen medical vision-language model (VLM) for semantic localization, a vision foundation model (VFM) for dense geometric feature extraction, and a promptable video segmentor for mask prediction and propagation. To resolve initialization ambiguity, we propose Scale-Space Semantic Prompting, which first selects an optimal contextual view via a parameter-free S.E.E.D. (Semantic Energy-Entropy Density) criterion, and then synthesizes geometrically precise auxiliary point prompts from dense foundation features without additional user interaction. To reduce propagation drift, a Reliability-Gated Memory update is further introduced to selectively freeze the segmentor's memory bank under uncertain predictions, preventing error accumulation. We also contribute the first dynamic fetal placenta ultrasound video segmentation dataset with 671 annotated frames. Across three ultrasound video datasets, EchoPilot achieves state-of-the-art performance under the sparse-interactive setting, consistently outperforming training-free baselines and finetuned specialists.

URL PDF HTML ☆

赞 0 踩 0

2605.25939 2026-05-26 cs.LG cs.AI 版本更新

$D^2$-Monitor: 通过犹豫感知路由实现扩散LLM的动态安全监控

Aoxi Liu, Yupeng Chen, James Oldfield, Guanzhe Hong, Junchi Yu, Baoyuan Wu, Philip Torr, Adel Bibi

发表机构 * Torr Vision Group, University of Oxford（奥克斯大学托尔视觉组）； The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））

AI总结针对扩散大语言模型的安全监控问题，提出基于犹豫感知路由的双层动态监控框架$D^2$-Monitor，通过轻量级探针实时估计犹豫度并触发高容量探针，在3个数据集上以0.85M参数达到最优性能与效率平衡。

详情

AI中文摘要

尽管扩散大语言模型（D-LLMs）作为自回归大语言模型（AR-LLMs）的替代方案已经出现，但D-LLMs的安全监控在很大程度上仍未得到探索。与AR-LLMs不同，D-LLMs通过多步去噪过程生成文本，暴露了中间隐藏表示，这些表示可能包含标准单步监控设置中无法获得的安全相关信息。受轻量级探针适用于始终在线监控的启发，我们分析了哪些轨迹级信号最能指示此类探针可能遇到困难。我们发现，信息量最大的信号是安全犹豫度：中间隐藏状态反复落在探针决策边界的小范围内。D-LLM轨迹中此类犹豫步的数量能有效预测探针失败，提供了样本难度的代理指标。基于此分析，我们提出了$D^2$-Monitor，一种针对D-LLMs的双层安全监控器。$D^2$-Monitor采用轻量级探针作为始终在线监控器，以联合估计犹豫度并执行基础分类。当犹豫度超过阈值时，激活更具表现力但计算量更大的探针。这种动态路由机制在测试时高效分配监控资源。在4个D-LLM上的3个数据集（WildguardMix、ToxicChat、OpenAI-Moderation）上评估，$D^2$-Monitor以紧凑的参数规模（≤0.85M参数）实现了最先进的性能，并且相对于8个基线方法，在有效性和效率之间取得了最佳权衡。

英文摘要

Despite the emergence of diffusion large language models (D-LLMs) as an alternative to autoregressive large language models (AR-LLMs), safety monitoring for D-LLMs remains largely unexplored. Unlike AR-LLMs, D-LLMs generate text through a multi-step denoising process, exposing intermediate hidden representations that may contain safety-relevant information unavailable in standard single-step monitoring setups. Motivated by the suitability of lightweight probes for always-on monitoring, we analyze which trajectory-level signals best indicate when such probes are likely to struggle. We find that the most informative signal is safety hesitation: intermediate hidden states repeatedly falling within a small margin of the probe's decision boundary. The number of such hesitation steps in D-LLM's trajectory predicts probe failure effectively, providing a proxy of sample difficulty. Building on this analysis, we propose $D^2$-Monitor, a bi-level safety monitor for D-LLMs. $D^2$-Monitor adopts a lightweight probe as an always-on monitor to jointly estimate hesitation and perform base classification. When the hesitation level exceeds a threshold, a more expressive but computationally heavier probe is activated. This dynamic routing mechanism allocates monitoring resources efficiently at test time. Evaluated on 3 datasets (WildguardMix, ToxicChat, OpenAI-Moderation) across 4 D-LLMs, $D^2$-Monitor achieves state-of-the-art performance with a compact parameter footprint ($\leq$ 0.85M parameters), and exhibits the best trade-off between effectiveness and efficiency relative to 8 baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.25891 2026-05-26 cs.CL cs.AI 版本更新

Causal Tongue-Tie: LLMs Can Encode Causal Direction, But Their Yes/No Outputs Fail to Express

因果舌结：LLMs 能编码因果方向，但其是/否输出无法表达

Ziyi Ding, Xiao-Ping Zhang

发表机构 * Tsinghua Shenzhen International Graduate School（清华大学深圳国际研究生院）

AI总结研究发现大语言模型在因果问题上存在内部编码与输出不匹配的现象，通过线性探针可从隐藏状态恢复证据支持的答案（准确率约0.97），但口头是/否回答却退化为常识答案（准确率约0.5），揭示了约+0.5的差距，称为“因果舌结”。

2605.25856 2026-05-26 cs.HC cs.AI 版本更新

Explaining Too Much? Understanding How Large Language Model Reasoning Traces Influence Performance and Metacognition

解释过多？理解大型语言模型推理轨迹如何影响性能和元认知

Daniela Fernandes, Daniel Buschek, Lev Tankelevitch, Thomas Kosch, Robin Welsch

发表机构 * Aalto University（奥卢大学）； University of Bayreuth（拜鲁特大学）； Microsoft Research Cambridge（微软研究院剑桥）

AI总结通过用户实验，研究大型语言模型展示推理轨迹（完整或摘要）对任务性能、信任、愉悦感和自我评估校准的影响，发现轨迹提升主观体验但无性能增益，且导致过度自信。

Comments 27 pages, 5 figures, 9 tables

详情

AI中文摘要

大型语言模型界面日益冗长，在最终答案之外暴露中间推理轨迹。轨迹被框架化为透明机制，但尚不清楚人们如何利用它们解决问题。我们报告了一项预注册的组间研究（N = 559），参与者在三种条件下解决十个LSAT式推理问题：仅答案基线、答案前显示完整轨迹、答案旁显示摘要轨迹。摘要轨迹在无轨迹基线上保持了任务性能，同时显著提升了信任和愉悦感，表明轨迹暴露改变了交互的主观评价，但未带来性能收益。在使用暴露冗长中间输出的开放权重推理模型时，完整轨迹相对于仅答案基线还损害了性能。在所有条件下，参与者大幅高估了自己的表现，且没有轨迹格式支持校准的自我评估。进一步分析表明，愉悦感（而非信任）承载了通向高估的间接路径，与处理流畅性解释一致。推理轨迹最好被理解为面向用户的界面工件，而非模型认知的透明窗口，校准不太可能从轨迹本身产生，最好通过首先引发用户自身推理的交互来支撑。

英文摘要

Large Language Model interfaces are increasingly verbose, exposing intermediate reasoning traces alongside final answers. Traces are framed as transparency mechanisms, yet it is unclear how people use them to solve problems. We report a preregistered between-subjects study (N = 559) in which participants solved ten LSAT-style reasoning problems under one of three conditions: an Answer-only baseline, a Full-trace revealed before the answer, and a Summary-trace presented alongside the answer. Summaries preserved task performance at the no-trace baseline while significantly elevating trust and hedonic appeal, establishing that trace exposure shifts subjective appraisal of the interaction without bringing performance benefits. Under an open-weight reasoning model exposing verbose intermediate output, full traces additionally impaired performance relative to the answer-only baseline. Across all conditions, participants substantially overestimated their performance, and no trace format supported calibrated self-evaluation. Further analysis indicates that hedonic appeal, not trust, carries the indirect path to overestimation, consistent with a processing-fluency account. Reasoning traces are best understood as user-facing interface artifacts rather than transparent windows into model cognition, and calibration is unlikely to emerge from the traces themselves and may best be scaffolded by interactions that elicit users' own reasoning first.

URL PDF HTML ☆

赞 0 踩 0

2605.25854 2026-05-26 cs.AI 版本更新

澄清、弃权或回答？基于信念增强生成的对话策略

Joris Baan, Wilker Aziz, Barbara Plank, Raquel Fernández

发表机构 * University of Amsterdam（阿姆斯特丹大学）； MCML Munich（慕尼黑MCML）； LMU Munich（慕尼黑莱茵-魏尔堡大学）

AI总结提出信念增强生成（BAG）方法，通过将大语言模型自身的信念状态注入提示，使其推理多个采样响应并决定对话策略（回答、澄清或弃权），从而提升多轮模糊问答的准确性和策略决策的忠实度。

详情

AI中文摘要

大语言模型（LLMs）定义了文本上的分布，这可以视为不确定性的概率表示：采样K个响应会产生一个信念状态——模型认为合理的响应。现有工作利用这种表示进行解码或选择性预测等狭窄任务，通常需要手动干预，无法直接控制生成。我们提出信念增强生成（BAG）：通过提示将LLMs锚定在其自身的信念状态中，并让它们推理这K个样本以决定对话策略：回答、澄清或弃权。在多轮模糊问答设置中，我们发现LLMs默认很少澄清或弃权，忽略了关于输入或事实的不确定性。BAG在六个模型上提高了问答准确性，并产生了比仅提示基线更忠实于信念状态的策略决策。然而，区分何时澄清与何时弃权仍然具有挑战性。

英文摘要

Large language models (LLMs) define a distribution over text, which can be viewed as a probabilistic representation of uncertainty: sampling K responses yields a belief state - responses a model deems plausible. Existing work exploits this representation for narrow tasks like either decoding or selective prediction, and often requires manual interventions, not controlling generation directly. We propose Belief-Augmented Generation (BAG): grounding LLMs in their own belief state via the prompt and letting them reason over these K samples to decide on a conversational strategy: answer, clarify, or abstain. In a multi-turn ambiguous QA setting, we find that LLMs by default rarely clarify or abstain, ignoring uncertainty about the input or facts. BAG improves QA accuracy across six models and yields strategy decisions more faithful to the belief state than prompt-only baselines. Disentangling when to clarify from when to abstain, however, remains challenging.

URL PDF HTML ☆

赞 0 踩 0

2605.25829 2026-05-26 cs.RO cs.AI 版本更新

OASIS: Observation-Action Space Alignment via SE(3) Trajectory Prediction for Robotic Manipulation

OASIS: 通过SE(3)轨迹预测实现机器人操作中的观测-动作空间对齐

Xinzhe Chen, Sihua Ren, Liqi Huang, Haowen Sun, Mingyang Li, Xingyu Chen, Zeyang Liu, Xuguang Lan

发表机构 * National Key Laboratory of Human-Machine Hybrid Augmented Intelligence（人机混合增强智能国家重点实验室）； Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University（西安交通大学人工智能与机器人研究所）

AI总结提出OASIS视觉运动策略，通过SE(3)末端执行器轨迹预测对齐中间表示与动作空间，在仿真和真实实验中优于VLA和WAM基线。

详情

AI中文摘要

最近的视觉-语言-动作（VLA）模型和世界动作模型（WAMs）通过用辅助空间特征或未来视觉状态预测丰富中间表示来推进机器人操作。然而，这些表示在很大程度上仍停留在观测空间内，不共享动作空间的刚体几何，迫使动作解码器隐式恢复该几何。我们提出OASIS，一种通过$SE(3)$末端执行器轨迹预测将中间表示与动作空间对齐的视觉运动策略。OASIS将融合视觉-语言和度量深度特征的3D感知特征编码器与生成相机帧末端执行器轨迹的$SE(3)$轨迹预测器耦合。以预测器的姿态监督隐藏状态为条件，动作解码器生成与刚体运动一致的动作块。在仿真和真实世界实验中，OASIS在成功率和分布外泛化方面优于VLA和WAM基线。我们的项目页面位于https://npuhandsome.github.io/OASIS_web。

英文摘要

Recent vision-language-action (VLA) models and world action models (WAMs) advance robotic manipulation by enriching intermediate representations with auxiliary spatial features or future visual-state prediction. However, these representations largely remain within the observation space and do not share the rigid-body geometry of the action space, forcing the action decoder to implicitly recover this geometry. We propose OASIS, a visuomotor policy that aligns the intermediate representation with the action space via $SE(3)$ end-effector trajectory prediction. OASIS couples a 3D-aware feature encoder that fuses vision-language and metric-depth features with an $SE(3)$ trajectory predictor that produces a camera-frame end-effector trajectory. Conditioned on the predictor's pose-supervised hidden states, the action decoder generates action chunks consistent with rigid-body motion. Across simulation and real-world experiments, OASIS outperforms VLA and WAM baselines in success rate and out-of-distribution generalization. Our project page is available at https://npuhandsome.github.io/OASIS_web.

URL PDF HTML ☆

赞 0 踩 0

2605.25816 2026-05-26 cs.CL cs.AI 版本更新

Fine-Tuning Over Architectural Complexity: Broad-Coverage PII Detection on PIIBench with DeBERTa

超越架构复杂性的微调：基于DeBERTa的PIIBench广泛覆盖PII检测

Pritesh Jha

AI总结本研究通过微调DeBERTa模型，在涵盖82种实体类型的多源PIIBench数据集上实现广泛覆盖的PII检测，直接微调方法在F1分数上显著优于架构复杂的层次模型和课程扩展方法。

详情

DOI: 10.5281/zenodo.20379635

AI中文摘要

个人身份信息（PII）检测系统通常在狭窄的源或领域边界内训练，当部署在异构文本上时覆盖范围有限。我们研究了在修正后的多源PIIBench准备数据上的模型微调，该数据跨越十个源数据集，涵盖82种保留实体类型。我们评估了三种基于DeBERTa的方法：直接令牌分类微调、源条件层次模型（SC+H）和三阶段课程扩展（SC+H+Curr）。在可重复的5,000条记录保留子集（test_5k）上，与八个已发表的比较系统相比，直接微调的DeBERTa达到F1 0.6476，而SC+H和课程变体分别达到0.5899和0.2772；最强的已发表比较系统仅达到0.1723。由于验证最初偏向SC+H，我们在完整的100,002条记录保留分割上进行了最终的流式评估。直接微调仍然优越，达到F1 0.6455，而SC+H为0.5894。实体级分析表明，直接微调在82个细粒度实体类型中的54个和所有十个粗粒度组中获胜（按支持加权实体F1），而SC+H在28个类型上保持局部优势。结果表明，多样化的任务特定训练数据和简单的加权交叉熵目标对广泛覆盖的PII检测的贡献大于所测试的架构和课程复杂性。

英文摘要

Personally identifiable information (PII) detection systems are frequently trained within narrow source or domain boundaries, limiting coverage when deployed on heterogeneous text. We study model fine-tuning on a corrected multi-source PIIBench preparation spanning 82 retained entity types across ten source datasets. We evaluate three DeBERTa-based approaches: direct token classification fine-tuning, a source-conditioned hierarchical model (SC+H), and a three-phase curriculum extension (SC+H+Curr). Against eight published comparator systems on a reproducible 5,000-record held-out subset (test_5k), direct fine-tuned DeBERTa achieves F1 0.6476, while SC+H and the curriculum variant achieve 0.5899 and 0.2772 respectively; the strongest published comparator reaches only 0.1723. Because validation initially favoured SC+H, we perform a final streamed evaluation on the complete 100,002-record held-out split. Direct fine-tuning remains superior, achieving F1 0.6455 versus 0.5894 for SC+H. Entity-level analysis shows that direct fine tuning wins 54 of 82 fine entity types and all ten coarse groups by support-weighted entity F1, while SC+H retains localised advantages on 28 types. The results indicate that diverse task-specific training data and a simple weighted cross-entropy objective contribute more to broad-coverage PII detection than the tested architectural and curriculum complexity.

URL PDF HTML ☆

赞 0 踩 0

2605.25814 2026-05-26 cs.CL cs.AI 版本更新

Adaptive Graph Refinement and Label Propagation with LLMs for Cost-Effective Entity Resolution

自适应图优化与基于大语言模型的标签传播用于经济高效实体解析

Hongtao Wang, Renchi Yang, Haoran Zheng, Xiangyu Ke

发表机构 * Hong Kong Baptist University（香港 Baptist 大学）； Zhejiang University（浙江大学）

AI总结提出Alper框架，通过迭代概率标签传播整合匹配与聚类，自适应融合图传播弱信号与LLM强查询，在预算约束下最大化边际增益，实现高效实体解析。

详情

AI中文摘要

脏实体解析（ER）从单个杂乱数据集中识别指向同一真实世界实体的记录，是数据管理和挖掘中的基本任务。然而，ER的主流阻塞-匹配-聚类范式存在严重缺陷。其级联、解耦的工作流本质上生成一个静态、稀疏的图，由于阻塞失败导致缺失边，由于匹配错误导致噪声链接，造成错误传播并产生次优聚类，特别是在聚类中施加严格传递性时。我们认为匹配和聚类本质上是协同的，两者都优化理想实体图的构建。基于这一见解，我们提出Alper，一个统一框架，将这些步骤整合为在全局、演化图上的迭代概率标签传播过程。与分离的阻塞不同，Alper通过自适应地整合来自图传播的“弱但廉价”信号与基于LLM的“强但昂贵”成对查询，动态优化图结构和标签。为了提高成本效益，我们将信号选择形式化为在查询预算下最大化累积边际增益的约束优化问题，通过我们的贪心算法求解，并具有可证明的理论保证。我们在八个基准数据集上的广泛实验表明，Alper始终优于最先进的级联流水线。

英文摘要

Dirty entity resolution (ER), which identifies records referring to the same real-world entity from a single, messy dataset, is a fundamental task in data management and mining. However, the dominant blocking-matching-clustering paradigm for ER suffers from critical flaws. Its cascaded, decoupled workflow essentially produces a static, sparse graph plagued by missing edges (due to blocking failures) and noisy links (due to matching errors), causing error propagation and yielding suboptimal clusters, particularly when rigid transitivity is imposed in the clustering. We contend that matching and clustering are fundamentally synergistic, both optimizing for the construction of an ideal entity graph. Building upon this insight, we propose Alper, a unified framework that integrates these steps into an iterative probabilistic label propagation process over a global, evolving graph. Unlike disjoint blocking, Alper refines the graph structure and labels dynamically by adaptively integrating "weak but cheap" signals from graph propagation with "strong but expensive" LLM-based pairwise queries. For higher cost-effectiveness, we formulate the signal selection as a constrained optimization problem maximizing cumulative marginal gain under a query budget, solved via our greedy algorithm with provable theoretical guarantees. Our extensive experiments over eight benchmark datasets demonstrate that Alper is consistently superior to state-of-the-art cascaded pipelines.

URL PDF HTML ☆

赞 0 踩 0

2605.25794 2026-05-26 cs.AI 版本更新

通过交叉注意力激活投影实现扩散模型的概念遗忘

Saemi Moon, Suhyeon Jun, Seoyeon Lee, Dongwoo Kim

发表机构 * CSE, POSTECH（POSTECH计算机科学系）； GSAI, POSTECH（POSTECH通用人工智能实验室）

AI总结提出PURE方法，利用交叉注意力激活空间构建遗忘和保留基，通过线性投影编辑权重，在保持保留概念的同时有效消除目标概念。

详情

AI中文摘要

概念遗忘旨在从预训练的文本到图像扩散模型中擦除目标概念，而无需重新训练。闭式方法在此设置中具有吸引力，因为它们对交叉注意力权重应用单一确定性编辑，并且不增加推理时间成本。然而，现有的闭式方法通过文本编码器对少数命名目标概念的简短锚定提示的响应来表示目标概念，而唤起该概念但不一致命名的释义提示可以绕过编辑。我们认为，目标应该改为在交叉注意力激活空间中表示。文本嵌入描述用户的提示，而交叉注意力激活描述模型即将渲染的内容，后者泛化到锚定模板未覆盖的释义。基于这一观察，我们提出了PURE（U-Net渲染中的投影用于擦除），这是一种闭式方法，从沿短去噪轨迹捕获的逐层交叉注意力激活构建遗忘和保留基，并将单个线性投影器应用于交叉注意力键和值权重。在最近涵盖艺术风格、知识产权、名人和NSFW类别中十个概念的整体概念遗忘基准上，PURE显著减少了在释义和对抗性提示下的目标泄露，同时将保留概念保持接近未编辑模型，在评估方法中实现了最佳的总体遗忘-保留权衡。

英文摘要

Concept unlearning aims to erase a target concept from a pretrained text-to-image diffusion model without retraining. Closed-form methods are attractive in this setting because they apply a single deterministic edit to the cross-attention weights and add no inference-time cost. Existing closed-form methods, however, represent the target concept through the text encoder's response to a few short anchor prompts that name it, and paraphrased prompts that evoke the concept without naming it consistently bypass the edit. We argue that the target should instead be represented in the cross-attention activation space. Text embeddings describe the user's prompt, while cross-attention activations describe what the model is about to render, and the latter generalize to paraphrase the anchor templates do not cover. Building on this observation, we propose PURE (Projection in U-Net Rendering for Erasure), a closed-form method that builds the forget and retain bases from per-layer cross-attention activations captured along a short denoising trajectory and applies a single linear projector to the cross-attention key and value weights. On a recent holistic concept-unlearning benchmark covering ten concepts across artistic style, intellectual property, celebrity, and NSFW categories, PURE significantly reduces target leakage under paraphrased and adversarial prompts while preserving retain concepts close to the unedited model, yielding the best overall forget-retain trade-off among evaluated methods.

URL PDF HTML ☆

赞 0 踩 0

2605.25764 2026-05-26 cs.CV cs.AI 版本更新

Benchmarking Pathology Foundation Models for Spatial Domain Understanding

病理基础模型在空间域理解中的基准测试

Bokai Zhao, Yiyang Zhang, Yuanchi Zhu, Hanqing Chao, Long Bai, Tai Ma, Minfeng Xu, Ming Song, Tianzi Jiang

发表机构 * School of Artificial Intelligence, University of Chinese Academy of Sciences（中国科学院大学人工智能学院）； Brainnetome Center, Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所脑网膜工程中心）； Beijing Key Laboratory of Brainnetome and Brain-Computer Interface, Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所北京脑网膜与脑机接口重点实验室）； DAMO Academy, Alibaba Group（阿里云达摩院）； ShanghaiTech University（上海科技大学）

AI总结提出SpaPath-Bench基准，通过空间域识别任务评估病理基础模型在区分组织区域和捕获空间关系方面的表示能力。

Comments MICCAI2026

详情

AI中文摘要

病理基础模型（PFMs）已成为从全切片图像（WSIs）中学习可迁移表示的核心方法，通常通过下游临床终点进行基准测试。虽然这种任务级评估不可或缺，但它们对表示本身编码了什么提供了有限的见解，特别是PFM嵌入是否能够区分有意义的组织区域并捕获其空间关系。我们提出了SpaPath-Bench，一个表示级基准，旨在诊断PFMs中的空间表示能力。SpaPath-Bench将配对全切片图像和空间转录组学（ST）数据上的空间域识别（SDI）制定为诊断任务。它整理了42个公开的配对WSI和ST切片，支持跨19个编码器和7种SDI方法的大规模评估，并使用三个互补标准衡量分区质量：无监督空间一致性、转录组学参考一致性和专家参考一致性。在83K次运行中，SpaPath-Bench揭示了不同的预训练范式捕获了组织空间架构的不同方面，并为构建下一代空间感知计算病理模型提供了实用指导。代码和数据管道公开于https://bokai-zhao.github.io/SpaPath-benchboard/。

英文摘要

Pathology foundation models (PFMs) have emerged as a core approach for learning transferable representations from whole slide images (WSIs), and they are typically benchmarked through downstream clinical endpoints. While such task level evaluations are indispensable, they offer limited insight into what the representations themselves encode, particularly whether PFM embeddings can distinguish meaningful tissue regions and capture their spatial relationships. We present SpaPath-Bench, a representation level benchmark designed to diagnose spatial representation capability in PFMs. SpaPath-Bench formulates spatial domain identification (SDI) on paired whole slide image and spatial transcriptomics (ST) data as a diagnostic task. It curates 42 public paired WSI and ST slides, enables large scale evaluation across 19 encoders and seven SDI methods, and measures partition quality using three complementary criteria: unsupervised spatial coherence, transcriptomics referenced agreement, and expert referenced agreement. Across 83K runs, SpaPath-Bench reveals that different pretraining paradigms capture distinct aspects of tissue spatial architecture, and it provides practical guidance for building the next generation of spatially aware computational pathology models. Code and data pipelines are publicly available at https://bokai-zhao.github.io/SpaPath-benchboard/.

URL PDF HTML ☆

赞 0 踩 0

2605.25749 2026-05-26 cs.IR cs.AI cs.LG 版本更新

DeGRe: Dense-supervised Generative Reranking for Recommendation

DeGRe: 密集监督的生成式重排序用于推荐

Chaotian Song, Jingyao Zhang, Chenghao Chen, Zisen Sang, Dehai Zhao, Guodong Cao, Boxi Wu, Deng Cai, Jia Jia

发表机构 * College of Software, Zhejiang University Hangzhou China ； Rajax Network Technology, Taobao Shangou of Alibaba Hangzhou China ； Rajax Network Technology, Taobao Shangou of Alibaba Beijing China ； State Key Lab of CAD\&CG, Zhejiang University Hangzhou China ； Rajax Network Technology, Taobao Shangou of Alibaba Shanghai China ； College of Software, Zhejiang University ； Rajax Network Technology, Taobao Shangou of Alibaba ； State Key Lab of CAD\&CG, Zhejiang University

AI总结提出DeGRe框架，通过离线探索中的密集监督信号（Lookahead Evaluator）指导在线生成器（Online Generator）进行单步贪婪解码，解决重排序中的启发式标签偏差和信用分配问题。

Comments Accepted to KDD 2026 (ADS Track)

详情

DOI: 10.1145/3770855.3818363

AI中文摘要

在多阶段推荐系统中，重排序通过捕获列表内上下文依赖关系来优化整体效用，但其核心挑战在于在指数级排列空间中探索最优序列。最近的研究转向端到端生成式框架，通常利用列表级奖励或偏好对齐来指导生成器训练。然而，这些方法仍面临两个关键问题。首先是启发式标签偏差。现有方法通常基于简单规则构建训练目标，例如将点击项提升到顶部，而忽略列表上下文中的因果依赖关系。其次是信用分配问题。稀疏的列表级后验奖励无法直接指导序列生成中的中间步骤，导致优化方向模糊。为了解决这些问题，我们提出DeGRe（密集监督的生成式重排序），一种通过密集监督弥合离线探索与在线效率之间差距的生成式重排序框架。DeGRe的核心在于其离线-在线解耦设计。在离线阶段，我们引入基于累积回归的Lookahead Evaluator，利用束搜索在未曝光空间中主动挖掘高价值前瞻序列。在训练期间，我们将评估器的逐步价值估计转换为密集监督信号，并将其蒸馏到轻量级在线生成器中。这种机制使生成器能够内化前瞻规划能力，在线推理时仅需一次高效的贪婪解码即可逼近全局最优。实验表明，DeGRe在公开基准和工业数据集上优于基线模型。我们已成功将DeGRe部署到淘宝闪购中，显著提升了在线推荐效果。

英文摘要

In multi-stage recommender systems, reranking optimizes overall utility by capturing intra-list contextual dependencies, yet its central challenge lies in exploring optimal sequences within an exponentially large permutation space. Recent studies have shifted towards end-to-end generative frameworks, which typically leverage list-wise rewards or preference alignment to guide generator training. However, these methods still face two critical issues. First is the heuristic label bias. Existing methods often construct training targets based on simple rules, such as promoting clicked items to the top, while ignoring causal dependencies within the list context. Second is the credit assignment problem. Sparse list-level posterior rewards fail to directly guide intermediate steps in sequence generation, leading to ambiguous optimization directions. To address these issues, we propose DeGRe (Dense-supervised Generative Reranking), a generative reranking framework that bridges the gap between offline exploration and online efficiency through dense supervision. The core of DeGRe lies in its offline-online decoupled design. During the offline phase, we introduce a Lookahead Evaluator based on cumulative regression, which leverages beam search to actively mine high-value lookahead sequences in the unexposed space. During training, we transform the step-wise value estimations from the evaluator into dense supervision signals and distill them into a lightweight Online Generator. This mechanism enables the generator to internalize lookahead planning capabilities, requiring only a single efficient greedy decoding pass during online inference to approximate the global optimum. Experiments demonstrate that DeGRe outperforms baseline models on public benchmarks and industrial datasets. We have successfully deployed DeGRe on Taobao Flash Shopping, significantly improving online recommendations.

URL PDF HTML ☆

赞 0 踩 0

2605.25748 2026-05-26 cs.AI 版本更新

Agent-Centric Social Trajectory Prediction: A Free Energy Principle Perspective

以智能体为中心的社交轨迹预测：自由能原理视角

Yanping Wu, Ji Zhang, Hao Chen, Edmond S. L. Ho, Chongfeng Wei

发表机构 * University of Glasgow（格拉斯哥大学）； Southwest Jiaotong University（西南交通大学）

AI总结针对现有轨迹预测方法依赖全局状态、部分可观测下信念推理不足及缺乏认知行为约束的问题，提出基于自由能原理的智能体中心轨迹预测框架FEP-Diff，通过双分支时空编码器、目标条件信念学习器和残差扩散轨迹生成器，在受限可观测条件下实现认知合理的预测。

Comments 10 pages, 4 figures

详情

AI中文摘要

轨迹预测方法在捕捉复杂运动模式方面已展现出显著能力。然而，现有方法依赖于全局状态假设，在部分可观测性下存在信念推理不足的问题，且预测中缺乏认知行为约束。这些局限性严重影响了实际部署的可行性和物理合理性。在这项工作中，我们提出了FEP-Diff，一个基于自由能原理的以智能体为中心的轨迹预测框架，旨在现实约束下实现认知合理的预测。具体来说，一个双分支时空编码器从局部观测中提取自我运动动态和社会交互线索。在此基础上，一个目标条件信念学习器推断多模态潜在信念分布，通过自由能目标进行优化，并对局部邻域图施加社会一致性约束以促进相邻智能体之间的认知对齐。最后，一个残差扩散轨迹生成器以学习到的信念表示为条件，通过令牌级代理条件，产生精确且多样化的未来预测。在五个公开基准上的大量实验表明，FEP-Diff在受限可观测性下始终优于最先进的方法。代码：https://anonymous.4open.science/r/FEP-Diff-8876。

英文摘要

Trajectory prediction methods have demonstrated remarkable capabilities in capturing complex motion patterns. However, existing methods rely on global state assumptions, suffer from insufficient belief inference under partial observability, and lack cognitive behavioral constraints in prediction. These limitations severely compromise both deployment feasibility and physical plausibility in real-world settings. In this work, we propose FEP-Diff, an agent-centric trajectory prediction framework grounded in the Free Energy Principle, aimed at achieving cognitively plausible predictions under realistic constraints. Specifically, a dual-branch spatiotemporal encoder extracts ego-motion dynamics and social interaction cues from local observations. Building upon this, a goal-conditioned belief learner infers multimodal latent belief distributions optimized via a free-energy objective, with a social consistency constraint on the local neighborhood graph to promote cognitive alignment among neighboring agents. Finally, a residual diffusion trajectory generator is conditioned on the learned belief representations with token-level proxy conditioning, producing precise and diverse future predictions. Extensive experiments on five public benchmarks demonstrate that FEP-Diff consistently outperforms state-of-the-art methods under restricted observability. Code: https://anonymous.4open.science/r/FEP-Diff-8876.

URL PDF HTML ☆

赞 0 踩 0

2605.25746 2026-05-26 cs.MA cs.AI 版本更新

Multi-Agent Coordination Adaptation via Structure-Guided Orchestration

基于结构引导编排的多智能体协调适应

Haoran Li, Shulun Chen, Shaoyuan Sun, Hanchen Wang

发表机构 * Nanjing University（南京大学）； University of Technology Sydney（悉尼科技大学）； University of New South Wales（新南威尔士大学）

AI总结提出MACA框架，通过概率视角将多智能体协调视为结构与编排的联合后验推断，利用任务和预算条件结构先验指导策略编排，实现高效自适应协调，性能平均提升8.42%且令牌消耗减少43.19%。

Comments 21 pages

详情

AI中文摘要

随着基于大语言模型的多智能体系统规模扩大以处理日益复杂的任务，平衡结构稳定性和动态适应性变得越来越具有挑战性。现有系统通常采用以结构为中心的方法，坚持预先确定的结构，限制了细粒度控制；或者采用以编排为中心的方法，动态调整决策，同时使协调结构隐含且不稳定。为了解决这一挑战，我们从概率角度重新审视多智能体协调，将其视为结构和编排联合分布的后验推断。我们引入了MACA，一个自动协调框架，它学习一个任务和预算条件的结构先验，用于智能体参与和交互。该先验指导基于策略的编排作为后验推断的近似，实现了具有细粒度控制的高效解决方案。在多个基准测试中，MACA比自适应多智能体基线平均高出8.42%，同时使用的令牌数减少了43.19%。进一步研究表明，结构和编排的联合适应抑制了冗余交互，使协调收敛到任务有效的执行。

英文摘要

As large language model (LLM)-based multi-agent systems scale to handle increasingly complex tasks, balancing structural stability and dynamic adaptability becomes increasingly challenging. Existing systems typically adopt either structure-centric methods, committing to structures determined upfront that limit fine-grained control, or orchestration-centric methods, adapting decisions dynamically while leaving coordination structure implicit and unstable. To address this challenge, we revisit multi-agent coordination from a probabilistic perspective, casting it as posterior inference over the joint distribution of structure and orchestration. We introduce MACA, an automated coordination framework that learns a task- and budget-conditioned structural prior over agent participation and interactions. This prior guides a policy-based orchestration as an approximation to posterior inference, enabling efficient solutions with fine-grained control. Across benchmarks, MACA outperforms adaptive multi-agent baselines by an average of 8.42% while using 43.19% fewer tokens. Further investigation reveals that joint adaptation of structure and orchestration suppresses redundant interactions, converging coordination toward task-effective execution.

URL PDF HTML ☆

赞 0 踩 0

2605.25735 2026-05-26 cs.AI 版本更新

A Deep Dive into Axiomatic Design -- Part I: Problem Formulation

公理化设计的深度剖析——第一部分：问题表述

Aydin Homay

发表机构 * Technische Universität Dresden（德累斯顿理工大学）

AI总结本文聚焦公理化设计中的问题表述步骤，澄清一级功能需求的定义与特性，分析常见误区与困难，并提供实用指导，最后探讨大语言模型在该步骤中的作用。

Comments The paper is accepted at the ICAD 2026 - MIT and the final camera ready will be available once it got published by the Springer

详情

AI中文摘要

问题表述——将客户需求和约束转化为最小的一组独立的一级功能需求——可以说是每个设计框架中最关键的步骤，包括公理化设计，然而在实践中它经常被误解或低估。本文专门关注公理化设计中的问题表述，澄清一级FR是什么（以及不是什么），解释为什么在给定的相同需求和约束下，它们不应在不同设计者之间合理变化，并强调导致设计失败的内在困难和反复出现的陷阱。讨论主要基于Nam P. Suh的三本书：《设计原理》、《公理化设计：进展与应用》和《复杂性理论》，并提供实用指导，帮助设计者制定适定的一级FR。最后，本文简要回顾了大语言模型时代的问题表述，并讨论了此类工具在一级层面上能够（以及不能）做出什么贡献。

英文摘要

Problem formulation translating customer needs and constraints into a minimum set of independent first-level functional requirements, is arguably the most critical step in every design framework, including axiomatic design yet it is frequently misunderstood or underestimated in practice. This paper focuses exclusively on problem formulation in axiomatic design it clarifies what first-level FRs are (and are not), explains why they should not legitimately vary across designers given the same needs and constraints, and highlights intrinsic difficulties and recurring pitfalls that lead to design failure. The discussion is grounded primarily in Nam P.Suh's three books. The Principles of Design, Axiomatic Design Advances and Applications, and Complexity Theory, and it offers practical guidance to help designers formulate well-posed first-level FRs. Finally, the paper briefly revisits problem formulation in the era of large language models and discusses what such tools can (and cannot) contribute at the first level.

URL PDF HTML ☆

赞 0 踩 0

2605.25720 2026-05-26 cs.AI 版本更新

Learning to Search and Searching to Learn for Generalization in Planning

学习搜索与搜索学习以实现规划中的泛化

Michael Aichmüller, Yannik Hesse, Hector Geffner

发表机构 * Department of Machine Learning and Reasoning, RWTH Aachen University（机器学习与推理部门，亚琛RWTH大学）

AI总结提出一种结合关系图神经网络值启发式的自改进WA*学习框架，通过搜索引导和Q学习更新启发式，实现零样本泛化，在多个规划任务中优于深度强化学习。

Comments Accepted at ICML 2026

详情

AI中文摘要

组合泛化仍然是深度强化学习（DRL）中的一个核心挑战。经典规划通过显式关系描述为研究这一问题提供了一个简单但具有挑战性的环境，无需从感知中学习。在稀疏奖励领域中，通过实时搜索的标准RL探索效率低下，而基于学习的规划方法通常依赖于专家演示、事后重标或从目标状态开始的随机游走。相比之下，规划器依赖于最佳优先搜索方法（如$\mathrm{A}^\star$）从头开始解决问题。我们提出了一种自改进的$\mathrm{WA}^\star$学习框架，结合由关系图神经网络表示的值启发式：启发式引导搜索，产生的搜索数据通过$Q$-学习更新启发式。这个循环产生了可以作为通用策略的启发式，并且即使在没有搜索的情况下也能解决新实例，而DRL在其他情况下会失败，正如我们在Sokoban、PushWorld、The Witness以及2023年国际规划竞赛基准等谜题上所展示的。值得注意的是，我们展示了强大的零样本泛化能力：例如，在少于30个块的Blocksworld实例上训练的启发式，无需搜索即可成功解决包含488个块的实例。

英文摘要

Combinatorial generalization remains a central challenge in Deep Reinforcement Learning (DRL). Classical planning provides a simple yet challenging setting to study this problem through explicit relational descriptions, without requiring learning from perception. In sparse-reward domains, standard RL exploration via real-time search is ineffective, and learning-based planning methods often rely on expert demonstrations, hindsight relabeling, or random walks from the goal state. In contrast, planners rely on best-first search methods such as $\mathrm{A}^\star$ to solve problems from scratch. We propose a self-improving $\mathrm{WA}^\star$ learning framework in combination with a value heuristic represented by a Relational Graph Neural Network: the heuristic guides search, and the resulting search data updates the heuristic via $Q$-learning. This loop yields heuristics that can function as general policies and solve new instances even without search, where DRL otherwise fails, as we show on puzzles such as Sokoban, PushWorld, The Witness, and the 2023 International Planning Competition benchmarks. Notably, we demonstrate strong zero-shot generalization: For example, heuristics trained on Blocksworld instances with fewer than 30 blocks successfully solve instances with 488 blocks without search.

URL PDF HTML ☆

赞 0 踩 0

2605.25717 2026-05-26 cs.AI cs.CE cs.LG 版本更新

FLOATBench: A Dataset and Benchmark for Floating Offshore Wind Turbine Tower Fatigue

FLOATBench：浮式海上风力发电机塔架疲劳数据集与基准

João Alves Ribeiro, Bruno Alves Ribeiro, Francisco Pimenta, Sérgio M. O. Tavares, Faez Ahmed

发表机构 * Department of Mechanical Engineering（机械工程系）； Massachusetts Institute of Technology（麻省理工学院）； School of Engineering（工程学院）； Brown University（布朗大学）； CONSTRUCT, Faculty of Engineering University of Porto（CONSTRUCT，工程学院，葡萄牙波尔图大学）； University of Aveiro（阿维罗大学）

AI总结提出FLOATBench，一个包含582,120个疲劳损伤标签的表格基准，基于22 MW浮式风机塔架的高保真仿真，并引入工况感知的评估协议以检测随机划分无法发现的性能排名变化。

详情

AI中文摘要

全球大部分海上风能资源位于水深过大、无法使用固定式基础的海域，因此浮式海上风力发电机（FOWT）对于深水部署至关重要。随着行业向22 MW级设计规模发展，塔架疲劳变得愈发关键，因为更大的结构会放大由持续风浪激励引起的耦合气动-水动-伺服-弹性载荷。准确的疲劳损伤预测对于认证、设计优化和成本降低至关重要。然而，该领域缺乏共享的替代模型基准：不同研究报告了不同的仿真、划分和指标，使得方法难以比较。我们提出FLOATBench，一个公开的表格基准，包含三种22 MW FOWT塔架几何形状的582,120个逐截面疲劳损伤标签，这些标签来自三种塔架的19,404次高保真OpenFAST仿真（每种塔架6,468次：1,078个对齐风浪工况点×六个湍流种子），每种塔架在30个截面上进行标注。FLOATBench包括一个基于工况感知的联合风浪运行包络的alpha-shape划分，将测试点分为训练内、插值和外推区域。它配备了一个可复现的评估框架，涵盖三个协议级别：随机验证（E1）、塔内工况感知评估（E2）和跨塔迁移（E3）。工况感知协议揭示了全局性能与外推性能之间的排名变化，而随机划分排行榜无法检测到这些变化。据作者所知，FLOATBench是首个用于表格替代建模的FOWT疲劳基准，并提供了一个可推广到定义在物理运行包络上的工程替代模型的评估协议。数据集和代码可在以下网址获取：https://github.com/Joao97ribeiro/FLOATBench。

英文摘要

Most of the world's offshore wind resource lies in waters too deep for fixed-bottom foundations, making floating offshore wind turbines (FOWTs) essential for deep-water deployment. As the industry scales toward $22$ MW class designs, tower fatigue becomes increasingly critical because larger structures amplify the coupled aero-hydro-servo-elastic loads induced by continuous wind and wave excitation. Accurate fatigue-damage prediction is therefore central to certification, design optimization, and cost reduction. Yet the field lacks a shared surrogate benchmark: studies report different simulations, splits, and metrics, making methods difficult to compare. We present FLOATBench, a public tabular benchmark with $582{,}120$ per-section fatigue-damage labels across three $22$ MW FOWT tower geometries, derived from $19{,}404$ high-fidelity OpenFAST simulations across the three towers ($6{,}468$ per tower: $1{,}078$ aligned wind/wave operating points $\times$ six turbulence seeds), labeled at $30$ cross-sections per tower. FLOATBench includes a regime-aware alpha-shape partition of the joint wind/wave operating envelope, stratifying test points into in-train, interpolation, and extrapolation regimes. It is paired with a reproducible evaluation harness covering three protocol levels: random validation (E1), within-tower regime-aware evaluation (E2), and cross-tower transfer (E3). The regime-aware protocol reveals rank shifts between global and extrapolation performance that random-split leaderboards cannot detect. To the authors' knowledge, FLOATBench is the first FOWT fatigue benchmark for tabular surrogate modeling, and offers an evaluation protocol that generalizes to engineering surrogates defined over physical operating envelopes. Dataset and code available at: https://github.com/Joao97ribeiro/FLOATBench.

URL PDF HTML ☆

赞 0 踩 0

2605.25707 2026-05-26 cs.AI 版本更新

AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions

AgentHijack：基准测试计算机使用智能体对常见环境干扰的鲁棒性

Jingwei Sun, Jianing Zhu, Yuanyi Li, Tongliang Liu, Xia HU, Bo Han

发表机构 * TMLR Group, Hong Kong Baptist University（香港 Baptist 大学 TMLR 团体）； The University of Texas at Austin（德克萨斯大学奥斯汀分校）； Sydney AI Centre, The University of Sydney（悉尼大学 AI 中心）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）

AI总结提出AgentHijack基准，通过9种可配置的常见环境干扰评估多模态大语言模型驱动的计算机使用智能体的鲁棒性，并设计AgentHijack-Agent框架提升其抗干扰能力。

Comments accepted by ICML 2026

详情

AI中文摘要

由多模态大语言模型（MLLM）驱动的自主计算机使用智能体正在成为完成复杂数字工作流的得力助手。然而，真实世界的执行环境远非理想：弹出窗口、分辨率变化和竞争性应用频繁干扰智能体的感知和控制。我们引入了AgentHijack，一个旨在评估计算机使用智能体在常见干扰下鲁棒性的基准，其中动态环境中的不确定性在没有直接对抗意图的情况下破坏执行流程。具体来说，AgentHijack引入了9种可配置的常见干扰来复现现实的不完美场景。我们评估了多种利用基于MLLM的智能体的桌面任务，发现即使是微小的干扰实例也会导致显著的性能下降，这强调了智能体的脆弱性以及鲁棒性评估的必要性。随后，我们提出了AgentHijack-Agent，一个将具有增强基础能力的动作生成器与负责行为总结和环境检查的旁观者相结合的框架。大量实验验证了其有效性。我们的代码、环境、基线模型和数据公开于：https://AgentHijack.github.io。

英文摘要

Autonomous computer use agents that powered by multimodal large language models (MLLMs) are emerging as capable assistants for completing complex digital workflows. However, real-world execution environments are far from ideal: pop-ups, resolution changes, and competing applications frequently interfere with agent perception and control. We introduce AgentHijack, a benchmark designed to evaluate the robustness of computer-use agents under common corruptions, where the uncertainties in dynamic environment disrupt the execution flow without direct adversarial intent. Specifically, AgentHijack introduces 9 configurable common corruptions to replicate realistic imperfect scenarios. We evaluate a variety of desktop tasks that utilize MLLM-based agents and discover that even minor instances of corruption can result in substantial performance degradation, which emphasizes the fragility of agents and underscores the necessity of robustness evaluation. Afterward, we propose AgentHijack-Agent, a framework that integrates an action generator with enhanced grounding capabilities and an onlooker responsible for behavior summarization and environment checking. Extensive experiments validate its effectiveness. Our code, environment, baseline models and data are publicly available at: https://AgentHijack.github.io.

URL PDF HTML ☆

赞 0 踩 0

2605.25698 2026-05-26 cs.LG cs.AI 版本更新

How Should LLMs Consume High-Quality Data? Optimal Data Scheduling via Quality-Aware Functional Scaling Laws

LLM应如何消费高质量数据？通过质量感知的功能缩放定律实现最优数据调度

Zhitao Zhu, Xili Wang, Shizhe Wu, Jiawei Fu, Xiaoqing Liu

发表机构 * Peking University（北京大学）； Meituan（美团）

AI总结本文通过引入数据质量维度扩展功能缩放定律，解析求解了联合数据质量和批次大小调度问题，揭示了高质量数据的双重角色，并提出了Drop-Stable-Rampup调度策略，在15B MoE模型上相比WSD和余弦衰减分别提升平均准确率+1.70和+2.98。

详情

AI中文摘要

高质量数据在大语言模型训练中稀缺，但如何联合训练动态调度其使用缺乏理论指导。我们通过引入数据质量维度扩展功能缩放定律，并以渐近闭式形式求解了联合数据质量和批次大小调度问题。该解揭示了两个阶段和高质量数据的双重角色。在噪声受限阶段，高质量数据应作为信号放大器：降低批次大小将更清洁的数据转换为更多信号而不放大噪声。在信号受限阶段，它应作为噪声抑制器：后期放置可减少终端噪声而不牺牲信号积累。现有的课程式流程主要利用第二个角色，将更清洁的数据放在后期，但忽略了第一个角色，因为传统的衰减调度在高质量数据可用时恰好降低了更新强度。受此启发，我们为LLM中期训练提出了Drop-Stable-Rampup：在质量转换时，降低批次大小，保持稳定以积累信号，然后逐渐增加以抑制终端噪声。在一个在108B tokens上中期训练的15B混合专家模型上，Drop-Stable-Rampup相比Warmup-Stable-Decay (WSD)平均准确率提升+1.70，相比余弦衰减提升+2.98，在数学推理基准如GSM8K (+4.23)和MATH (+2.80)上增益尤其显著。

英文摘要

High-quality data is scarce in large language model (LLM) training, yet how to schedule its use jointly with training dynamics lacks theoretical guidance. We extend functional scaling laws by incorporating a data-quality dimension, and solve the joint data-quality and batch-size scheduling problem in asymptotic closed form. The solution reveals two regimes and a dual role of high-quality data. In the noise-limited regime, high-quality data should be used as a signal amplifier: lowering the batch size converts cleaner data into more signal without amplifying noise. In the signal-limited regime, it should be used as a noise suppressor: late placement reduces terminal noise without sacrificing signal accumulation. Existing curriculum-style pipelines primarily exploit the second role by placing cleaner data late, but miss the first role because conventional decay schedules reduce update intensity exactly when high-quality data becomes available. Guided by this, we propose Drop-Stable-Rampup for LLM midtraining: upon the quality transition, drop the batch size, hold it stable to accumulate signal, then ramp up to suppress terminal noise. On a 15B Mixture-of-Experts model midtrained on 108B tokens, Drop-Stable-Rampup improves average accuracy over Warmup-Stable-Decay (WSD) by +1.70 and over Cosine-decay by +2.98, with particularly large gains on mathematical reasoning benchmarks such as GSM8K (+4.23) and MATH (+2.80).

URL PDF HTML ☆

赞 0 踩 0

2605.25682 2026-05-26 cs.DC cs.AI 版本更新

Profiling-Driven Adaptive Distributed Transformer Inference on Embedded Edge Deployment

面向嵌入式边缘部署的剖析驱动自适应分布式Transformer推理

Muhammad Azlan Qazi, Alexandros Iosifidis, Qi Zhang

发表机构 * Aarhus University（奥胡斯大学）； Tampere University（塔尔基耶大学）

AI总结通过结合分段均值压缩和轻量级离线剖析，自适应地在运行时选择本地或分布式执行，解决了嵌入式设备上分布式Transformer推理中CPU-GPU通信瓶颈问题，相比全张量交换降低了65%-77%延迟和34%-52%能耗。

详情

DOI: 10.1145/3812836.3814999

AI中文摘要

将Transformer推理分布在嵌入式边缘设备上可以缓解单个内存和计算约束，但在实际硬件上的实际益处仍不明确：先前的工作主要依赖于忽略硬件特定通信开销的模拟。我们在通过WiFi连接的NVIDIA Jetson Orin Nano设备上进行了硬件原型研究。我们的关键发现是，主要瓶颈不仅是网络带宽，还有通信期间的CPU-GPU暂存。由于Jetson的集成GPU架构缺乏NCCL所需的PCIe/NVLink路径，所有设备间数据通信应通过GLOO路由并在CPU内存中暂存；这种开销随通信数据量扩展，使得对于中等规模模型（如ViT），全张量交换比单设备推理更慢。因此，我们通过结合分段均值压缩与轻量级离线剖析来评估Prism，以在运行时自适应地选择本地或分布式执行。实验表明，相对于静态分布式执行设置中的全张量交换，该策略将延迟降低了65%-77%，能耗降低了34%-52%，证明了剖析驱动自适应对于嵌入式硬件上的实际分布式Transformer推理至关重要。

英文摘要

Distributing Transformer inference across embedded edge devices can alleviate individual memory and compute constraints, yet practical benefits on real hardware remain unclear: prior work relies largely on simulations that overlook hardware-specific communication overheads. We present a hardware prototype study on NVIDIA Jetson Orin Nano devices connected over WiFi. Our key finding is that the dominant bottleneck is not just network bandwidth but also the CPU-GPU staging during communication. Because Jetson's integrated GPU architecture lacks the PCIe/NVLink pathway that NCCL requires, all inter-device data communication should be routed through GLOO and staged in CPU memory; an overhead that scales with communication data volume and makes full-tensor exchange slower than single-device inference across the batch sizes for medium sized models such as ViT. We therefore evaluate Prism by combining Segment Means compression with lightweight offline profiling to adaptively select between local and distributed execution at runtime. Experiments show that this strategy reduces latency by 65%-77% and energy consumption by 34%-52% relative to full-tensor exchange in static distributed execution setup, demonstrating that profiling-driven adaptation is essential for practical distributed Transformer inference on embedded hardware.

URL PDF HTML ☆

赞 0 踩 0

2605.25681 2026-05-26 cs.LG cs.AI 版本更新

Don't Retrain, Just Reuse: Recovering Dual-Target Molecules from Single-Target Diffusion Models

不要重新训练，只需重用：从单目标扩散模型中恢复双目标分子

Qingyuan Zeng, Pengxiang Cai, Zixin Guan, Ziyang Chen, Anglin Liu, Lang Qin, Xinyao Lai, Jintai Chen

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； Guangzhou University of Chinese Medicine（广州中医药大学）

AI总结提出REUSE框架，通过层次化进化输入空间搜索，从冻结的单目标扩散模型中恢复双目标分子，无需重新训练或修改扩散过程，在双目标亲和力上提升20.9个百分点。

详情

AI中文摘要

设计一个能调节两个靶点的单一分子是多药理学中一种有前景的策略，但它比标准的单目标生成要困难得多，因为一个候选分子必须满足两个结合要求，同时保持药物相似性和可合成性。现有的双目标生成方法通常通过在采样期间重新训练生成器或干预扩散过程来引入双目标能力。前者在双目标监督稀疏时可能成本高昂且难以稳定，而后者可能对去噪时的目标平衡和竞争性更新方向敏感。这些局限性促使我们寻找一种保持生成器不变的替代方案：能否在不修改参数或去噪动态的情况下，从冻结的单目标扩散模型的输入空间中恢复双目标候选分子？我们将此任务表述为一个受约束的多目标优化问题，并提出REUSE，一种层次化进化输入空间搜索框架，结合配对条件探索和结构化多阶段选择，以强制执行双目标亲和力、化学质量和多样性。实验表明，与修改扩散过程的方法相比，REUSE持续改善了双目标亲和力和平衡性，在双高亲和力指标上比最强基线提高了20.9个百分点，同时保持了竞争性的分子质量。

英文摘要

Designing a single molecule that modulates two targets is a promising strategy for polypharmacology, but it remains substantially harder than standard single-target generation because one candidate must satisfy two binding requirements while preserving drug-likeness and synthesizability. Existing dual-target generative methods typically introduce dual-target capability by either retraining the generator or intervening in the diffusion process during sampling. The former can be costly and difficult to stabilize when dual-target supervision is sparse, while the latter may be sensitive to denoising-time target balancing and competing update directions. These limitations motivate a generator-preserving alternative that keeps the pretrained prior intact: can dual-target candidates instead be recovered from the input space of a frozen single-target diffusion model, without modifying its parameters or denoising dynamics? We formulate this task as a constrained multi-objective optimization problem and propose REUSE, a hierarchical evolutionary input-space search framework that combines pair-conditioned exploration with structured multi-stage selection to enforce dual-target affinity, chemical quality, and diversity. Experiments show that, compared with methods that modify the diffusion process, REUSE consistently improves dual-target affinity and balance, achieving a 20.9-percentage-point gain in Dual High Affinity over the strongest prior baseline while maintaining competitive molecular quality.

URL PDF HTML ☆

赞 0 踩 0

2605.25680 2026-05-26 cs.CL cs.AI 版本更新

Simulating Human Memory with Language Models

用语言模型模拟人类记忆

Qihan Wang, Nicholas Tomlin, Michael Hu, Brian Dillon, Tal Linzen

发表机构 * NYU（纽约大学）； UMass Amherst（马萨诸塞大学阿姆赫斯特分校）

AI总结本研究通过心理学经典记忆实验对比语言模型与人类记忆，发现未经调优的模型记忆优于人类，但通过提示策略和压缩器可使模型遗忘方式更接近人类，从而在下游教育任务中成为更有效的用户模拟器。

2605.25673 2026-05-26 cs.CR cs.AI 版本更新

Referential Security as a New Paradigm for AI Evaluations

引用安全性作为AI评估的新范式

Dan Ristea, Vasilios Mavroudis

发表机构 * University College London（伦敦大学学院）； Alan Turing Institute（艾伦·图灵研究所）； King's College London（国王学院伦敦）

AI总结针对AI系统持续更新导致评估标识不稳定问题，提出引用安全性范式，通过将模型身份作为可验证属性来确保评估的可重复性、纵向审计有效性和跨提供商等价性。

详情

AI中文摘要

安全评估本质上依赖于稳定的标识符。任何发现、审计或监管决策必须始终附属于其所涉及的具体工件。持续更新的人工智能系统违反了这一核心假设，公开的模型名称保持不变，而底层权重、提示、检索机制、滥用分类器、推理设置和服务基础设施却未经宣布地修改。因此，当前的评估常常适用于表面标签而非可识别和不同的系统。为了解决这个问题，我们提出引用安全性作为AI评估的新范式。基本安全问题不仅涉及模型是否安全，还涉及后续方能否最终确定特定安全声明所针对的是哪个系统。这种方法将模型身份重新定义为经验上可验证的属性，并将引用稳定性与其所制约的实质性安全声明分开。该框架为当前实践处理不善的三个关键工作流带来了可处理性。具体来说，它实现了可重复评估、纵向审计有效性和跨提供商等价性。通过将这些评估建立在可验证的工件上，我们的方法确保安全审计和监管发现在动态系统的整个操作生命周期中保持其实证效用。

英文摘要

Security evaluations inherently depend on stable identifiers. Any finding, audit, or regulatory decision must remain attached to the specific artifact it pertains to. Continuously updated artificial intelligence systems violate this core assumption, with public model designations remaining static while underlying weights, prompts, retrieval mechanisms, misuse classifiers, inference settings, and serving infrastructures undergo unannounced modifications. Consequently, current evaluations frequently apply to superficial labels rather than identifiable and distinct systems. To resolve this, we propose referential security as a new paradigm for AI evaluation. The fundamental security question extends beyond whether a model is safe to whether subsequent parties can conclusively determine which system a specific safety claim addressed. This approach reframes model identity as an empirically verifiable property and separates referential stability from the substantive security claims it conditions. This framework brings tractability to three critical workflows that current practices handle poorly. Specifically, it enables reproducible evaluation, longitudinal audit validity, and cross-provider equivalence. By grounding these evaluations in verifiable artifacts, our approach ensures that safety audits and regulatory findings maintain their empirical utility across the operational lifecycle of dynamic systems.

URL PDF HTML ☆

赞 0 踩 0

2605.25665 2026-05-26 cs.SE cs.AI 版本更新

Meta-Engineering Harnesses for AI-Native Software Production: A Contract-Driven Adversarial Verification Architecture with Early Deployment Report

面向AI原生软件生产的元工程框架：一种基于合约的对抗性验证架构及早期部署报告

Satadru Sengupta, Tamunokorite Briggs, Ivan Myshakivskyi

发表机构 * HireNimbus

AI总结提出一种元工程框架，通过合约驱动、角色专业化AI代理和对抗性验证，实现AI原生软件的持续生产、验证与改进，并在小型服务公司的CTO即服务场景中部署17项功能，验证了其可靠性。

Comments 17 pages, 2 figures, early deployment report

详情

AI中文摘要

AI原生软件开发通常在单个模型、提示或生成工件的层面进行评估。这种框架对于生产环境是不够的，在这些环境中，软件必须在多个操作上下文和长时间跨度内持续生产、验证、部署、维护和适应。我们提出了一种元工程框架：一种软件生产架构，它将操作和产品特性需求转化为明确的合约，通过角色专业化的AI代理分配工作，执行独立和对抗性验证，并通过结构化失败分类和外环校准持续自我改进。该框架专为软件交付不是一次性项目而是持续运营功能的场景设计。在我们的激励应用——面向小型服务公司的CTO即服务中，该系统将网站、预订流程、支付系统、后台工作流自动化和AI代理接口作为持续演进的技术基础设施进行管理，而非一次性交付物。我们描述了分层架构，包括两遍合约编译、带有专业化记录的持久化Markdown记忆、基于注意力和独立性的验证、四路失败仲裁器以及外环校准。我们报告了早期生产部署的结果，该部署跨越数周，涵盖17项功能，包括一个详细的应用内支付案例研究，揭示了合约不完整性和验证边界问题。这些观察直接推动了框架的针对性改进。贡献在于实现了一个可测量、可扩展的验证架构，使AI原生服务即软件生产变得可靠、可审计且可随时间改进。

英文摘要

AI-native software development is often evaluated at the level of individual models, prompts, or generated artifacts. This framing is insufficient for production environments where software must be continuously produced, verified, deployed, maintained, and adapted across many operational contexts and long time horizons. We present a meta-engineering harness: a software-production architecture that transforms operational and product feature requirements into explicit contracts, routes work through role-specialized AI agents, performs independent and adversarial verification, and continuously improves itself through structured failure classification and outer-loop calibration. The harness is designed for settings in which software delivery is not a one-time project but an ongoing operating function. In our motivating application, CTO-as-a-service for small service firms, the system manages websites, booking flows, payment systems, backoffice workflow automations, and AI-agent interfaces as continuously evolving technical infrastructure rather than one-off deliverables. We describe the layered architecture, including two-pass contract compilation, persistent markdown memory with specialization records, attention-based and independence-based verifications, a four-way failure arbiter, and outer-loop calibration. We report results from an early production deployment spanning 17 features over several weeks, including a detailed in-app payments case study that revealed contract incompleteness and verification-boundary issues. These observations directly drove targeted improvements to the harness. The contribution is an implemented, measurable, and extensible verification architecture for making AI-native service-as-a-software production reliable, auditable, and improvable over time.

URL PDF HTML ☆

赞 0 踩 0

2605.25664 2026-05-26 cs.HC cs.AI cs.AR cs.CY 版本更新

Posture Clip: Sit properly or I wont let you work

Posture Clip：坐姿端正，否则不让你工作

Arka Majhi, Aparajita Mondal

发表机构 * Faculty of Information Technology and Communication Sciences（信息科技与通讯科学学院）； Tampere University（塔尔基马亚大学）； School of Forest Sciences（森林科学学院）； University of Eastern Finland（东芬兰大学）

AI总结提出一种名为PostureClip的衣夹式设备，通过屏幕变黑和恢复来限制用户弯腰工作，实验表明其能显著改善坐姿角度并减少弯腰时长。

Comments Published online by Cambridge University Press on 14 May 2026

详情

DOI: 10.1017/wtc.2026.10041
Journal ref: Wearable Technologies, 7, e5 (2026)

AI中文摘要

不良姿势因其对健康和生产率的有害影响而成为一个重要问题。本文提出了一种名为PostureClip的衣夹式设备，旨在通过黑屏并在纠正姿势后恢复屏幕，限制用户以弯腰角度坐着工作，从而促进更好的姿势。该设备集成了传感器和反馈机制，为用户提供实时姿势反馈。为了评估PostureClip的有效性，进行了一项对照实验，参与者（n=165）每天使用笔记本电脑/个人电脑工作超过6小时。参与者被随机分配到干预组（IG1，n=54；IG2，n=55），使用衣夹式设备，以及对照组（CG，n=56），不使用该设备。IG1未收到反馈，而IG2通过通知并进一步使屏幕变暗从设备获得反馈。研究在参与者的办公室环境中进行，持续4周，收集了姿势角度、弯腰持续时间以及用户反馈等指标。分析显示，与无反馈组和对照组（未干预）相比，使用带反馈的PostureClip的参与者组在姿势角度上有显著改善（p<0.001），弯腰持续时间显著减少（p<0.01）。用户反馈的定性分析强调了该设备的易用性、提供及时反馈的有效性以及对参与者姿势意识和习惯的积极影响。这些结果表明，PostureClip是促进久坐工作中更好姿势的有效工具。

英文摘要

Poor posture is a significant concern due to its detrimental effects on health and productivity. This paper presents a collar-clipped device called PostureClip, designed to restrict users from sitting and working at a bent angle, by blacking out the screen and resuming on correcting posture, thereby promoting better posture. The device integrates sensors and feedback mechanisms to provide real-time posture feedback to users. To evaluate the effectiveness of PostureClip, a controlled experiment was conducted with participants (n=165) who were working on a laptop/PC for over 6 hours per day. The participants were randomly assigned to both the intervention group (IG1,n=54 ; IG2,n=55), which used the collar-clipped device, and the control group (CG, n=56), which did not use the device. IG1 didn't get feedback while IG2 got feedback from the device by notifying and further darkening the screen. The study was conducted in the office environment of the participants, for 4 weeks, and metrics such as posture angle, duration of bent angle, and user feedback were collected. Analysis revealed significant improvements in posture angle (p<0.001) and significant reduction in bent angle duration (p<0.01) for participants' group using PostureClip with feedback and compared to the group without feedback and the control group (who were not intervened). The qualitative analysis of user feedback highlighted the device's ease of use, effectiveness in providing timely feedback, and positive impact on participants' awareness and habits regarding posture. These results indicate that PostureClip is an effective tool for promoting better posture during sedentary work.

URL PDF HTML ☆

赞 0 踩 0

2605.25658 2026-05-26 cs.CL cs.AI 版本更新

AutoSG: LLM-Driven Solver Generation Solely from Task Prompts for Expensive Optimization

AutoSG: 仅从任务提示出发的LLM驱动的昂贵优化求解器生成

Haoran Gu, Handing Wang, Yi Mei, Mengjie Zhang

发表机构 * Xidian University（西安电子科技大学）； Victoria University of Wellington（威灵顿维多利亚大学）

AI总结提出AutoSG框架，通过检索增强生成、单步自优化和无实例评估机制，从自然语言提示直接生成可执行定制求解器，解决昂贵优化中的幻觉、结构破坏和评估成本问题。

详情

AI中文摘要

昂贵优化任务在现实应用中普遍存在，需要高度专业化的求解器。虽然LLM驱动的自动求解器生成显示出前景，但当前范式在处理昂贵优化时面临三个关键问题：由于领域知识不足导致的事实幻觉、在细化过程中频繁破坏先前建立的局部最优结构，以及在训练实例上执行带来的高昂评估成本和受限的泛化能力。为了解决这些问题，我们引入了AutoSG，一个完全自动化的流程，直接将自然语言提示转换为可执行的定制求解器。AutoSG具有三个核心创新：一个检索增强的求解器生成模块，严格将代码基于经过验证的文献；一个单步自优化算子，在保留关键结构组件的同时引入特定任务的改进；以及一个基于Elo的无实例LLM-as-a-Judge评估机制，快速建立全局排名。在多种昂贵优化任务上的广泛评估证实，AutoSG显著优于人工设计的最先进框架和现有的LLM生成的求解器。

英文摘要

Expensive optimization tasks are ubiquitous in real-world applications, demanding highly specialized solvers. While LLM-driven automated solver generation shows promise, current paradigms face three critical issues when tackling expensive optimization: factual hallucinations due to deficient domain knowledge, the frequent dismantling of previously established locally optimal structures during refinement, and the prohibitive evaluation costs alongside restricted generalization caused by executing on training instances. To address these issues, we introduce AutoSG, a fully automated workflow directly translating natural language prompts into executable customized solvers. AutoSG features three core innovations: a retrieval-augmented solver generation module strictly grounding code in verified literature; a one-step self-refinement operator introducing task-specific improvements while preserving critical structural components; and an instance-free Elo-based LLM-as-a-Judge evaluation mechanism rapidly establishing global rankings. Extensive evaluations across diverse expensive optimization tasks confirm AutoSG significantly outperforms human-designed state-of-the-art frameworks and existing LLM-generated solvers.

URL PDF HTML ☆

赞 0 踩 0

2605.25632 2026-05-26 cs.AI cs.LG q-fin.RM 版本更新

Insuring Every Action: An Authority Frontier Framework for Runtime Actuarial Control of Autonomous AI Agents

为每个行动投保：自主AI代理运行时精算控制的权威边界框架

Hao-Hsuan Chen

发表机构 * Department of Risk Management and Insurance（风险管理与保险系）

AI总结提出精算行动接口（AAI）和权威边界框架，通过确定性运行时合约对自主AI代理的副作用行动进行定价、门控和评估，实现跨领域的精算控制与基准测试。

Comments 35 pages, 4 figures, 11 tables. Companion paper on the mathematical foundations: SSRN 6761960

详情

AI中文摘要

自主AI代理越来越多地产生带有副作用的行动：数据库变更、退款、支付、外部承诺。我们提出精算行动接口（AAI），这是一个确定性的运行时合约，它在时间一致的风险映射下，对每个此类行动按照合约固定的安全默认值进行定价，并根据每个边界的储备资本预算门控执行。然后我们开发了权威边界，这是一种评估原语，用于衡量运行时在每个储备资本水平下释放的自主权威量。该框架提供：(i) 一个确定性的报价-绑定-提交协议，带有通行费限制的能力令牌；(ii) 一个通用的七类行动分类法，将异构工具调用映射到可比较的权威单位；(iii) 在alpha支出下的重放确定性和逐路径储备覆盖；(iv) 通过全储备需求C_full和资本指标Capital@k进行跨域归一化。我们在四个代理环境（数据库变更、客服退款以及公共tau-bench零售和航空工具使用轨迹）中实例化AAI，并报告一个实时Postgres面板，其中三个Azure托管的模型通过同一合约提出行动。边界在跨域中表现出常见的低储备拒绝和中间释放模式，仅在预算网格达到全储备需求时饱和；所需储备资本变化达22倍（Capital@50从289到6457）。该框架不强制域采用相同形状；它揭示每个域的精算几何。在实时面板中，合约在低预算下防止了所有三个模型的实现损失，但在拒绝下的承保持续性方面有所不同：模型身份是一个精算承保变量。贡献是一个用于自主代理副作用运行时精算控制的基准就绪评估框架。

英文摘要

Autonomous AI agents increasingly issue side-effect-bearing actions: database mutations, refunds, payments, external commitments. We propose the Actuarial Action Interface (AAI), a deterministic runtime contract that prices each such action against a contractually fixed safe default under a time-consistent risk mapping, and gates execution against a per-boundary reserve capital budget. We then develop the Authority Frontier, an evaluation primitive measuring how much autonomous authority the runtime releases at each level of reserve capital. The framework provides (i) a deterministic quote-bind-commit protocol with toll-bounded capability tokens; (ii) a universal seven-class action taxonomy mapping heterogeneous tool calls to comparable authority units; (iii) replay determinism and pathwise reserve coverage under alpha-spending; (iv) cross-domain normalization via full reserve demand C_full and capital metrics Capital@k. We instantiate AAI across four agentic environments (database mutation, customer-service refund, and the public tau-bench retail and airline tool-use traces) and report a live Postgres panel in which three Azure-hosted models propose actions through the same contract. The frontier exhibits a common low-reserve refusal and intermediate-release pattern across domains, with saturation only where the budget grid reaches full reserve demand; required reserve capital varies by 22x (Capital@50 from 289 to 6457). The framework does not force domains into the same shape; it surfaces each domain's actuarial geometry. In the live panel the contract prevents realized loss across all three models at low budget while differing in underwriting persistence under denial: model identity is an actuarial underwriting variable. The contribution is a benchmark-ready evaluation framework for runtime actuarial control of autonomous-agent side effects.

URL PDF HTML ☆

赞 0 踩 0

2605.25620 2026-05-26 cs.AI 版本更新

Back to Parsimonious Latents: Learning Task-Centric World Models from Visual Foundations

回归简约潜在变量：从视觉基础学习以任务为中心的世界模型

Minghao Fu, Fan Feng, Nicklas Hansen, Biwei Huang

发表机构 * University of California, San Diego（加州大学圣地亚哥分校）

AI总结提出TC-WM框架，通过将预训练视觉嵌入线性投影为紧凑潜在状态、对比学习对齐子空间并重建嵌入，将基础模型特征转化为任务充分的世界表示，实现更好的世界建模质量和控制精度。

详情

AI中文摘要

世界模型使智能体能够根据动作预测未来动态，因此潜在表示的选择对于规划和控制至关重要。这种表示通常要么直接从像素中学习，但语义结构有限；要么继承自冻结的视觉基础模型，但包含过多与任务无关的细节，导致状态空间与下游规划和控制不匹配。这在无奖励的离线设置中尤其具有挑战性，因为模型必须从固定轨迹中学习，没有奖励监督或在线交互。为了解决这个问题，我们提出了TC-WM，一个将基础模型嵌入转化为紧凑、任务充分的世界表示的框架。关键设计是将预训练嵌入空间视为语义支架而非最终状态空间：TC-WM将高维视觉嵌入线性投影到紧凑潜在变量作为动态空间，通过对比学习将子空间与智能体的物理状态对齐，并重建嵌入以保留有用的视觉结构。这结合了基础特征的通用性和以任务为中心的动态的可控性。理论上，我们证明TC-WM足以识别潜在的任务中心潜在因子，只需简单变换。实验上，TC-WM能够在多种环境（如Robomimic和D4RL）中实现测试时规划，其世界建模质量和控制精度均优于现有方法。

英文摘要

World models enable agents to predict future dynamics conditioned on actions, making the choice of latent representation central to planning and control. Such representations are often either learned directly from pixels with limited semantic structure or inherited from frozen visual foundation models with excessive task-irrelevant detail, yielding state spaces that are poorly matched to downstream planning and control. This is especially challenging in reward-free offline settings, where the model must learn from fixed trajectories without reward supervision or online interaction. To address this, we propose TC-WM, a framework for turning foundation-model embeddings into compact, task-sufficient world representations. The key design is to treat the pretrained embedding space as a semantic scaffold rather than as the final state space: TC-WM linearly projects high-dimensional visual embeddings into a compact latent as the dynamic space, aligns a subspace with the agent's physical state via contrastive learning, and reconstructs embeddings to preserve useful visual structure. This combines the generality of foundation features with the controllability of task-centric dynamics. Theoretically, we show that TC-WM suffices to identify the underlying task-centric latent factors up to a simple transformation. Empirically, TC-WM enables test-time planning across diverse environments (e.g., Robomimic and D4RL), achieving better world-modeling quality and more precise control than state-of-the-art approaches.

URL PDF HTML ☆

赞 0 踩 0

2605.25612 2026-05-26 cs.LG cs.AI 版本更新

Towards the Connection between Activation Sparsity and Flat Minima

激活稀疏性与平坦极小值之间的联系

Ze Peng, Jian Zhang, Lei Qi, Yang Gao, Yinghuan Shi

发表机构 * State Key Laboratory for Novel Software Technology, Nanjing University（南京大学新型软件技术国家重点实验室）； Institute of Brain-Machine Interface, Nanjing University（南京大学脑机接口研究院）； School of Computer Science and Engineering, Southeast University（东南大学计算机科学与工程学院）

AI总结本文发现损失景观的平坦性与Transformer中MLP激活稀疏性密切相关，通过理论推导和三种实用方法增强稀疏性，显著降低推理和训练成本。

详情

AI中文摘要

标准训练的Transformer的MLP块中出现的激活稀疏性为在不牺牲性能的情况下大幅降低计算成本提供了机会。为了从理论上解释这一现象，现有工作表明激活稀疏性并非源于数据属性或数据拟合，而是来自训练过程的隐式偏差。然而，这些联系是在强假设下得到的，无法应用于标准训练的大步数深度模型。与这些工作不同，我们发现损失景观的平坦性也与MLP激活稀疏性密切相关，并且可以作为标准深度网络的一个更弱且自然出现的假设。具体来说，我们发现：1) MLP激活稀疏性等于“增强平坦性”（平坦性度量的加权和）与输入范数和MLP激活梯度乘积的比值。我们经验性地发现该比值在训练过程中下降，导致稀疏激活。2) 我们还提出了导数稀疏性的概念，在ReLU下它退化为激活稀疏性，但进一步支持反向传播中的剪枝，并且比激活稀疏性更稳定。基于理论发现，我们通过三种方法减小分子和增大分母来进一步鼓励激活稀疏性。这些即插即用的修改可以有效降低比值并产生更稀疏的激活。在ImageNet-1K和C4上的实验表明，与原始Transformer相比，推理稀疏性至少提高36%，训练稀疏性至少提高50%，表明在推理和训练中进一步降低成本的潜力。

英文摘要

The observation that activation sparsity emerges in MLP blocks of standardly trained Transformers offers an opportunity to drastically reduce computation costs without sacrificing performance. To theoretically explain this phenomenon, existing works have shown that activation sparsity does not result from the data properties or data fitting but from the implicit bias of the training process. However, these connections are obtained with strong assumptions, which cannot be applied to deep models standardly trained with a large number of steps. Different from these works, we find that the flatness of loss landscapes is also closely related to the MLP activation sparsity and can serve as a weaker and naturally emerging assumption standard deep networks. Specifically, we find that 1) the MLP activation sparsity equals a ratio between "augmented flatness" (a weighted sum of flatness measures) and the product of the input norm and activation gradient of the MLP. We empirically find that this ratio decreases during training, leading to sparse activations. 2) We also propose the notion of derivative sparsity, which reduces to activation sparsity under ReLU, but further enables pruning in the backward propagation and is more stable than activation sparsity. With the theoretical findings, we can further encourage activation sparsity by decreasing the numerator and increasing the denominator of the ratio using three methods. These plug-and-play modifications can effectively reduce the ratio and produce sparser activations. Experiments on ImageNet-1K and C4 demonstrate relative improvements of at least 36% on inference sparsity and at least 50% on training sparsity over vanilla Transformers, indicating further potential cost reduction in both inference and training

URL PDF HTML ☆

赞 0 踩 0

2605.25603 2026-05-26 cs.AI 版本更新

基于流形分解的几何流匹配分子构象生成

Yunqing Liu, Yi Zhou, Wenqi Fan

发表机构 * The Hong Kong Polytechnic University（香港理工大学）

AI总结提出GO-Flow方法，通过将生成过程分解为平移、旋转和构象三个物理子空间，利用流形上的最优传输和测地流，解决现有方法忽略分子几何层次结构的问题，实现高质量、高效率的分子构象生成。

详情

AI中文摘要

生成准确的3D分子构象是计算化学和药物发现中的关键挑战。最近，扩散和流匹配模型取得了显著成功。然而，它们的数学公式与分子的物理现实之间存在严重的不匹配。现有方法主要将分子视为笛卡尔空间中的无结构点云，忽略了键长和键角相对刚性而扭转角构成主要柔性自由度的内在层次力学。这种对流形的不感知迫使模型从头重新学习基本几何约束，常常导致物理上不可信的中间结构。为了解决这个问题，我们提出了GO-Flow，通过流形分解将生成建模与分子几何对齐。GO-Flow不是强制在欧几里得空间中运动，而是将生成过程分解为三个物理驱动的子空间：具有线性最优输运的平移空间、$SO(3)$上具有测地流的旋转空间以及具有熵最优输运的构象空间。这种分解注入了几何归纳偏置，使生成路径更好地与分子自由度对齐。当与等变神经架构结合时，它鼓励旋转一致的生成并提高几何有效性。在GEOM-Drugs和GEOM-QM9上的大量实验表明，GO-Flow实现了最先进的生成质量。值得注意的是，通过在正确的流形上自然地学习更直的概率路径，我们的方法能够在仅50步的情况下实现高保真采样，有效弥合了结构精度与计算效率之间的差距。

英文摘要

The generation of accurate 3D molecular conformations is a pivotal challenge in computational chemistry and drug discovery. Recently, diffusion and flow matching models have achieved remarkable success. However, there is a critical misalignment between their mathematical formulation and the physical reality of molecules. Existing approaches predominantly treat molecules as unstructured point clouds in Cartesian space, overlooking the intrinsic hierarchical mechanics where bond lengths and bond angles are relatively stiff, whereas torsion angles constitute the dominant flexible degrees of freedom. This lack of manifold awareness forces models to relearn fundamental geometric constraints from scratch, often leading to physically implausible intermediate structures. To address this, we propose GO-Flow that aligns generative modeling with molecular geometry via manifold decomposition. Instead of forcing motion through Euclidean space, GO-Flow decomposes the generation process into three physically motivated subspaces: translation space with linear optimal transport, rotation space with geodesic flows on $SO(3)$, and conformation space with entropic optimal transport. This decomposition injects geometric inductive biases and makes the generative paths better aligned with molecular degrees of freedom. When combined with equivariant neural architectures, it encourages rotation-consistent generation and improves geometric validity. Extensive experiments on GEOM-Drugs and GEOM-QM9 demonstrate that GO-Flow achieves state-of-the-art generation quality. Notably, by learning straighter probability paths on the correct manifolds naturally, our method enables high-fidelity sampling with as few as 50 steps, effectively bridging the gap between structural precision and computational efficiency.

URL PDF HTML ☆

赞 0 踩 0

2605.25574 2026-05-26 cs.CV cs.AI 版本更新

Mosaic: Compositional Multi-Concept Erasure via Vector Field Blending

Mosaic: 通过向量场混合的组合式多概念擦除

Junseok Ko, Jungwoo Kim, Jong-Seok Lee

发表机构 * Department of Artificial Intelligence, Yonsei University（延世大学人工智能系）； School of Integrated Technology, Yonsei University（延世大学整合技术学院）

AI总结针对流式文本到图像模型中同时擦除多个目标概念的任务，提出Mosaic框架，通过动态构建概念特定掩码并选择性混合向量场，无需额外优化即可有效移除复杂场景中的多概念。

详情

AI中文摘要

概念擦除已成为确保文本到图像（T2I）模型安全与伦理图像合成的关键研究方向。现有研究虽探索了多概念擦除，但通常假设每张图像仅有一个目标概念，这一限制被现代基于流的T2I模型日益暴露，此类模型可同时生成包含多个概念的复杂场景。为弥补这一空白，我们引入组合式多概念擦除这一新任务，旨在同时移除单个场景中的多个目标概念。我们提出CoME-Bench，一个用于评估组合式多概念擦除的基准，涵盖类别内和跨类别场景。我们进一步提出Mosaic，一个用于基于流的T2I模型中多概念擦除的新框架，该框架通过动态构建概念特定掩码并选择性混合它们，利用向量场中目标概念的空间局部性，无需额外优化。大量实验表明，Mosaic能有效移除复杂组合场景中的多个目标概念，同时保留非目标上下文。

英文摘要

Concept erasure has emerged as a key research direction for ensuring safe and ethical image synthesis in Text-to-Image (T2I) models. While existing studies have explored concept erasure across multiple concepts, they typically assume only a single target concept per image, a limitation increasingly exposed by modern flow-based T2I models, which can generate complex scenes with multiple concepts simultaneously. To address this gap, we introduce compositional multi-concept erasure, a new task that aims to simultaneously remove multiple target concepts within a single scene. We propose CoME-Bench, a benchmark for evaluating compositional multi-concept erasure, which covers both intra- and cross-category scenarios. We further propose Mosaic, a novel framework for multi-concept erasure in flow-based T2I models, which exploits the spatial locality of target concepts in the vector field by dynamically constructing concept-specific masks and selectively blending them without additional optimization. Extensive experiments demonstrate that Mosaic effectively removes multiple target concepts in complex compositional scenes while preserving non-target contexts.

URL PDF HTML ☆

赞 0 踩 0

2605.25572 2026-05-26 cs.CL cs.AI 版本更新

PennySynth: RAG-Driven Data Synthesis for Automated Quantum Code Generation

PennySynth：基于RAG的数据合成用于自动量子代码生成

Minghao Shao, Nouhaila Innan, Hariharan Janardhanan, Muhammad Kashif, Alberto Marchisio, Muhammad Shafique

发表机构 * eBRAIN Lab, Division of Engineering, New York University Abu Dhabi (NYUAD)（eBRAIN实验室，工程系，纽约大学阿布扎比分校）； Center for Quantum and Topological Systems (CQTS), NYUAD Research Institute（量子与拓扑系统中心（CQTS），NYUAD研究所）； Department of Computer Science and Engineering, NYU Tandon School of Engineering（计算机科学与工程系，纽约大学坦顿工程学院）

AI总结提出PennySynth框架，通过检索增强生成和代码感知嵌入，利用13,389个PennyLane指令-代码对数据集，在QHack竞赛中实现52%-68%的pass@5，显著提升量子代码生成的结构有效性和功能正确性。

Comments 11 pages, 3 figures

详情

AI中文摘要

量子编程框架日益增长的复杂性暴露了现有基于大语言模型（LLM）的代码助手的一个关键局限性：通用模型在面对专门的量子编码挑战时，会幻觉出PennyLane特定的门名称、错误放置设备配置并生成结构无效的电路。我们提出PennySynth，一个检索增强生成框架，通过将LLM推理条件化为一个包含13,389个PennyLane指令-代码对的精选知识库来解决这一差距，该知识库通过一个三阶段（提取、验证和去重）流程从官方PennyLane仓库、社区GitHub源和QHack竞赛档案中构建。PennySynth引入了一种使用st-codesearch-distilroberta-base的代码感知嵌入策略，该策略针对自然语言到代码的检索进行训练，将平均检索余弦相似度从通用基线的0.45提高到0.726。在涵盖QHack竞赛三年（2022、2023、2024）的74个挑战上进行评估，PennySynth在QHack 2022、2023和2024上分别达到64%、68%和52%的pass@5，相比无检索的Claude Sonnet 4.6提高了+28、+25和+28个百分点。我们进一步引入了一个量子适应的CodeBLEU指标，该指标对qml.*令牌模式进行加权，并表明结构代码相似性和功能正确性捕捉了量子代码质量的不同方面。受控消融实验揭示，代码感知嵌入是检索性能的主要驱动因素，而当检索质量足够精确时，数据集扩展和源组合提供了额外的增益。

英文摘要

The growing complexity of quantum programming frameworks has exposed a critical limitation in existing large language model (LLM)-based code assistants: general-purpose models hallucinate PennyLane-specific gate names, misplace device configurations, and produce structurally invalid circuits when faced with specialized quantum coding challenges. We present PennySynth, a retrieval-augmented generation framework that addresses this gap by conditioning LLM inference on a curated knowledge base of 13,389 PennyLane instruction-code pairs, built via a three-stage extraction, verification, and deduplication pipeline over official PennyLane repositories, community GitHub sources, and QHack competition archives. PennySynth introduces a code-aware embedding strategy using st-codesearch-distilroberta-base, trained for natural-language-to-code retrieval, increasing average retrieval cosine similarity from 0.45 to 0.726 compared to a general-purpose baseline. Evaluated across 74 challenges spanning three years of the QHack competition (2022, 2023, 2024), PennySynth achieves 64%, 68%, and 52% pass@5 on QHack 2022, 2023, and 2024, respectively, improving over Claude Sonnet 4.6 without retrieval by +28, +25, and +28 percentage points. We further introduce a quantum-adapted CodeBLEU metric that upweights qml.* token patterns and show that structural code similarity and functional correctness capture distinct aspects of quantum code quality. Controlled ablations reveal that code-aware embeddings are the primary driver of retrieval performance, while dataset expansion and source composition provide additional gains when retrieval quality is sufficiently precise.

URL PDF HTML ☆

赞 0 踩 0

2605.25566 2026-05-26 cs.AI 版本更新

Uncertainty Reasoning with Large Language Models for Explainable Disease Diagnosis

基于大语言模型的不确定性推理用于可解释疾病诊断

Xiaoyang Fan, Yufan Cai, Zhe Hou, Jin Song Dong

发表机构 * National University of Singapore（新加坡国立大学）； Griffith University（格里菲斯大学）

AI总结提出一种神经符号推理框架，将大语言模型与模糊逻辑和声明式规则结合，实现可解释且形式可验证的医学诊断。

详情

AI中文摘要

临床决策需要对不完整、不精确且以语言表达的患者叙述进行推理。虽然大语言模型（LLMs）擅长从自然语言中提取潜在信息，但它们缺乏可信赖医疗AI所必需的可验证性和可解释性。我们提出一种神经符号推理框架，将LLMs与形式逻辑对齐，以实现可解释且形式可验证的医学诊断。患者描述和临床指南被嵌入神经知识库，其中LLMs提取结构化医疗实体、时间关系和模糊症状模式，这些被解码为用模糊逻辑和声明式规则表达的符号知识库。我们执行两阶段推理：（1）归纳符号泛化，从编码叙述中捕获诊断模式；（2）通过逻辑编程引擎进行推理验证，推导并验证符合临床标准的诊断。每个症状被视为具有概率权重的模糊谓词，推理路径可审计、可调整，并与医生反馈兼容。与纯统计方法不同，我们的系统支持迭代优化：LLM生成的诊断与真实情况之间的偏差可以通过形式规则追踪、解释和纠正。通过结合基于逻辑的透明性、LLM的适应性和概率鲁棒性，该框架实现了与人类一致的医疗推理，具有强泛化能力和可验证的逐步推理链。我们在公开基准上验证了该框架，展示了符号推理与LLM在真实临床叙述中的有效协调。结果显示，性能与最先进的LLM相当，同时额外提供了可解释的推理路径和形式可验证的诊断结论。

英文摘要

Clinical decision-making requires reasoning over incomplete, imprecise, and linguistically expressed patient narratives. While large language models (LLMs) excel at extracting latent information from natural language, they lack the verifiability and interpretability essential for trustworthy medical AI. We propose a neuro-symbolic reasoning framework that aligns LLMs with formal logic to enable explainable and formally verifiable medical diagnosis. Patient descriptions and clinical guidelines are embedded into a neural knowledge base, where LLMs extract structured medical entities, temporal relations, and fuzzy symptom patterns, which are decoded into a symbolic knowledge base expressed in fuzzy logic and declarative rules. We perform two-stage reasoning: (1) inductive symbolic generalization to capture diagnostic patterns from encoded narratives, and (2) inference verification via a logic programming engine to derive and validate diagnoses consistent with clinical standards. Each symptom is treated as a fuzzy predicate with probabilistic weights, and inference paths are auditable, adjustable, and compatible with physician feedback. Unlike purely statistical methods, our system supports iterative refinement: misalignment between LLM-generated diagnoses and ground truth can be traced, explained, and corrected through formal rules. By combining logic-based transparency, LLM adaptability, and probabilistic robustness, the framework enables human-aligned healthcare inference with strong generalization and verifiable, step-by-step reasoning chains. We validate our framework on public benchmarks, demonstrating effective reconciliation of symbolic reasoning and LLMs with real-world clinical narratives. Results show performance comparable to state-of-the-art LLMs, while additionally providing interpretable reasoning paths and formally verifiable diagnostic conclusions.

URL PDF HTML ☆

赞 0 踩 0

2605.25558 2026-05-26 cs.AI 版本更新

Beyond Query Memorization: Large Language Model Routing with Query Decomposition and Historical Matching

超越查询记忆化：基于查询分解和历史匹配的大语言模型路由

Bo Lv, Jingbo Sun

发表机构 * Tencent Hunyuan（腾讯文言）； University of Chinese Academy of Sciences（中国科学院大学）

AI总结提出DecoR路由框架，通过查询能力分解和历史日志匹配来避免记忆化陷阱，在保持高准确率的同时降低推理成本。

详情

AI中文摘要

优化预测性能与计算成本之间的权衡是大语言模型（LLM）部署中的核心关注点。当前的路由方法主要依赖于基于表面特征的查询到模型的直接映射，使其容易陷入记忆化陷阱，并导致在分布外（OOD）数据上的泛化能力差。在本文中，我们提出DecoR，一种新颖的路由框架，将路由任务重新定义为从历史日志中筛选相似查询的匹配过程，有效缓解了记忆化陷阱。为了提高匹配准确性，我们引入了一种查询能力分解方法，将语言表面形式与任务内在需求解耦，将匹配导向能力维度，从而将决策基于基本任务属性。此外，我们开发了CodaSet，一个用于评估路由泛化能力的综合基准，实验结果表明，DecoR在分布内和OOD设置下均保持优越的准确性，同时大幅降低推理成本。所有代码和数据可在https://github.com/lvbotenbest/DecoR获取。

英文摘要

Optimizing the trade-off among predictive performance and computational cost is a central focus in the deployment of Large Language Models (LLMs). Current routing methods primarily rely on direct mapping from queries to models based on surface-level features, making them susceptible to the memorization trap and leading to poor generalizability on out-of-distribution (OOD) data. In this paper, we propose DecoR, a novel routing framework that recasts the routing task as a matching process of sifting similar queries from historical logs, effectively mitigating the memorization trap. To enhance matching accuracy, we introduce a query capability deconstruction method that decouples linguistic surface forms from task-intrinsic requirements, directing matching toward capability dimensions to ground decisions in essential task attributes. Furthermore, we develop CodaSet, a comprehensive benchmark for assessing routing generalization, where experimental results demonstrate that DecoR maintains superior accuracy while substantially lowering inference costs across both in-distribution and OOD settings. All the codes and data are available at https://github.com/lvbotenbest/DecoR.

URL PDF HTML ☆

赞 0 踩 0

2605.25554 2026-05-26 cs.AI 版本更新

PHGNet: Prototype-Guided Hypergraph Construction for Heterogeneous Spatiotemporal Forecasting

PHGNet: 原型引导的超图构建用于异质时空预测

Ruiwen Gu, Yahao Liu, Zhenyu Liu, Qitai Tan, Xiao-Ping Zhang

发表机构 * Shenzhen Ubiquitous Data Enabling Key Lab（深圳通用数据赋能重点实验室）； Shenzhen International Graduate School, Tsinghua University（深圳国际研究生学院，清华大学）； School of Computer Science and Engineering（计算机科学与工程学院）； University of Electronic Science and Technology of China（电子科技大学）

AI总结提出基于原型引导超图构建的时空预测框架PHGNet，通过原型学习机制自适应地将模式相似节点分配到超边以捕获高阶交互，并引入全局-局部节点表示模块和迭代残差细化与时间查询注意力机制提升预测精度。

详情

AI中文摘要

作为智能交通系统的核心任务，交通预测在城市交通管理中起着关键作用。准确的交通预测依赖于对复杂时空依赖关系的建模，而由于交通系统中的空间异质性，这本身就具有挑战性。尽管取得了显著进展，大多数现有方法仍局限于成对空间依赖建模，难以捕获具有相似交通模式的节点之间的动态高阶交互。为了解决这个问题，我们提出了PHGNet，一种基于原型引导超图构建的新型时空预测框架。在PHGNet的核心，设计了一种原型学习机制，自适应地将模式相似的节点分配到超边，从而捕获具有时变结构的高阶交互。为了提高动态超图构建的可靠性，我们进一步开发了一个全局-局部节点表示模块来提取时间一致的特征。对于预测，引入了迭代残差细化和时间查询注意力机制，以提高预测精度并支持高效的并行解码。在多个真实世界数据集上的大量实验表明，与最先进的方法相比，PHGNet实现了优越的预测性能。

英文摘要

As a core task in intelligent transportation systems, traffic forecasting plays a critical role in urban traffic management. Accurate traffic forecasting relies on modeling complex spatiotemporal dependencies, which is inherently challenging due to spatial heterogeneity in traffic systems.Despite significant progress, most existing methods are still limited to pairwise spatial dependency modeling, making it difficult to capture dynamic high-order interactions among nodes with similar traffic patterns. To address this issue, we propose PHGNet, a novel spatiotemporal forecasting framework based on prototype-guided hypergraph construction. At the core of PHGNet, a prototype learning mechanism is designed to adaptively assign pattern-similar nodes to hyperedges, thereby capturing high-order interactions with time-varying structures. To improve the reliability of dynamic hypergraph construction, we further develop a global-local node representation module to extract time-consistent features. For forecasting, iterative residual refinement and Temporal Query Attention are introduced to improve forecasting accuracy while supporting efficient parallel decoding. Extensive experiments on multiple real-world datasets demonstrate that PHGNet achieves superior predictive performance compared with state-of-the-art methods.

URL PDF HTML ☆

赞 0 踩 0

2605.25549 2026-05-26 cs.CL cs.AI cs.LG 版本更新

个性化再存储：面向长时程智能体的个性化记忆基准测试与学习

Yeonjun In, Wonjoong Kim, Sangwu Park, Kanghoon Yoon, Chanyoung Park

发表机构 * KAIST（韩国科学技术院）

AI总结针对现有基于大语言模型的记忆系统采用通用静态策略忽略用户间存储上下文差异的问题，提出首个个性化记忆基准PerMemBench和会话级存储门控框架，验证个性化能显著提升记忆保留但精确门控仍是关键挑战。

Comments preprint

详情

AI中文摘要

现有的基于大语言模型（LLM）的记忆系统采用通用、静态的策略，忽略了一个基本现实：不同用户值得存储在记忆中的上下文是不同的。这种错位将有限的记忆预算浪费在短暂交互上，同时未能为长时程任务保留关键上下文。为解决这一差距，我们研究了一个未被充分探索的问题：基于LLM的记忆系统能否学习个性化的记忆策略？我们引入了PerMemBench，这是首个用于评估个性化记忆系统的基准，具有跨多年、多领域、多样化用户角色的交互历史。我们进一步提出了记忆个性化的首个实证研究，提出了会话级存储门控，这是一个轻量级框架，可选择性地绕过短暂会话的记忆操作。我们的研究证实，在完美门控下，个性化能带来显著的保留增益，但同时也揭示出精确门控仍然是一个开放且关键的挑战。

英文摘要

Existing large language model (LLM) based memory systems apply universal, static policies that overlook a fundamental reality: the contexts that are worth storing in memory are different across users. This misalignment wastes limited memory budget on transient interactions while failing to preserve critical context for long horizon tasks. To address this gap, we investigate an underexplored question: can LLM based memory systems learn personalized memory policies? We introduce PerMemBench, the first benchmark for evaluating personalized memory systems, featuring multi year, multi domain interaction histories across diverse user personas. We further present the first empirical study of memory personalization, proposing session level storage gating, a lightweight framework that selectively bypasses memory operations for transient sessions. Our study confirms that personalization yields substantial retention gains under perfect gating, yet reveals that accurate gating remains an open and critical challenge.

URL PDF HTML ☆

赞 0 踩 0

2605.25534 2026-05-26 cs.AI 版本更新

StructBreak: Structural Cognitive Overload-Induced Safety Failures in MLLMs

StructBreak: 多模态大语言模型中结构性认知过载引发的安全故障

Yang Luo, Xinran Liu, Tiantian Ji, Zhiyi Yin, Lingyun Peng, Shuyu Li

发表机构 * Key Laboratory of Trustworthy Distributed Computing and Service (MoE), Beijing University of Posts and Telecommunications（可信分布式计算与服务重点实验室（MoE），北京邮电大学）； Institute of Computing Technology, Chinese Academy of Sciences（中国科学院计算技术研究所）

AI总结提出StructBreak框架，通过量化结构性认知过载（SCO）揭示一种高阶认知过载攻击范式，在六种主流MLLM上实现92%平均攻击成功率，并证明该攻击通过结构性通道绕过安全过滤器。

Comments 23 pages; accepted to Findings of ACL 2026. This paper contains examples of harmful content

详情

AI中文摘要

多模态大语言模型（MLLM）在结构推理方面表现出色，但在结构一致性方面存在明显的逻辑脆弱性。我们将这种现象称为结构性认知过载（SCO），它是深度推理与安全对齐之间竞争产生的副产品。然而，先前的工作主要针对排版和像素级扰动，对SCO的研究尚不充分。为此，我们提出了StructBreak，一个自动化的端到端框架，旨在量化SCO。通过利用StructBreak，我们发现了一种新颖的高阶认知过载攻击范式；值得注意的是，这种攻击在实用的黑盒设置下运行，无需内部模型访问。因此，我们利用该框架建立了一个涵盖十种不同威胁场景的综合基准。对六种领先MLLM的实证评估表明，SCO容易触发有毒内容生成，平均攻击成功率（ASR）达到92%（在Gemini 2.5上高达97%）。为了阐明SCO的机制，我们进一步进行了模型级解释，涵盖注意力动态、潜在空间拓扑和几何分析。我们的发现表明，StructBreak作为一种新颖的结构性通道来绕过安全过滤器。此外，固有安全机制的有效性有限，凸显了当前的对齐范式不足以应对复杂多模态推理的时代。

英文摘要

Multimodal Large Language Models (MLLMs) excel at structural reasoning yet suffer from a sharp logical brittleness in structural consistency. We term this phenomenon Structural Cognitive Overload (SCO), a byproduct of the contention between deep reasoning and safety alignment. However, prior work has predominantly targeted typographic and pixel-level perturbations, leaving the study of SCO largely unexplored. To this end, we propose StructBreak, an automated end-to-end framework designed to quantify SCO. By leveraging StructBreak, we uncover a novel higher-order cognitive overload attack paradigm; notably, this attack operates under a practical black-box setting, requiring no internal model access. Consequently, we utilize this framework to establish a comprehensive benchmark spanning ten diverse threat scenarios. Empirical evaluations on six leading MLLMs reveal that SCO readily triggers toxic generation, yielding a 92% average ASR (up to 97% on Gemini 2.5). To elucidate the mechanism of SCO, we further conduct model-level interpretations spanning attention dynamics, latent space topology, and geometric analysis. Our findings reveal that StructBreak acts as a novel structural channel to circumvent safety filters. Furthermore, the limited efficacy of inherent safety mechanisms underscores that current alignment paradigms are insufficient for the era of complex multimodal reasoning.

URL PDF HTML ☆

赞 0 踩 0

2605.25518 2026-05-26 cs.CV cs.AI 版本更新

Cross-Stage Attention Multi-Expert Network for Radiologist-Inspired Breast Ultrasound Diagnosis

受放射科医生启发的乳腺超声诊断的跨阶段注意力多专家网络

Xinyang Zhai, Chong Yang, Ruizhi Zhang

发表机构 * International Agency for Research on Cancer (IARC)（国际癌症研究机构）； World Health Organization（世界卫生组织）

AI总结提出跨阶段注意力混合专家网络(CSA-MoE-Net)，通过跨阶段注意力模块增强多级特征、三分支MoE块从全肿瘤图像、肿瘤核心和边界学习互补特征，并在平衡数据集上实现96.33%准确率，显著优于基线ResNet-18。

详情

AI中文摘要

乳腺超声成像是一种重要的早期乳腺癌诊断无创方法，但由于肿瘤异质性、边界模糊和数据不平衡，自动良恶性分类仍具挑战。为了提高特征表示和分类准确性，本文提出了跨阶段注意力混合专家网络(CSA-MoE-Net)。它采用跨阶段注意力增强的ResNet-18作为骨干网络，其中跨阶段注意力模块自适应地重新校准多级特征，从而增强关键肿瘤特征并抑制冗余。一个三分支混合专家(MoE)块从全肿瘤图像、肿瘤核心和边界学习互补特征，自适应门控网络融合这些特征以捕获形态、纹理和上下文信息。融合后的特征在架构中称为融合专家特征(FEF)。在包含2,129张乳腺超声图像的平衡数据集上的实验表明，在20次独立运行的平均值下，该模型实现了96.33%的准确率、94.09%的精确率、98.53%的召回率、96.25%的F1分数和99.50%的AUC。与基线ResNet-18相比，这些指标分别提高了3.01、0.70、5.37、2.98和5.42个百分点。所提出的机制无需侵入性修改，可无缝嵌入VGG-16、DenseNet-121等网络，带来稳定的性能提升，从而为计算机辅助诊断提供可靠支持。

英文摘要

Breast ultrasound imaging is an important noninvasive method for early breast cancer diagnosis, but automatic benign/malignant classification remains challenging due to tumor heterogeneity, blurred boundaries, and data imbalance. To improve feature representation and classification accuracy, this paper proposes the Cross-Stage Attention Mixture-of-Experts Network (CSA-MoE-Net). It adopts a Cross-Stage Attention-enhanced ResNet-18 as the backbone, in which the Cross-Stage Attention module adaptively recalibrates multi-level features, thereby enhancing key tumor features and suppressing redundancy. A three-branch Mixture of Experts (MoE) Block learns complementary features from the Whole Tumor Image, Tumor Core, and Boundary, and an Adaptive Gating Network fuses them to capture morphological, textural, and contextual information. The fused features are denoted as Fused Expert Feature (FEF) in the architecture. Experiments on a balanced dataset of 2,129 breast ultrasound images show that, averaged over 20 independent runs, the model achieves an accuracy of 96.33\%, precision of 94.09\%, recall of 98.53\%, F1-score of 96.25\%, and AUC of 99.50\%. Compared to the baseline ResNet-18, these metrics improve by 3.01, 0.70, 5.37, 2.98, and 5.42 percentage points, respectively. The proposed mechanism requires no invasive modification and can be seamlessly embedded into VGG-16, DenseNet-121, etc., yielding stable performance gains, thus providing reliable support for computer-aided diagnosis.

URL PDF HTML ☆

赞 0 踩 0

2605.25517 2026-05-26 cs.AI 版本更新

What Gets Cited: Competitive GEO in AI Answer Engines

什么被引用：AI 问答引擎中的竞争性生成式引擎优化

Rahul Vishwakarma, Shushant Kumar, Ratnesh Jamidar

发表机构 * Sprinklr

AI总结研究 AI 问答引擎中两个检索候选源竞争时，哪些因素决定哪个源被优先引用，通过控制实验发现主题相关性和列表位置是主要驱动因素。

详情

DOI: 10.1145/3805712.3808445

AI中文摘要

AI 问答引擎从检索到的页面生成答案，但只引用少数来源。这使得可见性不仅取决于排名，还取决于被引用。我们研究竞争性生成式引擎优化（GEO）：当两个检索到的候选源竞争时，什么因素使得其中一个更可能被首先引用？我们构建了一个受控的两文档检索增强生成（RAG）测试平台，将恰好两个候选源注入模型上下文，并测量输出中第一个引用标记引用了哪个源。在六个 LLM 上，我们执行了 252,000 次试验，在 18 个内容因素的一个析因程序下进行重复配对比较。在每次试验中，两个源恰好在一个因素上不同；我们使用品牌匿名化和平衡源顺序来将内容效应与位置偏差分离。混合效应模型显示，主题相关性和列表位置是被首先引用的最大驱动因素。包含明确的价格信息和最近的时间戳也持续有帮助。完整性和信任线索带来较小的增益，而仅格式编辑几乎没有影响。我们发布了一个可重复的评估协议和一个优先化的 GEO 检查清单供从业者使用，并在 Sprinklr 的早期内部试点中进行了实践，团队报告了对工作流可用性的积极定性反馈。

英文摘要

AI answer engines generate answers from retrieved pages but cite only a few sources. This makes visibility depend not just on ranking, but on being cited. We study competitive Generative Engine Optimization (GEO): when two retrieved candidates compete, what makes one more likely to be cited first? We build a controlled two-document retrieval-augmented generation (RAG) testbed that injects exactly two candidate sources into the model context and measures which source is referenced by the first citation marker in the output. Across six LLMs we execute 252,000 trials, repeated paired comparisons under one factorial program over 18 content factors. In each trial the two sources differ in exactly one factor; we use brand anonymization and counterbalanced source order to separate content effects from position bias. Mixed-effects models show that topical relevance and list position are the biggest drivers of being cited first. Including explicit price information and a recent timestamp also helps consistently. Completeness and trust cues add smaller gains, while formatting-only edits have little impact. We release a reproducible evaluation protocol and a prioritized GEO checklist for practitioners, and we exercised it in an early internal pilot at Sprinklr, where teams reported positive qualitative feedback on workflow usability.

URL PDF HTML ☆

赞 0 踩 0

2605.25505 2026-05-26 cs.CY cs.AI econ.GN physics.soc-ph q-fin.EC 版本更新

Generative AI impacts on intra-urban inequality and skill premium in Beijing

生成式人工智能对北京城市内部不平等和技能溢价的影响

Xiliu He, Haoxiang Zhao, Mingyi Ma, Edward Wen Chuan Lai, Koei Enomoto, Anni Hu, Jiatong Li, Lingyun Chu, Yuan Lai

发表机构 * School of Architecture, Tsinghua University（清华大学建筑学院）； ZODA LAB（ZODA实验室）； Technology Innovation Center for Smart Human Settlements and Spatial Planning & Governance, Ministry of Natural Resources, Tsinghua University（智能人居环境与空间规划及治理技术创新中心，自然资源部，清华大学）

AI总结利用北京2018-2024年500万条招聘数据，通过五个大语言模型评估任务级暴露度，构建社区级生成式人工智能暴露指数，发现生成式人工智能暴露集中在核心区，导致高暴露社区工资停滞和“高技能陷阱”，挑战了技能偏向技术变革理论。

Comments 21 pages, 8 figures

详情

AI中文摘要

生成式人工智能（GenAI）是首次大规模触及高认知任务的自动化浪潮，但其对城市内部不平等的影响仍基本未知。利用北京2018-2024年500万条招聘数据，我们通过汇总五个领先大语言模型的任务级评估，构建了社区级GenAI暴露指数。我们考察了这一冲击的空间、结构和因果机制。我们发现，GenAI暴露高度集中在城市核心区，加深了城市内部的人工智能鸿沟。自2023年以来，高暴露社区尽管继续吸引高技能工人，却经历了工资停滞——一种“高技能陷阱”。这种工资惩罚是由任务去技能化和劳动力市场拥挤加剧驱动的。以ChatGPT发布为中心的倍差法设计支持因果解释。这些发现挑战了流行的技能偏向技术变革理论，并为全球科技中心的包容性人工智能治理提供了基础。

英文摘要

Generative artificial intelligence (GenAI) is the first automation wave to reach high-cognitive tasks at scale, yet its effects on intra-urban inequality remain largely unknown. Using 5 million job postings from Beijing (2018--2024), we construct a neighborhood-level GenAI Exposure Index by aggregating task-level assessments from five leading large language models. We examine the spatial, structural and causal mechanisms of this shock. We find that GenAI exposure is highly concentrated in the city's core districts, deepening the intra-urban AI divide. Since 2023, high-exposure neighborhoods have experienced wage stagnation even as they continue to attract high-skilled workers -- a "high-skill trap." This wage penalty is driven by task de-skilling and intensified labor-market crowding. A difference-in-differences design centered on ChatGPT's release supports a causal interpretation. These findings challenge the prevailing theory of skill-biased technological change and provide a basis for inclusive AI governance in global technology hubs.

URL PDF HTML ☆

赞 0 踩 0

2605.25502 2026-05-26 cs.CL cs.AI 版本更新

A Controlled Synthetic Benchmark for Educational Aspect-Based Sentiment Analysis

面向教育方面情感分析的可控合成基准

Yehudit Aperstein, Alexander Apartsin

发表机构 * Intelligent Systems, Afeka Academic College of Engineering（阿法卡学术工程智能系统学院）； School of Computer Science, Faculty of Sciences, Holon Institute of Technology（霍洛技术学院计算机科学学院）

AI总结为解决教育领域标注数据稀缺问题，提出一个包含10,000条合成课程评论和20个教学方面的可控合成基准，并通过实验验证了任务难度及合成到真实的迁移能力。

Comments 39 pages, 14 figures

详情

AI中文摘要

教育方面情感分析（ABSA）可以支持课程改进，但带有方面标签的学生反馈仍然稀缺，因为教育评论是私有的、特定于机构的且标注成本高昂。本研究引入了一个面向教育ABSA的可控合成基准，该基准由10,000条合成课程评论构建，具有明确的训练-验证-测试划分，以及一个涵盖教学质量、评估与课程管理、学习需求、学习环境和参与度的20方面教学模式。该语料库通过采样的目标标签、采样的细微属性以及经过三轮评审-编辑流程优化的真实感提示生成。在该基准上，使用TF-IDF、两阶段变换器和联合编码器的局部基线表明该任务并非易事；最强的未调优模型BERT在留出集上的检测微F1得分为0.2760，而一个适度的低学习率BERT调度将其提升至0.2930。基于gpt-5.2的全测试GPT推理在零样本模式下达到0.2519微F1，在使用基于检索的少样本提示时达到0.2501，使批量推理高于经典基线并接近紧凑的联合编码器。在来自Herath等人的2,829条映射学生反馈评论上进行的保守外部评估中，BERT在9个方面重叠上的微F1得分为0.4593，表明部分合成到真实的迁移。真实性和忠实度分析作为生成器诊断报告，阐明了基准如何稳定以及标签噪声仍然存在的位置。因此，本研究贡献了一个合成教育ABSA语料库、一个文档化的生成过程以及一个可复现的基准设置，适用于公共标注数据仍然难以获得的领域。

英文摘要

Educational aspect-based sentiment analysis (ABSA) can support course improvement, but public aspect-labeled student feedback remains scarce because educational reviews are private, institution-specific, and expensive to annotate. This study introduces a controlled synthetic benchmark for educational ABSA built from 10,000 synthetic course reviews with explicit train-validation-test splits and a 20-aspect pedagogical schema spanning instructional quality, assessment and course management, learning demand, learning environment, and engagement. The corpus is generated with sampled target labels, sampled nuance attributes, and a realism-tuned prompt refined through a three-cycle judge-editor procedure. On the resulting benchmark, local baselines with TF-IDF, two-step transformers, and joint encoders show that the task is nontrivial; the strongest untuned model, BERT, reaches a held-out detection micro-F1 of 0.2760, while a modest lower-rate BERT schedule improves this to 0.2930. Full-test GPT-based inference with gpt-5.2 reaches 0.2519 micro-F1 in zero-shot mode and 0.2501 with retrieval-based few-shot prompting, placing batch inference above the classical baseline and close to the compact joint encoders. A conservative external evaluation on 2,829 mapped student-feedback reviews from Herath et al. yields a micro-F1 of 0.4593 for BERT on a 9-aspect overlap, indicating partial synthetic-to-real transfer. Realism and faithfulness analyses are reported as generator diagnostics that clarify how the benchmark was stabilized and where label noise remains. The study therefore contributes a synthetic educational ABSA corpus, a documented generation procedure, and a reproducible benchmark setting for a domain in which public labeled data remain difficult to obtain.

URL PDF HTML ☆

赞 0 踩 0

2605.25489 2026-05-26 cs.AI cs.HC 版本更新

ATWL: A Formal Language for Representing, Comparing, and Reusing Visual Analytics Workflows

ATWL：一种用于表示、比较和重用可视化分析工作流的正式语言

Natalia Andrienko, Gennady Andrienko, Jürgen Bernard, Michael Sedlmair

发表机构 * Fraunhofer Institute IAIS（弗劳恩霍夫研究所IAIS）； Lamarr Institute for Machine Learning and Artificial Intelligence（拉马尔机器学习与人工智能研究所）； City St George’s, University of London（伦敦大学城市圣乔治学院）； University of Zurich（苏黎世大学）； University of Stuttgart（斯图加特大学）

AI总结提出ATWL语言，通过模块化本体和标准化意图形式化表示可视化分析工作流，结合LLM提取工作流，实现结构比较和重用。

详情

AI中文摘要

可视化分析（VA）工作流本质上是复杂的，涉及数据转换、特征工程、视觉表示和人类解释。它们通常以非结构化的散文形式描述，阻碍了系统比较、成熟策略的重用以及新手的培训。我们提出了工件-转换工作流语言（ATWL），这是一种领域无关的声明式语言，通过捕获工作流的结构和潜在分析意图来形式化表示VA工作流。ATWL构建于一个由八种工件类型（实体、特征、排列、可视化、模式、模型、知识、规范）和以标准化意图（例如，定义单元、表征、情境化、抽象）为特征的转换组成的模块化本体之上。为了证明形式化工作不必阻碍采用，我们通过监督式LLM代理交互从研究论文中提取工作流，将人类角色简化为审查和细化。利用这一过程，我们从已发表的VA论文中构建了一个包含17个ATWL工作流的库。跨工作流分析揭示了结构规律性——一个反复出现的元结构、重复出现的主题、可重用的构建块、多样的迭代策略以及跨领域等价性——这些在散文中是不可见的。我们进一步通过一个受控实验评估了实际效用，在该实验中，同一个LLM处理了两个分析问题，提供的库要么是原始论文，要么是ATWL表示。两种形式都提供了有用的建议，但形式化表示系统地添加了显式迭代结构、类型化数据流、片段级适应来源以及支持扩展的紧凑性，超出了散文库在LLM上下文中的容量。ATWL使得从叙事描述向形式化表示、可比较和可重用的分析知识过渡成为可能。

英文摘要

Visual analytics (VA) workflows are inherently complex, involving data transformation, feature engineering, visual representation, and human interpretation. They are typically described in unstructured prose, hindering systematic comparison, reuse of proven strategies, and training of novices. We present Artifact-Transform Workflow Language (ATWL), a domain-agnostic, declarative language that formally represents VA workflows by capturing their structure and underlying analytical intent. ATWL is built upon a modular ontology of eight artifact types (entities, features, arrangements, visualisations, patterns, models, knowledge, specifications) and transforms characterised by standardised intents (e.g., define-unit, characterise, contextualise, abstract). To show that formalisation effort need not impede adoption, we extract workflows from research papers through supervised interaction with LLM agents, reducing the human role to review and refinement. Using this process, we constructed a library of seventeen ATWL workflows from published VA papers. Cross-workflow analysis reveals structural regularities -- a recurrent meta-structure, recurring motifs, reusable building blocks, diverse iterative strategies, and cross-domain equivalences -- that remain invisible in prose. We further evaluate practical utility through a controlled experiment in which the same LLM addressed two analytical problems with the library supplied either as original papers or as ATWL representations. Both forms enabled useful recommendations, but the formal representation systematically added explicit iteration structure, typed data flow, fragment-level adaptation provenance, and compactness supporting scaling beyond what prose libraries can fit in an LLM's context. ATWL enables a transition from narrative descriptions to formally represented, comparable, and reusable analytical knowledge.

URL PDF HTML ☆

赞 0 踩 0

2605.25488 2026-05-26 cs.CV cs.AI cs.MM 版本更新

Test-Time Self-Adaptive Conditioning for Stable Audio-Driven Talking-Head Generation

测试时自适应条件用于稳定音频驱动说话头生成

Zhicheng Zhang, Lei Wang, Yu Zhang, Yongsheng Gao

发表机构 * School of Business, University of New South Wales (UNSW)（新南威尔士大学商学院）； School of Engineering and Built Environment, Griffith University（格里菲斯大学工程与环境学院）

AI总结提出一种无需参数训练的测试时自适应条件框架（TT-SAC），通过反馈循环调整条件表示，提升预训练说话头生成器的身份保持、时间一致性和感知质量。

Comments Research report

详情

AI中文摘要

音频驱动的说话头生成在AniTalker、FLOAT和Sonic等最新模型中取得了显著进展。尽管取得了成功，大多数现有方法在推理阶段依赖单一静态参考图像来调节整个视频生成过程。这种静态条件范式通常导致固定身份特征与动态面部运动之间的不匹配，从而引起身份漂移、时间不一致性和感知质量下降。我们引入了测试时自适应条件（TT-SAC），这是一个无需参数的推理框架，使预训练的说话头生成器能够在推理过程中调整其条件表示，而无需重新训练、梯度更新或额外监督。TT-SAC不是将参考肖像视为不可变的，而是将生成器与其编码器组合成一个反馈循环：生成器自身的输出被重新编码，以构建一个更符合合成序列时间动态的精细条件表示。单次自适应步骤近似于生成过程的自洽平衡，稳定了跨时间的身份和运动。我们进一步提供了理论分析，表明在温和的Lipschitz假设下，测试时条件自适应减少了特征方差并提高了生成稳定性，同时表现出原则性的偏差-方差权衡，该权衡决定了自适应最优强度。在最新说话头生成器和基准数据集上的大量实验表明，在唇形同步准确性、时间一致性、身份保持和感知保真度方面均有持续改进。TT-SAC提供了一种模型无关且无需训练的策略来增强生成视频模型，将测试时条件自适应确立为稳定音频驱动肖像动画的有效机制。

英文摘要

Audio-driven talking-head generation has achieved remarkable progress with recent models such as AniTalker, FLOAT, and Sonic. Despite their success, most existing approaches rely on a single static reference image to condition the entire video generation process at inference stage. This static conditioning paradigm often creates a mismatch between fixed identity features and dynamically evolving facial motion, leading to identity drift, temporal inconsistency, and degraded perceptual quality. We introduce Test-Time Self-Adaptive Conditioning (TT-SAC), a parameter-free inference framework that enables pretrained talking-head generators to adapt their conditioning representations during inference without retraining, gradient updates, or additional supervision. Instead of treating the reference portrait as immutable, TT-SAC composes the generator with its encoder in a feedback loop: the generator's own outputs are re-encoded to construct a refined conditioning representation that better aligns with the temporal dynamics of the synthesized sequence. A single adaptation step approximates a self-consistent equilibrium of the generative process, stabilizing identity and motion across time. We further provide theoretical analysis showing that test-time conditioning adaptation reduces feature variance and improves generative stability under mild Lipschitz assumptions, while exhibiting a principled bias-variance tradeoff that governs the optimal strength of adaptation. Extensive experiments on state-of-the-art talking-head generators and benchmark datasets demonstrate consistent improvements in lip-sync accuracy, temporal coherence, identity preservation, and perceptual fidelity. TT-SAC offers a model-agnostic and training-free strategy for enhancing generative video models, establishing test-time conditioning adaptation as an effective mechanism for stabilizing audio-driven portrait animation.

URL PDF HTML ☆

赞 0 踩 0

2605.25477 2026-05-26 cs.RO cs.AI 版本更新

EXPO-FT: Sample-Efficient Reinforcement Learning Finetuning for Vision-Language-Action Models

EXPO-FT：面向视觉-语言-动作模型的样本高效强化学习微调

Perry Dong, Kuo-Han Hung, Tian Gao, Dorsa Sadigh, Chelsea Finn

发表机构 * Stanford University（斯坦福大学）

AI总结提出EXPO-FT系统，通过样本高效的强化学习微调预训练的VLA策略，在多种高精度操作任务中实现完美性能（30/30成功率），平均仅需19.1分钟在线机器人数据。

详情

AI中文摘要

高效且可靠地学习新任务的能力一直是机器人学的基础挑战。视觉-语言-动作（VLA）模型在多种操作任务中展现出强大的泛化能力，但预训练策略始终无法达到实际部署所需的可靠性。强化学习（RL）微调为弥合这一差距提供了有前景的路径，但现有方法要么从头开始训练而未充分利用预训练先验，要么微调VLA而未达到实际部署所需的样本效率和成功率。我们提出了EXPO-FT，一个用于对预训练VLA策略进行稳定、样本高效的RL微调的系统，填补了这一空白。我们的系统解决了一系列具有挑战性的操作任务，包括串灯并插入插头点亮、将台球击入袋中、将花插入酒瓶，每个任务都需要高精度、动态动作以及对不同初始状态的鲁棒性。我们的系统在所有评估任务中均实现了完美的任务性能（30/30成功），平均仅需19.1分钟的在线机器人数据，优于先前的从头RL训练和VLA微调方法。我们发布了一个开源代码库，旨在促进机器人领域中VLA模型RL微调的更广泛采用。

英文摘要

The ability to efficiently and reliably learn new tasks has been a foundational challenge in robotics. Vision-Language-Action (VLA) models have demonstrated strong generalization across diverse manipulation tasks, yet pretrained policies consistently fall short of the reliability required for real-world deployment. Reinforcement learning (RL) fine-tuning offers a promising path to bridge this gap, but existing approaches either train from scratch without fully leveraging pretrained priors, or fine-tune VLAs without achieving the sample efficiency and success rates that practical deployment demands. We present EXPO-FT, a system for stable, sample-efficient RL finetuning of pretrained VLA policies that closes this gap. Our system solves a suite of challenging manipulation tasks, including routing string lights and inserting the plug to light it up, striking a pool ball into a pocket, and inserting a flower into a wine bottle, each requiring combinations of high precision, dynamic actions, and robustness to varied initial states. Our system achieves perfect task performance (30/30 successes) across all evaluated tasks within an average of 19.1 minutes of online robot data, outperforming both prior RL-from-scratch and VLA finetuning approaches. We release an open-source codebase with the aim of facilitating broader adoption of RL finetuning of VLA models in robotics.

URL PDF HTML ☆

赞 0 踩 0

2605.25475 2026-05-26 cs.CL cs.AI 版本更新

IndexMem: Learned KV-Cache Eviction with Latent Memory for Long-Context LLM Inference

IndexMem: 基于潜在记忆的学习型KV缓存驱逐策略用于长上下文LLM推理

Xintong Yang, Hao Gu, Binxing Xu, Lujun Li, Bei Liu, Jiacheng Liu, Qiyuan Zhu, Sirui Han, Yike Guo

发表机构 * The Hong Kong University of Science（香港科学与技术大学）； Zhejiang University（浙江大学）

AI总结提出一种可学习的索引器预测KV重要性，并结合轻量级潜在记忆模块压缩被驱逐的令牌，以在有限KV预算下实现准确的长上下文推理。

详情

AI中文摘要

大型语言模型（LLM）越来越需要处理长上下文，但标准softmax注意力机制的KV缓存随序列长度线性增长，迅速成为长上下文推理的瓶颈。一种实用的补救措施是驱逐不太重要的KV条目；然而，现有的驱逐策略大多是启发式的，难以捕捉令牌重要性的丰富、输入相关的分布。在这项工作中，我们引入了一个可学习的索引器来预测KV重要性，从而能够更准确地保留关键令牌。同时，简单地驱逐令牌会永久丢弃其信息，导致不可逆的遗忘和长距离检索性能下降。为了解决这个问题，我们提出了一个轻量级的潜在记忆模块，将驱逐的令牌压缩成紧凑的、在线更新的状态，并提供残差读出以补偿通过KV驱逐丢失的注意力贡献。总的来说，我们的方法能够在有限的KV预算下实现准确的长上下文推理，在RULER（4K/16K）上对Qwen、Mistral和Llama模型（在激进驱逐下提升高达25分）带来一致的改进，在Needle-in-a-Haystack检索中显著更稳定，并且在LongBench得分和压缩曲线上优于现有的驱逐策略。

英文摘要

Large Language Models (LLMs) are increasingly expected to operate over long contexts, yet standard softmax attention incurs a KV cache that grows linearly with sequence length, quickly becoming the bottleneck for long context inference. A practical remedy is to evict less important KV entries; however, existing eviction policies are largely heuristic and struggle to capture the rich, input-dependent distribution of token importance. In this work, we introduce a learnable indexer that predicts KV importance, enabling more accurate retention of critical tokens. Meanwhile, naively evicting tokens permanently discards their information, leading to irreversible forgetting and degraded retrieval over long ranges. To address this, we propose a lightweight latent memory module that compresses evicted tokens into a compact, online-updated state and provides residual readouts to compensate for the attention contributions lost through KV eviction. Collectively, our method enables accurate long-context inference under a bounded KV budget, delivering consistent improvements on RULER (4K/16K) across Qwen, Mistral, and Llama models (up to 25 points under aggressive eviction), markedly more stable Needle-in-a-Haystack retrieval, and superior LongBench scores and compression curves compared to existing eviction policies.

URL PDF HTML ☆

赞 0 踩 0

2605.25459 2026-05-26 cs.LG cs.AI 版本更新

From Simulation to Enaction: Post-trained language models recognize and react to their own generations

从模拟到行动：后训练语言模型识别并回应自身生成

Asvin G., Jack Lindsey

发表机构 * Institute for Advanced Study, Princeton（普林斯顿高级研究院）； Anthropic

AI总结本文发现后训练语言模型能够识别自身生成（on-policy）并降低输出熵，通过内部表示输入意外性来调节，且显式识别与隐式识别机制不同。

Comments Anthropic fellows project mentored by Jack Lindsey

详情

AI中文摘要

语言模型被预训练为被动预测器，没有动机去建模自身输出的后果。后训练改变了这一点：产生自身响应的模型可以从识别自身处于on-policy状态中获益。我们提供证据表明，后训练模型识别其on-policy生成，并且这种识别隐式编码在其输出分布中。特别是，在不同模型家族和规模类别中，on-policy输出分布熵比off-policy熵低3-4倍。我们将这种效应的部分原因追溯到输入意外性的内部表示，该表示跟踪模型先前预测中最新的输入标记的不可能性，并因果性地调节输出熵。这些现象的一个例子可以在对开放式提示的响应中观察到；后训练模型（与预训练模型不同）在第一个输出标记之前就将其对即将生成的响应主题的不确定性坍缩；用不同主题的前缀违反这种缓存意图会导致更高的输出熵。我们还测试了模型是否可以通过显式口头报告区分on-policy上下文和前缀。我们发现它们可以，但有趣的是，这种显式识别通过不同于隐式识别的机制进行路由。

英文摘要

Language models are pretrained as passive predictors with no incentive to model the consequences of their own outputs. Post-training changes this: a model producing its own responses can benefit from recognizing that it is on-policy. We present evidence that post-trained models recognize their on-policy generations, and this recognition is implicitly encoded in their output distributions. In particular, on-policy output distribution entropy is 3--4$\times$ lower than off-policy entropy, across model families and size classes. We trace part of this effect to an internal representation of input surprise, tracking the unlikeliness of the most recent input token according to the model's prior predictions, that causally modulates output entropy. One example of these phenomena can be observed in response to open-ended prompts; post-trained models (unlike pretrained models) collapse their uncertainty over the topic of their upcoming response before the first output token; violating this cached intention with a different-topic prefill results in higher output entropy. We also tested whether models can distinguish on-policy contexts from prefills via explicit verbal report. We find that they can, but that interestingly, this explicit recognition routes through a different mechanism than implicit recognition.

URL PDF HTML ☆

赞 0 踩 0

2605.25454 2026-05-26 cs.HC cs.AI cs.CL cs.CY cs.SI 版本更新

AI Content Moderation in Therapy Conversations

AI在治疗对话中的内容审核

Jiwon Kim, Claire Wang, Taeung Yoon, Sabelle Huang, Koustuv Saha

AI总结研究审计三种主流内容审核系统（OpenAI、Meta、Google）在真实治疗对话中的标记行为，揭示其限制LLM作为治疗师的潜力。

2605.25446 2026-05-26 cs.AI cs.LG 版本更新

CODESKILL：学习自进化技能的编码智能体

Yanzhou Li, Yiran Zhang, Xiaoyu Zhang, Xiaoxia Liu, Yang Liu

发表机构 * Nanyang Technological University（南洋理工大学）； Zhejiang University（浙江大学）

AI总结提出CODESKILL框架，通过强化学习从编码智能体轨迹中提取多粒度程序性技能并维护技能库，提升下游任务解决能力。

详情

AI中文摘要

编码智能体在解决软件工程任务时产生丰富的轨迹。为了实现智能体自我进化，这些轨迹可以提炼为可重用的程序性技能，以紧凑的方式编码经验来指导未来行为。然而，现有的技能构建和维护方法通常依赖固定提示和启发式更新规则，不清楚如何选择、抽象和维护知识以最好地服务下游智能体。我们提出CODESKILL，一个基于LLM的框架，将技能提取和技能库维护重新表述为可学习的管理策略。CODESKILL从编码智能体轨迹中提取多粒度程序性技能，用新经验进化技能，并维护一个紧凑的技能库用于未来任务解决。我们使用强化学习训练CODESKILL，采用混合奖励，将基于评分标准的密集技能质量反馈与来自冻结下游智能体的稀疏可验证执行反馈相结合。在EnvBench、SWE-Bench Verified和Terminal-Bench 2上的实验表明，CODESKILL相比无技能基线平均通过率提高9.69，相比最强的基于提示或记忆基线提高4.01，同时在迭代构建过程中将技能库大小维持在稳定水平。

英文摘要

Coding agents produce rich trajectories while solving software-engineering tasks. To enable agent self-evolution, these trajectories can be distilled into reusable procedural skills that compactly encode experience to guide future behavior. However, existing skill construction and maintenance methods often rely on fixed prompts and heuristic update rules, leaving it unclear how knowledge should be selected, abstracted, and maintained to best serve downstream agents. We propose CODESKILL, an LLM-based framework that reformulates skill extraction and skill-bank maintenance as a learnable management policy. CODESKILL extracts multi-granularity procedural skills from coding-agent trajectories, evolves skills with new experience, and maintains a compact skill bank for future task solving. We train CODESKILL with reinforcement learning, using a hybrid reward that combines dense rubric-based skill-quality feedback with sparse verifiable execution feedback from the frozen downstream agent. Experiments on EnvBench, SWE-Bench Verified, and Terminal-Bench 2 show that CODESKILL improves average pass rate by 9.69 over the no-skill baseline and by 4.01 over the strongest prompt-based or memory baseline, while maintaining the skill bank at a stable size during iterative construction.

URL PDF HTML ☆

赞 0 踩 0

2605.25427 2026-05-26 cs.CV cs.AI 版本更新

Binding Visual Features Point by Point

逐点绑定视觉特征

Udith Haputhanthri, Declan Campbell, Rim Assouel, Jonathan D. Cohen, Taylor W. Webb

发表机构 * Princeton University（普林斯顿大学）； Mila – Quebec AI Institute（魁北克AI研究所）； Université de Montréal（蒙特利尔大学）

AI总结研究通过文本引导的“指向”机制解决视觉语言模型在多目标场景中的绑定问题，发现该机制诱导内部视觉搜索程序，消除绑定错误并实现组合泛化。

详情

AI中文摘要

尽管在标准基准测试中取得了成功，但视觉语言模型在处理涉及多目标场景的任务时仍表现出持续的失败，包括许多对人类来说相对容易的任务。最近的研究发现，这些失败可能源于在上下文中准确绑定对象特征的基本能力缺失，这在认知科学和神经科学中被称为“绑定问题”。人类视觉系统被认为通过串行处理来解决这一绑定问题，即一次只关注一个对象，以避免来自其他对象的干扰。最近的研究提出了“指向”——使用显式空间坐标来指代对象——作为视觉语言模型的类似解决方案，并发现它提高了具有挑战性的多目标任务的性能。然而，目前尚不清楚这种方法为何（即在机制或表征层面）能提高性能，以及这与人类视觉中的串行处理有何直接关系。本文研究了这一问题。我们发现，通过文本学习指向会诱导内部视觉搜索程序，并描述了支持这一过程的机制。我们还发现，指向行为可以通过微调泛化到新任务，并且这样做可以消除绑定错误并实现组合泛化。这些结果提供了一个原理证明，即串行处理可以像解决生物视觉中的绑定问题一样，解决视觉语言模型中的绑定问题。

英文摘要

Despite success on standard benchmarks, vision language models display persistent failures on tasks involving processing of multi-object scenes, including many tasks that are relatively easy for humans. Recent work has found that these failures may stem from a basic inability to accurately bind object features in-context, a challenge that is referred to as the "binding problem" in cognitive science and neuroscience. The human visual system is thought to solve this binding problem via serial processing, attending to individual objects one at a time so as to avoid interference from other objects. Recent work has proposed "pointing" -- the use of explicit spatial coordinates to refer to objects -- as an analogous solution for vision language models, and found that it improves performance on challenging multi-object tasks. However, it is unclear $\textit{why}$ (i.e., on a mechanistic or representational level) this approach improves performance, and how directly this relates to serial processing in human vision. Here, we investigate this question. We find that learning to point-via-text induces an internal visual search routine, and we characterize the mechanisms that support this procedure. We also find that pointing behavior can be generalized to new tasks via fine-tuning, and that doing so eliminates binding errors and enables compositional generalization. These results provide a proof-of-principle that serial processing can solve the binding problem for vision language models just as it does for biological vision.

URL PDF HTML ☆

赞 0 踩 0

2605.25424 2026-05-26 cs.LG cs.AI 版本更新

SeqRoute: Global Budget-Aware Sequential LLM Routing via Offline Reinforcement Learning

SeqRoute: 通过离线强化学习实现全局预算感知的顺序LLM路由

Zhongling Xu, Shunan Zheng, Wei Wang

发表机构 * Department of Operations Research and Industrial Engineering（运筹学与工业工程系）

AI总结提出SeqRoute框架，将多轮LLM路由建模为有限时域马尔可夫决策过程，通过离线强化学习（CQL）和事后预算重标记（HBR）学习延迟满足，在全局预算约束下优化成本与质量，降低破产率至1%以下。

详情

AI中文摘要

现有的LLM路由框架将查询视为独立事件，忽略了受全局计算预算约束的真实用户会话的顺序性质。这种不匹配不可避免地导致预算破产：短视的路由策略在早期交互中耗尽资源，迫使后续通常更复杂的查询使用不充分的模型。我们引入SeqRoute，一个将多轮路由建模为有限时域马尔可夫决策过程并通过离线强化学习求解的框架。通过将剩余预算纳入状态空间并使用保守Q学习（CQL）进行训练，SeqRoute学习延迟满足以策略性地为会话后期的高风险轮次保留资源。为了克服数据匮乏，我们提出事后预算重标记（HBR）。该技术在不同假设预算下回顾性地模拟历史轨迹，将10,000个原始会话扩展为238万个包含关键破产信号的转换。在部署时，动态λ扫描机制无需重新训练即可实现成本-质量帕累托前沿的零样本导航。大量评估表明，SeqRoute在保持或提高质量的同时将运营成本降低6.0-73.5%，并将破产率抑制在1%以下，在整个帕累托前沿上严格优于行为克隆、预算感知启发式和静态基线。

英文摘要

Existing LLM routing frameworks treat queries as independent events, neglecting the sequential nature of real-world user sessions constrained by global computational budgets. This mismatch inevitably leads to budget bankruptcy: myopic routing policies exhaust resources on early interactions, forcing subsequent and often more complex queries onto inadequate models. We introduce SeqRoute, a framework that formulates multi-turn routing as a finite-horizon Markov Decision Process and solves it via offline reinforcement learning. By incorporating the remaining budget into the state space and training with Conservative Q-Learning (CQL), SeqRoute learns delayed gratification to strategically preserve resources for high-stakes turns later in the session. To overcome data starvation, we propose Hindsight Budget Relabeling (HBR). This technique retrospectively simulates historical trajectories under diverse hypothetical budgets, expanding 10,000 raw sessions into 2.38 million transitions enriched with critical bankruptcy signals. At deployment, a dynamic $λ$-sweep mechanism enables zero-shot navigation of the cost-quality Pareto frontier without retraining. Extensive evaluations demonstrate that SeqRoute reduces operational costs by 6.0-73.5% while maintaining or improving quality, and suppresses bankruptcy rates to under 1%, strictly dominating behavior cloning, budget-aware heuristics, and static baselines across the entire Pareto frontier.

URL PDF HTML ☆

赞 0 踩 0

2605.25422 2026-05-26 eess.SP cs.AI cs.IT math.IT 版本更新

A Token/KV-Cache Communication Media Selection and Resource Allocation Strategy for Multi-Agent Collaboration

面向多智能体协作的令牌/KV缓存通信介质选择与资源分配策略

Lipeng Dai, Luping Xiang, Kun Yang

发表机构 * State Key Laboratory of Novel Software Technology, Nanjing University, Nanjing, 210008, China（新型软件技术国家重点实验室，南京大学，南京）； Institute of Intelligent Networks and Communications (NINE), Nanjing University (Suzhou Campus), Suzhou, 215163, China（智能网络与通信研究院（NINE），南京大学（苏州校区），苏州）

AI总结针对多智能体协作中异构交互介质带来的端到端延迟权衡问题，提出一种联合通信介质选择与无线资源分配的优化方法，并设计低复杂度算法以最小化延迟。

详情

AI中文摘要

大型语言模型（LLM）与6G网络的融合正在催生自主多智能体协作范式，这预计将大幅增加东西向流量。尽管潜在空间交互机制比符号自然语言（NL）交换能实现更高效的协作，但先前的工作通常忽略了实际无线约束下的相关通信开销。在具身多智能体场景中，异构交互介质会导致不同的推理和传输成本，从而产生固有的端到端（E2E）延迟权衡。为解决这一问题，我们提出了一种联合设计，将通信介质选择与无线资源分配相结合。通过分析表征和基于仿真的评估，我们表明基于令牌的传输和基于键值（KV）缓存的传输在运行状态下并非统一最优，因为性能关键取决于可用计算资源和信道条件等系统参数。因此，我们构建了一个联合优化问题，旨在最小化多智能体协作的E2E延迟，并开发了一种低复杂度的联合介质选择与资源分配（JMSRA）算法。数值结果进一步证实，通过自适应地协调异构链路上的交互介质和带宽分配，所提方案相对于传统的仅NL和仅KV缓存基线显著降低了E2E延迟，从而在未来无线网络中实现高效且鲁棒的多智能体协作。

英文摘要

The convergence of large language models (LLMs) with 6G networks is fostering a paradigm of autonomous multi-agent cooperation, which in turn is expected to substantially increase east-west traffic. Although latent-space interaction mechanisms can enable more efficient collaboration than symbolic natural-language (NL) exchanges, prior work often abstracts away the associated communication overhead under practical wireless constraints. In embodied multi-agent settings, heterogeneous interaction media incur disparate inference and transmission costs, thereby inducing an inherent end-to-end (E2E) latency trade-off. To address this, we propose a joint design that integrates communication-media selection with wireless resource allocation. Through analytical characterization and simulation-based evaluation, we show that neither token-based transmission nor key-value (KV) cache-based transmission is uniformly optimal across operating regimes, as performance depends critically on system parameters such as available computational resources and channel conditions. Accordingly, we formulate a joint optimization problem aimed at minimizing the E2E latency of multi-agent collaboration and develop a low-complexity joint media selection and resource allocation (JMSRA) algorithm. Numerical results further confirm that, by adaptively coordinating the interaction media and bandwidth allocation over heterogeneous links, the proposed scheme achieves markedly reduced E2E latency relative to conventional NL-only and KV-cache-only baselines, enabling efficient and robust multi-agent collaboration in future wireless networks.

URL PDF HTML ☆

赞 0 踩 0

2605.25420 2026-05-26 cs.CL cs.AI cs.CY 版本更新

SomaliBench Eval: Measuring English-to-Somali Refusal Gaps in Open-Weight Language Models

SomaliBench Eval：衡量开源语言模型中英语到索马里语的拒绝差距

Khalid Yusuf Dahir

发表机构 * Independent researcher（独立研究人员）

AI总结通过构建索马里语有害意图基准并评估四个开源模型，发现英语到索马里语的拒绝率存在显著差距，且多数非拒绝输出为不流畅的无效内容。

Comments 12 pages, 3 figures, 4 tables. Code: https://github.com/khaledyusuf44/somalibench_eval Dataset: https://huggingface.co/datasets/khaledyusuf44/somalibench-v0

详情

AI中文摘要

大型语言模型的安全评估仍然高度以英语为中心，即使模型在全球部署，低资源语言的评估也严重不足。我们在SomaliBench v0上评估了四个开源指令微调模型，这是一个由母语者验证的基准，包含100对英语和索马里语的有害意图提示。每个模型（Llama-3.1-8B-Instruct、Gemma-2-9B-Instruct、Qwen-2.5-7B-Instruct和Aya-23-8B）均在本地运行，温度为0，并使用相同的英语“有帮助、无害、诚实”（HHH）系统提示。一个固定的Claude Sonnet快照（claude-sonnet-4-5-20250929）将每个响应分类为拒绝、遵从或不清楚；母语作者对分层抽样的80行样本进行抽查。我们发现所有四个模型在英语到索马里语之间存在巨大的拒绝差距：Llama-3.1-8B（0.90；95%自助法置信区间[0.85, 0.96]）、Aya-23-8B（0.75 [0.67, 0.83]）、Qwen-2.5-7B（0.69 [0.59, 0.78]）和Gemma-2-9B（0.38 [0.27, 0.49]）。对于三个模型，索马里语中主要的非拒绝模式不是流畅的有害遵从，而是不清楚的输出：空、错误语言或不连贯的生成。母语验证抽查在80个采样行上与判断器达到100%一致（Cohen's kappa = 1.00）。我们仅报告总体拒绝率、类别差距和可靠性统计；原始模型生成保留在本地，不发布。

英文摘要

Large language model safety evaluation remains heavily English-centered, leaving low-resource languages under-measured even when models are deployed globally. We evaluate four open-weight instruction-tuned models on SomaliBench v0, a native-author-verified benchmark of 100 harmful-intent prompts paired across English and Somali. Each of Llama-3.1-8B-Instruct, Gemma-2-9B-Instruct, Qwen-2.5-7B-Instruct, and Aya-23-8B is run locally with temperature 0 and the same English "helpful, harmless, and honest" (HHH) system prompt. A pinned Claude Sonnet snapshot (claude-sonnet-4-5-20250929) classifies each response as refused, complied, or unclear; the native author spot-checks a stratified 80-row sample. We find large English-to-Somali refusal gaps for all four models: Llama-3.1-8B (0.90; 95% bootstrap CI [0.85, 0.96]), Aya-23-8B (0.75 [0.67, 0.83]), Qwen-2.5-7B (0.69 [0.59, 0.78]), and Gemma-2-9B (0.38 [0.27, 0.49]). For three models, the dominant Somali non-refusal mode is not fluent harmful compliance but unclear output: empty, wrong-language, or incoherent generations. The native verification spot-check achieves 100% agreement with the judge (Cohen's kappa = 1.00) on the 80 sampled rows. We report aggregate refusal rates, category gaps, and reliability statistics only; raw model generations are retained locally and are not released.

URL PDF HTML ☆

赞 0 踩 0

2605.25399 2026-05-26 cs.AI 版本更新

Towards end-to-end LLM-based censoring-aware survival analysis

面向端到端基于大语言模型的删失感知生存分析

Yishu Wei, Hexin Dong, Yi Lin, Jiahe Qian, Yi Liu, Yifan Peng

发表机构 * Department of Population Health Science, Weill Cornell Medicine（人口健康科学系，韦尔·科恩医学中心）； Weill Cornell Medicine（韦尔·科恩医学中心）

AI总结提出LLMSurvival框架，通过成对排序重制定时间事件预测，实现删失感知的生存分析，在ICU死亡率和骨折风险预测中优于Cox比例风险模型和三种深度学习模型。

详情

AI中文摘要

目的：生存分析是医学预测的核心，然而大语言模型（LLM）很少被用作端到端生存模型，因为删失阻碍了直接的监督微调。这里我们提出LLMSurvival，一个框架，使得未修改的LLM能够直接操作表格临床数据进行删失感知的生存分析。材料与方法：LLMSurvival将时间事件预测重新表述为可比较受试者之间的成对排序，并通过聚合与训练队列中锚定个体的比较来推导测试时风险。结果：在两个临床任务（MIMIC-IV中的ICU死亡率预测和纽约长老会/威尔康奈尔医学中心队列中的脆性骨折预测）中，LLMSurvival相比Cox比例风险模型，整体一致性提高了ICU死亡率3.1%和骨折风险0.5%，相比三个已建立的深度学习生存模型，ICU死亡率平均提高2.1%，骨折风险平均提高2.8%。讨论：结果表明，通过基于比较的重新制定，可以使带有删失的生存建模与LLM微调兼容。该框架展示了高可移植性，并且在不同的临床背景下优于专家制定的评分（如SAPS-II和FRAX评分）。此外，该框架支持本地部署，因为紧凑、公开可用的基础模型提供了足够的性能。结论：LLMSurvival框架作为通过LLM进行集成、删失意识的生存分析的概念验证。

英文摘要

Objective: Survival analysis is central to medical prediction, yet large language models (LLMs) are rarely used as end-to-end survival models because censoring prevents straightforward supervised fine-tuning. Here we present LLMSurvival, a framework that enables censoring-aware survival analysis with unmodified LLMs operating directly on tabular clinical data. Materials and Methods: LLMSurvival reformulates time-to-event prediction as pairwise ranking among comparable subjects, and derives test-time risk by aggregating comparisons against anchor individuals from the training cohort. Results: Across two clinical tasks (ICU mortality prediction in MIMIC-IV and fragility fracture prediction in a NewYork-Presbyterian/Weill Cornell Medicine cohort), LLMSurvival improves overall concordance over Cox proportional hazards modeling by 3.1% for ICU mortality and 0.5% for fracture risk, 2.1% on average for ICU mortality and 2.8% for fracture risk over three established deep learning survival models. Discussion: The results show that survival modeling with censoring can be made compatible with LLM fine-tuning through comparison-based reformulation. The framework demonstrates high portability and superior performance over expert curated scores like SAPS-II and FRAX scores across diverse clinical context. Furthermore, the framework supports local deployment, as compact, publicly available base models provide sufficient performance. Conclusion: The LLMSurvival framework serves as a proof of concept for an integrated, censoring-conscious approach to survival analysis via LLMs.

URL PDF HTML ☆

赞 0 踩 0

2605.25396 2026-05-26 cs.CV cs.AI 版本更新

Subspace-Guided Semantic and Topological Invariant Registration for Annotation-Free Ultrasound Plane Quality Control

子空间引导的语义与拓扑不变配准用于无标注超声平面质量控制

Chunzheng Zhu, Jianxin Lin, Feng Wang, Cheng Jiang, Guanghua Tan, Zhenyu Zhou, Shengli Li, Kenli Li

发表机构 * Hunan University（湖南大学）； Shenzhen Maternity and Child Healthcare Hospital（深圳妇幼保健医院）

AI总结提出STRIQ框架，通过子空间引导的配准一致性度量，实现无标注超声平面质量控制，达到与临床质量评分的最优相关性。

Comments MICCAI 2026 Accepted Paper; Subspace-Guided Registration for Ultrasound Quality Control

详情

AI中文摘要

超声图像的可靠质量控制对于实时采集指导和回顾性临床审计至关重要，然而现有方法严重依赖逐平面标注，或采用在临床采集固有空间变形下易产生系统性偏差的伪标签。我们提出STRIQ，一种基于配准的框架，将无标注超声平面质量控制重新定义为子空间引导的一致性度量问题。具体而言，STRIQ引入潜在配准对齐器（LRA）以建立查询图像与方差驱动锚点之间的层次特征空间对应，这些锚点通过方差谱准则从无标签数据中自主提炼，作为结构稳定的原型。为进一步区分解剖平面并减轻负知识迁移，我们提出正交知识子空间（OKS）模块。OKS将平面特定表示分解为相互正交的子空间，实现细粒度专家协作同时防止平面间干扰，确保质量度量基于原则性的子空间邻近性。在内部US4QA和公开CAMUS数据集上的大量实验表明，STRIQ实现了与临床质量评分的最优相关性，为无标注、实时可靠的超声质量控制建立了新范式。我们的代码可在https://github.com/zhcz328/STRIQ获取。

英文摘要

Reliable quality control (QC) of ultrasound images is essential for both real-time acquisition guidance and retrospective clinical audit, yet existing approaches rely heavily on per-plane annotations, or employ pseudo-labeling prone to systematic bias under spatial deformations inherent in clinical acquisition. We present STRIQ, a registration-driven framework that recasts annotation-free US plane quality control as a subspace-guided consistency measurement problem. Specifically, STRIQ introduces a Latent Registration Aligner (LRA) to establish hierarchical feature space correspondences between query images and variance-driven anchors, which are autonomously distilled from unlabeled data via a variance spectrum criterion to serve as structurally stable prototypes. To further disambiguate anatomical planes and mitigate negative knowledge transfer, we propose an Orthogonal Knowledge Subspace (OKS) module. The OKS decomposes plane-specific representations into mutually orthogonal subspaces, enabling fine-grained expert collaboration while preventing inter-plane interference, ensuring that the quality metric is grounded in principled subspace proximity. Extensive experiments on the in-house US4QA and public CAMUS datasets demonstrate that STRIQ achieves state-of-the-art correlation with clinical quality scores, establishing a new paradigm for annotation-free, real-time reliable ultrasound quality control. Our code is available at https://github.com/zhcz328/STRIQ.

URL PDF HTML ☆

赞 0 踩 0

2605.25394 2026-05-26 cs.AI cs.CL 版本更新

Second Guess: Detecting Uncertainty Through Abstention and Answer Stability in Small Language Models

Second Guess: 通过弃权和答案稳定性检测小型语言模型的不确定性

Ashwath Vaithinathan Aravindan, Mayank Kejriwal

发表机构 * University of Southern California（南加州大学）； Information Sciences Institute（信息科学研究所）

AI总结提出一种轻量级、无参数的提示技术Second Guess，通过添加“我不知道”选项并观察答案稳定性，在多项选择问答中实现弃权，有效检测小型语言模型的不确定性。

详情

AI中文摘要

大型语言模型在不确定时往往生成自信但错误的答案，而非弃权。这个问题对于小型语言模型（SLM）尤为严重，因为计算约束和自主操作放大了对可靠不确定性检测的需求。我们提出了_Second Guess_，一种轻量级、无参数的提示技术，用于多项选择问答（MCQA）中的弃权，非常适合SLM。我们的关键实证洞察是，真正知道答案的模型会一致地选择它，而不确定的模型在添加“我不知道”选项时会表现出不稳定的行为。在四个开源模型（2B-8B参数）和四个基准测试上评估，Second Guess实现了10.81%的最高复合风险改进。值得注意的是，在基于熵的方法退化的微调模型上，它保持了8%的复合风险改进，并且对性能较低的模型改进最大。重现本工作所需的所有代码和结果可在https://github.com/Mystic-Slice/second-guess获取。

英文摘要

Large language models often generate confident but incorrect answers rather than abstaining when uncertain. This problem is particularly acute for small language models (SLMs), where computational constraints and autonomous operation amplify the need for reliable uncertainty detection. We propose _Second Guess_, a lightweight, parameter-free prompting technique for abstention in multiple-choice question answering (MCQA) that is well-suited for SLMs. Our key empirical insight is that models which truly know an answer will select it consistently, while uncertain models exhibit unstable behavior when an ``I don't know'' option is added. Evaluated on four open models (2B-8B parameters) and four benchmarks, Second Guess achieves the highest composite risk improvement of 10.81\%. Notably, it maintains an 8\% composite risk improvement on fine-tuned models where entropy-based methods degrade, and improves most for lower-performing models. All code and results required to reproduce this work is available in https://github.com/Mystic-Slice/second-guess

URL PDF HTML ☆

赞 0 踩 0

2605.25389 2026-05-26 cs.CR cs.AI cs.MA 版本更新

Evo-Attacker: Memory-Augmented Reinforcement Learning for Long-Horizon Tool Attacks on LLM-MAS

Evo-Attacker: 用于LLM-MAS长时程工具攻击的记忆增强强化学习

Bingyu Yan, Xiaoming Zhang, Jinyu Hou, Chaozhuo Li, Ziyi Zhou, Yiming Hei, Litian Zhang

发表机构 * Beihang University（北航）； Beijing University of Posts and Telecommunications（北京邮电大学）； China Academy of Information and Communications Technology（信息通信技术研究院）

AI总结提出Evo-Attacker，通过记忆增强强化学习框架将工具攻击建模为自进化过程，并引入Attack-Flow GRPO优化长时程信用分配，实验表明其优于基线方法。

Comments ACL 2026 main

详情

AI中文摘要

尽管基于大语言模型的多智能体系统（LLM-MAS）通过编排专业智能体和外部工具在解决复杂任务方面展现出显著能力，但对工具输出的隐式信任创造了一个关键攻击面。现有的工具攻击受限于领域特异性或固定静态模板。为应对这些挑战，我们提出Evo-Attacker，将工具攻击形式化为一个自进化的、记忆增强的强化学习过程。Evo-Attacker构建动态攻击记忆，并采用深思熟虑的推理来检索对抗模式，并在关键时刻策略性地修改干预。此外，我们引入Attack-Flow GRPO，通过终端结果优化中间推理步骤，解决长时程信用分配挑战。综合实验表明，Evo-Attacker始终优于基线，凸显其泛化和进化能力，以及防御性工具保障的迫切需求。

英文摘要

While Large Language Model-based Multi-Agent Systems (LLM-MAS) demonstrate remarkable capabilities in solving complex tasks by orchestrating specialized agents and external tools, the implicit trust in tool outputs creates a critical attack surface. Existing tool attacks are limited by domain specificity or fixed and static templates. To address these challenges, we propose Evo-Attacker, which formulates the tool attack as a self-evolving, memory-augmented reinforcement learning process. Evo-Attacker constructs a dynamic attack memory and employs deliberative reasoning to retrieve adversarial patterns and strategize modifying interventions at critical moments. Furthermore, we introduce Attack-Flow GRPO to optimize intermediate reasoning steps via terminal outcomes, addressing the long-horizon credit assignment challenge. Comprehensive experiments demonstrate that Evo-Attacker consistently outperforms baselines, highlighting its generalization and evolutionary capabilities and the urgent need for defensive tool safeguards.

URL PDF HTML ☆

赞 0 踩 0

2605.25385 2026-05-26 cs.CV cs.AI 版本更新

Weakly Supervised Camouflaged Object Detection Based on the SAM Model and Mask Guidance

基于SAM模型和掩码引导的弱监督伪装目标检测

Xia Li, Xinran Liu, Lin Qi, Junyu Dong

发表机构 * School of Computer Science（计算机科学学院）； Technology, Ocean University of China, Qingdao 266100, China（技术，中国海洋大学，青岛266100，中国）

AI总结提出MGNet网络，利用SAM模型生成伪标签，通过级联掩码解码器、上下文增强模块和掩码引导特征聚合模块，实现弱监督伪装目标检测，性能与全监督方法相当。

Comments 18 pages

详情

DOI: 10.1016/j.imavis.2025.105571

AI中文摘要

伪装目标检测（COD）由于目标与背景高度相似，是一项具有挑战性的任务。现有的全监督方法需要耗费大量人力进行像素级标注，因此弱监督方法成为平衡精度与标注效率的可行折中方案。然而，由于使用粗标注，弱监督方法常出现性能下降。本文提出一种新的弱监督伪装目标检测方法以克服这些限制。具体地，我们设计了一个新颖的网络MGNet，通过利用自定义级联掩码解码器（CMD）生成的初始掩码来引导分割过程并增强边缘预测，从而解决边缘模糊和漏检问题。我们引入上下文增强模块（CEM）以减少漏检，以及掩码引导特征聚合模块（MFAM）进行有效的特征聚合。针对弱监督挑战，我们提出BoxSAM，利用带有边界框提示的Segment Anything Model（SAM）生成伪标签。通过采用冗余处理策略，为训练MGNet提供高质量的像素级伪标签。大量实验表明，我们的方法在性能上与当前最先进方法具有竞争力。

英文摘要

Camouflaged object detection (COD) from a single image is a challenging task due to the high similarity between objects and their surroundings. Existing fully supervised methods require labor-intensive pixel-level annotations, making weakly supervised methods a viable compromise that balances accuracy and annotation efficiency. However, weakly supervised methods often experience performance degradation due to the use of coarse annotations. In this paper, we introduce a new weakly supervised approach for camouflaged object detection to overcome these limitations. Specifically, we propose a novel network, MGNet, which tackles edge ambiguity and missed detections by utilizing initial masks generated by our custom-designed Cascaded Mask Decoder (CMD) to guide the segmentation process and enhance edge predictions. We introduce a Context Enhancement Module(CEM) to reduce the missing detection, and a Mask-guided Feature Aggregation Module (MFAM) for effective feature aggregation. For the weak supervision challenge, we propose BoxSAM, which leverages the Segment Anything Model (SAM) with bounding-box prompts to generate pseudo-labels. By employing a redundant processing strategy, high quality pixel-level pseudo-labels are provided for training MGNet. Extensive experiments demonstrate that our method delivers competitive performance against current state-of-the-art methods.

URL PDF HTML ☆

赞 0 踩 0

2605.25377 2026-05-26 cs.CV cs.AI 版本更新

Adversarial Orthogonal Disentanglement for LVLM Hallucination Mitigation

对抗正交解缠用于LVLM幻觉缓解

Ruoxi Cheng, Haoxuan Ma, Zhengfei Hai, Yiyan Huang, Ranjie Duan, Tianle Zhang, Xu Yang, Ziyi Ye, Xingjun Ma

发表机构 * Fudan University（复旦大学）； Tencent（腾讯）； Nanjing University（南京大学）； Southeast University（东南大学）； Great Bay University（大坝大学）； TeleAI, China Telecom（TeleAI，中国电信）

AI总结提出对抗正交解缠（AOD）框架，通过最小最大目标学习幻觉相关方向，并利用双前向对比解码策略，在不需额外训练的情况下缓解大型视觉语言模型（LVLM）的幻觉问题。

详情

AI中文摘要

大型视觉语言模型（LVLM）推进了多模态理解，但其可靠性受到幻觉的限制，即生成内容与视觉事实冲突。现有缓解方法要么依赖昂贵的外部干预（如指令调优和检索），要么使用受限于有缺陷的注意力权重和纠缠的隐藏表示的内部机制。我们提出对抗正交解缠（AOD），一种用于缓解LVLM幻觉的潜在几何框架。AOD通过最小最大目标学习幻觉相关方向：分类器将幻觉信号集中到投影分量中，而对抗器通过梯度反转层将其从正交残差空间中移除。学习到的方向使得一种无需训练的双前向对比解码策略能够抑制幻觉同时保持通用能力。在三个LVLM上进行的四个幻觉和四个效用基准实验表明，AOD一致优于强基线。它在POPE上平均提高超过6%的准确率，将AMBER提升6%，并在MMMU等效用任务上保持强劲性能。进一步分析显示跨数据集的鲁棒迁移，表明AOD捕获了通用的幻觉相关偏差而非数据集特定伪影。我们的源代码和数据集可在https://github.com/Hunter-Wrynn/AOD获取。

英文摘要

Large Vision-Language Models (LVLMs) have advanced multimodal understanding, yet their reliability is limited by hallucination, where generated content conflicts with visual facts. Existing mitigation methods either rely on costly external interventions, such as instruction tuning and retrieval, or use internal mechanisms that remain limited by flawed attention weights and entangled hidden representations. We propose Adversarial Orthogonal Disentanglement (AOD), a latent geometric framework for mitigating LVLM hallucinations. AOD learns a hallucination-related direction through a minimax objective: a classifier concentrates hallucination signals into the projected component, while an adversary removes them from the orthogonal residual space via a Gradient Reversal Layer. The learned direction enables a training-free dual-forward-pass contrastive decoding strategy that suppresses hallucinations while preserving general capabilities. Experiments on three LVLMs across four hallucination and four utility benchmarks show that AOD consistently outperforms strong baselines. It improves POPE accuracy by over 6\% on average, boosts AMBER by 6\%, and maintains strong performance on utility tasks such as MMMU. Further analysis shows robust transfer across datasets, suggesting that AOD captures general hallucination-related biases rather than dataset-specific artifacts. Our source code and datasets are available at https://github.com/Hunter-Wrynn/AOD.

URL PDF HTML ☆

赞 0 踩 0

2605.25358 2026-05-26 cs.CL cs.AI cs.CY 版本更新

AI-Associated Lexical Shifts Across 34 Languages: Cross-Lingual Convergence and Diachronic Uptake in News Writing

AI相关的词汇转变跨越34种语言：新闻写作中的跨语言趋同与历时采纳

Thomas Stephan Juzek

发表机构 * Florida State University（佛罗里达州立大学）

AI总结通过分析34种语言的新闻语料，使用GPT-4.1续写诊断方法，发现AI过度使用的词汇在跨语言中呈现语义趋同，且ChatGPT发布后这些词汇的使用频率显著增加。

Comments 19 pages (9-page main body, plus references and appendices), 3 figures; ACL ARR reviewed, committed to EMNLP 2026

详情

AI中文摘要

AI相关的词汇转变主要被记录在科学英语中。我们将这项工作扩展到WMT新闻抓取语料库中的34种语言，改进了一种分割-后半部分续写诊断方法，比较GPT-4.1续写与匹配的人类黄金标准文本。对于每种语言，我们使用对数流行率比率推导出排名靠前的AI过度使用词元。我们发现显著的跨语言语义趋同：语义相关的概念在类型多样的语言中反复出现，其中'强调'类动词出现在34种语言中的24种。基于嵌入和人工分析支持这一模式。我们还考察了ChatGPT发布前后新闻写作中的历时采纳情况。追踪每种语言前20个AI过度使用项目，我们发现从2020-2021年到2023-2024年，34种语言中有26种语言的流行率增加，平均变化为+15.1%，而匹配的基线词汇没有显示出可比的增加（-4.5%）。在具有较长历史覆盖的10种语言中，纵向分析显示2022年后的增加超过了早期观察到的适度变化，尽管效应大小小于科学英语。我们广泛验证了我们的方法，包括跨种子、模型变体、数据大小、模型系列等。我们的发现与以下观点一致：AI相关的词汇偏好超越了英语，并可能对全球语言使用施加跨语言同质化压力。

英文摘要

AI-associated lexical shifts have been documented mainly in Scientific English. We extend this work to 34 languages in the WMT News Crawl corpus, refining a split-halves continuation diagnostic that compares GPT-4.1 continuations with matched human gold-standard text. For each language, we derive ranked AI-overused lemmas using log prevalence ratios. We find substantial cross-lingual semantic convergence: semantically related concepts recur across typologically diverse languages, with 'emphasize'-type verbs appearing in 24 of 34 languages. Embedding-based and manual analyses support this pattern. We also examine diachronic uptake in news writing before and after ChatGPT's release. Tracking each language's top 20 AI-overused items, we find prevalence increases in 26 of 34 languages from 2020-2021 to 2023-2024, with a mean change of +15.1%, whilst matched baseline words show no comparable increase (-4.5%). In 10 languages with longer historical coverage, longitudinal analyses show post-2022 increases that exceed the modest shifts observed in earlier periods, though with smaller effect sizes than in Scientific English. We validate our approach extensively, including across seeds, model variants, data sizes, model families, and more. Our findings are consistent with the view that AI-associated lexical preferences extend beyond English and may exert cross-lingual homogenising pressure on global language use.

URL PDF HTML ☆

赞 0 踩 0

2605.25354 2026-05-26 cs.AI 版本更新

Context-CoT: Enhancing Context Learning via High-Quality Reasoning Synthesis

Context-CoT：通过高质量推理合成增强上下文学习

Hongbo Jin, Mingnan Zhu, Jingqi Tian, Xu Jiang, Zhongjing Du, Haoran Tang, Siyi Xie, Qiaoman Zhang, Jiayu Ding

发表机构 * Peking University（北京大学）； Xiamen University（厦门大学）； Tsinghua University（清华大学）

AI总结针对大语言模型在动态提取和应用新知识方面的上下文学习能力不足，提出Context-CoT方法，通过合成高质量推理链来增强上下文学习，在CL-Bench上显著提升性能。

2605.25352 2026-05-26 cs.LG cs.AI 版本更新

CausalFlow: LLM Agent 失败的因果归因与反事实修复

Akash Bonagiri, Devang Borkar, Gerard Janno Anderias, Setareh Rafatirad, Houman Homayoun

发表机构 * Department of Computer Science University of California, Davis（计算机科学系加州大学戴维斯分校）

AI总结提出CausalFlow框架，通过反事实干预计算步骤级因果责任分数，识别失败步骤并生成最小编辑修复，用于测试时修复和训练时监督，在多个基准上优于启发式方法。

详情

AI中文摘要

大型语言模型（LLM）代理在涉及推理、工具使用和环境交互的多步任务中经常失败。虽然此类失败通常被记录或通过启发式重试处理，但它们包含了关于执行中断位置的结构化信号。我们提出了CausalFlow，一个干预框架，将失败的代理轨迹转换为最小的反事实修复和可重用的监督。CausalFlow将执行轨迹建模为依赖步骤的顺序链，并通过步骤级反事实干预计算因果责任分数（CRS）来识别导致失败的步骤。对于这些步骤，我们生成最小编辑修复，将最终结果翻转为成功，产生形式为（错误步骤，修正步骤）的验证对比对。CausalFlow支持两种互补用途：具有最小行为漂移的针对性测试时修复，以及适用于离线偏好优化或奖励建模的训练时监督。在涵盖数学推理、代码生成、问答和医学浏览的四个基准测试中，CausalFlow将失败执行转换为具有高最小性和因果一致性分数的验证最小修复，并证明因果归因对于跨不同代理任务的可靠改进是必要的，在复杂检索设置中优于启发式细化，同时产生更局部的修复。这些结果表明，对结构化执行轨迹的干预分析提供了一种原则性和可扩展的机制，将代理失败转化为可靠性提升和可学习的监督。

英文摘要

Large language model (LLM) agents frequently fail on multi-step tasks involving reasoning, tool use, and environment interaction. While such failures are typically logged or retried heuristically, they contain structured signals about where execution broke down. We introduce CausalFlow, an interventional framework that converts failed agent traces into minimal counterfactual repairs and reusable supervision. CausalFlow models execution traces as sequential chains of dependent steps and computes Causal Responsibility Scores(CRS) via step-level counterfactual intervention to identify failure-inducing steps. For these steps, we generate minimally edited repairs that flip the final outcome to success, producing validated contrastive pairs of the form (wrong step, corrected step). CausalFlow supports two complementary uses: targeted test-time repair that recovers from failures with minimal behavioral drift, and training-time supervision suitable for offline preference optimization or reward modeling. Across four benchmarks spanning mathematical reasoning, code generation, question answering, and medical browsing, CausalFlow converts failed executions into validated minimal repairs with high minimality and causal-consensus scores, and demonstrates that causal attribution is necessary for reliable improvement across diverse agent tasks, outperforming heuristic refinement in complex retrieval settings while producing more localized repairs throughout. These results demonstrate that interventional analysis over structured execution traces provides a principled and scalable mechanism for transforming agent failures into reliability gains and learning-ready supervision.

URL PDF HTML ☆

赞 0 踩 0

2605.25313 2026-05-26 cs.LG cs.AI cs.RO stat.ML 版本更新

UWM-JEPA: Predictive World Models That Imagine in Belief Space

UWM-JEPA：在信念空间中进行想象的世界预测模型

Santosh Kumar Radha, Oktay Goktas

发表机构 * AgentField AI

AI总结针对部分可观测环境，提出UWM-JEPA模型，通过密度矩阵潜变量和酉预测器在信念空间中保持联合状态谱，实现长时域盲推演下的不确定性保持，显著优于向量潜变量基线。

Comments 14 pages, 6 figures, 7 tables. Code and data: https://github.com/santoshkumarradha/uwm-jepa

详情

AI中文摘要

部分可观测环境下的世界模型必须想象多个兼容的隐藏未来，并在反事实动作下引导它们。联合嵌入预测架构（JEPAs）在潜在空间中实现这一点，但向量值潜变量没有内部结构来承载盲推演过程中隐藏连续性的信念。我们引入了酉世界模型JEPA（UWM-JEPA），这是一种JEPA世界模型，具有在联合系统-环境空间上的密度矩阵潜变量和学习的酉预测器。该结构在推演过程中精确保持联合状态谱，因此预测器本身不会耗散表示的不确定性。在一个需要根据给定动作序列进行五步前向模拟且目标观测被掩蔽的隐藏速度指示任务中，UWM-JEPA达到0.77的准确率，并且随着动作被扰动而单调下降；而参数匹配的LSTM-JEPA在相同的反事实目标目标和动作头训练下，在所有动作条件下都崩溃为多数类准确率（0.53）。在盲推演下，UWM-JEPA在短时域上损失不到十个点的探针R^2，而向量潜变量基线损失四十一个和六十八个点；两者在保留的上下文探针上表现相当，表明差异在于预测器而非编码器。动作敏感性本身需要针对反事实而非教师强制目标进行训练，这一发现适用于酉参数化之外。对于JEPA世界模型在部分可观测性下进行想象，潜变量几何和预测器动力学至关重要，而不仅仅是冻结的上下文编码能力。

英文摘要

World models for partially observed environments must imagine multiple compatible hidden futures and steer between them under counterfactual actions. Joint Embedding Predictive Architectures (JEPAs) do this in latent space, but a vector-valued latent has no internal structure for carrying the belief over hidden continuations through blind rollout. We introduce the Unitary World Model JEPA (UWM-JEPA), a JEPA world model with a density-matrix latent on a joint system-environment space and a learned unitary predictor. The construction preserves the joint-state spectrum exactly during rollout, so the predictor itself cannot dissipate the represented uncertainty. On a hidden-velocity indicator task requiring five-step forward simulation under a given action sequence with the target observation masked, UWM-JEPA reaches 0.77 accuracy and degrades monotonically as actions are perturbed; a parameter-matched LSTM-JEPA trained under the same counterfactual-target objective and action head collapses to majority-class accuracy (0.53) under every action condition. Under blind rollout, UWM-JEPA loses fewer than ten points of probe R^2 at short horizons while vector-latent baselines lose forty-one and sixty-eight; both nevertheless tie on a held-out context probe, locating the separation in the predictor rather than the encoder. Action sensitivity itself requires training against counterfactual rather than teacher-forced targets, a finding that applies beyond the unitary parameterisation. For JEPA world models to imagine under partial observability, latent geometry and predictor dynamics matter, not frozen context-encoding capacity alone.

URL PDF HTML ☆

赞 0 踩 0

2605.25293 2026-05-26 cs.CV cs.AI cs.RO 版本更新

Neuromorphic LiDAR-based Bird's Eye View Object Detection using Energy-efficient Spiking Neural Networks

基于神经形态激光雷达的鸟瞰图目标检测：使用节能脉冲神经网络

Sambit Mohapatra, Senthil Yogamani, Heinrich Gotzig, Patrick Mader

发表机构 * Valeo, Germany（德国瓦莱欧公司）； Valeo, Ireland（爱尔兰瓦莱欧公司）； TU Ilmenau, Germany（德国伊门豪大学）

AI总结提出一种端到端脉冲编码器-解码器网络，用于激光雷达点云鸟瞰图表示中的目标检测，通过代理梯度反向传播训练，在KITTI基准上达到高精度，并实现3.33倍突触操作能耗降低。

详情

AI中文摘要

自动驾驶感知需要在严格的功耗约束下对三维传感器数据进行准确高效的处理。传统卷积神经网络实现了强大的检测精度，但计算密集，限制了其在资源受限的神经形态平台上的部署。脉冲神经网络通过事件驱动的稀疏计算提供了一种引人注目的替代方案，但其在复杂真实世界感知任务（如三维目标检测）中的应用仍然有限。在这项工作中，我们提出了一种端到端脉冲编码器-解码器网络，用于激光雷达点云鸟瞰图表示中的目标检测，并使用代理梯度反向传播进行训练。我们训练了两个变体：一个膜电位变体，在输出阶段读取连续神经元状态以获得最大精度，在$\mathrm{IoU}\!=\!0.5$（简单/中等/困难）下达到$92.05$/$87.04$/$86.51$ AP；以及一个全二进制脉冲变体，每一层仅操作脉冲序列，用于直接神经形态部署。我们评估了四种输入脉冲编码策略，并证明允许网络直接从数据学习脉冲表示优于手工制作的泊松、延迟和z轴编码方案，在KITTI基准上，当顺序帧不可用且BEV输入跨时间步重复呈现作为时间流代理时。分块能量分析表明，在保守的基于循环的操作下，与等效CNN相比，突触操作能量降低了$3.33 imes$。这些结果共同证明了脉冲神经网络在自动驾驶中实现准确且节能的神经形态感知的可行性。

英文摘要

Autonomous driving perception demands accurate and efficient processing of three-dimensional sensor data under strict power constraints. Traditional convolutional neural networks achieve strong detection accuracy but are computationally intensive, limiting their suitability for deployment on resource-constrained neuromorphic platforms. Spiking neural networks offer a compelling alternative through event-driven sparse computation, yet their application to complex real-world perception tasks such as three-dimensional object detection remains limited. In this work, we propose an end-to-end spiking encoder-decoder network for object detection in bird's eye view representations of LiDAR point clouds, trained using surrogate gradient backpropagation. We train two variants: a membrane potential variant that reads continuous neuron state at the output stage for maximum accuracy, achieving $92.05$/$87.04$/$86.51$ AP at $\mathrm{IoU}\!=\!0.5$ (Easy/Moderate/Hard), and, a fully binary spiking variant that operates exclusively on spike trains at every layer for direct neuromorphic deployment. We evaluate four input spike encoding strategies and demonstrate that allowing the network to learn spike representations directly from data outperforms hand-crafted Poisson, latency, and z-axis encoding schemes on the KITTI benchmark, where sequential frames are unavailable and the BEV input is presented repeatedly across timesteps as a proxy for temporal streaming. A block-wise energy analysis demonstrates a $3.33\times$ reduction in synaptic operation energy over an equivalent CNN under conservative loop-based operation. Together, these results demonstrate the viability of spiking neural networks for accurate and energy-efficient neuromorphic perception in autonomous driving.

URL PDF HTML ☆

赞 0 踩 0

2605.25272 2026-05-26 cs.AI cs.CY stat.AP 版本更新

AI Cartography: Mapping the Latent Landscape of AI Benchmark Ecosystems

AI 制图：绘制 AI 基准生态系统的潜在景观

Michael Hardy, Anka Reuel, Lijin Zhang, Jodi M. Casabianca, Sang Truong, Yash Dave, Hansol Lee, Benjamin Domingue, Sanmi Koyejo

发表机构 * Open LLM Leaderboard（开放大语言模型排行榜）； HELM ； ICML（国际机器学习会议）

AI总结针对排行榜分数受测量噪声影响的问题，提出基于验证性因子分析和概化理论的框架，分解排名方差来源，揭示基准间关系、局部依赖性及元数据影响，并比较显式与潜在缩放律的可靠性。

详情

AI中文摘要

虽然总体排行榜分数驱动着 AI 发展，但它们包含大量测量噪声，其来源和幅度尚未量化，使得排名何时反映真实能力差异何时反映评估伪像尚不明确。我们引入了一个用于测量 AI 基准生态系统中潜在景观的框架。将验证性因子分析（CFA）和概化理论应用于 Open LLM Leaderboard 上的 4000 多个模型，我们分解了排名方差的来源并确定：（1）当前报告实践中假设的结构低估了基准之间关系的强度；（2）排行榜项目之间存在局部依赖性的证据，这削弱了在当前评分系统下将基准用作测量工具的有效性；（3）在此背景下，贡献者元数据解释了比架构或部署类别更多的排名相关方差（约 9%）；(4) 显式分数的“缩放律”斜率可靠性较低（$R_β=0.53$）；相比之下，潜在通用因子大小斜率在生态系统控制下高度稳定（$R_g=0.97$）。我们能够提供对基准动态的独特见解，例如哪些基准是 LLM 规模的函数，哪些可能受到后训练实践的相反影响。我们提供了可操作的诊断方法，以确定如何信任基准排名以及如何改进基准设计。

英文摘要

While aggregate leaderboard scores drive AI development, they contain substantial measurement noise whose sources and magnitudes remain unquantified, making it unclear when rankings reflect genuine capability differences versus evaluation artifacts. We introduce a framework for measuring the latent landscape in AI benchmark ecosystems. Applying Confirmatory Factor Analysis (CFA) and Generalizability Theory to 4,000+ models from the Open LLM Leaderboard, we decompose sources of ranking variance and establish: (1) structures assumed in current reporting practice underestimate the strength of relationships between benchmarks; (2) evidence of local dependence among leaderboard items, undermining uses of benchmarks as measurement instruments under current scoring systems; (3) contributor metadata explains more rank-relevant variance ($\approx9\%$) than architecture or deployment categories in this context; (4) a manifest-score "scaling law" slope has low reliability ($R_β=0.53$); by contrast, the latent general-factor size slope is highly stable across ecosystem controls ($R_g=0.97$). We are able to provide unique insights into benchmark dynamics, such as which benchmarks are a function of LLM size and which can be oppositely impacted by post-training practices. We provide actionable diagnostics to determine how benchmark rankings can be trusted and how benchmark design can be improved.

URL PDF HTML ☆

赞 0 踩 0

2605.25271 2026-05-26 math.AG cs.AI cs.NE 版本更新

Positivity in classical enumerative geometry: a case study in synchronized AI-assisted mathematics

经典枚举几何中的正性：同步AI辅助数学的案例研究

Gergely Bérczi, László M. Fehér

发表机构 * Department of Mathematics, Aarhus University（阿arhus大学数学系）； Eötvös University Budapest（布达佩斯欧多维奇大学）； Alfréd Rényi Institute of Mathematics（阿尔弗雷德·雷尼数学研究所）

AI总结研究对称多项式∏_{α∈A_{n,d}}(1+α_1 x_1+⋯+α_n x_n)（即Sym^d(C^n)的全陈类）的齐次部分c_k(n,d)的结构，通过AI与人类协作证明相关猜想、建立显式公式并研究对数凹性。

Comments 29 pages

详情

AI中文摘要

我们研究对称多项式 $\prod_{\alpha\in A_{n,d}}\bigl(1+\alpha_1 x_1+\cdots+\alpha_n x_n\bigr)$，其中 $A_{n,d}:=\{\alpha\in\mathbb{Z}_{\ge 0}^n:|\alpha|=d\}$，它是 $\mathrm{Sym}^d(\mathbb{C}^n)$ 的全陈类，视为一个环面表示，其陈根为权重 $\alpha_1 x_1+\cdots+\alpha_n x_n$（$\alpha\in A_{n,d}$）。其齐次 $k$ 次部分 $c_k(n,d)$ 是 $\mathrm{Sym}^d(\mathbb{C}^n)$ 的第 $k$ 个陈类。这些陈类及其在各种对称函数基中的系数在枚举几何中起着核心作用。尽管定义简单，但其系数的通用封闭公式却十分微妙，且这些类的许多结构性质至今仍知之甚少。在本文中，我们证明了关于其结构的几个猜想，建立了显式公式，并研究了陈类及其 $K$ 理论类比的对数凹性。在秩为二的情况下，通过过渡到 Schur 基并将 Schur 系数在 $d$ 的二项式基中展开，我们发现了一种新的二项式对数凹性现象，并证明了精细的正性结果。本文展示了一种新颖的方法论：我们将多个AI系统与人类数学洞察力结合在协调的工作流程中，根据每个工具在实验发现、猜想形成、符号证明构建和验证方面的优势进行部署。据我们所知，这是协调多个AI工具在连贯的数学研究项目中取得实质性进展的首批详细案例研究之一。

英文摘要

We study the symmetric polynomial $\prod_{α\in A_{n,d}}\bigl(1+α_1 x_1+\cdots+α_n x_n\bigr)$ where $A_{n,d}:=\{α\in\mathbb{Z}_{\ge 0}^n:|α|=d\}$, which is the total Chern class of $\mathrm{Sym}^d(\mathbb{C}^n)$, viewed as a torus representation whose Chern roots are the weights $α_1 x_1+\cdots+α_n x_n$ for $α\in A_{n,d}$. Its homogeneous degree-$k$ part $c_k(n,d)$ is the $k$-th Chern class of $\mathrm{Sym}^d(\mathbb{C}^n)$. These Chern classes, together with their coefficients in various symmetric function bases, play a central role in enumerative geometry. Despite their simple definition, general closed formulas for their coefficients are subtle, and many structural properties of these classes have remained poorly understood. In this paper we prove several conjectures concerning their structure, establish explicit formulas, and study log-concavity properties for both the Chern classes and their $K$-theoretic analogue. In rank two, passing to the Schur basis and expanding the Schur coefficients in the binomial basis of $d$, we uncover a new binomial log-concavity phenomenon and prove refined positivity results. The paper demonstrates a novel methodology: we combine several AI systems with human mathematical insight in a coordinated workflow, deploying each tool according to its strengths in experimental discovery, conjecture formation, symbolic proof construction, and verification. To our knowledge, this is one of the first detailed case studies of orchestrating multiple AI tools to make substantial progress on a coherent mathematical research project.

URL PDF HTML ☆

赞 0 踩 0

2605.25267 2026-05-26 cs.LG cs.AI 版本更新

Latent Q-Barrier Shielding for Safe In-Context Reinforcement Learning

潜在Q-屏障屏蔽用于安全上下文强化学习

Minjae Kwon, Amir Moeini, Shangtong Zhang, Lu Feng

发表机构 * University of Virginia（弗吉尼亚大学）

AI总结提出一种潜在Q-屏障屏蔽方法，通过学习上下文表示、潜在动力学和集成成本评论家，在部署时无需参数更新即可根据剩余预算和预测未来成本过滤或软重加权候选动作，从而改善安全上下文强化学习在分布外转移下的奖励-安全权衡。

详情

AI中文摘要

安全上下文强化学习（ICRL）在测试时不更新参数，仅从交互历史中在线适应，同时将情节成本控制在安全预算内。在分布外（OOD）部署转移下，仅预训练的安全ICRL可能产生较差的奖励-安全权衡，因为剩余预算仅通过冻结的策略条件影响行为，而非通过针对预测未来成本的显式动作级检查。我们提出一种潜在Q-屏障屏蔽，在部署前学习上下文表示、潜在动力学和集成成本评论家。无需参数更新，该屏蔽从历史中推断上下文，并使用剩余预算和预测未来成本过滤或软重加权候选动作。我们证明了一个条件性的、误差分解的屏障-边际结果：满足Q-屏障的动作将下一个潜在预算状态置于近似预算安全的延续中（在学习的评论家下），误差上界由贝尔曼误差和潜在预测误差决定。在五个安全ICRL基准测试中，该屏蔽在部署时相比强安全ICRL基线改善了奖励-安全权衡：在短上下文窗口后，它在五个基准中的四个上实现了更高的回报，同时在所有五个基准中匹配或降低了平均情节成本。

英文摘要

Safe in-context reinforcement learning (ICRL) adapts online from interaction history without test-time parameter updates while controlling episode cost under a safety budget. Under out-of-distribution (OOD) deployment shifts, pretraining-only safe ICRL can give poor reward-safety tradeoffs because the remaining budget affects behavior only through frozen policy conditioning, not an explicit action-level check against predicted future cost. We propose a latent Q-Barrier shield that learns a context representation, latent dynamics, and an ensemble cost critic before deployment. Without parameter updates, the shield infers context from history and filters or softly reweights candidate actions using the remaining budget and predicted future cost. We prove a conditional, error-decomposed barrier-margin result: a Q-Barrier-satisfying action leaves the next latent-budget state with an approximately budget-safe continuation under the learned critic, up to Bellman and latent-prediction errors. Across five safe ICRL benchmarks, the shield improves deployment-time reward-safety tradeoffs over a strong safe-ICRL baseline: after a short context window, it achieves higher return in four of five benchmarks while matching or lowering average episode cost in all five.

URL PDF HTML ☆

赞 0 踩 0

2605.25263 2026-05-26 cs.CL cs.AI 版本更新

Mimir: Large-scale Multilingual Concept Modeling

Mimir：大规模多语言概念建模

Elio Musacchio, Lucia Siciliani, Pierpaolo Basile

发表机构 * Department of Computer Science（计算机科学系）； University of Bari Aldo Moro（巴里阿尔多·莫罗大学）

AI总结提出Mimir，一个1.6B参数的大规模概念模型，通过多语言预训练和指令微调实现概念级别的理解与生成，替代传统的token预测范式。

详情

AI中文摘要

当前的语言建模方法围绕token构建。文本语料被分割成token，模型通过对这些token进行计算来训练，例如根据前文预测下一个token。这一范式已成为现代语言建模的标准，尤其是基于token的架构取得了卓越性能。然而，最近的研究不仅开始质疑语言模型如何从token中处理和理解意义，还开始质疑使用更高级别的粒度是否能推动研究领域的发展。这引出了概念建模的想法，即直接训练模型进行下一个概念预测，而非下一个token预测。目标是输入从token转变为概念，迫使底层语言模型将其粒度从细粒度的token转变为广泛的概念。在这项工作中，我们介绍了Mimir，一个1.6B参数的大规模概念模型，用于多语言概念理解和生成。我们利用了一个大规模多语言预训练语料库（38,883,987,240个句子），涵盖46种语言，以及一个大规模多轮多语言指令微调数据集（66,816,428个句子），覆盖总共35种语言。我们针对一个参数数量相当的语言模型，对模型性能进行了广泛评估。

英文摘要

Current language modeling approaches are built around tokens. Text corpora are split into tokens, and models are trained by performing computations on these tokens, such as predicting the next token given the preceding ones as context. This paradigm has become the standard in modern language modeling, especially given the outstanding performance obtained by token-based architectures. However, recent works have not only begun to question how language models process and understand meaning from tokens, but also to question whether using higher levels of granularity could advance the research field. This led to the idea of Concept Modeling, that is, to directly train models for next-concept prediction rather than next-token prediction. The goal is to change the input from tokens to concepts, forcing the underlying language model to shift its granularity from fine-grained tokens to broad concepts. In this work, we introduce Mimir, a 1.6B Large Concept Model trained for multilingual concept understanding and generation. We leverage a large-scale multilingual pre-training corpus (38,883,987,240 sentences) spanning 46 languages and a large-scale multi-turn and multilingual instruction-tuning dataset (66,816,428 sentences) covering a total of 35 languages. We extensively evaluate model performance against a language model with a comparable number of parameters.

URL PDF HTML ☆

赞 0 踩 0

2605.25258 2026-05-26 cs.IR cs.AI cs.CY cs.LG 版本更新

First, do no harm: Breaking suicidogenic echo chambers in media recommendation

首先，不伤害：打破媒体推荐中的自杀性回音室

Alberto Díaz-Álvarez, Raúl Lara-Cabrera, Fernando Ortega-Requena, Víctor Ramos-Osuna

发表机构 * E.T.S.I. Sistemas Informáticos (Universidad Politécnica de Madrid)（马德里理工大学信息系统工程系）

AI总结针对推荐系统在心理健康场景中可能加剧用户自杀倾向的问题，提出RankAid重排序方法，通过惩罚有害内容并提升治疗性内容，在保持推荐准确性的同时确保临床安全。

Comments 10 pages, 5 figures. Research on safety-aware recommender systems and algorithmic ethics

详情

AI中文摘要

推荐系统通常优化用户参与度，但在心理健康背景下这种方法存在危险。当脆弱用户表现出自杀意念迹象时，标准算法往往将他们困在有害内容的回音室中，恶化其心理状态。为此，我们引入RankAid，一种重排序方法，在预测相关性的同时优先考虑临床安全性。它作为现有模型的附加层运行：根据用户当前的脆弱程度惩罚风险项目并提升治疗性内容。我们使用MovieLens 1M数据集评估了该方法，其中项目通过大语言模型进行了临床风险和治疗价值的语义注释。我们的模拟表明，该算法在危机高峰期成功阻止了有害内容的推荐，主动重塑信息流以支持情绪降级。此外，这种安全干预仅导致标准准确性指标（如NDCG）可控且可接受的下降。通过使用非对称超参数，RankAid还使系统管理员能够根据特定的临床指南调整干预的严重程度。

英文摘要

Recommender systems generally optimises user engagement, but this approach is dangerous in mental health contexts. When vulnerable users show signs of suicidal ideation, standard algorithms often trap them in echo chambers of harmful content, worsening their psychological state. In response, we introduce RankAid, a re-ranking method that prioritises clinical safety alongside predictive relevance. It works as an add-on layer to existing models: it penalises risky items and boosts therapeutic content depending on the user's current level of vulnerability. We evaluated this approach using the MovieLens 1M dataset, where items were semantically annotated for clinical risk and therapeutic value using large language models. Our simulations show that our algorithm successfully blocks the recommendation of harmful content during crisis peaks, actively reshaping the feed to support emotional de-escalation. Furthermore, this safety intervention only causes a controlled, acceptable drop in standard accuracy metrics like NDCG. By using asymmetric hyperparameters, RankAid also gives system administrators the flexibility to tune the severity of the intervention based on specific clinical guidelines.

URL PDF HTML ☆

赞 0 踩 0

2605.25254 2026-05-26 cs.CV cs.AI 版本更新

PilotWiMAE：面向无线信道的导频原生表示学习

Berkay Guler, Giovanni Geraci, Hamid Jafarkhani

发表机构 * Center for Pervasive Communications and Computing, University of California, Irvine（加州大学尔湾分校普及通信与计算中心）； Nokia and Universitat Pompeu Fabra（诺基亚与庞培法布拉大学）

AI总结提出PilotWiMAE自监督框架，直接处理噪声导频观测，通过分解注意力机制和补丁归一化重构，在缩小观测空间的同时实现跨频段波束选择和信道表征，优于监督基线。

详情

AI中文摘要

信道基础模型假设能够访问完全观测的信道，这一假设在部署中不成立。我们提出PilotWiMAE，一种自监督框架，其编码器直接接收噪声导频观测，注意力沿时间与联合空频处理轴分解，这是受问题物理特性启发的归纳偏置。导频输入将观测空间缩小两个数量级，并消除了全CSI可用性的不现实假设，同时降低延迟。分解设计通过利用可分离的信道结构生成鲁棒表示，并允许预训练掩码率达到$99\%$。我们将捕获小尺度衰落结构的补丁归一化重构与恢复大尺度衰落特征的辅助尺度损失相结合，并使用AWGN课程学习来匹配预训练和部署时的导频噪声。仅在$3.5$\,GHz上预训练，在$28$\,GHz上评估，涵盖分布内和分布外场景，PilotWiMAE的跨频段波束选择和信道表征在更小的观测空间上仍优于监督基线。为削弱解码器容量与表示质量之间的耦合，我们进一步提出在编码器-解码器联合预训练之后进行以解码器为中心的预训练阶段，使得PilotWiMAE在不牺牲表示质量的情况下展现出有竞争力的信道估计性能。为促进该方向的进一步研究，我们发布了PilotWiMAE预训练权重和训练流程，以及基于Sionna的射线追踪信道生成工具CSIGen和本文使用的信道数据集。

英文摘要

Channel foundation models assume access to fully observed channels, an assumption that fails in deployment. We introduce PilotWiMAE, a self-supervised framework whose encoder ingests noisy pilot observations directly and whose attention factorizes along the axis separating temporal from joint space-frequency processing, an inductive bias inspired by the physics of the problem. Pilot input shrinks the observation space by up to two orders of magnitude and also removes the unrealistic assumption of full-CSI availability while incurring lower latency. The factorized design generates robust representations by exploiting the separable channel structure and allows a pretraining mask ratio of $99\%$. We pair patch-normalized reconstruction, which captures small-scale fading structure, with an auxiliary scale loss that recovers the large-scale fading features, and use an AWGN curriculum to match pilot noise at pretraining and deployment. Pretrained solely on $3.5$\,GHz and evaluated at $28$\,GHz across in-distribution and out-of-distribution settings, PilotWiMAE's cross-frequency beam selection and channel characterization beat supervised baselines despite operating on a smaller observation space. To weaken the coupling between decoder capacity and representation quality, we further propose a decoder-centric pretraining stage following the encoder-decoder joint pretraining, which allows PilotWiMAE to demonstrate competitive channel estimation without sacrificing representation quality. To foster further work in this direction, we release the PilotWiMAE pretrained weights and training pipeline, together with CSIGen, our Sionna-based ray-tracing channel-generation tool, and the channel datasets used in this work.

URL PDF HTML ☆

赞 0 踩 0

2605.22795 2026-05-26 stat.ML cs.AI cs.LG math.ST stat.TH 版本更新

Finite-Particle Convergence Rates for Conservative and Non-Conservative Drifting Models

保守与非保守漂移模型的有限粒子收敛速率

Krishnakumar Balasubramanian

发表机构 * Department of Statistics, University of California, Davis（加州大学戴维斯分校统计系）

AI总结针对一步生成建模，提出保守漂移方法（用核密度估计梯度速度替代位移速度）并证明连续时间有限粒子收敛界，同时分析非保守方法（Laplace核）的对应速率。

详情

AI中文摘要

我们提出并分析了一种用于一步生成建模的保守漂移方法。该方法将原始的基于位移的漂移速度替换为核密度估计（KDE）梯度速度，即核平滑数据得分与核平滑模型得分之差。该速度为梯度场，解决了通用基于位移的漂移场中发现的非保守性问题。我们证明了在$\R^d$上保守方法的连续时间有限粒子收敛界：联合熵恒等式给出了经验Stein漂移、KDE的平滑Fisher差异以及中心速度平方的界。主要的有限粒子校正是倒数KDE自相互作用项，我们给出了确定性和高概率的局部占据条件，在此条件下该项可控。我们保持求积常数显式并追踪其可能的带宽依赖性：在额外的$h$均匀求积正则条件下，根残差速度率为$N^{-1/(d+4)}$；而更一般的增长条件产生优化根速率$N^{-(2-β)/(2(d+4-β))}$，其中$0\le β<2$。我们还分析了使用Laplace核的非保守漂移方法，对应于Deng等人2026年（arxiv:2602.04770）提出的原始基于位移的速度。对于该方法，一个尖锐的伴随核将速度分解为尖锐得分不匹配的正标量预处理加上Laplace尺度不匹配残差，产生类似的有限粒子速率，但带有一个不可避免的残差项。最后，我们解释了如何通过显式漂移大小$η$将连续时间残差速度界转化为一步生成保证。

英文摘要

We propose and analyze a conservative drifting method for one-step generative modeling. The method replaces the original displacement-based drifting velocity by a kernel density estimator (KDE)-gradient velocity, namely the difference of the kernel-smoothed data score and the kernel-smoothed model score. This velocity is a gradient field, addressing the non-conservatism issue identified for general displacement-based drifting fields. We prove continuous-time finite-particle convergence bounds for the conservative method on $\R^d$: a joint-entropy identity yields bounds for the empirical Stein drift, the smoothed Fisher discrepancy of the KDE, and the squared center velocity. The main finite-particle correction is a reciprocal-KDE self-interaction term, and we give deterministic and high-probability local-occupancy conditions under which this term is controlled. We keep the quadrature constants explicit and track their possible bandwidth dependence: the root residual-velocity rate $N^{-1/(d+4)}$ holds under an additional $h$-uniform quadrature regularity condition, while a more general growth condition yields the optimized root rate $N^{-(2-β)/(2(d+4-β))}$, where $0\le β<2$. We also analyze the non-conservative drifting method with Laplace kernel, corresponding to the original displacement-based velocity proposed in Deng et al., 2026 (arxiv:2602.04770). For this method, a sharp companion kernel decomposes the velocity into a positive scalar preconditioning of a sharp-score mismatch plus a Laplace scale-mismatch residual, producing an analogous finite-particle rate with an unavoidable residual term. Finally, we explain how the continuous-time residual-velocity bounds translate into one-step generation guarantees through the explicit drift size $η$.

URL PDF HTML ☆

赞 0 踩 0

2605.22769 2026-05-26 cs.CL cs.AI 版本更新

Understanding Data Temporality Impact on Large Language Models Pre-training

理解数据时间性对大型语言模型预训练的影响

Hippolyte Pilchen, Romain Fabre, Franck Signe Talla, Patrick Perez, Edouard Grave

发表机构 * Kyutai

AI总结研究预训练数据顺序对大型语言模型获取时间敏感事实知识的影响，通过构建包含7000多个时间相关问题的基准并训练60亿参数模型，发现按时间顺序训练比随机打乱训练能产生更及时和精确的知识。

详情

AI中文摘要

大型语言模型（LLMs）通常在打乱顺序的语料库上进行训练，导致模型的知识在训练时被冻结，其时间基础仍然难以理解。在这项工作中，我们研究了预训练动态对获取时间敏感事实知识的影响，特别关注数据顺序。我们的主要贡献有两方面。首先，我们引入了一个包含7000多个时间基础问题的综合基准和一个评估协议，能够分析模型是否将事实与其对应的时间段正确关联。其次，我们在按时间顺序排列的Common Crawl快照上预训练了60亿参数的模型，并将其与标准的随机打乱预训练进行比较。我们的结果表明，按顺序训练的模型在通用语言理解和常识方面与随机打乱的基线相当，同时始终表现出更及时和精确的时间知识。按时间顺序的预训练提高了事实的新鲜度，而随机打乱的预训练在较旧的数据上表现更好，可能是由于事实重复增加。这些发现，连同我们在https://github.com/kyutai-labs/kairos 发布的代码、在https://huggingface.co/collections/kyutai/kairos 发布的检查点和数据集，为LLMs的持续学习未来研究提供了基础。

英文摘要

Large language models (LLMs) are typically trained on shuffled corpora, yielding models whose knowledge is frozen at train time and whose temporal grounding remains poorly understood. In this work, we study the impact of pre-training dynamics on the acquisition of time-sensitive factual knowledge, focusing specifically on data ordering. Our main contributions are twofold. First, we introduce a comprehensive benchmark of over 7,000 temporally grounded questions and an evaluation protocol that enables analysis of whether models correctly associate facts with their corresponding time periods. Second, we pretrain 6B-parameter models on temporally ordered Common Crawl snapshots and compare them against standard shuffled pre-training. Our results show that sequentially trained models match shuffled baselines on general language understanding and common knowledge while consistently exhibiting more up-to-date and temporally precise knowledge. Temporally ordered pre-training yields improved factual freshness, while shuffled pre-training peaks on older data, possibly due to increased factual repetition. These findings, along with the release of our code at https://github.com/kyutai-labs/kairos , checkpoints, and datasets at https://huggingface.co/collections/kyutai/kairos provide a foundation for future research on continual learning for LLMs.

URL PDF HTML ☆

赞 0 踩 0

2605.22365 2026-05-26 cs.CR cs.AI cs.LG 版本更新

TimeGuard: Channel-wise Pool Training for Backdoor Defense in Time Series Forecasting

TimeGuard: 面向时间序列预测中后门防御的通道式池化训练

Quang Duc Nguyen, Siyuan Liang, Yiming Li, Fushuo Huo, Dacheng Tao

发表机构 * College of Computing（计算学院）； Data Science, Nanyang Technological University, Singapore（数据科学，新加坡南洋理工大学）

AI总结针对时间序列预测中后门攻击防御难题，提出基于通道式池化训练的TimeGuard方法，通过时间感知池初始化与距离正则化损失选择缓解信号稀释与损失退化，显著提升鲁棒性。

Comments 44 pages, 30 figures. ICML 2026

详情

AI中文摘要

时间序列预测（TSF）极易受到后门攻击，但由于数据纠缠和任务公式化转变带来的挑战，有效的防御方法仍未被充分探索。为填补这一空白，我们对TSF生命周期中的十三种代表性后门防御进行了系统评估，并分析了它们的失败模式。我们的结果揭示了两个根本问题：（1）数据纠缠导致通道级信号稀释，使得样本过滤和触发器合成防御无法有效定位后门；（2）任务公式化转变导致训练损失退化，使得训练阶段中毒窗口与干净窗口难以区分。基于这些发现，我们提出了一种针对TSF的训练时后门防御方法，称为TimeGuard。该方法以通道式池化训练为核心范式，并使用时间感知标准初始化高置信度池以缓解信号稀释。此外，我们引入了距离正则化损失选择，在训练过程中逐步扩展可靠池并缓解损失退化。在多个数据集、预测架构和TSF后门攻击上的大量实验表明，TimeGuard显著提升了鲁棒性，将$\mathrm{MAE}_\mathrm{P}$相对于领先基线提升了1.96倍，同时将干净性能保持在5% $\mathrm{MAE}_\mathrm{C}$以内。

英文摘要

Time Series Forecasting (TSF) is highly vulnerable to backdoor attacks, yet effective defenses remain underexplored due to challenges arising from data entanglement and shifts in task formulation. To fill this gap, we conduct a systematic evaluation of thirteen representative backdoor defenses across the TSF life cycle and analyze their failure modes. Our results reveal two fundamental issues: (1) data entanglement induces channel-level signal dilution, rendering sample-filtering and trigger-synthesis defenses ineffective at localizing backdoors; and (2) task-formulation shift leads to training-loss degeneration, causing poisoned and clean windows to become indistinguishable at training stages. Based on these findings, we propose a training-time backdoor defense for TSF, termed TimeGuard. Our method adopts channel-wise pool training as the core paradigm and initializes a high-confidence pool using time-aware criteria to mitigate signal dilution. Moreover, we introduce distance-regularized loss selection to progressively expand the reliable pool during training and ease loss degeneration. Extensive experiments across multiple datasets, forecasting architectures, and TSF backdoor attacks demonstrate that TimeGuard substantially improves robustness, boosting $\mathrm{MAE}_\mathrm{P}$ by $1.96\times$ over the leading baseline, while preserving clean performance within 5% $\mathrm{MAE}_\mathrm{C}$.

URL PDF HTML ☆

赞 0 踩 0

2605.21602 2026-05-26 cs.AI cs.SE 版本更新

通过全循环Transformer简单稳定循环

Rao Fu, Zixuan Yang, Jiankun Zhang, Jing Ma, Hechang Chen, Yu Li, Yi Chang

发表机构 * Hong Kong Baptist University（香港 Baptist 大学）； Jilin University（吉林大学）

AI总结针对循环Transformer在迭代次数增加时出现的训练不稳定性，提出全循环Transformer，通过全循环架构和注意力注入两种无参数修改，稳定训练至12次循环，下游任务性能提升最高13.2%。

详情

AI中文摘要

扩展模型性能通常需要增加模型大小。循环Transformer通过迭代重用相同的Transformer块提供了一种引人注目的替代方案，用额外的计算换取性能提升，而不增加参数数量或上下文长度。由于推理时可以调整循环迭代次数，它还提供了一种平衡性能和测试时计算的自然机制。然而，当循环迭代次数增加时，循环Transformer仍然存在训练不稳定性。我们的分析表明，这种不稳定性源于两个来源：梯度振荡和残差爆炸。为了解决这两个问题，我们提出了全循环Transformer，它引入了两种无参数修改：（1）全循环架构，将循环间信号分布到所有层以缓解残差爆炸；（2）注意力注入，重用现有的注意力块以抑制梯度振荡。这些修改稳定了训练动态，使得全循环Transformer能够稳定训练多达12次循环迭代，而其他基线循环模型在这种情况下会崩溃。在循环Transformer不会崩溃的较温和设置中，全循环Transformer仍然将平均下游任务性能提升了高达13.2%。总体而言，我们的实验表明，全循环Transformer提高了训练稳定性，增强了下游性能，并通过在推理时改变循环迭代次数，提供了在不同测试时计算预算下的初步适应性。

英文摘要

Scaling model performance typically requires increasing model size. Looped Transformer offers a compelling alternative by iteratively reusing the same Transformer blocks, trading additional computation for improved performance without increasing parameter count or context length. Because the number of loop iterations can be adjusted at inference, it also provides a natural mechanism for balancing performance and test-time compute. However, Looped Transformer still suffers from training instability when the number of loop iterations increases. Our analysis reveals that this instability stems from two sources: gradient oscillation and residual explosion. To address these two problems, we propose the Fully Looped Transformer, which introduces two parameter-free modifications: (1) Fully Looped Architecture, which distributes inter-loop signals across all layers to mitigate residual explosion; (2) Attention Injection, which reuses the existing attention block to suppress gradient oscillation. These modifications stabilize training dynamics, enabling the Fully Looped Transformer to be trained stably up to 12 loop iterations, whereas other baseline looped models collapse in this regime. In milder settings where Looped Transformer does not collapse, Fully Looped Transformer still improves average downstream-task performance by up to 13.2\%. Overall, our experiments demonstrate that Fully Looped Transformer improves training stability, enhances downstream performance, and provides preliminary adaptability under different test-time compute budgets by varying loop iterations at inference.

URL PDF HTML ☆

赞 0 踩 0

2605.18746 2026-05-26 cs.CV cs.AI cs.CL cs.LG cs.RO 版本更新

ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop

ESI-Bench: 迈向闭环感知-动作的具身空间智能

Yining Hong, Jiageng Liu, Han Yin, Manling Li, Leonidas Guibas, Li Fei-Fei, Jiajun Wu, Yejin Choi

发表机构 * Stanford University（斯坦福大学）； UCLA（加州大学洛杉矶分校）； Northwestern University（西北大学）

AI总结提出ESI-BENCH基准，通过主动探索（感知、移动、操作）在OmniGibson环境中评估具身空间智能，发现主动探索显著优于被动方法，失败主因是动作盲视而非感知弱，且模型存在元认知差距。

Comments https://esi-bench.github.io/

详情

AI中文摘要

空间智能通过感知-动作循环展开：智能体通过行动获取观察，并推理观察如何随动作变化。它们不是被动处理所见，而是主动揭示未见——遮挡结构、动态、包含关系和功能，这些无法仅通过被动感知解决。我们超越先前假设神谕观察的空间智能表述，将观察者重新定义为行动者。我们引入ESI-BENCH，一个基于OmniGibson、扎根于Spelke核心知识系统的全面具身空间智能基准，涵盖10个任务类别和29个子类别。智能体必须决定部署哪些能力——感知、移动和操作——以及如何排序以主动积累任务相关证据。我们对最先进的MLLM进行大量实验，发现主动探索显著优于被动对应物，智能体自发发现涌现的空间策略而无需明确指令，而随机多视角往往增加噪声而非信号，尽管消耗更多图像。大多数失败并非源于感知弱，而是动作盲视：糟糕的动作选择导致糟糕的观察，进而引发级联错误。虽然显式3D基础稳定了深度敏感任务的推理，但不完美的3D表示通过扭曲空间关系证明比2D基线更有害。人类研究进一步揭示，与寻求证伪视角并在矛盾下修正信念的人类不同，模型无论证据质量如何都过早且高置信度地承诺，暴露了一个既不能通过更好感知也不能通过更多具身互动单独闭合的元认知差距。

英文摘要

Spatial intelligence unfolds through a perception-action loop: agents act to acquire observations, and reason about how observations vary as a function of action. Rather than passively processing what is seen, they actively uncover what is unseen - occluded structure, dynamics, containment, and functionality that cannot be resolved from passive sensing alone. We move beyond prior formulations of spatial intelligence that assume oracle observations by recasting the observer as an actor. We introduce ESI-BENCH, a comprehensive benchmark for embodied spatial intelligence spanning 10 task categories and 29 subcategories built on OmniGibson, grounded in Spelke's core knowledge systems. Agents must decide what abilities to deploy - perception, locomotion, and manipulation - and how to sequence them to actively accumulate task-relevant evidence. We conduct extensive experiments on state-of-the-art MLLMs and find that active exploration substantially outperforms passive counterparts, with agents spontaneously discovering emergent spatial strategies without explicit instructions, while random multi-view often adds noise rather than signal despite consuming far more images. Most failures stem not from weak perception but from action blindness: poor action choices lead to poor observations, which in turn drive cascading errors. While explicit 3D grounding stabilizes reasoning on depth-sensitive tasks, imperfect 3D representation proves more harmful than 2D baselines by distorting spatial relations. Human studies further reveal that unlike humans who seek falsifying viewpoints and revise beliefs under contradiction, models commit prematurely with high confidence regardless of evidence quality, exposing a metacognitive gap that neither better perception nor more embodied interaction alone can close.

URL PDF HTML ☆

赞 0 踩 0

2605.18172 2026-05-26 cs.AI 版本更新

Visualizing the Invisible: Generative Visual Grounding Empowers Universal EEG Understanding in MLLMs

可视化不可见：生成式视觉定位赋能多模态大语言模型的通用脑电图理解

Jun-Yu Pan, Yansen Wang, Enze Zhang, Bao-Liang Lu, Wei-Long Zheng, Dongsheng Li

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Microsoft Research Asia（微软亚洲研究院）

AI总结提出生成式视觉定位（GVG）框架，通过脑电图到图像的生成模型作为视觉翻译器，为多模态大语言模型提供结构化视觉上下文，以增强非视觉脑电图的理解和临床状态解释。

详情

AI中文摘要

利用预训练大语言模型和多模态大语言模型的通用表示为脑基础模型提供了一条有前景的路径。然而，视觉诱发的脑电图数据集仍然稀缺，导致现有方法主要将神经信号与抽象文本对齐，这种有损翻译可能丢弃脑活动中编码的细粒度感知信息。我们提出生成式视觉定位（GVG）框架，通过使用脑电图到图像的生成模型作为视觉翻译器，将不可见的信息可视化。GVG 不是仅将脑电图强制转换为文本，而是为非视觉脑电图生成实例特定的代理图像，提供结构化的视觉上下文，使多模态大语言模型能够利用其视觉先验进行临床状态解释。我们在两个多模态大语言模型骨干上验证了这一想法：GVG-X-Omni 和 GVG-Janus。仅图像对齐已具有竞争力：轻量级 GVG-X-Omni 在冻结的 7B 骨干上仅调整 170M 参数，即可匹配 1.7B 参数的文本对齐基线。我们进一步扩展了 GVG-Janus，采用三模态图像+文本对齐，其中文本提供类别语义锚点，视觉代理用感知细节丰富神经表示。实验表明，在脑电图理解和视觉生成方面均取得了一致增益，表明视觉代理定位作为文本对齐的有效补充。

英文摘要

Leveraging the universal representations of pre-trained LLMs and MLLMs offers a promising path toward brain foundation models. However, visually-evoked EEG datasets remain scarce, leading existing methods to align neural signals mainly with abstract text, a lossy translation that may discard fine-grained perceptual information encoded in brain activity. We propose Generative Visual Grounding (GVG), a framework that visualizes the invisible by using an EEG-to-image generative model as a visual translator. Instead of forcing EEG into text alone, GVG hallucinates instance-specific proxy images for non-visual EEG, providing structured visual contexts that allow MLLMs to exploit their visual priors for clinical-state interpretation. We validate this idea on two MLLM backbones, GVG-X-Omni and GVG-Janus. Image-only alignment is already competitive: the lightweight GVG-X-Omni matches 1.7B-parameter text-aligned baselines while tuning only 170M parameters on a frozen 7B backbone. We further extend GVG-Janus with trimodal Image+Text alignment, where text supplies categorical semantic anchors and visual proxies enrich neural representations with perceptual details. Experiments show consistent gains in EEG understanding and visual generation, suggesting visual proxy grounding as an effective complement to textual alignment.

URL PDF HTML ☆

赞 0 踩 0

2605.17937 2026-05-26 cs.CL cs.AI 版本更新

BacktestBench: Benchmarking Large Language Models for Automated Quantitative Strategy Backtesting

BacktestBench：面向自动化量化策略回测的大语言模型基准测试

Zhensheng Wang, Wenmian Yang, Qingtai Wu, Lequan Ma, Yiquan Zhang, Weijia Jia

发表机构 * Beijing Normal University（北京师范大学）； Elmleaf Ltd.（Elmleaf公司）

AI总结提出首个大规模自动化量化回测基准BacktestBench，包含18,246个问答对，并设计多智能体基线AutoBacktest，通过协调摘要器、检索器和编码器实现自然语言策略到可重复回测的转换。

Comments This paper has been accepted by KDD 2026 (Datasets and Benchmarks Track)

详情

DOI: 10.1145/3770855.3817460

AI中文摘要

量化回测对于评估交易策略至关重要，但仍受到高技术门槛和有限可扩展性的阻碍。虽然大语言模型（LLMs）通过先进的代码生成、工具使用和智能体规划为自动化这一复杂的跨学科工作流程提供了变革性路径，但实际实现因当前缺乏专门用于自动化量化回测的大规模基准而面临重大挑战，这阻碍了该领域的进展。为弥补这一关键差距，我们引入了BacktestBench，这是首个用于自动化量化回测的大规模基准。它基于超过600万条真实市场记录构建，包含18,246个精心标注的问答对，涵盖四个任务类别：指标计算、股票选择、策略选择和参数确认。我们还提出了AutoBacktest，一个稳健的多智能体基线，通过协调摘要器进行语义因子提取、检索器进行验证的SQL生成以及编码器进行Python回测实现，将自然语言策略转化为可重复的回测。我们对23个主流LLM的评估，辅以有针对性的消融实验，识别了影响端到端性能的关键因素，并强调了基于事实的验证和标准化指标表示的重要性。

英文摘要

Quantitative backtesting is essential for evaluating trading strategies but remains hampered by high technical barriers and limited scalability. While Large Language Models (LLMs) offer a transformative path to automate this complex, interdisciplinary workflow through advanced code generation, tool usage, and agentic planning, the practical realization is significantly challenged by the current lack of a large-scale benchmark dedicated to automated quantitative backtesting, which hinders progress in this field. To bridge this critical gap, we introduce BacktestBench, the first large-scale benchmark for automated quantitative backtesting. Built from over 6 million real market records, it comprises 18,246 meticulously annotated question-answering pairs across four task categories: metrics calculation, ticker selection, strategy selection, and parameter confirmation. We also propose AutoBacktest, a robust multi-agent baseline that translates natural language strategies into reproducible backtests by coordinating a Summarizer for semantic factor extraction, a Retriever for validated SQL generation, and a Coder for Python backtesting implementation. Our evaluation on 23 mainstream LLMs, complemented by targeted ablations, identifies key factors that influence end-to-end performance and highlights the importance of grounded verification and standardized indicator representations.

URL PDF HTML ☆

赞 0 踩 0

2605.17730 2026-05-26 cs.LG cs.AI 版本更新

L-Drive: Beyond a Single Mapping-Latent Context Drives Time Series Forecasting

L-Drive：超越单一映射——潜在上下文驱动时间序列预测

Fan Zhang, Shijun Chen, Hua Wang

发表机构 * Business University, Yantai, Shandong, China（山东商业大学）； Ludong University, Yantai, Shandong, China（鲁东大学）

AI总结针对分布偏移和机制变化导致直接映射范式在转折点响应滞后的问题，提出L-Drive框架，通过引入潜在上下文表征高层动态并利用门控调制增量表示，提升对变化段的适应能力，同时采用补丁共享相对位置基函数增强段内结构建模，实现预测精度与计算效率的更好平衡。

详情

AI中文摘要

多变量时间序列预测的主流方法主要遵循直接映射范式。它们在观测空间中学习从历史到未来的统一映射，以拟合值级依赖关系。然而，现实世界系统经常经历分布偏移和机制变化。在这种情况下，统一映射在转折点附近可能出现响应滞后，导致切换窗口内误差累积，降低预测可靠性。为解决此问题，我们提出L-Drive，一种变化感知预测框架。L-Drive引入潜在上下文，显式表征随时间演变的高层动态，并使用门控调制增量表示。这提供了更及时的变化线索，并改善了对变化段的适应。此外，它结合了补丁共享相对位置基函数，以加强段内结构建模并减少由绝对位置记忆引起的过拟合。大量实验验证了L-Drive的有效性，并展示了其在预测精度和计算效率之间更好的整体权衡。

英文摘要

Mainstream methods for multivariate time-series forecasting largely follow the Direct-Mapping paradigm. They learn a unified mapping from history to the future in the observation space to fit value-level dependencies. However, real-world systems often undergo distribution shifts and regime changes. In such cases, a unified mapping can exhibit response lag around turning points, causing error accumulation within the switching window and reducing forecasting reliability. To address this issue, we propose L-Drive, a change-aware forecasting framework. L-Drive introduces a Latent-Context, to explicitly characterize high-level dynamics evolving over time, and uses gating to modulate increment representations. This provides more timely change cues and improves adaptation to changing segments. In addition, it incorporates patch-shared relative positional basis functions to strengthen intra-segment structural modeling and reduce overfitting caused by absolute-position memorization. Extensive experiments validate the effectiveness of L-Drive and show a better overall trade-off between forecasting accuracy and computational efficiency.

URL PDF HTML ☆

赞 0 踩 0

2605.17537 2026-05-26 cs.AI 版本更新

Self-supervised Hierarchical Visual Reasoning with World Model

基于世界模型的自监督分层视觉推理

Yuanfei Xu, Lin Liu, Wengang Zhou, Mingxiao Feng, Houqiang Li

发表机构 * Department of Electronic Engineering and Information Science, University of Science and Technology of China（电子工程与信息科学系，中国科学技术大学）； Institute of Artificial Intelligence, Hefei Comprehensive National Science Center（人工智能研究院，合肥综合性国家科学中心）

AI总结提出ResDreamer，一种分层世界模型，通过自监督方式学习残差表示，实现高效视觉推理，在3D开放环境中达到最先进的样本和参数效率。

详情

AI中文摘要

具有对抗对手的3D开放世界环境因其巨大的状态空间仍然是强化学习的核心挑战。有效的推理表示在此类环境中至关重要。虽然现有的自监督视觉预见推理方法常常遭受多步误差累积，许多最近的研究转向注入领域特定知识以提供更稳定的指导。我们的关键洞察是，视觉推理表示的照片级真实感是次要的；真正重要的是提供信息丰富、任务相关的信号。为此，我们提出ResDreamer，一种分层世界模型，其中每个更高层被训练来重建下一层的残差。这种设计使得对日益复杂的世界动态进行渐进抽象成为可能，并促进更丰富潜在表示的出现。受“苦涩教训”启发，ResDreamer以纯自监督方式训练其推理表示。高层残差表示用于调节低层预测，使得世界模型仅以线性增加的跨层通信成本即可有效扩展。实验表明，ResDreamer实现了最先进的样本效率和参数效率。这种可扩展的分层视觉预见推理架构为开放、动态环境中更具能力的在线RL代理铺平了道路。代码可在https://github.com/XuYuanFei01/ResDreamer获取。

英文摘要

3D open-world environments with adversarial opponents remain a core challenge for reinforcement learning due to their vast state spaces. Effective reasoning representations are essential in such settings. While existing self-supervised visual foresight reasoning approaches often suffer from multi-step error accumulation, many recent studies resort to injecting domain-specific knowledge for more stable guidance. Our key insight is that the photorealistic fidelity of visual reasoning representations is secondary; what truly matters is providing informative, task-relevant signals. To this end, we propose ResDreamer, a hierarchical world model in which each higher-level layer is trained to reconstruct the residuals of the layer below. This design enables progressive abstraction of increasingly sophisticated world dynamics and fosters the emergence of richer latent representations. Drawing inspiration from the "Bitter Lesson", ResDreamer trains its reasoning representations in a purely self-supervised manner. The higher-level residual representations are used to modulate lower-level predictions, allowing the world model to scale effectively with only linearly increasing cross-layer communication costs. Experiments show that ResDreamer achieves state-of-the-art sample efficiency and parameter efficiency. This scalable hierarchical visual foresight reasoning architecture paves the way for more capable online RL agents in open-ended, dynamic environments. The code is accessible at https://github.com/XuYuanFei01/ResDreamer.

URL PDF HTML ☆

赞 0 踩 0

2605.16953 2026-05-26 cs.AI cs.CL 版本更新

How do Humans Process AI-generated Hallucination Contents: a Neuroimaging Study

人类如何处理AI生成的幻觉内容：一项神经影像学研究

Shuqi Zhu, Yi Zhong, Ziyi Ye, Bangde Du, Yujia Zhou, Qingyao Ai, Yiqun Liu

发表机构 * Department of Computer Science and Technology, Tsinghua University, Beijing, China（清华大学计算机科学与技术系）； Institute of Trustworthy Embodied AI, Fudan University, Shanghai, China（复旦大学可信具身人工智能研究院）

AI总结通过EEG实验，研究人类在处理多模态大语言模型生成的幻觉与非幻觉内容时的神经动力学差异，揭示误判的幻觉内容未能触发标准神经认知事实验证通路。

详情

AI中文摘要

尽管AI生成的幻觉带来了相当大的风险，但人类能够成功识别或被这些幻觉误导的潜在认知机制仍不清楚。为了解决这个问题，本文探索了人类的神经动力学，以表征大脑如何处理幻觉内容。我们记录了27名参与者在执行验证任务时的EEG信号，该任务要求判断由多模态大语言模型（MLLM）生成的图像描述的正确性。基于平均事件相关电位（ERP）研究，我们揭示了多种认知过程，例如语义整合、推理处理、记忆检索和认知负荷，在处理幻觉与非幻觉内容时表现出不同的模式。值得注意的是，人类参与者误判与正确判断的幻觉的神经反应显示出显著差异。这表明，被误判的AI生成幻觉未能触发标准的神经认知事实验证通路。

英文摘要

While AI-generated hallucinations pose considerable risks, the underlying cognitive mechanisms by which humans can successfully recognize or be misled by these hallucinations remain unclear. To address this problem, this paper explores humans' neural dynamics to characterize how the brain processes hallucinated content. We record EEG signals from 27 participants while they are performing a verification task to judge the correctness of image descriptions generated by a multi-modal large language model (MLLM). Based on an averaged event-related potential (ERP) study, we reveal that multiple cognitive processes, e.g., semantic integration, inferential processing, memory retrieval, and cognitive load, exhibit distinct patterns when humans process hallucinated versus non-hallucinated content. Notably, neural responses to hallucinations that were misjudged versus correctly judged by human participants showed significant differences. This indicates that misjudged AI-generated hallucinations failed to trigger the standard neurocognitive fact verification pathway.

URL PDF HTML ☆

赞 0 踩 0

2605.12906 2026-05-26 cs.LG cs.AI 版本更新

Data Difficulty and the Generalization--Extrapolation Tradeoff in LLM Fine-Tuning

数据难度与LLM微调中的泛化-外推权衡

Siyuan Liu, Tinghong Chen, Xinghan Li, Yifei Wang, Jingzhao Zhang

发表机构 * IIIS, Tsinghua University（清华大学人工智能学院）； College of AI, Tsinghua University（清华大学人工智能学院）； Shanghai Qi Zhi Institute（上海启智研究院）； Amazon AGI SF Lab（亚马逊AGI旧金山实验室）

AI总结本文通过实证和理论分析，研究了监督微调中数据难度对模型行为的影响，发现数据难度与数据量共同决定泛化与外推之间的权衡，并存在最优难度随数据量增加而向更难数据偏移的规律。

Comments Accepted to ICML 2026

详情

AI中文摘要

监督微调（SFT）期间的数据选择可以显著改变大型语言模型（LLMs）的行为。尽管已有工作研究了基于困惑度、难度或长度等启发式方法选择数据的效果，但报告的结果往往不一致或依赖于上下文。在这项工作中，我们从实证和理论角度系统地研究了数据难度在微调中的作用，并发现不存在普遍最优的难度水平；相反，其有效性取决于数据集大小。我们表明，对于固定的数据预算，SFT存在一个最优的数据难度，并且随着数据预算的增加，该最优难度向更难的数据偏移。为了解释这一现象，我们进行了受控的合成实验，揭示了一个简单的底层机制：分布内泛化差距与外推差距之间的相互作用。我们通过使用PAC-Bayesian泛化界限的理论分析进一步支持了这一机制。总的来说，我们的结果阐明了数据大小和难度如何共同影响SFT中泛化与外推之间的权衡，为在特定模型和数据条件下基于难度的数据选择提供了指导。

英文摘要

Data selection during supervised fine-tuning (SFT) can critically change the behavior of large language models (LLMs). Although existing work has studied the effect of selecting data based on heuristics such as perplexity, difficulty, or length, the reported findings are often inconsistent or context-dependent. In this work, we systematically study the role of data difficulty in fine-tuning from both empirical and theoretical perspectives, and find that there is no universally optimal difficulty level; rather, its effectiveness depends on the dataset size. We show that for a fixed data budget, there exists an optimal data difficulty for SFT, and that this optimal difficulty shifts toward harder data as the data budget increases. To explain this phenomenon, we conduct controlled synthetic experiments that reveal a simple underlying mechanism: the interplay between the (in-distribution) generalization gap and the extrapolation gap. We further support this mechanism through a theoretical analysis using PAC-Bayesian generalization bounds. Overall, our results clarify how data size and difficulty jointly affect the trade-off between generalization and extrapolation in SFT, providing guidance for difficulty-based data selection under certain model and data conditions.

URL PDF HTML ☆

赞 0 踩 0

2605.12374 2026-05-26 cs.CV cs.AI cs.LG 版本更新

Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models

填补GAP：多模态大语言模型中视觉推理的粒度对齐范式

Yanting Miao, Yutao Sun, Dexin Wang, Mengyu Zhou, Pascal Poupart, Lei Lv, Qi Zhao, Li Wang, Hao Li, Xiaoxi Jiang, Guanjun Jiang

发表机构 * Qwen Large Model Application Team, Alibaba（阿里云大模型应用团队）； Alibaba University of Waterloo（阿里大学水力学院）； Vector Institute（向量研究所）； Zhejiang University（浙江大学）

AI总结提出GAP（粒度对齐范式），通过特征级、上下文级和能力引导级对齐，解决多模态大语言模型中视觉潜在推理的特征空间不匹配问题，提升感知与推理性能。

详情

AI中文摘要

视觉潜在推理让多模态大语言模型（MLLM）以连续令牌形式创建中间视觉证据，避免外部工具或图像生成器。然而，现有方法通常遵循输出即输入的潜在范式，产生不稳定的收益。我们识别出特征空间不匹配是导致这种不稳定的证据：主流的视觉潜在模型建立在预归一化MLLM上，重用解码器隐藏状态作为预测的潜在输入，尽管这些状态与模型训练时消耗的输入嵌入处于截然不同的范数范围（Xie et al., 2025; Li et al., 2026; Team et al., 2026）。这种不匹配可能使直接潜在反馈不可靠。受此诊断启发，我们提出GAP，一种用于视觉潜在建模的粒度对齐范式。GAP在三个层面对齐视觉潜在推理：特征级对齐通过轻量级PCA对齐潜在头将解码器输出映射为输入兼容的视觉潜在；上下文级对齐通过可检查的辅助视觉监督锚定潜在目标；能力引导对齐选择性地将潜在监督分配给基础MLLM难以处理的示例。在Qwen2.5-VL 7B上，所得模型在我们监督变体中实现了最佳平均聚合感知和推理性能。推理时干预探测进一步表明，生成的潜在提供了任务相关的视觉信号，而不仅仅是增加令牌槽位。

英文摘要

Visual latent reasoning lets a multimodal large language model (MLLM) create intermediate visual evidence as continuous tokens, avoiding external tools or image generators. However, existing methods usually follow an output-as-input latent paradigm and yield unstable gains. We identify evidence for a feature-space mismatch that can contribute to this instability: dominant visual-latent models build on pre-norm MLLMs and reuse decoder hidden states as predicted latent inputs, even though these states occupy a substantially different norm regime from the input embeddings the model was trained to consume (Xie et al., 2025; Li et al., 2026; Team et al., 2026). This mismatch can make direct latent feedback unreliable. Motivated by this diagnosis, we propose GAP, a Granular Alignment Paradigm for visual latent modeling. GAP aligns visual latent reasoning at three levels: feature-level alignment maps decoder outputs into input-compatible visual latents through a lightweight PCA-aligned latent head; context-level alignment grounds latent targets with inspectable auxiliary visual supervision; and capacity-guided alignment assigns latent supervision selectively to examples where the base MLLM struggles. On Qwen2.5-VL 7B, the resulting model achieves the best mean aggregate perception and reasoning performance among our supervised variants. Inference-time intervention probing further suggests that generated latents provide task-relevant visual signal beyond merely adding token slots.

URL PDF HTML ☆

赞 0 踩 0

2605.10913 2026-05-26 cs.AI cs.PL cs.SE 版本更新

Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace

Shepherd: 一个为元代理提供形式化执行迹的运行时基座

Simon Yu, Derek Chong, Ananjan Nandi, Dilara Soylu, Jiuding Sun, Christopher D Manning, Weiyan Shi

发表机构 * Northeastern University（东北大学）； Stanford University（斯坦福大学）

AI总结提出Shepherd，一个基于函数式编程的Python运行时基座，将代理执行作为一等对象，通过类似Git的执行迹支持元代理的检查、分叉和重放，在三个用例中显著提升性能。

Comments 50 pages, 22 figures, 14 tables

详情

AI中文摘要

随着LLM代理系统承担更复杂的任务，它们越来越依赖元代理：对其他代理进行操作的高阶代理，就像管理者监督员工一样。无论元代理做什么：协调代理、在执行前停止风险动作、或修复失败的运行，都需要在运行时操纵代理执行。现有的代理基座使得这变得困难：它们只给元代理提供纯文本记录和环境快照，要求元代理构建自己的工具来重建和编排执行状态。因此，我们引入了Shepherd，一个基于函数式编程原则的Python基座，其中代理的执行本身是一个一等对象，元代理可以检查和转换它。每个模型调用、工具调用和环境变化都成为类似Git的执行迹中的一个结构化事件，任何过去的状态都可以被分叉（比docker commit快5倍）并重放。三个示例用例展示了Shepherd的多功能性：（1）一个监督代理防止并行编码代理之间的冲突，将CooperBench的性能从28.8%提升到54.7%；（2）一个反事实优化器通过提出编辑并从行为改变点重放运行来修复代理工作流，在TerminalBench-2上比MetaHarness低58%的挂钟时间；（3）一个元代理在展开期间选择分叉点以改进长程代理强化学习中的信用分配，在TerminalBench-2上将GRPO的增益翻倍。我们开源Shepherd，以通过原则性和高效的代理执行操作赋能未来的元代理。

英文摘要

As LLM agent systems take on more complex tasks, they increasingly rely on meta-agents: higher-order agents that operate on other agents, much as managers supervise employees. Whatever a meta-agent does: coordinating agents, halting risky actions before execution, or repairing failed runs, requires manipulation of agentic execution at runtime. Existing agentic substrates make this hard: they give meta-agents only plain transcripts and environment snapshots, requiring it to build it's own tooling to reconstruct and orchestrate execution state. Therefore, we introduce Shepherd, a Python substrate grounded in functional programming principles, where an agent's execution is itself a first-class object that a meta-agent can inspect and transform. Every model call, tool call, and environment change becomes a structured event in a Git-like execution trace, where any past state can be forked 5x faster than docker commit and replayed. Three example use cases show Shepherd's versatility: (1) a supervisor agent prevents conflicts among parallel coding agents, lifting CooperBench performance from 28.8% to 54.7%; (2) a counterfactual optimizer repairs agent workflows by proposing edits and replaying runs from the point of changed behavior, outperforming MetaHarness on TerminalBench-2 with 58% lower wall-clock; (3) a meta-agent picks fork points during rollouts to improve credit assignment in long-horizon agentic RL, doubling GRPO's gains on TerminalBench-2. We open-source Shepherd to empower future meta-agents with principled and efficient operations over agentic execution.

URL PDF HTML ☆

赞 0 踩 0

2605.09270 2026-05-26 cs.LG cs.AI 版本更新

Memorize Theorems, Not Instances: Probing SFT Generalization through Mathematical Reasoning

记忆定理而非实例：通过数学推理探究SFT泛化

Ruiying Peng, Mengyu Yang, Jing Lei, Xiaohui Li, Xueyu Wu, Xinlei Chen

发表机构 * Tsinghua Shenzhen International Graduate School（清华大学深圳国际研究生院）； Huawei Technologies（华为技术）

AI总结针对监督微调（SFT）损害推理泛化的问题，提出Theorem-SFT方法，通过显式定理应用训练，在多个基准上取得显著提升，并揭示前馈层是推理规则的主要存储位置。

详情

AI中文摘要

监督微调（SFT）广泛用于任务特定适配，但近期工作表明它会系统性地削弱推理泛化。我们认为根本原因不在于记忆本身，而在于其目标：标准SFT驱动模型利用并记忆问题-答案对中的虚假表面相关性，使其对表面输入变化脆弱。为解决此问题，我们提出Theorem-SFT，通过教授模型规则如何被调用而非答案看起来像什么，将监督重新导向显式定理应用。Theorem-SFT在多个基准和模型家族上取得一致提升：在MATH上（LLaMA3.2-3B-Instruct）提升8.8%，在GeoQA上（Qwen2.5-VL-7B-Instruct）提升20.27%，无需特定模态的重新训练。仅微调MLP层即可达到全层性能，表明前馈组件是推理规则的主要存储位置。我们的发现重新定义了争论：泛化失败并非源于记忆机制本身，而是源于记忆了错误的归纳目标。

英文摘要

Supervised Fine-Tuning (SFT) is widely used for task-specific adaptation, yet recent work shows it systematically undermines reasoning generalization. We argue the root cause is not memorization itself, but its target: vanilla SFT drives models to exploit and memorize spurious surface correlations in problem-solution pairs, leaving them brittle to superficial input variations. To address this, we propose Theorem-SFT, which reorients supervision toward explicit theorem application by teaching models how rules are invoked rather than what answers look like. Theorem-SFT yields consistent gains across benchmarks and model families: +8.8% on MATH (LLaMA3.2-3B-Instruct) and +20.27% on GeoQA (Qwen2.5-VL-7B-Instruct) without modality-specific re-training. Fine-tuning MLP layers alone matches full-layers performance, implicating feed-forward components as the primary locus of reasoning rules. Our findings reframe the debate: Generalization failures stem not from memorization as a mechanism, but from memorizing the wrong inductive targets.

URL PDF HTML ☆

赞 0 踩 0

2605.04906 2026-05-26 cs.AI 版本更新

审计心理健康对话中的隐性谄媚：结构化临床状态诊断与干净匹配基准

Tianze Han, Beining Xu, Hanbo Zhang, Yongming Lu

发表机构 * Shenzhen MSU-BIT University（深圳MSU-BIT大学）

AI总结针对心理健康对话模型中隐式谄媚（表面共情但强化消极认知）的问题，提出基于动态情感签名图（DESG）的结构化离线审计框架，通过临床状态转移评估响应方向，并在干净匹配基准上实现最优有害风险检测。

详情

AI中文摘要

心理健康对话模型越来越多地由基于AI的评估器进行评估，但这些评估器通常将表面共情、支持性或流畅性视为安全的证据。在本文中，我们研究了一种隐藏的失败模式，称为隐式谄媚：一个响应可能看似共情，但暗中强化灾难化、回避、绝望预测或CBT式标签。为了检查这个问题，我们引入了一个用于隐式谄媚检测的诊断基准，该基准基于三个代表性的心理健康对话来源构建，涵盖日常同伴支持、咨询式情感支持和危机导向互动，并进一步构建了一个泄漏审计的干净单响应匹配基准，包含500个上下文和1500个匹配响应窗口。然后，我们提出了动态情感签名图（DESG），一个结构化的离线审计框架，将基于LLM的状态提取与最终评分分离，并通过语义、情感和认知扭曲状态转移而非自由形式的LLM判断来评估临床方向。与元数据、表面风格、词汇、嵌入和基于规则的LLM基线不同，DESG对响应引起的临床状态变化方向进行评分；在泄漏审计的干净匹配基准上，DESG-StateRisk比最强的非DESG基线提高了0.0488 macro-F1，并实现了最佳的有害风险检测结果。这些结果表明，评估隐式谄媚需要显式的临床状态建模以及泄漏检查、捷径控制和竞争性基线。

英文摘要

Mental-health dialogue models are increasingly evaluated by AI-based evaluators, yet these evaluators often treat surface empathy, supportiveness, or fluency as evidence of safety. In this paper, we study a hidden failure mode that we call implicit sycophancy: a response may appear empathetic while implicitly reinforcing catastrophizing, avoidance, hopeless prediction, or CBT-style labeling. To examine this problem, we introduce a diagnostic benchmark for implicit-sycophancy detection, built from three representative mental-health dialogue sources covering everyday peer support, counseling-style emotional support, and crisis-oriented interaction, and further construct a leakage-audited clean single-response matched benchmark with 500 contexts and 1,500 matched response windows. We then propose Dynamic Emotional Signature Graphs (DESG), a structured offline audit framework that separates LLM-based state extraction from final scoring and evaluates clinical direction through semantic, affective, and cognitive-distortion state transitions rather than free-form LLM judgment. Unlike metadata, surface-style, lexical, embedding, and rubric-LLM baselines, DESG scores the direction of clinical-state change induced by a response; on the leakage-audited clean matched benchmark, DESG-StateRisk improves over the strongest non-DESG baseline by 0.0488 macro-F1 and achieves the best harmful-risk detection result. These results suggest that evaluating implicit sycophancy requires explicit clinical-state modeling together with leakage checks, shortcut controls, and competitive baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.02495 2026-05-26 cs.LG cs.AI stat.ML 版本更新

Efficient Preference Poisoning Attack on Offline RLHF

高效偏好投毒攻击离线RLHF

Chenye Yang, Weiyu Xu, Lifeng Lai

发表机构 * Department of Electrical and Computer Engineering, University of California, Davis, Davis, CA, USA（加州大学戴维斯分校电气与计算机工程系）； Department of Electrical and Computer Engineering, University of Iowa, Iowa City, IA, USA（爱荷华大学电气与计算机工程系）

AI总结针对离线RLHF中的偏好投毒攻击，提出基于梯度字典的二进制稀疏近似方法（BAL-A和BMP-A），实现高效标签翻转攻击。

Comments Accepted to ICML 2026

详情

AI中文摘要

离线人类反馈强化学习（RLHF）流程（如直接偏好优化DPO）在预收集的偏好数据集上训练，使其容易受到偏好投毒攻击。我们研究了对数线性DPO的标签翻转攻击。首先说明翻转一个偏好标签会在DPO梯度中引起与参数无关的偏移。利用这一关键性质，我们可以将目标投毒问题转化为结构化的二进制稀疏近似问题。为解决该问题，我们开发了两种攻击方法：二进制感知格点攻击（BAL-A）和二进制匹配追踪攻击（BMP-A）。BAL-A将二进制翻转选择问题嵌入二进制感知格点，并应用Lenstra-Lenstra-Lovász约简和Babai最近平面算法；我们提供了强制二进制系数并恢复最小翻转目标的充分条件。BMP-A将二进制匹配追踪适应于我们的非归一化梯度字典，并给出基于相干性的恢复保证和$K$翻转预算的鲁棒性（不可能性）证书。在合成字典和斯坦福人类偏好数据集上的实验验证了理论，并突出了字典几何如何决定攻击成功。

英文摘要

Offline Reinforcement Learning from Human Feedback (RLHF) pipelines such as Direct Preference Optimization (DPO) train on a pre-collected preference dataset, which makes them vulnerable to preference poisoning attack. We study label flip attacks against log-linear DPO. We first illustrate that flipping one preference label induces a parameter-independent shift in the DPO gradient. Using this key property, we can then convert the targeted poisoning problem into a structured binary sparse approximation problem. To solve this problem, we develop two attack methods: Binary-Aware Lattice Attack (BAL-A) and Binary Matching Pursuit Attack (BMP-A). BAL-A embeds the binary flip selection problem into a binary-aware lattice and applies Lenstra-Lenstra-Lovász reduction and Babai's nearest plane algorithm; we provide sufficient conditions that enforce binary coefficients and recover the minimum-flip objective. BMP-A adapts binary matching pursuit to our non-normalized gradient dictionary and yields coherence-based recovery guarantees and robustness (impossibility) certificates for $K$-flip budgets. Experiments on synthetic dictionaries and the Stanford Human Preferences dataset validate the theory and highlight how dictionary geometry governs attack success.

URL PDF HTML ☆

赞 0 踩 0

2604.23853 2026-05-26 cs.AI 版本更新

ClawTrace: Cost-Aware Tracing for LLM Agent Skill Distillation

ClawTrace: 面向LLM智能体技能蒸馏的成本感知追踪

Boqin Yuan, Yue Su, Renchu Song, Sen Yang, Jing Qin

发表机构 * University of California San Diego（加州大学圣地亚哥分校）； Carnegie Mellon University（卡内基梅隆大学）

AI总结针对技能蒸馏管道缺乏每步成本信号的问题，提出ClawTrace记录成本归因轨迹并生成TraceCard，通过CostCraft生成保留、剪枝和修复三类技能补丁，发现剪枝补丁作为质量护栏而保留补丁导致回归，主张按规则类型评估可复用技能。

Comments Accepted at Agent Skills '26 Workshop, ACM Conference on AI and Agentic Systems (CAIS 2026), San José, CA, May 26, 2026

详情

AI中文摘要

技能蒸馏管道从LLM智能体轨迹中学习可重用规则，但它们缺乏一个关键信号：每一步的成本。没有每步成本，管道无法区分添加缺失步骤以修复错误与移除从未影响结果的昂贵步骤。我们利用成本归因差距来探究蒸馏技能内部的规则类型是否以相同方式迁移到新任务。ClawTrace记录成本归因的智能体轨迹，并将每个会话编译成TraceCard；CostCraft读取TraceCard并编写三种技能补丁：保留、剪枝和修复。我们发现了一个聚合指标隐藏的模式。在30个保留的SpreadsheetBench任务上（两个种子），移除剪枝补丁大致使质量回归计数增加了三倍，而未降低中位成本。在整个84任务的SkillsBench迁移中，CostCraft未节省总成本。所有三个质量回归都追溯到保留通道，而两个质量提升都追溯到剪枝通道：剪枝补丁充当质量护栏，而保留补丁驱动回归。我们认为可重用的智能体技能应在规则类型层面进行评估，而不是作为整体指令包。为支持这一点，我们发布了ClawTrace、TraceCard模式以及全套类型化技能。

英文摘要

Skill-distillation pipelines learn reusable rules from LLM agent trajectories, but they lack a key signal: how much each step costs. Without per-step cost, a pipeline cannot distinguish adding a missing step to fix a bug from removing an expensive step that never affected the outcome. We use the cost-attribution gap to ask whether the rule types inside a distilled skill transfer the same way to new tasks. ClawTrace records cost-attributed agent traces and compiles each session into a TraceCard; CostCraft reads TraceCards and writes three kinds of skill patches: preserve, prune, and repair. We find a pattern aggregate metrics hide. On 30 held-out SpreadsheetBench tasks across two seeds, removing prune patches roughly tripled the quality-regression count without lowering median cost. Across the full 84-task SkillsBench transfer, CostCraft saves no aggregate cost. All three quality regressions trace to the preserve lane, and both quality wins trace to the prune lane: prune patches act as quality guardrails while preserve patches drive regressions. We argue that reusable agent skills should be evaluated at the rule-type level, not as monolithic instruction packages. To support this, we release ClawTrace, the TraceCard schema, and the full set of typed skills.

URL PDF HTML ☆

赞 0 踩 0

2604.23728 2026-05-26 cs.CV cs.AI 版本更新

ESIA: An Energy-Based Spatiotemporal Interaction-Aware Framework for Pedestrian Intention Prediction

ESIA：基于能量的时空交互感知框架用于行人意图预测

Yanping Wu, Meiting Dang, Lin Wu, Edmond S. L. Ho, Zhenghua Chen, Chongfeng Wei

发表机构 * James Watt School of Engineering, University of Glasgow（格拉斯哥大学詹姆斯·瓦特工程学院）

AI总结提出ESIA框架，利用条件随机场和能量函数建模时空交互，通过结构一致性约束和模拟退火算法实现行人意图预测，在标准基准上达到最先进性能并提升可解释性。

Comments 13 pages, 6 figures, 3 tables

详情

AI中文摘要

自动驾驶的最新进展推动了行人意图预测的研究，该研究旨在通过建模时间动态、社交互动和环境背景来推断未来的过街决策和行动。然而，现有研究仍受限于过度简化的多智能体交互模式、不透明的推理逻辑以及行为预测中缺乏全局一致性，这损害了鲁棒性和可解释性。在这项工作中，我们提出了ESIA（基于能量的时空交互感知框架），一种新颖的基于条件随机场（CRF）的范式。我们将意图预测任务视为一个基于统一图表示的结构化预测问题，将行人和环境视为时空节点。为了表征它们的不同角色，我们为节点分配一元势能以捕捉个体意图，为边分配成对势能以编码社交和环境交互。这些势能被整合到一个统一的全局能量函数中，以确保行为预测的场景级一致性。为了在没有真实标签监督的情况下进一步约束推理，我们引入了结构一致性项来惩罚逻辑矛盾。该优化通过一种新颖的一元种子模拟退火（U-SSA）算法高效求解，该算法利用高置信度的一元先验快速收敛到高质量解。在标准基准上的大量实验表明，ESIA在现有方法中实现了最先进的性能，并具有更好的可解释性。

英文摘要

Recent advances in autonomous driving have motivated research on pedestrian intention prediction, which aims to infer future crossing decisions and actions by modeling temporal dynamics, social interactions, and environmental context. However, existing studies remain constrained by oversimplified multi-agent interaction patterns, opaque reasoning logic, and a lack of global consistency in behavioral predictions, which compromise both robustness and interpretability. In this work, we propose ESIA (Energy-based Spatiotemporal Interaction-Aware framework), a novel Conditional Random Field (CRF)-based paradigm. We cast the intention prediction task as a structured prediction problem over a unified graph-based representation, treating pedestrians and the environment as spatiotemporal nodes. To characterize their distinct roles, we assign unary potentials to nodes to capture individual intentions, and pairwise potentials to edges to encode social and environmental interactions. These potentials are integrated into a unified global energy function to ensure scene-level consistency across behavioral predictions. To further constrain inference without ground-truth supervision, we introduce structural consistency terms to penalize logical contradictions. This optimization is efficiently solved via a novel Unary-Seeded Simulated Annealing (U-SSA) algorithm, which leverages high-confidence unary priors to rapidly converge to a high-quality solution. Extensive experiments on standard benchmarks demonstrate that ESIA achieves state-of-the-art performance with improved interpretability over existing methods.

URL PDF HTML ☆

赞 0 踩 0

2604.23295 2026-05-26 cs.CL cs.AI 版本更新

Human-1 by Josh Talks: A Full-Duplex Conversational Modeling Framework in Hindi using Real-World Conversations

Human-1 by Josh Talks: 基于真实对话的印地语全双工对话建模框架

Bhaskar Singh, Shobhit Banga, Mahima Manik, Pranav Sharma

发表机构 * JoshTalks

AI总结本文通过适配Moshi架构，使用自定义印地语分词器和26,000小时真实对话数据训练，提出了首个开放、可复现的印地语全双工口语对话系统，实现了自然的打断、重叠和反馈行为。

详情

AI中文摘要

全双工口语对话系统能够模拟自然的对话行为，如打断、重叠和反馈，然而这类系统在印度语言中仍 largely unexplored。我们通过适配最先进的双工语音架构Moshi，使用自定义印地语分词器，并在从14,695名说话者收集的26,000小时真实自发对话数据（具有独立的说话者通道）上进行训练，提出了首个开放、可复现的印地语全双工口语对话系统，从而能够直接从自然交互中学习话轮转换和重叠模式。为了支持印地语文本生成，我们替换了原始英语分词器，并重新初始化了依赖于文本词汇的参数，同时保留了预训练的音频组件。我们提出了一种两阶段训练方案——大规模预训练，然后在1,000小时对话数据上进行微调。通过提示对话延续范式，结合自动评估指标和人工判断，评估结果表明生成的模型在印地语中表现出自然且有意义的全双工对话行为。这项工作为印地语及其他印度语言的实时双工口语对话系统迈出了第一步。

英文摘要

Full-duplex spoken dialogue systems can model natural conversational behaviours such as interruptions, overlaps, and backchannels, yet such systems remain largely unexplored for Indian languages. We present the first open, reproducible full-duplex spoken dialogue system for Hindi by adapting Moshi, a state-of-the-art duplex speech architecture, using a custom Hindi tokeniser and training on 26,000 hours of real spontaneous conversations collected from 14,695 speakers with separate speaker channels, enabling direct learning of turn-taking and overlap patterns from natural interactions. To support Hindi text generation, we replace the original English tokeniser and reinitialise text-vocabulary-dependent parameters while retaining the pre-trained audio components. We propose a two-stage training recipe -- large-scale pre-training followed by fine-tuning on 1,000 hours of conversational data. Evaluation through the prompted dialogue continuation paradigm with both automatic metrics and human judgments demonstrates that the resulting model generates natural and meaningful full-duplex conversational behaviour in Hindi. This work serves as a first step toward real-time duplex spoken dialogue systems for Hindi and other Indian languages.

URL PDF HTML ☆

赞 0 踩 0

2604.11557 2026-05-26 cs.AI 版本更新

UniToolCall: Unifying Tool-Use Representation, Data, and Evaluation for LLM Agents

UniToolCall: 统一LLM智能体的工具使用表示、数据与评估

Yijuan Liang, Xinghao Chen, Yifan Ge, Ziyi Wu, Hao Wu, Changyu Zeng, Wei Xing, Xiaoyu Shen

发表机构 * University of Science and Technology of China（中国科学技术大学）； Ningbo Institute of Digital Twin（宁波数字孪生研究所）； Eastern Institute of Technology（东部技术研究所）； Department of Computing, The Hong Kong Polytechnic University（香港理工大学计算机系）

AI总结提出UniToolCall框架，通过标准化工具集构建、数据集生成和评估流程，结合22k+工具和390k+训练实例，引入锚点链接机制，在混合设置下使Qwen3-8B单轮严格精度达93.0%，超越GPT、Gemini和Claude。

Comments 21 pages, 10 figures, 9 tables. Code and datasets are publicly available at: https://github.com/EIT-NLP/UniToolCall

详情

AI中文摘要

工具使用能力是LLM智能体的基本组成部分，使其能够通过结构化函数调用与外部系统交互。然而，现有研究存在不一致的交互表示，很大程度上忽略了工具使用轨迹的结构分布，并依赖于不兼容的评估基准。我们提出了UniToolCall，一个统一的工具学习框架，标准化了从工具集构建、数据集生成到评估的整个流程。该框架整理了包含22k+工具的大型工具池，并通过结合10个标准化公共数据集与结构受控的合成轨迹，构建了包含390k+实例的混合训练语料库。它显式建模了多种交互模式，包括单跳与多跳、单轮与多轮，同时捕获了串行和并行执行结构。为了支持连贯的多轮推理，我们进一步引入了锚点链接机制，强制跨轮依赖关系。此外，我们将7个公共基准转换为统一的查询-动作-观察-答案（QAOA）表示，并在函数调用、轮次和对话级别进行细粒度评估。实验表明，在我们的数据集上微调Qwen3-8B显著提升了工具使用性能。在干扰项密集的Hybrid-20设置下，单轮严格精度达到93.0%，优于包括GPT、Gemini和Claude在内的商业模型。

英文摘要

Tool-use capability is a fundamental component of LLM agents, enabling them to interact with external systems through structured function calls. However, existing research exhibits inconsistent interaction representations, largely overlooks the structural distribution of tool-use trajectories, and relies on incompatible evaluation benchmarks. We present UniToolCall, a unified framework for tool learning that standardizes the entire pipeline from toolset construction and dataset generation to evaluation. The framework curates a large tool pool of 22k+ tools and constructs a hybrid training corpus of 390k+ instances by combining 10 standardized public datasets with structurally controlled synthetic trajectories. It explicitly models diverse interaction patterns, including single-hop vs. multi-hop and single-turn vs. multi-turn, while capturing both serial and parallel execution structures. To support coherent multi-turn reasoning, we further introduce an Anchor Linkage mechanism that enforces cross-turn dependencies. Furthermore, we convert 7 public benchmarks into a unified Query--Action--Observation--Answer (QAOA) representation with fine-grained evaluation at the function-call, turn, and conversation levels. Experiments show that fine-tuning Qwen3-8B on our dataset substantially improves tool-use performance. Under the distractor-heavy Hybrid-20 setting, achieves 93.0% single-turn Strict Precision, outperforming commercial models including GPT, Gemini, and Claude.

URL PDF HTML ☆

赞 0 踩 0

2604.10783 2026-05-26 cs.AI cs.LG 版本更新

Learning Preference-Based Objectives from Clinical Narratives for Dynamic Sepsis Treatment

从临床叙述中学习基于偏好的目标用于动态脓毒症治疗

Daniel J. Tan, Jayne Hui Zhen Chan, Kai Wen Hwang, Arturo Yong Yao Neo, Kay Choong See, Mengling Feng

发表机构 * Institute of Data Science, National University of Singapore, Singapore（新加坡国立大学数据科学研究所）； National University Hospital, Singapore（新加坡国立大学医院）； Saw Swee Hock School of Public Health, National University of Singapore, Singapore（新加坡国立大学 Saw Swee Hock 公共卫生学院）

AI总结提出CN-PR框架，利用大语言模型从出院小结中提取轨迹级偏好，通过偏好优化学习奖励函数，在离线强化学习中改善脓毒症治疗结果。

详情

AI中文摘要

在医疗保健中为强化学习设计奖励函数仍然具有挑战性，因为临床有意义的结果稀疏、延迟且难以明确指定。尽管结构化临床数据捕获了生理状态，但它们往往无法反映患者轨迹的更广泛方面，如治疗反应、恢复动态和干预负担。相比之下，临床叙述编码了临床医生对疾病进展、治疗效果和恢复的纵向评估，提供了超越预定义结果指标的轨迹级监督的潜在来源。我们提出了临床叙述知情偏好奖励（CN-PR）框架，该框架通过将临床叙述视为轨迹级偏好的可扩展监督，直接从出院小结中学习奖励函数。使用大语言模型，我们推导出轨迹质量分数，并在患者轨迹之间构建成对偏好，通过基于偏好的优化来学习奖励。为了考虑叙述信息量的变异性，我们引入了一个任务相关性信号，根据监督与下游决策任务的相关性对其进行加权。我们在离线强化学习中评估了CN-PR在动态脓毒症治疗中的应用。学习到的奖励与轨迹质量分数表现出强烈的单调对齐，并产生了与改善恢复相关结果相关的策略，包括增加器官支持无天数和更快的休克解决，同时保持与基于结果的奖励基线相当的性能。这些发现在外部验证下得以保留。我们的结果表明，临床叙述为动态治疗方案中的奖励学习提供了可扩展且富有表现力的监督来源。

英文摘要

Designing reward functions for reinforcement learning (RL) in healthcare remains challenging because clinically meaningful outcomes are sparse, delayed, and difficult to explicitly specify. Although structured clinical data capture physiologic states, they often fail to reflect broader aspects of patient trajectories such as treatment response, recovery dynamics, and intervention burden. Clinical narratives, by contrast, encode longitudinal clinician assessments of disease progression, treatment effectiveness, and recovery, providing a potential source of trajectory-level supervision beyond predefined outcome metrics. We propose Clinical Narrative-informed Preference Rewards (CN-PR), a framework that learns reward functions directly from discharge summaries by treating clinical narratives as scalable supervision for trajectory-level preferences. Using a large language model, we derive trajectory quality scores and construct pairwise preferences between patient trajectories to learn rewards through preference-based optimization. To account for variability in narrative informativeness, we incorporate a task relevance signal that weights supervision according to its relevance to the downstream decision-making task. We evaluate CN-PR in dynamic sepsis treatment using offline RL. The learned reward demonstrated strong monotonic alignment with trajectory quality scores and produced policies associated with improved recovery-related outcomes, including increased organ support-free days and faster shock resolution, while maintaining mortality performance comparable to outcome-based reward baselines. These findings were preserved under external validation. Our results suggest that clinical narratives provide a scalable and expressive source of supervision for reward learning in dynamic treatment regimes.

URL PDF HTML ☆

赞 0 踩 0

2604.08870 2026-05-26 cs.LG cs.AI 版本更新

Temporal Dropout Risk in Learning Analytics: A Harmonized Survival Benchmark Across Dynamic and Early-Window Representations

学习分析中的时间辍学风险：跨动态与早期窗口表示的协调生存基准

Rafael da Silva, Jeff Eicher, Gregory Longo

发表机构 * Applied Data Science Program（应用数据科学项目）； Eastern University（东部大学）

AI总结本研究使用OULAD数据集，通过协调的生存分析基准（包括动态周表示和连续时间表示）评估辍学风险模型，发现时间行为特征比静态背景属性更具预测力。

Comments 34 pages, 14 figures, 18 tables. Includes appendix with reliability diagrams, sensitivity analyses, and dataset audit tables

详情

AI中文摘要

学生辍学是学习分析中持续关注的问题，然而比较研究经常在异质协议下评估预测模型，优先考虑区分度而非时间可解释性和校准。本研究引入了一个面向生存的基准，用于使用开放大学学习分析数据集（OULAD）进行时间辍学风险建模。比较了两个协调分支：一个动态周分支，采用人-时期表示的模型；以及一个可比较的连续时间分支，扩展了模型家族——基于树的生存模型、参数模型和神经网络模型。评估协议整合了四个分析层面：预测性能、消融、可解释性和校准。结果在每个分支内分别报告，因为跨分支单一排名在方法论上不合理。在可比较分支中，随机生存森林在区分度和特定时间点的Brier分数上领先；在动态分支中，泊松分段指数在紧密的五家族聚类中在综合Brier分数上略微领先。无重抽样自举变异将这些位置视为方向性信号而非绝对优势。消融和可解释性分析在所有家族中收敛于一个共同发现：主导预测信号主要不是人口统计学或结构性的，而是时间和行为性的。校准在更好区分的模型中证实了这一模式，但XGBoost AFT除外，它表现出系统性偏差。这些结果支持在学习分析中采用协调的多维基准的价值，并将辍学风险定位为一个时间行为过程，而非静态背景属性的函数。

英文摘要

Student dropout is a persistent concern in Learning Analytics, yet comparative studies frequently evaluate predictive models under heterogeneous protocols, prioritizing discrimination over temporal interpretability and calibration. This study introduces a survival-oriented benchmark for temporal dropout risk modelling using the Open University Learning Analytics Dataset (OULAD). Two harmonized arms are compared: a dynamic weekly arm, with models in person-period representation, and a comparable continuous-time arm, with an expanded roster of families -- tree-based survival, parametric, and neural models. The evaluation protocol integrates four analytical layers: predictive performance, ablation, explainability, and calibration. Results are reported within each arm separately, as a single cross-arm ranking is not methodologically warranted. Within the comparable arm, Random Survival Forest leads in discrimination and horizon-specific Brier scores; within the dynamic arm, Poisson Piecewise-Exponential leads narrowly on integrated Brier score within a tight five-family cluster. No-refit bootstrap sampling variability qualifies these positions as directional signals rather than absolute superiority. Ablation and explainability analyses converged, across all families, on a shared finding: the dominant predictive signal was not primarily demographic or structural, but temporal and behavioral. Calibration corroborated this pattern in the better-discriminating models, with the exception of XGBoost AFT, which exhibited systematic bias. These results support the value of a harmonized, multi-dimensional benchmark in Learning Analytics and situate dropout risk as a temporal-behavioral process rather than a function of static background attributes.

URL PDF HTML ☆

赞 0 踩 0

2604.08213 2026-05-26 cs.CV cs.AI 版本更新

表征语言模型间的线性对齐

Matt Gorbett, Suman Jana

发表机构 * Independent Researcher（独立研究者）； Department of Computer Science（计算机科学系）； Columbia University（哥伦比亚大学）

AI总结研究独立训练的大语言模型间是否存在线性对齐，并探索其在文本生成、嵌入分类、分布外检测及隐私保护跨孤岛推理中的应用。

详情

AI中文摘要

语言模型似乎越来越多地学习到相似的表示，尽管训练目标、架构和数据模态存在差异。这种独立训练模型之间新兴的兼容性为跨模型对齐下游目标带来了新的机会。此外，这种能力解锁了新的潜在应用领域，例如在安全、隐私或竞争约束禁止直接数据或模型共享的场景中。在这项工作中，我们研究了表示收敛在多大程度上实现了大语言模型之间的实用线性对齐。具体来说，我们学习独立模型最终隐藏状态之间的仿射变换，并在文本生成、嵌入分类和分布外检测中经验性地评估这些映射。我们发现，模型对之间的性能基本保持不变，并首次证明线性对齐有时能够实现跨独立训练模型的文本生成。我们进一步强调了线性对齐在隐私保护跨孤岛推理中的潜在应用。该框架在共享公共数据集上学习仿射变换，并使用同态加密来保护客户端查询。通过仅加密线性分类操作，该方法实现了亚秒级推理延迟。

英文摘要

Language models increasingly appear to learn similar representations, despite differences in training objectives, architectures, and data modalities. This emerging compatibility between independently trained models introduces new opportunities for cross-model alignment to downstream objectives. Moreover, this capability unlocks new potential application domains, such as settings where security, privacy, or competitive constraints prohibit direct data or model sharing. In this work, we investigate the extent to which representational convergence enables practical linear alignment between large language models. Specifically, we learn affine transformations between the final hidden states of independent models and empirically evaluate these mappings across text generation, embedding classification, and out-of-distribution detection. We find that performance is largely preserved across model pairs, and show for the first time that linear alignment sometimes enables text generation across independently trained models. We further highlight a potential application of linear alignment for privacy-preserving cross-silo inference. The framework learns an affine transformation over a shared public dataset and uses homomorphic encryption to protect client queries. By encrypting only the linear classification operation, the method achieves sub-second inference latency.

URL PDF HTML ☆

赞 0 踩 0

2603.16105 2026-05-26 cs.CL cs.AI 版本更新

Frequency Matters: Fast Model-Agnostic Data Curation for Pruning and Quantization

频率至关重要：用于剪枝和量化的快速模型无关数据筛选

Francesco Pio Monaco, Elia Cunegatti, Flavio Vella, Giovanni Iacca

发表机构 * University of Trento（特伦托大学）

AI总结提出一种基于Zipf幂律的模型无关数据筛选策略ZipCal，通过最大化词汇多样性来选择校准数据，在剪枝和量化中实现与依赖模型困惑度的最先进方法相当的性能，且速度快约240倍。

Comments Added statistical analysis, mechanistic analysis and a comparison with a generative baseline. 22 pages

详情

AI中文摘要

解码机器学习决策：面向大规模排序系统的智能体推理框架

Longfei Yun, Yihan Wu, Haoran Liu, Xiaoxuan Liu, Ziyun Xu, Yi Wang, Yang Xia, Pengfei Wang, Mingze Gao, Yunxiang Wang, Changfan Chen, Wenjie Fu, Hong Yan, Junfeng Pan

发表机构 * Meta

AI总结提出GEARS框架，通过智能体技能封装排序专家知识，将排序优化转化为自主发现过程，实现高层意图驱动的系统调控并保证生产可靠性。

Comments 12 pages, 5 figures

详情

AI中文摘要

现代大规模排序系统在竞争目标、操作约束和不断变化的产品需求的复杂环境中运行。该领域的进展越来越受到工程上下文约束的瓶颈：将模糊的产品意图转化为合理、可执行、可验证的假设的艰巨过程，而不仅仅是建模技术本身。我们提出了GEARS（生成式智能体排序系统引擎），这是一个将排序优化重新定义为可编程实验环境中的自主发现过程的框架。GEARS不是将优化视为静态模型选择，而是利用专门智能体技能将排序专家知识封装为可复用的推理能力，使操作者能够通过高层意图（如氛围个性化）来引导系统。此外，为确保生产可靠性，该框架集成了验证钩子以强制执行统计稳健性，并过滤掉过度拟合短期信号的脆弱策略。跨不同产品表面的实验验证表明，GEARS通过协同算法信号与深度排序上下文，同时保持严格的部署稳定性，能够持续识别出接近帕累托最优的优越策略。

英文摘要

Modern large-scale ranking systems operate within a sophisticated landscape of competing objectives, operational constraints, and evolving product requirements. Progress in this domain is increasingly bottlenecked by the engineering context constraint: the arduous process of translating ambiguous product intent into reasonable, executable, verifiable hypotheses, rather than by modeling techniques alone. We present GEARS (Generative Engine for Agentic Ranking Systems), a framework that reframes ranking optimization as an autonomous discovery process within a programmable experimentation environment. Rather than treating optimization as static model selection, GEARS leverages Specialized Agent Skills to encapsulate ranking expert knowledge into reusable reasoning capabilities, enabling operators to steer systems via high-level intent vibe personalization. Furthermore, to ensure production reliability, the framework incorporates validation hooks to enforce statistical robustness and filter out brittle policies that overfit short-term signals. Experimental validation across diverse product surfaces demonstrates that GEARS consistently identifies superior, near-Pareto-efficient policies by synergizing algorithmic signals with deep ranking context while maintaining rigorous deployment stability.

URL PDF HTML ☆

赞 0 踩 0

2602.17658 2026-05-26 cs.LG cs.AI cs.IT math.IT 版本更新

STAPO：通过抑制稀有虚假标记稳定大语言模型的强化学习

Shiqi Liu, Zeyu He, Guojian Zhan, Letian Tao, Zhilong Zheng, Jiang Wu, Yinuo Wang, Yang Guan, Kehua Sheng, Bo Zhang, Keqiang Li, Jingliang Duan, Shengbo Eben Li

发表机构 * School of Vehicle ； Mobility \& College of AI, Tsinghua University ； Didi Voyager Labs, DiDi Autonomous Driving

AI总结针对强化学习微调大语言模型时因稀有虚假标记导致训练不稳定和性能崩溃的问题，提出STAPO方法，通过抑制这些标记的梯度扰动，在多个数学推理基准上实现稳定训练和性能提升。

详情

AI中文摘要

强化学习显著提升了大语言模型的推理能力，但现有的强化学习微调方法严重依赖熵正则化和重加权等启发式技术来维持稳定性。实践中，这些方法常遭遇后期性能崩溃，导致推理质量下降和训练不稳定。我们识别出这一不稳定的关键因素：一小部分标记（称为虚假标记，约占0.01%）对推理结果贡献甚微，但由于继承了完整的序列级奖励而获得不成比例放大的梯度更新。我们提出了一个统一框架，用于评估虚假风险、梯度范数和熵变化下标记级优化影响。基于对严重破坏优化的标记特征的分析，我们提出了抑制虚假标记（S2T）机制，以有效抑制其梯度扰动。将该机制融入基于组的目标中，我们提出了虚假标记感知策略优化（STAPO），促进了稳定有效的大规模模型优化。在使用Qwen 1.7B、8B和14B基础模型的六个数学推理基准上，STAPO一致展现出优越的熵稳定性，并在GRPO、20-Entropy和JustRL基础上平均性能提升11.49%（$\rho_{\mathrm{T}}$=1.0, top-p=1.0）和3.73%（$\rho_{\mathrm{T}}$=0.7, top-p=0.9）。

英文摘要

Reinforcement Learning (RL) has significantly improved large language model reasoning, but existing RL fine-tuning methods rely heavily on heuristic techniques such as entropy regularization and reweighting to maintain stability. In practice, they often suffer from late-stage performance collapse, leading to degraded reasoning quality and unstable training. We identify a key factor behind this instability: a small fraction of tokens, termed spurious tokens (around 0.01%), which contribute little to the reasoning outcome but receive disproportionately amplified gradient updates due to inheriting the full sequence-level reward. We present a unified framework for evaluating token-level optimization impacts across spurious risk, gradient norms, and entropy changes. Building on the analysis of token characteristics that severely disrupt optimization, we propose the Silencing Spurious Tokens (S2T) mechanism to efficiently suppress their gradient perturbations. Incorporating this mechanism into a group-based objective, we propose Spurious-Token-Aware Policy Optimization (STAPO), which promotes stable and effective large-scale model refinement. Across six mathematical reasoning benchmarks using Qwen 1.7B, 8B, and 14B base models, STAPO consistently demonstrates superior entropy stability and achieves an average performance improvement of 11.49% ($ρ_{\mathrm{T}}$=1.0, top-p=1.0) and 3.73% ($ρ_{\mathrm{T}}$=0.7, top-p=0.9) over GRPO, 20-Entropy, and JustRL.

URL PDF HTML ☆

赞 0 踩 0

2602.11534 2026-05-26 cs.LG cs.AI 版本更新

Krause Synchronization Transformers

Krause同步变换器

Jingkun Liu, Yisong Yue, Max Welling, Yue Song

发表机构 * Shanghai Qi Zhi Institute（上海启智研究院）； College of AI, Tsinghua University（清华大学人工智能学院）； Shanghai Jiao Tong University（上海交通大学）； California Institute of Technology（加州理工学院）； University of Amsterdam（阿姆斯特丹大学）

AI总结提出基于有界置信共识动力学的Krause注意力机制，通过局部化稀疏交互替代全局softmax归一化，缓解表示坍缩和注意力汇聚现象，实现线性复杂度并提升性能。

Comments ICML 2026, Project page: https://jingkun-liu.github.io/krause-sync-transformers/

详情

AI中文摘要

Transformer中的自注意力依赖于全局归一化的softmax权重，导致所有token在每一层竞争影响力。当跨深度组合时，这种交互模式会诱导强同步动力学，倾向于收敛到主导模式，这种行为与表示坍缩和注意力汇聚现象相关。我们引入了Krause注意力，一种受有界置信共识动力学启发的原则性注意力机制。Krause注意力将基于相似性的全局聚合替换为基于距离的、局部化的、选择性稀疏的交互，促进结构化的局部同步而非全局混合。我们将这种行为与最近将Transformer动力学建模为相互作用粒子系统的理论联系起来，并展示有界置信交互如何自然地调节注意力集中并缓解注意力汇聚。将交互限制在局部邻域还将运行时复杂度从序列长度的二次方降低到线性。实验上，我们在多种设置中验证了Krause注意力，包括视觉（CIFAR/ImageNet上的ViT）、自回归图像生成（MNIST/CIFAR-10）、大语言模型（Llama/Qwen）以及从零开始训练的多种规模（100M/200M）的语言模型。在这些领域中，Krause注意力在提高计算效率的同时实现了持续的性能提升，突显了有界置信动力学作为注意力的一种可扩展且有效的归纳偏置。

英文摘要

Self-attention in Transformers relies on globally normalized softmax weights, causing all tokens to compete for influence at every layer. When composed across depth, this interaction pattern induces strong synchronization dynamics that favor convergence toward a dominant mode, a behavior associated with representation collapse and attention sink phenomena. We introduce Krause Attention, a principled attention mechanism inspired by bounded-confidence consensus dynamics. Krause Attention replaces similarity-based global aggregation with distance-based, localized, and selectively sparse interactions, promoting structured local synchronization instead of global mixing. We relate this behavior to recent theory modeling Transformer dynamics as interacting particle systems, and show how bounded-confidence interactions naturally moderate attention concentration and alleviate attention sinks. Restricting interactions to local neighborhoods also reduces runtime complexity from quadratic to linear in sequence length. Empirically, we validate Krause Attention across diverse settings, including vision (ViT on CIFAR/ImageNet), autoregressive image generation (MNIST/CIFAR-10), large language models (Llama/Qwen), and language models trained from scratch at multiple scales (100M/200M). Across these domains, Krause Attention achieves consistent performance gains while improving computational efficiency, highlighting bounded-confidence dynamics as a scalable and effective inductive bias for attention.

URL PDF HTML ☆

赞 0 踩 0

2602.08499 2026-05-26 cs.LG cs.AI 版本更新

Contextual Rollout Bandits for Reinforcement Learning with Verifiable Rewards

上下文展开赌博机：面向可验证奖励的强化学习

Xiaodong Lu, Xiaohan Wang, Jiajun Chai, Guojun Yin, Wei Lin, Zhijun Chen, Yu Luo, Fuzhen Zhuang, Yikun Ban, Deqing Wang

发表机构 * School of Computer Science and Engineering, Beihang University（北京航空航天大学计算机科学与工程学院）； School of Artificial Intelligence, Beihang University（北京航空航天大学人工智能学院）； Huawei（华为）

AI总结针对RLVR中展开使用无差别、短视导致的问题，提出上下文赌博机框架，自适应选择高价值展开，提升训练效率与性能。

详情

AI中文摘要

可验证奖励的强化学习（RLVR）是提升大型语言模型推理能力的有效范式。然而，现有RLVR方法以无差别和短视的方式使用展开：每个提示内不同质量的响应被统一对待，且历史展开在单次使用后被丢弃。这导致监督噪声大、样本效率低以及策略更新次优。我们通过将RLVR中的展开调度形式化为上下文赌博机问题，并提出一个统一的神经调度框架来解决这些问题，该框架在整个训练过程中自适应地选择高价值展开。每个展开被视为一个臂，其奖励由连续优化步骤之间诱导的性能增益定义。由此产生的调度器支持噪声感知的组内选择和历史展开的自适应全局重用，所有这些都在一个统一的原则性框架内。我们通过推导次线性遗憾界并证明扩大展开缓冲区可改善可实现性能上限，提供了理论依据。在六个数学推理基准上的实验表明，在多种RLVR优化方法中，性能和训练效率均有一致的提升。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) is an effective paradigm for improving the reasoning capabilities of large language models. However, existing RLVR methods utilize rollouts in an indiscriminate and short-horizon manner: responses of heterogeneous quality within each prompt are treated uniformly, and historical rollouts are discarded after a single use. This leads to noisy supervision, poor sample efficiency, and suboptimal policy updates. We address these issues by formulating rollout scheduling in RLVR as a contextual bandit problem and proposing a unified neural scheduling framework that adaptively selects high-value rollouts throughout training. Each rollout is treated as an arm whose reward is defined by the induced performance gain between consecutive optimization steps. The resulting scheduler supports both noise-aware intra-group selection and adaptive global reuse of historical rollouts within a single principled framework. We provide theoretical justification by deriving sublinear regret bounds and showing that enlarging the rollout buffer improves the achievable performance upper bound. Experiments on six mathematical reasoning benchmarks demonstrate consistent gains in performance and training efficiency across multiple RLVR optimization methods.

URL PDF HTML ☆

赞 0 踩 0

2602.08426 2026-05-26 cs.CL cs.AI cs.CV 版本更新

Prism: Spectral-Aware Block-Sparse Attention

Prism: 频谱感知的块稀疏注意力

Xinghao Wang, Pengyu Wang, Xiaoran Liu, Fangxu Liu, Jason Chu, Kai Song, Xipeng Qiu

发表机构 * Fudan University（复旦大学）； Shanghai Innovation Institute（上海创新研究院）； ByteDance Inc.（字节跳动公司）； OpenMOSS Team（OpenMOSS团队）

AI总结针对长上下文LLM预填充中块稀疏注意力的块选择效率瓶颈，提出无训练频谱感知方法Prism，通过高低频分支分解和能量温度校准恢复位置信号，实现纯块级重要性估计，在保持精度同时实现高达5.1倍加速。

Comments ICML 2026

详情

AI中文摘要

块稀疏注意力有望加速长上下文LLM的预填充，但高效识别相关块仍是瓶颈。现有方法通常采用粗粒度注意力作为块重要性估计的代理，但往往诉诸昂贵的令牌级搜索或评分，导致显著的选择开销。在本工作中，我们将通过均值池化的标准粗粒度注意力的不准确性追溯到一个理论根源：均值池化与旋转位置嵌入（RoPE）之间的交互。我们证明均值池化充当低通滤波器，在高频维度上引起破坏性干扰，有效造成局部位置信息（如斜线模式）的“盲点”。为解决此问题，我们引入Prism，一种无训练的频谱感知方法，将块选择分解为高频和低频分支。通过应用基于能量的温度校准，Prism直接从池化表示中恢复衰减的位置信号，使得仅使用块级操作即可进行块重要性估计，从而提高效率。大量评估证实，Prism在保持与全注意力精度相当的同时，实现了高达$\mathbf{5.1 imes}$的加速。

英文摘要

Block-sparse attention is promising for accelerating long-context LLM pre-filling, yet identifying relevant blocks efficiently remains a bottleneck. Existing methods typically employ coarse-grained attention as a proxy for block importance estimation, but often resort to expensive token-level searching or scoring, resulting in significant selection overhead. In this work, we trace the inaccuracy of standard coarse-grained attention via mean pooling to a theoretical root cause: the interaction between mean pooling and Rotary Positional Embeddings (RoPE). We prove that mean pooling acts as a low-pass filter that induces destructive interference in high-frequency dimensions, effectively creating a "blind spot" for local positional information (e.g., slash patterns). To address this, we introduce Prism, a training-free spectral-aware approach that decomposes block selection into high-frequency and low-frequency branches. By applying energy-based temperature calibration, Prism restores the attenuated positional signals directly from pooled representations, enabling block importance estimation using purely block-level operations, thereby improving efficiency. Extensive evaluations confirm that Prism maintains accuracy parity with full attention while delivering up to $\mathbf{5.1\times}$ speedup.

URL PDF HTML ☆

赞 0 踩 0

2602.06717 2026-05-26 cs.LG cs.AI 版本更新

F-GRPO: Don't Let Your Policy Learn the Obvious and Forget the Rare

F-GRPO: 别让你的策略学到显而易见的而忘记罕见的

Daniil Plyusov, Alexey Gorbatovski, Boris Shaposhnikov, Viacheslav Sinii, Alexey Malakhov, Daria Korotyshova, Daniil Gavrilov

发表机构 * T-Tech

AI总结针对强化学习中有限采样组导致罕见正确轨迹被忽略的问题，提出基于Focal loss的难度感知缩放系数F-GRPO，在不增加组大小和计算成本下提升数学推理性能。

详情

AI中文摘要

基于可验证奖励的强化学习通常依赖组采样来估计优势并稳定策略更新。实践中，计算限制往往排除非常大的组，因此训练使用有限的rollout集合，这些集合只能强化它们暴露的正确行为。在实际组大小下，更新可能会遗漏罕见的正确轨迹，同时仍然包含混合奖励，将概率集中在更常见的采样解上。我们推导了这种提示局部尾部遗漏事件作为组大小函数的概率，展示了非单调行为，并在分类抽象中描述了未采样的正确质量如何在总正确质量增长时缩小。受此分析启发，我们提出了一种难度感知缩放系数，灵感来自Focal loss，它降低了高成功采样组的更新权重。经验上，分类模拟在分类设置中展示了相同效果，Maze提供了单解测试，LLM实验包括代表性的GRPO组大小扫描以及GRPO、DAPO和CISPO之间的固定N迁移。在Qwen2.5-7B上，N=8时，我们的方法将平均数学pass@256从64.1提高到70.3（GRPO），69.3提高到72.5（DAPO），73.2提高到76.8（CISPO）；在所有三种情况下，OOD pass@256也得到改善，且不增加组大小或计算成本。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) is commonly based on group sampling to estimate advantages and stabilize policy updates. In practice, computational limits often rule out very large groups, so training proceeds with finite rollout sets that can reinforce only the correct behavior they expose. At practical group sizes, updates can miss rare-correct trajectories while still containing mixed rewards, concentrating probability on more common sampled solutions. We derive the probability of such prompt-local tail-miss events as a function of group size, showing non-monotonic behavior, and in the categorical abstraction characterize how unsampled-correct mass can shrink even as total correct mass grows. Motivated by this analysis, we propose a difficulty-aware scaling coefficient, inspired by Focal loss, that down-weights updates on high-success sampled groups. Empirically, categorical simulation illustrates the same effect in the categorical setting, Maze provides a single-solution test, and LLM experiments include a representative GRPO group-size sweep together with fixed-$N$ transfer across GRPO, DAPO, and CISPO. On Qwen2.5-7B at $N{=}8$, our method improves average math pass@256 from 64.1 $\rightarrow$ 70.3 (GRPO), 69.3 $\rightarrow$ 72.5 (DAPO), and 73.2 $\rightarrow$ 76.8 (CISPO); OOD pass@256 also improves in all three cases, without increasing group size or computational cost.

URL PDF HTML ☆

赞 0 踩 0

2602.04120 2026-05-26 cs.LG cs.AI cs.DC cs.SE 版本更新

Scalable Explainability-as-a-Service (XaaS) for Edge AI Systems

面向边缘AI系统的可扩展可解释性即服务（XaaS）

Samaresh Kumar Singh, Joyjit Roy

AI总结提出可解释性即服务（XaaS）分布式架构，通过解耦推理与解释生成、语义缓存、轻量验证和自适应引擎，在边缘设备上实现低延迟、高保真的可解释性，并在三个实际用例中降低38%延迟。

Comments 8 pages, 5 figures, 2 tables. This version updates metadata after publication in IEEE Xplore and publication by SoutheastCon 2026

详情

DOI: 10.1109/SoutheastCon63549.2026.11476268
Journal ref: 2026 IEEE SoutheastCon, Huntsville, AL, USA, 2026

AI中文摘要

尽管可解释人工智能（XAI）取得了显著进展，但其在边缘和物联网系统中的集成通常是临时且低效的。当前大多数方法以“耦合”方式运行，即解释生成与模型推理同时进行。因此，这些方法在异构边缘设备上部署时会产生冗余计算、高延迟和可扩展性差的问题。本文提出可解释性即服务（XaaS），一种将可解释性视为一等系统服务（而非模型特定功能）的分布式架构。我们提出的XaaS架构的关键创新在于解耦推理与解释生成，使边缘设备能够在资源和延迟约束下请求、缓存和验证解释。为此，我们引入三项主要创新：（1）基于语义相似性的分布式解释缓存检索方法，显著减少冗余计算；（2）轻量验证协议，确保缓存和新生成解释的保真度；（3）自适应解释引擎，根据设备能力和用户需求选择解释方法。我们在三个实际边缘AI用例上评估了XaaS的性能：（i）制造质量控制；（ii）自动驾驶车辆感知；（iii）医疗诊断。实验结果表明，XaaS在三个实际部署中延迟降低38%，同时保持高解释质量。总体而言，本工作使得在大规模异构物联网系统中部署透明和可问责的AI成为可能，并弥合了XAI研究与边缘实用性之间的差距。

英文摘要

Though Explainable AI (XAI) has made significant advancements, its inclusion in edge and IoT systems is typically ad-hoc and inefficient. Most current methods are "coupled" in such a way that they generate explanations simultaneously with model inferences. As a result, these approaches incur redundant computation, high latency and poor scalability when deployed across heterogeneous sets of edge devices. In this work we propose Explainability-as-a-Service (XaaS), a distributed architecture for treating explainability as a first-class system service (as opposed to a model-specific feature). The key innovation in our proposed XaaS architecture is that it decouples inference from explanation generation allowing edge devices to request, cache and verify explanations subject to resource and latency constraints. To achieve this, we introduce three main innovations: (1) A distributed explanation cache with a semantic similarity based explanation retrieval method which significantly reduces redundant computation; (2) A lightweight verification protocol that ensures the fidelity of both cached and newly generated explanations; and (3) An adaptive explanation engine that chooses explanation methods based upon device capability and user requirement. We evaluated the performance of XaaS on three real-world edgeAI use cases: (i) manufacturing quality control; (ii) autonomous vehicle perception; and (iii) healthcare diagnostics. Experimental results show that XaaS reduces latency by 38% while maintaining high explanation quality across three real-world deployments. Overall, this work enables the deployment of transparent and accountable AI across large scale, heterogeneous IoT systems, and bridges the gap between XAI research and edge-practicality.

URL PDF HTML ☆

赞 0 踩 0

2602.03695 2026-05-26 cs.MA cs.AI cs.CL 版本更新

Agent Primitives: Reusable Latent Building Blocks for Multi-Agent Systems

Agent Primitives: 面向多智能体系统的可复用潜在构建模块

Haibo Jin, Peng Kuang, Ye Yu, Xiaopeng Yuan, Haohan Wang

发表机构 * School of Information Sciences, University of Illinois at Urbana-Champaign, IL, USA（伊利诺伊大学厄巴纳-香槟分校信息科学学院）； Siebel School of Computing and Data Science, University of Illinois at Urbana-Champaign, IL, USA（伊利诺伊大学厄巴纳-香槟分校Siebel计算与数据科学学院）

AI总结提出Agent Primitives，一组可复用的潜在构建模块，通过KV缓存内部通信和自动组合，提升多智能体系统的鲁棒性、效率和跨任务复用性。

Comments 16 pages

详情

AI中文摘要

虽然现有的多智能体系统（MAS）可以通过多个智能体之间的协作处理复杂问题，但它们通常高度任务特定，依赖手动设计的智能体角色和交互提示，导致架构复杂性增加且跨任务复用性有限。此外，大多数MAS主要通过自然语言进行通信，使得它们在内部智能体历史中的长上下文、多阶段交互中容易受到错误累积和不稳定性的影响。在这项工作中，我们提出了 extbf{Agent Primitives}，一组用于基于LLM的MAS的可复用潜在构建模块。受神经网络设计的启发，其中复杂模型由可复用组件构建，我们观察到许多现有的MAS架构可以分解为少量重复出现的内部计算模式。基于这一观察，我们实例化了三个原语：Review、Voting and Selection以及Planning and Execution。所有原语通过键值（KV）缓存进行内部通信，通过减轻多阶段交互中的信息退化来提高鲁棒性和效率。为了实现自动系统构建，一个Organizer智能体根据每个查询选择并组合原语，由先前成功配置的轻量级知识库引导，形成基于原语的MAS。实验表明，基于原语的MAS相比单智能体基线平均准确率提高12.0-16.5%，与基于文本的MAS相比，令牌使用量和推理延迟减少约3-4倍，同时相对于单智能体推理仅产生1.3-1.6倍的开销，并在不同模型骨干上提供更稳定的性能。

英文摘要

While existing multi-agent systems (MAS) can handle complex problems by enabling collaboration among multiple agents, they are often highly task-specific, relying on manually crafted agent roles and interaction prompts, which leads to increased architectural complexity and limited reusability across tasks. Moreover, most MAS communicate primarily through natural language, making them vulnerable to error accumulation and instability in long-context, multi-stage interactions within internal agent histories. In this work, we propose \textbf{Agent Primitives}, a set of reusable latent building blocks for LLM-based MAS. Inspired by neural network design, where complex models are built from reusable components, we observe that many existing MAS architectures can be decomposed into a small number of recurring internal computation patterns. Based on this observation, we instantiate three primitives: Review, Voting and Selection, and Planning and Execution. All primitives communicate internally via key-value (KV) cache, which improves both robustness and efficiency by mitigating information degradation across multi-stage interactions. To enable automatic system construction, an Organizer agent selects and composes primitives for each query, guided by a lightweight knowledge pool of previously successful configurations, forming a primitive-based MAS. Experiments show that primitives-based MAS improve average accuracy by 12.0-16.5\% over single-agent baselines, reduce token usage and inference latency by approximately 3$\times$-4$\times$ compared to text-based MAS, while incurring only 1.3$\times$-1.6$\times$ overhead relative to single-agent inference and providing more stable performance across model backbones.

URL PDF HTML ☆

赞 0 踩 0

2602.02495 2026-05-26 cs.CL cs.AI cs.LG 版本更新

Reward-free Alignment for Conflicting Objectives

无奖励的冲突目标对齐

Peter Chen, Xiaopeng Li, Xi Chen, Tianyi Lin

发表机构 * Columbia University（哥伦比亚大学）

AI总结提出RACO框架，通过冲突规避梯度下降的裁剪变体直接利用成对偏好数据解决多目标冲突，实现帕累托最优对齐。

Comments Accepted to ICML 2026 (Oral)

详情

AI中文摘要

直接对齐方法越来越多地用于将大型语言模型（LLMs）与人类偏好对齐。然而，许多现实世界的对齐问题涉及多个相互冲突的目标，简单的偏好聚合可能导致训练不稳定和糟糕的权衡。特别是，加权损失方法可能无法识别同时改善所有目标的更新方向，而现有的多目标方法通常依赖显式奖励模型，增加了额外复杂性并扭曲了用户指定的偏好。本文的贡献有两方面。首先，我们提出了一种用于冲突目标的无奖励对齐框架（RACO），该框架直接利用成对偏好数据，并通过一种新颖的冲突规避梯度下降的裁剪变体解决梯度冲突。我们提供了收敛到尊重用户指定目标权重的帕累托临界点的保证，并进一步证明在双目标设置中裁剪可以严格改善收敛速度。其次，我们使用一些启发式方法改进了我们的方法，并进行了实验，以证明所提框架在LLM对齐中的兼容性。在多个LLM家族（Qwen 3、Llama 3、Gemma 3）上的多目标摘要和安全对齐任务的定性和定量评估表明，与现有的多目标对齐基线相比，我们的方法始终能实现更好的帕累托权衡。

英文摘要

Direct alignment methods are increasingly used to align large language models (LLMs) with human preferences. However, many real-world alignment problems involve multiple conflicting objectives, where naive aggregation of preferences can lead to unstable training and poor trade-offs. In particular, weighted loss methods may fail to identify update directions that simultaneously improve all objectives, and existing multi-objective approaches often rely on explicit reward models, introducing additional complexity and distorting user-specified preferences. The contributions of this paper are two-fold. First, we propose a Reward-free Alignment framework for Conflicted Objectives (RACO) that directly leverages pairwise preference data and resolves gradient conflicts via a novel clipped variant of conflict-averse gradient descent. We provide convergence guarantees to Pareto-critical points that respect user-specified objective weights, and further show that clipping can strictly improve convergence rate in the two-objective setting. Second, we improve our method using some heuristics and conduct experiments to demonstrate the compatibility of the proposed framework for LLM alignment. Both qualitative and quantitative evaluations on multi-objective summarization and safety alignment tasks across multiple LLM families (Qwen 3, Llama 3, Gemma 3) show that our method consistently achieves better Pareto trade-offs compared to existing multi-objective alignment baselines.

URL PDF HTML ☆

赞 0 踩 0

2601.21094 2026-05-26 cs.LG cs.AI cs.SY eess.SY 版本更新

Safety Generalization Under Distribution Shift in Safe Reinforcement Learning: A Diabetes Testbed

安全强化学习中的分布偏移下的安全泛化：一个糖尿病测试平台

Minjae Kwon, Josephine Lamp, Lu Feng

发表机构 * Department of Computer Science, University of Virginia（弗吉尼亚大学计算机科学系）

AI总结研究安全强化学习算法在分布偏移下训练时安全保证能否迁移到部署中，使用糖尿病管理作为测试平台，发现安全泛化差距并通过测试时屏蔽有效恢复安全性。

Comments Accepted at ICML 2026. Camera-ready version

详情

AI中文摘要

安全强化学习算法通常在固定的训练条件下进行评估。我们使用糖尿病管理作为安全关键测试平台，研究训练时的安全保证是否能在分布偏移下迁移到部署中。我们在统一的临床模拟器上对安全强化学习算法进行基准测试，并揭示了一个安全泛化差距：在训练期间满足约束的策略经常在未见过的患者身上违反安全要求。我们证明，测试时屏蔽（使用学习到的动力学模型过滤不安全动作）能有效恢复跨算法和患者群体的安全性。在八种安全强化学习算法、三种糖尿病类型和三个年龄组中，屏蔽使得PPO-Lag和CPO等强基线的血糖达标时间范围提高了13-14%，同时降低了临床风险指数和血糖变异性。我们的模拟器和基准测试为研究安全关键控制领域中分布偏移下的安全性提供了一个平台。代码可在https://github.com/safe-autonomy-lab/GlucoSim 和 https://github.com/safe-autonomy-lab/GlucoAlg 获取。

英文摘要

Safe Reinforcement Learning (RL) algorithms are typically evaluated under fixed training conditions. We investigate whether training-time safety guarantees transfer to deployment under distribution shift, using diabetes management as a safety-critical testbed. We benchmark safe RL algorithms on a unified clinical simulator and reveal a safety generalization gap: policies satisfying constraints during training frequently violate safety requirements on unseen patients. We demonstrate that test-time shielding, which filters unsafe actions using learned dynamics models, effectively restores safety across algorithms and patient populations. Across eight safe RL algorithms, three diabetes types, and three age groups, shielding achieves Time-in-Range gains of 13--14\% for strong baselines such as PPO-Lag and CPO while reducing clinical risk index and glucose variability. Our simulator and benchmark provide a platform for studying safety under distribution shift in safety-critical control domains. Code is available at https://github.com/safe-autonomy-lab/GlucoSim and https://github.com/safe-autonomy-lab/GlucoAlg.

URL PDF HTML ☆

赞 0 踩 0

2601.13236 2026-05-26 eess.IV cs.AI physics.med-ph 版本更新

Pixelwise Uncertainty Quantification of Accelerated MRI Reconstruction

加速MRI重建的像素级不确定性量化

Ilias I. Giannakopoulos, Lokesh B Gautham Muthukumar, Yvonne W. Lui, Riccardo Lattanzi

发表机构 * Bernard and Irene Schwartz Center for Biomedical Imaging, Department of Radiology, NYU Grossman School of Medicine（贝纳德与伊蕾娜·施瓦茨生物医学成像中心，放射学系，纽约大学格罗斯曼医学院）； Courant Institute of Mathematical Sciences, NYU（数学科学学院，纽约大学）； Center for Advanced Imaging Innovation and Research (CAI 2 R), Department of Radiology, NYU Grossman School of Medicine（先进成像创新与研究中心（CAI 2 R），放射学系，纽约大学格罗斯曼医学院）

AI总结提出一种基于共形分位数回归的像素级不确定性量化框架，用于加速MRI重建，无需全采样参考图像即可自动识别不可靠区域。

Comments 10 pages, 8 figues, 2 tables

详情

AI中文摘要

并行成像技术减少了磁共振成像（MRI）扫描时间，但随着加速因子的增加，图像质量会下降。在临床实践中，由于缺乏自动评估欠采样重建诊断质量的机制，通常选择保守的加速因子。本文提出了一种用于并行MRI重建的像素级不确定性量化的通用框架，无需任何真实参考图像即可自动识别不可靠区域。我们的方法将共形分位数回归与图像重建方法相结合，以估计统计上严格的像素级不确定性区间。我们在从fastMRI数据集获得的笛卡尔欠采样脑部和膝盖数据上训练并评估了模型，加速因子范围为2到10。使用端到端变分网络进行图像重建。定量实验表明，预测的不确定性图与真实重建误差之间具有高度一致性。使用我们的方法，在四倍及以上的加速水平下，相应的皮尔逊相关系数高于90%；而当使用更简单的启发式概念（残差幅度）计算不确定性时，该系数降至70%以下。定性示例进一步表明，基于分位数回归的不确定性图捕捉了不同加速因子下重建误差的大小和空间分布，不确定性升高的区域与病理和伪影对齐。所提出的框架能够在没有全采样真实参考图像的情况下评估重建质量。这代表了向自适应MRI采集协议迈出的一步，该协议可能能够动态平衡扫描时间和诊断可靠性。

英文摘要

Parallel imaging techniques reduce magnetic resonance imaging (MRI) scan time but image quality degrades as the acceleration factor increases. In clinical practice, conservative acceleration factors are chosen because no mechanism exists to automatically assess the diagnostic quality of undersampled reconstructions. This work introduces a general framework for pixel-wise uncertainty quantification in parallel MRI reconstructions, enabling automatic identification of unreliable regions without access to any ground-truth reference image. Our method integrates conformal quantile regression with image reconstruction methods to estimate statistically rigorous pixel-wise uncertainty intervals. We trained and evaluated our model on Cartesian undersampled brain and knee data obtained from the fastMRI dataset using acceleration factors ranging from 2 to 10. An end-to-end Variational Network was used for image reconstruction. Quantitative experiments demonstrate strong agreement between predicted uncertainty maps and true reconstruction error. Using our method, the corresponding Pearson correlation coefficient was higher than 90% at acceleration levels at and above four-fold; whereas it dropped to less than 70% when the uncertainty was computed using a simpler a heuristic notion (magnitude of the residual). Qualitative examples further show the uncertainty maps based on quantile regression capture the magnitude and spatial distribution of reconstruction errors across acceleration factors, with regions of elevated uncertainty aligning with pathologies and artifacts. The proposed framework enables evaluation of reconstruction quality without access to fully-sampled ground-truth reference images. It represents a step toward adaptive MRI acquisition protocols that may be able to dynamically balance scan time and diagnostic reliability.

URL PDF HTML ☆

赞 0 踩 0

2601.05613 2026-05-26 cs.LG cs.AI 版本更新

PiXTime: A Model for Federated Time Series Forecasting with Heterogeneous Data across Nodes

PiXTime: 一种跨节点异构数据联邦时间序列预测模型

Yiming Zhou, Jiahao Wang, Mingyue Cheng, Hao Wang, Defu Lian, Enhong Chen

发表机构 * University of Science and Technology of China（科学技术大学）

AI总结提出基于Transformer的PiXTime框架，通过参数解耦架构（局部个性化模块+全局共享骨干）处理异构时间序列，实现联邦学习中的异构数据预测，并在多个基准上达到最优性能。

详情

AI中文摘要

虽然对分布式时间序列进行协同预测非常理想，但由于数据共享限制，直接合并局部数据集通常不可行。联邦学习提供了一种有前景的替代方案，但传统的联邦学习算法要求同构模型架构，这与去中心化节点中常见的结构差异（如时间分辨率不对齐、变量通道不匹配）不兼容。为弥合这一差距，我们引入了PiXTime，一种新颖的基于Transformer的框架，旨在原生适应并利用结构异构的时间数据。其核心采用参数解耦架构，将模型策略性地划分为局部个性化模块和全局聚合共享骨干。具体而言，节点特定的局部模块作为维度适配器，将不同长度的原始序列投影到统一表示空间。同时，全局同步的VE表将一致的类别标识注入特征空间，使共享骨干能够跨不一致的变量分布协同学习并泛化表示。在多个基准上的全面评估表明，PiXTime在异构联邦环境中实现了最先进的性能，同时在标准同构和集中式预测设置中保持强大的优势。

英文摘要

While collaborative forecasting on distributed time series is highly desirable, directly pooling localized datasets is often impractical due to data sharing constraints. Federated learning offers a promising alternative, yet conventional federated learning algorithms require homogeneous model architectures, which are incompatible with the structural discrepancies, such as unaligned temporal resolutions and mismatched variable channels, commonly observed across decentralized nodes. To bridge this gap, we introduce PiXTime, a novel Transformer-based framework designed to natively accommodate and leverage structurally heterogeneous temporal data. At its core, PiXTime adopts a parameter-decoupling architecture, strategically partitioning the model into localized personalized modules and a globally aggregated shared backbone. Specifically, node-specific local modules act as dimensional adapters, projecting raw sequences of diverse lengths into a unified representation space. Concurrently, a globally synchronized VE Table injects consistent categorical identities into the feature space, allowing the shared backbone to collaboratively learn and generalize representations across inconsistent variable distributions. Comprehensive evaluations on multiple benchmarks demonstrate that PiXTime achieves state-of-the-art performance in heterogeneous federated environments, while maintaining robust superiority in standard homogeneous and centralized forecasting settings.

URL PDF HTML ☆

赞 0 踩 0

2601.05483 2026-05-26 cs.AI 版本更新

MMUEChange: A Generalized LLM Agent Framework for Intelligent Multi-Modal Urban Environment Change Analysis

MMUEChange：面向智能多模态城市环境变化分析的通用LLM智能体框架

Zixuan Xiao, Jun Ma, Siwei Zhang

发表机构 * Department of Urban Planning and Design, The University of Hong Kong（香港大学城市规划与设计系）

AI总结提出MMUEChange多模态智能体框架，通过模块化工具包和模态控制器实现异构城市数据灵活集成与跨模态对齐，在三个城市案例中任务成功率提升46.7%并有效缓解幻觉。

详情

DOI: 10.1016/j.asoc.2026.114576
Journal ref: Applied Soft Computing 190 (2026) 114576

AI中文摘要

理解城市环境变化对于可持续发展至关重要。然而，当前方法，特别是遥感变化检测，通常依赖于刚性的单模态分析。为克服这些限制，我们提出MMUEChange，一个多模态智能体框架，通过模块化工具包和核心模块——模态控制器实现跨模态和模态内对齐，灵活集成异构城市数据，从而支持对复杂城市变化场景的稳健分析。案例研究包括：纽约向小型社区公园的转变，反映了当地的绿地建设努力；香港各区集中水污染的扩散，指向协调的水管理；深圳露天垃圾场的显著减少，以及夜间经济活动与垃圾类型之间的对比关联，表明生活垃圾和建筑垃圾背后不同的城市压力。与性能最佳的基线相比，MMUEChange智能体在任务成功率上提升了46.7%，并有效缓解了幻觉，展示了其支持具有实际政策影响的复杂城市变化分析任务的能力。

英文摘要

Understanding urban environment change is essential for sustainable development. However, current approaches, particularly remote sensing change detection, often rely on rigid, single-modal analysis. To overcome these limitations, we propose MMUEChange, a multi-modal agent framework that flexibly integrates heterogeneous urban data via a modular toolkit and a core module, Modality Controller for cross- and intra-modal alignment, enabling robust analysis of complex urban change scenarios. Case studies include: a shift toward small, community-focused parks in New York, reflecting local green space efforts; the spread of concentrated water pollution across districts in Hong Kong, pointing to coordinated water management; and a notable decline in open dumpsites in Shenzhen, with contrasting links between nighttime economic activity and waste types, indicating differing urban pressures behind domestic and construction waste. Compared to the best-performing baseline, the MMUEChange agent achieves a 46.7% improvement in task success rate and effectively mitigates hallucination, demonstrating its capacity to support complex urban change analysis tasks with real-world policy implications.

URL PDF HTML ☆

赞 0 踩 0

2601.03790 2026-05-26 cs.CL cs.AI 版本更新

NeoAMT: Neologism-Aware Agentic Machine Translation with Reinforcement Learning

NeoAMT: 基于强化学习的新词感知智能机器翻译

Zhongtao Miao, Kaiyan Zhao, Masaaki Nagata, Yoshimasa Tsuruoka

发表机构 * The University of Tokyo（东京大学）； NTT Communication Science Laboratories, NTT, Inc.（NTT通信科学实验室，NTT公司）

AI总结提出NeoAMT框架，利用基于Wiktionary的搜索工具和强化学习训练翻译智能体，以提升包含新词的源句翻译质量。

Comments ACL 2026 Main. Fixed minor typos

详情

AI中文摘要

新词感知机器翻译旨在将包含新词的源句翻译成目标语言。与通用机器翻译相比，该领域仍未被充分探索。本文提出一个智能体框架NeoAMT，用于新词感知机器翻译，配备基于Wiktionary的搜索工具。具体而言，我们首先构建了一个专门用于新词感知机器翻译的数据集，并建立了一个基于Wiktionary的搜索工具。该数据集涵盖16种语言和75个翻译方向，源自约1000万条英文Wiktionary转储记录。搜索工具的检索语料库也来自同一转储中约300万条清洗后的记录。然后，我们利用该数据集和工具，通过强化学习训练翻译智能体，并评估新词感知机器翻译的准确性。此外，我们提出了一个强化学习训练框架，具有新颖的奖励设计和自适应展开生成策略，利用翻译难度进一步提高使用我们搜索工具的翻译智能体的翻译质量。

英文摘要

Neologism-aware machine translation aims to translate source sentences containing neologisms into target languages. This field remains underexplored compared with general machine translation (MT). In this paper, we propose an agentic framework, NeoAMT, for neologism-aware machine translation equipped with a Wiktionary-based search toolkit. Specifically, we first construct a dedicated dataset for neologism-aware machine translation and build a search toolkit grounded in Wiktionary. The dataset covers 16 languages and 75 translation directions in total, derived from approximately 10 million records of an English Wiktionary dump. The retrieval corpus of the search toolkit is also constructed from around 3 million cleaned records of the same dump. We then leverage the dataset and toolkit to train a translation agent via reinforcement learning (RL) and to evaluate the accuracy of neologism-aware machine translation. Furthermore, we propose an RL training framework featuring a novel reward design and an adaptive rollout generation strategy that exploits translation difficulty to further improve the translation quality of translation agents using our search toolkit.

URL PDF HTML ☆

赞 0 踩 0

2601.03624 2026-05-26 cs.AI 版本更新

Architecting Agentic Communities using Design Patterns

使用设计模式构建智能体社区

Zoran Milosevic, Fethi Rabhi

发表机构 * School of Computer Science and Engineering, University of New South Wales, Sydney, Australia（新南威尔士大学计算机科学与工程学院，悉尼，澳大利亚）； Deontik, Australia（澳大利亚德诺提克）

AI总结本文提出基于企业分布式系统设计模式的三层分类架构（LLM智能体、智能体AI、智能体社区），并通过临床试验匹配案例验证其形式化框架，为多智能体生态系统的工程化部署提供实践指导与形式化验证能力。

Comments supplementary material accompanying this paper is also attached .. its title is "Complete Agentic AI Design Patterns Catalogue"; Fixed encoding artefacts (garbled em dashes) throughout

详情

AI中文摘要

大型语言模型（LLM）及后续智能体AI技术的快速发展需要系统化的架构指导，以构建复杂的生产级系统。本文提出了一种使用源自企业分布式系统标准、形式化方法和行业实践的设计模式来架构此类系统的方法。我们将这些模式分为三层：LLM智能体（任务特定自动化）、智能体AI（自适应目标寻求者）和智能体社区（AI智能体与人类参与者通过正式角色、协议和治理结构进行协调的组织框架）。我们重点关注智能体社区——涵盖LLM智能体、智能体AI实体和人类的协调框架——这最适用于企业和工业应用。借鉴分布式系统中成熟的协调原则，我们将这些模式置于一个形式化框架中，该框架规定了协作协议，其中AI智能体和人类在受治理的生态系统中扮演角色。这种方法既提供了实践指导，也提供了形式化验证能力，通过问责机制表达组织、法律和伦理规则，确保智能体间通信、协商和意图建模的可操作且可验证的治理。我们通过一个临床试验匹配案例研究验证了该框架。我们的目标是为从业者提供可操作的指导，同时保持动态多智能体生态系统中企业部署所必需的形式化严谨性。

英文摘要

The rapid evolution of Large Language Models (LLM) and subsequent Agentic AI technologies requires systematic architectural guidance for building sophisticated, production-grade systems. This paper presents an approach for architecting such systems using design patterns derived from enterprise distributed systems standards, formal methods, and industry practice. We classify these patterns into three tiers: LLM Agents (task-specific automation), Agentic AI (adaptive goal-seekers), and Agentic Communities (organizational frameworks where AI agents and human participants coordinate through formal roles, protocols, and governance structures). We focus on Agentic Communities - coordination frameworks encompassing LLM Agents, Agentic AI entities, and humans - most relevant for enterprise and industrial applications. Drawing on established coordination principles from distributed systems, we ground these patterns in a formal framework that specifies collaboration agreements where AI agents and humans fill roles within governed ecosystems. This approach provides both practical guidance and formal verification capabilities, enabling expression of organizational, legal, and ethical rules through accountability mechanisms that ensure operational and verifiable governance of inter-agent communication, negotiation, and intent modeling. We validate this framework through a clinical trial matching case study. Our goal is to provide actionable guidance to practitioners while maintaining the formal rigor essential for enterprise deployment in dynamic, multi-agent ecosystems.

URL PDF HTML ☆

赞 0 踩 0

2601.03327 2026-05-26 cs.LG cs.AI 版本更新

Extreme-value forest fire prediction A study of the Loss Function in an Ordinality Scheme

极端值森林火灾预测：序数方案中损失函数的研究

Nicolas Caron, Christophe Guyeux, Hassan Noura, Benjamin Aynes

AI总结提出首个序数分类框架预测火灾严重等级，研究损失函数设计对预测极端事件的影响，发现加权卡帕损失在极端类别上IoU提升超过0.1。

Comments Following external reviews, we identified major methodological issues in the manuscript, including insufficient justification of the ordinal clustering strategy, limited statistical validation, ambiguities in dataset splitting, and missing comparisons with standard ordinal approaches. We therefore request withdrawal in order to prepare a substantially revised version

详情

AI中文摘要

野火在空间和严重程度上是高度不平衡的自然灾害，使得极端事件的预测特别具有挑战性。在这项工作中，我们引入了第一个序数分类框架，用于预测与法国操作决策直接对齐的野火严重等级。我们的研究调查了损失函数设计对神经模型预测罕见但关键的高严重火灾发生能力的影响。我们将标准交叉熵与几种序数感知目标进行比较，包括提出的基于截断离散指数广义帕累托分布的概率TDeGPD损失。通过对多种架构和真实操作数据的广泛基准测试，我们表明序数监督显著提高了模型相对于传统方法的性能。特别是，加权卡帕损失（WKLoss）取得了最佳整体结果，在最极端严重类别上IoU（交并比）增益超过0.1，同时保持了有竞争力的校准质量。然而，由于数据集中极端事件极低的代表性，对于最罕见事件的性能仍然有限。这些发现强调了将严重性排序、数据不平衡考虑和季节性风险整合到野火预测系统中的重要性。未来的工作将集中于将季节动态和不确定性信息纳入训练，以进一步提高极端事件预测的可靠性。

英文摘要

Wildfires are highly imbalanced natural hazards in both space and severity, making the prediction of extreme events particularly challenging. In this work, we introduce the first ordinal classification framework for forecasting wildfire severity levels directly aligned with operational decision-making in France. Our study investigates the influence of loss-function design on the ability of neural models to predict rare yet critical high-severity fire occurrences. We compare standard cross-entropy with several ordinal-aware objectives, including the proposed probabilistic TDeGPD loss derived from a truncated discrete exponentiated Generalized Pareto Distribution. Through extensive benchmarking over multiple architectures and real operational data, we show that ordinal supervision substantially improves model performance over conventional approaches. In particular, the Weighted Kappa Loss (WKLoss) achieves the best overall results, with more than +0.1 IoU (Intersection Over Union) gain on the most extreme severity classes while maintaining competitive calibration quality. However, performance remains limited for the rarest events due to their extremely low representation in the dataset. These findings highlight the importance of integrating both severity ordering, data imbalance considerations, and seasonality risk into wildfire forecasting systems. Future work will focus on incorporating seasonal dynamics and uncertainty information into training to further improve the reliability of extreme-event prediction.

URL PDF HTML ☆

赞 0 踩 0

2601.02144 2026-05-26 cs.CL cs.AI 版本更新

Routing by Analogy: kNN-Augmented Expert Assignment for Mixture-of-Experts

类比路由：用于混合专家模型的kNN增强专家分配

Boxuan Lyu, Soichiro Murakami, Hidetaka Kamigaito, Peinan Zhang

发表机构 * Institute of Science Tokyo（东京科学研究院）； CyberAgent ； Nara Institute of Science and Technology（奈良科学技術大學）

AI总结提出kNN-MoE框架，通过检索历史相似案例的局部最优专家分配来增强MoE路由，使用检索邻居的平均相似度作为置信度混合系数，在分布偏移下提升鲁棒性。

2601.00553 2026-05-26 cs.CV cs.AI 版本更新

A Comprehensive Dataset for Human vs. AI Generated Image Detection

人类与AI生成图像检测的综合数据集

Rajarshi Roy, Ashhar Aziz, Shashwat Bajpai, Nasrin Imanpour, Gurpreet Singh, Shwetangshu Biswas, Kapil Wanaskar, Parth Patwa, Subhankar Ghosh, Shreyas Dixit, Nilesh Ranjan Pal, Vipula Rawte, Ritvik Garimella, Amitava Das, Amit Sheth, Gaytri Jena, Vasu Sharma, Aishwarya Naresh Reganti, Vinija Jain, Aman Chadha

发表机构 * 1 Kalyani Government Engineering College, India. 2 IIIT Delhi, India. 3 BITS Pilani Hyderabad Campus, India. 4 University of South Carolina, USA. 5 IIIT Guwahati, India. 6 NIT Silchar, India. 7 San Jos\' e State University, USA. 8 UCLA, USA. 9 Washington State University, USA. 10 Vishwakarma Institute of Information Technology, India. 11 Gandhi Institute for Technological Advancement, India. 12 Meta AI, USA. 13 Amazon AI, USA. 14 BITS Pilani Goa, India.

AI总结针对AI生成图像检测问题，构建了包含96000个真实与合成数据点的MS COCOAI数据集，并提出了图像真伪分类与生成模型识别两个任务。

详情

AI中文摘要

像Stable Diffusion、DALL-E和MidJourney这样的多模态生成式AI系统从根本上改变了合成图像的创建方式。这些工具推动了创新，但也促进了误导性内容、虚假信息和被操纵媒体的传播。随着生成的图像越来越难以与照片区分，检测它们已成为当务之急。为了应对这一挑战，我们发布了MS COCOAI，这是一个用于AI生成图像检测的新数据集，包含96000个真实和合成数据点，基于MS COCO数据集构建。为了生成合成图像，我们使用了五个生成器：Stable Diffusion 3、Stable Diffusion 2.1、SDXL、DALL-E 3和MidJourney v6。基于该数据集，我们提出了两个任务：（1）将图像分类为真实或生成；（2）识别哪个模型生成了给定的合成图像。该数据集可在https://huggingface.co/datasets/Rajarshi-Roy-research/Defactify_Image_Dataset获取。

英文摘要

Multimodal generative AI systems like Stable Diffusion, DALL-E, and MidJourney have fundamentally changed how synthetic images are created. These tools drive innovation but also enable the spread of misleading content, false information, and manipulated media. As generated images become harder to distinguish from photographs, detecting them has become an urgent priority. To combat this challenge, we release MS COCOAI, a novel dataset for AI generated image detection consisting of 96000 real and synthetic datapoints, built using the MS COCO dataset. To generate synthetic images, we use five generators: Stable Diffusion 3, Stable Diffusion 2.1, SDXL, DALL-E 3, and MidJourney v6. Based on the dataset, we propose two tasks: (1) classifying images as real or generated, and (2) identifying which model produced a given synthetic image. The dataset is available at https://huggingface.co/datasets/Rajarshi-Roy-research/Defactify_Image_Dataset.

URL PDF HTML ☆

赞 0 踩 0

2512.23076 2026-05-26 cs.LG cs.AI cs.HC 版本更新

Multimodal Functional Maximum Correlation for Emotion Recognition

多模态功能最大相关用于情感识别

Deyang Zheng, Tianyi Zhang, Wenming Zheng, Shujian Yu

发表机构 * Key Laboratory of Child Development and Learning Science (Ministry of Education), School of Biological Sciences and Medical Engineering, Southeast University（儿童发展与学习科学重点实验室（教育部）、生物科学与医学工程学院、东南大学）； Department of Artificial Intelligence, Westlake University（人工智能学院、西湖大学）； Department of Artificial Intelligence, Vrije Universiteit Amsterdam（人工智能学院、阿姆斯特丹自由大学）

AI总结提出多模态功能最大相关（MFMC）框架，通过双重总相关目标最大化高阶多模态依赖，在情感识别基准上取得最先进性能。

Comments manuscript accepted by IEEE Transactions on Affective Computing. Code is available at https://github.com/DY9910/MFMC

详情

DOI: 10.1109/TAFFC.2026.3695876

AI中文摘要

情绪状态表现为中枢和自主系统之间协调但异质的生理反应，这对情感计算中的多模态表示学习构成了基本挑战。学习这种联合动态因情感标注的稀缺性和主观性而进一步复杂化，这推动了自监督学习（SSL）的使用。然而，大多数现有的SSL方法依赖于成对对齐目标，这些目标不足以表征两个以上模态之间的依赖关系，也无法捕捉由协调的脑和自主反应产生的高阶交互。为了解决这一限制，我们提出了多模态功能最大相关（MFMC），一个原则性的SSL框架，通过双重总相关（DTC）目标最大化高阶多模态依赖。通过推导一个紧致的夹逼界并使用基于功能最大相关分析（FMCA）的迹替代进行优化，MFMC直接捕捉联合多模态交互，而不依赖于成对对比损失。在三个公开的情感计算基准上的实验表明，MFMC在受试者依赖和受试者独立评估协议下均一致地达到最先进或具有竞争力的性能，突显了其对受试者间变异性的鲁棒性。特别是，MFMC将CEAP-360VR上的受试者依赖准确率从78.9%提高到86.8%，仅使用EDA信号就将受试者独立准确率从27.5%提高到33.1%。此外，在MAHNOB-HCI最具挑战性的EEG受试者独立划分中，MFMC与最佳方法的差距在0.8个百分点以内。我们的代码可在https://github.com/DY9910/MFMC获取。

英文摘要

Emotional states manifest as coordinated yet heterogeneous physiological responses across central and autonomic systems, posing a fundamental challenge for multimodal representation learning in affective computing. Learning such joint dynamics is further complicated by the scarcity and subjectivity of affective annotations, which motivates the use of self-supervised learning (SSL). However, most existing SSL approaches rely on pairwise alignment objectives, which are insufficient to characterize dependencies among more than two modalities and fail to capture higher-order interactions arising from coordinated brain and autonomic responses. To address this limitation, we propose Multimodal Functional Maximum Correlation (MFMC), a principled SSL framework that maximizes higher-order multimodal dependence through a Dual Total Correlation (DTC) objective. By deriving a tight sandwich bound and optimizing it using a functional maximum correlation analysis (FMCA) based trace surrogate, MFMC captures joint multimodal interactions directly, without relying on pairwise contrastive losses. Experiments on three public affective computing benchmarks demonstrate that MFMC consistently achieves state-of-the-art or competitive performance under both subject-dependent and subject-independent evaluation protocols, highlighting its robustness to inter-subject variability. In particular, MFMC improves subject-dependent accuracy on CEAP-360VR from 78.9% to 86.8%, and subject-independent accuracy from 27.5% to 33.1% using the EDA signal alone. Moreover, MFMC remains within 0.8 percentage points of the best-performing method on the most challenging EEG subject-independent split of MAHNOB-HCI. Our code is available at https://github.com/DY9910/MFMC.

URL PDF HTML ☆

赞 0 踩 0

2512.18735 2026-05-26 cs.CV cs.AI 版本更新

$M^3-Verse$: A "Spot the Difference" Challenge for Large Multimodal Models

$M^3-Verse$: 大型多模态模型的“找不同”挑战

Kewei Wei, Bocheng Hu, Jie Cao, Xiaohan Chen, Zhengxi Lu, Wubing Xia, Weili Xu, Jiaao Wu, Junchen He, Mingyu Jia, Ciyun Zhao, Ye Sun, Yizhi Li, Zhonghan Zhao, Jian Zhang, Gaoang Wang

发表机构 * Zhejiang University, China（浙江大学）； Shanghai AI Lab, China（上海人工智能实验室）； Hangzhou Normal University, China（杭州师范大学）

AI总结提出 $M^3-Verse$ 基准，通过多视角视频对评估 LMM 在一致空间中对物体动态变化的理解能力，并验证了现有模型的局限性。

详情

AI中文摘要

现代大型多模态模型（LMMs）在静态图像和单状态时空理解方面表现出非凡的能力。然而，它们在两个不同视频观测中理解共享空间上下文内物体动态变化的能力仍未被充分探索。这种在一致环境中推理变换的能力对于空间智能领域的进步尤为关键。在本文中，我们引入了 $M^3-Verse$，一个多模态、多状态、多维度的基准，以正式评估这一能力。它基于成对视频，这些视频提供了室内场景在状态变化前后的多视角观察。该基准包含总共 270 个场景和 2,932 个问题，分为 50 多个子任务，探究 4 种核心能力。我们评估了 16 个最先进的 LMMs，并观察到它们在跟踪状态转换方面的局限性。为了解决这些挑战，我们进一步提出了一个简单而有效的基线，在多状态感知中实现了显著的性能提升。因此，$M^3-Verse$ 提供了一个具有挑战性的新测试平台，以促进对动态视觉世界有更全面理解的下一代模型的发展。您可以从 https://github.com/Wal-K-aWay/M3-Verse_pipeline 获取构建流程，并从 https://www.modelscope.cn/datasets/WalKaWay/M3-Verse 获取完整的基准数据。

英文摘要

Modern Large Multimodal Models (LMMs) have demonstrated extraordinary ability in static image and single-state spatial-temporal understanding. However, their capacity to comprehend the dynamic changes of objects within a shared spatial context between two distinct video observations, remains largely unexplored. This ability to reason about transformations within a consistent environment is particularly crucial for advancements in the field of spatial intelligence. In this paper, we introduce $M^3-Verse$, a Multi-Modal, Multi-State, Multi-Dimensional benchmark, to formally evaluate this capability. It is built upon paired videos that provide multi-perspective observations of an indoor scene before and after a state change. The benchmark contains a total of 270 scenes and 2,932 questions, which are categorized into over 50 subtasks that probe 4 core capabilities. We evaluate 16 state-of-the-art LMMs and observe their limitations in tracking state transitions. To address these challenges, we further propose a simple yet effective baseline that achieves significant performance improvements in multi-state perception. $M^3-Verse$ thus provides a challenging new testbed to catalyze the development of next-generation models with a more holistic understanding of our dynamic visual world. You can get the construction pipeline from https://github.com/Wal-K-aWay/M3-Verse_pipeline and full benchmark data from https://www.modelscope.cn/datasets/WalKaWay/M3-Verse.

URL PDF HTML ☆

赞 0 踩 0

2512.05865 2026-05-26 cs.LG cs.AI 版本更新

Intrinsically Interpretable Attention via Sparse Post-Training

通过稀疏后训练实现内在可解释的注意力机制

Florent Draye, Anson Lei, Hsiao-Ru Pan, Ingmar Posner, Bernhard Schölkopf

发表机构 * MPI-IS（马克斯·普朗克研究所）； University of Oxford（牛津大学）； ETH Zürich（苏黎世联邦理工学院）

AI总结提出一种后训练方法，通过约束损失下的灵活稀疏正则化，在不牺牲性能的前提下将Transformer注意力连接稀疏至约0.4%，从而简化全局电路并提升可解释性。

详情

AI中文摘要

我们引入一种简单的后训练方法，使Transformer注意力变得稀疏而不牺牲性能。在约束损失目标下应用灵活的稀疏正则化，我们在高达7B参数的模型上证明，可以将注意力连接减少到其边缘的约0.4%，同时保留原始预训练损失。与为计算效率设计的稀疏注意力方法不同，我们的方法利用稀疏性作为结构先验：它保留了能力，同时暴露出更有组织和可解释的连接模式。我们发现这种局部稀疏性级联成全局电路简化：特定任务的电路涉及更少的组件（注意力头和MLP），连接它们的边缘减少了多达100倍。此外，使用跨层转录器，我们表明稀疏注意力显著简化了注意力归因，实现了基于特征和基于电路视角的统一视图。这些结果表明，Transformer注意力可以变得稀疏几个数量级，表明其大部分计算是冗余的，并且稀疏性可以作为更结构化和可解释模型的指导原则。

英文摘要

We introduce a simple post-training method that makes transformer attention sparse without sacrificing performance. Applying a flexible sparsity regularisation under a constrained-loss objective, we show on models up to 7B parameters that it is possible to retain the original pretraining loss while reducing attention connectivity to $\approx 0.4 \%$ of its edges. Unlike sparse-attention methods designed for computational efficiency, our approach leverages sparsity as a structural prior: it preserves capability while exposing a more organized and interpretable connectivity pattern. We find that this local sparsity cascades into global circuit simplification: task-specific circuits involve far fewer components (attention heads and MLPs) with up to 100x fewer edges connecting them. Additionally, using cross-layer transcoders, we show that sparse attention substantially simplifies attention attribution, enabling a unified view of feature-based and circuit-based perspectives. These results demonstrate that transformer attention can be made orders of magnitude sparser, suggesting that much of its computation is redundant and that sparsity may serve as a guiding principle for more structured and interpretable models.

URL PDF HTML ☆

赞 0 踩 0

2511.20236 2026-05-26 cs.AI cs.LG 版本更新

Actionable and diverse counterfactual explanations incorporating domain knowledge and plausibility constraints

结合领域知识和可行性约束的可操作且多样化的反事实解释

Szymon Bobek, Łukasz Bałec, Grzegorz J. Nalepa

发表机构 * Faculty of Physics, Astronomy and Applied Computer Science, Institute of Applied Computer Science, Jagiellonian Human-Centered AI Lab（物理、天文与应用计算机科学学院，应用计算机科学研究所，雅盖隆人机中心AI实验室）

AI总结提出DANCE方法，通过建模特征依赖和领域约束生成可操作、多样化的反事实解释，在OpenML数据集和工业邮件营销场景中验证了其有效性和实用性。

详情

AI中文摘要

反事实解释通过识别实现期望结果所需的最小变化来提高机器学习模型的可操作可解释性。然而，现有方法常常忽略特征之间的依赖关系，这可能导致不现实或不切实际的修改。这一限制降低了反事实解释在现实决策支持系统中的实用性。受网络安全中电子邮件营销应用的启发，我们提出了DANCE（多样化、可操作且知识约束的解释），一种生成反事实的方法，该方法结合了特征依赖和领域约束。DANCE使用线性或概率结构对特征之间的关系进行建模，这些结构可以从数据中学习或由专家指定。在搜索过程中强制执行这些依赖关系以提高可行性和现实性。该方法在一个统一的目标中联合优化可行性、多样性、邻近性和稀疏性。我们在OpenML的140个数据集上评估了DANCE，并证明它在多个评估标准上相比现有方法具有竞争性或更优的性能。此外，我们与一个电子邮件营销平台合作，在真实工业环境中验证了该方法，表明它能够产生符合领域且可操作的建议。

英文摘要

Counterfactual explanations improve the actionable interpretability of machine learning models by identifying minimal changes required to achieve a desired outcome. However, existing methods often neglect dependencies among features, which can lead to unrealistic or impractical modifications. This limitation reduces the usefulness of counterfactual explanations in real-world decision-support systems. Motivated by applications in cybersecurity for email marketing, we propose DANCE (Diverse, Actionable, and Knowledge-Constrained Explanations), a method for generating counterfactuals that incorporate feature dependencies and domain constraints. DANCE models relationships between features using linear and probabilistic structures that can be learned from data or specified by experts. These dependencies are enforced during the search process to improve plausibility and feasibility. The method jointly optimizes plausibility, diversity, proximity, and sparsity within a unified objective. We evaluate DANCE on 140 datasets from OpenML and demonstrate that it achieves competitive or superior performance compared to existing approaches across multiple evaluation criteria. Additionally, we validate the method in a real-world industrial setting in collaboration with an email marketing platform, showing that it produces domain-consistent and actionable recommendations.

URL PDF HTML ☆

赞 0 踩 0

2511.19065 2026-05-26 cs.CV cs.AI cs.LG 版本更新

Understanding, Accelerating, and Improving MeanFlow Training

理解、加速和改进MeanFlow训练

Jin-Young Kim, Hyojun Go, Lea Bogensperger, Julius Erbach, Nikolai Kalischek, Federico Tombari, Konrad Schindler, Dominik Narnhofer

发表机构 * Yonsei University（延世大学）； ETH Zurich（苏黎世联邦理工学院）； University of Zurich（苏黎世大学）； Max Planck ETH CLS（马克斯·普朗克ETH CLS）； Google（谷歌）

AI总结通过分析瞬时速度与平均速度的相互作用，提出一种加速瞬时速度形成并逐步转移训练重点的有效训练方案，实现更快的收敛和更优的少步生成性能。

详情

AI中文摘要

MeanFlow通过联合学习瞬时速度场和平均速度场，有望在少步内实现高质量生成建模。然而，其底层训练动态仍不清楚。我们分析两种速度之间的相互作用，发现：(i) 建立良好的瞬时速度是学习平均速度的前提；(ii) 当时间间隔较小时，瞬时速度的学习受益于平均速度，但随着间隔增大而退化；(iii) 任务亲和性分析表明，对于一步生成至关重要的大间隔平均速度的平滑学习，依赖于先形成准确的瞬时速度和小间隔平均速度。在这些观察的指导下，我们设计了一种有效的训练方案，加速瞬时速度的形成，然后将重点从短间隔平均速度转移到长间隔平均速度。我们改进的MeanFlow训练实现了更快的收敛和显著更好的少步生成：使用相同的DiT-XL骨干网络，我们的方法在1-NFE ImageNet 256x256上达到了令人印象深刻的FID 2.87，而传统的MeanFlow基线为3.43。或者，我们的方法以2.5倍更短的训练时间或使用更小的DiT-L骨干网络，匹配MeanFlow基线的性能。

英文摘要

MeanFlow promises high-quality generative modeling in few steps, by jointly learning instantaneous and average velocity fields. Yet, the underlying training dynamics remain unclear. We analyze the interaction between the two velocities and find: (i) well-established instantaneous velocity is a prerequisite for learning average velocity; (ii) learning of instantaneous velocity benefits from average velocity when the temporal gap is small, but degrades as the gap increases; and (iii) task-affinity analysis indicates that smooth learning of large-gap average velocities, essential for one-step generation, depends on the prior formation of accurate instantaneous and small-gap average velocities. Guided by these observations, we design an effective training scheme that accelerates the formation of instantaneous velocity, then shifts emphasis from short- to long-interval average velocity. Our enhanced MeanFlow training yields faster convergence and significantly better few-step generation: With the same DiT-XL backbone, our method reaches an impressive FID of 2.87 on 1-NFE ImageNet 256x256, compared to 3.43 for the conventional MeanFlow baseline. Alternatively, our method matches the performance of the MeanFlow baseline with 2.5x shorter training time, or with a smaller DiT-L backbone.

URL PDF HTML ☆

赞 0 踩 0

2511.15732 2026-05-26 cs.CY cs.AI 版本更新

Just Asking Questions: Doing Our Own Research on Conspiratorial Ideation by Generative AI Chatbots

只是提问：关于生成式AI聊天机器人阴谋论思维的自主研究

Katherine M. FitzGerald, Michelle Riedlinger, Axel Bruns, Stephen Harrington, Timothy Graham, Daniel Angus

发表机构 * Digital Media Research Centre, Queensland University of Technology（昆士兰理工大学数字媒体研究中心）

AI总结本研究通过系统评估六种主流AI聊天机器人对阴谋论问题的回应，发现安全护栏在不同模型和阴谋论主题上存在显著差异，且设计具有选择性。

详情

DOI: 10.17645/mac.11337

AI中文摘要

基于人工智能框架的交互式聊天系统日益普及，并嵌入搜索引擎、网页浏览器、操作系统或通过网站和应用程序提供。研究人员致力于理解生成式AI的局限性和潜在危害，本文对此做出贡献。通过对六种AI聊天系统（ChatGPT 3.5、ChatGPT 4 Mini、Bing中的Microsoft Copilot、Google Search AI、Perplexity以及Twitter/X中的Grok）进行系统评估，本研究考察了这些领先产品如何回应与阴谋论相关的问题。这遵循了Glazunova等人（2023年）建立的平台政策实施审计方法。我们选取了五个众所周知且已被全面驳斥的阴谋论，以及四个与数据收集时的突发新闻事件相关的新兴阴谋论。我们的发现表明，生成式AI聊天机器人中针对阴谋论思维的安全护栏程度因聊天机器人模型和阴谋论的不同而存在显著差异。我们的观察表明，AI聊天机器人中的安全护栏通常设计得非常具有选择性：生成式AI公司似乎特别关注确保其产品不被视为种族主义；它们似乎还特别关注涉及重大国家创伤（如9/11事件）或与既定政治问题相关的阴谋论。未来的工作应包括持续努力，扩展到更多平台、多种语言以及远超美国范围的各类阴谋论。

英文摘要

Interactive chat systems that build on artificial intelligence frameworks are increasingly ubiquitous and embedded into search engines, Web browsers, and operating systems, or are available on websites and apps. Researcher efforts have sought to understand the limitations and potential for harm of generative AI, which we contribute to here. Conducting a systematic review of six AI-powered chat systems (ChatGPT 3.5; ChatGPT 4 Mini; Microsoft Copilot in Bing; Google Search AI; Perplexity; and Grok in Twitter/X), this study examines how these leading products respond to questions related to conspiracy theories. This follows the platform policy implementation audit approach established by Glazunova et al. (2023). We select five well-known and comprehensively debunked conspiracy theories and four emerging conspiracy theories that relate to breaking news events at the time of data collection. Our findings demonstrate that the extent of safety guardrails against conspiratorial ideation in generative AI chatbots differs markedly, depending on chatbot model and conspiracy theory. Our observations indicate that safety guardrails in AI chatbots are often very selectively designed: generative AI companies appear to focus especially on ensuring that their products are not seen to be racist; they also appear to pay particular attention to conspiracy theories that address topics of substantial national trauma such as 9/11 or relate to well-established political issues. Future work should include an ongoing effort extended to further platforms, multiple languages, and a range of conspiracy theories extending well beyond the United States.

URL PDF HTML ☆

赞 0 踩 0

2511.12046 2026-05-26 cs.CR cs.AI cs.CV cs.LG 版本更新

BackWeak: Backdooring Knowledge Distillation Simply with Weak Triggers and Fine-tuning

BackWeak: 使用弱触发器和微调简单后门知识蒸馏

Shanmin Wang, Dongdong Zhao

发表机构 * School of Computer Science and Artificial Intelligence（计算机科学与人工智能学院）； Wuhan University of Technology（武汉科技大学）

AI总结提出BackWeak方法，通过微调教师模型嵌入弱触发器实现后门攻击，无需替代学生模型或模拟蒸馏，在标准蒸馏过程中可靠转移至不同学生架构。

详情

AI中文摘要

知识蒸馏对于压缩大型模型至关重要，但依赖从第三方仓库下载的预训练“教师”模型引入了严重的安全风险——最显著的是后门攻击。现有的知识蒸馏后门方法通常复杂且计算密集：它们使用替代学生模型和模拟蒸馏来保证可转移性，并构建类似于通用对抗扰动（UAP）的触发器，这些触发器在幅度上不隐蔽，本质上表现出强烈的对抗行为。本文质疑这种复杂性是否必要，并构建了隐蔽的“弱”触发器——具有可忽略对抗效应的不可察觉扰动。我们提出了BackWeak，一种简单、无替代的攻击范式。BackWeak表明，通过使用非常小的学习率对良性教师模型进行微调并嵌入弱触发器，即可植入强大的后门。我们证明，这种精细的微调足以嵌入后门，在受害者的标准蒸馏过程中可靠地转移到不同的学生架构，从而实现高攻击成功率。在多个数据集、模型架构和知识蒸馏方法上的广泛实证评估表明，BackWeak比以往复杂的方法更高效、更简单，且通常更隐蔽。本文呼吁研究知识蒸馏后门攻击的学者特别关注触发器的潜在对抗特性。

英文摘要

Knowledge Distillation (KD) is essential for compressing large models, yet relying on pre-trained "teacher" models downloaded from third-party repositories introduces serious security risks--most notably backdoor attacks. Existing KD backdoor methods are typically complex and computationally intensive: they employ surrogate student models and simulated distillation to guarantee transferability, and construct triggers similar to universal adversarial perturbations (UAPs), which being not stealthy in magnitude, inherently exhibit strong adversarial behavior. This work questions whether such complexity is necessary and constructs stealthy "weak" triggers--imperceptible perturbations that have negligible adversarial effect. We propose BackWeak, a simple, surrogate-free attack paradigm. BackWeak shows that a powerful backdoor can be implanted by simply fine-tuning a benign teacher with a weak trigger using a very small learning rate. We demonstrate that this delicate fine-tuning is sufficient to embed a backdoor that reliably transfers to diverse student architectures during a victim's standard distillation process, yielding high attack success rates. Extensive empirical evaluations on multiple datasets, model architectures, and KD methods show that BackWeak is efficient, simpler, and often more stealthy than previous elaborate approaches. This work calls on researchers studying KD backdoor attacks to pay particular attention to the trigger's potential adversarial characteristics.

URL PDF HTML ☆

赞 0 踩 0

2511.10502 2026-05-26 cs.CR cs.AI 版本更新

On the Detectability of Active Gradient Inversion Attacks in Federated Learning

联邦学习中主动梯度反转攻击的可检测性研究

Vincenzo Carletti, Pasquale Foggia, Carlo Mazzocca, Giuseppe Parrella, Mario Vento

发表机构 * Department of Computer Information and Electrical Engineering and Applied Mathematics（计算机信息与应用数学系）

AI总结本文研究联邦学习中主动梯度反转攻击的可检测性，提出基于异常权重结构和损失/梯度动态的轻量级客户端检测方法，实验证明能有效检测攻击而不修改训练协议。

详情

DOI: 10.1109/SP63933.2026.00193
Journal ref: 2026 IEEE Symposium on Security and Privacy (SP), pp. 1931-1950, 2026

AI中文摘要

DeepEN: 一种用于重症监护中个性化肠内营养的深度强化学习框架

Daniel Jason Tan, Jiayang Chen, Dilruk Perera, Kay Choong See, Mengling Feng

发表机构 * Institute of Data Science（数据科学研究所）； Saw Swee Hock School of Public Health, National University of Singapore, Singapore（Saw Swee Hock公共卫生学院，新加坡国立大学，新加坡）； National University Hospital, Singapore（新加坡国立医院）

AI总结提出DeepEN框架，利用深度强化学习从电子健康记录中学习个性化肠内营养方案，在MIMIC-IV数据集上相比临床实践降低绝对死亡率4.0个百分点。

详情

AI中文摘要

目的：由于个性化程度有限以及在动态代谢需求下对适当热量、蛋白质和液体目标的不确定性，ICU中的肠内营养（EN）输送仍不理想。我们引入DeepEN，一个使用电子健康记录数据进行个性化EN优化的强化学习（RL）框架。方法：DeepEN在来自MIMIC-IV的超过11,000名ICU患者上训练，以生成每4小时一次、针对患者的卡路里、蛋白质和液体目标。状态表示包括人口统计学、合并症、生命体征、实验室值和近期干预措施。一个生理学对齐的奖励框架平衡了生物标志物稳定性与长期生存。策略学习采用带有保守Q学习正则化的决斗双深度Q网络，以实现安全的离线训练。结果：DeepEN实现了最高的估计策略价值（$V^π= 9.48$）和最低的校准死亡率（18.8 ± 1.0%），与临床实践（22.8%）相比绝对降低了4.0个百分点。该策略还表现出优越的代谢稳定性，实现了目标范围内葡萄糖、磷酸盐和钠值的最高比例。此外，偏离DeepEN策略与死亡率和生物标志物不稳定性独立相关，而偏离随机策略则没有这种关联。可解释性分析进一步表明，建议是基于器官功能和代谢状态的生理相关标志物，而不是静态剂量启发式。结论：DeepEN证明了保守离线RL在安全、个性化EN优化中的可行性，突出了数据驱动个性化在重症监护中补充基于指南方法的潜力。

英文摘要

Objective: Enteral nutrition (EN) delivery in the ICU remains suboptimal due to limited personalization and uncertainty regarding appropriate calorie, protein, and fluid targets under dynamic metabolic demands. We introduce DeepEN, a reinforcement learning (RL) framework for personalized EN optimization using electronic health record data. Methods: DeepEN was trained on over 11,000 ICU patients from MIMIC-IV to generate 4-hourly, patient-specific caloric, protein, and fluid targets. The state representation incorporated demographics, comorbidities, vital signs, laboratory values, and recent interventions. A physiologically aligned reward framework balanced biomarker stability with long-term survival. Policy learning employed a dueling double deep Q-network with Conservative Q-Learning regularization to enable safe offline training. Results: DeepEN achieved the highest estimated policy value ($V^π= 9.48$) and the lowest calibrated mortality (18.8 +/- 1.0%), representing a 4.0 percentage-point absolute reduction compared with clinician practice (22.8%). The policy also demonstrated superior metabolic stability, achieving the highest proportion of glucose, phosphate, and sodium values within target range. Furthermore, deviation from the DeepEN policy was independently associated with increased mortality and biomarker instability, whereas deviation from a random policy showed no such association. Interpretability analyses further indicated that recommendations were conditioned on physiologically relevant markers of organ function and metabolic status rather than static dosing heuristics. Conclusion: DeepEN demonstrates the feasibility of conservative offline RL for safe, individualized EN optimization, highlighting the potential of data-driven personalization to complement guideline-based approaches in critical care.

URL PDF HTML ☆

赞 0 踩 0

2510.05699 2026-05-26 cs.CR cs.AI 版本更新

Membership Inference Attacks on Tokenizers of Large Language Models

大型语言模型分词器的成员推理攻击

Meng Tong, Yuntao Du, Kejiang Chen, Weiming Zhang, Ninghui Li

发表机构 * University of Science and Technology of China（中国科学技术大学）； Purdue University（普渡大学）

AI总结针对预训练大型语言模型成员推理攻击的挑战，提出以分词器作为新攻击向量，探索五种攻击方法，并设计自适应防御。

Comments To appear at USENIX Security Symposium 2026

详情

AI中文摘要

成员推理攻击（MIAs）被广泛用于评估机器学习模型的隐私风险。然而，当这些攻击应用于预训练的大型语言模型（LLMs）时，会遇到显著挑战，包括错误标记的样本、分布偏移以及实验环境与真实环境之间模型规模的差异。为了解决这些限制，我们引入分词器作为成员推理的新攻击向量。具体来说，分词器将原始文本转换为LLMs的令牌。与完整模型不同，分词器可以从头开始高效训练，从而避免上述挑战。此外，分词器的训练数据通常代表用于预训练LLMs的数据。尽管有这些优势，分词器作为攻击向量的潜力尚未被探索。为此，我们首次开展了关于通过分词器泄露成员信息的研究，并探索了五种攻击方法来推断数据集成员身份。在数百万互联网样本上的广泛实验揭示了最先进LLMs分词器中的漏洞。为了缓解这一新兴风险，我们进一步提出了一种自适应防御。我们的发现强调了分词器是一个被忽视但关键的隐私威胁，突显了专门为其设计隐私保护机制的迫切需求。

英文摘要

Membership inference attacks (MIAs) are widely used to assess the privacy risks associated with machine learning models. However, when these attacks are applied to pre-trained large language models (LLMs), they encounter significant challenges, including mislabeled samples, distribution shifts, and discrepancies in model size between experimental and real-world settings. To address these limitations, we introduce tokenizers as a new attack vector for membership inference. Specifically, a tokenizer converts raw text into tokens for LLMs. Unlike full models, tokenizers can be efficiently trained from scratch, thereby avoiding the aforementioned challenges. In addition, the tokenizer's training data is typically representative of the data used to pre-train LLMs. Despite these advantages, the potential of tokenizers as an attack vector remains unexplored. To this end, we present the first study on membership leakage through tokenizers and explore five attack methods to infer dataset membership. Extensive experiments on millions of Internet samples reveal the vulnerabilities in the tokenizers of state-of-the-art LLMs. To mitigate this emerging risk, we further propose an adaptive defense. Our findings highlight tokenizers as an overlooked yet critical privacy threat, underscoring the urgent need for privacy-preserving mechanisms specifically designed for them.

URL PDF HTML ☆

赞 0 踩 0

2510.05688 2026-05-26 cs.LG cs.AI 版本更新

vAttention: Verified Sparse Attention

vAttention: 验证的稀疏注意力

Aditya Desai, Kumar Krishna Agrawal, Shuo Yang, Alejandro Cuadron, Luis Gaspar Schroeder, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica

发表机构 * Electrical Engineering and Computer Sciences, University of California, Berkeley（加州大学伯克利分校电气工程与计算机科学系）

AI总结提出vAttention，通过统一top-k和随机采样，实现首个具有用户指定(ε, δ)近似精度保证的实用稀疏注意力机制，显著提升质量-效率权衡。

详情

Journal ref: Proceedings of the International Conference on Learning Representations (ICLR), 2026

AI中文摘要

最先进的用于减少解码延迟的稀疏注意力方法主要分为两类：近似top-$k$（及其扩展top-$p$）和最近引入的基于采样的估计。然而，这些方法在逼近全注意力方面存在根本性局限：它们无法在头和查询向量之间提供一致的近似，最关键的是，缺乏对近似质量的保证，限制了其实际部署。我们观察到top-$k$和随机采样是互补的：当注意力分数由少数标记主导时，top-$k$表现良好，而当注意力分数相对均匀时，随机采样提供更好的估计。基于这一洞察并利用采样的统计保证，我们引入了vAttention，这是第一个具有用户指定$(ε, δ)$近似精度保证（因此称为“已验证”）的实用稀疏注意力机制。这些保证使vAttention成为向大规模实用、可靠部署稀疏注意力迈出的引人注目的一步。通过统一top-$k$和采样，vAttention在质量-效率权衡上优于两者各自的表现。我们的实验表明，vAttention显著提高了稀疏注意力的质量（例如，在RULER-HARD上，Llama 3.1 8B Instruct和DeepSeek-R1-Distill-Llama-8B提高了约4.5个百分点），并有效弥合了全注意力和稀疏注意力之间的差距（例如，在多个数据集上，以高达20倍稀疏度匹配全模型质量）。我们还展示了它可以部署在推理场景中，在不牺牲模型质量的情况下实现快速解码（例如，vAttention在AIME2024上以10倍稀疏度和高达32K标记生成实现了全模型质量）。代码：https://github.com/skylight-org/sparse-attention-hub。网页：https://sky-light.eecs.berkeley.edu。

英文摘要

State-of-the-art sparse attention methods for reducing decoding latency fall into two main categories: approximate top-$k$ (and its extension, top-$p$) and recently introduced sampling-based estimation. However, these approaches are fundamentally limited in their ability to approximate full attention: they fail to provide consistent approximations across heads and query vectors and, most critically, lack guarantees on approximation quality, limiting their practical deployment. We observe that top-$k$ and random sampling are complementary: top-$k$ performs well when attention scores are dominated by a few tokens, whereas random sampling provides better estimates when attention scores are relatively uniform. Building on this insight and leveraging the statistical guarantees of sampling, we introduce vAttention, the first practical sparse attention mechanism with user-specified $(ε, δ)$ guarantees on approximation accuracy (thus, "verified"). These guarantees make vAttention a compelling step toward practical, reliable deployment of sparse attention at scale. By unifying top-$k$ and sampling, vAttention outperforms both individually, delivering a superior quality-efficiency trade-off. Our experiments show that vAttention significantly improves the quality of sparse attention (e.g., $\sim$4.5 percentage points for Llama 3.1 8B Instruct and DeepSeek-R1-Distill-Llama-8B on RULER-HARD), and effectively bridges the gap between full and sparse attention (e.g., across datasets, it matches full model quality with up to 20x sparsity). We also demonstrate that it can be deployed in reasoning scenarios to achieve fast decoding without compromising model quality (e.g., vAttention achieves full model quality on AIME2024 at 10x sparsity with up to 32K token generations). Code: https://github.com/skylight-org/sparse-attention-hub. Webpage: https://sky-light.eecs.berkeley.edu.

URL PDF HTML ☆

赞 0 踩 0

2510.02837 2026-05-26 cs.AI cs.CL 版本更新

Beyond the Final Answer: Evaluating the Reasoning Trajectories of Tool-Augmented Agents

超越最终答案：评估工具增强型智能体的推理轨迹

Wonjoong Kim, Sangwu Park, Yeonjun In, Sein Kim, Dongha Lee, Chanyoung Park

发表机构 * Graduate School of Data Science, KAIST, Daejeon, South Korea（数据科学研究生院，韩国科学技术院，大田，韩国）； Department of Industrial and Systems Engineering, KAIST, Daejeon, South Korea（工业与系统工程系，韩国科学技术院，大田，韩国）； Department of Artificial Intelligence, Yonsei University, Seoul, South Korea（人工智能系，延世大学，首尔，韩国）

AI总结针对工具增强型LLM，提出无参考框架TRACE，通过证据库多维度评估推理轨迹的效率、幻觉和适应性，并用元评估数据集验证其有效性。

Comments International Conference on Machine Learning (ICML) 2026

详情

AI中文摘要

尽管最近的工具增强型基准涉及复杂请求，但评估仍局限于答案匹配，忽略了效率、幻觉和适应性等关键轨迹方面。最直接的评估方法是将智能体的轨迹与真实轨迹进行比较，但注释所有有效的真实轨迹成本过高。为此，我们引入TRACE，一个用于工具增强型LLM多维度评估的无参考框架。通过整合一个从先前步骤积累知识的证据库，TRACE有效评估智能体的推理轨迹。为验证我们的框架，我们开发了一个新的元评估数据集，包含多样且有缺陷的轨迹，每个轨迹都标有多方面的性能分数。我们的结果证实，即使使用小型开源LLM，TRACE也能准确评估复杂轨迹。此外，我们应用该方法评估智能体在解决工具增强型任务时产生的轨迹，展示了先前未报告的观察结果及其相应的见解。

英文摘要

Although recent tool-augmented benchmarks involve complex requests, evaluation remains limited to answer matching, neglecting critical trajectory aspects like efficiency, hallucination, and adaptivity. The most straightforward method for evaluation is to compare an agent's trajectory with the ground-truth, but annotating all valid ground-truth trajectories is prohibitively expensive. In this manner, we introduce TRACE, a reference-free framework for the multi-dimensional evaluation of tool-augmented LLMs. By incorporating an evidence bank which accumulates knowledge from preceding steps, TRACE assesses an agent's reasoning trajectory effectively. To validate our framework, we develop a new meta-evaluation dataset with diverse and flawed trajectories, each labeled with multi-faceted performance scores. Our results confirm that TRACE accurately evaluates complex trajectories even with small open-source LLMs. Furthermore, we apply our method to evaluate the trajectories that agents produce while solving tool-augmented tasks, presenting previously unreported observations and their corresponding insights.

URL PDF HTML ☆

赞 0 踩 0

2510.02361 2026-05-26 cs.CL cs.AI 版本更新

ChunkLLM: A Lightweight Pluggable Framework for Accelerating LLMs Inference

ChunkLLM: 一种轻量级可插拔的LLM推理加速框架

Haojie Ouyang, Jianwei Lv, Lei Ren, Chen Wei, Xiaojie Wang, Fangxiang Feng

发表机构 * School of Artificial Intelligence, Beijing University of Posts and Telecommunications（北京邮电大学信息学院）

AI总结针对Transformer自注意力二次复杂度导致的推理效率低下问题，提出ChunkLLM框架，通过QK适配器和块适配器实现块选择与压缩，在保持性能的同时显著加速推理。

详情

AI中文摘要

基于Transformer的大模型在自然语言处理和计算机视觉中表现出色，但由于自注意力对输入令牌的二次复杂度，面临严重的计算效率低下问题。最近，研究人员提出了一系列基于块选择和压缩的方法来缓解这一问题，但它们要么存在语义不完整的问题，要么训练-推理效率低下。为了全面解决这些挑战，我们提出了ChunkLLM，一个轻量级且可插拔的训练框架。具体来说，我们引入了两个组件：QK适配器（Q-Adapter和K-Adapter）和块适配器。前者附加在每个Transformer层上，兼具特征压缩和块注意力获取的双重目的。后者在模型的最底层运行，通过利用上下文语义信息来检测块边界。在训练阶段，骨干网络的参数保持冻结，仅QK适配器和块适配器进行训练。值得注意的是，我们设计了一种注意力蒸馏方法来训练QK适配器，这提高了关键块的召回率。在推理阶段，仅当当前令牌被检测为块边界时才触发块选择，从而加速模型推理。我们在涵盖多个任务的多种长文本和短文本基准数据集上进行了实验评估。ChunkLLM不仅在短文本基准上取得了可比的性能，而且在长上下文基准上保持了98.64%的性能，同时保持了48.58%的键值缓存保留率。特别地，在处理120K长文本时，ChunkLLM相比原始Transformer实现了最大4.48倍的加速。

英文摘要

Transformer-based large models excel in natural language processing and computer vision, but face severe computational inefficiencies due to the self-attention's quadratic complexity with input tokens. Recently, researchers have proposed a series of methods based on block selection and compression to alleviate this problem, but they either have issues with semantic incompleteness or poor training-inference efficiency. To comprehensively address these challenges, we propose ChunkLLM, a lightweight and pluggable training framework. Specifically, we introduce two components: QK Adapter (Q-Adapter and K-Adapter) and Chunk Adapter. The former is attached to each Transformer layer, serving dual purposes of feature compression and chunk attention acquisition. The latter operates at the bottommost layer of the model, functioning to detect chunk boundaries by leveraging contextual semantic information. During the training phase, the parameters of the backbone remain frozen, with only the QK Adapter and Chunk Adapter undergoing training. Notably, we design an attention distillation method for training the QK Adapter, which enhances the recall rate of key chunks. During the inference phase, chunk selection is triggered exclusively when the current token is detected as a chunk boundary, thereby accelerating model inference. Experimental evaluations are conducted on a diverse set of long-text and short-text benchmark datasets spanning multiple tasks. ChunkLLM not only attains comparable performance on short-text benchmarks but also maintains 98.64% of the performance on long-context benchmarks while preserving a 48.58% key-value cache retention rate. Particularly, ChunkLLM attains a maximum speedup of 4.48x in comparison to the vanilla Transformer in the processing of 120K long texts.

URL PDF HTML ☆

赞 0 踩 0

2510.02327 2026-05-26 cs.CL cs.AI eess.AS 版本更新

KAME: Tandem Architecture for Enhancing Knowledge in Real-Time Speech-to-Speech Conversational AI

KAME：用于增强实时语音到语音对话AI知识的串联架构

So Kuroki, Yotaro Kubo, Takuya Akiba, Yujin Tang

AI总结提出一种混合架构，通过实时注入后端LLM的文本响应来增强S2S模型的知识，在保持低延迟的同时提升响应正确性。

Comments Published at IEEE ICASSP 2026

详情

AI中文摘要

实时语音到语音（S2S）模型擅长生成自然、低延迟的对话响应，但往往缺乏深层知识和语义理解。相反，结合自动语音识别、基于文本的大语言模型（LLM）和文本到语音合成的级联系统提供了优越的知识表示，但代价是高延迟，这破坏了自然交互的流畅性。本文介绍了一种新颖的混合架构，弥合了这两种范式之间的差距。我们的框架通过S2S变压器处理用户语音以实现即时响应，同时将查询并发地传递给强大的后端LLM。然后，LLM的基于文本的响应被实时注入以指导S2S模型的语音生成，有效地为其输出注入丰富的知识，而无需承受级联系统的全部延迟惩罚。我们使用MT-Bench基准的语音合成变体（包含多轮问答会话）评估了我们的方法。结果表明，我们的系统在响应正确性上显著优于基线S2S模型，接近级联系统的水平，同时保持了与基线相当的延迟。

英文摘要

Real-time speech-to-speech (S2S) models excel at generating natural, low-latency conversational responses but often lack deep knowledge and semantic understanding. Conversely, cascaded systems combining automatic speech recognition, a text-based Large Language Model (LLM), and text-to-speech synthesis offer superior knowledge representation at the cost of high latency, which disrupts the flow of natural interaction. This paper introduces a novel hybrid architecture that bridges the gap between these two paradigms. Our framework processes user speech through an S2S transformer for immediate responsiveness while concurrently relaying the query to a powerful back-end LLM. The LLM's text-based response is then injected in real time to guide the S2S model's speech generation, effectively infusing its output with rich knowledge without the full latency penalty of a cascaded system. We evaluated our method using a speech-synthesized variant of the MT-Bench benchmark that consists of multi-turn question-answering sessions. The results demonstrate that our system substantially outperforms a baseline S2S model in response correctness, approaching that of a cascaded system, while maintaining a latency on par with the baseline.

URL PDF HTML ☆

赞 0 踩 0

2509.24978 2026-05-26 cs.AI cond-mat.quant-gas quant-ph 版本更新

Agentic Exploration of Physics Models

物理模型的智能体探索

Maximilian Nägele, Florian Marquardt

发表机构 * Max Planck Institute for the Science of Light（马克斯·普朗克光科学研究所）

AI总结提出 SciExplorer 智能体，利用大语言模型工具使用能力，无需领域特定蓝图即可探索未知物理系统，通过实验和观测恢复运动方程和哈密顿量。

详情

DOI: 10.1103/xnqc-q6nt

AI中文摘要

科学发现的过程依赖于观察、分析和假设生成的相互作用。机器学习正越来越多地被用于处理这一过程的各个方面。然而，完全自动化发现未知系统定律所需的启发式迭代循环（通过实验和分析探索系统）仍然是一个开放挑战，且不能针对特定任务进行定制。在这里，我们介绍了 SciExplorer，一个利用大语言模型工具使用能力来探索系统而无需任何领域特定蓝图的智能体，并将其应用于最初对智能体未知的物理系统。我们在涵盖机械动力学系统、波演化和量子多体物理的广泛模型上测试了 SciExplorer。尽管使用了最小工具集（主要基于代码执行），我们观察到在从观测动力学恢复运动方程和从期望值推断哈密顿量等任务上表现出色。该设置的有效性为在其他领域进行类似的科学探索打开了大门，无需微调或任务特定指令。

英文摘要

The process of scientific discovery relies on an interplay of observations, analysis, and hypothesis generation. Machine learning is increasingly being adopted to address individual aspects of this process. However, it remains an open challenge to fully automate the heuristic, iterative loop required to discover the laws of an unknown system by exploring it through experiments and analysis, without tailoring the approach to the specifics of a given task. Here, we introduce SciExplorer, an agent that leverages large language model tool-use capabilities to enable exploration of systems without any domain-specific blueprints, and apply it to physical systems that are initially unknown to the agent. We test SciExplorer on a broad set of models spanning mechanical dynamical systems, wave evolution, and quantum many-body physics. Despite using a minimal set of tools, primarily based on code execution, we observe impressive performance on tasks such as recovering equations of motion from observed dynamics and inferring Hamiltonians from expectation values. The demonstrated effectiveness of this setup opens the door towards similar scientific exploration in other domains, without the need for finetuning or task-specific instructions.

URL PDF HTML ☆

赞 0 踩 0

2509.22299 2026-05-26 cs.LG cs.AI 版本更新

HEAPr: Hessian-based Efficient Atomic Expert Pruning in Output Space

HEAPr: 基于Hessian的输出空间中高效原子专家剪枝

Ke Li, Zheng Yang, Zhongbin Zhou, Feng Xue, Zhonglin Jiang, Wenxiao Wang

发表机构 * School of Software Technology, Zhejiang University（浙江大学软件学院）； FABU Inc.（FABU公司）； Hangzhou Kuaidi Science and Technology Co., Ltd.（杭州快的科学技术有限公司）

AI总结针对MoE模型粗粒度专家剪枝导致精度下降的问题，提出HEAPr算法，通过将专家分解为原子专家并利用二阶信息（最优脑外科原理）评估重要性，在输出空间简化计算，实现高比例无损压缩。

Comments ICLR 2026

详情

Journal ref: Proceedings of the International Conference on Learning Representations (ICLR), 2026

AI中文摘要

大型语言模型中的混合专家（MoE）架构相比密集LLM具有卓越性能和更低的推理成本。然而，其庞大的参数数量导致内存需求过高，限制了实际部署。现有的剪枝方法主要关注专家级剪枝，这种粗粒度通常导致显著的精度下降。在这项工作中，我们引入了HEAPr，一种新颖的剪枝算法，它将专家分解为更小、不可分割的原子专家，从而实现更精确和灵活的原子专家剪枝。为了衡量每个原子专家的重要性，我们利用基于最优脑外科理论原理的二阶信息。为了解决二阶信息带来的计算和存储挑战，HEAPr利用原子专家的固有属性，将专家参数的二阶信息转换为原子专家参数的二阶信息，并进一步简化为原子专家输出的二阶信息。这种方法将空间复杂度从$O(d^4)$（其中$d$是模型的维度）降低到$O(d^2)$。HEAPr仅需在小型校准集上进行两次前向传播和一次反向传播即可计算原子专家的重要性。在包括DeepSeek MoE和Qwen MoE系列在内的MoE模型上的大量实验表明，HEAPr在广泛的剪枝比例和基准测试中优于现有的专家级剪枝方法。具体来说，在大多数模型中，HEAPr在20%~25%的剪枝比例下实现了几乎无损的压缩，同时FLOPs也减少了近20%。代码可在[https://github.com/LLIKKE/HEAPr](https://github.com/LLIKKE/HEAPr)找到。

英文摘要

Mixture-of-Experts (MoE) architectures in large language models (LLMs) deliver exceptional performance and reduced inference costs compared to dense LLMs. However, their large parameter counts result in prohibitive memory requirements, limiting practical deployment. While existing pruning methods primarily focus on expert-level pruning, this coarse granularity often leads to substantial accuracy degradation. In this work, we introduce HEAPr, a novel pruning algorithm that decomposes experts into smaller, indivisible atomic experts, enabling more precise and flexible atomic expert pruning. To measure the importance of each atomic expert, we leverage second-order information based on principles similar to the Optimal Brain Surgeon theory. To address the computational and storage challenges posed by second-order information, HEAPr exploits the inherent properties of atomic experts to transform the second-order information from expert parameters into that of atomic expert parameters, and further simplifies it to the second-order information of atomic expert outputs. This approach reduces the space complexity from $O(d^4)$, where $d$ is the model's dimensionality, to $O(d^2)$. HEAPr requires only two forward passes and one backward pass on a small calibration set to compute the importance of atomic experts. Extensive experiments on MoE models, including DeepSeek MoE and Qwen MoE family, demonstrate that HEAPr outperforms existing expert-level pruning methods across a wide range of pruning ratios and benchmarks. Specifically, HEAPr achieves nearly lossless compression at pruning ratios of 20% ~ 25% in most models, while also reducing FLOPs nearly by 20%. The code can be found at [https://github.com/LLIKKE/HEAPr](https://github.com/LLIKKE/HEAPr).

URL PDF HTML ☆

赞 0 踩 0

2509.21592 2026-05-26 cs.CV cs.AI cs.LG 版本更新

What Happens Next? Anticipating Future Motion by Generating Point Trajectories

接下来会发生什么？通过生成点轨迹预测未来运动

Gabrijel Boduljak, Laurynas Karazija, Iro Laina, Christian Rupprecht, Andrea Vedaldi

发表机构 * Visual Geometry Group, University of Oxford（牛津大学视觉几何组）

AI总结提出一种基于单张图像预测未来运动的方法，通过生成密集轨迹网格来捕捉场景动态和不确定性，相比现有方法更准确多样，并验证其在机器人等下游任务中的有效性。

详情

Journal ref: ICLR 2026

AI中文摘要

我们考虑从单张图像预测运动的问题，即预测世界中物体可能如何移动，而无法观察其他参数如物体速度或施加的力。我们将此任务表述为密集轨迹网格的条件生成，模型紧密遵循现代视频生成器的架构，但输出运动轨迹而非像素。这种方法捕捉了场景范围的动态和不确定性，比先前的回归器和生成器产生更准确和多样化的预测。我们在模拟数据上广泛评估了我们的方法，展示了其在机器人等下游应用中的有效性，并在真实世界的直觉物理数据集上显示出有希望的准确性。尽管最近最先进的视频生成器常被视为世界模型，但我们表明它们在从单张图像预测运动方面存在困难，即使在简单的物理场景如落块或机械物体交互中，尽管对这些数据进行了微调。我们表明这一局限性源于生成像素的开销，而非直接建模运动。

英文摘要

We consider the problem of forecasting motion from a single image, i.e., predicting how objects in the world are likely to move, without the ability to observe other parameters such as the object velocities or the forces applied to them. We formulate this task as conditional generation of dense trajectory grids with a model that closely follows the architecture of modern video generators but outputs motion trajectories instead of pixels. This approach captures scene-wide dynamics and uncertainty, yielding more accurate and diverse predictions than prior regressors and generators. We extensively evaluate our method on simulated data, demonstrate its effectiveness on downstream applications such as robotics, and show promising accuracy on real-world intuitive physics datasets. Although recent state-of-the-art video generators are often regarded as world models, we show that they struggle with forecasting motion from a single image, even in simple physical scenarios such as falling blocks or mechanical object interactions, despite fine-tuning on such data. We show that this limitation arises from the overhead of generating pixels rather than directly modeling motion.

URL PDF HTML ☆

赞 0 踩 0

2509.16931 2026-05-26 cs.IR cs.AI cs.LG 版本更新

Equip Pre-ranking with Target Attention by Residual Quantization

通过残差量化为预排序阶段配备目标注意力机制

Yutong Li, Yu Zhu, Yichen Qiao, Ziyu Guan, Lv Shao, Tong Liu, Bo Zheng

发表机构 * Taobao \& Tmall Group of Alibaba Hangzhou China ； Shanghai Jiao Tong University Shanghai China ； Xidian University Xi'an China ； Taobao \& Tmall Group of Alibaba Beijing China ； Taobao \& Tmall Group of Alibaba ； Shanghai Jiao Tong University ； Xidian University

AI总结提出TARQ框架，利用残差量化在预排序阶段近似目标注意力架构，首次在延迟关键阶段引入TA建模能力，实现精度与效率的新最优平衡。

Comments 5 pages, 2 figures, accepted by SIGIR 2026 Short Paper Track

详情

AI中文摘要

工业推荐系统中的预排序阶段面临效率与效果之间的根本冲突。虽然目标注意力（TA）等强大模型在排序阶段擅长捕捉复杂的特征交互，但其高计算成本使其无法用于通常依赖简单向量积模型的预排序阶段。这种差异给整个系统造成了显著的性能瓶颈。为弥合这一差距，我们提出了TARQ，一种新颖的预排序框架。受生成模型启发，TARQ的关键创新在于通过残差量化为预排序阶段配备近似TA的架构。这使得我们首次将TA的建模能力引入延迟关键的预排序阶段，建立了精度与效率之间新的最优权衡。在淘宝进行的大量离线实验和大规模在线A/B测试证明了TARQ在排序性能上的显著提升。因此，我们的模型已全面部署在生产环境中，服务于数千万日活跃用户，并带来了可观的业务改进。代码和数据可在 https://github.com/zyody/tarq_sigir2026 获取。

英文摘要

The pre-ranking stage in industrial recommendation systems faces a fundamental conflict between efficiency and effectiveness. While powerful models like Target Attention (TA) excel at capturing complex feature interactions in the ranking stage, their high computational cost makes them infeasible for pre-ranking, which often relies on simplistic vector-product models. This disparity creates a significant performance bottleneck for the entire system. To bridge this gap, we propose TARQ, a novel pre-ranking framework. Inspired by generative models, TARQ's key innovation is to equip pre-ranking with an architecture approximate to TA by Residual Quantization. This allows us to bring the modeling power of TA into the latency-critical pre-ranking stage for the first time, establishing a new state-of-the-art trade-off between accuracy and efficiency. Extensive offline experiments and large-scale online A/B tests at Taobao demonstrate TARQ's significant improvements in ranking performance. Consequently, our model has been fully deployed in production, serving tens of millions of daily active users and yielding substantial business improvements. The code and data are available at https://github.com/zyody/tarq_sigir2026.

URL PDF HTML ☆

赞 0 踩 0

2508.19113 2026-05-26 cs.AI 版本更新

Hybrid Deep Searcher: Scalable Parallel and Sequential Search Reasoning

混合深度搜索器：可扩展的并行与顺序搜索推理

Dayoon Ko, Jihyuk Kim, Haeju Park, Sohyeon Kim, Dahyun Lee, Yongrae Jo, Gunhee Kim, Moontae Lee, Kyungjae Lee

发表机构 * Seoul National University（首尔国立大学）； LG AI Research（LG AI研究）； University of Illinois Chicago（伊利诺伊大学芝加哥分校）； University of Seoul（首尔大学）

AI总结提出混合搜索策略HybridDeepSearcher，通过并行查询扩展与显式证据聚合结合顺序推理，在多个基准上显著提升性能并实现测试时搜索扩展。

Comments Accepted to ICLR 2026

详情

AI中文摘要

大型推理模型（LRMs）结合检索增强生成（RAG）使得深度研究智能体能够通过外部知识检索进行多步推理。然而，我们发现现有方法很少展示测试时搜索扩展。通过单查询顺序搜索扩展推理的方法受限于证据覆盖范围，而每步生成多个独立查询的方法通常缺乏结构化聚合，阻碍了更深的顺序推理。我们提出一种混合搜索策略来解决这些限制。我们引入了HybridDeepSearcher，一种结构化的搜索智能体，它在进入更深的顺序推理之前集成了并行查询扩展与显式证据聚合。为了监督这种行为，我们引入了HDS-QA，一个新颖的数据集，通过包含并行子查询的监督推理-查询-检索轨迹，指导模型将广泛的并行搜索与结构化聚合相结合。在五个基准上，HybridDeepSearcher显著优于现有技术，在FanOutQA上F1分数提高+15.9，在BrowseComp子集上提高+9.2。进一步分析显示其一致的测试时搜索扩展：随着允许的额外搜索轮次或调用次数增加，性能持续提升，而竞争方法则趋于平稳。

英文摘要

Large reasoning models (LRMs) combined with retrieval-augmented generation (RAG) have enabled deep research agents capable of multi-step reasoning with external knowledge retrieval. However, we find that existing approaches rarely demonstrate test-time search scaling. Methods that extend reasoning through single-query sequential search suffer from limited evidence coverage, while approaches that generate multiple independent queries per step often lack structured aggregation, hindering deeper sequential reasoning. We propose a hybrid search strategy to address these limitations. We introduce HybridDeepSearcher, a structured search agent that integrates parallel query expansion with explicit evidence aggregation before advancing to deeper sequential reasoning. To supervise this behavior, we introduce HDS-QA, a novel dataset that guides models to combine broad parallel search with structured aggregation through supervised reasoning-query0retrieval trajectories containing parallel sub-queries. Across five benchmarks, HybridDeepSearcher significantly outperforms the state-of-the-art, improving F1 scores by +15.9 on FanOutQA and +9.2 on a subset of BrowseComp. Further analysis shows its consistent test-time search scaling: performance improves as additional search turns or calls are allowed, while competing methods plateau.

URL PDF HTML ☆

赞 0 踩 0

2508.12538 2026-05-26 cs.CR cs.AI cs.SE 版本更新

MCPXKIT: The Unified Toolkit for Analyzing Model Context Protocol Security

MCPXKIT：分析模型上下文协议安全性的统一工具包

Yongjian Guo, Puzhuo Liu, Wanlun Ma, Zehang Deng, Xiaogang Zhu, Peng Di, Xi Xiao, Sheng Wen

发表机构 * Shenzhen International Graduate School, Tsinghua University, Shenzhen, China（深圳国际研究生院，清华大学，深圳，中国）； Ant Group, Hangzhou, China（蚂蚁集团，杭州，中国）； Swinburne University of Technology, Melbourne, Australia（斯威本科技大学，墨尔本，澳大利亚）； The University of Adelaide, Adelaide, Australia（阿德莱德大学，阿德莱德，澳大利亚）； UNSW Sydney, Australia（悉尼大学，澳大利亚）

AI总结本文提出MCPXKIT工具包，分类实现了31种攻击方法，通过定量实验揭示了MCP在工具描述依赖、文件攻击、链攻击及数据命令区分等方面的漏洞，并提供了安全增强建议。

Comments Accepted by IEEE Transactions on Dependable and Secure Computing (TDSC). $\href{https://ieeexplore.ieee.org/abstract/document/11531012}{Official \ version}$

详情

DOI: 10.1109/TDSC.2026.3695553

AI中文摘要

模型上下文协议（MCP）已成为一种通用标准，使AI代理能够无缝连接外部工具，显著增强其功能。然而，MCP在带来显著优势的同时，也引入了重大漏洞，例如工具投毒攻击（TPA），其中隐藏的恶意指令利用大型语言模型（LLM）的谄媚性来操纵代理行为。尽管存在这些风险，当前关于MCP安全性的学术研究仍然有限，大多数研究侧重于狭窄或定性的分析，未能捕捉真实世界威胁的多样性。为填补这一空白，我们提出了MCP利用工具包（MCPXKIT），该工具包在四个关键分类下分类并实现了31种不同的攻击方法：直接工具注入、间接工具注入、恶意用户攻击和LLM固有攻击。我们进一步对每种攻击的有效性进行了定量分析。我们的实验揭示了MCP漏洞的关键见解，包括代理对工具描述的盲目依赖、对基于文件的攻击的敏感性、利用共享上下文的链式攻击，以及难以区分外部数据与可执行命令。这些通过攻击实验验证的见解，强调了制定稳健防御策略和知情MCP设计的紧迫性。我们的贡献包括：1）构建全面的MCP攻击分类法，2）引入统一的攻击框架MCPXKIT，以及3）进行实证漏洞分析以增强MCP安全机制。这项工作为支持MCP生态系统的安全演进提供了基础框架。

英文摘要

The Model Context Protocol (MCP) has emerged as a universal standard that enables AI agents to seamlessly connect with external tools, significantly enhancing their functionality. However, while MCP brings notable benefits, it also introduces significant vulnerabilities, such as Tool Poisoning Attacks (TPA), where hidden malicious instructions exploit the sycophancy of large language models (LLMs) to manipulate agent behavior. Despite these risks, current academic research on MCP security remains limited, with most studies focusing on narrow or qualitative analyses that fail to capture the diversity of real-world threats. To address this gap, we present the MCP eXploit Toolkit (MCPXKIT), which categorizes and implements 31 distinct attack methods under four key classifications: direct tool injection, indirect tool injection, malicious user attacks, and LLM inherent attack. We further conduct a quantitative analysis of the efficacy of each attack. Our experiments reveal key insights into MCP vulnerabilities, including agents' blind reliance on tool descriptions, sensitivity to file-based attacks, chain attacks exploiting shared context, and difficulty distinguishing external data from executable commands. These insights, validated through attack experiments, underscore the urgency for robust defense strategies and informed MCP design. Our contributions include 1) constructing a comprehensive MCP attack taxonomy, 2) introducing a unified attack framework, MCPXKIT, and 3) conducting empirical vulnerability analysis to enhance MCP security mechanisms. This work provides a foundational framework, supporting the secure evolution of MCP ecosystems.

URL PDF HTML ☆

赞 0 踩 0

2508.09801 2026-05-26 cs.CR cs.AI 版本更新

SoK: GPT 和 DeepSeek 模型越狱鲁棒性的全面安全分析

Xiaodong Wu, Xiangman Li, Qi Li, Lingshuang Liu, Jianbing Ni

发表机构 * Queen’s University（女王大学）； University of Waterloo（滑铁卢大学）

AI总结通过 HarmBench 基准测试，对 DeepSeek 模型系列与 GPT-3.5、GPT-4 进行首次全面越狱分析，发现 DeepSeek 对优化驱动攻击有部分鲁棒性但易受提示工程攻击，而 GPT-4 Turbo 具有更一致的安全对齐，揭示了模型效率与对齐泛化之间的固有权衡。

详情

AI中文摘要

大型语言模型（LLM）的快速普及加剧了对其遭受越狱攻击的担忧，越狱攻击通过精心设计的对抗性输入来诱导不安全内容。尽管 GPT-4 等专有模型已被广泛评估，但新兴开源系统（如 DeepSeek）的鲁棒性仍未得到充分检验，尽管它们在 LLM 应用中的使用日益增长。在本文中，我们首次对 DeepSeek 模型系列进行了全面的越狱分析，通过 HarmBench 基准测试将其与 GPT-3.5 和 GPT-4 进行比较。我们研究了涵盖 510 种有害行为的七种代表性攻击方法，这些方法按功能和语义维度组织。结果表明，DeepSeek 对 TAP-T 等优化驱动攻击具有部分鲁棒性，但也导致其对基于提示和手动设计的对抗性输入更加敏感。相比之下，GPT-4 Turbo 在广泛行为中表现出更强大且一致的安全对齐，这可能是由于更强的安全优化和来自人类反馈的强化学习。此外，细粒度行为分析和案例研究表明，DeepSeek 往往无法一致地将安全约束应用于对抗性提示，导致拒绝行为不均匀。总体而言，我们的结果凸显了模型效率与对齐泛化之间的固有权衡，强调了针对性安全调优和稳健对齐策略对于确保开源 LLM 安全部署的重要性。

英文摘要

The rapid proliferation of Large Language Models (LLMs) has heightened concerns regarding their exposure to jailbreak attacks, which craft adversarial inputs designed to elicit unsafe content. Although proprietary models such as GPT-4 have been extensively evaluated, the robustness of emerging open-source systems like DeepSeek remains insufficiently examined, despite their growing use in LLM applications. In this paper, we conduct the first comprehensive jailbreak analysis of the DeepSeek model family, comparing it with GPT-3.5 and GPT-4 through the HarmBench benchmark. We investigate seven representative attack methods across 510 harmful behaviors, organized along both functional and semantic dimensions. Findings indicate that DeepSeek provides partial resilience against optimization-driven attacks such as TAP-T, but also results in greater susceptibility to prompt-based and manually engineered adversarial inputs. In contrast, GPT-4 Turbo demonstrates more robust and consistent safety alignment across a wide range of behaviors, likely due to stronger safety optimization and reinforcement learning from human feedback. In addition, fine-grained behavioral analysis and case studies reveal that DeepSeek often fails to consistently apply safety constraints to adversarial prompts, leading to uneven refusal behaviors. Overall, our results highlight an inherent trade-off between model efficiency and alignment generalization, underscoring the importance of targeted safety tuning and robust alignment strategies to ensure secure deployment of open-source LLMs.

URL PDF HTML ☆

赞 0 踩 0

2506.17629 2026-05-26 cs.CV cs.AI cs.CL 版本更新

CLiViS: Unleashing Cognitive Map through Linguistic-Visual Synergy for Embodied Visual Reasoning

CLiViS: 通过语言-视觉协同释放认知地图用于具身视觉推理

Kailing Li, Qi'ao Xu, Tianwen Qian, Yuqian Fu, Yang Jiao, Xiaoling Wang

发表机构 * School of Computer Science and Technology, East China Normal University（东华大学计算机科学与技术学院）； King Abdullah University of Science and Technology（科廷大学）； Fudan University（复旦大学）

AI总结提出CLiViS框架，通过LLM进行高层任务规划并协调VLM驱动的开放世界视觉感知，构建动态认知地图以迭代更新场景上下文，实现无需训练的具身视觉推理。

详情

AI中文摘要

具身视觉推理（EVR）旨在基于自我中心视频遵循复杂、自由形式的指令，从而在动态环境中实现语义理解和时空推理。尽管具有潜力，EVR面临复杂指令多样性和长期自我中心视频中复杂时空动态的挑战。现有解决方案要么在静态视频描述上使用大型语言模型（LLM），这通常会遗漏关键视觉细节，要么依赖端到端视觉语言模型（VLM），后者在逐步组合推理上存在困难。考虑到LLM在推理和VLM在感知方面的互补优势，我们提出了CLiViS。这是一个新颖的无训练框架，利用LLM进行高层任务规划，并协调VLM驱动的开放世界视觉感知，以迭代更新场景上下文。基于这种协同，CLiViS的核心是一个动态认知地图，它在推理过程中不断演化。该地图构建了具身场景的结构化表示，连接了低层感知和高层推理。跨多个基准的大量实验证明了CLiViS的有效性和通用性，特别是在处理长期视觉依赖方面。代码可在 https://github.com/Teacher-Tom/CLiViS 获取。

英文摘要

Embodied Visual Reasoning (EVR) seeks to follow complex, free-form instructions based on egocentric video, enabling semantic understanding and spatiotemporal reasoning in dynamic environments. Despite its promising potential, EVR encounters significant challenges stemming from the diversity of complex instructions and the intricate spatiotemporal dynamics in long-term egocentric videos. Prior solutions either employ Large Language Models (LLMs) over static video captions, which often omit critical visual details, or rely on end-to-end Vision-Language Models (VLMs) that struggle with stepwise compositional reasoning. Consider the complementary strengths of LLMs in reasoning and VLMs in perception, we propose CLiViS. It is a novel training-free framework that leverages LLMs for high-level task planning and orchestrates VLM-driven open-world visual perception to iteratively update the scene context. Building on this synergy, the core of CLiViS is a dynamic Cognitive Map that evolves throughout the reasoning process. This map constructs a structured representation of the embodied scene, bridging low-level perception and high-level reasoning. Extensive experiments across multiple benchmarks demonstrate the effectiveness and generality of CLiViS, especially in handling long-term visual dependencies. Code is available at https://github.com/Teacher-Tom/CLiViS.

URL PDF HTML ☆

赞 0 踩 0

2506.11027 2026-05-26 cs.LG cs.AI cs.PL 版本更新

From Reasoning to Code: GRPO Optimization for Underrepresented Languages

从推理到代码：针对代表性不足语言的GRPO优化

Federico Pennino, Bianca Raimondi, Massimo Rondelli, Andrea Gurioli, Maurizio Gabbrielli

发表机构 * Qwen2.5-Coder

AI总结提出结合Qwen2.5-Coder小模型与GRPO的强化学习方法，利用执行反馈和奖励机制提升Prolog、Lisp等低资源语言的代码生成准确性与推理质量。

Comments Accepted ICLP 2026

详情

AI中文摘要

使用大型语言模型（LLM）生成准确且可执行的代码对于代表性不足的编程语言（如Prolog和Lisp）仍然是一个重大挑战，因为与Python等高资源语言相比，公共训练数据稀缺。本文介绍了一种可泛化的强化学习（RL）方法，将Qwen2.5-Coder模型的小规模版本与组相对策略优化（GRPO）相结合，通过推理实现有效的代码生成。为了解决稀疏数据集的局限性，我们将执行驱动的反馈直接集成到RL循环中，利用一个奖励系统，该系统同时利用逻辑正确性和结构格式。在GSM8K数据集上的实验结果表明，在代表性不足的语言中，推理质量和代码准确性有显著提升。这些发现强调了我们的方法通过利用符号推理和基于解释器的反馈，使缺乏广泛训练资源的多种编程语言受益的潜力。

英文摘要

Generating accurate and executable code using Large Language Models (LLMs) remains a significant challenge for underrepresented programming languages, such as Prolog and Lisp, due to the scarcity of public training data compared to high-resource languages like Python. This paper introduces a generalizable Reinforcement Learning (RL) approach that combines small-scale versions of the Qwen2.5-Coder model with Group Relative Policy Optimization (GRPO) to enable effective code generation through reasoning. To address the limitations of sparse datasets, we integrate execution-driven feedback directly into the RL loop, utilizing a reward system that exploits both logical correctness and structural formatting. Experimental results on GSM8K dataset demonstrate significant improvements in reasoning quality and code accuracy across underrepresented languages. These findings underscore the potential of our approach to benefit a wide range of programming languages lacking extensive training resources by leveraging symbolic reasoning and interpreter-based feedback.

URL PDF HTML ☆

赞 0 踩 0

2506.06840 2026-05-26 stat.ML cs.AI cs.LG stat.AP stat.OT 版本更新

A Statistical Framework for Model Selection in LSTM Networks

LSTM网络中模型选择的统计框架

Fahad Mostafa

发表机构 * School of Mathematical and Natural Sciences, Arizona State University（数学与自然科学院，亚利桑那州立大学）

AI总结针对LSTM网络模型选择依赖启发式且计算昂贵的问题，提出统一统计框架，通过扩展信息准则和收缩估计到序列神经网络，定义适应时间结构的惩罚似然、广义阈值方法处理隐状态动态，并利用变分贝叶斯和近似边际似然实现高效估计，在生物医学数据上验证了灵活性和性能提升。

详情

AI中文摘要

长短期记忆（LSTM）神经网络模型已成为从自然语言处理到时间序列预测等众多应用中序列数据建模的基石。尽管取得了成功，但模型选择问题，包括超参数调优、架构规范和正则化选择，仍然很大程度上是启发式的且计算昂贵。在本文中，我们提出了一个统一的统计框架，用于LSTM网络中的系统模型选择。我们的框架将经典的模型选择思想，如信息准则和收缩估计，扩展到序列神经网络。我们定义了适应时间结构的惩罚似然，提出了一个用于隐状态动态的广义阈值方法，并利用变分贝叶斯和近似边际似然方法提供了高效的估计策略。几个以生物医学数据为中心的示例展示了所提出框架的灵活性和改进的性能。

英文摘要

Long Short-Term Memory (LSTM) neural network models have become the cornerstone for sequential data modeling in numerous applications, ranging from natural language processing to time series forecasting. Despite their success, the problem of model selection, including hyperparameter tuning, architecture specification, and regularization choice remains largely heuristic and computationally expensive. In this paper, we propose a unified statistical framework for systematic model selection in LSTM networks. Our framework extends classical model selection ideas, such as information criteria and shrinkage estimation, to sequential neural networks. We define penalized likelihoods adapted to temporal structures, propose a generalized threshold approach for hidden state dynamics, and provide efficient estimation strategies using variational Bayes and approximate marginal likelihood methods. Several biomedical data centric examples demonstrate the flexibility and improved performance of the proposed framework.

URL PDF HTML ☆

赞 0 踩 0

2506.06454 2026-05-26 cs.LG cs.AI stat.ML 版本更新

LETS Forecast: Learning Embedology for Time Series Forecasting

LETS Forecast：用于时间序列预测的嵌入学

Abrar Majeedi, Viswanatha Reddy Gajjala, Satya Sai Srinath Namburi GNVV, Nada Magdi Elkordi, Yin Li

发表机构 * Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison（生物统计学与医学信息学系，威斯康星大学麦迪逊分校）； Department of Computer Sciences, University of Wisconsin-Madison（计算机科学系，威斯康星大学麦迪逊分校）

AI总结提出DeepEDM框架，结合非线性动力系统建模与深度学习，通过延迟嵌入和核回归学习潜在动态，实现高精度时间序列预测。

Comments Accepted at International Conference on Machine Learning (ICML) 2025

详情

AI中文摘要

现实世界的时间序列通常受复杂的非线性动力学支配。理解这些潜在动力学对于精确的未来预测至关重要。虽然深度学习在时间序列预测中取得了重大成功，但许多现有方法并未显式建模动力学。为弥补这一差距，我们引入了DeepEDM，一个将非线性动力系统建模与深度神经网络相结合的框架。受经验动态建模（EDM）启发并基于Takens定理，DeepEDM提出了一种新颖的深度模型，该模型从时间延迟嵌入中学习潜在空间，并使用核回归来逼近潜在动力学，同时利用softmax注意力的高效实现，允许对未来时间步进行准确预测。为了评估我们的方法，我们在非线性动力系统的合成数据以及跨领域的真实世界时间序列上进行了全面实验。结果表明，DeepEDM对输入噪声具有鲁棒性，并在预测准确性上优于最先进的方法。我们的代码可在以下网址获取：https://abrarmajeedi.github.io/deep_edm。

结合抽象论证与机器学习高效分析低层过程事件流

Bettina Fazzinga, Sergio Flesca, Filippo Furfaro, Luigi Pontieri, Francesco Scala

发表机构 * University of Calabria（卡拉布里亚大学）； CNR（国家科研委员会）

AI总结提出一种数据高效的神经符号方法，通过抽象论证框架（AAF）优化序列标注模型生成的候选事件解释，以解决低层过程事件流中事件到活动映射的不确定性问题。

详情

DOI: 10.1007/s40747-026-02340-1

AI中文摘要

监控和分析过程轨迹是现代公司和组织的一项关键任务。在轨迹事件与参考业务活动之间存在差距的场景中，这涉及一个解释问题，即将任何正在进行的轨迹的每个事件转换为活动实例的相应步骤。基于最近将解释问题框架化为抽象论证框架（AAF）内的接受问题的方法，可以优雅地分析可能的（可能以聚合形式）事件解释，并为那些与先验过程知识冲突的解释提供解释。由于在事件到活动映射高度不确定（或简单地说未充分指定）的环境中，这种基于推理的方法可能产生低信息量的结果和繁重的计算，因此可以考虑发现一个序列标注模型，该模型经过训练以上下文感知的方式建议高概率的候选事件解释。然而，最优地训练这样的模型可能需要使用大量手动注释的示例轨迹。因此，我们提出了一种数据高效的神经符号方法，其中由示例驱动的序列标注器返回的候选解释由基于AAF的推理器进行细化。这使我们能够利用先验知识来补偿示例数据的稀缺性，实验结果证实了这一点。

英文摘要

Monitoring and analyzing process traces is a critical task for modern companies and organizations. In scenarios where there is a gap between trace events and reference business activities, this entails an interpretation problem, amounting to translating each event of any ongoing trace into the corresponding step of the activity instance. Building on a recent approach that frames the interpretation problem as an acceptance problem within an Abstract Argumentation Framework (AAF), one can elegantly analyze plausible event interpretations (possibly in an aggregated form), as well as offer explanations for those that conflict with prior process knowledge. Since, in settings where event-to-activity mapping is highly uncertain (or simply under-specified) this reasoning-based approach may yield lowly-informative results and heavy computation, one can think of discovering a sequence-tagging model, trained to suggest highly-probable candidate event interpretations in a context-aware way. However, training such a model optimally may require using a large amount of manually-annotated example traces. We then propose a data-efficient neuro-symbolic approach to the problem, where the candidate interpretations returned by the example-driven sequence tagger is refined by the AAF-based reasoner. This allows us to also leverage prior knowledge to compensate for the scarcity of example data, as confirmed by experimenftal results.

URL PDF HTML ☆

赞 0 踩 0

2502.10906 2026-05-26 cs.AI 版本更新

PCGRLLM: Large Language Model-Driven Reward Design for Procedural Content Generation Reinforcement Learning

PCGRLLM：面向程序化内容生成强化学习的大语言模型驱动奖励设计

In-Chang Baek, Sung-Hyun Kim, Sam Earle, Zehua Jiang, Jin-Ha Noh, Julian Togelius, Kyung-Joong Kim

发表机构 * Gwangju Institute of Science and Technology（光州科学技术院）； New York University（纽约大学）； Corresponding author（通讯作者）

AI总结提出PCGRLLM架构，利用大语言模型和反馈机制生成奖励函数，在二维环境中实现故事到奖励的生成，性能接近人类水平。

Comments 14 pages, 8 figures, Acccepted to Transactions on Games

详情

DOI: 10.1109/TG.2026.3695197

AI中文摘要

奖励设计在游戏AI训练中起着关键作用，需要大量领域知识和人力。近年来，一些研究探索了使用大语言模型（LLM）生成奖励函数来训练游戏代理和控制机器人。在内容生成文献中，已有早期工作为强化学习代理生成器生成奖励函数。本文介绍了PCGRLLM，一种基于早期工作的扩展架构，采用了反馈机制和几种基于推理的提示工程技术。我们在二维环境中的故事到奖励生成任务上，使用两种最先进的LLM和各种基于推理的提示方法评估了所提出的方法。我们的实验提供了富有洞察力的评估，展示了LLM在内容生成任务中不可或缺的能力。结果表明，与之前的结构相比，性能有了显著提升，达到了与人类相当的性能。我们的工作展示了在游戏AI开发中减少人类依赖的潜力，同时支持和增强创造性过程。

英文摘要

Reward design plays a pivotal role in the training of game AIs, requiring substantial domain-specific knowledge and human effort. In recent years, several studies have explored reward generation for training game agents and controlling robots using large language models (LLMs). In the content generation literature, there has been early work on generating reward functions for reinforcement learning agent generators. This work introduces PCGRLLM, an extended architecture based on earlier work, which employs a feedback mechanism and several reasoning-based prompt engineering techniques. We evaluate the proposed method on a story-to-reward generation task in a two-dimensional environment using two state-of-the-art LLMs across various reasoning-based prompting methods. Our experiments provide insightful evaluations that demonstrate the capabilities of LLMs essential for content generation tasks. The results demonstrate a substantial performance improvement over the previous structure, achieving performance comparable to that of humans. Our work demonstrates the potential to reduce human dependency in game AI development, while supporting and enhancing creative processes.

URL PDF HTML ☆

赞 0 踩 0

2502.10311 2026-05-26 cs.LG cs.AI cs.HC 版本更新

ExplainReduce: Generating global explanations from many local explanations

ExplainReduce: 从许多局部解释生成全局解释

Lauri Seppäläinen, Mudong Guo, Kai Puolamäki

发表机构 * University of Helsinki（赫尔辛基大学）

AI总结本文提出 ExplainReduce 方法，通过贪心启发式算法将大量局部解释缩减为少量简单模型，作为生成式全局解释，并证明其有效性和竞争力。

Comments 21 pages with a 36 page appendix, 8 + 39 figures, 1+1 tables. The datasets and source code used in the paper are available at https://github.com/edahelsinki/explainreduce. Accepted for publication in the 4th World Conference on eXplainable Artificial Intelligence (2026)

2502.01397 2026-05-26 cs.LG cs.AI cs.NA math.NA 版本更新

Message-Passing GNNs Fail to Approximate Sparse Triangular Factorizations

消息传递GNN无法近似稀疏三角分解

Vladislav Trifonov, Ekaterina Muravleva, Ivan Oseledets

发表机构 * AIC, Skoltech（斯克里普金技术大学人工智能中心）； Skoltech AI4S Center（斯克里普金技术大学AI4S中心）； Sberbank of Russia（俄罗斯储蓄银行）； AIRI

AI总结本文通过理论和实验证明，消息传递图神经网络在逼近稀疏三角分解时存在根本性局限，需要超越消息传递的架构创新。

Comments Camera-ready version published in Transactions on Machine Learning Research

详情

Journal ref: Transactions on Machine Learning Research, 2026

AI中文摘要

图神经网络（GNN）已被提议作为学习稀疏矩阵预条件子的工具，预条件子是加速线性求解器的关键组件。我们提出理论和实验证据表明，对于存在高质量预条件子但需要非局部依赖的矩阵类别，消息传递GNN从根本上无法近似稀疏三角分解。为了说明这一点，我们使用合成矩阵和SuiteSparse集合中的真实示例构建了一组基线。在包括图注意力网络和图变换器在内的多种GNN架构中，我们观察到预测因子与参考因子之间的余弦相似度较低（关键情况下≤0.7）。我们的理论和实验结果表明，需要超越消息传递的架构创新才能将GNN应用于矩阵分解等科学计算任务。此外，实验表明仅克服非局部性是不够的。需要定制的架构来捕获所需的依赖关系，因为即使是完全非局部的全局图变换器也无法匹配所提出的基线。

英文摘要

Graph Neural Networks (GNNs) have been proposed as a tool for learning sparse matrix preconditioners, which are key components in accelerating linear solvers. We present theoretical and empirical evidence that message-passing GNNs are fundamentally incapable of approximating sparse triangular factorizations for classes of matrices for which high-quality preconditioners exist but require non-local dependencies. To illustrate this, we construct a set of baselines using both synthetic matrices and real-world examples from the SuiteSparse collection. Across a range of GNN architectures, including Graph Attention Networks and Graph Transformers, we observe low cosine similarity ($\leq0.7$ in key cases) between predicted and reference factors. Our theoretical and empirical results suggest that architectural innovations beyond message-passing are necessary for applying GNNs to scientific computing tasks such as matrix factorization. Moreover, experiments demonstrate that overcoming non-locality alone is insufficient. Tailored architectures are necessary to capture the required dependencies since even a completely non-local Global Graph Transformer fails to match the proposed baselines.

URL PDF HTML ☆

赞 0 踩 0

2502.01184 2026-05-26 cs.LG cs.AI physics.chem-ph q-bio.QM 版本更新

FragmentNet: Adaptive Graph Fragmentation for Graph-to-Sequence Molecular Representation Learning

FragmentNet: 自适应图分片用于图到序列分子表示学习

Ankur Samanta, Rohan Gupta, Aditi Misra, Christian McIntosh Clarke, Jayakumar Rajadas

发表机构 * Department of Electrical and Computer Engineering, University of Toronto, Toronto, Canada（电气与计算机工程系，多伦多大学，多伦多，加拿大）； Regenerative Biomaterials Laboratory, Stanford Cardiovascular Institute, Palo Alto, USA（再生生物材料实验室，斯坦福心血管研究所，帕洛阿尔托，美国）

AI总结提出FragmentNet，通过自适应学习的分词器将分子图分解为化学有效的片段，并利用化学感知的空间位置编码保持分子拓扑，在片段级别进行掩码预训练，在多个属性预测任务上提升了性能。

Comments 22 pages, 13 figures, 5 tables

详情

AI中文摘要

分子表示学习方法通常将分子标记为单个原子或使用刚性、基于规则的分片分解，限制了它们捕捉有意义化学子结构上下文的能力。我们引入了FragmentNet，一种围绕新颖的自适应学习分词器构建的图到序列模型，该分词器将分子图分解为可调整粒度的化学有效片段，并辅以化学感知的空间位置编码，在生成的序列中保留分子拓扑。将自然语言处理中的掩码预训练策略扩展到分子领域，我们在化学有意义的片段级别而非单个原子级别对分子进行掩码和重建。在多个属性预测基准上的评估发现，在片段粒度上进行预训练在大多数任务上提高了下游性能，表明标记化粒度是分子表示学习的重要设计选择。

英文摘要

Molecular representation learning methods typically tokenize molecules as individual atoms or use rigid, rule-based fragment decompositions, limiting their ability to capture meaningful chemical substructure context. We introduce FragmentNet, a graph-to-sequence model built around a novel adaptive, learned tokenizer that decomposes molecular graphs into chemically valid fragments of adjustable granularity, complemented by chemically aware spatial positional encodings that preserve molecular topology in the resulting sequence. Extending masked pre-training strategies from natural language processing to the molecular domain, we mask and reconstruct molecules at the level of chemically meaningful fragments rather than individual atoms. Evaluating across multiple property prediction benchmarks, we find that pre-training at fragment granularity leads to improved downstream performance on the majority of tasks, demonstrating that tokenization granularity is an important design choice for molecular representation learning.

URL PDF HTML ☆

赞 0 踩 0

2303.07863 2026-05-26 cs.CV cs.AI cs.MM 版本更新

You Can Ground Earlier than See: An Effective and Efficient Pipeline for Temporal Sentence Grounding in Compressed Videos

你可以比看见更早定位：一种用于压缩视频中时序句子定位的高效流程

Xiang Fang, Daizong Liu, Pan Zhou, Guoshun Nan

发表机构 * The Hubei Engineering Research Center on Big Data Security, School of Cyber Science and Engineering, Huazhong University of Science and Technology（大数据安全湖北工程研究中心，网络安全科学与工程学院，华中科技大学）； Peking University（北京大学）； Beijing University of Posts and Telecommunications（北京邮电大学）

AI总结提出一种三分支压缩域时空融合框架（TCSF），直接从压缩视频中提取I帧、运动向量和残差特征，实现高效准确的时序句子定位。

Comments Accepted by CVPR 2023

详情

AI中文摘要

给定一个未剪辑视频，时序句子定位（TSG）旨在根据句子查询语义上定位目标时刻。尽管先前的工作取得了不错的成功，但它们仅关注从连续解码帧中提取的高级视觉特征，未能处理压缩视频的查询建模，导致训练和测试期间表示能力不足且计算复杂度高。本文提出了一种新的设置——压缩域TSG，直接利用压缩视频而非完全解压的帧作为视觉输入。为了处理原始视频比特流输入，我们提出了一种新颖的三分支压缩域时空融合（TCSF）框架，该框架提取并聚合三种低级视觉特征（I帧、运动向量和残差特征）以实现高效准确的定位。特别地，不像先前工作那样编码整个解码帧，我们仅通过学习I帧特征来捕获外观表示，以减少延迟。此外，我们不仅通过学习运动向量特征来探索运动信息，还通过残差特征探索相邻帧的关系。通过这种方式，进一步设计了一个带有自适应运动-外观融合模块的三分支时空注意力层，以提取和聚合外观和运动信息用于最终定位。在三个具有挑战性的数据集上的实验表明，我们的TCSF以更低的复杂度实现了比现有最先进方法更好的性能。

英文摘要

Given an untrimmed video, temporal sentence grounding (TSG) aims to locate a target moment semantically according to a sentence query. Although previous respectable works have made decent success, they only focus on high-level visual features extracted from the consecutive decoded frames and fail to handle the compressed videos for query modelling, suffering from insufficient representation capability and significant computational complexity during training and testing. In this paper, we pose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input. To handle the raw video bit-stream input, we propose a novel Three-branch Compressed-domain Spatial-temporal Fusion (TCSF) framework, which extracts and aggregates three kinds of low-level visual features (I-frame, motion vector and residual features) for effective and efficient grounding. Particularly, instead of encoding the whole decoded frames like previous works, we capture the appearance representation by only learning the I-frame feature to reduce delay or latency. Besides, we explore the motion information not only by learning the motion vector feature, but also by exploring the relations of neighboring frames via the residual feature. In this way, a three-branch spatial-temporal attention layer with an adaptive motion-appearance fusion module is further designed to extract and aggregate both appearance and motion information for the final grounding. Experiments on three challenging datasets shows that our TCSF achieves better performance than other state-of-the-art methods with lower complexity.

URL PDF HTML ☆

赞 0 踩 0

2209.11572 2026-05-26 cs.CV cs.AI cs.IR cs.MM 版本更新

Multi-Modal Cross-Domain Alignment Network for Video Moment Retrieval

多模态跨域对齐网络用于视频时刻检索

Xiang Fang, Daizong Liu, Pan Zhou, Yuchong Hu

发表机构 * Hubei Key Laboratory of Distributed System Security（湖北分布式系统安全重点实验室）； Hubei Engineering Research Center on Big Data Security（湖北大数据安全工程研究中心）； School of Cyber Science and Engineering（网络安全学院）； Huazhong University of Science and Technology（华中科技大学）； Wangxuan Institute of Computer Technology（王轩计算机技术研究所）； Peking University（北京大学）； School of Computer Science and Technology（计算机科学与技术学院）； Key Laboratory of Information Storage System Ministry of Education of China（信息存储系统教育部重点实验室）

AI总结提出多模态跨域对齐网络，通过域对齐、跨模态对齐和特定对齐三个模块，解决跨域视频时刻检索中域差异和语义鸿沟问题。

Comments Accepted by IEEE Transactions on Multimedia

详情

AI中文摘要

作为多媒体信息检索中日益流行的任务，视频时刻检索（VMR）旨在根据给定的语言查询从未修剪的视频中定位目标时刻。大多数先前的方法严重依赖于大量手动标注（即时刻边界），这在实践中获取成本极高。此外，由于不同数据集之间的域差异，直接将预训练模型应用于未见过的域会导致性能显著下降。本文聚焦于一项新任务：跨域VMR，其中在一个域（“源域”）中有完全标注的数据集，但目标域（“目标域”）仅包含未标注的数据集。据我们所知，我们提出了关于跨域VMR的首项研究。为了解决这一新任务，我们提出了一种新颖的多模态跨域对齐（MMCDA）网络，将标注知识从源域迁移到目标域。然而，由于源域和目标域之间的域差异以及视频和查询之间的语义鸿沟，直接将训练好的模型应用于目标域通常会导致性能下降。为解决此问题，我们开发了三个新颖的模块：（i）域对齐模块，用于对齐每个模态在不同域之间的特征分布；（ii）跨模态对齐模块，旨在将视频和查询特征映射到联合嵌入空间，并对齐目标域中不同模态之间的特征分布；（iii）特定对齐模块，试图获取特定帧与给定查询之间的细粒度相似性以实现最优定位。通过联合训练这三个模块，我们的MMCDA能够学习域不变且语义对齐的跨模态表示。

英文摘要

As an increasingly popular task in multimedia information retrieval, video moment retrieval (VMR) aims to localize the target moment from an untrimmed video according to a given language query. Most previous methods depend heavily on numerous manual annotations (i.e., moment boundaries), which are extremely expensive to acquire in practice. In addition, due to the domain gap between different datasets, directly applying these pre-trained models to an unseen domain leads to a significant performance drop. In this paper, we focus on a novel task: cross-domain VMR, where fully-annotated datasets are available in one domain (``source domain''), but the domain of interest (``target domain'') only contains unannotated datasets. As far as we know, we present the first study on cross-domain VMR. To address this new task, we propose a novel Multi-Modal Cross-Domain Alignment (MMCDA) network to transfer the annotation knowledge from the source domain to the target domain. However, due to the domain discrepancy between the source and target domains and the semantic gap between videos and queries, directly applying trained models to the target domain generally leads to a performance drop. To solve this problem, we develop three novel modules: (i) a domain alignment module is designed to align the feature distributions between different domains of each modality; (ii) a cross-modal alignment module aims to map both video and query features into a joint embedding space and to align the feature distributions between different modalities in the target domain; (iii) a specific alignment module tries to obtain the fine-grained similarity between a specific frame and the given query for optimal localization. By jointly training these three modules, our MMCDA can learn domain-invariant and semantic-aligned cross-modal representations.

URL PDF HTML ☆

赞 0 踩 0

2011.10396 2026-05-26 cs.LG cs.AI 版本更新

Double Self-weighted Multi-view Clustering via Adaptive View Fusion

双自加权多视图聚类：通过自适应视图融合

Xiang Fang, Yuchong Hu

发表机构 * School of Computer Science and Technology, Key Laboratory of Information Storage System Ministry of Education of China, Huazhong University of Science and Technology（计算机科学与技术学院，信息存储系统教育部重点实验室，华中科技大学）

AI总结提出双自加权多视图聚类框架（DSMC），通过自适应权重矩阵和权重因子分别对特征和图进行加权，去除冗余和噪声，并融合多图进行聚类。

Comments Corresponding author: Xiang Fang

详情

AI中文摘要

多视图聚类已应用于许多实际应用中，其中原始数据通常包含噪声。一些基于图的多视图聚类方法被提出来试图减少噪声的负面影响。然而，以往的基于图的多视图聚类方法即使存在冗余特征或噪声，也平等对待所有特征，这显然是不合理的。在本文中，我们提出了一种新颖的多视图聚类框架——双自加权多视图聚类（DSMC）来克服上述缺陷。DSMC执行双自加权操作，从每个图中去除冗余特征和噪声，从而获得鲁棒的图。对于第一次自加权操作，它通过引入自适应权重矩阵为不同特征分配不同的权重，这可以增强重要特征在联合表示中的作用，并使每个图鲁棒。对于第二次自加权操作，它通过施加自适应权重因子对不同图进行加权，这可以为更鲁棒的图分配更大的权重。此外，通过设计自适应多图融合，我们可以融合不同图中的特征，以整合这些图进行聚类。在六个真实世界数据集上的实验证明了其相对于其他最先进的多视图聚类方法的优势。

英文摘要

Multi-view clustering has been applied in many real-world applications where original data often contain noises. Some graph-based multi-view clustering methods have been proposed to try to reduce the negative influence of noises. However, previous graph-based multi-view clustering methods treat all features equally even if there are redundant features or noises, which is obviously unreasonable. In this paper, we propose a novel multi-view clustering framework Double Self-weighted Multi-view Clustering (DSMC) to overcome the aforementioned deficiency. DSMC performs double self-weighted operations to remove redundant features and noises from each graph, thereby obtaining robust graphs. For the first self-weighted operation, it assigns different weights to different features by introducing an adaptive weight matrix, which can reinforce the role of the important features in the joint representation and make each graph robust. For the second self-weighting operation, it weights different graphs by imposing an adaptive weight factor, which can assign larger weights to more robust graphs. Furthermore, by designing an adaptive multiple graphs fusion, we can fuse the features in the different graphs to integrate these graphs for clustering. Experiments on six real-world datasets demonstrate its advantages over other state-of-the-art multi-view clustering methods.

URL PDF HTML ☆

赞 0 踩 0

2011.10254 2026-05-26 cs.LG cs.AI stat.ML 版本更新

Unbalanced Incomplete Multi-view Clustering via the Scheme of View Evolution: Weak Views are Meat; Strong Views do Eat

通过视图演化方案的不平衡不完整多视图聚类：弱视图为食，强视图为食

Xiang Fang, Yuchong Hu, Pan Zhou, Dapeng Oliver Wu

发表机构 * School of Computer Science and Technology, Key Laboratory of Information Storage System Ministry of Education of China, Huazhong University of Science and Technology（计算机科学与技术学院，信息存储系统教育部重点实验室，华中科技大学）； Hubei Engineering Research Center on Big Data Security, School of Cyber Science and Engineering, Huazhong University of Science and Technology（大数据安全工程研究中心，网络安全学院，华中科技大学）； Department of Electrical and Computer Engineering, University of Florida（电气与计算机工程系，佛罗里达大学）

AI总结针对不同视图不完整程度不平衡的问题，受生物进化理论启发，提出基于视图演化的不平衡不完整多视图聚类方法UIMC，通过加权多视图子空间聚类和低秩鲁棒表示恢复数据，显著提升聚类性能。

Comments Accepted by IEEE Transactions on Emerging Topics in Computational Intelligence

详情

DOI: 10.1109/TETCI.2021.3077909
Journal ref: IEEE Transactions on Emerging Topics in Computational Intelligence 2021

AI中文摘要

不完整多视图聚类是处理现实世界中不完整多视图数据的重要技术。以往的工作假设所有视图具有相同的不完整性，即平衡不完整性。然而，不同的视图往往具有不同的不完整性，即不平衡不完整性，这导致了强视图（低不完整性视图）和弱视图（高不完整性视图）。不平衡不完整性阻止我们直接使用先前的方法进行聚类。在本文中，受有效生物进化理论的启发，我们设计了新颖的视图演化方案来聚类强视图和弱视图。此外，我们提出了一种不平衡不完整多视图聚类方法（UIMC），这是第一个基于视图演化的有效方法，用于不平衡不完整多视图聚类。与先前的方法相比，UIMC有两个独特的优势：1）它提出了加权多视图子空间聚类来整合这些不平衡不完整的视图，有效解决了不平衡不完整多视图问题；2）它设计了低秩和鲁棒表示来恢复数据，减少了不完整性和噪声的影响。大量的实验结果表明，UIMC在三个评估指标上相比其他最先进的方法将聚类性能提高了高达40%。

英文摘要

Incomplete multi-view clustering is an important technique to deal with real-world incomplete multi-view data. Previous works assume that all views have the same incompleteness, i.e., balanced incompleteness. However, different views often have distinct incompleteness, i.e., unbalanced incompleteness, which results in strong views (low-incompleteness views) and weak views (high-incompleteness views). The unbalanced incompleteness prevents us from directly using the previous methods for clustering. In this paper, inspired by the effective biological evolution theory, we design the novel scheme of view evolution to cluster strong and weak views. Moreover, we propose an Unbalanced Incomplete Multi-view Clustering method (UIMC), which is the first effective method based on view evolution for unbalanced incomplete multi-view clustering. Compared with previous methods, UIMC has two unique advantages: 1) it proposes weighted multi-view subspace clustering to integrate these unbalanced incomplete views, which effectively solves the unbalanced incomplete multi-view problem; 2) it designs the low-rank and robust representation to recover the data, which diminishes the impact of the incompleteness and noises. Extensive experimental results demonstrate that UIMC improves the clustering performance by up to 40% on three evaluation metrics over other state-of-the-art methods.

URL PDF HTML ☆

赞 0 踩 0

2605.25250 2026-05-26 cs.AI 版本更新

LipoAgent: Coordinating Fine-Tuned LLM Agents for Safer Lipid Design

LipoAgent: 协调微调的大语言模型智能体以实现更安全的脂质设计

Leshu Li, An Lu, Haiyu Wang, Zhibin Feng, Conghui Duan, Qing Bao, Zongmin Zhao, Sai Qian Zhang

发表机构 * New York University（纽约大学）； University of Illinois Chicago（伊利诺伊大学香槟分校）

AI总结提出LipoAgent，一种安全感知的多智能体大语言模型框架，通过条件预测目标强制毒性作为效率预测的前提，并结合多智能体验证，在mRNA转染效率预测上平均相对提升32%。

详情

AI中文摘要

脂质纳米颗粒（LNPs）是核酸递送中最临床成熟的平台之一，但设计既有效又生物学安全的脂质仍是一个主要瓶颈。在实际筛选中，毒性是一个决策层面的约束：如果一种脂质有毒，其效率预测在临床上无关紧要。我们提出LipoAgent，一种用于脂质发现的安全感知多智能体大语言模型框架。LipoAgent将领域特定微调与条件预测目标相结合，强制毒性作为效率预测的前提，并通过多智能体验证进一步提高可靠性，在存在持续分歧时辅以轻量级人工监督。在多个基础模型上，与已报道的其他脂质设计模型相比，LipoAgent在mRNA转染效率预测上实现了平均32%的相对改进。湿实验验证证实，虚拟筛选排名可靠地转化为生物学转染结果。代码公开于https://github.com/SAI-Lab-NYU/LipoAgent.git。

英文摘要

Lipid nanoparticles (LNPs) are among the most clinically mature platforms for nucleic acid delivery, yet designing lipids that are both effective and biologically safe remains a major bottleneck. In practical screening, toxicity is a decision-level constraint: if a lipid is toxic, its efficiency prediction is clinically irrelevant. We propose LipoAgent, a safety-aware multi-agent LLM framework for lipid discovery. LipoAgent combines domain-specific finetuning with a conditional prediction objective that enforces toxicity as a prerequisite for efficiency prediction, and further improves reliability via multi-agent verification with lightweight human oversight when disagreement persists. Across multiple foundation models, LipoAgent achieves an average 32% relative improvement in mRNA transfection efficiency prediction compared with other reported models for lipid design. Wet-lab validation confirms that virtual screening rankings reliably translate to biological transfection outcomes. The code is publicly available at https://github.com/SAI-Lab-NYU/LipoAgent.git.

URL PDF HTML ☆

赞 0 踩 0

2605.25235 2026-05-26 cs.LG cs.AI math.OC 版本更新

Constraint-Anchored Attribution: Feasibility-Certified Counterfactuals and Bonferroni-PAC Sufficient Subsets for Neural CO Policies

约束锚定归因：神经组合优化策略的可行性认证反事实与Bonferroni-PAC充分子集

Sohaib Lafifi

发表机构 * Univ. Artois, UR 3926, Laboratoire de G\'enie Informatique et d'Automatique de l'Artois (LGI2A) B\'ethune F-62400 France ； Univ. Artois, UR 3926, Laboratoire de G\'enie Informatique et d'Automatique de l'Artois (LGI2A)

AI总结提出一种神经组合优化策略的归因方法，通过LP松弛对偶分解决策、CSP可行性模型认证反事实，并用Bonferroni校正的Hoeffding充分子集测试界定PAC解释大小。

Comments 4 pages, 1 figure, Reference implementation: https://github.com/sohaibafifi/neuro-co-cax (MIT)

详情

AI中文摘要

我们为神经组合优化（CO）策略提供了一种归因方法，该方法（i）通过LP松弛对偶按约束族分解决策，（ii）通过组合可行性模型（实现为CSP可行性决策模型）认证反事实，以及（iii）通过沿贪心顺序的Bonferroni校正Hoeffding充分子集测试界定PAC充分解释的大小。在三个CO问题和三个随机种子上，我们的LP锚定$\Lambda$-归因在CVRPTW（n_cert=344）上匹配CF导出信号的96.5%，在定向问题（n_cert=281）上匹配77.2%，而代理梯度分别为75.0%和35.2%（配对差异+0.215和+0.420；McNemar精确$p \le 10^{-14}$）。在柔性作业车间调度问题的秩对齐机制中，两个后端在每个CSP认证翻转（n_cert=59）上一致，确认了无增益预测。Bonferroni-PAC子集平均每步5.0个节点（$M=70$，$\varepsilon=\delta=0.2$，$k_{\max}=25$）。参考实现：https://github.com/sohaibafifi/neuro-co-cax

英文摘要

We give an attribution method for neural combinatorial-optimisation (CO) policies that (i) decomposes a decision by constraint families via LP-relaxation duals, (ii) certifies counterfactuals through a combinatorial feasibility model (implemented as a CSP feasibility-decision model), and (iii) bounds the size of a PAC-sufficient explanation with a Bonferroni-corrected Hoeffding sufficient-subset test along a greedy ordering. Across three CO problems and three seeds, our LP-anchored $Λ$-attribution matches the CF-derived signal at 96.5% on CVRPTW (n_cert=344) and 77.2% on the Orienteering Problem (n_cert=281) vs 75.0% and 35.2% for proxy gradient (paired diffs +0.215 and +0.420; McNemar exact $p \le 10^{-14}$). In the rank-aligned regime of the Flexible Job-Shop Scheduling Problem, both backends agree on every CSP-certified flip (n_cert=59), confirming the no-gain prediction. Bonferroni-PAC subsets average 5.0 nodes per step ($M=70$, $\varepsilon=δ=0.2$, $k_{\max}=25$). Reference implementation: https://github.com/sohaibafifi/neuro-co-cax

URL PDF HTML ☆

赞 0 踩 0

2605.25234 2026-05-26 cs.LG cs.AI stat.CO stat.ML 版本更新

On the Epistemic Uncertainty of Overparametrized Neural Networks

关于过参数化神经网络的认知不确定性

David Rügamer

发表机构 * Department of Statistics, LMU Munich（统计系，慕尼黑大学）； Munich Center for Machine Learning (MCML)（慕尼黑机器学习中心）

AI总结本文通过非可辨识性视角分析过参数化神经网络的认知不确定性，刻画了离散和连续残余不确定性来源，并以单隐层ReLU网络为例验证理论。

Comments Accepted at ICML 2026 (Main Track)

2605.25233 2026-05-26 cs.AI 版本更新

Meta-Agent: From Task Descriptions to Verified Multi-Agent Systems

Meta-Agent：从任务描述到经过验证的多智能体系统

Andy Xu, Yu-Wing Tai

发表机构 * Dartmouth College（达特茅斯学院）

AI总结提出Meta-Agent两阶段框架，通过任务规划、网络搜索、代码生成和验证机制，自动从自然语言任务描述构建并执行可靠的多智能体系统，在编码、上下文学习和开放推理任务中提升成功率、错误恢复和工作流稳定性。

详情

AI中文摘要

AI智能体越来越多地被用于解决复杂的多步骤任务，但随着工作流规模和深度的增长，现有的多智能体框架仍然脆弱。中间阶段的小错误会通过智能体交互传播，同时不充分的依据和薄弱的验证机制进一步限制了可靠性。我们提出Meta-Agent，一个两阶段框架，能够从自然语言任务描述自动构建并执行专门的多智能体系统。在构建阶段，任务规划器将问题分解为智能体规范的有向无环图，包含明确的输入/输出契约和验证标准。网络搜索模块用外部证据为每个规范提供依据，代码生成模块产生系统提示和工具配置。构建时验证阶段随后验证生成的工件，并在检测到失败时触发有针对性的重新生成。在执行阶段，协调器在智能体图中分配子任务，同时执行时验证对中间输出进行把关。我们进一步引入三级错误归因机制，区分局部、上游和结构性失败，从而实现从局部重试到部分重新执行和重新分解的有针对性的恢复策略。我们在编码、上下文学习和开放式推理任务上评估Meta-Agent。与强多智能体基线及消融实验相比，结果表明在任务成功率、错误恢复和工作流稳定性方面均有持续改进。这些结果凸显了将规划、依据和验证紧密集成以构建可靠多智能体系统的重要性。

英文摘要

AI agents are increasingly used to solve complex, multi-step tasks, but existing multi-agent frameworks remain brittle as workflows grow in scale and depth. Small errors at intermediate stages can propagate through agent interactions, while insufficient grounding and weak verification mechanisms further limit reliability. We present Meta-Agent, a two-phase framework that automatically constructs and executes specialized multi-agent systems from natural-language task descriptions. In the construction phase, a task planner decomposes a problem into a directed acyclic graph of agent specifications with explicit input/output contracts and verification criteria. A web search module grounds each specification with external evidence, and a code generation module produces system prompts and tool configurations. A construction-time verification stage then validates generated artifacts and triggers targeted regeneration when failures are detected. In the execution phase, a coordinator dispatches subtasks across the agent graph while execution-time verification gates intermediate outputs. We further introduce a three-level error attribution mechanism that distinguishes local, upstream, and structural failures, enabling targeted recovery strategies ranging from localized retries to partial re-execution and re-decomposition. We evaluate Meta-Agent across coding, contextual learning, and open-ended reasoning tasks. Experiments against strong multi-agent baselines and ablation studies demonstrate consistent improvements in task success rate, error recovery, and workflow stability. The results highlight the importance of tightly integrating planning, grounding, and verification for building reliable multi-agent systems.

URL PDF HTML ☆

赞 0 踩 0

2605.25232 2026-05-26 cs.SE cs.AI cs.LO 版本更新

Specification-Based Code-Text-Code Reengineering for LLM-Mediated Software Evolution

基于规约的代码-文本-代码重构：面向LLM介导的软件演化

Oleg Grynets, Vasyl Lyashkevych, Arsen Dolichnyi, Roman Piznak, Taras Zelenyy, Volodymyr Morozov

发表机构 * EPAM Systems（EPAM系统）； McLean, Virginia, USA（美国弗吉尼亚州麦莱恩）； Lviv, Ukraine（乌克兰利沃夫）； Kyiv, Ukraine（乌克兰基辅）

AI总结提出一种基于规约的Code2Text2Code重构框架，通过将源代码转换为中性文本规约并迭代验证，解决直接Code2Code转换中的语义漂移、行为变化等问题，实现受控的LLM介导软件演化。

Comments 15 pages, 9 figures, 7 tables, 39 references

详情

AI中文摘要

直接的Code2Code转换仍然难以控制，因为它可能保留表面语法，同时引入语义漂移、隐藏的行为变化、可追溯性丧失、非惯用的目标实现或领域逻辑的不完整重建。本文提出了一种基于规约的Code2Text2Code重构框架，用于LLM介导的软件演化。核心思想是将源代码转换为中性的文本规约，该规约捕获程序行为、标识符、计算流程、条件、副作用、数据依赖和领域特定意图，而不直接转移源语言语法。所提出的框架结合了事实上下文提取、Code2Text生成、源代码与文本规约之间的迭代验证、Text2Code生成、目标代码验证、检索增强接地、语义感知分块以及转换损失估计。知识表示层集成了来自AST的元数据、基于图的依赖结构、中性自然语言规约、技术文档、业务文档和架构级表示。进行的实验包括从多种编程语言和SQL方言构建的Code2Text2Code数据集、中间表示比较、检索评估、文档转换评估以及使用DSPy进行提示调优。实现了使用结构保留、反向兼容性、接口稳定性和总图相似性的图形式化来估计转换损失。结果支持将Code2Text2Code方法解释为一种受控的基于规约的重构过程，用于LLM介导的软件演化，而非简单的代码转换。

英文摘要

Direct Code2Code transformation remains challenging to control because it can preserve surface-level syntax while introducing semantic drift, hidden behavioral changes, loss of traceability, non-idiomatic target implementations, or incomplete reconstruction of domain logic. This paper proposes a specification-based Code2Text2Code reengineering framework for LLM-mediated software evolution. The central idea is to transform source code into a neutral textual specification that captures program behavior, identifiers, computational flow, conditions, side effects, data dependencies, and domain-specific intent without directly transferring the source language syntax. The proposed framework combines factual context extraction, Code2Text generation, iterative verification between source code and text specification, Text2Code generation, target code verification, retrieval-augmented grounding, and semantic-aware chunking, and transformation loss estimation. The knowledge representation layer integrates metadata derived from AST, graph-based dependency structures, neutral natural language specifications, technical documentation, business documentation, and architecture-level representations. The conducted experiments include a Code2Text2Code dataset built from multiple programming languages and SQL dialects, comparison of intermediate representations, retrieval evaluation, documentation transformation evaluation, and prompt tuning using DSPy. A graph formalization using structural preservation, reverse compatibility, interface stability, and total graph similarity is implemented to estimate transformation losses. The results support the interpretation of the Code2Text2Code approach not as a simple code transformation, but as a controlled specification-based reengineering process for LLM-mediated software evolution.

URL PDF HTML ☆

赞 0 踩 0

2605.25210 2026-05-26 cs.LG cs.AI stat.ML 版本更新

Multi-Objective Learning for Diffusion Models: A Statistical Theory under Semi-Supervised Learning

扩散模型的多目标学习：半监督学习下的统计理论

Ziheng Cheng, Yixiao Huang, Hanlin Zhu, Haoran Geng, Somayeh Sojoudi, Jitendra Malik, Pieter Abbeel, Xin Guo

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结针对扩散模型在多目标学习中因模型容量增大导致统计成本高的问题，提出半监督两阶段训练方法，利用未标记数据通过伪样本蒸馏，证明所需配对样本量仅取决于专家模型复杂度。

详情

AI中文摘要

扩散模型越来越多地被用作强大的条件生成器，然而实际部署通常涉及来自不同任务的多个目标分布，例如文本到图像生成中的多样化提示域，或机器人技术中具有扩散策略的多个环境。这自然引出了多目标学习（MOL）问题。一个关键挑战是，实现良好的帕累托权衡可能需要一个通用模型类，其容量远大于解决任何单个任务所需的容量，从而增加了统计成本，因为样本复杂度通常随模型复杂度而扩展。为了调和这一点，我们为有限数据下的扩散模型开发了一个原则性的多目标学习框架：一种半监督机制，其中配对（标记）样本稀缺，但（未标记）条件数据丰富。我们提出了一种两阶段训练程序，首先从有限的配对数据中拟合轻量级专家模型，然后通过生成伪样本将它们蒸馏成一个通用模型。我们建立了泛化界限，表明所需的配对样本数量仅取决于专家模型类的复杂度。我们进一步将理论扩展到用于序列决策的扩散策略，以考虑在线策略展开中的分布偏移。在机器人控制和图像恢复任务上进行了大量实验，以验证我们的理论结果。

英文摘要

Diffusion models are increasingly used as powerful conditional generators, yet real deployments often involve multiple target distributions arising from different tasks, e.g., diverse prompt domains in text-to-image generation, or multiple environments in robotics with diffusion policies. This naturally leads to a multi-objective learning (MOL) problem. A key challenge is that achieving good Pareto trade-offs can require a generalist model class with substantially larger capacity than what suffices for solving any individual task, thereby increasing statistical cost since sample complexity typically scales with the model complexity. To reconcile this, we develop a principled MOL framework for diffusion models with limited data: a semi-supervised regime where paired (labeled) samples are scarce, but (unlabeled) condition data are abundant. We propose a two-stage training procedure that first fits lightweight specialist models from limited paired data, and then distills them into a generalist model by generating pseudo-samples. We establish generalization bounds showing that the required number of paired samples only depends on the complexity of the specialist model classes. We further extend the theory to diffusion policies for sequential decision making to account for distribution shift in on-policy rollouts. Extensive experiments on robotic control and image restoration tasks are conducted to verify our theoretical results.

URL PDF HTML ☆

赞 0 踩 0

2605.25203 2026-05-26 cs.LG cs.AI cs.LO 版本更新

Influence-Inspired Spectral Rotations for Extreme Low-Bit LLM Quantization

基于影响启发的谱旋转用于极端低位LLM量化

Gorgi Pavlov

发表机构 * Lehigh University（莱斯大学）

AI总结本文利用伴随理论论文的影响自适应Walsh几何，通过WHT旋转和列缩放结合重构误差量化器，实现极端低位权重量化，在多个模型上降低困惑度15-58%。

Comments 14 pages, no figures. Companion application paper to arXiv:2605.01637 (theory). Code and pinned eval stack: https://github.com/gogipav14/spectral-llm

详情

AI中文摘要

我们将伴随理论论文（arXiv:2605.01637）的影响自适应Walsh几何应用于极端低位仅权重量化。方法是一个数学不变的变换：对每个线性层的权重矩阵进行WHT旋转，并根据逐坐标Walsh基激活能量重新缩放其列，然后交给重构误差量化器（Intel auto-round）。这使每组整数舍入偏向高谱能量通道。在四个从135M到1.5B参数的预训练仅解码器模型上，BBT-spectral在W2A16下相对于普通auto-round将wikitext-2困惑度降低了15-58%；我们还报告了一个TinyLlama-1.1B辅助数据点。三个扩展将方法迁移到其失败的族：针对Qwen3注意力的每头PCA矩阵-Gamma替换q_norm/k_norm（Qwen3-0.6B上PPL从136.76降至88.99）；与RoPE可交换的SO(2)每对旋转（Qwen2.5-1.5B上PPL从36.93降至21.84）；以及通过架构模糊测试发现的Laguna风格融合专家布局的MoE感知输入侧吸收修复。W2与W4的消融实验给出了一个故意的阴性对照：在W4下，重新分配收益落在±0.5 PPL噪声基底内，这与Schur-凸性直觉一致，即非集中影响成本随噪声预算缩小而消失。所有量化权重导出为OpenVINO IR，并在Intel NPU + Arc dGPU + CPU上运行，PPL在设备间变化在±0.1内。我们不声称将理论论文的majorization论证形式化为布尔到实数值的迁移：这里使用的WHT激活能量不是理论论文的布尔影响，联系是直观的，贡献在于工程价值而非迁移定理。与SpinQuant、QuaRot、QuIP-sharp、AQLM、OmniQuant和ButterflyQuant在匹配校准下的头对头基准测试是未来的主要工作。

英文摘要

We apply the influence-adaptive Walsh geometry of a companion theory paper (arXiv:2605.01637) to extreme low-bit weight-only LLM quantization. The recipe is one math-invariant transformation: WHT-rotate each linear layer's weight matrix and rescale its columns by per-coordinate Walsh-basis activation energy before handing off to a reconstruction-error quantizer (Intel auto-round). This biases per-group integer rounding toward high-spectral-energy channels. On four pretrained decoder-only models from 135M to 1.5B parameters, BBT-spectral reduces wikitext-2 perplexity by 15-58% relative to vanilla auto-round at W2A16; we also report a TinyLlama-1.1B auxiliary data point. Three extensions transfer the recipe to families it failed on: a per-head PCA matrix-Gamma replacement of q_norm/k_norm for Qwen3 attention (PPL 136.76 -> 88.99 on Qwen3-0.6B); an SO(2) per-pair rotation that commutes with RoPE (PPL 36.93 -> 21.84 on Qwen2.5-1.5B); and an MoE-aware input-side absorption fix identified by architectural fuzzing of Laguna-style fused-expert layouts. A W2-vs-W4 ablation gives a deliberate negative control: the redistribution payoff falls within the +/-0.5 PPL noise floor at W4, consistent with the Schur-convexity intuition that the cost of unconcentrated influence vanishes as the noise budget shrinks. All quantized weights export to OpenVINO IR and run on Intel NPU + Arc dGPU + CPU with PPL invariant to device within +/-0.1. We do not claim a formal Boolean-to-real-valued transfer of the theory paper's majorization argument: the WHT activation energy used here is not the Boolean influence of the theory paper, the link is intuitive, and the contribution is engineering value rather than a transferred theorem. Head-to-head benchmarks against SpinQuant, QuaRot, QuIP-sharp, AQLM, OmniQuant, and ButterflyQuant at matched calibration are the main future-work item.

URL PDF HTML ☆

赞 0 踩 0

2605.25198 2026-05-26 cs.LG cs.AI 版本更新

Hide to Guide: Learning via Semantic Masking

隐藏以引导：通过语义掩码学习

Ruitao Liu, Qinghao Hu, Alex Hu, Yecheng Wu, Shang Yang, Luke J. Huang, Zhuoyang Zhang, Han Cai, Song Han

发表机构 * MIT（麻省理工学院）； NVIDIA（英伟达）

AI总结提出语义掩码专家策略优化（SMEPO），通过掩码专家轨迹中与奖励相关的语义片段，将困难问题转化为填空过程，提升强化学习在推理密集型任务中的探索效率。

详情

AI中文摘要

具有可验证奖励的强化学习（RLVR）已成为提升语言模型在推理密集型任务上性能的强大范式，但其有效性常受限于探索。例如，模型在困难问题上常常失败，留下很少有用的奖励信号。外部专家轨迹提供了一种自然的引导来源，但它们也可能在通往验证器目标的关键路径上暴露与奖励相关的内容，如最终答案、中间值、可执行实现或与答案相关的实体。这些内容可能创建意外的奖励黑客通道，使策略通过复制轨迹而非学习底层推理或智能体行为来获得奖励。现有的引导式RL方法通过使用部分轨迹来降低这种风险，但它们主要启发式地控制展示多少专家信息，而非控制应隐藏哪些部分。为此，我们提出语义掩码专家策略优化（SMEPO），一种用于专家引导RLVR的细粒度语义掩码策略。SMEPO不是粗略地截断轨迹或原样展示，而是在保留专家分解、计划和过程结构的同时，掩码关键路径上与奖励相关的语义片段。这将困难问题从从头推理转变为填空过程：策略可以遵循专家的问题解决路径，但仍需自行重建缺失的值、代码或实体。SMEPO易于应用，无需更改奖励函数或RL目标。在包括数学、代码和智能体搜索在内的多个领域，SMEPO相比GRPO将准确率提升最多3.2个百分点，并将训练时间减少最多4.2倍。代码已开源：https://github.com/mit-han-lab/SMEPO。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has become a powerful paradigm for improving language models on reasoning-intensive tasks, but its effectiveness is often limited by exploration. For example, models often fail on hard problems, leaving little useful reward signal. External expert traces offer a natural source of guidance, yet they may also expose reward-relevant content along the critical path to the verifier target, such as final answers, intermediate values, executable implementations, or answer-related entities. This content can create an unintended reward hacking channel, allowing the policy to obtain reward by copying the trace rather than learning the underlying reasoning or agentic behavior. Existing guided-RL methods reduce this risk by using partial trajectories, but they mainly control how much expert information is shown heuristically rather than which parts should be hidden. To this end, we propose Semantic Masked Expert Policy Optimization (SMEPO), a fine-grained semantic masking strategy for expert-guided RLVR. Instead of truncating traces coarsely or revealing them unchanged, SMEPO masks reward-relevant semantic spans along the critical path while preserving the expert's decomposition, plan, and procedural structure. This turns hard problems from reasoning from scratch into a fill-in-the-blank process: the policy can follow the expert's problem-solving route, but must still reconstruct the missing values, code, or entities by itself. SMEPO is simple to apply and requires no changes to the reward function or RL objective. Across diverse domains, including math, code, and agentic search, SMEPO improves accuracy by up to 3.2 points over GRPO and reduces training time by up to 4.2x. The code is available at https://github.com/mit-han-lab/SMEPO.

URL PDF HTML ☆

赞 0 踩 0

2605.25196 2026-05-26 cs.CY cs.AI 版本更新

Beyond Killer Robots: General AI Attitudes and Public Support for Military AI in Nine Countries

超越杀手机器人：九个国家中通用人工智能态度与公众对军事人工智能的支持

Andreas Jungherr, Antonia Schlude, Adrian Rauchfleisch

发表机构 * University of Bamberg（巴伐利亚数字转型研究所）； Bavarian Research Institute for Digital Transformation (bidt)（巴伐利亚数字转型研究所）； National Taiwan University（国立台湾大学）

AI总结基于九国调查，研究公众对军事AI的支持主要受通用AI态度、对致命自主性的原则性反对还是外交政策取向影响，发现认为AI有益者更支持，而原则性反对仅与完全自主致命武力相关。

详情

AI中文摘要

人工智能赋能的军事系统是现代军事冲突的常见特征。应用范围从用于监视和攻击的自主无人机到AI支持的目标选择。AI对现代冲突的重要性也体现在政府与科技公司之间关于军事获取前沿AI条件的公开争议中。军事用途以及政府试图推动和引导这些用途的行为都发生在公众舆论的背景下，然而我们对人们如何看待军事AI仍知之甚少。基于一项在包括中国、德国和美国在内的九个国家中对9000名受访者进行的预注册调查，我们检验了军事AI的支持是否主要由对AI的通用态度、对致命自主性的原则性反对，或外交政策和地缘政治取向所塑造。在六个在致命性和人类控制方面有所不同的军事AI场景中，认为AI有益的受访者明显更支持军事AI。鹰派受访者也更支持。相比之下，对致命自主性的原则性反对与整体指数没有广泛关联，但与完全自主致命武力的应用相关。与我们的预期相反，感知到的AI风险与支持呈正相关。跨国差异适中，且与地缘政治背景大致一致。总体而言，公众对军事AI的舆论似乎是有条件地宽容的。公众并不绝对反对AI的各种军事用途。相反，不安主要集中在完全自主的致命武力上。

英文摘要

AI-enabled military systems are a fixture of modern military conflict. Applications vary from autonomous drones for surveillance and attack to AI-supported target selection. The importance of AI for modern conflict shows also in public disputes between governments and technology companies over the conditions for military access to frontier AI. Both military uses and government attempts at enabling and steering them happen before a backdrop of public opinion, yet we still know little about how people think about military AI. Drawing on a preregistered survey of 9,000 respondents in nine countries, including China, Germany, and the United States, we examine whether support for military AI is shaped primarily by general attitudes toward AI, principled opposition to lethal autonomy, or foreign-policy and geopolitical orientations. Across six military AI scenarios that vary in lethality and human control, respondents who view AI as beneficial are substantially more supportive of military AI. Hawkish respondents are also more supportive. By contrast, principled opposition to lethal autonomy is not broadly associated with the full index but is related to the application of fully autonomous lethal force. Contrary to our expectation, perceived AI risks are positively associated with support. Cross-national differences are moderate and broadly consistent with geopolitical context. Overall, public opinion toward military AI appears conditionally permissive. Publics are not categorically opposed to various military uses of AI. Instead, unease is concentrated around fully autonomous lethal force.

URL PDF HTML ☆

赞 0 踩 0

2605.25188 2026-05-26 cs.AI 版本更新

DarkForest: Less Talk, Higher Accuracy for Multi-Agent LLMs

DarkForest: 少说话，多智能体LLM更高精度

Yi Li, Songtao Wei, Dongming Jiang, Zhichun Guo, Qiannan Li, Bingzhe Li

发表机构 * University of Texas at Dallas（德克萨斯大学达拉斯分校）； Independent Researcher（独立研究者）； University of California, Davis（加州大学戴维斯分校）

AI总结提出DarkForest框架，通过保持智能体独立、结构化解析响应并基于信念分布协调，减少通信开销和错误传播，在六个推理基准上实现领先质量并大幅降低令牌消耗。

详情

AI中文摘要

多智能体LLM系统通过组合多个智能体的输出来改进推理，但交互密集型方法可能导致错误传播和高通信开销。当智能体交换原始响应或推理轨迹时，不正确的中间推理可能被采纳和放大，导致自信但错误的共识；多轮通信也增加了令牌消耗、延迟和推理成本。在本文中，我们提出了一种名为DarkForest的受控通信协调框架。DarkForest首先保持智能体独立，因此每个智能体在不看到其他智能体输出的情况下产生答案。然后，它将原始响应解析为结构化候选记录，将语义等价的候选记录分组为聚类，并使用智能体可靠性、置信度、解析质量、支持模式可靠性和独立性校正来估计这些聚类上的校准信念分布。协调器仅从该信念状态接收策略允许的证据，并进行受控通信。在六个推理基准上的实验表明，DarkForest实现了领先的整体质量，在基准指标上比最强基线提高了30.7%，并且与通信密集型基线相比，令牌消耗减少了高达6.5倍。

英文摘要

Multi-agent LLM systems improve reasoning by combining outputs from multiple agents, but interaction-heavy methods can introduce error propagation and high communication overhead. When agents exchange raw responses or reasoning traces, incorrect intermediate reasoning may be adopted and amplified, leading to confident but wrong consensus; multi-round communication also increases token consumption, latency, and inference cost. In this paper, we propose a controlled-communication coordination framework named DarkForest. DarkForest first keeps agents independent, so each agent produces an answer without seeing the others' outputs. It then parses the raw responses into structured candidate records, groups semantically equivalent candidates into clusters, and estimates a calibrated belief distribution over these clusters using agent reliability, confidence, parse quality, support-pattern reliability, and independence corrections. A coordinator receives only policy-permitted evidence from this belief state with controlled communication. Experiments on six reasoning benchmarks show that DarkForest achieves leading overall quality, improves the strongest baseline by up to 30.7\% on benchmark metrics, and reduces token consumption by up to $6.5\times$ compared with communication-heavy baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.25186 2026-05-26 cs.CL cs.AI 版本更新

By Their Fruits You Will Know Them: Comparing Formalizations of Law by the Decisions They Encode

凭其果实，你们将认识它们：通过编码的决策比较法律的形式化

Julius Vernie, Matthias Grabmair

发表机构 * Technical University of Munich（慕尼黑技术大学）

AI总结提出一种系统方法，通过SAT求解器枚举不同形式化在边缘案例上的分歧，并转化为具体事实场景，以比较同一法律条款的不同形式化，应用于九个前沿LLM生成的十个欧盟条款形式化，发现行为分歧与结构一致性基本不相关。

Comments 23 pages, 17 figures, submitted to EMNLP PROC 2026

详情

AI中文摘要

将法律条款形式化有望实现机器可访问的法律和自动化法律推理，而最近的LLM使得直接从法规文本生成这种形式化变得诱人。然而，任何形式化都会做出隐含的解释选择，其后果难以预料，尤其是当LLM是作者时。我们提出了一种方法，通过它们在个别案例上的推理，系统地比较同一法律条款的不同形式化。给定一个条款的多个形式化，我们在节点级别匹配它们，从匹配中为每对推导出一个共享接口，并使用SAT求解器枚举任意两个形式化存在分歧的边缘案例。然后将选定的边缘案例转化为具体的事实场景，供法律专家检查并采取行动。我们将该方法应用于九个前沿LLM生成的十个欧盟条款的形式化。我们发现，形式化之间的行为分歧与其结构一致性基本不相关，并且口头化的案例揭示了定性的不同分歧类型，包括反映法律评论中真实争议的分歧。

英文摘要

Formalizing legal provisions promises machine-accessible law and automated legal reasoning, and recent LLMs make it tempting to generate such formalizations directly from statutory text. However, any formalization makes implicit interpretive choices whose consequences are hard to anticipate, especially if an LLM is the author. We present a method for systematically comparing different formalizations of the same legal provision by their inferences on individual cases. Given multiple formalizations of a provision, we match them at the node level, derive a shared interface for each pair from the matching, and use a SAT solver to enumerate the edge cases on which any two formalizations disagree. Selected edge cases are then verbalized into concrete factual scenarios that a legal expert can examine and act on. We apply our method to formalizations of ten EU provisions generated by nine frontier LLMs. We find that behavioral divergence between formalizations is essentially uncorrelated with their structural agreement and that the verbalized cases reveal qualitatively distinct types of disagreement, including divergences that mirror genuine controversies in the legal commentary.

URL PDF HTML ☆

赞 0 踩 0

2605.25181 2026-05-26 cs.AI 版本更新

SpecAlign: A Semantic Alignment Framework for SystemVerilog Assertion Generation

SpecAlign: 一种用于 SystemVerilog 断言生成的语义对齐框架

Jaime Rafael Imperial, Hao Zheng

发表机构 * University of South Florida（佛罗里达州立大学）

AI总结提出 SpecAlign 框架，通过基于蕴含的分类和自一致性投票机制，评估并改进 LLM 生成的 SVA 与自然语言规范之间的语义对齐，无需黄金 RTL。

详情

AI中文摘要

现有的大语言模型（LLM）方法在生成 SystemVerilog 断言（SVA）时主要关注语法有效性和形式验证结果，而生成的断言与自然语言规范之间的语义对齐仍然难以量化。因此，在缺乏黄金 RTL 的情况下，幻觉或未对齐的 SVA 会降低信心并增加调试工作。本文提出了 SpecAlign，一个用于语义评估和优化 LLM 生成的 SVA 的框架。SpecAlign 引入了两个迭代对齐循环，通过基于蕴含的分类来评估自然语言属性和 SVA 是否符合设计规范。我们通过链式思维提示生成多个推理路径，并通过自一致性投票机制聚合它们，从而改进对齐决策。对未对齐的断言进行分析以生成可操作的反馈用于优化。我们进一步定义了一个定量对齐分数来衡量迭代过程中的语义一致性。实验结果表明，SpecAlign 能够有效检测语义不一致性，并在不依赖黄金 RTL 的情况下改进断言对齐，为传统形式验证评估指标提供了可扩展的补充。

英文摘要

Existing Large Language Model (LLM) approaches to SystemVerilog Assertion (SVA) generation primarily focus on syntactic validity and formal verification outcomes, while semantic alignment between generated assertions and natural language specifications remains difficult to quantify. As a result, hallucinated or misaligned SVAs can reduce confidence and increase debugging efforts in the absence of golden RTL. This paper presents SpecAlign, a framework for semantic evaluation and refinement of LLM-generated SVAs. SpecAlign introduces two iterative alignment loops that assess both natural language properties and SVAs against the design specification using entailment-based classification. We improve alignment decisions by generating multiple reasoning paths using chain-of-thought prompting and aggregating them via a self-consistency voting mechanism. Misaligned assertions are analyzed to generate actionable feedback for refinement. We further define a quantitative alignment score to measure semantic consistency across iterations. Experimental results demonstrate that SpecAlign effectively detects semantic inconsistencies and improves assertion alignment without relying on golden RTL, providing a scalable complement to traditional formal verification evaluation metrics.

URL PDF HTML ☆

赞 0 踩 0

2605.25170 2026-05-26 cs.LG cs.AI cs.ET cs.RO 版本更新

Grow-Prune-Freeze Networks: Adaptive & Continual Learning Technique for Olfactory Navigation

生长-剪枝-冻结网络：用于嗅觉导航的自适应与持续学习技术

Kordel K. France, Ovidiu Daescu

AI总结提出生长-剪枝-冻结（GPF）网络框架，通过动态调整策略网络层数实现持续学习，在湍流羽流导航任务中达到94%成功率，并推广到其他机器学习任务。

详情

AI中文摘要

嗅觉训练数据分散在非标准化的数据集中，限制了构建代表性世界模型的能力。嗅觉导航是一项高度动态和非平稳的任务，受益于实时持续学习。我们引入了一种名为生长-剪枝-冻结（GPF）网络的自适应框架，使智能体能够通过生长、剪枝和冻结其策略的早期层来持续学习，以应对世界复杂性。将GPF基于非线性随机矩阵理论，我们展示了Pennington & Worth（2017）的工作可以从单隐藏层扩展到n层持续学习模型，并且网络权重的特征值组成在添加连续层时得以保持。我们展示了基于期望SARSA的GPF在湍流羽流导航上实现了94%的成功率——这是一个部分可观测、非平稳的任务，代表了激发机器人自适应学习的“大世界”挑战——并提供了将GPF应用于其他世界模型的支撑方法。进一步的实验表明，GPF可能很好地推广到其他机器学习任务，如Atari中的强化学习、图像分类和自回归语言模型。我们开源所有代码和数据，以鼓励对嗅觉机器人技术的改进和更多研究。

英文摘要

Training data for olfaction is scattered through disparate, non-standardized datasets that limit the ability to build representative world models. Olfactory navigation is a highly dynamic and non-stationary task that benefits from real-time continual learning. We introduce an adaptive framework called Grow-Prune-Freeze (GPF) networks that enable an agent to continually learn through growing, pruning, and freezing early layers of its policy in response to world complexity. Grounding GPFs in non-linear random matrix theory, we show that the work of Pennington & Worth (2017) can be extended from single hidden layers to n-layer continual-learning models, and that eigenvalue composition of network weights is preserved as successive layers are added. We show that GPFs based on Expected SARSA achieve a 94% success rate on turbulent plume navigation - a partially observable, non-stationary task representative of the "big world" challenges that motivate adaptive learning in robotics - and provide supporting methodology for applying GPFs in other world models. Further experiments amount evidence that GPFs may generalize well to other machine learning tasks such as reinforcement learning in Atari, image classification, and autoregressive language models. We open source all code and data to encourage improvements on and more research in olfactory robotics.

URL PDF HTML ☆

赞 0 踩 0

2605.25166 2026-05-26 cs.LG cs.AI 版本更新

AME-TS: Anchored Mixture-of-Experts for Time Series Forecasting

AME-TS：基于锚定的混合专家模型用于时间序列预测

Rui Wang, Renhao Xue, Ray Razi, Huan Song, Hannah R. Marlowe

发表机构 * Amazon Web Services（亚马逊网络服务）

AI总结提出AME-TS，一种结构引导的稀疏时间序列基础模型，通过轻量级预测器估计序列级描述符并生成专家软结构先验，实现专家路由与可解释时间结构对齐，在GIFT-Eval基准上实现精度-效率权衡，并在M5微调中展现更稳定的专家专业化。

详情

AI中文摘要

时间序列预测模型通过大型Transformer骨干不断扩展规模，但大多数现有方法通过共享密集计算路径处理所有序列，尽管时间结构存在显著异质性。混合专家模型（MoE）通过条件计算提供了一种自然替代方案，但标准MoE路由导致专家专业化识别弱且在下游适应中常不稳定。我们提出AME-TS，一种结构引导的稀疏时间序列基础模型，将专家路由与可解释的时间结构对齐。AME-TS首先使用轻量级预测器估计序列级描述符，包括可预测性、季节性、趋势和稀疏性，并将其映射为专家上的软结构先验。该序列级先验在训练期间指导令牌级路由，鼓励结构对齐的专业化。在GIFT-Eval基准上，AME-TS在不同模型规模下提供了强大的精度-效率权衡：在小型模型规模上显著优于现有时间序列基础模型，在较大规模上与最强模型保持竞争力，同时通过稀疏路由激活显著更少的参数。我们进一步表明，在M5数据集微调期间，AME-TS学习了比标准MoE更可解释的路由几何和更稳定的专家专业化。这些结果表明，结构感知路由是实现稀疏专家模型在时间序列预测中优势的有效且可靠方式。

英文摘要

Time series forecasting models are increasingly scaled through large Transformer backbones, yet most existing approaches process all series through a shared dense computation path despite substantial heterogeneity in temporal structure. Mixture-of-Experts (MoE) offers a natural alternative by enabling conditional computation, but standard MoE routing leaves expert specialization weakly identified and often unstable during downstream adaptation. We propose AME-TS, a structure-guided sparse time series foundation model that aligns expert routing with interpretable temporal structure. AME-TS first uses a lightweight regime predictor to estimate series-level descriptors, including forecastability, seasonality, trend, and sparsity, and maps them to a soft structural prior over experts. This series-level prior guides token-level routing during training, encouraging structure-aligned specialization. On the GIFT-Eval benchmark, AME-TS delivers a strong accuracy-efficiency tradeoff across model scales: it substantially outperforms existing time series foundation models at small model scales and remains competitive with the strongest models at larger scales, while activating substantially fewer parameters through sparse routing. We further show that AME-TS learns more interpretable routing geometry and substantially more stable expert specialization than standard MoE during fine-tuning on the M5 dataset. These results suggest that structure-aware routing is an effective and reliable way to realize the benefits of sparse expert models for time series forecasting.

URL PDF HTML ☆

赞 0 踩 0

2605.25163 2026-05-26 cs.CV cs.AI 版本更新

K-U-KAN: Koopman-Enhanced U-KAN for 3D Dental Reconstruction from a Single Panoramic X-ray Radiograph

K-U-KAN: 基于Koopman增强的U-KAN用于单张全景X射线片的三维牙齿重建

Bikram Keshari Parida, Abhijit Sen, Wonsang You

发表机构 * Artificial Intelligence \& Image Processing Lab., Department of Information \& Communication Engineering, Sun Moon University, Asan-Si, South Korea ； Department of Physics ； Engineering Physics, Tulane University, New Orleans, LA, USA

AI总结提出K-U-KAN三阶段流水线，结合Kolmogorov-Arnold网络、Koopman算子与U-KAN，从单张全景X射线高效重建三维牙齿结构，提升感知质量并缩短训练时间。

Comments 24 pages, 9 figures,

详情

AI中文摘要

全景X射线将三维颌骨压缩为二维条带；我们的目标是干净且快速地恢复缺失的深度。现有的隐式神经表示能渲染逼真的体积，但训练缓慢，对采样和位置编码敏感，且实际成本高。纯CNN基线效率高，但难以处理牙弓的长程几何，模糊了精细的釉质-牙本质边界，且可解释性差。我们提出K-U-KAN，一个三阶段流水线：(i) 使用Kolmogorov-Arnold网络将二维特征提升为深度感知的可观测变量，(ii) 通过Koopman令牌块以稳定的、相位感知的线性演化推进这些可观测变量，(iii) 将预测的深度区间放置在焦槽射线上，然后由轻量级3D注意力U-KAN细化体积。这种物理（Beer-Lambert图像形成）、几何（马蹄形焦槽）和学习线性动力学的结合，在批量大小为1的原生射线强度上产生了清晰的解剖结构、更少的伪影和鲁棒的行为。在保留数据上，K-U-KAN在信号和结构指标上与Transformer/隐式基线相当，显著提高了感知质量，并且训练时间大约减半——使单视图全景X射线到锥形束CT重建在临床流程中更加实用。

英文摘要

A panoramic X-ray compresses a 3D jaw into a 2D strip; we aim to recover the missing depth cleanly and fast. Existing implicit neural representations render realistic volumes but are slow to train, sensitive to sampling and positional encodings, and costly in practice. Pure CNN baselines are efficient yet struggle with the dental arch's long-range geometry, blur fine enamel-dentin boundaries, and offer little interpretability. We present K-U-KAN, a three-stage pipeline that (i) lifts 2D features into depth-aware observables with Kolmogorov-Arnold Networks, (ii) advances these observables by a stable, phase-aware linear evolution via a Koopman token block, and (iii) places the predicted depth bins onto focal-trough rays before a lightweight 3D attention U-KAN refines the volume. This marriage of physics (Beer-Lambert image formation), geometry (horseshoe focal trough), and learned linear dynamics yields sharp anatomy, fewer artifacts, and robust behavior on native radiographic intensities with batch size one. On held-out data, K-U-KAN matches transformer/implicit baselines on signal and structure metrics, clearly improves perceptual quality, and trains in roughly half the time-making single-view PX $\to$ CBCT reconstruction more practical for clinical pipelines.

URL PDF HTML ☆

赞 0 踩 0

2605.25162 2026-05-26 cs.CL cs.AI 版本更新

STREAM: A Data-Centric Framework for Mining High-Value Task-Oriented Dialogues from Streaming Media

STREAM：一个以数据为中心的框架，用于从流媒体中挖掘高价值任务导向对话

Liang Xue, Haoyu Liu, Cheng Wang, Pengyu Chen, Haozhuo Zheng, Yang Liu

发表机构 * Harbin Institute of Technology（哈尔滨工业大学）； Byering Technology（伯英技术）

AI总结提出STREAM框架，利用流媒体数据合成大规模多领域任务导向对话数据集StreamDial，通过角色构建和对话蓝图结合RAG生成高质量对话，解决数据稀缺问题。

详情

AI中文摘要

垂直领域的大语言模型受到复杂、特定领域任务导向对话稀缺的瓶颈。现有的数据获取管道面临持续的三难困境：专家标注昂贵，真实服务对话受隐私和商业限制，静态语料库很快过时。我们提出Stream，一个以数据为中心的框架，利用公开可用的流媒体（直播和短视频）大规模合成高价值服务对话。Stream从嘈杂的流中挖掘真实的交互信号，并通过将基于角色的个性构建与对话蓝图构建相结合来合成对话；它进一步采用检索增强生成（RAG）来支持知识感知的响应。基于Stream，我们发布了StreamDial，一个覆盖汽车、餐厅和酒店的大规模多领域数据集。StreamDial总共包含87,498个对话会话和1,497,320轮次，平均每个会话17.11轮，各领域规模相当。每个会话被组织为结构化四元组⟨P_u, P_a, B, H⟩，将对话历史与明确的用户/代理角色和对话蓝图配对，捕捉真实服务行为，如需求挖掘、约束冲突、协商和恢复。使用自动评估和下游任务的评估表明，StreamDial在强基线上提高了内在对话质量，使用StreamDial训练的模型在多个骨干网络上改进了对话状态跟踪；我们进一步报告了完整的人工评估集，并在受控训练预算下在Qwen3-8B上实现了令人鼓舞的多语言迁移。数据发布在https://github.com/hitxueliang/DialogDataSetBySTREAM。

英文摘要

Large language models for vertical domains are bottlenecked by the scarcity of complex, domain-specific task-oriented dialogues. Existing data acquisition pipelines face a persistent trilemma: expert annotation is expensive, real-world service conversations are constrained by privacy and commercial restrictions, and static corpora quickly become temporally stale. We propose Stream, a data-centric framework that leverages publicly available streaming media (live streams and short videos) to synthesize high-value service dialogues at scale. Stream mines authentic interaction signals from noisy streams and synthesizes conversations by integrating role-grounded persona construction with Conversational Blueprint construction; it further adopts retrieval-augmented generation (RAG) to support knowledge-aware responses. Based on Stream, we release StreamDial, a large-scale multi-domain dataset covering Automotive, Restaurant, and Hotel. StreamDial contains 87,498 dialogue sessions and 1,497,320 turns in total, with an average of 17.11 turns per session and a comparable scale across domains. Each session is organized as a structured quadruplet $\langle P_u, P_a, B, H \rangle$ that pairs dialogue history with explicit user/agent personas and a Conversational Blueprint, capturing realistic service behaviors such as requirement mining, constraint conflicts, negotiation, and recovery. Evaluations with automatic judges and downstream tasks show that StreamDial improves intrinsic dialogue quality over strong baselines, and models trained with StreamDial improve Dialogue State Tracking across backbones; we further report a completed human-evaluation set and encouraging multilingual transfer on Qwen3-8B under a controlled training budget. The data is released in https://github.com/hitxueliang/DialogDataSetBySTREAM.

URL PDF HTML ☆

赞 0 踩 0

2605.25156 2026-05-26 cs.LG cs.AI 版本更新

Abduction-Deduction Entanglement: Domain Generalization via Representation Transplants

溯因-演绎纠缠：通过表示移植实现领域泛化

Kasra Jalaldoust, Elias Bareinboum

发表机构 * Columbia University（哥伦比亚大学）

AI总结本文提出一种基于表示移植的方法，通过参数化溯因-演绎纠缠中的非可识别性，在源分布约束下搜索目标分布空间，实现领域泛化中的最优目标预测。

详情

AI中文摘要

在源分布下训练的预测模型通常无法很好地泛化到不同的目标分布。对未见数据分布的有效推断必须依赖于生成源数据和目标数据的某些因果机制的不变性，然而这些结构不变性仅从源数据中是无法识别的。在关于数据的温和因果假设下，我们表明目标中的最优预测实际上部分可由源分布识别。该结果基于一个简单的观察：在任何领域中，最优预测可以分解为我们称之为溯因映射和演绎映射的一对映射，其中溯因映射从观测变量推断某些未观测变量（可能是混杂因素），演绎映射使用观测和推断的量来预测标签。大量源数据的使用固定了最优预测，从而约束了产生它的有效溯因-演绎组合——这种非可识别性我们称之为溯因-演绎纠缠。为了利用这一点，我们使用所谓的表示移植来参数化受约束的族，表示移植是表示空间中的一种特定线性变换，它在保留演绎成分的同时操纵表示的溯因内容。生成标签的因果机制的不变性意味着源和目标之间存在不变的演绎映射。因此，我们可以通过参数化移植来搜索合理的目标分布空间。我们在一个学习器-对手博弈中使用该方案，在理想优化下，该博弈可证明终止于学习器具有极小极大最优目标预测。评估验证了理论，表明该方法在领域泛化基准测试中具有竞争力。

英文摘要

Prediction models trained under the source distribution do not generalize well to a different target distribution. A valid inference about an unseen data distribution must be anchored by the invariance of certain causal mechanisms that generate the source and target data, however, these structural invariances are non-identifiable from the source data alone. Under mild causal assumptions about the data, we show that the optimal prediction in the target is in fact partially identifiable by the source distribution. The result rests on a simple observation: In any domain, the optimal prediction can be factorized into what we call a pair of abduction and deduction maps, where the abduction map makes inference about some unobserved variables (possibly confounders) from the observed variables and the deduction map predicts the label using both the observed and inferred quantities. Access to large source data pins down the optimal prediction, thus constrains the valid abduction-deduction ensembles that produce it -- a non-identifiability that we call the abduction-deduction entanglement. To leverage this, we parameterize the constrained family using what we call a representation transplant, that is a specific linear transformation in the representation space that manipulates the abduction content of the representation while retaining the deduction component. Invariance of the causal mechanism generating the label implies existence of an invariant deduction map between source and target. Thus, we can search the space of plausible target distributions via a parametric transplant. We use this scheme in a learner-adversary game that, under an idealistic optimization, provably terminates with the learner having the minimax-optimal target prediction. Evaluations verify the theory, showing that the method is competitive in DG benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2605.25151 2026-05-26 cs.AI cs.CE 版本更新

Representation Without Control: Testing the Realization Effect in Language Models

无控制的表征：测试语言模型中的实现效应

Ciarán Walsh, Emilio Barkett

发表机构 * Columbia University（哥伦比亚大学）

AI总结通过提示行为、线性读出和因果控制三个层面，测试语言模型是否表现出类似人类的实现效应，发现潜在读出成功但因果控制无效，表明三者不自动共存。

详情

AI中文摘要

大型语言模型越来越多地被用作行为模拟器，但其输出何时反映类似人类的认知机制而非提示敏感的表面模式仍不清楚。我们通过实现效应研究这一问题，这是行为经济学中一个特征明确的发现，即风险承担在纸面收益与实现收益及损失后存在系统性差异。我们在三个层面评估LLM行为：仅提示的行为敏感性、内部表征的线性读出以及通过激活引导的因果控制。仅提示结果显示系统的条件敏感性，但方向模式未复现人类实现效应的预测。Gemma的残差流在第18层包含一个线性可解码的实现状态信号，该信号可泛化到未见过的提示。然而，沿此方向引导并未可靠地改变下游风险选择，这一零结果在正尺度和负符号对称运行中均成立。行为敏感性、潜在读出和因果控制是三个不同的属性，它们不会自动共存，成功的潜在读出不足以证明模型在下游决策中行为上依赖于该表征。

英文摘要

Large language models are increasingly used as behavioral simulators, but it remains unclear when their outputs reflect human-like cognitive mechanisms rather than prompt-sensitive surface patterns. We study this question through the realization effect, a well-characterized finding in behavioral economics in which risk-taking differs systematically after paper versus realized gains and losses. We evaluate LLM behavior at three levels: prompt-only behavioral sensitivity, linear readout of internal representations, and causal control via activation steering. Prompt-only results show systematic condition sensitivity, but the directional pattern does not reproduce human realization-effect predictions. Gemma's residual stream contains a linearly decodable realization-status signal at layer 18 that generalizes to held-out prompts. Steering along this direction does not, however, reliably shift downstream risk choices, a null result that holds across positive scales and in a negative sign-symmetry run. Behavioral sensitivity, latent readout, and causal control are three distinct properties that do not automatically co-occur, and successful latent readout is insufficient evidence that a model behaviorally relies on a representation during downstream decision-making.

URL PDF HTML ☆

赞 0 踩 0

2605.25141 2026-05-26 cs.CL cs.AI 版本更新

LLM Agent Based Renewable Energy Forecasting Using Edge and IoT Data A Review of Solar Wind Weather and Grid Aware Decision Support

基于LLM Agent的利用边缘和物联网数据的可再生能源预测：太阳能、风能、天气和电网感知决策支持综述

Pavan Manjunath, Thomas Pruefer

发表机构 * Independent Researcher（独立研究员）

AI总结本文综述了如何利用大语言模型代理整合异构传感器流、天气API数据、历史发电记录和电网约束，形成统一的决策支持工作流，以增强可再生能源预测。

详情

AI中文摘要

可再生能源发电的可靠预测是电网稳定性、能源交易、电池调度和碳感知运营规划的基础要求。太阳能和风能资源本质上是间歇性的，其输出随云量、风速、大气湍流、季节模式和局部地形而波动。物联网和边缘设备的普及，包括智能电表、逆变器、风速计、日射强度计、气象站和电网接口传感器，创造了前所未有的实时运行数据量，而传统的预测流程难以充分利用这些数据。本综述研究了大语言模型代理如何通过将异构传感器流、天气API数据、历史发电记录、电网约束和上下文推理整合到统一的决策支持工作流中，来增强可再生能源预测。我们调查了经典预测方法（统计时间序列模型、深度学习架构、物理混合方法）以及新兴的用于解释、不确定性沟通和操作员指导的LLM代理框架。提出了一个六层分类法，涵盖数据采集、预处理、特征工程、模型推理、不确定性估计和自然语言报告。综述识别了十二个开放挑战，包括实时部署、分布偏移下的模型漂移、不确定性量化、LLM代理中的幻觉控制、边缘硬件的互操作性以及与能源管理系统的集成。论文最后建议了一个研究议程，重点关注开放基准、物理信息LLM基础以及联邦预测架构。

英文摘要

Reliable forecasting of renewable energy generation is a foundational requirement for grid stability energy trading battery scheduling and carbon aware operational planning Solar and wind resources are inherently intermittent their output fluctuates with cloud cover wind speed atmospheric turbulence seasonal patterns and local terrain The proliferation of IoT and edge devices spanning smart meters inverters anemometers pyranometers weather stations and grid interface sensors has created an unprecedented volume of real time operational data that conventional forecasting pipelines are ill equipped to exploit fully This review investigates how large language model LLM agents can enhance renewable energy forecasting by integrating heterogeneous sensor streams weather API data historical generation records grid constraints and contextual reasoning into unified decision support workflows We survey classical forecasting methods statistical time series models deep learning architectures physics hybrid approaches and emerging LLM agent frameworks for explanation uncertainty communication and operator guidance A six layer taxonomy is proposed covering data acquisition preprocessing feature engineering model inference uncertainty estimation and natural language reporting The review identifies twelve open challenges spanning real time deployment model drift under distribution shift uncertainty quantification hallucination control in LLM agents interoperability of edge hardware and integration with energy management systems The paper concludes by recommending a research agenda centred on open benchmarks physics informed LLM grounding and federated forecasting architectures

URL PDF HTML ☆

赞 0 踩 0

2605.25135 2026-05-26 cs.LG cs.AI 版本更新

ASTRO: Adaptive Spatio-Temporal Reinforcement Optimization for GNN Powered Anomly Detection in Cyber Physical Systems

ASTRO: 用于信息物理系统中基于GNN的异常检测的自适应时空强化优化

Rai Ali Yar, Umaisa Lail, Anwar Shah

发表机构 * Department of Computer Science, FAST NUCES（计算机科学系，FAST NUCES）； Department of Information Technology, Riphah International University（信息技术系，Riphah国际大学）

AI总结提出ASTRO框架，结合深度Q网络与图神经网络、时间建模和多头注意力机制，通过强化学习动态优化阈值，在SWaT和WADI数据集上实现高F1分数，优于现有方法。

详情

AI中文摘要

工业物联网环境中的异常检测对于保护工业控制系统和信息物理系统免受运行时虚假数据注入和其他恶意攻击至关重要。传感器网络和互连控制回路日益复杂，使得识别隐藏在高维和时间依赖信号中的异常行为变得困难。为解决这些挑战，本文介绍了自适应时空强化优化ASTRO，一种新颖的异常检测框架，开创性地使用强化学习进行动态阈值优化。通过将深度Q网络与图神经网络、时间建模和多头注意力机制相结合，ASTRO不断调整其决策边界以提高检测精度。GNN组件建模传感器之间的空间关系，时间模型捕获时间序列依赖性，注意力层突出显示最具信息量的时间步。模型生成连续异常分数，通过自适应阈值转换为二元决策，该阈值通过深度Q网络优化。ASTRO方法在两个真实工业基准测试：安全水处理和水分配数据集上进行了评估。所提模型在SWaT上取得了卓越性能，F1分数为0.990。此外，在高度复杂的127个终端设备的WADI数据集上，它获得了0.788的F1分数，比最先进的基线高出近14%。多次运行的结果证实了其一致的泛化能力和稳定性。这些实验表明，ASTRO框架是增强大规模信息物理基础设施的高度实用和可扩展的方法。

英文摘要

Anomaly detection in Industrial Internet of Things (IIoT) environments is essential to protect the Industrial Control Systems (ICS) and Cyber-Physical Systems (CPS) from occuring run time false data injection and other malicious attacks. The increasing complexity of sensor networks and interconnected control loops makes it difficult to identify anomalous behavior hidden within high-dimensional and time-dependent signals. To address these challenges, this article introduces Adaptive Spatio-Temporal Reinforcement Optimization ASTRO (ASTRO), a novel anomaly detection framework that pioneers the use of reinforcement learning for dynamic threshold optimization. By integrating a Deep Q-Network (DQN) with Graph Neural Networks (GNNs), temporal modelling and a Multi-Head Attention mechanism, ASTRO continuously adapts its decision boundaries to improve detection accuracy. The GNN component models the spatial relations among sensors, Temporal model captures time series dependencies and the attention layer highlights most informative time steps. The model generates continuous anomaly scores, which are transformed into binary decisions using an adaptive threshold, optimized via a Deep Q-Network (DQN). The ASTRO approach is evaluated on two real world industrial benchmarks: the Secure Water Treatment (SWaT) and Water Distribution (WADI) datasets. The proposed model achieves an exceptional performance on the SWaT with F1 score of 0.990. Moreover, on highly complex 127 end devices WADI dataset, it secures F1 score of 0.788, outperforming state-of-the-art baselines by nearly 14%. Results across multiple runs confirm consistent generalization and stability. These experiments demonstrate that the ASTRO framework is highly practical and scalable method for strengthening the large scale cyber physical infrastructures

URL PDF HTML ☆

赞 0 踩 0

2605.25133 2026-05-26 cs.AI cs.CL 版本更新

Courant：一种具有局部支持和可解释场分解的状态自适应感知器神经代理模型

Anuj Kumar, Josiah Bjorgaard, Nikolaos Bouklas, Matteo Salvador, Alexander Lavin

发表机构 * Pasteur Labs（Pasteur实验室）； Cornell University（康奈尔大学）； Institute for Simulation Intelligence（模拟智能研究所）

AI总结提出基于感知器的编码-处理-解码代理模型Courant，通过状态自适应潜在查询和轻量解码器实现类似自适应hp细化的局部支持与可解释场分解，在稳态/瞬态模拟基准上取得竞争性精度。

详情

AI中文摘要

我们引入“Courant”，一种基于感知器的编码器-处理器-解码器代理模型，其潜在特征在物理空间中表现出自适应专门化和局部支持，实现了类似于自适应hp细化方案的功能，这是传统数值求解器和科学机器学习中非常期望的属性。所提出的架构结合了共享随机傅里叶特征坐标嵌入、状态自适应潜在查询和轻量解码器。Courant使用稳态或瞬态模拟数据进行端到端训练，仅使用物理空间中的标准L_2预测损失，在基准测试上达到竞争性精度。我们证明Courant的归纳偏差产生了设计上可解释的潜在变量：它们在模拟域中发展出多尺度几何专门化，并在时间相关情况下跟踪相干结构，类似于随时间演化的空间基函数，从而允许对模拟场进行紧凑的、几何锚定的、单位划分式的分解。

英文摘要

We introduce "Courant", a Perceiver-based encoder-processor-decoder surrogate model that has latent features exhibiting adaptive specialization and local support in the physical space, enabling functionality akin to an adaptive hp-refinement scheme, an attribute that is highly desirable in traditional numerical solvers and scientific machine learning broadly. The proposed architecture combines a shared random Fourier feature coordinate embedding, state-adapted latent queries, and a light-weight decoder. Courant is trained end-to-end with steady or transient simulation data and only a standard L_2 prediction loss in the physical space, achieving competitive accuracy on benchmarks. We demonstrate that Courant's inductive biases yield latents that are interpretable by design: they develop multiscale geometric specialization in the simulation domain and track coherent structures in the time-dependent case, acting analogously to time-evolving spatial basis functions and allowing for decoding a compact, geometry-anchored, partition-of-unity-like decomposition of the simulated field.

URL PDF HTML ☆

赞 0 踩 0

2605.25110 2026-05-26 cs.CV cs.AI cs.LG 版本更新

Uncertainty-DTW for Sequences and Visual Tokens

Uncertainty-DTW 用于序列和视觉标记

Lei Wang, Syuan-Hao Li, Yongsheng Gao, Piotr Koniusz

发表机构 * School of Engineering and Built Environment, Electrical and Electronic Engineering, Griffith University（工程与建筑环境学院，电气与电子工程学院，格里菲斯大学）； School of Computer Science and Engineering, University of New South Wales（计算机科学与工程学院，新南威尔士大学）

AI总结提出不确定性感知的动态时间规整（uDTW）框架，通过异方差不确定性建模和最大似然估计实现鲁棒对齐，并推广到视觉标记集，在多个领域取得优于现有方法的结果。

Comments Research report

详情

AI中文摘要

对齐结构化数据是计算机视觉和机器学习中的一个基本问题，支撑着时间序列分析、人类动作识别和视觉表示学习等任务。现有的对齐方法，包括动态时间规整（DTW）及其可微变体，依赖于确定性相似度度量，因此对异质和噪声特征敏感。在这项工作中，我们引入了不确定性感知对齐，这是一个概率框架，用异方差不确定性建模成对对应关系，并沿对齐路径执行结构化匹配。我们的公式，不确定性-DTW（uDTW），为每个对应分配一个正态分布，并通过最大似然估计目标参数化每条对齐路径，该目标包括（i）一个精度加权匹配项，抑制不可靠特征，以及（ii）一个对数方差正则化，防止退化解。这产生了一个概率对齐机制，对噪声具有鲁棒性且可解释，因为不确定性直接反映了匹配的可靠性。我们进一步将该框架从时间序列推广到标记化的视觉表示，从而能够对视觉标记集进行结构化匹配。学习到的不确定性可以解释为反向注意力：语义相关区域表现出低不确定性并主导对齐，而模糊/噪声区域具有高不确定性。这提供了对齐、注意力和不确定性建模之间的联系。我们在不同领域评估了所提出的框架。结果表明，与最先进的方法相比，该方法持续改进，并且学习到的不确定性与语义重要性相关。这些发现将不确定性感知对齐确立为一个通用、鲁棒且可解释的框架，用于从结构化数据中学习。

英文摘要

Aligning structured data is a fundamental problem in computer vision and machine learning, underlying tasks such as time series analysis, human action recognition, and visual representation learning. Existing alignment methods, including Dynamic Time Warping (DTW) and its differentiable variants, rely on deterministic similarity measures and are therefore sensitive to heterogeneous and noisy features. In this work, we introduce uncertainty-aware alignment, a probabilistic framework that models pairwise correspondences with heteroscedastic uncertainty and performs structured matching along alignment paths. Our formulation, uncertainty-DTW (uDTW), assigns each correspondence a Normal distribution and parametrizes each alignment path by a Maximum Likelihood Estimate objective consisting of (i) a precision-weighted matching term that suppresses unreliable features, and (ii) a log-variance regularization that prevents degenerate solutions. This yields a probabilistic alignment mechanism that is robust to noise and interpretable, as uncertainty directly reflects the reliability of matches. We further generalize this framework from temporal sequences to tokenized visual representations, enabling structured matching over sets of visual tokens. The learned uncertainty can be interpreted as a reverse-attention: semantically relevant regions exhibit low uncertainty and dominate the alignment, while ambiguous/noisy regions have high uncertainty. This provides a connection between alignment, attention, and uncertainty modeling. We evaluate the proposed framework across diverse domains. The results demonstrate consistent improvements over state-of-the-art methods and show that learned uncertainty correlates with semantic importance. These findings establish uncertainty-aware alignment as a general, robust, and interpretable framework for learning from structured data.

URL PDF HTML ☆

赞 0 踩 0

2605.25107 2026-05-26 cs.LG cs.AI cs.NA math.NA 版本更新

Leveraging Gauge Freedom for Learning Non-Gradient Population Dynamics of Stochastic Systems

利用规范自由度学习随机系统的非梯度种群动力学

Jules Berman, Tobias Blickhan, Benjamin Peherstorfer

发表机构 * Courant Institute of Mathematical Sciences, New York University, New York, NY 10012, USA（数学科学学院，纽约大学，纽约，纽约州，10012，美国）

AI总结针对现有种群动力学推断局限于梯度流的问题，提出非梯度推断流（NGIF）算法，通过连续性方程的弱形式参数化一般向量场并选择非最小动能准则，在低维和高维物理问题中提高了分布精度并更好地捕捉非势输运。

2605.25101 2026-05-26 cs.SE cs.AI cs.SY eess.SY 版本更新

Multi-Agent Specification-based Metamorphic Testing of FMU-Based Simulations

基于多智能体规范的FMU仿真蜕变测试

Ashir Kulshreshtha, Abdullah Mughees, Gaadha Sudheerbabu, Tanwir Ahmad, Kristian Klemets, Dragos Truscan, Mikael Manngård

发表机构 * University of Turku, Finland（图尔库大学，芬兰）； Novia University of Applied Sciences, Finland（诺维亚应用科学大学，芬兰）

AI总结针对FMU仿真模型中缺乏显式预期输出导致传统测试方法受限的问题，提出一种基于LLM的多智能体工作流，从规范和接口中自动提取蜕变关系并生成测试用例，在润滑油冷却系统FMU上验证了其有效性。

Comments Author version. 9 pages. Accepted for publication in the 10th International Workshop on Metamorphic Testing (MET 2026) of the IEEE Conference on Computers, Software, and Applications (COMPSAC2026), June 7-10, 2026 Madrid, Spain

详情

AI中文摘要

在许多工业领域中，功能模型接口（FMI）被用于跨不同合作伙伴使用各种建模工具交换仿真模型作为功能模型单元（FMU）。这为使用FMU进行基于仿真的验证和确认以确保系统行为可靠提供了可能性。然而，由于缺乏显式预期输出，为这些仿真模型推导有效的测试预言仍然具有挑战性。这限制了需要访问系统内部工作原理的传统测试方法的适用性。蜕变测试（MT）通过利用蜕变关系（MR）解决了这一限制，但从规范中提取此类关系在很大程度上仍然是手动且容易出错的过程。为了应对这一挑战，我们提出了一种基于LLM的多智能体工作流，用于对基于FMU的仿真模型进行基于规范的蜕变测试。该方法以功能和接口规范为输入，协调多个智能体提取需求并推导MR。这些MR使用Given-When-Then模式来表达输入条件（Given）、变换（When）和预期输出行为（Then）。然后利用这些关系生成蜕变测试用例，执行仿真，并评估多个会话间的输出一致性。我们在润滑油冷却系统FMU上评估了该方法，证明了其自动生成有意义的MR和相应测试用例的能力。初步结果表明，所提出的工作流能够通过减少手动工作并改进测试生成，有效支持动态仿真模型的系统化验证和确认。

英文摘要

In many industrial domains, the Functional Mock-up Interface (FMI) is used to exchange simulation models as Functional Mock-up Units (FMUs) across different partners using various modelling tools. This opens up the possibilities for simulation-based verification and validation using FMUs for ensuring reliable system behaviour. However, deriving effective test oracles for these simulation models remains challenging due to the absence of explicit expected outputs. This limits the applicability of conventional testing approaches, which require access to the internal workings of the systems. Metamorphic testing (MT) addresses this limitation by leveraging metamorphic relations (MRs), but extracting such relations from specifications remains largely a manual and error-prone process. To address this challenge, we propose an LLM-powered multi-agent workflow for specification-based metamorphic testing of FMU-based simulation models. The approach takes functional and interface specifications as input and orchestrates multiple agents to extract requirements and derive MRs. These MRs are expressed using Given-When-Then patterns to structure input conditions (Given), transformations (When), and expected output behaviours (Then). These relations are then used to generate metamorphic test cases, execute simulations, and evaluate output consistency across multiple sessions. We evaluate the approach on a Lube Oil Cooling system FMU, demonstrating its ability to automatically generate meaningful MRs and corresponding test cases. Preliminary results indicate that the proposed workflow can effectively support the systematic verification and validation of dynamic simulation models by reducing manual effort and improving test generation.

URL PDF HTML ☆

赞 0 踩 0

2605.25095 2026-05-26 cs.AI cs.LG math.OC 版本更新

RECTOR: Priority-Aware Rule-Based Reranking for Compliance-Aware Autonomous Driving Trajectory Selection

RECTOR: 基于优先级规则的合规感知自动驾驶轨迹选择重排序

Hadi Hajieghrary, Benedikt Walter, Chaitanya Shinde, Paul Schmitt, Miguel Hurtado

发表机构 * TORC Robotics LLC（TORC机器人公司）； Daimler Truck AG（戴姆勒卡车集团）； Reynolds & Moore（雷诺兹与摩尔公司）； MassRobotics（马斯机器人）

AI总结提出RECTOR，一种后生成重排序层，通过差异化代理和场景条件适用性机制，基于分层规则手册（安全>法律>道路>舒适）对候选轨迹进行评分，并采用确定性ε-词典序规则选择，在无需重新训练预测器的情况下，将安全与法律违规率从28.58%降至20.42%。

详情

AI中文摘要

自动驾驶堆栈必须从多模态候选集中选择一条轨迹；仅凭模型置信度选择会忽略安全、交通法规和舒适性约束。我们提出RECTOR（规则强制约束轨迹编排器），一种后生成重排序层，通过差异化代理和场景条件适用性机制，根据分层规则手册（安全>法律>道路>舒适）对候选轨迹进行评分，然后采用确定性ε-词典序规则进行选择，该规则通过构造保持跨层优先级——无需重新训练预测器。在Waymo开放运动数据集validation_interactive划分（43,219个增强实例，K=6）上，根据协议B（28条规则代理目录，oracle适用性），与同一候选集上仅基于置信度的选择相比，规则感知选择将安全+法律违规从28.58%降至20.42%，总违规从40.32%降至32.41%。在该基准上，均匀加权求和基线匹配了二元合规性——经验提升来自规则感知排序，而词典序保证是任何权重校准无法复制的结构性差异因素。在对抗性置信度破坏下，仅置信度选择在100%的场景中失败，而两种规则感知选择器在约96%的场景中拒绝了注入的模式。所有数据均为代理评估器结果（非安全认证），开环，5秒时域，美国规则，验证集划分。

英文摘要

Autonomous driving stacks must pick one trajectory from a multi-modal candidate set; choosing by model confidence ignores safety, traffic-law, and comfort constraints. We present \textsc{RECTOR} (Rule-Enforced Constrained Trajectory Orchestrator), a post-generation reranking layer that scores candidates against a tiered rulebook (Safety~$\succ$~Legal~$\succ$~Road~$\succ$~Comfort) via differentiable proxies and a scene-conditioned applicability mechanism, then selects with a deterministic $\varepsilon$-lexicographic rule that preserves cross-tier priority by construction -- without retraining the predictor. On the Waymo Open Motion Dataset \texttt{validation\_interactive} split (43{,}219 augmented instances, $K{=}6$), under Protocol~B (28-rule proxy catalog, oracle applicability) rule-aware selection cuts Safety+Legal violations from 28.58\% to 20.42\% and Total from 40.32\% to 32.41\% versus confidence-only on the same candidates. A uniform-weight weighted-sum baseline matches binary compliance on this benchmark -- the empirical lift comes from rule-aware ranking, while the lexicographic guarantee is the structural differentiator no weight calibration can replicate. Under adversarial confidence corruption, confidence-only selection fails in 100\% of scenarios while both rule-aware selectors reject the injected mode in $\sim$96\%. All figures are proxy-evaluator results (not a safety certificate), open-loop, 5\,s horizon, U.S.\ rules, validation split.

URL PDF HTML ☆

赞 0 踩 0

2605.25091 2026-05-26 cs.AI 版本更新

D3S2: 扩散引导的语义分割数据集蒸馏

Wenjie Zheng, Haoji Hu, Jiali Lu, Xingze Zou, Jing Wang

发表机构 * Zhejiang University（浙江大学）

AI总结针对语义分割数据集蒸馏中的长尾类别不平衡、像素级对齐和高计算成本问题，提出两阶段框架D3S2，通过类别平衡掩码选择和扩散引导图像合成生成紧凑训练集，在极低压缩率下显著提升分割性能。

详情

AI中文摘要

数据集蒸馏旨在将大规模数据集压缩为紧凑的合成集，同时保持训练效果。然而，现有研究主要关注图像分类，而语义分割等密集预测任务尚未充分探索。本文识别了分割数据集蒸馏的三个关键挑战：(i) 长尾类别不平衡，(ii) 图像与密集标签之间严格的像素级对齐需求，以及(iii) 使用复杂模型优化高分辨率数据的高计算成本。为应对这些挑战，我们提出D3S2，一种扩散引导的语义分割数据集蒸馏框架。我们的方法采用两阶段设计。在类别平衡掩码选择中，我们通过优先考虑低表示类别的贪婪策略构建代表性掩码集。在扩散引导图像合成中，我们使用预训练的布局到图像扩散模型生成以所选掩码为条件的图像，自然确保空间对齐。为进一步增强合成数据的训练效用，我们引入具有两个互补目标的引导扩散采样：用于像素级对齐的分割一致性损失，以及用于对齐跨层每类特征统计的类级特征匹配损失。大量实验证明了D3S2的优越性。值得注意的是，在1%的极低压缩率下，我们的方法在ADE20K和COCO-Stuff上使用Mask2Former (Swin-S)分别达到24.99%和35.49%的mIoU，比随机选择分别高出9.34%和5.70%。

英文摘要

Dataset distillation (DD) aims to compress large-scale datasets into compact synthetic sets while preserving training efficacy. However, existing studies mainly focus on image classification, leaving dense prediction tasks such as semantic segmentation largely underexplored. In this work, we identify three key challenges for segmentation DD: (i) long-tailed class imbalance, (ii) the need for strict pixel-wise alignment between images and dense labels, and (iii) the high computational cost of optimizing high-resolution data with complex models. To address these challenges, we propose D3S2, a Diffusion-guided Dataset Distillation framework for Semantic Segmentation. Our method adopts a two-stage design. In Class-Balanced Mask Selection, we construct a representative mask set via a greedy strategy that prioritizes underrepresented classes. In Diffusion-Guided Image Synthesis, we employ a pretrained layout-to-image diffusion model to generate images conditioned on the selected masks, naturally ensuring spatial alignment. To further enhance the training utility of synthesized data, we introduce guided diffusion sampling with two complementary objectives: a segmentation-consistency loss for pixel-level alignment, and a class-wise feature matching loss for aligning per-class feature statistics across layers. Extensive experiments demonstrate the superiority of D3S2. Notably, at an extremely compression rate of 1%, our method achieves 24.99% and 35.49% mIoU on ADE20K and COCO-Stuff with Mask2Former (Swin-S), outperforming random selection by 9.34% and 5.70%, respectively.

URL PDF HTML ☆

赞 0 踩 0

2605.25020 2026-05-26 cs.AI cs.CL 版本更新

面向任务驱动无人机网络的能量感知多智能体强化学习扩展与个体奖励

Changling Li, Ying Li

发表机构 * Department of Computer Science, ETH Zurich（苏黎世联邦理工学院计算机科学系）； Department of Computer Science, Colby College（科尔比学院计算机科学系）

AI总结提出基于个体奖励函数的能量感知多智能体强化学习模型，利用深度Q网络解决无人机网络动态环境和电池容量限制下的轨迹规划问题，实验表明在任务密度高时成功率接近100%，且扩展性优于共享奖励模型。

Comments IEEE Internet of Things Journal

详情

DOI: 10.1109/JIOT.2024.3511253
Journal ref: volume=12, number=8, year=2025, pages=10640-10654

AI中文摘要

多智能体强化学习（MARL）因其通过交互学习的能力，在自动驾驶和智慧城市等协作系统中显示出广泛适用性。随着无人机网络的最新发展，研究人员也应用MARL来解决轨迹规划问题。然而，动态环境和有限的电池容量仍然是使用MARL实现高效协作任务执行的挑战。在本文中，我们提出了一种能量感知的MARL模型作为应对这些挑战的尝试，利用深度Q网络（DQN）和由任务执行进度及无人机剩余电量驱动的个体奖励函数。我们对所提出的模型进行了一系列仿真研究，并将其与共享奖励MARL进行比较，以探索MARL中信用分配的影响。结果表明，无论任务位置和长度如何，我们提出的模型都能达到至少80%的成功率。与共享奖励模式类似，个体奖励模式在任务密度高时可以获得更好的成功率，并且当任务密度接近40%时，几乎可以达到100%的成功率。我们提出的个体奖励模型的真正优势在环境扩展时得以显现。与共享奖励MARL的比较表明，我们提出的模型对环境大小和智能体数量的变化更加鲁棒。由于目标的清晰性，它可以用更少的步骤实现更高的成功率，从而更好地提高能源效率。

英文摘要

Multi-agent reinforcement learning (MARL) has shown wide applicability in collaborative systems such as autonomous driving and smart cities for its ability of learning through interaction. With the recent development of drone networks, researchers have also applied MARL to address the trajectory planning problems. However, the dynamic environment and the limited battery capacity are still challenging for using MARL to achieve efficient collaborative task execution. In this paper, we propose an energy-aware MARL model as an attempt to tackle these challenges, leveraging Deep Q-Networks (DQN) with \emph{individual reward functions} driven by the task execution progress and the remaining battery of drones. We conduct a set of simulation studies for the proposed mode and compare it with the shared reward MARL~\cite{Li2022MARL} to explore the impact of credit assignment in MARL. The results indicate that our proposed model can achieve at least 80\% success rate regardless of the task locations and lengths. Similar to the shared reward mode, the individual reward mode can achieve a better success rate when the task density is high, and it can hit nearly a 100\% success rate when task density gets close to 40\%. The true advantage of our proposed model with individual reward is revealed when scaling up the environment. The comparison to the shared reward MARL shows that the our proposed model is more robust towards the change of the environment size and agent numbers. It can achieve higher success rate with fewer steps due to the clarity of the goal which improves energy efficiency even better.

URL PDF HTML ☆

赞 0 踩 0

2605.24989 2026-05-26 cs.LG cs.AI cs.IR 版本更新

Selective Test-Time Compute Scaling for Click-Through Rate Prediction via Uncertainty-Triggered Feature Path Exploration

基于不确定性触发的特征路径探索的点击率预测选择性测试时计算扩展

Moyu Zhang, Yun Chen, Yujun Jin, Jinxin Hu, Yu Zhang, Xiaoyi Zeng

发表机构 * Alibaba Group（阿里巴巴集团）

AI总结针对点击率预测中训练数据稀疏导致的不确定性，提出无需训练、模型无关的UTTSI框架，通过双信号估计器区分认知不确定性和偶然模糊性，对不确定实例进行自适应特征过滤和随机特征路径探索，在保持最坏延迟不变的情况下实现平均约2.8倍基础模型开销，实验和在线A/B测试均取得显著提升。

Comments 12 pages, 4 Figures, 3 Tables

详情

AI中文摘要

扩展测试时计算对语言模型已被证明非常有效，然而这一机会在工业点击率（CTR）预测中仍未得到充分探索。CTR模型存在一个根本的不对称性：训练中充分表示的特征组合产生自信的预测，而稀疏观察到的特征组合则产生不可靠的输出。现有的训练阶段解决方案（如自适应门控）学习一个固定的选择函数，但受限于相同的稀疏性，在部署时无法提供针对每个实例的补救措施。我们提出UTTSI（不确定性触发的测试时选择性推理），一个无需训练、模型无关的框架，将推理深度按比例扩展到每个实例的不确定性。一个结合模型logit置信度和数据级频率先验的双信号估计器区分认知不确定性和偶然模糊性。每个实例都经过自适应特征过滤以去除不可靠的嵌入；不确定的实例额外接受随机特征路径探索，其预测通过一致性加权集成进行聚合。自信的实例完全绕过探索，保持平均开销约为基础模型成本的2.8倍，最坏情况延迟不变。在四个数据集和三种骨干架构上的实验表明，与所有训练阶段基线相比，取得了持续且统计显著的增益。为期七天的在线A/B测试进一步证实了5.3%的相对CTR提升（p < 0.01），确立了选择性测试时计算分配作为CTR预测训练阶段进展的实用补充。

英文摘要

Scaling test-time compute has proven highly effective for language models, yet this opportunity remains largely unexplored for industrial Click-Through Rate (CTR) prediction. CTR models suffer from a fundamental asymmetry: feature combinations well-represented in training yield confident predictions, while sparsely observed ones produce unreliable outputs. Existing training-phase solutions such as adaptive gating learn a fixed selection function subject to the same sparsity, offering no per-instance recourse at deployment.We propose UTTSI (Uncertainty-Triggered Test-Time Selective Inference), a training-free model-agnostic framework that scales inference depth proportionally to per-instance uncertainty. A dual-signal estimator combining model logit confidence with a data-level frequency prior distinguishes epistemic uncertainty from aleatoric ambiguity. Every instance undergoes adaptive feature filtering to remove unreliable embeddings; uncertain instances additionally receive stochastic feature-path explorations whose predictions are aggregated via consistency-weighted ensembling. Confident instances bypass exploration entirely, keeping average overhead at approximately $2.8\times$ base model cost with worst-case latency unchanged.Experiments on four datasets with three backbone architectures demonstrate consistent, statistically significant gains over all training-phase baselines. A seven-day online A/B test further confirms a 5.3% relative CTR gain ($p < 0.01$), establishing selective test-time compute allocation as a practical complement to training-phase advances for CTR prediction.

URL PDF HTML ☆

赞 0 踩 0

2605.24975 2026-05-26 cs.RO cs.AI cs.LG 版本更新

Bridging the Gap: Enabling Soft Actor Critic for High Performance Legged Locomotion

弥合差距：实现软演员-评论家算法用于高性能腿部运动

Gianluca Sabatini, Chenhao Li, Marco Hutter

发表机构 * ETH Zurich（苏黎世联邦理工学院）

AI总结本文通过识别软演员-评论家（SAC）在并行训练中性能不足的根本原因，并提出策略初始化、超时感知评论家目标和多步回报估计等改进，使其在腿部运动任务中达到与近端策略优化（PPO）相当的性能。

详情

AI中文摘要

近端策略优化（PPO）由于其在IsaacLab等大规模并行仿真环境中的鲁棒性和可扩展性，已成为训练腿部机器人的事实标准。然而，其基于策略的性质使其天生样本效率低下，阻碍了其在真实硬件上的持续适应和微调。相比之下，软演员-评论家（SAC）是一种可以重用过去经验的离策略算法，使其成为模拟到现实迁移工作流程的自然候选，其中同一算法既可用于仿真，也可用于真实机器人的在线学习。尽管有这些优势，SAC在大规模并行训练设置中始终未能匹配PPO的经验性能。本工作确定了这一差距的根本原因，并引入了针对性的修改，包括策略初始化、超时感知评论家目标和多步回报估计，使SAC能够稳定地大规模训练。在多个腿部机器人平台和多样化的运动任务上评估，我们的方法完全弥合了与PPO的性能差距。

英文摘要

Proximal Policy Optimization (PPO) has become the de facto standard for training legged robots, thanks to its robustness and scalability in massively parallel simulation environments like IsaacLab. However, its on-policy nature makes it inherently sample-inefficient, preventing its use for continuous adaptation and fine-tuning on real hardware. Soft Actor-Critic (SAC), by contrast, is an off-policy algorithm that can reuse past experience, making it a natural candidate for sim-to-real transfer workflows where the same algorithm can be used both in simulation and for online learning on the real robot. Despite these advantages, SAC has consistently failed to match PPO's empirical performance in massively parallel training settings. This work identifies the root causes of this gap and introduces targeted modifications, covering policy initialization, timeout-aware critic targets, and multi-step return estimation, that enable SAC to train stably at scale. Evaluated across multiple legged robot platforms and diverse locomotion tasks, our approach closes the performance gap with PPO entirely.

URL PDF HTML ☆

赞 0 踩 0

2605.24973 2026-05-26 cs.CV cs.AI cs.CL 版本更新

MinerU-Popo: Universal Post-Processing Model for Structured Document Parsing

MinerU-Popo：结构化文档解析的通用后处理模型

Bangrui Xu, Ziyang Miao, Xuanhe Zhou, Yiming Lin, Zirui Tang, Xiaomeng Zhao, Fan Wu, Cheng Tan, Fan Wu, Bin Wang, Conghui He

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Shanghai Artificial Intelligence Laboratory, OpenDataLab（上海人工智能实验室，OpenDataLab）； University of California, Berkeley（加州大学伯克利分校）

AI总结提出MinerU-Popo轻量级通用后处理框架，通过分解为文本/表格截断恢复、标题层级重建和图文关联四个子任务，并利用动态分块和重叠同步将OCR页面级结果重构为文档级逻辑结构，显著提升标题层级TEDS和RAG准确性。

Comments The code is available at https://github.com/opendatalab/MinerU-Popo

详情

AI中文摘要

基于VLM的OCR模型已成为文档解析的事实标准，因为它们可以准确提取页面级元素（例如单个页面内的段落）及其边界框和文本内容。然而，下游应用（如RAG）需要连贯的文档级信息，而这些模型常常破坏跨页连续性，并且无法恢复被页面边界截断的结构（如段落和表格）。这种关系不局限于单个页面；相反，它们需要对跨多个页面的标题、段落、表格和图像进行联合分析。因此，一个自然的解决方案是重用现有的OCR输出，并通过后处理重建文档级逻辑结构。为此，我们提出了MinerU-Popo，一个轻量级且通用的OCR输出后处理框架，它将来自不同解析器的页面级结果转换为连贯的文档级结构。MinerU-Popo将问题分解为四个聚焦的子任务：文本截断恢复、表格截断恢复、标题层级重建和图文关联。为了有效解决这些问题，我们构建了一个面向任务的数据引擎，具有任务特定的输入过滤，并使用生成的数据（30K）微调了一个轻量级后处理模型（Qwen3-VL-4B）。为了支持长文档，我们引入了基于重叠同步的动态分块，对齐微调模型的分块级输出并保持全局一致性。最后，我们将对齐后的输出组装成树状文档表示，并通过节点分块和摘要进一步丰富，以支持下游检索和分析。实验结果表明，MinerU-Popo在所有五个测试的OCR模型上，标题层级TEDS至少提高了20%，提高了RAG准确性并降低了每次查询的延迟。

英文摘要

VLM-based OCR models have become the de facto choice for document parsing, as they can accurately extract page-level elements (e.g., paragraphs within individual pages) together with their bounding boxes and textual content. However, downstream applications such as RAG require coherent document-level information, whereas these models often break cross-page continuity and fail to recover disrupted structures, such as paragraphs and tables truncated by page boundaries. Such relationships are not confined to a single page; instead, they require joint analysis of titles, paragraphs, tables, and images spanning multiple pages. A natural solution is therefore to reuse existing OCR outputs and reconstruct document-level logical structures through post-processing. To this end, we propose MinerU-Popo, a lightweight and universal framework for POst-Processing OCR outputs, which converts page-level results from diverse parsers into coherent document-level structures. MinerU-Popo decomposes the problem into four focused subtasks: text truncation recovery, table truncation recovery, title hierarchy reconstruction, and image-text association. To address these effectively, we build a task-oriented data engine with task-specific input filtering, and use the generated data (30K) to fine-tune a lightweight post-processing model (Qwen3-VL-4B). To support long documents, we introduce dynamic chunking with overlap-based synchronization, which aligns chunk-level outputs from the fine-tuned model and preserves global consistency. Finally, we assemble the aligned outputs into a tree-structured document representation, further enriched with node chunking and summaries for downstream retrieval and analysis. Empirical results show MinerU-Popo improves title-hierarchy TEDS by at least 20% across all five tested OCR models, improves RAG accuracy and reduces per-query latency.

URL PDF HTML ☆

赞 0 踩 0

2605.24971 2026-05-26 cs.LG cs.AI 版本更新

TGFormer: Towards Temporal Graph Transformer with Auto-Correlation Mechanism

TGFormer：基于自相关机制的时间图Transformer

Hongjiang Chen, Pengfei Jiao, Ming Du, Xuan Guo, Zhidong Zhao, Di Jin, Xiao Liu

发表机构 * Hangzhou Dianzi University, School of Cyberspace（杭州电子科技大学信息学院）； Tianjin University, College of Intelligence and Computing（天津大学智能与计算学院）； State Key Laboratory of Systems Medicine for Cancer, Shanghai Cancer Institute（癌症系统医学国家重点实验室，上海癌症研究院）

AI总结针对时间图神经网络在捕获长期依赖和周期模式上的不足，提出TGFormer，通过轨迹框架和自相关机制实现子交互级别的依赖发现与表示聚合，在六个基准上最高提升9.35%精度。

详情

DOI: 10.1016/j.patcog.2025.112053
Journal ref: Pattern Recognition 170 (2026): 112053

AI中文摘要

对时间图神经网络（TGNN）日益增长的兴趣源于它们能够建模复杂动态并提供卓越性能。然而，TGNN在捕获长期依赖和识别周期模式方面面临根本性挑战。为解决这些限制，我们提出了TGFormer，一种专为时间图设计的新型Transformer架构。我们的模型通过建立与时间序列分析原理一致的轨迹框架，重新定义了时间图学习。这种方法使TGFormer能够通过对历史交互的系统分析来推导节点表示，从而实现对跨连续时间戳的节点关系的精细检查。基于随机过程理论，我们开发了一种自相关机制，系统性地揭示节点交互中的周期依赖。这一创新使TGFormer能够在子交互级别进行依赖发现和表示聚合，相比传统注意力机制展现出更高的效率和准确性。在六个公开基准上的实验验证了我们的方法的有效性，与最先进方法相比，TGFormer最高实现了9.35%的精度提升。

英文摘要

The growing interest in Temporal Graph Neural Networks (TGNNs) stems from their ability to model complex dynamics and deliver superior performance. However, TGNNs encounter fundamental challenges in capturing long-term dependencies and identifying periodic patterns. To address these limitations, we propose TGFormer, a novel Transformer architecture specifically designed for temporal graphs. Our model redefines temporal graph learning by establishing a trajectory framework that aligns with time series analysis principles. This approach allows TGFormer to derive node representations through systematic analysis of historical interactions, enabling granular examination of node relationships across sequential timestamps. Building upon stochastic process theory, we develop an auto-correlation mechanism that systematically uncovers periodic dependencies in node interactions. This innovation empowers TGFormer to perform dependency discovery and representation aggregation at sub-interaction levels, demonstrating superior efficiency and accuracy compared to conventional attention mechanisms. Experimental validation across six public benchmarks confirms the effectiveness of our approach, with TGFormer at most achieving 9.35\% precision improvement compared to state-of-the-art approaches.

URL PDF HTML ☆

赞 0 踩 0

2605.24969 2026-05-26 cs.LG cs.AI 版本更新

OSDTW: Optimal Shared Depth and Task Weighting for Long-Tailed Recognition

OSDTW：长尾识别的最优共享深度与任务加权

Chang Chu, Qingyue Zhang, Shao-Lun Huang, Junxiong Zheng

发表机构 * Shenzhen International Graduate School, Tsinghua University, Shenzhen, China（清华大学深圳国际研究生院，中国深圳）； Shenzhen Zkosemi Semiconductor Technology Co., Ltd（深圳卓芯半导体科技有限公司）

AI总结提出OSDTW框架，通过分解任务、共享编码器与任务特定解码器，并基于Fisher信息矩阵推导泛化误差的偏置-方差分解，以优化共享深度和任务权重，解决长尾识别中头部-尾部性能权衡问题。

Comments ICIC 2026 Oral

详情

AI中文摘要

长尾识别面临持续的头部-尾部权衡：提升尾部性能通常会降低头部准确率，并可能增加训练不稳定性。尽管重加权、解耦训练和多专家方法取得了强有力的实证结果，但关于头部和尾部类别之间表示共享以及跨类别组监督加权的关键设计选择仍主要基于启发式。在这项工作中，我们提出了OSDTW，一个原则性的任务分解框架，将原始的单标签识别问题划分为头部任务和尾部任务，通过共享编码器和任务特定解码器实现。为了处理两个标签组之间的互斥性和统计依赖性，我们引入了一个因子化模型，并表明由此产生的基于KL散度的泛化误差可以写为任务项之和（加一个常数），从而得到一个定义良好的任务级目标。我们进一步开发了一个三阶段训练流程：独立任务训练以估计任务级最优值和Fisher信息矩阵，加权联合训练以学习共享编码器，以及分支组装以构建最终的解耦模型。在块对角Fisher近似下，我们推导了期望泛化误差的可计算二阶展开，将其分解为编码器方差、编码器偏置和解码器方差。这种偏置-方差分解提供了一个可计算的代理来选择共享深度和任务权重，从而实现高效的超参数搜索。在标准长尾基准上的实验证明了所提出方法相对于强基线的有效性。

英文摘要

Long-tailed recognition suffers from a persistent head--tail trade-off: improving tail performance often degrades head accuracy and can increase training instability. Despite strong empirical results from re-weighting, decoupled training, and multi-expert methods, key design choices about representation sharing between head and tail classes and supervision weighting across class groups remain largely heuristic. In this work, we propose OSDTW, a principled task-decomposition framework that partitions the original single-label recognition problem into a head task and a tail task, implemented with a shared encoder and task-specific decoders. To handle the mutual exclusivity and statistical dependence between the two label groups, we introduce a factorized model and show that the resulting Kullback--Leibler divergence-based generalization error can be written as the sum of task-wise terms up to an additive constant, yielding a well-defined task-wise objective. We further develop a three-stage training pipeline: independent task training to estimate task-wise optima and the Fisher information matrix, weighted joint training to learn a shared encoder, and branch assembly to construct the final decoupled model. Under a block-diagonal Fisher approximation, we derive a computable second-order expansion of the expected generalization error, decomposing it into encoder variance, encoder bias, and decoder variance. This bias--variance decomposition provides a computable proxy to select the shared depth and task weights, enabling efficient hyper-parameter search. Experiments on standard long-tailed benchmarks demonstrate the effectiveness of the proposed approach over strong baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.24965 2026-05-26 cs.CV cs.AI cs.LG 版本更新

Cross-Domain Generalization Limits of Vision Foundation Models in Facial Deepfake Detection

视觉基础模型在面部深度伪造检测中的跨域泛化极限

Ibrahim Delibasoglu

发表机构 * Department of Software Engineering, Faculty of Computer and Information Sciences（软件工程系，计算机与信息科学学院）

AI总结本文通过系统评估三种视觉基础模型（RoPE-ViT、DINOv3、NVIDIA C-RADIOv4-H）在DF40基准上的线性探测性能，揭示了它们在面部深度伪造检测中的跨域泛化极限，发现基础模型对全脸合成保持高判别力，但对局部编辑技术存在根本性边界。

详情

AI中文摘要

生成模型的快速进化使得超逼真面部深度伪造的创建成为可能，暴露了现代数字取证中的一个关键弱点：检测器无法泛化到未见过的操作技术。传统网络遭受表示崩溃，过度拟合特定训练生成器的局部伪影指纹。本研究探讨了现代视觉基础模型是否可以作为可泛化的、开箱即用的特征提取器，能够在完全未见过的生成流形上追踪取证异常。我们进行了系统的跨域评估，比较了三种基础学习范式：全监督宏观语义特征（RoPE-ViT）、纯自监督几何特征（DINOv3）和多教师聚合表示（NVIDIA C-RADIOv4-H）。通过部署冻结的骨干网络并进行下游线性探测，我们映射了这些架构在具有挑战性的DF40基准上的性能极限。我们的实证结果揭示了预训练范式和参数规模之间的内在权衡，证明虽然基础模型对全脸合成保持高判别能力，但局部面部编辑技术在线性探测评估结构中暴露了基本边界。源代码和模型权重可在 http://github.com/mribrahim/deepfake 获取。

英文摘要

The rapid evolution of generative models has enabled the creation of hyper-realistic facial deepfakes, exposing a critical vulnerability in modern digital forensics: the inability of detectors to generalize to unseen manipulation techniques. Traditional networks suffer from representation collapse, overfitting to localized artifact fingerprints of specific training generators. This work investigates whether modern Vision Foundation Models can serve as generalizable, out-of-the-box feature extractors capable of tracking forensic anomalies across entirely unseen generative manifolds. We conduct a systematic cross-domain evaluation comparing three foundational learning paradigms: fully supervised macro-semantic features (RoPE-ViT), pure self-supervised geometric features (DINOv3), and multi-teacher agglomerative representations (NVIDIA C-RADIOv4-H). By deploying frozen backbones subjected to downstream linear probing, we map the performance limitations of these architectures on the challenging DF40 benchmark. Our empirical findings expose the intrinsic trade-offs between pre-training paradigms and parameter scale, proving that while foundation models retain high discriminative capabilities for entire face synthesis, localized face editing techniques expose fundamental boundaries in linear probe evaluation structures. Source code and model weights are available in http://github.com/mribrahim/deepfake

URL PDF HTML ☆

赞 0 踩 0

2605.24960 2026-05-26 cs.CL cs.AI cs.LG 版本更新

Investigating the Interplay between Contextual and Parametric Chain-of-Thought Faithfulness under Optimization

探究优化下上下文与参数化思维链忠实性之间的相互作用

Jingyi Sun, Qianli Wang, Pepa Atanasova, Nils Feldhus, Isabelle Augenstein

发表机构 * University of Copenhagen（哥本哈根大学）； Technische Universität Berlin（柏林技术大学）； German Research Center for Artificial Intelligence (DFKI)（德国人工智能研究中心（DFKI））； BIFOLD – Berlin Institute for the Foundations of Learning and Data（BIFOLD – 柏林学习与数据基础研究院）

AI总结通过提出统一偏好对齐接口FaithMate，研究上下文与参数化两种思维链忠实性范式在优化下的相互作用，发现两者正相关但不对称，且上下文忠实性指标间存在权衡。

Comments The first two authors contributed equally and share first-authorship

详情

AI中文摘要

思维链（CoT）忠实性，即CoT是否真实反映大型语言模型（LLM）的底层行为，通常通过两种不相交的范式进行评估：上下文忠实性（通过扰动输入或CoT轨迹测量）和参数化忠实性（通过干预模型的参数化知识评估）。然而，先前的工作仅对它们进行描述性比较。我们通过提出FaithMate（一个统一的偏好对齐接口，用于优化模型朝向任一忠实性范式）来填补这一空白。它使我们能够研究两种范式之间的相互作用，检查忠实性增益在范式内部和跨范式之间是否以及多大程度上泛化。在三个模型、两个数据集和六个忠实性指标上，我们发现两种范式呈正相关但不对称：优化参数化忠实性在两种范式上均产生一致的增益，而上下文对应范式则带来更多可变的增益。在上下文范式内，一个指标上的忠实性增益不能一致地转移到其他指标上，这表明现有的上下文指标捕捉了忠实性的不同方面，并暴露了固有的权衡。这些发现意味着CoT忠实性不是一个单一目标，因此需要多方面的优化和评估。

英文摘要

Chain-of-Thought (CoT) faithfulness, i.e., whether CoTs genuinely reflect large language models' (LLM) underlying behavior, is typically evaluated under two disjoint paradigms: contextual faithfulness, measured by perturbing the input or CoT trace, and parametric faithfulness, assessed by intervening on a model's parametric knowledge. Yet prior work compares them only descriptively. We fill this gap by proposing FaithMate, a unified preference-alignment interface for optimizing models towards either faithfulness paradigm. It enables us to investigate the interplay between the two paradigms, examining whether and to what extent faithfulness gains generalize within and across paradigms. Across three models, two datasets, and six faithfulness metrics, we find that the two paradigms are positively coupled, yet asymmetric: optimizing towards parametric faithfulness yields consistent gains across both paradigms, whereas the contextual counterpart delivers more variable gains. Within the contextual paradigm, faithfulness gains on one metric do not consistently transfer to others, implying that existing contextual metrics capture disjoint facets of faithfulness and exposing inherent trade-offs. These findings imply that CoT faithfulness is not a monolithic objective and therefore requires multifaceted optimization and evaluation.

URL PDF HTML ☆

赞 0 踩 0

2605.24958 2026-05-26 cs.CL cs.AI 版本更新

SEP-Attack: A Simple and Effective Paradigm for Transfer-Based Textual Adversarial Attack

SEP-Attack：一种简单有效的基于迁移的文本对抗攻击范式

Han Liu, Zhi Xu, Xiaotong Zhang, Feng Zhang, Xiaoming Xu, Wei Wang, Fenglong Ma, Hong Yu

发表机构 * Dalian University of Technology（大连理工大学）； Peking University（北京大学）； Macao Polytechnic University（澳门理工学院）； The Pennsylvania State University（宾夕法尼亚州立大学）

AI总结提出SEP-Attack，利用行列式点过程生成多样化的代理集成权重，通过新指标评估预测置信度以计算词重要性并生成对抗样本，在多个数据集和API上显著优于现有方法。

详情

AI中文摘要

尽管深度神经网络在现代Web和语言应用中表现出色，但它们仍然容易受到对抗攻击，尤其是使用代理模型生成对抗样本而无需访问受害者模型的迁移攻击。文本领域的迁移攻击仍未得到充分探索，只有少数研究解决了这一挑战性问题，且由于对子模型平等对待或重要性分数估计不准确，往往导致次优结果。为了解决这些挑战，我们提出了一种简单而有效的基于迁移的文本对抗攻击范式，名为SEP-Attack。具体来说，我们采用行列式点过程（DPP）生成多样化的代理集成权重，代表子模型的迁移性。利用这些权重，我们引入了一种新的度量来评估预测置信度分数，进而用于计算词重要性分数并生成对抗候选。最后，我们量化每个候选的迁移性分数，并选择排名靠前的作为最终的迁移对抗样本。在四个数据集和两个真实API上进行的实验验证了SEP-Attack的有效性，显著优于最先进的基线方法。

英文摘要

Despite the strong performance of deep neural networks in modern Web and language applications, they remain vulnerable to adversarial attacks, especially transferable attacks that generate adversarial examples using surrogate models without accessing the victim model. Transferable attacks in the text domain are still under-explored, with only a few studies addressing this challenging issue, often with suboptimal results due to equal treatment of submodels or inaccurate estimation of importance scores. To address these challenges, we propose a simple yet effective paradigm for transfer-based textual adversarial attack, named SEP-Attack. Specifically, we employ the Determinantal Point Process (DPP) to generate diverse surrogate ensemble weights, representing the transferability of submodels. Using these weights, we introduce a new metric to evaluate prediction confidence scores, which in turn are used to calculate word importance scores and generate adversarial candidates. Finally, we quantify the transferability score for each candidate and select the top ones as the final transferable adversarial examples. Experiments conducted on four datasets and two real-world APIs validate the efficacy of SEP-Attack, significantly outperforming state-of-the-art baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.24957 2026-05-26 cs.AI cs.CV cs.LG 版本更新

Mitigating Object Hallucinations in Vision-Language Models through Region-Aware Attention Recalibration

通过区域感知注意力重校准减轻视觉语言模型中的对象幻觉

Yuanzhi Xu, Qian Gao, Jun Fan, Guohui Ding, Zhenyu Yang, Sixue Lin, Yuteng Xiao

发表机构 * Qilu University of Technology (Shandong Academy of Sciences)（齐鲁工业大学（山东省科学院））； China Telecom Digital Intelligence Technology Co, Ltd（中国电信数字智能技术有限公司）； Shenyang Aerospace University（沈阳航空航天大学）； Qilu Institute of Technology（齐鲁理工学院）

AI总结提出一种无需训练的区域感知自适应加权机制，通过计算注意力头的稳健统计中点并利用跨头分歧动态调整干预预算，以连续惩罚调制抑制幻觉路径，有效纠正视觉语义错位，同时保持生成流畅性。

详情

AI中文摘要

生成事实上不正确的对象（通常称为对象幻觉）仍然是大型视觉语言模型（LVLMs）中的一个持久挑战。当前解决该问题的方法——从昂贵的数据驱动微调和延迟较高的对比解码到刚性的注意力头截断——常常在计算效率或模型特征空间的连续性上做出妥协。为克服这些限制，我们引入了一种新颖的、无需训练的推理策略，该策略作为一种区域感知的自适应加权机制，动态纠正语义漂移，而不依赖于突然的启发式截断。通过计算各注意力头上的离群值稳健统计中点，我们为可靠的视觉表示建立了一个稳定锚点。然后，我们利用跨区域映射的跨头分歧来动态确定干预预算，通过连续惩罚调制温和地抑制引起幻觉的注意力路径。这种重校准过程有效纠正了视觉语义错位，同时完全保留了生成流畅性和语言先验。在包括CHAIR、POPE和MME在内的标准多模态基准上的全面评估表明，我们的策略显著减少了实例级和句子级幻觉。结果展示了与当代基线相比的最先进性能，证实了我们方法的效率和算法鲁棒性。我们的代码将公开。

英文摘要

The generation of factually incorrect objects, commonly known as object hallucination, remains a persistent challenge in Large Vision-Language Models (LVLMs). Current approaches to address this issue - ranging from expensive data-driven fine-tuning and high-latency contrastive decoding to rigid attention head truncation - frequently compromise either computational efficiency or the continuity of the model's feature space. To overcome these limitations, we introduce a novel, training-free inference strategy that operates as a region-aware adaptive weighting mechanism to dynamically correct semantic drift without relying on abrupt heuristic truncations. By computing an outlier-resistant statistical midpoint across various attention heads, we establish a stable anchor for reliable visual representations. We then utilize the inter-head disagreement mapped across regions to dynamically determine intervention budgets, gently suppressing hallucination-inducing attention paths through a continuous penalty modulation. This recalibration process effectively rectifies visual-semantic misalignments while fully preserving generative fluency and language priors. Comprehensive evaluations on standard multimodal benchmarks, including CHAIR, POPE, and MME, reveal that our strategy substantially curtails both instance- and sentence-level hallucinations. The results demonstrate state-of-the-art performance against contemporary baselines, confirming our method's efficiency and algorithmic robustness. Our code will be public.

URL PDF HTML ☆

赞 0 踩 0

2605.24953 2026-05-26 cs.AI 版本更新

你的嵌入模型比你想象的更聪明

Jianrui Zhang, Hyun Jung Lee, Sukanta Ganguly, Tae-Eui Kam, Donghyun Kim, Yong Jae Lee

发表机构 * UW-Madison（威斯康星大学麦迪逊分校）； Korea University（韩国大学）； NetApp, Inc.（NetApp公司）

AI总结提出SMART框架，通过利用标准单向量模型的隐式多向量能力，在推理时应用后期交互，无需额外训练即可提升多模态检索性能。

详情

AI中文摘要

多模态检索严重依赖单向量检索器，它将丰富的顺序令牌序列压缩为单个全局表示。虽然高效，但它们丢弃了密集检索任务所需的关键细粒度局部证据。多向量方法作为解决方案被引入，但严格需要训练，且许多忽略了全局总结表示的必要性。为解决这一问题，我们引入SMART，一个释放标准单向量模型潜在多向量能力的框架。我们首先证明，在池化嵌入上的标准对比训练通过梯度流隐式塑造了前序隐藏状态的检索几何结构。通过在推理时对这些冻结的隐藏状态应用直接后期交互，SMART作为一种即插即用的升级，持续提升跨多种模态的性能，甚至在MMEB-V2上进一步改进了最先进的模型。我们还揭示了SMART的优越性能，简单的轻量级后训练不仅节省时间和计算，还在视觉文档检索上带来进一步改进，使单向量模型能够超越最先进的多向量对应模型。最终，SMART为多模态检索提供了高效的推理增强和强大的微调技术。我们在https://github.com/HanSolo9682/SMART开源了代码和权重。

英文摘要

Multimodal retrieval relies heavily on single-vector retrievers, which compress rich, sequential token sequences into one single global representation. While efficient, they discard fine-grained, local evidence critical for dense retrieval tasks. Multi-vector approaches were introduced as a solution, but they strictly require training and many ignore the necessity of a globally summarizing representation. To address this, we introduce SMART, a framework that unlocks the latent multi-vector capabilities of standard single-vector models. We first demonstrate that standard contrastive training on the pooled embedding implicitly shapes the retrieval geometry of preceding hidden states via gradient flow. By applying direct late-interaction over these frozen hidden states during inference, SMART acts as a plug-and-play upgrade that consistently improves performance across diverse modalities, improving even the state-of-the-art models further on MMEB-V2. We also reveal SMART's superior performance, as simple lightweight post-training not only saves time and compute, but also brings forth further improvement on Visual Document retrieval, allowing a single-vector model to outperform SoTA multi-vector counterparts. Ultimately, SMART offers both a highly efficient inference enhancement and a powerful finetuning technique for multimodal retrieval. We open source our code and weights at https://github.com/HanSolo9682/SMART.

URL PDF HTML ☆

赞 0 踩 0

2605.24926 2026-05-26 cs.AI 版本更新

Energy Shields for Fairness

公平性能量护盾

Filip Cano, Thomas A. Henzinger, Konstantin Kueffner

发表机构 * Institute of Science and Technology Austria（科学与技术研究院）

AI总结提出一种受物理学启发的轻量级自适应控制器——能量护盾，通过概率性干预平滑地保证运行时公平性，并首次同时提供短期安全性和长期活性保证。

详情

DOI: 10.1145/3805689.3806807

AI中文摘要

运行时公平性不是一个一次性约束，而是一个在决策序列上评估的动态属性。为了确保运行时公平性，必须考虑过去的决策，这是传统静态分类器所忽略的信息。传统的公平性护盾通过确定性干预来强制执行运行时公平性，每当决策序列违反运行公平性度量的目标时，就会突然干预。这激发了我们主要的概念贡献：能量护盾。能量护盾是一种新颖的、轻量级的自适应控制器，它监控决策序列并概率性地干预，通过利用受物理学启发的能量函数将序列推向公平性，从而平滑地确保运行时公平性：决策越不公平，推动力就越强。这使得能量护盾成为第一个同时提供短期安全性和长期活性保证的公平性护盾。安全性确保运行公平性度量以高概率保持在运行目标区间内，而活性确保公平性度量的极限位于极限目标区间内。直观地说，短期指定了容忍的公平性值，长期指定了期望的公平性值。我们还提供了一种合成程序，用于为给定的目标规范构建最小侵入性的能量护盾，并通过实验证明其效率。我们通过短期和长期公平性的视角，将我们的能量护盾与现有的公平性护盾进行了评估。

英文摘要

Runtime fairness is not a one-time constraint but a dynamic property evaluated over a sequence of decisions. To ensure fairness at runtime, it is necessary to account for past decisions, information neglected by conventional, static classifiers. Traditional fairness shields enforce runtime fairness abruptly, by intervening \emph{deterministically} whenever a sequence of decisions violates the target for a running fairness measure. This motivates our \emph{main conceptual contribution: \textbf{energy shields}.} An energy shield is a novel, lightweight, adaptive controller that monitors a sequence of decisions and intervenes \emph{probabilistically} to ensure runtime fairness smoothly, by utilizing physics-inspired energy functions to nudge the sequence toward fairness: the more unfair the decisions, the stronger the nudging force becomes. This makes energy shields the \emph{\textbf{first}} fairness shields to provide both \emph{short-term safety and long-term liveness guarantees}. Safety ensures that the running fairness measure stays within a running target interval with high probability, and liveness ensures that the limit of the fairness measure lies within the limit target interval. Intuitively, the short-term specifies the tolerated fairness values and the long-term specifies the desired fairness values. We also provide a synthesis procedure for constructing the least intrusive energy shield for a given target specification, and demonstrate its efficiency experimentally. We evaluate our energy shields against existing fairness shields through the lens of short- and long-term fairness.

URL PDF HTML ☆

赞 0 踩 0

2605.24920 2026-05-26 cs.LG cs.AI stat.ML 版本更新

Quaternion Self-Attention with Shared Scores

共享分数的四元数自注意力

Shogo Yamauchi, Tohru Nitta, Hideaki Tamori

发表机构 * Tokyo Woman's Christian University（东京女子基督教大学）

AI总结提出一种共享分数四元数自注意力机制，通过四元数内积计算单一实值分数并共享注意力分布，在保持性能的同时大幅降低计算成本。

Comments 26 pages, 6 figures and 15 tables. Accepted at ICML2026

详情

AI中文摘要

四元数神经网络通过将四个相关特征表示为一个单一实体，实现了参数高效并建模多维依赖关系。然而，现有的四元数自注意力计算每个分量的分数并对每个分量应用独立的softmax操作，这增加了计算成本并允许注意力分布在分量间发散。我们提出了一种共享分数的四元数自注意力机制，该机制使用四元数内积计算单一实值分数，并在所有分量上应用共享的注意力分布。这将分数计算乘法减少了75%，并将softmax操作次数从四次减少到一次。我们证明，当查询和键由诱导分量预混合的四元数线性投影产生时，分量级分数和共享分数位于相同的交互子空间中，表明独立的分量级注意力主要重新参数化相同的交互，而不是扩展特征交互空间。在语音增强中，我们的方法在GPU上将推理时间减少了高达44.3%，在CPU上减少了58.1%，同时保持了质量，并且在视觉和自然语言处理中呈现一致的趋势。

英文摘要

Quaternion neural networks are parameter-efficient and model multidimensional dependencies by representing four related features as a single entity. However, existing quaternion self-attention computes component-wise scores and applies independent softmax operations to each component, which increases the computational cost and allows attention distributions to diverge across components. We propose a shared-score quaternion self-attention mechanism that computes a single real-valued score using the quaternion inner product and applies a shared attention distribution across all components. This reduces score-computation multiplications by 75% and the number of softmax operations from four to one. We prove that, when queries and keys are produced by quaternion linear projections that induce component pre-mixing, the component-wise and shared scores lie in the same interaction subspace, indicating that independent component-wise attention primarily re-parameterizes the same interactions rather than expanding the feature interaction space. In speech enhancement, our method reduces inference time by up to 44.3% on a GPU and 58.1% on a CPU while maintaining quality, with consistent trends across vision and natural language processing.

URL PDF HTML ☆

赞 0 踩 0

2605.24913 2026-05-26 eess.IV cs.AI q-bio.QM 版本更新

Explainable Multi-Task Retinal Imaging Reveals Microvascular Signals for Systemic Risk Stratification in Type 2 Diabetes: A Pilot Study

可解释多任务视网膜成像揭示2型糖尿病系统性风险分层的微血管信号：一项初步研究

Mini Han Wang, Liting Huang, Wei Hong, Boonthawan Wingwon

发表机构 * Faculty of Computer Science and Artificial Intelligence, Shenzhen University of Advanced Technology（深圳先进技术大学计算机科学与人工智能学院）； Frontier Science Computing Center, Zhuhai Institute of Advanced Technology Chinese Academy of Sciences（中国科学院珠海先进技术研究院前沿科学计算中心）； Chinese University of Hong Kong（香港中文大学）； Zhuhai People's Hospital (The Affiliated Hospital of Beijing Institute of Technology, Zhuhai Clinical Medical College of Jinan University)（珠海人民医院（北京理工大学珠海临床医学院附属医院））； Lampang Inter-Tech College, Lampang Thailand（泰国 Lampang 职业技术学院）

AI总结本研究开发了一个可解释的多任务深度学习框架，通过分析视网膜微血管特征与系统性异常（如肾脏异常）的关联，验证了视网膜成像作为糖尿病系统性风险分层生物标志物的潜力。

Comments 18 pages, 4 figures

详情

AI中文摘要

视网膜成像提供了进入系统性微血管健康的非侵入性窗口，并已成为系统性疾病的潜在生物标志物。然而，视网膜特征是否编码了生物学上有意义的系统性信号，并且可以使用可解释人工智能（XAI）可靠地解释，仍不清楚。我们开发了一个可解释的多任务深度学习框架，以研究视网膜微血管特征与2型糖尿病系统性异常之间的关联。使用共享神经网络和针对血糖状态、肾脏异常和多系统参与的任务特定头部，分析了来自2,719名个体的11,011张眼底图像。使用梯度加权类激活映射（Grad-CAM）、解剖掩膜和血管对齐分析评估模型可解释性。该框架展示了任务依赖的预测性能，对肾脏异常的最佳区分度（AUC高达0.63），而血糖状态预测性能有限（AUC = 0.49-0.61）。可解释性分析一致地将模型注意力定位到视网膜血管和视盘周围区域。掩膜实验表明，遮挡血管区域导致性能下降最大，表明视网膜血管是主要的预测来源。不同架构表现出异质的注意力模式，提示存在多种系统性信号编码的表征路径。这项初步研究表明，视网膜微血管特征包含与系统性异常（尤其是微血管损伤）相关的可测量信号。通过将多任务学习与定量XAI验证相结合，该框架推动视网膜成像向用于糖尿病系统性风险分层的可解释数字生物标志物发展。

英文摘要

Retinal imaging provides a non-invasive window into systemic microvascular health and has emerged as a potential biomarker for systemic diseases. However, whether retinal features encode biologically meaningful systemic signals that can be reliably interpreted using explainable artificial intelligence (XAI) remains unclear. An explainable multi-task deep learning framework was developed to investigate associations between retinal microvascular features and systemic abnormalities in Type 2 Diabetes Mellitus. A total of 11,011 fundus images from 2,719 individuals were analysed using a shared neural network with task-specific heads for glycaemic status, kidney abnormality, and multi-system involvement. Model interpretability was evaluated using Gradient-weighted Class Activation Mapping (Grad-CAM), anatomical masking, and vessel alignment analysis. The framework demonstrated task-dependent predictive performance, with the best discrimination observed for kidney abnormality (AUC up to 0.63), whereas glycaemic status prediction showed limited performance (AUC = 0.49-0.61). Explainability analyses consistently localized model attention to retinal vessels and peripapillary regions. Masking experiments showed that occlusion of vascular regions caused the greatest performance decline, indicating that retinal vessels were the primary predictive source. Different architectures exhibited heterogeneous attention patterns, suggesting multiple representational pathways for systemic signal encoding. This pilot study demonstrates that retinal microvascular features contain measurable signals associated with systemic abnormalities, particularly microvascular damage. By integrating multi-task learning with quantitative XAI validation, this framework advances retinal imaging toward interpretable digital biomarkers for systemic risk stratification in diabetes.

URL PDF HTML ☆

赞 0 踩 0

2605.24912 2026-05-26 cs.LG cs.AI q-bio.OT 版本更新

Explainable Retinal Imaging for Prediction of Multi-Organ Dysfunction in Type 2 Diabetes

可解释的视网膜成像用于预测2型糖尿病多器官功能障碍

Mini Han Wang, Liting Huang, Wei Hong, Boonthawan Wingwon

发表机构 * Faculty of Computer Science and Artificial Intelligence（计算机科学与人工智能学院）； Frontier Science Computing Center（前沿科学计算中心）； Chinese Academy of Sciences（中国科学院）； Chinese University of Hong Kong（香港中文大学）； Zhuhai People's Hospital（珠海人民医院）； Beijing Institute of Technology（北京理工大学）； Jinan University（暨南大学）； Lampang Inter-Tech College

AI总结本研究利用常规实验室生物标志物构建系统级异常指数，通过梯度提升模型预测2型糖尿病多系统失调，并采用SHAP实现可解释性，揭示了高血糖、肾功能障碍、血脂异常和炎症是主要驱动因素。

Comments 15 pages, 8 figures

详情

AI中文摘要

背景：2型糖尿病（T2DM）日益被认为是一种以代谢、肾脏、脂质和炎症通路协调功能障碍为特征的系统性疾病。现有的临床评估往往无法捕捉这种多维度负担。方法：我们对1,195名患者进行了回顾性研究，使用了常规收集的实验室生物标志物。构建了系统级异常指数以量化器官特异性功能障碍，并将多系统受累定义为两个或以上系统异常。训练了包括逻辑回归、随机森林和梯度提升在内的监督机器学习模型来预测多系统失调。使用SHapley Additive exPlanations（SHAP）实现模型可解释性。结果：梯度提升模型表现出近乎完美的区分能力（AUC = 1.000），显著优于逻辑回归（AUC = 0.925）。特征归因分析显示，高血糖、肾功能障碍、血脂异常和炎症是多系统风险的主要驱动因素。部分依赖分析中观察到的剂量-反应关系进一步支持了模型预测的生物学合理性。结论：本研究提出了一个可解释的、数据驱动的框架，用于量化T2DM的系统性疾病负担。通过将常规生物标志物与多器官功能障碍联系起来，我们的方法提供了预测准确性和机制洞察，为糖尿病护理中的风险分层和精准医学提供了潜力。本研究中使用的数据和代码可在GitHub上公开获取：https://github.com/MiniHanWang/Type-2-Diabetes-1.git

英文摘要

Background: Type 2 diabetes mellitus (T2DM) is increasingly recognised as a systemic disease characterised by coordinated dysfunction across metabolic, renal, lipid, and inflammatory pathways. Existing clinical assessments often fail to capture this multi-dimensional burden. Methods: We conducted a retrospective study of 1,195 patients using routinely collected laboratory biomarkers. System-level abnormality indices were constructed to quantify organ-specific dysfunction, and multi-system involvement was defined as abnormalities in two or more systems. Supervised machine learning models, including logistic regression, random forest, and gradient boosting, were trained to predict multi-system dysregulation. Model interpretability was achieved using SHapley Additive exPlanations (SHAP). Results: The gradient boosting model demonstrated near-perfect discrimination (AUC = 1.000), significantly outperforming logistic regression (AUC = 0.925). Feature attribution analysis revealed that hyperglycaemia, renal impairment, dyslipidaemia, and inflammation were the dominant drivers of multi-system risk. Dose-response relationships observed in partial dependence analyses further supported the biological plausibility of model predictions. Conclusion: This study presents an interpretable, data-driven framework for quantifying systemic disease burden in T2DM. By linking routine biomarkers to multi-organ dysfunction, our approach provides both predictive accuracy and mechanistic insight, offering potential for improved risk stratification and precision medicine in diabetes care. The data and code used in this study are openly available on GitHub at: https://github.com/MiniHanWang/Type-2-Diabetes-1.git

URL PDF HTML ☆

赞 0 踩 0

2605.24911 2026-05-26 cs.LG cs.AI 版本更新

Factorize to Generalize: Retrieval-Guided Invariant-Dynamic Decomposition for Time Series Forecasting

因式分解以泛化：面向时间序列预测的检索引导不变-动态分解

Jinjin Chi, Lei Feng, Lulu Zhang, Yongcheng Jing, Yiming Wang, Ximing Li, Jialie Shen, Leszek Rutkowski, Dacheng Tao

发表机构 * College of Computer Science and Technology, Jilin University（吉林大学计算机科学与技术学院）； College of Computing and Data Science, Nanyang Technological University（南洋理工大学计算机与数据科学学院）； City St George’s, University of London（伦敦大学城圣乔治学院）； Systems Research Institute, Polish Academy of Sciences（波兰科学院系统研究所）

AI总结提出检索引导的不变-动态分解框架，通过分离稳定共享结构与实例特定变化，提升时间序列零样本预测在分布偏移下的鲁棒性。

详情

AI中文摘要

时间序列基础模型（TSFMs）最近通过大规模预训练和检索增强预测实现了强大的零样本预测性能。然而，我们的实证分析揭示了基于检索的预测的一个非平凡限制：检索倾向于导致更振荡的预测，在高度波动的序列上提升性能，但在更平滑、趋势主导的序列上降低准确性。这表明检索信息可能在未明确区分稳定时间结构与实例特定变化的情况下被融合到预测中，这可能在分布偏移下降低鲁棒性。我们提出了一种用于时间序列预测的检索引导不变-动态分解框架。我们不将检索用作辅助预测上下文，而是利用检索到的序列作为来自相关环境的隐式样本，以指导表示分解。具体来说，我们首先通过基于注意力的聚合构建检索感知表示，然后引入检索引导路由机制将其分解为捕获稳定共享结构的不变组件和建模上下文相关变化的动态组件。这两个组件分别预测并融合以进行最终预测，使模型能够保留可迁移模式，同时保持对动态演变的适应性。我们进一步设计了鼓励不变学习和解耦的训练目标，并提供了理论见解，表明检索聚合减少了方差，并在没有显式环境监督的情况下近似不变表示学习。大量实验表明，我们的方法在分布偏移下持续提高鲁棒性，并在零样本预测设置中优于现有的TSFMs和基于检索的基线。

英文摘要

Time series foundation models (TSFMs) have recently achieved strong zero-shot forecasting performance through large-scale pretraining and retrieval-augmented prediction. However, our empirical analysis reveals a non-trivial limitation of retrieval-based forecasting: retrieval tends to induce more oscillatory predictions, improving performance on highly fluctuating series while degrading accuracy on smoother, trend-dominated ones. This suggests that retrieved information may be fused into prediction without explicitly distinguishing stable temporal structure from instance-specific variations, which can reduce robustness under distribution shifts. We propose a Retrieval-guided Invariant-Dynamic DEcomposition framework for time series forecasting. Rather than using retrieval as auxiliary predictive context, we leverage retrieved sequences as implicit samples from related environments to guide representation decomposition. Specifically, we first construct a retrieval-aware representation via attention-based aggregation, and then introduce a retrieval-guided routing mechanism to decompose it into an invariant component capturing stable shared structure and a dynamic component modeling context-dependent variations. These two components are forecast separately and fused for final prediction, enabling the model to preserve transferable patterns while remaining adaptive to evolving dynamics. We further design training objectives that encourage invariant learning and disentanglement, and provide theoretical insight showing that retrieval aggregation reduces variance and approximates invariant representation learning without explicit environment supervision. Extensive experiments demonstrate that our method consistently improves robustness under distribution shifts and outperforms existing TSFMs and retrieval-based baselines in zero-shot forecasting settings.

URL PDF HTML ☆

赞 0 踩 0

2605.24910 2026-05-26 cs.AI cs.CE 版本更新

Noise-Robust Financial Numerical Entity Attribute Tagging

鲁棒噪声的金融数值实体属性标注

Hsin-Min Lu, Chen-Yang Lai, Yi-Jhen Li, Ju-Chun Yen

发表机构 * National Taiwan University（国立台湾大学）； National Central University（国立中央大学）

AI总结针对金融数值实体标注中标签噪声和属性不全问题，提出NORA方法，通过任务感知实例加权和邻域先验KNN过滤，在6.6百万实例基准上实现鲁棒的多属性预测。

详情

AI中文摘要

金融数值实体（FNE）理解旨在恢复财务报告中数值提及的含义。现有研究主要关注概念名称预测，并面临两个重要限制。首先，来自内联XBRL的标签可能包含错误，因为申报通常是手动准备的。其次，其他重要的FNE属性，如报告时间关系、测量尺度和会计符号，较少被强调。我们提出鲁棒噪声的丰富金融数值实体属性标注（NORA）来解决这些差距。NORA使用任务感知的实例特定加权来减弱训练过程中噪声标签的影响，并进一步提出邻域先验调整KNN（NPK）过滤方法，以便在真实世界噪声测试集上进行更可靠的评估。此外，我们构建了一个包含660万个实例的大规模基准，具有多属性标签和申报元数据。实验表明，NORA与最先进的噪声标签基线（包括Co-teaching、Mixup、SSR和SelfMix）相比表现强劲。此外，NORA在未过滤和噪声过滤测试设置下均具有鲁棒性。它在概念名称和时间关系预测上取得了最佳准确率、宏F1和加权F1，同时在尺度和符号预测上保持竞争力。这些结果证明了在考虑真实世界财务申报中标签噪声的同时，联合建模丰富FNE属性的价值。

英文摘要

Financial Numerical Entity (FNE) understanding aims to recover the meaning of numerical mentions in financial reports. Existing studies primarily focus on concept name prediction and face two important limitations. First, labels derived from inline XBRL may contain errors because filings are usually prepared manually. Second, other important FNE attributes, such as reporting-time relation, measurement scale, and accounting sign, are less emphasized. We propose \textbf{NO}ise-\textbf{R}obust Tagging for Rich Financial Numerical Entity \textbf{A}ttributes (\textsc{NORA}) to address these gaps. NORA uses task-aware instance-specific weighting to attenuate the influence of noisy labels during training, and we further propose the Neighborhood Prior-adjusted KNN (NPK) filtering method for more reliable evaluation on real-world noisy test sets. In addition, we construct a large-scale benchmark containing 6.6 million instances with multi-attribute labels and filing metadata. Experiments show that \textsc{NORA} performs strongly compared with state-of-the-art noisy-label baselines, including Co-teaching, Mixup, SSR, and SelfMix. Moreover, NORA is robust under both unfiltered and noise-filtered test settings. It achieves the best Accuracy, Macro F1, and Weighted F1 for concept name and time-relation prediction, while remaining competitive on scale and sign prediction. These results demonstrate the value of jointly modeling rich FNE attributes while accounting for label noise in real-world financial filings.

URL PDF HTML ☆

赞 0 踩 0

2605.24908 2026-05-26 cs.LG cs.AI 版本更新

On the Impact of Class Imbalance on the Learning Dynamics of Deep Neural Networks:An Intuitive Insight

论类别不平衡对深度神经网络学习动态的影响：直观洞察

Ismail B. Mustapha, Shafaatunnur Hasan, Sunday O. Olatunji, Hatem S. Y. Nabus

发表机构 * Faculty of Computing（计算机学院）； Universiti Teknologi Malaysia（技术大学）； Adejkunle Ajasin University（阿德吉库内勒·阿贾辛大学）； Johor, Malaysia（马来西亚 Johor）； Akungba-Akoko, Nigeria（尼日利亚 Akungba-Akoko）

AI总结通过监测不同不平衡比率下深度神经网络对多数类和少数类的学习模式，系统研究了类别不平衡如何导致模型早期欠拟合少数类并仅学习多数类，最终造成少数类表示过拟合而非泛化。

Comments Conference

详情

AI中文摘要

近年来，深度神经网络（DNN）中的类别不平衡问题引起了研究者的广泛关注。然而，相关文献中对DNN在不平衡数据上表现不佳的原因存在不同解释，表明人们对这一长期存在的现象如何影响DNN性能知之甚少。更好地理解这一问题对于开发有效的基于DNN的不平衡方法至关重要。因此，本研究通过监测DNN模型在不同不平衡比率数据集上对多数类和少数类的学习模式，系统研究了类别不平衡对DNN学习动态的影响。实验结果表明，与从平衡数据集学习时DNN类似地学习各个类别不同，类别不平衡严重损害了DNN的性能，导致模型在早期训练轮次中欠拟合少数类样本，同时仅学习多数类。尽管DNN最终学会了少数类样本，但这种学习方式仅导致学习到的少数类表示在测试阶段无法泛化，因为它们仅仅是过拟合以尽可能降低整体训练损失。

英文摘要

Class imbalance in deep neural networks (DNNs) has witnessed a rapid increase in research attention in recent years. However, the varying accounts of the reasons behind the poor performance of DNN on imbalance data in pertinent literature shows that little is known about how this agelong phenomenon impacts the performance of DNNs. A better understanding of this problem is crucial to developing effective DNN-based imbalance methods. Thus, this study systematically investigates the impact of class imbalance on the learning dynamics of DNN by monitoring the learning pattern of DNN models on both the majority and minority classes of datasets of varying imbalance ratios. Experimental findings shows that as against learning from balanced datasets where DNN learns the classes similarly, class imbalance has severe deteriorating impact on the performance of DNN, driving the model to underfit the minority class samples in the early training epochs while simultaneously learning only the majority class. Although DNN ultimately learns the minority samples, learning in this manner only results in learnt minority representations that are non-generalizable at test phase because they are merely overfitted to keep the overall training loss as low as possible.

URL PDF HTML ☆

赞 0 踩 0

2605.24902 2026-05-26 cs.CL cs.AI cs.LG 版本更新

When Reasoning Hurts: Source-Aware Evaluation of Frontier LLMs for Clinical SOAP Note Generation

当推理有害：面向临床SOAP笔记生成的前沿LLM源感知评估

Faizan Faisal

发表机构 * University of California, Davis（加州大学戴维斯分校）

AI总结通过源感知基准测试，评估推理增强型LLM在临床SOAP笔记生成中的表现，发现推理能力反而降低GPT-5.4的质量，而相同源RAG带来模型依赖的小幅提升。

详情

AI中文摘要

推理增强型LLM在医学推理基准测试中表现强劲，但这些增益是否能迁移到结构化临床文档尚不清楚；我们通过一个跨OMI Health、ACI-Bench和PriMock57的源感知基准，利用临床对话生成SOAP笔记来研究这一问题。我们在一个2x2受控设计中评估GPT-5.4、DeepSeek-V4-Flash和Gemma-4-E4B，独立切换提供者原生推理和相同源检索增强生成（RAG）。输出使用七种自动指标以及两个参考感知的LLM评判者进行评估。两种评估方法一致认为，非推理的GPT-5.4配置达到最高整体质量，而DeepSeek-V4-Flash在推理增强配置中表现最佳。启用推理显著降低了GPT-5.4在所有三个数据集上的性能，而相同源RAG带来较小的、模型依赖的改进。总体而言，研究结果表明，不应假设更强的推理能力能改善对保真度敏感的SOAP笔记生成，而无需专门的、任务特定的评估。

英文摘要

Reasoning-enabled LLMs perform strongly on medical reasoning benchmarks, but it remains unclear whether these gains transfer to structured clinical documentation; we investigate this question using SOAP note generation from clinical dialogue in a source-aware benchmark spanning OMI Health, ACI-Bench, and PriMock57. We evaluate GPT-5.4, DeepSeek-V4-Flash, and Gemma-4-E4B in a controlled 2x2 design that independently toggles provider-native reasoning and same-source retrieval-augmented generation (RAG). Outputs are assessed using seven automatic metrics alongside two reference-aware LLM judges. Both evaluation approaches agree that a non-reasoning GPT-5.4 configuration achieves the highest overall quality, while DeepSeek-V4-Flash performs best among reasoning-enabled configurations. Enabling reasoning significantly degrades GPT-5.4 performance across all three datasets, whereas same-source RAG yields smaller, model-dependent improvements. Overall, the findings indicate that stronger reasoning capability should not be assumed to improve fidelity-sensitive SOAP note generation without dedicated, task-specific evaluation.

URL PDF HTML ☆

赞 0 踩 0

2605.24900 2026-05-26 cs.AI 版本更新

ProActor: Timing-Aware Reinforcement Learning for Proactive Task Scheduling Agents

ProActor: 时序感知强化学习用于主动任务调度智能体

Lei Ding, Bin He, Chenguang Wang, Yang Liu

发表机构 * University of California, Santa Cruz（加州大学圣克鲁兹分校）； Zillow Group（Zillow集团）

AI总结提出ProActor框架，通过时序感知强化学习（结合RULER奖励和阶段感知复合奖励）和高效训练系统ART-F，在保持动作一致性的同时显著提升主动任务调度的时序质量。

Comments 47 pages, 31 figures. Accepted to ACL 2026

详情

AI中文摘要

主动任务导向的智能体必须自主预测用户需求、识别可操作的机会，并在适当时刻触发软件动作——从根本上转变依赖显式指令的被动系统。然而，现有方法缺乏可泛化的端到端解决方案来度量和优化这种预期行为。本文介绍了ProActor，一个用于对话任务调度的统一框架，集成了：(1) 一种领域无关的自动标注方法，通过生成完整的机遇时间窗口而非刚性点标签，实现可扩展的主动性强化学习(RL)；(2) 系统性的主动性指标，同时捕获时序质量和参考动作对齐；(3) 使用GRPO及多种奖励设计的RL优化。我们的洞察是，基于RULER的奖励结合主动性评分准则对提升时序质量至关重要，而由阶段感知复合奖励实现的主动性优化是平衡时序质量和参考动作对齐的关键。时序感知RL需要大量探索，这要求高效的基础设施。我们开发了ART-F，一种自适应框架，将请求自适应推理集群与单节点多GPU系统上的DDP训练相结合，实现了4位Qwen2.5-14B-ProActor-Q4的LoRA训练，加速4-8倍。在两个新自动标注数据集上的实验表明，在保持与最先进(SOTA)基线相当的动作一致性的同时，主动时序显著提升。消融实验验证了不同复合奖励变体的有效性。

英文摘要

Proactive task-oriented agents must autonomously anticipate user needs, identify actionable opportunities, and trigger software actions at appropriate moments - fundamentally shifting from reactive systems that await explicit instructions. However, existing approaches lack generalizable end-to-end solutions for measuring and optimizing such anticipatory behaviors. This paper introduces ProActor, a unified framework for conversational task scheduling that integrates: (1) a domain-agnostic automated annotation methodology that enables scalable proactiveness reinforcement learning (RL) by generating full opportunity time windows instead of rigid point labels, (2) systematic proactiveness metrics capturing both timing quality and reference action alignment, and (3) RL optimization using GRPO with various reward designs. Our insight is that RULER-based rewards with proactiveness rubrics are crucial for improving timing quality, and that proactiveness optimization enabled by stage-aware composite rewards is key to balancing timing quality and reference action alignment. Timing-aware RL requires extensive exploration, demanding efficient infrastructure. We develop ART-F, an adaptive framework combining request-adaptive inference clusters with DDP-based training on single-node multi-GPU systems, enabling LoRA training of 4-bit Qwen2.5-14B-ProActor-Q4 with 4-8x speedups. Experiments on two newly auto-annotated datasets demonstrate significant improvements in proactive timing while maintaining action consistency comparable to state-of-the-art (SOTA) baselines. Ablations validate the effectiveness of distinct composite reward variations.

URL PDF HTML ☆

赞 0 踩 0

2605.24899 2026-05-26 cs.AI 版本更新

TaBIIC2: Interactive Building of Ontological Taxonomies using Weighted Self-Organizing Maps

TaBIIC2：使用加权自组织映射交互式构建本体分类

Mathieu d'Aquin

发表机构 * LORIA, CNRS, Université de Lorraine（LORIA研究所、法国国家科学研究中心、洛林大学）

AI总结本文提出一种工具，通过加权自组织映射聚类方法，支持用户逐步交互式地从表格数据中构建概念分类，并定义概念的内涵，平衡了纯手动分析与自动方法。

详情

AI中文摘要

本体表示一个领域的概念知识。本体的核心是概念和子概念的分类，这些概念代表特定实体，构建起来可能很复杂。在许多情况下，信息以记录形式提供，描述相关实体的特征，即表格数据。识别此类数据中的模式和相似性可以作为识别概念并组织它们的基础。然而，手动执行此操作可能具有挑战性，而纯自动方法（如凝聚聚类或依赖大型语言模型分析数据）可能会让用户面对大量结果且控制力不足。在本文中，我们描述了一种工具，通过识别聚类及其内涵定义，支持逐步交互式构建概念分类。为此，我们依赖加权自组织映射作为聚类方法，因为它们能够创建任意数量的聚类，这些聚类在聚类实体特定特征的值分布方面具有区分性。我们表明，通过集成这种机制和其他机制来快速创建将表格数据中的实例分组的概念，该工具代表了在纯手动分析和自动方法之间构建本体分类的中间地带。

英文摘要

Ontologies represent the conceptual knowledge of a domain. At the core of an ontology is the taxonomy of concepts and subconcepts that represent specific entities, which can be complex to build. In many cases, information is available in the form of records describing the characteristics of relevant entities, i.e., tabular data. Identifying patterns and similarities in such data can serve as a basis for identifying concepts and organizing them. However, doing so manually can be challenging, and purely automatic approaches, such as agglomerative clustering or relying on a large language model to analyze the data, can leave the user with overwhelming results and little control. In this paper, we describe a tool that enables the progressive and interactive construction of a taxonomy of concepts by identifying clusters as well as their intentional definitions. To do so, we rely on weighted self-organizing maps as a clustering method because they enable the creation of an arbitrary number of clusters that are distinct with respect to the distributions of values of specific characteristics of the clustered entities. We show that, by integrating this mechanism and others for rapidly creating concepts that group together instances from tabular data, this tool represents a middle ground between purely manual analysis and automatic methods for building ontological taxonomies.

URL PDF HTML ☆

赞 0 踩 0

2605.24883 2026-05-26 cs.AI cs.CR cs.SE 版本更新

Inverting the Shield: Systematically Generating Safety Tests from Policy Specifications

反转盾牌：从策略规范中系统生成安全测试

Xiaoyue Lu, Xianglin Yang, Haijun Liu, Jiahao Liu, Kuntai Cai, Yan Xiao, Jin Song Dong

发表机构 * Shenzhen Campus of Sun Yat-sen University（中山大学深圳校区）； National University of Singapore（新加坡国立大学）； Independent Researcher（独立研究者）

AI总结提出POLARIS框架，通过将非结构化自然语言策略编译为一阶逻辑表示并构建语义策略图，实现覆盖驱动的可重复安全测试，相比基线方法提高了策略覆盖率和攻击成功次数。

Comments Accepted to the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)

详情

AI中文摘要

大型语言模型（LLMs）的广泛集成需要严格且系统的安全评估。现有范式要么依赖构建的基准从预定义角度评估安全性，要么采用动态红队探测潜在漏洞。虽然有效，但这些方法面临挑战，因为它们严重依赖专家领域知识，提供的系统保证有限，且容易快速过时。为解决这些限制，我们引入了一个新颖框架POLARIS，将基于规范的软件测试的严谨性引入AI安全。POLARIS首先将非结构化自然语言策略编译为一阶逻辑（FOL）表示，建立高层规则与具体测试用例之间的可追溯链接。这种形式化使得能够构建语义策略图，其中复杂的策略违规场景被编码为可遍历路径。通过系统地探索该图，POLARIS发现组合违规模式，然后将其实例化为可执行的自然语言测试查询，实现覆盖驱动且可重复的安全测试。实验表明，与已建立的基线相比，POLARIS实现了更高的策略覆盖率和攻击成功次数。关键是，通过连接形式化方法和AI安全，POLARIS提供了一种有原则的自动化方法，确保LLMs遵守安全关键策略，并具有可验证的可追溯性。我们在https://github.com/huac-lxy/POLARIS发布代码。

英文摘要

The widespread integration of Large Language Models (LLMs) necessitates rigorous and systematic safety evaluation. Existing paradigms either rely on constructed benchmarks to assess safety from predefined perspectives, or employ dynamic red-teaming to probe potential vulnerabilities. While effective, these approaches face challenges, as they depend heavily on expert domain knowledge, offer limited systematic guarantees, and are vulnerable to rapid obsolescence. To address these limitations, we introduce a novel framework POLARIS that brings the rigor of specification-based software testing to AI safety. POLARIS first compiles unstructured natural-language policies into First-Order Logic (FOL) representations, establishing a traceable link between high-level rules and concrete test cases. This formalization enables the construction of a Semantic Policy Graph, where complex policy violation scenarios are encoded as traversable paths. By systematically exploring this graph, POLARIS uncovers compositional violation patterns, which are then instantiated into executable natural-language test queries, enabling coverage-driven and reproducible safety testing. Experiments demonstrate that POLARIS achieves higher policy coverage and attack success counts compared to established baselines. Crucially, by bridging formal methods and AI safety, POLARIS provides a principled, automated approach to ensuring LLMs adhere to safety-critical policies with verifiable traceability. We release our code at https://github.com/huac-lxy/POLARIS.

URL PDF HTML ☆

赞 0 踩 0

2605.24873 2026-05-26 cs.CL cs.AI cs.LG 版本更新

视觉自回归生成的对抗性纠错

Ligong Bi, Tao Huang, Jianyuan Guo, Chang Xu

发表机构 * Shanghai Jiao Tong University（上海交通大学）； City University of Hong Kong（香港城市大学）； The University of Sydney（悉尼大学）

AI总结提出AID-VAR框架，通过对抗性注入诊断机制纠正视觉自回归模型中的级联误差，提升生成质量。

详情

AI中文摘要

视觉自回归（VAR）模型通过执行层次化的下一尺度预测，已成为图像合成的强大范式。然而，VAR模型天生容易产生级联误差传播，其中细微的粗尺度误预测会在层次结构中放大，最终扭曲最终合成。为了缓解这一问题，我们提出了AID-VAR，一个即插即用的框架，通过对抗性注入诊断增强预训练的VAR。与标准的被动生成不同，AID-VAR引入了一种主动纠错机制，灵感来自GAN中的对抗性反馈。我们部署了一个判别器来诊断每个尺度转换处的保真度差距，并配有一个轻量级的引导注入器。该模块作为一个非侵入式适配器，优化冻结的VAR骨干网络的特征流形，有效引导生成朝向真实图像的分布，同时不破坏预训练潜在空间的稳定性。此外，为了严格评估这种跨尺度进展，我们引入了跨尺度一致性得分（ISCS），这是一个新的度量标准，用于量化连续分辨率尺度之间的保真度和结构对齐。在各种骨干网络上的实验结果表明，AID-VAR以可忽略的开销提供了更清晰的纹理细节和更少的结构失真。例如，AID-VAR-d20在参数仅增加3%的情况下，FID提升了16%。这些结果确立了AID-VAR作为升级大规模VAR生成器的高效且可扩展的途径，在不改变训练数据、基础架构或采样调度的情况下，增强了全局连贯性和局部细节。代码可在https://github.com/bijiw515/AID-VAR获取。

英文摘要

Visual Autoregressive (VAR) models have emerged as a powerful paradigm for image synthesis by performing hierarchical next-scale prediction. However, VAR models are inherently prone to cascading error propagation, where subtle coarse-scale mispredictions are amplified across the hierarchy, ultimately distorting the final synthesis. To mitigate this, we propose AID-VAR, a plug-and-play framework that enhances pre-trained VARs through Adversarially Injected Diagnosis. Instead of a standard passive generation, AID-VAR introduces a proactive error-correction mechanism inspired by the adversarial feedback in GANs. We deploy a discriminator to diagnose fidelity gaps at each scale transition, coupled with a lightweight guidance injector. This module operates as a non-invasive adapter that refines the feature manifold of a frozen VAR backbone, effectively steering the generation toward the distribution of real images without destabilizing the pre-trained latent space. Furthermore, to rigorously evaluate this cross-scale progression, we introduce the Inter-Scale Consistency Score (ISCS), a novel metric that quantifies the fidelity and structural alignment between consecutive resolution scales. Experimental results across various backbones demonstrate that AID-VAR delivers sharper textural details and fewer structural distortions with negligible overhead. For instance, AID-VAR-d20 achieves a 16% improvement in FID with only a 3% increase in parameters. These results establish AID-VAR as a highly efficient and scalable pathway for upgrading large-scale VAR generators, enhancing global coherence and local detail without altering training data, base architectures, or sampling schedules. Code is available at https://github.com/bijiw515/AID-VAR.

URL PDF HTML ☆

赞 0 踩 0

2605.24834 2026-05-26 cs.CR cs.AI 版本更新

Reflect-Guard: Enhancing LLM Safeguards against Adversarial Prompts via Logical Self-Reflection

Reflect-Guard: 通过逻辑自我反思增强大语言模型对对抗性提示的防护

Lixing Lin, Juli You, Yue Li, Luyun Lin, Yiqing Wang, Zhen Zhang, Moxuan Zheng

发表机构 * Yale University（耶鲁大学）； Columbia University（哥伦比亚大学）； Citigroup（摩根大通）； Independent Researcher（独立研究者）

AI总结提出Reflect-Guard方法，通过参数高效微调为大语言模型安全分类器注入链式思维自我反思能力，显著提升对对抗性越狱攻击的检测性能。

Comments 12 pages, 2 figures, and 4 tables

详情

AI中文摘要

大语言模型安全分类器（如Llama Guard）能有效检测明显有害的提示，但对通过角色扮演场景、虚构框架和间接请求伪装恶意意图的对抗性越狱攻击仍然脆弱。我们提出Reflect-Guard，一种通过参数高效微调为大语言模型安全分类器注入链式思维自我反思能力的方法。我们的方法从GPT-4o-mini中提炼分析推理能力，形成结构化反思注释，然后通过QLoRA训练Llama-Guard-3-8B，使其在发布安全判决前生成逻辑自我反思。仅使用1000个训练样本并更新0.5%的模型参数（约4200万），Reflect-Guard在两个具有挑战性的基准测试上取得了显著改进。在WildGuardTest上，F1分数从0.770提升至0.842（+7.2个百分点），对抗性提示的召回率从0.513提升至0.921（+40.8个百分点）。在JailbreakBench上，攻击成功率从10.3%降至1.8%，相对降低82.5%。这些增益在对抗性输入上尤为明显，显式的推理步骤使模型能够看穿击败标准模式匹配方法的混淆技术。我们的结果表明，教会安全分类器推理对抗性意图，而非简单分类表面模式，是实现鲁棒大语言模型安全性的有前景方向。

英文摘要

Large language model (LLM) safety classifiers such as Llama Guard are effective at detecting overtly harmful prompts but remain vulnerable to adversarial jailbreak attacks that disguise malicious intent through role-play scenarios, fictional framing, and indirect requests. We present Reflect-Guard, a method that augments LLM-based safety classifiers with chain-of-thought self-reflection capabilities through parameter-efficient fine-tuning. Our approach distills analytical reasoning from GPT-4o-mini into structured reflection annotations, then trains Llama-Guard-3-8B via QLoRA to generate logical self-reflections before issuing safety verdicts. Using only 1000 training examples and updating just 0.5% of model parameters (~42M), Reflect-Guard achieves substantial improvements on two challenging benchmarks. On WildGuardTest, F1 score improves from 0.770 to 0.842 (+7.2 pp), with recall on adversarial prompts increasing from 0.513 to 0.921 (+40.8 pp). On JailbreakBench, the attack success rate drops from 10.3% to 1.8%, representing an 82.5% relative reduction. These gains are especially pronounced on adversarial inputs, where the explicit reasoning step enables the model to see through obfuscation techniques that defeat standard pattern-matching approaches. Our results demonstrate that teaching safety classifiers to reason about adversarial intent, rather than simply classify surface patterns, is a promising direction for robust LLM safety.

URL PDF HTML ☆

赞 0 踩 0

2605.24831 2026-05-26 cs.CV cs.AI 版本更新

Multiscale Real-Time Object Detection in the NMS-Free Era: A Comparative Performance Evaluation of YOLOv8 and YOLO26

无NMS时代的实时多尺度目标检测：YOLOv8与YOLO26的对比性能评估

Chidera G. Oguine, Kanyifeechukwu J. Oguine, Obiozor M. Oguine, Ozioma C. Oguine

发表机构 * University of Abuja（阿布贾大学）； Vanderbilt University（范德比大学）； University of Notre Dame（圣约翰大学）

AI总结本文在Pascal VOC和VisDrone数据集上，从准确率、定位、模型大小、计算量和延迟等维度，系统比较了基于NMS的YOLOv8与无NMS的YOLO26在多尺度下的性能，发现YOLO26在多数尺度上检测更强且模型复杂度更低，但在密集小目标场景下优势缩小，且YOLOv8在GPU延迟上仍有竞争力。

Comments 11 pages, 6 tables, 9 figures

详情

AI中文摘要

非极大值抑制（NMS）仍然是许多实时目标检测流程中的关键后处理步骤，但在资源受限的环境中可能引入延迟变化和部署复杂性。最近的无NMS设计（如YOLO26）旨在通过端到端检测减少这种依赖，然而与基于NMS的成熟模型（如YOLOv8）相比，其性能在标准基准之外尚未得到充分探索。本文在Pascal VOC和VisDrone上比较了YOLOv8和YOLO26，这两个数据集分别代表通用目标检测和密集空中小目标检测。两个模型家族在五个尺度上使用准确率、定位、模型大小、GFLOPs以及CPU/GPU延迟进行评估。结果表明，YOLO26在Pascal VOC上的大多数尺度上实现了更强的检测性能和更低的模型复杂度，而在VisDrone上性能差距缩小，两个模型在处理密集小目标时均表现困难。YOLOv8在GPU延迟上仍具有竞争力，表明无NMS设计并不能保证普遍的部署优势。总体而言，研究表明检测器的选择取决于数据集特征、目标尺度、模型容量和硬件约束。

英文摘要

Non-Maximum Suppression (NMS) remains a key post-processing step in many real-time object detection pipelines, but it can introduce latency variation and deployment complexity in resource-constrained settings. Recent NMS-free designs such as YOLO26 aim to reduce this dependence through end-to-end detection, yet their performance relative to established NMS-based models such as YOLOv8 remains underexplored beyond standard benchmarks. This paper compares YOLOv8 and YOLO26 on Pascal VOC and VisDrone, representing general object detection and dense aerial small-object detection, respectively. Both model families are evaluated across five scales using accuracy, localization, model size, GFLOPs, and CPU/GPU latency. Results show that YOLO26 achieves stronger detection performance and lower model complexity on Pascal VOC across most scales, while the performance gap narrows on VisDrone, where both models struggle with dense small targets. YOLOv8 remains competitive in GPU latency, showing that NMS-free design does not guarantee universal deployment superiority. Overall, the study shows that detector selection depends on dataset characteristics, object scale, model capacity, and hardware constraints.

URL PDF HTML ☆

赞 0 踩 0

2605.24823 2026-05-26 cs.AI 版本更新

Agent Manufacturing: Foundation-Model Agents as First-Class Industrial Entities

Agent制造：基础模型Agent作为一级工业实体

Yilei Zhang

发表机构 * University of Canterbury（坎特伯雷大学）

AI总结本文提出Agent制造范式，即基础模型Agent通过解释开放目标、长程规划、调用工具和机器、与其他Agent及人类协商来协调生产，从而将工业中的人类协调认知工作自动化。

详情

AI中文摘要

制造业已经历了四个广泛认可的范式——机械化、电气化、可编程自动化和智能制造——每个范式都定义了从人类转移到机器的工作类型。在每种情况下，有一层工业工作仍然基本上由人类完成：生产的协调认知，包括工程师、规划师和运营经理所执行的解释、分配、诊断、协商和治理工作。我们认为，第五次转型正在进行中，其中这一层（而非其下的物理或常规认知层）正是基于基础模型的自主Agent主要重新分配的对象。我们将这一范式命名为Agent制造，并操作性地定义：当一个制造系统的主要协调机制是由基础模型Agent执行的推理，这些Agent能够解释开放目标、在长周期内规划、调用工具和机器、并与其他Agent和人类协商时，该系统就是Agent制造的一个实例。这一定义比现有的认知制造或工业5.0文献更窄且更可证伪，并且它将该范式与经典的多Agent制造系统（后者仅在封闭协议空间内自主）明确区分开来。

英文摘要

Manufacturing has passed through four widely recognized paradigms - mechanization, electrification, programmable automation, and Smart Manufacturing - each defined by the kind of work it shifted from humans to machines. In every case, one layer of industrial work remained fundamentally human: the coordinative cognition of production, comprising the interpretive, allocative, diagnostic, negotiative, and governance work exercised by engineers, planners, and operational managers. We argue that a fifth transition is now underway in which this layer, rather than the physical or routine-cognitive layers below it, is what foundation-model-based autonomous agents primarily redistribute. We name this paradigm Agent Manufacturing and define it operationally: a manufacturing system is an instance of Agent Manufacturing when its principal coordination mechanism is reasoning performed by foundation-model agents that can interpret open-ended goals, plan over long horizons, invoke tools and machines, and negotiate with other agents and humans. This is a narrower and more falsifiable definition than the existing literature on cognitive manufacturing or Industry 5.0 provides, and it distinguishes the paradigm sharply from classical multi-agent manufacturing systems, which were autonomous only within closed protocol spaces.

URL PDF HTML ☆

赞 0 踩 0

2605.24812 2026-05-26 cs.AI 版本更新

面向大规模视觉识别的多模态大语言模型分治推理

Zhipeng Ye, Jiaqi Huang, Feng Jiang, Qiufeng Wang, Yikang Duan, Dawei Wang, Xihang Zhou, Qian Qiao

发表机构 * Taizhou Institute of Science and Technology, Nanjing University of Science and Technology（泰州科技学院、南京理工大学）； Department of Intelligence Science, Xi’an Jiaotong-Liverpool University（智能科学系，西安交通大学利物浦大学）； School of Computer Science and Technology, Soochow University（计算机科学与技术学院，苏州大学）； Department of Statistical Sciences, University of Toronto（统计科学系，多伦多大学）

AI总结针对多模态大语言模型在长序列识别中性能崩溃的问题，提出分治推理（DCI）策略，通过递归分解任务和动态剪枝提升信噪比与分类精度。

详情

AI中文摘要

多模态大语言模型（MLLMs）在广泛的视觉语言任务中展现了强大的能力。然而，当应用于大规模图像分类时，随着标签空间的扩大，其性能显著下降——我们将这一现象定义为长序列识别中的性能崩溃。通过信息论分析，我们揭示了这种崩溃源于不断增长的信息熵与注意力机制中显著的注意力稀释和衰减之间的根本冲突，这损害了模型在处理极长提示时维持足够信噪比的能力。为缓解这一问题，我们提出了分治推理（DCI），一种用于MLLMs视觉识别的新型测试时扩展策略。DCI递归地将复杂的全局分类任务分解为多个更简单的局部子问题，并采用动态剪枝机制压缩搜索空间。该方法通过缓解长序列推理中固有的权重稀释问题，有效提高了局部信噪比和模型精度。此外，传统自注意力具有难以承受的二次计算复杂度，而DCI在大规模分类场景中实现了更有利的扩展行为并显著加速推理。在ImageNet-1K和ImageNet-21K等基准上的大量实验表明，DCI持续提高了分类精度。这使得轻量级开源模型无需任何额外训练或微调即可与甚至超越前沿闭源巨头。作为一种模型无关、即插即用的范式，DCI为在大规模场景中扩展MLLMs的推理精度提供了一种高效方法。

英文摘要

Multimodal Large Language Models (MLLMs) have demonstrated strong capabilities across a wide range of vision language tasks. However, when applied to large scale image classification, their performance degrades significantly as the label space expands a phenomenon we define as Performance Collapse in Long Sequence Recognition. Through an information theoretic analysis, we reveal that this collapse stems from a fundamental conflict between the escalating information entropy and the prominent attention dilution and decay within attention mechanisms, which impairs the model's ability to maintain a sufficient signal-to-noise ratio when processing extremely long prompts. To mitigate this, we propose Divide-and-Conquer Inference (DCI), a novel test-time scaling strategy for visual recognition with MLLMs. DCI recursively decomposes complex global classification tasks into multiple simpler, localized subproblems and employs a dynamic pruning mechanism to compress the search space. This method effectively improves the local signal to noise ratio and model accuracy by mitigating the inherent weight dilution issues in long-sequence inference. Moreover, while traditional self-attention incurs a prohibitive quadratic computational complexity, DCI achieves more favorable scaling behavior and substantially accelerates inference in large scale classification scenarios. Extensive experiments on benchmarks such as ImageNet-1K and ImageNet-21K demonstrate that DCI consistently improves classification accuracy. This enables lightweight open-source models to rival or even surpass frontier closed-source giants without any additional training or fine-tuning. As a model-agnostic, plug-and-play paradigm, DCI offers an efficient approach for scaling the inferential precision of MLLMs in large-scale scenarios.

URL PDF HTML ☆

赞 0 踩 0

2605.24792 2026-05-26 cs.CV cs.AI 版本更新

Parameter-Efficient VLMs for Gastrointestinal Endoscopy: Medical Image Generation and Clinical Visual Question Answering

用于胃肠内窥镜的参数高效视觉语言模型：医学图像生成与临床视觉问答

Ojonugwa Oluwafemi Ejiga Peter, Frederick Akor Ejiga, Fahmi Khalifa, Md Mahmudur Rahman

发表机构 * Computer Science Department, Morgan State University（莫尔甘州大学计算机科学系）； International Organization for Migration (IOM)（国际移民组织）； Electrical & Computer Engineering Department, Morgan State University（莫尔甘州大学电气与计算机工程系）

AI总结提出双流水线参数高效微调模型，结合Florence-2和LoRA Stable Diffusion，分别解决临床视觉问答和隐私保护合成数据生成问题，在Kvasir-VQA数据集上取得高ROUGE和BLEU分数，并显著降低计算成本。

详情

DOI: 10.1109/BHI67747.2025.11269532

AI中文摘要

胃肠内窥镜AI系统的主要局限性源于标注数据短缺、严格的隐私政策以及传统模型微调中的显著瓶颈。这些限制阻碍了复杂AI模型在临床实践中的成功应用，尤其影响了诊断的可靠性和可扩展性。在本文中，我们提出了一种双流水线PEFT模型，解决了两个基本问题：医学视觉问答（VQA）和隐私保护合成数据的生成。对于临床VQA，我们采用Florence-2视觉语言模型。利用PEFT增强了模型的可解释性，同时大幅降低了训练的计算成本。同时，我们使用低秩适应（LoRA）与Stable Diffusion 2.1生成高质量的胃肠图像，在不违反患者隐私的情况下增强训练数据库。本研究使用了Kvasir-VQA数据集。我们的Florence-2 VQA模型实现了ROUGE-1为0.92，ROUGE-L为0.91，BLEU分数从0.08提升到0.24。在私有数据集上的微调始终优于在公共数据集上的微调。秩为4的LoRA合成达到了最优性能，保真度得分为0.290，一致性得分为0.730，Frechet BiomedCLIP距离（FBD）为1450，计算成本降低了近90%。该框架提高了AI在胃肠内窥镜中的临床潜力。与FLUX、MSDM和Kandinsky 2.2相比，我们的模型表现出更优的FBD和强语义对齐。虽然其他模型在保真度或一致性上领先，但我们更低的FBD表明更好的图像-文本一致性。这些结果确立了我们的方法作为增强临床AI中VQA和合成数据生成的稳健解决方案。

英文摘要

The major limitations of gastrointestinal (GI) endoscopy AI systems arise from a shortage of annotated data, strict privacy policies, and significant bottlenecks in conventional model fine-tuning. Such limitations impede the successful application of sophisticated AI models in clinical practice, particularly affecting the reliability and scalability of diagnosis. In this paper, we present a dual-pipeline PEFT model that addresses two fundamental problems: medical Visual Question Answering (VQA) and the generation of privacy-preserving synthetic data. For clinical VQA, we adopt the Florence-2 vision-language model. Leveraging PEFT enhances model interpretability while substantially reducing the computational cost of training. Simultaneously, we employ Low-Rank Adaptation (LoRA) with Stable Diffusion 2.1 to generate high-quality GI images that enhance training databases without violating patient privacy. This research utilized the Kvasir-VQA dataset. Our Florence-2 VQA model achieved ROUGE-1 of 0.92, ROUGE-L of 0.91, and BLEU score improvements from 0.08 to 0.24. Fine-tuning on private datasets consistently showed better results than fine-tuning on public datasets. The rank-4 LoRA synthesis achieved optimal performance with a fidelity score of 0.290, an agreement score of 0.730, and a Frechet BiomedCLIP Distance (FBD) of 1450, reducing computational costs by almost 90 percent. This framework improves the clinical potential of AI in GI endoscopy. Compared to FLUX, MSDM, and Kandinsky 2.2, our model demonstrates superior FBD and strong semantic alignment. While other models lead in Fidelity or Agreement, our lower FBD indicates better image-text coherence. These results establish our approach as a robust solution for enhancing VQA and synthetic data generation in clinical AI.

URL PDF HTML ☆

赞 0 踩 0

2605.24786 2026-05-26 cs.LG cs.AI 版本更新

CONF-KV: Confidence-Aware KV Cache Eviction with Mixed-Precision Storage for Long-Horizon LLM

CONF-KV：面向长序列LLM的置信度感知KV缓存淘汰与混合精度存储

Yubo Li, Yidi Miao

发表机构 * Carnegie Mellon University（卡内基梅隆大学）

AI总结提出CONF-KV方法，利用模型当前不确定性（置信度）动态调整KV缓存预算，结合混合精度存储和分块在线softmax注意力，在长序列推理中显著降低显存占用并保持高精度。

详情

AI中文摘要

长序列LLM推理使键值（KV）缓存成为GPU内存的主要消耗者，并使每个token的注意力计算越来越昂贵。许多常见的淘汰策略使用静态的最近窗口或历史注意力，忽略了每个解码步骤中计算出的一个信号：模型当前的不确定性。我们引入CONF-KV，一个KV缓存管理器，它将下一个token分布转换为标量置信度分数，并用它来选择每步缓存预算，在模型不确定时保留更多上下文，在模型确定时积极剪枝。在每个预算内，token根据累积注意力质量和最近性的组合进行排序，同时一个受保护的最近窗口保持局部连贯性。我们将该策略与分块在线softmax注意力、混合FP16/INT8存储以及金字塔式逐层预算变体相结合。在四个模型家族和生成长度高达4K的情况下，CONF-KV的显存占用接近固定的512 token滑动窗口，同时与完整KV相比，困惑度差异保持在1.5-2.1点以内。在长达32K token的“大海捞针”测试中，CONF-KV的检索准确率达到91.4%，而滑动窗口为53.8%，H2O为80.6%；在75个VisualWebArena任务中，它以2.8倍的峰值内存降低保留了完整KV成功率的95.3%。

英文摘要

Long-horizon LLM inference turns the key--value (KV) cache into the dominant GPU memory consumer and makes per-token attention increasingly expensive. Many common eviction policies use static recency windows or historical attention, leaving unused a signal computed on every decoding step: the model's current uncertainty. We introduce CONF-KV, a KV-cache manager that converts the next-token distribution into a scalar confidence score and uses it to choose the per-step cache budget, retaining more context when the model is uncertain and pruning aggressively when it is confident. Within each budget, tokens are ranked by a composite of accumulated attention mass and recency, while a protected recent window preserves local coherence. We combine the policy with blockwise online-softmax attention, mixed FP16/INT8 storage, and a pyramidal per-layer budget variant. Across four model families and generated lengths up to 4K, CONF-KV stays near the footprint of a fixed 512-token sliding window while remaining within 1.5--2.1 perplexity points of full KV. On Needle-in-a-Haystack up to 32K tokens, CONF-KV reaches 91.4% retrieval accuracy versus 53.8% for sliding windows and 80.6% for H2O; on 75 VisualWebArena tasks it retains 95.3% of full-KV success at 2.8 times lower peak memory.

URL PDF HTML ☆

赞 0 踩 0

2605.24784 2026-05-26 cs.AI 版本更新

GRAIL: AI translation for scientists application workflow on satellite data

GRAIL：面向卫星数据科学家应用工作流的AI翻译

Zhuocheng Shang, Ahmed Eldawy

发表机构 * University of California, Riverside（加州大学河滨分校）

AI总结提出GRAIL系统，通过LangGraph管道将Python地理空间工作流翻译为可扩展的Spark程序，无需科学家学习新框架。

2605.24779 2026-05-26 cs.LG cs.AI math.CO 版本更新

Complement Submodular Information Measures for Balanced and Robust Data Selection

互补子模信息度量用于平衡和鲁棒的数据选择

Rishabh Iyer

发表机构 * The University of Texas at Dallas（德克萨斯大学达拉斯分校）

AI总结提出互补子模信息（CSI）目标函数，通过建模子集与其补集之间的共享结构信息，实现平衡且鲁棒的数据选择，并在理论上证明其近似单调性和贪心近似保证，实验表明在鲁棒隐藏切片感知子集选择中优于经典子模目标。

详情

AI中文摘要

子模优化已成为数据选择、检索、摘要和表示学习的基本范式，因为它能够建模覆盖度、多样性和代表性。然而，经典子模目标仅优化所选子集，并未明确保留所选子集与剩余数据之间的结构信息。在许多现代机器学习应用中，包括训练/验证/测试分割、基准构建和鲁棒子集选择，选择的质量关键取决于在所选子集及其补集之间保持平衡结构。在这项工作中，我们引入了互补子模信息（CSI），这是一类新的互补感知子模目标，用于量化子集与其补集之间的共享结构信息。我们的框架产生了几个经典子模函数的互补感知变体，包括设施选址、图割、LogDet、饱和覆盖、集合覆盖、概率集合覆盖和基于特征函数。我们分析了CSI目标的理论性质，并表明它们在有限曲率条件下表现出近似单调性，从而得到接近$(1-1/e)$的贪心近似保证。实验上，CSI目标在鲁棒隐藏切片感知子集选择中始终优于标准子模目标。特别是，CSI目标显著改善了相干稀有/尾部语义结构的保留，同时抑制了噪声和孤立异常值，从而显著提高了下游预测性能。合成实验进一步说明了不同的CSI实例如何捕获代表性、多样性、连通性和平衡邻域保留的互补概念。

英文摘要

Submodular optimization has become a fundamental paradigm for data selection, retrieval, summarization, and representation learning due to its ability to model coverage, diversity, and representativeness. However, classical submodular objectives optimize only the selected subset and do not explicitly preserve structural information between the selected subset and the remaining data. In many modern machine learning applications, including train/validation/test splitting, benchmark construction, and robust subset selection, the quality of a selection depends critically on preserving balanced structure across both the selected subset and its complement. In this work, we introduce Complement Submodular Information (CSI), a new class of complement-aware submodular objectives that quantify shared structural information between a subset and its complement. Our framework induces complement-aware variants of several classical submodular functions including Facility Location, Graph Cut, LogDet, Saturated Coverage, Set Cover, Probabilistic Set Cover, and Feature Based Functions. We analyze the theoretical properties of CSI objectives and show that they exhibit approximate monotonicity under bounded curvature conditions, leading to near-$(1-1/e)$ greedy approximation guarantees. Empirically, CSI objectives consistently outperform standard submodular objectives on robust hidden-slice-aware subset selection. In particular, CSI objectives significantly improve preservation of coherent rare/tail semantic structure while simultaneously suppressing noisy and isolated outliers, leading to substantially improved downstream predictive performance. Synthetic experiments further illustrate how different CSI instantiations capture complementary notions of representativeness, diversity, connectivity, and balanced neighborhood preservation.

URL PDF HTML ☆

赞 0 踩 0

2605.24775 2026-05-26 cs.AI cs.MA 版本更新

PRIMA: Operational Patterns for Resilient Multi-Agent Research with Verifiable Identity and Convergent Feedback

PRIMA: 具有可验证身份和收敛反馈的弹性多智能体研究的操作模式

Sasank Annapureddy

发表机构 * GitHub

AI总结针对长时间运行的多智能体LLM系统面临的故障模式，提出PRIMA框架，包含弹性恢复、子智能体操作规范和结构化工程交付的多阶段应用模式，并通过图同构案例验证其有效性。

Comments 11 pages. Single-author preprint. Supplementary case-study report (Graph Isomorphism algorithm proposal with three theorems, five conjectures, complete complexity analysis, and hard-instance evaluation) available at https://spockstein.github.io/prima/case-study-graph-isomorphism.html

详情

AI中文摘要

将LLM作为协调的多智能体研究系统运行数小时，会暴露出单次评估无法发现的故障模式：上游提供商无预警地限制服务，子智能体使任务偏离以适应可用工具，叙述机制而非使用它，以自我道歉开始修订迭代，或将上游上下文视为可执行指令。我们提出PRIMA，其主要贡献是三种应对这些故障模式的操作模式：(1) 弹性与恢复层，检测上游速率限制信号，将类型化的暂停记录持久化到磁盘，并在进程重启后恢复长时间运行的任务而不重新执行已收敛的工作；(2) 子智能体操作规范，将任务保真度、工具使用、修订和步骤间上下文边界规范编码为结构化的提示层；(3) 用于结构化工程交付的多阶段应用模式，将正交的草稿步骤与最终综合前的显式跨文档协调过程配对。这些模式基于一个基础协议：具有显式收敛标准的研究程序规范语言、双指标评分引擎（LLM评判的评分标准加沙盒代码）、外部元优化循环、事件驱动持久化、基于钩子的中间件、上下文压缩和多提供商LLM抽象。智能体身份来源于素数幂，提供无冲突标识符和无需中央注册表的可轻松验证的集群成员资格。理论保证包括$O(k)$验证、$O(V+E)$ DAG验证以及由算术基本定理保证的身份无冲突。一个图同构案例研究将架构主张落实到生成的产物中：一个六步协议，产生了一篇研究论文，提出了一种新的规范形式算法，包含三个定理和五个猜想。

英文摘要

Operating LLMs as coordinated multi-agent research systems over multi-hour runs surfaces failure modes that single-shot evaluation cannot: upstream providers throttle without warning, sub-agents drift the task to fit accessible tools, narrate machinery instead of using it, open revision iterations with self-apology, or treat upstream context as executable directives. We present PRIMA, whose primary contributions are three operational patterns for surviving these failure modes: (1) a resilience-and-recovery layer that detects upstream rate-limit signals, persists a typed pause record to disk, and resumes long-running runs without re-executing converged work even across process restarts; (2) a sub-agent operating discipline encoding task-fidelity, tool-use, revision, and inter-step context-boundary norms as a structural prompt layer; (3) a multi-phase application pattern for structured engineering deliverables pairing orthogonal draft steps with an explicit cross-document harmonization pass before final synthesis. These sit atop a foundational protocol: a research-program specification language with explicit convergence criteria, a dual-metric scoring engine (LLM-judged rubric plus sandboxed code), an outer meta-optimization loop, event-driven persistence, hook-based middleware, context compaction, and a multi-provider LLM abstraction. Agent identities derive from prime powers, giving collision-free identifiers and trivially-verifiable cluster membership without a central registry. Theoretical guarantees include $O(k)$ verification, $O(V+E)$ DAG validation, and identity collision freedom by the Fundamental Theorem of Arithmetic. A Graph Isomorphism case study grounds the architectural claims in a generated artifact: a six-step protocol that produced a research paper proposing a new canonical-form algorithm with three theorems and five conjectures.

URL PDF HTML ☆

赞 0 踩 0

2605.24773 2026-05-26 cs.AI 版本更新

Uncertainty Decomposition via Cyclical SG-MCMC and Soft-label Learning for Subjective NLP

通过循环SG-MCMC和软标签学习进行主观NLP中的不确定性分解

Keito Inoshita, Takato Ueno

发表机构 * Faculty of Business and Commerce（商科学部）； Data Science and AI Innovation Research Promotion Center（数据科学与人工智能创新研究促进中心）； Graduate School of Data Science（数据科学研究生院）

AI总结提出结合循环随机梯度马尔可夫链蒙特卡洛（cSG-MCMC）与软标签学习的方法，在情感分类中沿多个轴评估不确定性，并在GoEmotions基准上优于现有方法。

详情

AI中文摘要

情感分类中标注者的分歧反映了情感概念固有的模糊性，对于主观NLP中的预测质量评估至关重要。然而，先前没有工作将软标签学习与贝叶斯深度学习相结合，以评估包括标注者分布保真度在内的多个轴上的不确定性。我们在冻结的RoBERTa上通过循环随机梯度马尔可夫链蒙特卡洛（cSG-MCMC）训练一个线性头，在五轴评估下以软标签目标针对经验标注者分布。在28情感的GoEmotions基准上，所提出的方法在三个轴上同时优于蒙特卡洛Dropout和深度集成——标注者分布的Jensen-Shannon散度（JSD）、每个情感偶然不确定性与分歧之间的Spearman相关性，以及选择性预测的风险-覆盖率曲线下面积（AURC）和ROC曲线下面积（AUROC）——表明独立的轴可以从一个后验中联合获得。事后温度缩放表现出双向效应，建立了硬标签校准和标注者JSD作为独立维度，并激励联合报告作为诚实协议。

英文摘要

Annotator disagreement in emotion classification reflects ambiguity intrinsic to emotion concepts and is essential for predictor-quality assessment in subjective NLP. Yet no prior work integrates soft-label learning with Bayesian deep learning to evaluate uncertainty along axes including annotator-distribution fidelity. We train a linear head on a frozen RoBERTa via cyclical stochastic gradient Markov chain Monte Carlo (cSG-MCMC), targeting the empirical annotator distribution with a soft-label objective under a five-axis evaluation. On the 28-emotion GoEmotions benchmark, the proposed method outperforms Monte Carlo Dropout and Deep Ensemble simultaneously on three axes -- Jensen-Shannon divergence (JSD) to the annotator distribution, Spearman correlation between per-emotion aleatoric uncertainty and disagreement, and selective-prediction Area Under the Risk-Coverage Curve (AURC) and Area Under the ROC Curve (AUROC) -- showing independent axes are jointly attainable from one posterior. Post-hoc temperature scaling exhibits a bidirectional effect, establishing hard-label calibration and annotator-JSD as independent dimensions and motivating joint reporting as an honest protocol.

URL PDF HTML ☆

赞 0 踩 0

2605.24771 2026-05-26 cs.CV cs.AI cs.LG 版本更新

From Theory to Decision Rule: Calibrating the Noisy-Label Crossover for Vision-Language Model Weak Supervision Across Three Medical-Imaging Benchmarks

从理论到决策规则：校准视觉-语言模型弱监督的噪声标签交叉点——基于三个医学影像基准

Bruce Changlong Xu, Jose James, Alexander Ryu

发表机构 * Department of Computer Science, Stanford University（计算机科学系，斯坦福大学）

AI总结通过三个医学影像基准校准理论预测的噪声标签交叉点，提出基于少量金标标签的决策规则。

Comments 5 pages, 2 figures, 4 tables

详情

AI中文摘要

经典的噪声标签理论预测，弱监督下的下游性能上限是标注者的准确率，这意味着一个尖锐的交叉点：一旦金标训练的分类器达到标注者的水平，弱标签就会从帮助变为伤害。该预测是理论性的；缺少的是将其转化为现代基础模型标注者的实例级陈述的基准校准。我们针对BiomedCLIP生成的弱标签，在三个医学影像基准（PCAM、ISIC、NIH-CXR）和六个跨越11倍参数范围的下游架构上提供了这样的校准。理论预测的交叉点出现在PCAM上约100个样本，ISIC上20-50个，NIH-CXR上250-500个；交叉点以上的弱标签使AUC降低高达-0.10。对于五个预训练架构中的四个，交叉点位置与架构无关，而一个家族内的DenseNet扫描（2.5倍参数，相同预训练）支持了标注者（而非学生）是主要约束的观点。该校准进而产生一个可在10-20个金标标签下操作的决策规则：比较仅金标AUC与用户金标集上的VLM准确率。NIH-CXR上的结构化与随机噪声符号翻转表明，该界限的仅速率形式是不完整的，并确定了一个具体的改进（标签空间投影），未来的基准可以设计来测试它。

英文摘要

Classical noisy-label theory predicts that downstream performance under weak supervision is bounded above by the labeler's accuracy, implying a sharp crossover: once a gold-trained classifier matches the labeler, weak labels stop helping and start hurting. The prediction is theoretical; what is missing is a benchmark calibration that turns it into an instance-level statement for modern foundation-model labelers. We provide such a calibration for BiomedCLIP-generated weak labels on three medical-imaging benchmarks (PCAM, ISIC, NIH-CXR) and six downstream architectures spanning an 11x parameter range. The crossover predicted by theory appears at ng~100 on PCAM, 20-50 on ISIC, and 250-500 on NIH-CXR; weak labels above the crossover degrade AUC by up to -0.10. The location is architecture-invariant for four of five pretrained architectures, and a within-family DenseNet sweep (2.5x parameters, identical pretraining) supports the view that the labeler, not the student, is the dominant constraint. The calibration in turn produces a decision rule operable from 10-20 gold labels: compare gold-only AUC to VLM accuracy on the user's gold set. A structured-vs-random noise sign flip on NIH-CXR shows that the rate-only formulation of the bound is incomplete and identifies a concrete refinement (label-space projection) that future benchmarks can be designed to test.

URL PDF HTML ☆

赞 0 踩 0

2605.24769 2026-05-26 cs.CV cs.AI eess.IV 版本更新

Leveraging pretrained RGB denoisers for hyperspectral image restoration

利用预训练RGB去噪器进行高光谱图像恢复

Daniele Picone, Mohamad Jouni, Mauro Dalla-Mura

发表机构 * Univ. Grenoble Alpes, CNRS, Grenoble INP, GIPSA-Lab（格勒诺布尔阿尔卑斯大学、法国国家科学研究中心、格勒诺布尔INP、GIPSA实验室）

AI总结提出一种轻量级适配器，通过投影映射重用冻结的预训练RGB去噪器，实现高光谱图像的去噪、去模糊和超分辨率恢复，实验表明RGB先验具有良好的迁移性。

2605.24764 2026-05-26 cs.IR cs.AI cs.CL 版本更新

用于多轮LLM微调的合成轨迹的双层优化

Shresth Verma, Mauricio Tec, Cheol Woo Kim, Kai Wang, Milind Tambe

发表机构 * Harvard University（哈佛大学）； Georgia Institute of Technology（佐治亚理工学院）

AI总结提出BOOST双层优化框架，通过内层加权训练和外层轻量级重加权头学习，解决合成轨迹质量异质性导致的LLM多轮交互性能下降问题。

详情

AI中文摘要

虽然LLM在单轮生成中表现出色，但在长程多轮交互中表现不佳。离线强化学习提供了一种可扩展的方法，但其性能依赖于多轮轨迹数据的可用性和质量。一种常见的补救措施是使用LLM或模拟器生成的合成轨迹来增强训练，但合成数据的质量高度异质，天真地将所有轨迹视为同等信息量会降低性能。我们提出BOOST，一个双层优化框架，其中内层在重新加权的数据上训练LLM，外层在保留的真实验证任务上训练一个轻量级的重加权头，无需外部评判器即可分配连续的轨迹级权重。为了夯实这一方法，我们推导出一个PAC-Bayesian界，揭示了三方权衡：合成数据增加了多样性但存在任务偏移风险，而将权重集中在高质量轨迹上提高了经验性能但以有效样本量为代价。实验上，我们的方法一致优于多个基线。分析表明，它提高了与真实数据分布一致且具有更高定性价值的合成轨迹的权重。

英文摘要

While LLMs excel at single-turn generation, they struggle with long-horizon, multi-turn interactions. Offline reinforcement learning (RL) offers a scalable approach, yet its performance hinges on the availability and quality of multi-turn trajectory data. A common remedy is to augment training with synthetic trajectories generated by LLMs or simulators, but synthetic data is highly heterogeneous in quality, and naively treating all trajectories as equally informative can degrade performance. We propose BOOST, a bilevel optimization framework where the inner level trains the LLM on reweighted data and the outer level trains a lightweight reweighting head on held-out real validation tasks, assigning continuous trajectory-level weights without requiring an external judge. To ground this approach, we derive a PAC-Bayesian bound revealing a three-way trade-off: synthetic data increases diversity but risks task-shift, while concentrating weight on high-quality trajectories improves empirical performance at the cost of effective sample size. Empirically, our method consistently outperforms multiple baselines. Analysis reveals it upweights synthetic trajectories that align with the real data distribution and exhibit higher qualitative merit.

URL PDF HTML ☆

赞 0 踩 0

2605.24737 2026-05-26 cs.CL cs.AI cs.CY 版本更新

Who judges the judges? Governance from metrics: a runtime framework for continuous LLM compliance monitoring

谁来评判评判者？基于指标的治理：面向持续LLM合规监控的运行时框架

Jehanne Dussert

发表机构 * Independent Researcher（独立研究者）

AI总结针对AI合规作为审计时二元判定而非生产系统持续可测量属性的问题，提出基于指标的治理原则，并开发开源框架govllm，通过运行时可观测性信号实现持续合规监控，验证了多模型陪审团设计在监管评估中的有效性。

Comments 41 pages, 8 figures, preprint

详情

AI中文摘要

当前AI合规方法将合规性视为审计时的二元判定，而非生产系统的持续可测量属性。我们认为这种合规虚构在结构上不适合欧盟AI法案的要求，该法案要求持续的人类监督和检测部署系统中涌现的行为漂移。我们引入了基于指标的治理原则，即监管合规性是从运行时可观测性中推导出的持续信号，而非来自静态评估。基于这一原则，我们提出了govllm，一个开源框架，实现了治理驱动的路由架构，其中模型选择由累积的合规分数决定，而非仅由延迟或成本决定。我们方法的核心是一个监管评判者小组——针对每个标准（欧盟AI法案、GDPR、ANSSI、可访问性）专门化的LLM评估器——我们将评判者间的分歧重新定义为监管不确定性信号，而非噪声，需要人工仲裁。我们通过一个包含49个标注提示/响应对的地面真实语料库验证了该方法，涵盖五个监管标准，由四个完全本地运行的小型语言模型（SLM，1.7B-7B参数）评估。一致率从51.5%（mistral:7b）到69.1%（phi4-mini）不等，没有单一模型在所有标准上占主导地位——这从经验上激励了“档案即陪审团”的设计。我们进一步记录了小型监管评判者中的三种结构性失败模式，以及一种评判者特定的位置偏差，该偏差在三种问题顺序条件（原始、反转、排列）下使一致率降低多达25个百分点。govllm作为开源软件发布，以支持可复现的AI治理研究。

英文摘要

Current approaches to AI compliance treat conformity as a binary, audit-time verdict rather than a continuous, measurable property of production systems. We argue that this compliance fiction is structurally ill-suited to the requirements of the EU AI Act, which demands ongoing human oversight and the detection of emergent behavioural drift in deployed systems. We introduce governance from metrics, a principle whereby regulatory compliance is derived as a continuous signal from runtime observability rather than from static assessments. Building on this principle, we present govllm, an open-source framework implementing a governance-driven routing architecture in which model selection is determined by accumulated compliance scores rather than by latency or cost alone. Central to our approach is a panel of regulatory judges - LLM evaluators specialised per criterion (EU AI Act, GDPR, ANSSI, accessibility) - whose inter-judge disagreement we reframe not as noise but as a regulatory uncertainty signal warranting human arbitration. We validate this approach through a ground truth corpus of 49 annotated prompt/response pairs across five regulatory criteria, evaluated by four small language models (SLMs, 1.7B-7B parameters) running fully on-premise. Agreement rates range from 51.5% (mistral:7b) to 69.1% (phi4-mini), with no single model dominating across all criteria - empirically motivating the Profile-as-jury design. We further document three structural failure modes in small regulatory judges and a judge-specific position bias that degrades agreement by up to 25 percentage points across three question-order conditions (original, reversed, permuted). govllm is released as open-source software to support reproducible AI governance research.

URL PDF HTML ☆

赞 0 踩 0

2605.24719 2026-05-26 cs.CL cs.AI 版本更新

World-State Transformations for Neuro-symbolic Interactive Storytelling

世界状态转换用于神经符号交互式故事讲述

Santiago Góngora, Luis Chiruzzo, Gonzalo Méndez, Pablo Gervás

AI总结本研究探索在神经符号架构中利用LLM预测规则系统中的世界状态转换，以解决纯LLM方法的故事连贯性问题，并通过实验表明该方法能保持世界状态一致性并促进玩家创造性输入。

Comments To be presented at the 17th International Conference on Computational Creativity (ICCC'26)

详情

AI中文摘要

大型语言模型（LLM）改变了处理自由文本用户输入的交互式故事讲述系统的可能性。然而，随着这类系统越来越多地被构建，越来越多的证据表明，仅依赖它们会出现故事连贯性问题。最近的研究表明，LLM可以有效地预测基于规则的交互式故事讲述系统中的状态变化，触发预编程的世界状态转换。在本文中，我们进行了一项探索性评估，研究这种转换是否可以作为玩家表达的催化剂，同时旨在解决纯LLM方法典型的连贯性问题。基于神经符号架构，我们使用开源模型（Llama 3 70B）和闭源模型（Gemini 1.5 Flash）进行了实验，测试以英语和西班牙语进行。八名参与者玩了两个场景，这些场景经过精心设计以评估不同的评估目标。我们的观察表明，转换提供了一种保持世界状态一致性的方式，同时鼓励玩家通过他们的书面输入进行创造性互动。

英文摘要

Large Language Models (LLMs) have changed the possibilities of Interactive Storytelling systems that process free-text user input. However, as more of these systems are built, evidence continues to mount regarding the story coherence problems that arise when relying solely on them. Recent research suggests that LLMs can effectively predict state changes within rule-based Interactive Storytelling systems, triggering pre-programmed world-state transformations. In this paper, we conduct an exploratory evaluation of whether such transformations can serve as a catalyst for player expression while aiming to address the incoherence issues typical of purely LLM-based approaches. Building upon a neuro-symbolic architecture, we conducted experiments using an open-source model (Llama 3 70B) and a closed-source model (Gemini 1.5 Flash), with testing conducted in both English and Spanish. Eight participants played two scenarios, carefully designed to assess different evaluation objectives. Our observations suggest that transformations offer a way to maintain world-state consistency while encouraging players to interact creatively through their written inputs.

URL PDF HTML ☆

赞 0 踩 0

2605.24703 2026-05-26 cs.CL cs.AI 版本更新

TS-Skill: A Benchmark for Evaluating Analytical Skills in Time-Series Question Answering

TS-Skill: 用于评估时间序列问答中分析技能的基准

Liying Han, Kang Yang, Oliver Wang, Jason Wu, Pengrui Quan, Gaofeng Dong, Ozan Baris Mulayim, Sizhe Ma, Yuyang Yuan, Dezhi Hong, Mario Berges, Mani Srivastava

发表机构 * University of California, Los Angeles（加州大学洛杉矶分校）； Samsung Research America（三星美国研究院）； Carnegie Mellon University（卡内基梅隆大学）； Microsoft（微软）； Amazon（亚马逊）

AI总结提出TS-Skill基准，通过三种可组合的分析技能（时间尺度选择、时间定位和跨区间整合）来诊断时间序列问答中模型的信号级能力，并开发SKEvol框架自动构建基准，实验揭示不同技能上的能力差距。

详情

AI中文摘要

大型语言模型（LLMs）和时间序列语言模型（TSLMs）越来越多地应用于时间序列问答（TSQA）。与纯文本问答不同，TSQA要求模型将答案基于时间信号，这些信号的模式可能出现在不同尺度、特定时间位置或跨分离区间。然而，现有的基准通常按任务类型或高层次推理类别组织，难以诊断驱动模型性能的底层信号级能力。我们引入TS-Skill，一个用于评估TSQA中三种可组合分析技能的控制基准：时间尺度选择（SK1）、时间定位（SK2）和跨区间整合（SK3）。TS-Skill提供时间戳感知的问题、广泛的领域覆盖以及人工验证的问答质量。为了大规模构建基准，我们开发了SKEvol，一个技能引导的智能体框架，结合了领域感知的时间序列种子生成、技能控制的问题生成、元数据和代码辅助的答案构建、多阶段信号接地验证以及人在回路中的策展。在十个最先进的LLMs和TSLMs上的实验揭示了SK1-SK3之间显著且不均匀的能力差距。特别是，SK3对非智能体模型始终具有挑战性，而工具增强的智能体在独立的SK3上显示出选择性优势。这些发现表明，技能级评估可以揭示被聚合TSQA分数掩盖的时间推理失败。

英文摘要

Large language models (LLMs) and time-series language models (TSLMs) are increasingly applied to time-series question answering (TSQA). Unlike text-only QA, TSQA requires models to ground answers in temporal signals whose patterns may occur at different scales, specific time locations, or across separated intervals. However, existing benchmarks are typically organized by task types or high-level reasoning categories, making it difficult to diagnose the underlying signal-level capabilities driving model performance. We introduce TS-Skill, a controlled benchmark for evaluating three composable analytical skills in TSQA: temporal scale selection (SK1), temporal localization (SK2), and cross-interval integration (SK3). TS-Skill provides timestamp-aware questions, broad domain coverage, and human-validated QA quality. To construct the benchmark at scale, we develop SKEvol, a skill-guided agentic framework that combines domain-aware time-series seed generation, skill-controlled question generation, metadata- and code-assisted answer construction, multi-phase signal-grounded verification, and human-in-the-loop curation. Experiments on ten state-of-the-art LLMs and TSLMs reveal substantial and uneven capability gaps across SK1-SK3. In particular, SK3 remains consistently challenging for non-agent models, whereas tool-augmented agents show a selective advantage on standalone SK3. These findings demonstrate that skill-level evaluation can uncover temporal reasoning failures that are obscured by aggregate TSQA scores.

URL PDF HTML ☆

赞 0 踩 0

2605.24699 2026-05-26 cs.AI cs.LG 版本更新

MDIA: A Multi-Agent Diagnostic Intelligence Pipeline on HealthBench Professional

MDIA：HealthBench Professional上的多智能体诊断智能流水线

Roberto Cruz, David Rey-Blanco

发表机构 * TietAI

AI总结提出MDIA多智能体诊断系统，通过7节点专业路由临床推理图架构，在非微调LLM上实现HealthBench Professional基准性能提升3.72个百分点，归因于系统架构设计而非提示工程。

Comments 33 pages, 10 figures

详情

AI中文摘要

大多数关于agentic-LLM临床基准测试的报告收益通常归因于提示工程，但我们的结果表明，更大的改进可能来自架构和引擎级别的设计。我们提出了MDIA，一个多智能体诊断智能体，实现为7节点专业路由临床推理图，在完整的HealthBench Professional基准测试（n=525）上，使用非微调LLM。MDIA在OpenAI的GPT-5.4-2026-03-05下达到0.6272，比OpenAI的ChatGPT for Clinicians的性能高出3.72个百分点。实验工作表明，性能提升归因于系统架构：专业路由、多轮上下文保留、药物状态安全门控、站点过滤搜索、长度感知合成和引擎级可靠性。这些发现支持了agentic临床基准性能由底层基础模型和编排架构共同塑造的观点。然而，我们也注意到在使用其他模型作为评分器时存在显著差异；特别是，当使用Gemini 2.5 Pro时，MDIA得分为0.6585，这表明评分器的选择是变异性来源。因此，对LLM的稳健评估需要跨多个独立评分器模型进行评估。

英文摘要

Most reported gains on agentic-LLM clinical benchmarks are often attributed to prompt engineering, yet our results suggest that larger improvements can come from architectural and engine-level design. We present MDIA, a Multi-agent Diagnostic Intelligence Agent implemented as a 7-node specialty-routed clinical reasoning graph, on the full HealthBench Professional benchmark (n = 525), on a non-fine-tuned LLM. MDIA achieves 0.6272 under OpenAI's GPT-5.4-2026-03-05, which is +3.72 pp above the performance of OpenAI's ChatGPT for Clinicians. The experimental work shows that performance lift is attributable to system architecture: specialty routing, multi-turn context preservation, drug-state safety gating, site-filtered search, length-aware synthesis, and engine-level reliability. These findings support the view that agentic clinical benchmark performance is shaped both by the underlying foundation model and the orchestration architecture. Nevertheless, we also noticed notable differences when using other models as a grader; in particular, when using Gemini 2.5 Pro, MDIA scored 0.6585, which suggests that the choice of grader is a source of variability. Robust evaluation of LLMs would therefore require assessment across several independent grader models.

URL PDF HTML ☆

赞 0 踩 0

2605.24697 2026-05-26 cs.CL cs.AI 版本更新

The Path Matters: Learning a Token-Commitment Policy for Diffusion Language Models

路径很重要：学习扩散语言模型的令牌提交策略

Bohang Sun, Max Zhu, Francesco Caso, Jindong Gu, Junchi Yu, Philip Torr, Pietro Liò, Jialin Yu

发表机构 * Department of Computer Science and Technology, University of Cambridge（计算机科学与技术系，剑桥大学）； Department of Engineering Science, University of Oxford（工程科学系，牛津大学）

AI总结本文提出TraceLock，一种轻量级可插拔控制器，通过学习可复用的轨迹状态策略来优化扩散语言模型中的令牌提交决策，从而改善质量与步数之间的权衡。

详情

AI中文摘要

扩散大语言模型通过并行细化多个令牌位置有望实现更快的生成，但这种并行性引入了一个隐藏的控制问题：每一步中哪些提议的令牌应被转移到部分解码的序列中？我们将此决策称为令牌提交。现有的冻结生成器解码器主要依赖于手工设计的置信度规则或特定块的接受过滤器。我们认为令牌提交可以学习为一种可复用的轨迹状态策略。我们引入了TraceLock，一种轻量级可插拔控制器，为冻结的扩散语言模型实例化此策略。由于无法获得 oracle 提交时间，TraceLock 从未来稳定性中推导出自我监督：在解码步骤 t，如果提议的令牌在完整解码轨迹完成后与位置 i 的最终令牌匹配，则将其标记为稳定。控制器对可变长度的轨迹状态进行评分，并决定哪些活跃的令牌提议应被提交到部分解码的序列中。一旦为给定的冻结主干训练完成，该控制器可以在局部窗口宽度、生成长度和步数预算下部署，无需重新训练或按设置校准。在问答、数学推理和代码生成上的实验表明，TraceLock 在质量-步数权衡上优于启发式和学习的基线，在跨设置部署下尤其稳定。诊断分析表明，其决策不能简化为标量置信度，这表明冻结的扩散语言模型暴露了一个超越基于置信度解码的可学习的提交轨迹空间。代码可在 https://github.com/BobSun98/TraceLock 获取。

英文摘要

Diffusion large language models promise faster generation by refining many token positions in parallel, but this parallelism introduces a hidden control problem: which proposed tokens should be transferred into the partially decoded sequence at each step? We refer to this decision as token commitment. Existing frozen-generator decoders largely rely on hand-designed confidence rules or block-specific acceptance filters. We argue that token commitment can instead be learned as a reusable trace-state policy. We introduce TraceLock, a lightweight plug-in controller that instantiates this policy for a frozen diffusion language model. Since oracle commitment times are unavailable, TraceLock derives self-supervision from future stability: at decoding step t, a proposed token for position i is labeled stable if it matches the final token at position i after the full decoding trace completes. The controller scores variable-length trace states and decides which active token proposals should be committed to the partially decoded sequence. Once trained for a given frozen backbone, the controller can be deployed across local-window widths, generation lengths, and step budgets without retraining or per-setting calibration. Experiments on question answering, mathematical reasoning, and code generation show that TraceLock improves the quality-step tradeoff over heuristic and learned baselines, with particularly stable behavior under cross-setting deployment. Diagnostic analyses show that its decisions are not reducible to scalar confidence, suggesting that frozen diffusion language models expose a learnable space of commitment trajectories beyond confidence-based decoding. Code is available at https://github.com/BobSun98/TraceLock.

URL PDF HTML ☆

赞 0 踩 0

2605.24687 2026-05-26 cs.CV cs.AI 版本更新

HoloFair: Unified T2I Fairness Evaluation and Fair-GRPO Debiasing

HoloFair: 统一的T2I公平性评估与Fair-GRPO去偏

Ruyi Chen, Lu Zhou, Xiaogang Xu, Chiyu Zhang, Jiafei Wu, Liming Fang

发表机构 * Nanjing University of Aeronautics and Astronautics（南京航空航天大学）； School of Software Technology, Zhejiang University, Ningbo, China（浙江大学宁波校区软件学院）； Ningbo Global Innovation Center, Zhejiang University, Ningbo, China（浙江大学宁波全球创新中心）； Collaborative Innovation Center of Novel Software Technology and Industrialization（新型软件技术与产业化协同创新中心）

AI总结提出HoloFair基准框架，通过多属性组间偏差指数（MGBI）评估文本到图像模型的公平性，并引入基于强化学习的Fair-GRPO方法进行去偏，在SD3.5-Medium模型上显著提升多维公平性且保持图像质量。

Comments Accepted to ICML 2026. Code and dataset are available at https://github.com/1059684669/HoloFair

详情

AI中文摘要

文本到图像（T2I）模型在视觉真实感和语义一致性方面取得了显著进展，但它们常常延续并放大社会偏见。现有的评估方法通常只处理单维偏见，缺乏从社会相关深层语义层面揭示模型偏见的视角。我们引入了HoloFair，一个用于多维人口统计偏见分析的综合基准框架。该框架基于我们大规模面向公平性的数据集和SpaFreq（空间-频率）属性分类器，提出了多属性组间偏差指数（MGBI）指标，旨在评估内在多样性和条件偏见。除评估外，我们还进一步引入了Fair-GRPO，一种基于强化学习的去偏方法，通过设计的多目标奖励函数改变生成模型的分布。例如，在SD3.5-Medium模型上的实验表明，Fair-GRPO在保持高图像质量的同时显著改善了多维公平性。我们还分析了潜在的奖励黑客现象，并提供了相应的缓解策略。代码和数据集可在https://github.com/1059684669/HoloFair获取。

英文摘要

Text-to-Image (T2I) models have made significant strides in visual realism and semantic consistency, yet they often perpetuate and amplify societal biases. Existing evaluation methods typically address only single-dimensional biases, lacking perspectives to uncover model biases at social-related deeper semantic levels. We introduce HoloFair, a comprehensive benchmark framework for multidimensional demographic bias analysis. Built upon our large-scale fairness-oriented dataset and the SpaFreq (Spatial-Frequency) attribute classifier, this framework proposes the Multi-attribute, Group-wise Bias Index (MGBI) metric, designed to assess both intrinsic diversity and conditional biases. Beyond evaluation, we further introduce Fair-GRPO, a reinforcement-learning-based debiasing method that alters the distribution of generative models through a designed multi-objective reward function. E.g., experiments on the SD3.5-Medium model demonstrate that Fair-GRPO significantly improves multidimensional fairness while maintaining high image quality. We also analyze potential reward hacking phenomena and provide corresponding mitigation strategies. Code and dataset are available at https://github.com/1059684669/HoloFair

URL PDF HTML ☆

赞 0 踩 0

2605.24686 2026-05-26 cs.AI 版本更新

Emotional intelligence in large language models is fragmented across perception, cognition, and interaction

大型语言模型中的情商在感知、认知和交互上存在碎片化

Minghao Lv, Lu Chen, Enchang Zhang, Anji Zhou, Xiaoran Xue, Hanyi Zhang, Fenghua Tang, Zhuo Rachel Han, Mengyue Wu

发表机构 * X-LANCE Lab（X-LANCE实验室）； School of Computer Science, Shanghai Jiao Tong University（上海交通大学计算机科学学院）； MoE Key Lab of Artificial Intelligence（人工智能MOE重点实验室）； Jiangsu Key Lab of Language Computing（江苏省语言计算重点实验室）； Beijing Key Laboratory of Applied Experimental Psychology（北京应用实验心理学重点实验室）； National Demonstration Center for Experimental Psychology Education, Faculty of Psychology, Beijing Normal University（北京师范大学实验心理学教育国家级示范中心，心理学学院）

AI总结本文提出FACET框架，基于Mayer-Salovey-Caruso四分支能力模型评估大型语言模型的情商，发现其并非单一能力，而是在认知和交互维度上碎片化，且隐藏情绪识别是普遍瓶颈。

详情

AI中文摘要

随着大型语言模型（LLMs）越来越多地集成到情感敏感领域，其情商（EI）的结构完整性成为安全和对齐的关键前沿。当前的基准测试常常将表面的礼貌与深层次的情感推理混为一谈，未能区分感知准确性和交互效能。在此，我们引入FACET（功能性情感能力和共情测试），这是一个基于心理测量学的框架，包含480个专家设计的项目。与先前的指标不同，FACET在理论上锚定于Mayer-Salovey-Caruso四分支能力模型，通过情绪感知、促进、理解和管理来操作化情商。通过对九个前沿模型（包括GPT-5、Claude-Sonnet-4）的评估，我们证明情商并非单一能力，而是在认知和交互维度上碎片化。尽管前沿模型在客观情绪识别和社会推理方面表现出强大的能力，但这并不一致地转化为交互成功。我们将这些差异归类为三种不同的表现类型：认知主导型、交互主导型和情境依赖型。这些类型表明情感技能并非随通用智能或模型大小均匀扩展；相反，它们由特定的对齐范式塑造。值得注意的是，我们识别出隐藏情绪识别是所有架构的普遍性能瓶颈。我们的结果表明，当前的RLHF过程可能优化了“随机共情”，即对情感句法的统计模仿，而牺牲了整合的情感推理。这些发现挑战了线性情感扩展的假设，并为开发能够真正临床共鸣的社会感知智能体提供了严谨的路线图。

英文摘要

As large language models (LLMs) are increasingly integrated into emotionally sensitive domains, the structural integrity of their emotional intelligence (EI) becomes a critical frontier for safety and alignment. Current benchmarks often conflate superficial politeness with deep affective reasoning, failing to distinguish between perceptual accuracy and interactive efficacy. Here, we introduce FACET (Functional Affective Competence and Empathy Test), a psychometrically grounded framework comprising 480 expert-crafted items. Unlike previous metrics, FACET is theoretically anchored in the Mayer-Salovey-Caruso four-branch ability model, operationalizing EI through perception, facilitation, understanding, and management of emotions. Through an evaluation of nine frontier models (including GPT-5, Claude-Sonnet-4), we demonstrate that emotional intelligence is not a monolithic capability but is fragmented across cognitive and interactive dimensions. While frontier models demonstrate robust proficiency in objective emotion recognition and social reasoning, this does not consistently translate to interactive success. We categorize these discrepancies into three distinct performance profiles: cognitive-dominant, interactive-dominant, and context-dependent. These typologies indicate that emotional skills do not scale uniformly with general intelligence or model size; rather, they are shaped by specific alignment paradigms. Notably, we identify hidden emotion recognition as a universal performance bottleneck across all architectures. Our results suggest that current RLHF processes may optimize for "stochastic empathy", a statistical mimicry of emotional syntax, at the expense of integrated affective reasoning. These findings challenge the assumption of linear emotional scaling and provide a rigorous roadmap for developing socially aware agents capable of genuine clinical resonance.

URL PDF HTML ☆

赞 0 踩 0

2605.24684 2026-05-26 cs.LG cs.AI 版本更新

Beyond the Aggregation Dilemma: Prior-Retaining Decoupled Learning for Multimodal Graphs

超越聚合困境：多模态图的先验保持解耦学习

Hao Yan, Xuanru Wang, Jun Yin, Shirui Pan, Senzhang Wang, Chengqi Zhang

发表机构 * School of Computer Science and Engineering, Central South University（中南大学计算机科学与工程学院）； Department of Data Science and Artificial Intelligence, The Hong Kong Polytechnic University（香港理工大学数据科学与人工智能系）； School of Information and Communication Technology, Griffith University（格里菲斯大学信息与通信技术学院）

AI总结针对多模态属性图学习中强制聚合导致性能反转的聚合困境，提出解耦双路径架构SUPRA，通过保持先验特征的独立性和轻量级共享GNN捕获结构协同，并辅以深度监督缓解梯度饥饿，实现SOTA性能且显著降低计算开销。

详情

AI中文摘要

多模态属性图学习（MAGL）通过图聚合将节点内在属性与结构拓扑相结合。然而，随着预训练编码器演变为大型基础模型（LFM），MAGL的格局发生了根本性转变：在高置信度LFM先验下，强制聚合引入了拓扑噪声，淹没了判别信号，引发反直觉的性能反转，即复杂的MAGL架构性能不如简单的拓扑无关MLP。通过系统的实证和理论分析，我们确定这种反转源于一个基本的聚合困境，其特征是两种并发病理：（1）表征病理（信噪比退化）——强制聚合用拓扑噪声稀释了鲁棒的内在特征，导致噪声惩罚超过其协作收益；（2）优化病理（梯度饥饿）——拓扑聚合减弱了梯度流，而共享任务损失导致主导模态过早抑制较弱模态。为解决这一困境，我们提出SUPRA（共享-独特先验保持架构），一种解耦的双路径范式。SUPRA通过拓扑无关的MLP处理模态特定特征，同时通过轻量级共享GNN捕获结构协同，并辅以深度监督来对抗梯度饥饿。大量评估表明，SUPRA实现了最先进的性能，同时峰值GPU内存需求降低3.5倍，训练时间比多模态图变换器快4.4倍。

英文摘要

Multimodal Attributed Graph Learning (MAGL) integrates intrinsic node attributes with structural topology via graph aggregation. However, as pretrained encoders evolve into Large Foundation Models (LFMs), the landscape of MAGL fundamentally shifts: under high-confidence LFM priors, mandatory aggregation introduces topological noise that overwhelms discriminative signals, triggering a counter-intuitive performance inversion where sophisticated MAGL architectures underperform simple topology-agnostic MLPs. Through systematic empirical and theoretical analysis, we identify that this inversion stems from a fundamental aggregation dilemma characterized by two concurrent pathologies: (1) Representational Pathology (SNR Degradation) - mandatory aggregation dilutes robust intrinsic features with topological noise, causing the noise penalty to outweigh its collaborative benefit; and (2) Optimization Pathology (Gradient Starvation) - topological aggregation attenuates gradient flow, while a shared task loss causes dominant modalities to prematurely suppress weaker ones. To resolve this dilemma, we propose SUPRA (Shared-Unique Prior-Retaining Architecture), a decoupled dual-pathway paradigm. SUPRA processes modality-specific features through topology-agnostic MLPs while capturing structural synergy via a lightweight shared GNN, with auxiliary deep supervision counteracting gradient starvation. Extensive evaluations demonstrate that SUPRA achieves state-of-the-art performance while requiring 3.5x lower peak GPU memory and up to 4.4x faster training time than Multimodal Graph Transformers.

URL PDF HTML ☆

赞 0 踩 0

2605.24675 2026-05-26 cs.CV cs.AI 版本更新

VaaWIT: Visual-Aware Adaptation of Large Language Models for Multilingual Web Image Translation

VaaWIT: 面向多语言网页图像翻译的大语言模型视觉感知适配

Bo Li, Ronghao Chen, Ningyuan Deng, Huacan Wang, Shaolin Zhu, Lijie Wen

发表机构 * The Hong Kong University of Science（香港科技大学）； Tianjin University（天津大学）； Tsinghua University（清华大学）

AI总结针对网页图像翻译中视觉表示差距问题，提出VaaWIT框架，通过双流注意力模块和视觉感知适配器，实现大语言模型对细粒度视觉特征的动态融合，在多个基准上超越开源模型并接近闭源模型性能。

Comments Accepted by KDD 2026

详情

DOI: 10.1145/3770855.3817631

AI中文摘要

翻译网页图像中的文本对于改善内容可访问性和跨语言信息检索至关重要，尤其是在社交媒体和电子商务领域。尽管大型视觉语言模型（LVLMs）已经推进了多模态理解，但由于视觉表示差距，将它们应用于网页图像翻译仍然具有挑战性：标准编码器通常优先考虑高级语义，而忽略了识别多样字符形态所需的细粒度视觉细节。为了解决这一挑战，我们提出了VaaWIT，一个端到端框架，用于适配大语言模型进行多语言网页图像翻译。该框架引入了两项关键技术贡献：（1）双流注意力模块（DSAM），促进多语言语义特征与详细视觉表示之间的双向交互，从而合成对文本变化鲁棒的统一特征；（2）视觉感知适配器（VAA），一种参数高效的微调策略，将这些融合的视觉线索动态注入冻结的LLM主干。这种设计使模型能够有效地将视觉上下文与语言推理对齐，同时最小化计算成本。在三个公共基准上的八个任务上的大量实验表明，VaaWIT显著优于最先进（SOTA）的开源基线，并达到了与专有模型相竞争的性能。这些结果验证了将细粒度视觉感知集成到LLM中用于复杂网页内容分析的有效性。

英文摘要

Translating text embedded in Web images is crucial for improving content accessibility and cross-lingual information retrieval, particularly within social media and e-commerce domains. Although Large Vision-Language Models (LVLMs) have advanced multimodal understanding, applying them to Web image translation remains challenging due to the visual representation gap: standard encoders often prioritize high-level semantics over the fine-grained visual details required for recognizing diverse character morphologies. To address this challenge, we propose VaaWIT, an end-to-end framework that adapts Large Language Models for multilingual Web image translation. The framework introduces two key technical contributions: (1) a Dual-Stream Attention Module (DSAM), which facilitates bidirectional interaction between multilingual semantic features and detailed visual representations, thereby synthesizing unified features robust to textual variations; and (2) a Visual-Aware Adapter (VAA), a parameter-efficient fine-tuning strategy that dynamically injects these fused visual cues into the frozen LLM backbone. This design enables the model to align the visual context with linguistic reasoning effectively while minimizing computational costs. Extensive experiments on eight tasks on three public benchmarks demonstrate that VaaWIT significantly outperforms state-of-the-art (SOTA) open-source baselines and achieves competitive performance against proprietary models. These results validate the efficacy of integrating fine-grained visual perception into LLMs for complex Web content analysis.

URL PDF HTML ☆

赞 0 踩 0

2605.24667 2026-05-26 cs.AI cs.LG 版本更新

When Mean CE Fails: Median CE Can Better Track Language Model Quality

当平均交叉熵失效时：中位数交叉熵能更好地跟踪语言模型质量

Hao Guo, Simon Dennis, Rivaan Patil, Kevin Shabahang

发表机构 * i14 ； University of Melbourne（墨尔本大学）； University of California, Santa Cruz（加州大学圣克ruz分校）

AI总结本文发现中位数交叉熵比平均交叉熵更能反映语言模型在训练过程中的任务性能，并建议在评估时报告多个百分位交叉熵。

Comments 20 pages

详情

AI中文摘要

平均交叉熵是语言模型的标准验证指标，但在训练过程中可能无法跟踪模型质量。我们在两种常见场景下研究了这一点。首先，在Qwen2.5-1.5B的合成事实学习SFT中，我们发现平均CE在初始学习阶段后显著上升，而保留的事实召回准确率保持接近峰值。其次，在TinyStories上的top-K蒸馏中，我们发现减小K会改善中位数CE而恶化平均CE；Top-5学生获得了最高的LLM评判分数，并在中位数CE上低于其教师，尽管其平均CE最差。在这两种情况下，中位数CE与任务性能的相关性比平均CE更紧密。分析训练过程中整体和尾部百分位CE的变化表明，训练重塑了经验性的每token CE分布。在top-K蒸馏中，较小的K产生了一个在两端都有更多质量的分布，降低了中位数并增加了平均值。在Qwen SFT中，整体部分迅速饱和，而尾部在训练后半段延伸。在这两种情况下，任务评估指标似乎对整体部分比尾部更敏感。实际上，我们建议在报告平均CE的同时报告一小部分百分位CE摘要，并利用它们之间的一致性作为跟踪分布重塑的工具，以及当平均和中位数CE在模型选择上不一致时的低成本诊断。

英文摘要

Mean cross-entropy is the standard validation metric for language models, but it can fail to track model quality during training. We examine this in two common scenarios. First, in Qwen2.5-1.5B SFT on synthetic fact-learning, we find that mean CE rises substantially after the initial learning phase while held-out fact-recall accuracy remains near its peak. Second, we find that in top-K distillation on TinyStories, decreasing K improves median CE while worsening mean CE; the Top-5 student attains the highest LLM-judge score and crosses below its teacher on median CE, despite having the worst mean CE. In both cases, median CE correlates much more closely with task performance than does mean CE. Analyzing how bulk and tail percentile CE move during training reveals that training reshapes the empirical per-token CE distribution. In top-K distillation, smaller K yields a distribution with more mass at both extremes, decreasing the median and increasing the mean. In Qwen SFT, the bulk saturates quickly while the tail extends in the latter half of training. In both, the task-evaluation metric appears more sensitive to the bulk than to the tail. Practically, we recommend reporting a small set of percentile CE summaries alongside the mean, and using concordance among them as a tool to keep track of distribution reshaping, as well as a low-cost diagnostic for when mean and median CE disagree on model selection.

URL PDF HTML ☆

赞 0 踩 0

2605.24663 2026-05-26 cs.CR cs.AI 版本更新

CyBOKClaw: Human-in-the-Loop CyBOK Mapping for Cybersecurity Curriculum

CyBOKClaw：用于网络安全课程的人机协同CyBOK映射框架

Yan Lin Aung, Kevin Togbe

发表机构 * University of Derby, Derby, UK（德比大学）

AI总结提出CyBOKClaw，一种可解释的人机协同检索框架，通过查询归一化、术语扩展、概念提升、主题描述丰富和领域敏感排序规则，将网络安全关键词/短语映射到CyBOK，并采用专家引导的top-5有用性指标ECA-5评估，在开发集和验证集上分别达到91.88%和98.00%的ECA-5。

详情

AI中文摘要

本文提出了CyBOKClaw，一个可解释的人机协同检索框架，用于将网络安全关键词或短语（KWoPs）映射到网络安全知识体系（CyBOK）。该框架并非将任务视为严格的精确分类，而是设计为供专家审查的top-k候选生成器。它结合了查询归一化、策划的术语扩展、概念级提升、主题描述丰富以及领域敏感的排序规则。由于教育领域的KWoPs通常宽泛、模糊且仅与CyBOK术语大致对齐，严格的精确匹配只能提供部分实际效用。因此，我们使用结构检索指标和专家引导的top-5有用性指标ECA-5（前5名中精确或最接近可接受匹配）来评估该框架，该指标记录返回的候选是否包含至少一个专家判断为精确或可接受为最接近实际CyBOK位置的映射。在开发数据集上，CyBOKClaw达到了64.73%的EXA-5（前5名精确匹配）、84.18%的结构语义对齐和91.88%的ECA-5；在验证数据集上，达到了81.19%的EXA-5、93.32%的结构语义对齐和98.00%的ECA-5。这些结果表明，专家引导的top-k有用性比单纯的精确结构匹配更能忠实地反映实际CyBOK映射效用，并且CyBOKClaw作为一种针对CyBOK的专家支持检索系统是有效的。

英文摘要

This paper presents CyBOKClaw, an interpretable human-in-the-loop retrieval framework for mapping cybersecurity keywords or phrases (KWoPs) to the Cyber Security Body of Knowledge (CyBOK). Rather than treating the task as strict exact classification, the framework is designed as a top-k candidate generator for expert review. It combines query normalization, curated term expansion, concept-level boosts, topic-description enrichment, and domain-sensitive ranking rules. Because educational KWoPs are often broad, ambiguous, and only approximately aligned with CyBOK terminology, strict exact matching provides only a partial account of practical utility. We therefore evaluate the framework using both structural retrieval metrics and an expert-guided top-5 usefulness metric, ECA-5 (Exact or Closest Acceptable Match at top-5), which records whether the returned candidates contain at least one mapping that an expert would judge exact or accept as the nearest practical CyBOK placement. On the development dataset, CyBOKClaw achieves 64.73% EXA-5 (Exact Match at top-5), 84.18% structural semantic alignment, and 91.88% ECA-5; on the validation dataset, it achieves 81.19% EXA-5, 93.32% structural semantic alignment, and 98.00% ECA-5. These results show that expert-guided top-k usefulness provides a more faithful account of practical CyBOK mapping utility than exact structural matching alone, and that CyBOKClaw is effective as a CyBOK-specific expert-support retrieval system.

URL PDF HTML ☆

赞 0 踩 0

2605.24661 2026-05-26 cs.AI cs.CL 版本更新

Measuring Reasoning Quality in LLMs: A Multi-Dimensional Behavioral Framework

衡量LLM中的推理质量：一个多维行为框架

Ali Şenol, Garima Agrawal, Huan Liu

发表机构 * Department of Computer Engineering, Tarsus University（塔鲁斯大学计算机工程系）； School of Computing and Augmented Intelligence (SCAI), Arizona State University (ASU)（计算与增强智能学院（SCAI），亚利桑那州立大学（ASU））； HumaConn AI Consulting（HumaConn AI咨询）

AI总结提出一个基于行为的多维框架，从正确性、一致性、鲁棒性、逻辑连贯性、效率和稳定性六个维度评估LLM推理质量，揭示仅靠准确率无法观察到的行为，并支持部署决策。

详情

AI中文摘要

LLMs在复杂推理任务中取得了显著成功，但当前的评估方法主要依赖最终答案的正确性，对产生这些答案的底层推理过程提供的洞察有限。为弥补这一空白，本研究从行为角度提出了一个统一的多维框架来衡量LLMs的推理质量，操作化了六个理论驱动的维度：正确性（CQ）、一致性（CS）、鲁棒性（RS）、逻辑连贯性（LS）、效率（ES）和稳定性（SS）。在四个基准测试的975个条目上对七个LLMs进行的广泛实验表明，该框架揭示了仅靠准确率指标无法观察到的行为。值得注意的是，逻辑连贯性与正确性正交（r = -0.172，不显著），证实了正确答案可能源于不连贯的推理，而Claude-Haiku-4.5取得了最高的多维得分（Q_bal = 0.778）。此外，该框架暴露了关键的排名反转：DeepSeek-V3在准确率优先下排名第二，但在法律/合规权重下排名第五，这种反转是单一指标评估无法检测到的。判别效度证实11/15个维度对是独立的（|r| < 0.50），为将每个维度视为不同信号提供了心理测量学支持。该框架产生的维度概况直接支持三类部署决策：识别那些虽然最终答案正确但推理轨迹无法通过问责审计的模型（LS--CQ正交性）；防止仅基于准确率的基准测试导致的排名错误；以及确保没有单一指标默默替代框架捕获的六个独立信号。

英文摘要

LLMs have achieved remarkable success in complex reasoning tasks, yet current evaluation approaches predominantly rely on final-answer correctness, offering limited insight into the underlying reasoning processes that produce those answers. To address this gap, this study proposes a unified multi-dimensional framework for measuring reasoning quality in LLMs from a behavioral perspective, operationalizing six theoretically grounded dimensions: Correctness (CQ), Consistency (CS), Robustness (RS), Logical Coherence (LS), Efficiency (ES), and Stability (SS). Extensive experiments on seven LLMs across 975 items from four benchmarks demonstrate that the framework reveals behaviors invisible to accuracy-only metrics. Notably, logical coherence is orthogonal to correctness (r = -0.172, ns), confirming that correct answers can arise from incoherent reasoning, while Claude-Haiku-4.5 achieves the highest multi-dimensional score (Q_bal = 0.778). Furthermore, the framework exposes critical ranking inversions: DeepSeek-V3 ranks second under accuracy-priority but fifth under legal/compliance weighting, a reversal that single-metric evaluation cannot detect. Discriminant validity confirms 11/15 dimension pairs are independent (|r| < 0.50), providing psychometric support for treating each dimension as a distinct signal. The dimensional profiles produced by the framework directly support three classes of deployment decision: identifying models whose reasoning traces would fail accountability audits despite correct final answers (LS--CQ orthogonality); preventing ranking errors caused by accuracy-only benchmarking; and ensuring that no single metric silently substitutes for the six independent signals the framework captures.

URL PDF HTML ☆

赞 0 踩 0

2605.24657 2026-05-26 cs.AI cs.SE 版本更新

Beyond Inference-Only Deployment: Comparing Weight-Based Consolidation Against Cascading Compaction

超越纯推理部署：比较基于权重的巩固与级联压缩

Simon Dennis, Kevin Shabahang, Hao Guo, Rivaan Patil

发表机构 * University of Melbourne（墨尔本大学）

AI总结针对大型语言模型纯推理部署中用户知识无法持久化的问题，提出通过夜间反射、合成和LoRA微调将交互知识巩固到模型权重中，实验表明该方法相比级联压缩知识保留率提升43.6个百分点。

Comments 15 pages

详情

AI中文摘要

主流LLM平台以纯推理配置部署模型：模型服务请求但从不更新每个用户的权重。用户必须反复重新教授偏好、修正和项目上下文，基于上下文的变通方法消耗上下文窗口空间，并在级联压缩下退化。我们评估了一种替代方案：通过反射、合成和低秩适应（LoRA）微调，在单个消费级GPU上将交互知识夜间巩固到模型权重中。在十次真实的软件开发对话中（n=10，三种记忆类型共1146个测试问题），三轮级联压缩保留了36.8±3.0%的知识（介于11.8%的无上下文下限和90.1%的全上下文上限之间），而巩固保留了80.4±1.3%——提升了43.6个百分点（配对t(9)=14.8，p<0.001），是压缩保留量的两倍多，其中程序性修正（36.3%->74.6%）和情景项目事实（31.5%->78.2%）的增益最大。作为方法论上的附带说明，平均每token验证交叉熵与LLM判断的准确性呈负相关（r=-0.51），而中位数每token验证交叉熵几乎完全跟踪准确性（r=+0.99）：在容忍表面形式变化的评估器下，平均值具有误导性，而重尾鲁棒统计量才是可靠的信号。持久个性化需要超越纯推理部署，转向将知识巩固到权重的架构。

英文摘要

Major LLM platforms deploy models in an inference-only configuration: the model serves requests but never updates per-user weights. Users must repeatedly re-teach preferences, corrections, and project context, and context-based workarounds consume context-window space and degrade under cascading compaction. We evaluate an alternative: nightly consolidation of interaction knowledge into model weights via reflection, synthesis, and Low-Rank Adaptation (LoRA) fine-tuning on a single consumer GPU. Across ten realistic software development conversations (n = 10, 1,146 test questions across three memory types), three cycles of cascading compaction retain 36.8 +/- 3.0% of knowledge (between an 11.8% no-context floor and a 90.1% full-context ceiling), while consolidation retains 80.4 +/- 1.3% -- a 43.6 pp gain (paired t(9) = 14.8, p < 0.001) that more than doubles what compaction preserves, with the largest gains on procedural corrections (36.3% -> 74.6%) and episodic project facts (31.5% -> 78.2%). As a methodological aside, mean per-token validation cross-entropy is negatively correlated with LLM-judged accuracy (r = -0.51) while median per-token validation cross-entropy tracks accuracy almost exactly (r = +0.99): under evaluators that tolerate surface-form variation, the mean is misleading and a heavy-tail-robust statistic is the faithful signal. Persistent personalization requires moving beyond inference-only deployment toward architectures that consolidate knowledge into weights.

URL PDF HTML ☆

赞 0 踩 0

2605.24652 2026-05-26 cs.AI cs.CV cs.MM cs.SD 版本更新

AVBench: Human-Aligned and Automated Evaluation Benchmark for Audio-Video Generative Models

AVBench：面向音视频生成模型的人类对齐与自动化评估基准

Jialiang Yang, Bin Xia, Ruihang Chu, Dingdong Wang, Wanke Xia, Zhun Mou, Tianyang Zhong, Yiting Zhao, Wenming Yang

发表机构 * Tsinghua University（清华大学）； The Chinese University of Hong Kong（香港中文大学）

AI总结提出AVBench，通过细粒度人类中心指标和偏好学习训练的专业评估器，实现音视频生成的自动化、准确评估。

详情

AI中文摘要

音视频（AV）生成的快速进步使得能够生成具有同步声音的高保真合成内容，特别是涉及语音和交互的人类相关场景。然而，AV生成的评估仍处于早期阶段，只有少数针对人类相关场景的粗粒度基准，并且依赖于有限的预设评估和通用多模态大语言模型，导致对模型能力的不准确评估。为了解决这些问题，我们引入了AVBench，一个专为人类中心AV生成设计的全自动化基准。AVBench基于两个关键设计以实现全面准确的评估：（i）人类中心和细粒度指标。AVBench整合了十个评估维度，专为以人为中心的现实场景设计，涵盖视觉质量、音频质量以及跨模态的多层次一致性。这些实用指标捕捉了现有基准经常忽略的人类相关细节。（ii）通过偏好学习训练的专业评估器。为了解决缺乏专门训练数据的问题，我们通过将真实视频转化为具有受控扰动的多样化训练对来构建大规模监督。在该高质量数据集上微调后，评估器学会可靠地检测细微的跨模态不一致性。关键的是，AVBench不输出离散的文本判断，而是从模型对二元决策的预测置信度中推导出连续评估分数。这种概率评分机制比传统的VQA风格评估更可靠，并且与人类判断高度一致。综合来看，AVBench为AV生成提供了自动化评估，展示了数据过滤的强大潜力，并可作为来自人类反馈的强化学习（RLHF）的可微分奖励信号。

英文摘要

Rapid advances in audio-video (AV) generation have enabled high-fidelity synthesis with synchronized sound, particularly for human-related scenarios involving speech and interactions. Yet evaluation for AV generation remains at an early stage, with only a few coarse-grained benchmarks for human-related scenarios and relying on limited preset evaluations with generic multimodal LLMs, leading to inaccurate assessments of model capabilities. To address these issues, we introduce AVBench, a fully automated benchmark tailored for human-centric AV generation. AVBench is built on two key designs for comprehensive and accurate evaluation: (i) Human-centric and fine-grained metrics. AVBench integrates ten evaluation dimensions designed for human-centered real-world scenarios, covering visual quality, audio quality, and multi-level consistency across modalities. These practical metrics capture human-related details that existing benchmarks often overlook. (ii) Specialized evaluators via preference learning. To address the lack of specialized training data, we construct large-scale supervision by transforming real-world videos into diverse training pairs with controlled perturbations. After fine-tuning on this high-quality dataset, the evaluators learn to reliably detect subtle cross-modal inconsistencies. Crucially, instead of producing discrete textual judgment, AVBench derives continuous evaluation scores from the model's prediction confidence on binary decisions. This probabilistic scoring mechanism enables a more reliable assessment than traditional VQA-style evaluation and aligns closely with human judgment. Taken together, AVBench offers automated evaluation for AV generation, demonstrates strong potential for data filtering, and serves as a differentiable reward signal for Reinforcement Learning from Human Feedback (RLHF).

URL PDF HTML ☆

赞 0 踩 0

2605.24639 2026-05-26 cs.CV cs.AI 版本更新

DisDop: Distillation with Domain Priors for Open-Vocabulary Aerial Object Detection

DisDop: 基于领域先验蒸馏的开放词汇航空目标检测

Ruihao Xu, Yong Liu, Yansong Tang, Sule Bai, Xubing Ye, Bingyao Yu, Yutao Guo, Jiwen Lu, Jie Zhou

发表机构 * Tsinghua Shenzhen International Graduate School, Tsinghua University（清华大学深圳国际研究生院，清华大学）； Tsinghua University（清华大学）

AI总结提出DisDop框架，通过从遥感基础模型（RemoteCLIP和DINOv3）中系统蒸馏多级领域先验知识到轻量级检测器，实现开放词汇航空目标检测的最新性能。

详情

AI中文摘要

近年来，随着无人机的广泛应用，航空图像的目标检测引起了越来越多的关注，尤其是不受预定义类别限制的开放词汇航空检测。由于无人机视角图像的稀缺性及其与自然图像的显著差异，直接应用为自然场景设计的普通开放词汇检测方法难以取得令人满意的结果。一些研究提出通过使用轻量级网络或生成伪标签来从预训练模型迁移知识，但它们往往依赖于在自然图像上训练的模型，忽略了专门为遥感和航空图像定制的基础模型的潜力。为了解决这一局限性，我们提出了DisDop，一个统一的框架，系统地将来自遥感基础模型（例如RemoteCLIP和DINOv3）的多级领域先验知识蒸馏到轻量级检测器中。具体来说，我们首先通过教师融合策略蒸馏视觉先验，该策略结合了RemoteCLIP的跨模态对齐能力和DINOv3的细粒度局部特征提取能力，将其互补优势迁移到检测器的骨干网络中。其次，我们通过显式建模类别间语义关系来蒸馏嵌入在RemoteCLIP文本编码器中的文本先验，同时结合全局上下文先验以增强小目标的局部特征表示。通过这种多级先验蒸馏框架，我们的DisDop在开放词汇航空检测基准上取得了新的最先进性能。大量的消融分析也证明了我们提出模块的合理性和有效性。

英文摘要

With the widespread application of drones in recent years, object detection of aerial images has attracted increasing attention, especially open-vocabulary aerial detection which is not restricted to predefined categories. Due to the scarcity of drone's viewpoint images and their significant differences from natural images, it is difficult to achieve satisfying results by directly applying vanilla open-vocabulary detection methods designed for natural scenarios. Some studies propose to transfer knowledge from pre-trained models by using lightweight networks or generating pseudo labels, but they tend to rely on models trained on natural images, neglecting the potential of foundation models specifically tailored for remote sensing and aerial imagery. To address this limitation, we propose DisDop, a unified framework that systematically distills multi-level domain priors from remote sensing foundation models (e.g., RemoteCLIP and DINOv3) into a lightweight detector. Specifically, we first distill visual priors through a teacher fusion strategy that combines RemoteCLIP's cross-modal alignment capability with DINOv3's fine-grained local feature extraction ability, transferring their complementary strengths to the detector's backbone. Second, we distill textual priors embedded in RemoteCLIP's text encoder by explicitly modeling inter-category semantic relationships, while incorporating global contextual priors to enhance local feature representation for small objects. Through this multi-level prior distillation framework, our DisDop achieves new state-of-the-art performance on open-vocabulary aerial detection benchmarks. Extensive ablation analysis also demonstrates the rationality and effectiveness of our proposed modules.

URL PDF HTML ☆

赞 0 踩 0

2605.24632 2026-05-26 cs.CR cs.AI cs.LG 版本更新

Demystifying the Mythos or Disrupting Bugonomics? From Zero-Day Asymmetry to Defender Remediation Throughput

揭秘神话或颠覆漏洞经济学？从零日不对称到防御者修复吞吐量

Alfredo Pesoli, Herman Errico, Lorenzo Cavallaro

发表机构 * University College London（伦敦大学学院）； Bynario

AI总结本文通过漏洞经济学视角分析LLM驱动的漏洞发现，指出其核心影响并非增加零日漏洞，而是提升防御者修复吞吐量，并利用Anthropic Mythos预览和Mozilla Firefox合作数据论证这一转变。

详情

AI中文摘要

最近，大型语言模型在生产软件中生成候选和确认漏洞的演示，重新引发了AI将重塑攻防安全的叙事。头条新闻强调能力，却很少审视成本和激励。本文通过漏洞经济学视角审视LLM驱动的漏洞发现：即生产、证明、优先级排序和修复安全相关缺陷的操作经济学。历史上，最引人注目的高端漏洞经济学是攻击方定价的，因为生产级零日漏洞和利用链是面向政府、经纪人和攻击方供应商的昂贵专家输出。防御方漏洞经济学早已存在于漏洞研究、奖励计划和供应商修复工作中；LLM辅助系统改变了其规模和分布。它们使得候选生成、代码理解、测试工具构建、影响证明草拟和报告准备在代码库规模上更便宜。利用和概念验证仍然重要，但在防御方工作流中，它们主要用于证明影响、指导优先级排序和证明修复的合理性。由此产生的瓶颈不仅仅是发现更多漏洞，而是吸收、验证、分类、修补和发布更大规模的报告流。利用Anthropic的Mythos预览和Mozilla Firefox合作的公开数据，以及公开的利用市场价格锚点和漏洞奖励计划，我们认为近期的转变不仅仅是更多的零日漏洞。而是向更广泛的防御者修复吞吐量迈进：低信号候选变得更便宜，证据丰富的修复变得更加重要，稀缺的能力转向维护者审查和发布工作。这种影响在开源领域尤为严重，因为LLM辅助发现可以增加报告量，而维护者侧的验证、分类、资金和发布能力可能无法扩展。

英文摘要

Recent demonstrations of large language models producing candidate and confirmed vulnerabilities in production software have renewed the narrative that AI will reshape offensive and defensive security. Headlines emphasize capability; they rarely interrogate costs and incentives. This paper examines LLM-driven vulnerability discovery through a bugonomics lens: the operational economics of producing, proving, prioritizing, and fixing security-relevant defects. Historically, the most visible high-end bugonomics was offense-priced because production-grade zero-days and exploit chains were expensive specialist outputs for governments, brokers, and offensive vendors. Defender-side bugonomics already existed in vulnerability research, reward programs, and vendor remediation work; LLM-assisted systems change its scale and distribution. They make candidate generation, code comprehension, harness construction, proof-of-impact drafting, and report preparation cheaper at codebase scale. Exploits and proofs of concept remain important, but in defender workflows they primarily prove impact, guide prioritization, and justify remediation. The resulting bottleneck is not only finding more bugs; it is absorbing, validating, triaging, patching, and shipping a larger stream of reports. Using public data from Anthropic's Mythos Preview and Mozilla Firefox collaborations, along with public exploit-market price anchors and vulnerability reward programs, we argue that the near-term shift is not simply more zero-days. It is a move toward broader defender remediation throughput: low-signal candidates become cheaper, evidence-rich remediation become more important, and scarce capacity shifts toward maintainer review and release work. The effect is acute in open source, where LLM-assisted discovery can increase report volume while maintainer-side validation, triage, funding, and release capacity may not scale.

URL PDF HTML ☆

赞 0 踩 0

2605.24631 2026-05-26 cs.LG cs.AI cs.CV 版本更新

Beyond Generative Priors: Minority Sampling with JEPA-Guided Diffusion

超越生成先验：JEPA引导扩散的少数采样

Sol Park, Soobin Um

发表机构 * Department of Artificial Intelligence, Kookmin University, Seoul, South Korea（人工智能系，韩国全州大学，首尔）

AI总结提出一种基于世界模型JEPA引导的扩散采样框架，通过近似策略实现高效计算，在无条件、类别条件和文本到图像生成中提升少数样本的保真度和语义有效性。

Comments ICML 2026, 21 pages, 9 figures

详情

AI中文摘要

少数采样旨在数据流形上生成低密度实例，在医学诊断、异常检测和创意AI等应用中具有核心重要性。然而，现有方法相对于从训练数据中学习的生成先验来定义少数样本，将稀有性限制在可能无法很好反映现实世界语义的模型特定概念中。在这项工作中，我们提出了一种以世界为中心的少数采样视角，该视角相对于现实世界先验而非生成器诱导的密度来定义稀有性。为此，我们引入了JEPA引导，一种由联合嵌入预测架构（JEPA）引导的扩散采样框架——JEPA是一类编码广泛、语义丰富表示的世界模型。JEPA引导将扩散轨迹导向JEPA隐含密度下的低密度区域，从而使生成的少数样本与现实世界的语义稀有性对齐。为了使JEPA引导在计算上实用，我们开发了带有理论误差界限的原则性近似策略，显著降低了引导计算的开销。在无条件、类别条件和文本到图像生成上的大量实验表明，JEPA引导持续提高了少数样本的保真度和语义有效性，在捕捉现实世界的稀有性概念方面优于以生成器为中心的基线。代码可在https://github.com/soobin-um/jepa-guidance获取。

英文摘要

Minority sampling aims to generate low-density instances on a data manifold and is of central importance in applications such as medical diagnosis, anomaly detection, and creative AI. Existing approaches, however, define minority samples relative to generative priors learned from training data, confining rarity to model-specific notions that may poorly reflect real-world semantics. In this work, we propose a world-centric perspective on minority sampling, which defines rarity with respect to real-world priors rather than generator-induced densities. To this end, we introduce JEPA guidance, a diffusion sampling framework guided by a Joint-Embedding Predictive Architecture (JEPA) -- a class of world models that encode broad, semantically rich representations. JEPA guidance steers diffusion trajectories toward low-density regions under the implicit density induced by the JEPA, thereby aligning generated minorities with real-world semantic rarity. To make JEPA guidance computationally practical, we develop principled approximation strategies accompanied by theoretical error bounds, significantly reducing the overhead of guidance computation. Extensive experiments across unconditional, class-conditional, and text-to-image generation demonstrate that JEPA guidance consistently improves the fidelity and semantic validity of minority samples, outperforming generator-centric baselines in capturing real-world notions of rarity. Code is available at https://github.com/soobin-um/jepa-guidance.

URL PDF HTML ☆

赞 0 踩 0

2605.24621 2026-05-26 cs.CV cs.AI cs.LG 版本更新

Phase-Aware Wavelet-Based-Scattering Encoder-Decoder for Dense Predictions

相位感知的基于小波散射的编解码器用于密集预测

Ghassen Marrakchi, Basarab Matei

发表机构 * Northern Paris Computer Science Lab, Sorbonne Paris Nord University, Villetaneuse, France（北巴黎计算机科学实验室，巴黎-索邦大学，法国维莱特内斯）

AI总结提出一种相位感知散射编解码器，通过在跳跃连接中显式保留相位信息来恢复空间结构，在图像去噪和皮肤病变分割任务中验证了相位对密集预测的有效性。

Comments 21 pages, 16 figures, 10 tables

2605.24614 2026-05-26 cs.CL cs.AI cs.LG 版本更新

Measuring the Depth of LLM Unlearning via Activation Patching

通过激活修补测量大语言模型遗忘的深度

Jaeung Lee, Dohyun Kim, Jaemin Jo

发表机构 * Sungkyunkwan University（全北大学）

AI总结提出遗忘深度评分（UDS），通过激活修补量化遗忘的机制深度，在150个遗忘模型上的元评估中达到最高忠实性和鲁棒性。

Comments 18 pages

详情

AI中文摘要

大语言模型遗忘已成为隐私保护和人工智能安全的关键事后机制，但审计目标知识是否真正被擦除仍然具有挑战性。现有的输出级指标无法检测到这些知识是否仍可从内部表示中恢复。最近的白盒研究揭示了此类残留知识，但通常依赖于辅助训练或数据集特定调整，缺乏可推广的指标。为解决这些限制，我们提出遗忘深度评分（UDS），一种通过激活修补量化遗忘机制深度的指标。UDS首先使用保留模型基线识别编码目标知识的层，然后在0-1尺度上测量遗忘模型中该知识被擦除的程度。在跨越8种方法的150个遗忘模型上的20个指标的元评估中，UDS实现了最高的忠实性和鲁棒性，证实了我们的因果方法是遗忘评估中最可靠的。案例研究进一步揭示，白盒指标可能在层级别上不一致，并且擦除深度因示例而异。我们提供了将UDS集成到现有基准测试框架并简化评估流程的指南。代码和数据可在https://github.com/gnueaj/unlearning-depth-score获取。

英文摘要

Large language model (LLM) unlearning has emerged as a crucial post-hoc mechanism for privacy protection and AI safety, yet auditing whether target knowledge is truly erased remains challenging. Existing output-level metrics fail to detect when this knowledge remains recoverable from internal representations. Recent white-box studies reveal such residual knowledge but often rely on auxiliary training or dataset-specific adaptations, leaving no generalizable metric. To address these limitations, we propose the Unlearning Depth Score (UDS), a metric that quantifies the mechanistic depth of unlearning via activation patching. UDS first identifies layers that encode the target knowledge using a retain model baseline, then measures how much of it is erased in the unlearned model on a 0-1 scale. In a meta-evaluation across 20 metrics on 150 unlearned models spanning 8 methods, UDS achieves the highest faithfulness and robustness, confirming our causal approach as the most reliable for unlearning evaluation. Case studies further reveal that white-box metrics can disagree at the layer level and that erasure depth varies across examples. We provide guidelines for integrating UDS into existing benchmarking frameworks and streamlining the evaluation pipeline. Code and data are available at https://github.com/gnueaj/unlearning-depth-score

URL PDF HTML ☆

赞 0 踩 0

2605.24613 2026-05-26 cs.CL cs.AI cs.SE 版本更新

Guarded Repair for Harm-Aware Post-hoc Replacement of LLM Mathematical Reasoning

Guarded Repair: 面向危害感知的LLM数学推理事后替换

Haizhou Xia

AI总结提出GuardedRepair框架，通过选择性替换机制在修复LLM数学推理错误时避免破坏正确结果，在GSM8K上准确率从95.60%提升至96.89%且未破坏正确案例。

Comments 15 pages,including appendices. Code and artifacts available at https://github.com/Haizhoux0517/guarded-repair

详情

AI中文摘要

LLM数学推理的事后修复引入了一种不对称风险：修复错误的推理轨迹是有用的，但替换原本正确的轨迹可能有害。我们在选择性替换设置下研究该问题，系统必须决定修复后的候选是否比保留原始缓存轨迹更安全。我们提出GuardedRepair，一种有保护的best-of-N修复框架，它诊断缓存推理轨迹，选择性触发修复，并仅在确定性验证守卫支持替换时才接受改变答案的候选。该框架结合了轻量级符号检查、表面语义风险诊断、有界候选生成和保守接受策略。在完整GSM8K测试集上，初始推理器已达到95.60%准确率，GuardedRepair将最终准确率提升至96.89%，修复了58个剩余错误中的17个，且主运行中未测量到破坏正确案例。在弱推理器ASDiv设置中，准确率从78.40%提升至87.60%。直接重新生成基线表明，这一增益不能仅由更强模型重新求解解释：重新求解所有GSM8K示例将准确率降至93.03%，并破坏了47个初始正确答案。额外分析表明，有保护修复显著改善了修复/破坏权衡，同时也揭示了替换风险被降低而非消除。这些结果支持将事后修复视为危害感知的选择性替换而非无约束的重新求解。

英文摘要

Post-hoc repair of LLM mathematical reasoning introduces an asymmetric risk: fixing an incorrect reasoning trace is useful, but replacing a trace that was already correct can be harmful. We study this problem under a selective replacement setting, where a system must decide whether a repaired candidate is safer than preserving the original cached trace. We present GuardedRepair, a guarded best-of-N repair framework that diagnoses cached reasoning traces, selectively triggers repair, and accepts answer-changing candidates only when deterministic verification guards support replacement. The framework combines lightweight symbolic checks, surface semantic-risk diagnostics, bounded candidate generation, and conservative acceptance policies. On the full GSM8K test set, where the initial reasoner already achieves 95.60% accuracy, GuardedRepair improves final accuracy to 96.89%, fixing 17 of 58 remaining errors without measured broken-correct cases in the main run. On a weak-reasoner ASDiv setting, accuracy improves from 78.40% to 87.60%. Direct regeneration baselines show that this gain is not explained by stronger-model re-solving alone: re-solving all GSM8K examples lowers accuracy to 93.03% and breaks 47 initially correct answers. Additional analyses show that guarded repair substantially improves the fixed/broken tradeoff, while also revealing that replacement risk is reduced rather than eliminated. These results support viewing post-hoc repair as harm-aware selective replacement rather than unconstrained re-solving.

URL PDF HTML ☆

赞 0 踩 0

2605.24608 2026-05-26 cs.AI cs.CV cs.LG 版本更新

Lattice theory and algebraic models for deep convolutional learning based on mathematical morphology

基于数学形态学的深度卷积学习的格论与代数模型

Gustavo, Angulo

发表机构 * Mines Paris, PSL University, CMA-Center for Applied Mathematics, Sophia-Antipolis, France（巴黎 Mines 学院，PSL 大学，应用数学中心，法国索菲亚-安蒂波利斯）

AI总结本文基于格论和数学形态学，为深度卷积架构（CNN、ResNet、UNet）建立了严格的代数框架，揭示了标准CNN流水线是交叉格算子，并识别出三种真正的幂等开运算层设计。

详情

AI中文摘要

我们为深度卷积架构（包括CNN、ResNet和如UNet的编码器-解码器网络）建立了一个严格的代数框架，该框架基于格论和数学形态学。核心工具是Matheron-Maragos-Banon-Barrera (MMBB) 平移不变算子通用表示理论，我们将其系统地应用于标准深度网络的每一层。主要发现是：标准CNN流水线（线性卷积 + ReLU + 平坦最大池化）是一个交叉格算子：卷积是傅里叶下半格中的腐蚀，ReLU是格并闭包，最大池化是逐点最大加格中的膨胀，它们的组合既不是形态学开运算也不是闭运算。第二个发现是：ReLU在逐点格中的上伴随是一个全局（非局部）算子，在全局非负函数上为恒等映射，否则为负无穷，因此没有局部形态学腐蚀能与ReLU构成伴随对。这两个结果共同提供了深度在标准CNN中引入真正表示能力的精确代数原因：组合层不是幂等的。我们识别并完全刻画了三种真正的幂等开运算层设计：纯最大加形态学层（逐点格）、谱维纳层（傅里叶格）和自对偶形态学层。我们建立了完整的不动点和收敛理论。该框架还将最大池化、步长卷积和拉普拉斯金字塔统一在Goutsias-Heijmans伴随金字塔理论下，并给出了激活-池化膨胀（APD）分解及其正确的伴随算子。

英文摘要

We develop a rigorous algebraic framework for deep convolutional architectures, CNNs, ResNets, and encoder--decoder networks such as UNet, grounded in lattice theory and mathematical morphology. The central tool is the Matheron--Maragos--Banon--Barrera (MMBB) universal representation theory for translation-invariant operators, which we apply systematically to every layer of a standard deep network. The principal finding is that the standard CNN pipeline (linear convolution~$+$ ReLU~$+$ flat max-pooling) is a cross-lattice operator: the convolution is an erosion in the Fourier inf-semilattice while ReLU is a lattice-join closing and max-pooling is a dilation in the pointwise max-plus lattice, and their composition is a morphological opening in neither. A second finding is that the upper adjoint of ReLU in the pointwise lattice is a global (non-local) operator, the identity on globally non-negative functions and $-\infty$ otherwise, so no local morphological erosion can form an adjunction pair with ReLU. These two results together provide the precise algebraic reason why depth in standard CNNs introduces genuine representational power: the composed layer is not idempotent. Three layer designs that are genuine idempotent openings are identified and fully characterised: the pure max-plus morphological layer (pointwise lattice), the spectral Wiener layer (Fourier lattice), and the self-dual morphological layer. We establish a complete fixed-point and convergence theory. The framework also unifies max-pooling, strided convolution, and the Laplacian pyramid under the Goutsias--Heijmans adjoint pyramid theory, and gives the Activation--Pooling Dilation (APD) factorisation with its correct adjoint.

URL PDF HTML ☆

赞 0 踩 0

2605.24600 2026-05-26 cs.AI 版本更新

Agent-as-Peer-Debriefer: A Multi-Agent Framework with Perspective-Based Refinement for Qualitative Analysis

Agent-as-Peer-Debriefer: 一种基于视角精炼的多智能体定性分析框架

Zhimin Lin, Kun Cheng, Fan Bai, Jie Gao

发表机构 * Soochow University（苏州大学）； Johns Hopkins University（约翰霍普金斯大学）

AI总结提出一种多智能体框架，通过模拟同行汇报（peer debriefing）并引入理论驱动、数据驱动和应用三种分析视角，提升大语言模型在定性数据分析中的编码质量。

详情

AI中文摘要

大型语言模型（LLMs）越来越多地被用于定性数据分析（QDA），但其输出往往缺乏人类分析的深度和细微差别。我们认为这一差距反映了人类QDA中缺失的一种可信度实践：同行汇报（peer debriefing），即分析师向无偏见的同行寻求反馈并据此完善其编码。为了将这一实践引入LLM辅助的QDA，我们提出了Agent-as-Peer-Debriefer，一个将同行汇报构建到关键编码步骤中的多智能体QDA框架。在我们的框架中，层次编码代理遵循标准QDA流程生成代码、子主题和主题，以及自我解释和反思备忘录。然后，它将输出共享给三个同行汇报代理，每个代理应用不同的分析视角（理论驱动、数据驱动或应用），并通过保留、重命名、重新分配、合并或拆分代码来完善代码。这些视角来源于跨领域和数据集通用的已建立的人类QDA实践。为了评估该框架，我们在三个LLM上对两个领域的三个数据集进行了测试，测量与人工标注代码的语义相似度。在所有设置中，基于视角的同行汇报精炼比单一LLM基线更接近人类代码，消融实验进一步表明，这种提升不仅仅来自额外的精炼。三种视角也产生了不同的权衡，表明视角的选择是一个有意义且可控的设计决策。更广泛地说，这些发现表明，用明确的视角模拟同行汇报是实现更可信的LLM辅助QDA的一条有前景的途径。

英文摘要

Large language models (LLMs) are increasingly used for qualitative data analysis (QDA), yet their outputs often miss the depth and nuance of human analysis. We argue this gap reflects a missing credibility practice from human QDA: peer debriefing, in which an analyst seeks feedback from a disinterested peer and uses it to refine their coding. To bring this practice into LLM-assisted QDA, we propose Agent-as-Peer-Debriefer, a multi-agent QDA framework that builds peer debriefing into key coding steps. In our framework, a Hierarchical Coding Agent follows the standard QDA process to generate codes, sub-themes, and themes, along with self-explanations and reflection memos. It then shares these outputs with three Peer-Debriefing Agents, each applying a distinct analytical perspective (Theory-Driven, Data-Driven, or Applied) and refining the codes by keeping, renaming, reassigning, merging, or splitting them. These perspectives are drawn from established human QDA practices that generalize across domains and datasets. To evaluate the framework, we test it on three datasets across two domains with three LLMs, measuring semantic similarity to human-annotated codes. Across all settings, perspective-based, peer-debriefing refinement aligns more closely with human codes than a single-LLM baseline, and an ablation further shows the gain is not merely from additional refinement. The three perspectives also produce distinct trade-offs, showing that the choice of perspective is a meaningful and controllable design decision. More broadly, these findings suggest that simulating peer debriefing with explicit perspectives is a promising route to more credible LLM-assisted QDA.

URL PDF HTML ☆

赞 0 踩 0

2605.24598 2026-05-26 cs.AI cs.MA 版本更新

Hera: Learning Long-Horizon Coordination for Device-Cloud Collaborative LLM Agents

Hera: 面向设备-云协作LLM智能体的长时程协调学习

Yuxin Zhang, Mengxue Hu, Zheng Lin, Xiaoyi Fan, Fan Xie, Zihan Fang, Jing Yang, Wenjun Zhu, Zhiwen Chen, Chengfei Lv, Zhe Chen

发表机构 * Fudan University（复旦大学）； Alibaba Group（阿里巴巴集团）； The University of Hong Kong（香港大学）； Shenzhen MSU-BIT University（深圳MSU-BIT大学）； New York University（纽约大学）； Universiti Malaya（马来亚大学）； SpaceAIC Co., Ltd.（SpaceAIC公司）

AI总结提出Hera，一种步骤级设备-云LLM智能体协调器，通过两阶段训练（模仿学习+强化学习）优化长时程任务的性能-成本帕累托前沿。

详情

AI中文摘要

大型语言模型（LLM）智能体通过自主与环境交互，擅长解决复杂的长期任务。然而，它们的实际部署面临根本性的设备-云困境：设备端模型高效但脆弱，而云端模型强大但计算成本高。最先进的LLM设备-云路由器通常做出粗粒度的任务级决策，无法适应多步智能体交互中变化的难度。为解决此问题，我们提出Hera，一种用于长期任务的步骤级设备-云LLM智能体协调器，实现了强大的性能-成本帕累托前沿。Hera采用新颖的两阶段训练范式：（1）冷启动的模仿学习，随后（2）联合优化任务成功率和云端使用效率的强化学习。第一阶段将步骤级路由视为监督分类问题：设备智能体在云端轨迹上重放，每个状态根据设备与云端动作的一致性进行标记。第二阶段，我们通过跨轨迹分组相同状态并使用偏好更高期望回报和更少未来云端调用的标签更新Hera，进行成本感知的强化学习。我们在ALFWorld、WebShop和AppWorld上评估Hera，它始终优于先前方法，在仅46.3%的步骤中使用云端的情况下，达到了云端单独成功率的92.5%。

英文摘要

Large language model (LLM) agents excel at solving complex long-horizon tasks through autonomous interaction with environments. However, their real-world deployment faces a fundamental device--cloud dilemma: on-device models are efficient but often brittle, while cloud models are stronger but costly in computation. State-of-the-art LLM device--cloud routers usually make coarse task-level decisions, which cannot adapt to the changing difficulty of multi-step agent interactions. To address this issue, we present Hera, a step-level device--cloud LLM agent coordinator for long-horizon tasks achieving a strong performance--cost Pareto frontier. Hera adopts a novel two-stage training paradigm: (1) imitation learning for cold-start, followed by (2) reinforcement learning that jointly optimizes task success and cloud usage efficiency. The first stage casts step-level routing as a supervised classification problem: the device agent is replayed on cloud trajectories, with each state labeled by the agreement between device and cloud actions. In the second stage, we perform cost-aware reinforcement learning by grouping identical states across trajectories and updating Hera with labels favoring higher expected return and fewer future cloud calls. We evaluate Hera on ALFWorld, WebShop, and AppWorld, where it consistently outperforms prior methods, achieving 92.5% of the cloud-only success rate with cloud use in only 46.3% of steps.

URL PDF HTML ☆

赞 0 踩 0

2605.24597 2026-05-26 cs.AI cs.CL cs.LG 版本更新

Learning to Reason Efficiently with A* Post-Training

学习通过A*后训练进行高效推理

Andreas Opedal, Francesco Ignazio Re, Abulhair Saparov, Mrinmaya Sachan, Bernhard Schölkopf, Ryan Cotterell

发表机构 * ETH Zürich（苏黎世联邦理工学院）； MPI for Intelligent Systems, Tübingen（图宾根智能系统研究所）； Purdue University（普渡大学）

AI总结本文通过A*搜索算法指导LLM生成正确且高效的推理步骤，提出监督微调和强化学习两种训练方法，在1B-3B参数模型上显著提升推理准确性和效率。

Comments Preprint

详情

AI中文摘要

大型语言模型（LLM）的许多应用需要演绎推理，但模型经常产生不正确或冗余的推理步骤。我们将自然语言推理框架化为一个搜索问题，其中最终答案本身就是有效的证明，需要推理过程中间推理正确。具体来说，我们研究LLM是否能够通过A*搜索（一种保证通向目标的最优高效路径的算法）的指导，学习生成正确且高效的证明。我们探索了两种训练技术：在A*执行轨迹上的监督微调，以及使用A*信息的过程奖励模型进行强化学习。实验发现，1B-3B范围内的Llama-3.2模型从A*后训练中获益显著，从接近零准确率提升到超越更大的模型DeepSeek-V3.2。我们的分析揭示了一个权衡：简单的正确性奖励最大化准确率，而A*信息的信号在准确率和效率之间取得平衡。此外，我们发现，在更大的搜索空间中，使用不完美启发式训练的模型表现出更高的准确率。我们的结果展示了朝着由经典搜索算法原理指导的推理方向的有前景的路径。

英文摘要

Many applications of large language models (LLMs) require deductive reasoning, yet models frequently produce incorrect or redundant inference steps. We frame natural language inference as a search problem where the final answer is the valid proof itself, requiring a reasoning procedure in which intermediate inferences are correct. Specifically, we investigate whether LLMs can learn to generate correct and efficient proofs with guidance from A* search -- an algorithm that guarantees an optimally efficient path to a goal. We explore two training techniques: supervised fine-tuning on execution traces from A* and reinforcement learning with A*-informed process reward models. Empirically, we find that Llama-3.2 models in the 1B--3B range benefit substantially from A* post training, going from near-zero accuracy to outperforming DeepSeek-V3.2 -- a much larger model. Our analysis uncovers a trade-off: while simple correctness rewards maximize accuracy, A*-informed signals strike a balance between accuracy and efficiency. Furthermore, we find that on larger search spaces, models trained with imperfect heuristics exhibit superior accuracy. Our results demonstrate a promising direction towards reasoning guided by principles derived from classical search algorithms.

URL PDF HTML ☆

赞 0 踩 0

2605.24588 2026-05-26 cs.AI cs.LG 版本更新

HeartBeatAI: An Interpretable and Robust Deep Learning Framework for Multi-Label ECG Arrhythmia Detection

HeartBeatAI：用于多标签心电图心律失常的可解释且鲁棒的深度学习框架

Shubham Gupta, Nikhil Panwar, Partha Pratim Roy

发表机构 * Department of Computer Science and Engineering, Indian Institute of Technology Roorkee（印度拉胡尔理工学院计算机科学与工程系）

AI总结提出HeartBeatAI框架，结合域泛化、多尺度特征聚合和临床可解释性，通过Squeeze-and-Excitation ResNet和多层浓度管道实现鲁棒的12导联心电图分类，在跨数据集评估中达到98%宏F1分数，但跨机构部署时罕见异常检测仍存在挑战。

详情

AI中文摘要

虽然深度学习增强了自动化心电图分析，但临床部署受到类别不平衡和泛化差距的阻碍。本文提出了HeartBeatAI，一个结合域泛化、多尺度特征聚合和临床可解释性的深度学习框架，用于鲁棒的12导联心电图分类。超越基于图像的范式，HeartBeatAI集成了一个Squeeze-and-Excitation ResNet来隔离诊断导联，以及一个多层浓度管道来捕捉宏观节律和微观形态异常。为了缓解域偏移，该框架采用了MixStyle正则化和标签平滑。通过使用源内和留一域外协议在四个大规模数据集上进行严格的基准测试，在源内条件下实现了高性能（98%宏F1分数）。然而，留一域外评估揭示了检测罕见异常时的显著退化，突显了跨机构部署中持续存在的挑战。

英文摘要

While Deep Learning (DL) enhances automated electrocardiogram (ECG) analysis, clinical deployment is hindered by class imbalance and the generalization gap. This paper presents HeartBeatAI, a deep learning framework combining domain generalization, multi-scale feature aggregation, and clinical explainability for robust 12-lead ECG classification. Moving beyond image-based paradigms, HeartBeatAI integrates a Squeeze-and-Excitation (SE) ResNet to isolate diagnostic leads alongside a Multi-Layer Concentration Pipeline to capture macro-rhythm and micro-morphological anomalies. To mitigate domain shift, the framework employs MixStyle regularization and Label Smoothing. Rigorous benchmarking across four large-scale datasets using intra-source and Leave-One-Domain-Out (LODO) protocols demonstrates high performance (98% Macro F1-score) under intra-source conditions. However, LODO evaluations reveal significant degradation in detecting rare anomalies, highlighting a persistent challenge in cross-institutional deployment.

URL PDF HTML ☆

赞 0 踩 0

2605.24584 2026-05-26 cs.LG cs.AI 版本更新

LAPLEX: The FFT of Learnable Laplace Kernels

LAPLEX: 可学习拉普拉斯核的FFT

Łukasz Struski, Hanna Blazhko, Piotr Kubaty, Jacek Tabor

发表机构 * Faculty of Mathematics and Computer Science, Jagiellonian University（杰里戈尼亚大学数学与计算机科学系）； Doctoral School of Exact and Natural Sciences, Jagiellonian University（杰里戈尼亚大学精确与自然科学博士学院）； Centre for Credible Artificial Intelligence, Warsaw University of Technology（华沙技术大学可信人工智能中心）

AI总结提出LAPLEX算子，通过可学习坐标锚点隐式定义满秩稠密矩阵，实现FFT规模的可训练矩阵-向量运算，分离表达性与存储成本。

详情

AI中文摘要

深度学习中的快速线性代数通常面临一个选择：固定几何和精确计算（如傅里叶变换），或者通过稠密参数、随机特征或低秩近似实现自适应几何。为了超越这种权衡，我们引入了LAPLEX，一类精确的、可训练的（相位）拉普拉斯核算子。LAPLEX层通常是一个满秩稠密矩阵，由可学习的坐标锚点隐式定义，具有类似FFT的缩放特性。因此，它支持在现代GPU上对高达$10^9$维的向量进行可训练的矩阵-向量运算。作为神经网络层，它产生紧凑的投影和分类头，可解释为软性的、可训练的路由模型。同样的原语也可作为高效的Gram算子，实现对展平图像（维度$3 \cdot 10^6$）的高维协方差建模，在保留可见空间结构的同时不施加卷积偏差。这些应用反映了一个单一原则：无需存储稠密矩阵即可学习稠密几何，从而在普通稠密层无法企及的领域中实现数据自适应的全局交互。在这个意义上，LAPLEX将表达性与存储成本分离：它表现得像一个稠密可训练矩阵，但通过一个小的结构化参数集表示和应用。

英文摘要

Fast linear algebra in deep learning usually comes with a choice: fixed geometry and exact computation, as in the Fourier transform, or adaptive geometry paid for by dense parameters, random features, or low-rank surrogates. To move beyond this trade-off, we introduce LAPLEX, a class of exact, trainable (phased) Laplace-kernel operators. A LAPLEX layer is a typically full-rank dense matrix, implicitly defined by learnable coordinate anchors, with FFT-like scaling. Consequently, it supports trainable matrix--vector operations at vector dimensions up to $10^9$ on modern GPUs. As a neural layer, it yields compact projections and classification heads interpretable as soft, trainable routing models. The same primitive also serves as an efficient Gram operator, enabling high-dimensional covariance models on flattened images of dimension $3 \cdot 10^6$ that preserve visible spatial structure without imposing convolutional bias. These applications reflect a single principle: dense geometry can be learned without storing a dense matrix, which enables data-adaptive global interactions in regimes where ordinary dense layers are out of reach. In this sense, LAPLEX separates expressivity from storage cost: it behaves like a dense trainable matrix, but is represented and applied through a small structured set of parameters.

URL PDF HTML ☆

赞 0 踩 0

2605.24577 2026-05-26 cs.LG cs.AI cs.CL 版本更新

Polymorphism Is Rotation: Operational Mechanistic Interpretability from a Two-Layer Transformer to Pythia-70m

多态性即旋转：从两层Transformer到Pythia-70m的操作性机械可解释性

Jordan F. McCann

发表机构 * Independent Researcher（独立研究者）

AI总结本文发现独立训练的Transformer在残差流基上通过均匀随机旋转相互关联，并利用正交Procrustes拟合实现特征字典和转向向量在模型间的迁移，无需重新训练。

Comments 26 pages, 4 figures, 40 references. Pre-registered four-bar framework; all numerical claims reproducible

详情

AI中文摘要

独立训练的Transformer在残差流基上计算相同的函数，这些基通过$\mathrm{SO}(d_{\mathrm{model}})$上的均匀随机旋转相互关联。我们将这种现象称为多态性：相同的函数，但内部坐标互不可解。每对模型之间的一次矩阵乘法即可消除这种多态性：在单批激活上进行正交Procrustes拟合，即可在独立训练的模型之间迁移稀疏自编码器特征字典和转向向量，无需重新训练。该现象对标准SAE通用性度量不可见。解码器列余弦相似度在不同种子间匹配度达98%，即SAE通用性的头条数字，而一个种子训练的SAE重构另一个种子的激活时，解释方差为负，比预测常数均值更差。解码器列对齐，但编码器从旋转后的框架读取。单个Procrustes旋转$R$可在每个内部位置将重构恢复至种子内上限的0.025 EV以内。 $R$服从Haar分布：$\|R - I\|_F$与随机正交预测$\sqrt{2 d_{\mathrm{model}}}$在$d_{\mathrm{model}} = 512$时匹配至0.1%，且$R$的特征值谱与Haar $\mathrm{SO}(d_{\mathrm{model}})$的Kolmogorov-Smirnov检验在合并和逐对情况下均返回$p \approx 1.000$。均值差转向向量通过与$R$的不变子空间对齐在三种机制下迁移：当被共享输出权重固定时清晰，与旋转子空间重叠时部分，否则反转。在无共享输入/输出（Pythia）时，所有三种情况均坍缩为普遍反转。同一旋转解释适用于单次运行中的不同训练检查点。在104k参数的Dyck-3 Transformer和九个独立训练的Pythia-70m种子（基于The Pile数据集）上，通过预注册的四柱操作框架进行验证。前沿规模（10B+）的复现仍有待研究。

英文摘要

Independently trained transformers compute the same function in residual-stream bases that differ by a uniform random rotation on $\mathrm{SO}(d_{\mathrm{model}})$. We call this phenomenon polymorphism: same function, mutually unintelligible interior coordinates. One matrix multiplication per model pair removes it: an orthogonal Procrustes fit on a single batch of activations transfers sparse-autoencoder feature dictionaries and steering vectors between independently trained models, with no retraining. The phenomenon is invisible to the standard SAE universality metric. Decoder-column cosine similarity matches across seeds at 98%, the SAE-universality headline number, while an SAE trained on one seed reconstructs another seed's activations at negative explained variance, worse than predicting the constant mean. The decoder columns align; the encoder reads from a rotated frame. A single Procrustes rotation $R$ restores reconstruction to within 0.025 EV of the within-seed ceiling at every internal site. $R$ is Haar-distributed: $\|R - I\|_F$ matches the random-orthogonal prediction $\sqrt{2 d_{\mathrm{model}}}$ to 0.1% at $d_{\mathrm{model}} = 512$, and a Kolmogorov-Smirnov test of $R$'s eigenvalue spectrum against Haar $\mathrm{SO}(d_{\mathrm{model}})$ returns $p \approx 1.000$ pooled and per-pair. Diff-of-means steering vectors transfer in three regimes by alignment with $R$'s invariant subspace: clean when pinned by shared output weights, partial when overlapping the rotated subspace, inverted otherwise. With no shared I/O (Pythia), all three collapse to universally inverted. The same rotation account holds across training checkpoints within a single run. Validated on a 104k-parameter Dyck-3 transformer and nine independently-trained Pythia-70m seeds on The Pile, via a pre-registered four-bar operational framework. Frontier-scale (10B+) replication remains open.

URL PDF HTML ☆

赞 0 踩 0

2605.24576 2026-05-26 cs.AI 版本更新

Associations between echocardiographic traits and AI-ECG predictions of heart failure

超声心动图特征与AI-ECG心力衰竭预测之间的关联

Elias Stenhede, Eivind Bjørkan Orstad, Torbjørn Omland, Henrik Schirmer, Arian Ranjbar

发表机构 * 1Medical Technology \& E-Health, Akershus University Hospital, 1478 Lørenskog, Norway ； 2Faculty of Medicine, University of Oslo, 0372 Oslo, Norway ； 3Department of Cardiology, Akershus University Hospital, 1478 Lørenskog, Norway ； 4Institute of Clinical Medicine, Campus Ahus, University of Oslo, 0317 Oslo, Norway

AI总结本研究通过回顾性分析8147例患者数据，发现AI-ECG预测的心力衰竭风险主要与整体纵向应变等收缩功能指标相关，且在射血分数保留的患者中也能捕捉舒张功能异常。

详情

AI中文摘要

人工智能心电图（AI-ECG）可以检测心力衰竭（HF），包括左心室射血分数（LVEF）未捕获的疾病，但模型预测背后的心脏表型仍不清楚。因此，我们研究了AI-ECG预测的HF风险是否与已确立的心肌功能障碍、重构和充盈压的超声心动图测量指标一致。我们回顾性分析了2023年1月1日至2025年6月1日期间在阿克什胡斯大学医院三天内同时接受心电图和超声心动图检查的8147名患者的数据。对所有心电图应用了先前验证的用于HF检测的AI-ECG模型。斯皮尔曼秩相关系数ρ量化了超声心动图参数与AI-ECG风险之间的关联。按性别和左心室射血分数（LVEF）进行了亚组分析。外部验证包括来自哥伦比亚大学欧文医学中心的36,286对心电图-超声心动图数据。整体纵向应变（GLS）显示出最强的相关性（ρ=0.57），其次是二尖瓣环平面收缩期位移（MAPSE）（ρ=-0.49）和LVEF（ρ=-0.45）。在LVEF>50%的患者中，GLS、MAPSE和舒张相关参数的相关性仍然显著。女性的左心室容积指数相关性较弱，而舒张指数在女性中的相关性比男性更强。生理学验证表明，AI-ECG的HF风险预测主要与收缩功能指标（特别是整体纵向应变）一致，同时也能捕捉LVEF保留患者的舒张相关异常。这种方法可能提高临床可解释性，并识别模型改进的机会。

英文摘要

Artificial intelligence-enabled electrocardiography (AI-ECG) can detect heart failure (HF), including disease not captured by left ventricular ejection fraction (LVEF), but the cardiac phenotypes underlying model predictions remain unclear. We therefore investigated whether AI-ECG-predicted HF risk aligns with established echocardiographic measures of myocardial dysfunction, remodelling, and filling pressures. We retrospectively analysed ECG and echocardiography data from 8147 patients who underwent both examinations within three days at Akershus University Hospital between 1 January 2023 and 1 June 2025. A previously validated AI-ECG model for HF detection was applied to all ECGs. Spearman's rank correlation $ρ$ quantified associations between echocardiographic parameters and AI-ECG risk. Subgroup analyses were performed by sex and left ventricular ejection fraction (LVEF). External validation included 36,286 ECG-echocardiography pairs from Columbia University Irving Medical Center. Global longitudinal strain (GLS) showed the strongest correlation ($ρ$=0.57), followed by mitral annular plane systolic excursion (MAPSE) ($ρ$=-0.49) and LVEF ($ρ$=-0.45). In patients with LVEF>50%, correlations remained substantial for GLS, MAPSE, and diastolic-related parameters. Volumetric left ventricular indices correlated less strongly in women, whereas diastolic indices showed stronger correlations in women than in men. Physiological validation showed that AI-ECG HF risk predictions align primarily with measures of systolic function, particularly global longitudinal strain, while also capturing diastolic-related abnormalities in patients with preserved LVEF. This approach may improve clinical interpretability and identify opportunities for model refinement.

URL PDF HTML ☆

赞 0 踩 0

2605.24570 2026-05-26 cs.LG cs.AI cs.CV 版本更新

PILOT: Policy-Informed Learned Optimization for Adaptive Deep Network Training

PILOT: 策略引导的学习优化器用于自适应深度网络训练

Sattam Altuuaim, Lama Ayash, Muhammad Mubashar, Naeemullah Khan

发表机构 * King Abdullah University of Science and Technology（卡布斯大学）； University of Strathclyde（斯特拉思克莱德大学）

AI总结提出PILOT在线优化器，通过梯度方向一致性信号动态调整动量、归一化和符号更新的组合，在FashionMNIST和CIFAR-10上实现更高准确率。

Comments 16 pages, 5 figures

详情

AI中文摘要

尽管优化在深度学习中扮演核心角色，但大多数优化器依赖于训练开始前固定函数形式的更新结构。这种静态设计限制了它们响应损失景观中变化梯度行为的能力，其中训练可能在稳定、噪声和不一致状态之间切换。本研究提出PILOT（策略引导的学习优化器），一种在线优化器，在训练过程中自适应其更新行为。PILOT不使用动量、归一化和符号更新之间的固定平衡，而是将梯度方向一致性作为局部训练稳定性的信号。基于该一致性信号调整更新规则，使优化器能够在梯度变得稳定、噪声或不一致时调整其行为。在FashionMNIST和CIFAR-10上的实验表明，PILOT在卷积设置中始终达到评估优化器中的最高准确率。在CNN架构上，PILOT在FashionMNIST上达到94.13%，在CIFAR-10上达到81.94%。在ResNet-18上，它进一步提升了性能，在FashionMNIST上达到95.71%，在CIFAR-10上达到93.42%。这些结果表明，在训练过程中学习如何调整更新结构可以在保持简单一阶优化框架的同时，提高紧凑和更深卷积模型的性能。PILOT的实现公开于https://github.com/SattamAltwaim/PILOT.git。

英文摘要

Despite the central role of optimization in deep learning, most optimizers rely on update structures whose functional form is fixed before training begins. This static design can limit their ability to respond to changing gradient behavior across the loss landscape, where training may shift between stable, noisy, and inconsistent regimes. This study proposes PILOT (Policy-Informed Learned OpTimizer), an online optimizer that adapts its update behavior during training. Rather than using a fixed balance between momentum, normalization, and sign-based updates, PILOT uses gradient-direction agreement as a signal of local training stability. Conditioning the update rule on this agreement signal allows the optimizer to adjust its behavior when gradients become stable, noisy, or inconsistent. Experiments on FashionMNIST and CIFAR-10 show that PILOT consistently achieves the highest accuracy among the evaluated optimizers across convolutional settings. On the CNN architecture, PILOT reaches 94.13% on FashionMNIST and 81.94% on CIFAR-10. On ResNet-18, it further improves performance, reaching 95.71% on FashionMNIST and 93.42% on CIFAR-10. These results suggest that learning how to adapt the update structure during training can improve performance across both compact and deeper convolutional models while preserving a simple first-order optimization framework. The implementation of PILOT is publicly available at https://github.com/SattamAltwaim/PILOT.git

URL PDF HTML ☆

赞 0 踩 0

2605.24564 2026-05-26 cs.AI cs.CE cs.LG 版本更新

Summoning the Oracle to Slay It: Mitigating Look-Ahead Bias in Financial Backtesting with Large Language Models

召唤神谕以屠之：利用大语言模型缓解金融回测中的前瞻偏差

Weixian Waylon Li, Mengyu Wang, Tiejun Ma

发表机构 * University of Edinburgh（爱丁堡大学）

AI总结提出FinCAD方法，通过对抗性偏差发现和实体日期自适应规则，在不重新训练的情况下抑制大语言模型对历史结果的记忆，从而缓解金融回测中的参数化前瞻偏差。

详情

AI中文摘要

在历史金融数据上回测大语言模型（LLMs）是不可靠的，因为预训练在事件发生后截断。一个在2024年训练的LLM已经“知道”2018-2020年股票的走势。我们将这种失败命名为参数化前瞻偏差，并提出FinCAD，一种上下文感知解码的推理时适配方法，无需重新训练即可抑制LLM对历史结果的记忆。FinCAD结合了一个对抗性偏差发现流程，该流程学习一个模型特定的记忆激活先验提示，以及一个实体和日期自适应规则，该规则将CAD强度按（实体，日期）记忆程度缩放，使得惩罚在记忆的样本内日期触发，并在样本外衰减至零。在五个7-14B LLM和五只大盘股上，FinCAD在记忆日期上将样本内回测收益削减高达-67.1%，同时将2025年样本外收益保持在$8K以内，夏普比率在基线的0.10以内，并保持通用推理能力在1.7分以内。在十一个模型的排行榜上，它将样本内/样本外Spearman相关性从+0.779提升至+0.846，恢复了真正预测样本外表现的排名。

英文摘要

Backtesting large language models (LLMs) on historical financial data is unreliable because pre-training cuts off after the events happened. An LLM trained in 2024 already "knows" which way 2018-2020 stocks moved. We name this failure parametric look-ahead bias and propose FinCAD, an inference-time adaptation of Context-Aware Decoding that suppresses an LLM's memory of historical outcomes without retraining. FinCAD pairs an adversarial bias-discovery pipeline that learns a model-specific memory-activating prior prompt with an entity- and date-adaptive rule that scales the CAD strength to per-(entity, date) memorisation, so the penalty fires on memorised in-sample dates and decays to zero out-of-sample. Across five 7-14B LLMs and five mega-cap equities, FinCAD cuts in-sample backtest returns by up to -67.1% on memorised dates while leaving 2025 out-of-sample returns within $8K and Sharpe within 0.10 of baseline, and preserves general-purpose reasoning within 1.7 pts. On an eleven-model leaderboard, it raises the in-sample / out-of-sample Spearman correlation from +0.779 to +0.846, recovering rankings that genuinely predict out-of-sample performance.

URL PDF HTML ☆

赞 0 踩 0

2605.24562 2026-05-26 cs.CV cs.AI 版本更新

PEDESTRIANQA: A Benchmark for Vision-Language Models on Pedestrian Intention and Trajectory Prediction

PEDESTRIANQA: 面向行人意图与轨迹预测的视觉-语言模型基准

Naman Mishra, Shankar Gangisetty, C. V. Jawahar

发表机构 * CVIT, IIIT-Hyderabad, India（IIIT-海得拉巴计算机视觉与智能技术研究所，印度）

AI总结提出大规模视频数据集PedestrianQA，将行人意图和轨迹预测转化为带结构化理由的问答任务，通过微调视觉-语言模型显著提升预测准确性与可解释性。

详情

AI中文摘要

行人意图和轨迹预测对于自动驾驶系统的安全部署至关重要，直接影响复杂交通环境中的导航决策。近期大型视觉-语言模型的进展通过结合高容量视觉理解与灵活的自然语言推理，为这些任务提供了强大的新范式。本文中，我们引入PedestrianQA，这是一个大规模视频数据集，将行人意图和轨迹预测公式化为带有结构化理由的问答任务。PedestrianQA以自然语言表达丰富标注的行人序列，使视觉-语言模型能够从视觉动态、上下文线索和交通智能体间的交互中学习，同时生成其预测的简洁解释，无需为每个任务定制专门的架构。在PIE、JAAD、TITAN和IDD-PeD上的实证评估表明，在PedestrianQA上微调最先进的视觉-语言模型显著提高了意图分类、轨迹预测准确性以及解释性理由的质量，展示了视觉-语言模型作为安全关键行人行为建模的统一且可解释框架的强大潜力。

英文摘要

Pedestrian intention and trajectory prediction are critical for the safe deployment of autonomous driving systems, directly influencing navigation decisions in complex traffic environments. Recent advances in large vision-language models offer a powerful new paradigm for these tasks by combining high-capacity visual understanding with flexible natural language reasoning. In this work, we introduce PedestrianQA, a large-scale video-based dataset that formulates pedestrian intention and trajectory prediction as question-answering tasks augmented with structured rationales. PedestrianQA expresses richly annotated pedestrian sequences, in natural language, enabling VLMs to learn from visual dynamics, contextual cues, and interactions among traffic agents while generating concise explanations of their predictions without needing specialized architectures tailored for each task. Empirical evaluations across PIE, JAAD, TITAN, and IDD-PeD show that finetuning state-of-the-art VLMs on PedestrianQA significantly improves intention classification, trajectory forecasting accuracy, and the quality of explanatory rationales, demonstrating the strong potential of VLMs as a unified and explainable framework for safety-critical pedestrian behavior modeling.

URL PDF HTML ☆

赞 0 踩 0

2605.24550 2026-05-26 cs.AI cs.CL cs.LG 版本更新

Jailbreak to Protect: Buffering and Reinforcing via Temporary Jailbreaking for Safe Fine-Tuning in Large Language Models

越狱以保护：通过临时越狱进行缓冲和强化以实现大型语言模型的安全微调

Seokil Ham, Jaehyuk Jang, Wonjun Lee, Changick Kim

发表机构 * School of Electrical Engineering, Korea Advanced Institute of Science and Technology (KAIST)（韩国科学技术院电子工程学院）

AI总结针对微调即服务中安全对齐被有害微调攻击削弱的问题，提出一种基于梯度分析的缓冲与强化框架，通过临时越狱适配器减少有害更新并利用QR分解合并强化安全，实现无需额外安全数据的高效防御。

Comments ICML 2026 Spotlight

详情

AI中文摘要

微调即服务（FaaS）使得大型语言模型（LLMs）的个性化成为可能，但它在有害微调攻击下会削弱安全对齐。最近的研究表明，在微调期间激活有害行为模块可以防止模型学习不良行为，但其机制尚不清楚。在本文中，我们重新审视临时越狱作为对抗有害微调的一种防御手段，并提供了梯度层面的分析，表明它能够饱和安全退化梯度，同时保留良性任务相关梯度。基于这一见解，我们提出了一种缓冲与强化微调框架，该框架在用户微调期间缓冲有害更新，并在适应后强化安全。具体来说，BufferLoRA作为一个可移除的适配器，在用户微调期间诱导临时越狱以减少有害更新。适应后，通过基于QR分解的合并，将经过训练的ReinforceLoRA（用于在临时越狱状态下恢复拒绝行为）与UserLoRA集成，以在保持用户任务性能的同时强化安全。大量实验表明，我们的框架在用户微调期间无需额外安全数据且计算成本极低的情况下，实现了卓越的安全性和实用性。

英文摘要

Fine-tuning-as-a-Service (FaaS) enables personalization of large language models (LLMs), but it can weaken safety-alignment under harmful fine-tuning attacks. Recent work has shown that activating harmful-behavior modules during fine-tuning can prevent models from learning undesired behaviors, but its mechanism remains unclear. In this paper, we revisit temporary jailbreaking as a defense against harmful fine-tuning and provide a gradient-level analysis showing that it saturates safety-degrading gradients while preserving benign task-relevant gradients. Based on this insight, we propose a Buffer-and-Reinforce fine-tuning framework that buffers harmful updates during user fine-tuning and reinforces safety after adaptation. Specifically, BufferLoRA induces temporary jailbreaking as a removable adapter to reduce harmful updates during user fine-tuning. After adaptation, ReinforceLoRA, trained to recover refusal behavior under the temporarily jailbroken state, is integrated with UserLoRA via QR decomposition-based merging to reinforce safety while preserving user-task performance. Extensive experiments show that our framework achieves superior safety and utility with no additional safety data during user fine-tuning and minimal computational cost.

URL PDF HTML ☆

赞 0 踩 0

2605.24549 2026-05-26 cs.AI 版本更新

AI驱动的自适应对手与公钥系统中密码学信任的侵蚀

Petar Radanliev

发表机构 * Department of Computer Sciences, University of Oxford（牛津大学计算机科学系）； The Alan Turing Institute（艾伦·图灵研究所）； British Library（大英图书馆）

AI总结本文研究人工智能驱动的自适应对手如何利用实现层面的可观测性侵蚀公钥密码学的安全性，提出了一种新的安全评估框架。

2605.24541 2026-05-26 cs.LG cs.AI cs.CL cs.IR 版本更新

SemanticZip: A Pilot Framework for Lossy Text Compression with LLMs as Semantic Decompressors

SemanticZip: 以LLM作为语义解压器的有损文本压缩的试点框架

Natalia Trukhina, Vadim Vashkelis

发表机构 * Embedded Intelligence Lab (EMILAB)（嵌入式智能实验室）

AI总结提出SemanticZip框架，通过LLM将文本压缩为紧凑代码并解压为任务相关语义，在结构化散文、JSON等六种表示上评估，发现结构化散文恢复率最高（WAR=0.956，19.1%令牌增益），而CCL-Min平衡性最佳（39.4%令牌增益，WAR=0.874）。

Comments 13 pages, 1 figure, 2 tables. Pilot framework paper; code and supplementary artifacts available in ancillary files

详情

AI中文摘要

大型语言模型（LLM）系统的文本压缩通常被框架化为令牌删除、检索、摘要或精确重建。我们研究了一种更具攻击性但明确有损的设置：将文本压缩为紧凑代码，LLM可以将其扩展为任务相关的含义。我们将此设置称为SemanticZip。与无损压缩不同，SemanticZip不需要字节相同的重建；与普通摘要不同，它将基于模型的解压缩视为编解码器的一部分，并评估是否恢复了任务相关的语义承诺。本文是一个试点框架，而非基准声明。我们形式化了LLM介导的解压缩，定义了受保护/有损数据包架构，并在五个作者构建的诊断案例上评估了六种表示体系：结构化散文、JSON、CCL-Core、CCL-Min、SemanticZip ASCII和SemanticZip emoji。一个独立的解码器LLM从每种压缩表示中重建类型化的语义原子，我们评估关键原子召回率、加权原子召回率、精确度和分词器增益。在该试点中，结构化散文具有最高的可恢复性，WAR=0.956，o200k_base令牌增益19.1%。CCL-Min是最强的平衡点，令牌增益39.4%，WAR=0.874。SemanticZip ASCII提供了最大的有用压缩，令牌增益46.5%，WAR=0.802，而表情符号密集的SemanticZip在压缩和恢复方面表现均较差。主要贡献并非声称这些数字建立了通用前沿。相反，我们引入了一个可重复的实验接口，用于研究有损、LLM可解压的文本代码，以及一个设计原则：安全关键和精确的承诺应保持受保护，而可预测的低风险上下文可以进行语义压缩。

英文摘要

Text compression for large language model (LLM) systems is usually framed as token deletion, retrieval, summarization, or exact reconstruction. We study a more aggressive but explicitly lossy setting: compress text into compact codes that an LLM can expand into task-relevant meaning. We call this setting SemanticZip. Unlike lossless compression, SemanticZip does not require byte-identical reconstruction; unlike ordinary summarization, it treats model-based decompression as part of the codec and evaluates whether task-relevant semantic commitments are recovered. This paper is a pilot framework, not a benchmark claim. We formalize LLM-mediated decompression, define a protected/lossy packet architecture, and evaluate six representation regimes over five author-constructed diagnostic cases: structured prose, JSON, CCL-Core, CCL-Min, SemanticZip ASCII, and SemanticZip emoji. An independent decoder LLM reconstructs typed semantic atoms from each compressed representation, and we score Critical Atom Recall, Weighted Atom Recall, precision, and tokenizer gain. In this pilot, structured prose has the highest recoverability, with WAR = 0.956 and 19.1% o200k_base token gain. CCL-Min is the strongest balanced point, with 39.4% token gain and WAR = 0.874. SemanticZip ASCII provides the largest useful compression, with 46.5% token gain and WAR = 0.802, while emoji-heavy SemanticZip performs worse on both compression and recovery. The main contribution is not the claim that these numbers establish a universal frontier. Rather, we introduce a reproducible experimental interface for studying lossy, LLM-decompressible text codes and a design principle: safety-critical and exact commitments should remain protected, while predictable low-risk context may be semantically zipped.

URL PDF HTML ☆

赞 0 踩 0

2605.24539 2026-05-26 cs.AI 版本更新

AnyMo：野外人体运动的几何感知与设置无关建模

Baiyu Chen, Zechen Li, Wilson Wongso, Lihuan Li, Xiachong Lin, Hao Xue, Benjamin Tag, Flora Salim

发表机构 * The University of New South Wales（新南威尔士大学）； The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； The Hong Kong University of Science and Technology（香港科技大学）

AI总结提出AnyMo框架，通过物理模拟生成多样化IMU信号、图编码器预训练和LLM对齐，实现跨设备/数据集的零样本活动识别、跨模态检索和运动描述，性能显著提升。

详情

AI中文摘要

随着可穿戴和移动设备日益融入日常生活，它们为持续感知野外人体运动提供了实用途径。但惯性信号高度依赖于传感设置，包括身体位置、安装方向、传感器朝向、设备硬件和采样协议。这种设置依赖性使得学习跨设备和数据集迁移的运动表示变得困难，并限制了可穿戴IMU在封闭集识别之外的广泛应用。我们提出AnyMo，一个用于设置无关人体运动建模的几何感知框架。AnyMo利用基于物理的IMU模拟在密集体表位置上生成多样且合理的合成信号，从配对的合成放置视图和掩蔽部分观测中预训练图编码器，将多位置IMU标记化为全身运动令牌，并将这些令牌与LLM对齐以进行运动-语言理解。我们在三个互补任务上评估AnyMo：跨14个未见下游数据集的零样本活动识别、跨模态检索和可穿戴IMU运动描述，其中在HAR上平均Accuracy/F1/R@2提升11.7%/11.6%/22.6%，零样本IMU到文本和文本到IMU检索MRR分别提升15.9%和28.6%，零样本描述BERT-F1提升18.8%。这些结果支持AnyMo作为野外可穿戴运动理解的通才模型。项目页面：https://baiyuchen.com/project/AnyMo。

英文摘要

As wearable and mobile devices become increasingly embedded in daily life, they offer a practical way to continuously sense human motion in the wild. But inertial signals are highly dependent on the sensing setup, including body location, mounting position, sensor orientation, device hardware, and sampling protocol. This setup dependence makes it difficult to learn motion representations that transfer across devices and datasets, and limits the broader use of wearable IMUs beyond closed-set recognition. We introduce AnyMo, a geometry-aware framework for setup-agnostic human motion modeling. AnyMo uses physics-grounded IMU simulation over dense body-surface placements to generate diverse and plausible synthetic signals, pre-trains a graph encoder from paired synthetic placement views and masked partial observations, tokenizes multi-position IMU into full-body motion tokens, and aligns these tokens with an LLM for motion-language understanding. We evaluate AnyMo on three complementary tasks: zero-shot activity recognition across 14 unseen downstream datasets, cross-modal retrieval, and wearable IMU motion captioning, where it improves average Accuracy/F1/R@2 by 11.7\%/11.6\%/22.6\% on HAR, increases zero-shot IMU-to-text and text-to-IMU retrieval MRR by 15.9\% and 28.6\%, respectively, and improves zero-shot captioning BERT-F1 by 18.8\%. These results support AnyMo as a generalist model for wearable motion understanding in the wild. Project page: https://baiyuchen.com/project/AnyMo.

URL PDF HTML ☆

赞 0 踩 0

2605.22634 2026-05-26 cs.SE cs.AI 版本更新

Contractual Skills: A GovernSpec Design Framework for Enterprise AI Agents

合同技能：面向企业AI代理的GovernSpec设计框架

Ting Liu

发表机构 * SymbolicLight Research（SymbolicLight研究院）

AI总结提出一种基于GovernSpec的合同技能设计框架，通过组织SKILL.md文件为可读任务合同，明确任务意图、边界和验收标准，实验表明该框架能提升生成质量并降低关键错误率。

Comments 15 pages, 5 figures, 4 tables. v2 adds a public-skill A/B study, updates experimental results, and adds a public replication package link: AGI/contractual-skill" target="_blank" rel="noopener">https://github.com/SymbolicLight-AGI/contractual-skill

详情

AI中文摘要

技能已成为代理指令、工作流、脚本和参考材料的实用封装机制。然而，在企业环境中，技能通常需要表达比任务指导更多的内容：目标、输入边界、权限、人工审批点、证据要求、输出合同、质量标准、验证步骤和交接规则。本文提出合同技能，一种受GovernSpec启发的设计框架，用于将SKILL.md文件组织为可读的任务合同，同时保持轻量级技能发现和渐进加载。该框架明确了合同技能、GovernSpec YAML合同、模型上下文协议（MCP）接口、工具适配器、运行时护栏、追踪和评估系统之间的界限。我们通过三个离线实证研究评估该框架。第一个文本生成实验涵盖三个企业技能、十五个合成任务、四种指令条件和八个生成模型，产生960个输出和1680个交叉评判分数记录。第二个研究是公共技能A/B扩展：将八个公共技能与合同重写在四十八个合成任务、六个生成模型、两次重复、1152个输出和两个完整评判文件上进行比较。在此设置中，合同技能将平均质量从4.692提高到4.914，并将关键错误率从0.083降低到0.013。第三个研究是离线工具调用挑战，涉及八个模型和192个模拟工具调用记录。结果表明，合同技能最好被理解为一种治理层，使任务意图、边界和验收标准显式化，而不是独立的安全机制。

英文摘要

Skills have become a practical packaging mechanism for agent instructions, workflows, scripts, and reference materials. In enterprise settings, however, a skill often needs to express more than task guidance: goals, input boundaries, permissions, human approval points, evidence requirements, output contracts, quality criteria, verification steps, and handoff rules. This paper proposes contractual skills, a GovernSpec-inspired design framework for organizing SKILL.md files as readable task contracts while preserving lightweight skill discovery and progressive loading. The framework clarifies the boundary between contractual skills, GovernSpec YAML contracts, Model Context Protocol (MCP) surfaces, tool adapters, runtime guardrails, tracing, and evaluation systems. We evaluate the framework with three offline empirical studies. The first text-generation experiment covers three enterprise skills, fifteen synthetic tasks, four instruction conditions, and eight generation models, producing 960 outputs and 1680 cross-judge score records. The second study is a public-skill A/B expansion: eight public skills are compared with contractual rewrites across forty-eight synthetic tasks, six generation models, two repeats, 1152 outputs, and two complete judge files. In this setting, contractual skills raise mean quality from 4.692 to 4.914 and reduce critical-error rate from 0.083 to 0.013. The third study is an offline tool-calling challenge with eight models and 192 simulated tool-call records. The results suggest that contractual skills are best understood as a governance layer that makes task intent, boundaries, and acceptance criteria explicit, not as a standalone safety mechanism.

URL PDF HTML ☆

赞 0 踩 0

2605.22337 2026-05-26 cs.AI 版本更新

Meta-Soft: Leveraging Composable Meta-Tokens for Context-Preserving KV Cache Compression

Meta-Soft: 利用可组合元标记实现上下文保持的KV缓存压缩

Wei Luo, Yi Huang, Songchen Ma, Huanyu Qu, Jiang Cai, Mingkun Xu

发表机构 * Guangdong Institute of Intelligence Science and Technology（广东智能科学与技术研究院）； University of Macau（澳门大学）； Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））； Hong Kong University of Science and Technology（香港科技大学）

AI总结提出Meta-Soft动态压缩框架，通过可学习正交基矩阵和Gumbel-Softmax选择网络合成元标记，结合注意力流整合机制保留丢弃上下文信息，解决KV缓存压缩中的信息丢失和上下文断裂问题。

Comments 9 pages, 2 figures

详情

AI中文摘要

大型语言模型中使用的KV缓存具有线性增长的时间复杂度，因此当处理长上下文时，LLMs面临内存爆炸和解码效率降低的问题。当前的KV缓存驱逐已成为重要的研究方向；然而，基于固定软标记（例如Judge Q）的现有方法依赖静态参数集作为查询来评估KV对的重要性，因此无法动态适应不同的输入提示，也无法精确捕捉复杂且变化的任务相关性。此外，被驱逐的KV对被永久丢弃，导致不可逆的信息丢失和上下文断裂。为了解决这个问题，我们提出了Meta-Soft，一种基于探针驱动上下文整合的动态压缩框架。具体来说，我们构建了一个带有可学习正交基矩阵$\mathcal{L}$的元库，并使用带有Gumbel-Softmax的选择器网络生成可微分的稀疏组合权重，从而从输入提示特征中动态合成最具针对性的$k$个软标记。我们将这些软标记附加到输入序列的末尾以探针关键信息。我们还引入了一种基于注意力流的整合机制，该机制将移除标记的语义信息重新分配到保留标记中，从而有效保持被丢弃的上下文信息。在多个数据集上的实验表明，我们的方法优于现有的最先进驱逐方法，并为KV缓存压缩提供了新的解决方案。

英文摘要

The KV cache used in large language models has linearly growing time complexity, so LLMs face memory blow-up and reduced decoding efficiency when they process long contexts. Current KV Cache eviction has become an important research direction; however, existing methods based on fixed Soft Tokens (e.g., Judge Q) rely on a static parameter set as the query to evaluate the importance of KV pairs, so they cannot adapt dynamically to different input prompts, and they cannot precisely capture complex and changing task relevance. Also, evicted KV pairs are discarded permanently, so this causes irreversible information loss and context breaks. To address this problem, we propose Meta-Soft, a dynamic compression framework based on probe-driven context integration. Specifically, we build a meta-library with a learnable orthogonal basis matrix $\mathcal{L}$, and we use a selector network with Gumbel-Softmax to produce differentiable sparse combination weights, so we dynamically synthesize the most targeted $k$ Soft Tokens from the input prompt features. We append these Soft Tokens to the end of the input sequence to probe key information. We also introduce an attention-flow based integration mechanism, which redistributes the semantic information of removed tokens into retained tokens, and this keeps the dropped context information effectively. Experiments on multiple datasets show that our method outperforms existing state-of-the-art eviction methods and provides a new solution for KV Cache compression.

URL PDF HTML ☆

赞 0 踩 0

2605.21740 2026-05-26 cs.AI 版本更新

SMDD-Bench: Can LLMs Solve Real-World Small Molecule Drug Design Tasks?

SMDD-Bench: 大语言模型能否解决真实世界的小分子药物设计任务？

Kevin Han, Renfei Zhang, Kathy Wei, Hamed Mahdavi, Niloofar Mireshghallah, Amir Barati Farimani

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Stealth Pennsylvania State University（隐形宾夕法尼亚州立大学）

AI总结提出SMDD-Bench基准，通过502个多轮长时任务实例评估LLM在真实小分子药物设计中的表现，发现最优模型GPT5.4仅解决40.2%任务。

详情

AI中文摘要

LLM智能体在科学发现应用中具有巨大潜力。然而，LLM智能体在跨不同化学空间和靶标的真实世界小分子药物设计（SMDD）任务上的表现尚不明确。当前的评估方法要么是临时的，对于真实发现过于简单，规模有限，或局限于单轮问答。为了标准化LLM智能体在小分子设计上的评估，我们引入了SMDD-Bench，一个具有挑战性的多轮长时智能体基准，包含502个保证可解的任务实例，涵盖5种任务类型：2D药效团识别、相互作用点发现、骨架跃迁、先导化合物优化和片段组装。SMDD-Bench任务覆盖广泛的化学空间，涉及102个独特的蛋白质靶标。完全解决该基准需要具备强大的化学和生物学推理能力及3D直觉，理解专业工具的使用，并在有限的oracle调用次数内展示规划专业知识。我们对7个前沿的开源和闭源LLM进行了基准测试，发现性能最好的LLM GPT5.4仅解决了40.2%的任务。我们希望SMDD-Bench能提供一个标准化的测试平台，激励该领域训练和评估用于全自动计算药物设计的LLM智能体。我们在smddbench.com上托管了一个公共排行榜。

英文摘要

LLM agents have incredible potential for scientific discovery applications. However, the performance of LLM agents on real-world, small molecule drug design (SMDD) tasks across diverse chemistries and targets is unclear. Current evaluation methods are either ad hoc, too simple for real-world discovery, limited in scale, or restricted to single-turn question answering. In effort to standardize the evaluation of LLM agents on small molecule design, we introduce SMDD-Bench, a challenging, multi-turn, long-horizon agentic benchmark consisting of 502 guaranteed-solvable task instances spanning 5 task types: 2D Pharmacophore Identification, Interaction Point Discovery, Scaffold Hopping, Lead Optimization, and Fragment Assembly. SMDD-Bench tasks span a wide region of chemical space and involve 102 unique protein targets. Completely solving the benchmark would require having strong chemical and biological reasoning and 3D intuition, understanding specialized tool use, and displaying planning expertise over a limited number of oracle calls. We benchmark 7 frontier open and closed source LLMs and find even the most performant LLM, GPT5.4, solves only 40.2\% of tasks. We hope SMDD-Bench provides a standardized testbed to invigorate the field towards training and evaluating LLM agents for fully autonomous computational drug design. We host a public leaderboard at smddbench.com .

URL PDF HTML ☆

赞 0 踩 0

2605.21652 2026-05-26 cs.CV cs.AI 版本更新

Look-Closer-Then-Diagnose: Confidence-Aware Ultrasound VQA via Active Zooming

Look-Closer-Then-Diagnose: 通过主动缩放实现置信度感知的超声VQA

Yue Zhou, Erxuan Wu, Yikang Sun, Hongjoo Lee, Yuan Bi, Huixiong Xu, Nassir Navab, Zhongliang Jiang

发表机构 * Computer Aided Medical Procedures (CAMP)（计算机辅助医疗程序）； TU Munich, Germany（慕尼黑工业大学，德国）； Munich Center for Machine Learning (MCML)（慕尼黑机器学习中心）； Munich, Germany（慕尼黑，德国）； Zhongshan Hospital, Fudan University, China（复旦大学中山医院）； The University of Hong Kong, Hongkong, China（香港大学，香港，中国）

AI总结提出一个模拟超声医师认知流程的框架，通过“缩放-诊断”范式和基于组相对策略优化的不确定性感知奖励，提升超声视觉问答中病灶定位和诊断性能。

详情

AI中文摘要

视觉-语言模型（VLM）显著推进了医学视觉问答，但在超声领域性能仍不理想。临床实践中，超声医师在制定报告时会明确关注病灶区域，尽管诊断解释有时因固有的主观性而存在差异。然而，现有VLM并未明确设计为在诊断前交互式地放大病灶；此外，它们通常将标注视为无偏真值，未能考虑其固有的主观性和模糊性。在本文中，我们提出了一个专门考虑超声医师认知工作流的框架。我们首先引入了一个结构化的“缩放-诊断”范式，该范式复制了交互式搜索过程以实现病灶聚焦推理。此外，在组相对策略优化（GRPO）框架内，我们引入了一个基于随机组 rollout 的不确定性感知奖励，以估计预测一致性作为模型置信度的代理。这两个组件共同鼓励模型在清晰案例上强化准确预测，同时在模糊情况下保持谨慎。在肝脏、乳腺和甲状腺数据集上的实验表明，我们的框架将病灶定位提高了39.3%，证明我们的模型学会了主动靠近观察并诊断的能力。

英文摘要

Vision-Language Models (VLMs) have significantly advanced medical visual question answering, yet their performance in ultrasound remains suboptimal. In clinical practice, sonographers explicitly focus on lesion regions to formulate reports, though diagnostic interpretations sometimes vary due to inherent subjectivity. However, existing VLMs are not explicitly structured to interactively zoom into lesions prior to diagnosis; moreover, they typically treat annotations as unbiased ground truths, failing to account for their inherent subjectivity and ambiguity. In this paper, we propose a framework specifically designed to consider the sonographer's cognitive workflow. We first introduce a structured Zoom-then-Diagnose paradigm, which replicates the interactive search process to enable lesion-focused reasoning. Furthermore, within the Group Relative Policy Optimization (GRPO) framework, we introduce an uncertainty-aware reward derived from stochastic group-wise rollouts to estimate prediction consistency as a proxy for model confidence. Together, these two components encourage the model to reinforce accurate predictions on clear cases while remaining cautious under ambiguity. Experiments across liver, breast, and thyroid datasets show that our framework improves lesion localization by 39.3\%, demonstrating that our model has learned the ability to actively look closer and diagnose.

URL PDF HTML ☆

赞 0 踩 0

2605.21417 2026-05-26 cs.CV cs.AI 版本更新

Ordering Matters: Rank-Aware Selective Fusion for Blended Emotion Recognition

排序重要：面向混合情感识别的排名感知选择性融合

Junghyun Lee, Hyunseo Kim, Hanna Jang, Junhyug Noh

发表机构 * Department of Artificial Intelligence and Software（人工智能与软件系）

AI总结提出一种排名感知的多编码器框架，通过注意力门控模块选择最有效的编码器进行融合，并解耦预测为存在性和显著性头，结合无监督域适应，在混合情感识别任务中取得第二名成绩。

Comments Accepted at IEEE FG 2026 Workshops. Final system ranked 2nd in the BlEmoRE Challenge. 9 pages including appendix, 8 figures

详情

AI中文摘要

混合情感识别具有挑战性，因为情感通常表现为微妙且重叠的多模态线索的混合，而非单一主导信号。我们提出了一种排名感知的多编码器框架，该框架选择性地结合来自不同预提取视频和音频编码器的互补表示。我们的方法将异构编码器特征投影到共享潜在空间，通过基于注意力的门控模块估计样本级编码器重要性，并仅融合前n个最具信息量的编码器。为了更好地建模混合情感，我们将预测解耦为存在性和显著性头，并通过概率级融合对齐它们。我们进一步引入了无需伪标签的特征级无监督域适应，以提高在分布偏移下的鲁棒性。在BlEmoRE挑战赛上的实验表明，所提出的框架优于强单个编码器和朴素的多编码器融合基线。我们的最终系统在比赛中排名第二，支持了排名感知选择性融合在细粒度混合情感识别中的有效性。

英文摘要

Blended emotion recognition is challenging because emotions are often expressed as mixtures of subtle and overlapping multimodal cues rather than a single dominant signal. We propose a rank-aware multi-encoder framework that selectively combines complementary representations from diverse pre-extracted video and audio encoders. Our method projects heterogeneous encoder features into a shared latent space, estimates sample-wise encoder importance through an attention-based gating module, and fuses only the top-n most informative encoders. To better model blended emotions, we decouple prediction into presence and salience heads and align them through probability-level fusion. We further incorporate feature-level unsupervised domain adaptation without pseudo-labeling to improve robustness under distribution shift. Experiments on the BlEmoRE challenge show that the proposed framework outperforms strong individual encoders and naïve multi-encoder fusion baselines. Our final system ranked 2nd in the competition, supporting the effectiveness of rank-aware selective fusion for fine-grained blended emotion recognition.

URL PDF HTML ☆

赞 0 踩 0

2605.20278 2026-05-26 cs.LG cs.AI cs.CV 版本更新

ClaimDiff-RL: Fine-Grained Caption Reinforcement Learning through Visual Claim Comparison

ClaimDiff-RL: 通过视觉声明比较进行细粒度描述强化学习

Tianle Li, Xuyang Shen, Yan Ma, Rongxin Guo, Shaoxiang Chen, Jiacheng Chen, Haochen Wang, Hongyang Tang, Yucong Zhou, Yu Cheng

发表机构 * The Chinese University of Hong Kong（香港中文大学）； MiniMax

AI总结提出ClaimDiff-RL框架，利用原子声明差异作为奖励单元，通过多模态判断器枚举视觉差异并分配错误类型和严重程度，以解决长描述强化学习中事实性与覆盖度的权衡问题。

详情

AI中文摘要

长格式图像描述揭示了强化学习中的奖励粒度问题：描述被整体判断，而重要错误发生在单个视觉声明层面。一个好的密集描述应既忠实又信息丰富，避免幻觉而不遗漏显著细节。然而，成对偏好、基于参考的指标和整体标量奖励将这些局部错误压缩为单个序列级信号，模糊了事实性与覆盖度之间的权衡。我们引入ClaimDiff-RL框架，该框架使用基于参考的原子声明差异作为描述强化学习的奖励单元。给定一张图像、一个演员描述和一个参考描述，多模态判断器枚举视觉上可区分的差异，针对图像验证每个差异，分配开放词汇的错误类型和严重程度，并生成每个差异的统计信息用于奖励组合。这使得幻觉声明和遗漏的显著事实可以分别测量和调整。实验表明，整体标量奖励可以通过增加遗漏事实来减少幻觉，而ClaimDiff-RL揭示了这种忠实性与覆盖度的权衡，并实现了更平衡的操作点。在包含160张图像的人工标注诊断基准、公开描述基准和VQA基准上，ClaimDiff-RL改善了幻觉-遗漏事实平衡，保留了通用能力，甚至在多个细粒度能力维度（如物体计数、空间关系和场景识别）上超越了Gemini-3-Pro-Preview。这些结果表明，类型化、可验证的声明差异是细粒度且可诊断的描述强化学习的有效奖励单元。

英文摘要

Long-form image captioning exposes a reward granularity problem in RL: captions are judged as whole sequences, while the important errors occur at the level of individual visual claims. A good dense caption should be both faithful and informative, avoiding hallucination without omitting salient details. Yet pairwise preferences, reference-based metrics, and holistic scalar rewards compress these local errors into a single sequence-level signal, obscuring the tradeoff between factuality and coverage. We introduce ClaimDiff-RL, a framework that uses reference-conditioned atomic claim differences as the reward unit for caption RL. Given an image, an actor caption, and a reference caption, a multimodal judge enumerates visually grounded differences, verifies each difference against the image, assigns open-vocabulary error types and severity levels, and produces per-difference statistics for reward composition. This makes hallucinated claims and omitted salient facts separately measurable and tunable. Experiments show that holistic scalar rewards can reduce hallucination by increasing missing facts, while ClaimDiff-RL exposes this faithfulness and coverage tradeoff and enables more balanced operating points. On a 160-image human-labeled diagnostic benchmark, public captioning benchmarks, and VQA benchmarks, ClaimDiff-RL improves the hallucination--missing-fact balance, preserves general capability, and even surpasses Gemini-3-Pro-Preview on several fine-grained Capability dimensions such as object counting, spatial relations, and scene recognition. These results suggest that typed, verifiable claim differences are an effective reward unit for fine-grained and diagnosable caption RL.

URL PDF HTML ☆

赞 0 踩 0

2605.20025 2026-05-26 cs.AI 版本更新

KairosHope: 一种基于双记忆架构的下一代时间序列基础模型，用于专门分类

Luis Balderas, José Alberto Rodríguez, Miguel Lastra, Antonio Arauzo-Azofra, José M. Benítez

发表机构 * Department of Computer Science and Artificial Intelligence（计算机科学与人工智能系）； DiCITS, iMUDS, DaSCI（DiCITS、iMUDS、DaSCI）； University of Granada（格拉纳达大学）； Advanced Medical Imaging Group（先进医学成像组）； Instituto de Investigación Biosanitaria de Granada (ibs.Granada)（格拉纳达生物医学研究机构（ibs.Granada））； Department of Software Engineering（软件工程系）； Department of Rural Engineering（农村工程系）； University of Córdoba（科尔多瓦大学）

AI总结针对标准注意力计算瓶颈和经典统计知识缺失问题，提出KairosHope模型，通过双记忆系统（Titans模块和连续记忆系统CMS）替代二次注意力，并融合深度表示与统计特征的混合决策头，在UCR基准上实现优越分类性能。

详情

AI中文摘要

时间序列基础模型（TSFMs）在通用预测任务中取得了显著成功；然而，它们对专门分类问题的适应仍然受到标准注意力的计算瓶颈和对经典统计知识的系统性忽略的限制。本技术报告介绍了KairosHope，一种下一代TSFM，旨在协调大规模泛化与分类任务中的分析精度。该提案的核心是HOPE块，一种用双记忆系统替代二次注意力的架构：用于动态短期保留的Titans模块和用于长期历史上下文抽象的连续记忆系统（CMS）。为了丰富归纳偏差，引入了混合决策头，它将深度潜在表示与通过tsfeatures包提取的确定性统计特征融合。KairosHope在大型Monash档案上进行自监督预训练，结合了掩码时间序列建模（MTSM）和对比学习（InfoNCE）。随后，通过严格的线性探测和全微调（LP-FT）协议在UCR基准数据集上进行适应，以防止灾难性遗忘。实验结果表明，在具有严格时间因果关系的领域（如HAR或传感器数据）中，性能优越。因此，KairosHope为基础模型适应时间序列分析建立了一个稳健高效的框架。

英文摘要

Time Series Foundation Models (TSFMs) have demonstrated notable success in general-purpose forecasting tasks; however, their adaptation to specialized classification problems remains constrained by the computational bottleneck of standard attention and the systematic omission of classical statistical knowledge. This technical report introduces KairosHope, a next-generation TSFM designed to reconcile massive generalization with analytical precision in classification tasks. The core of the proposal is the HOPE block, an architecture that replaces quadratic attention with a dual-memory system: Titans modules for dynamic short-term retention and a Continuum Memory System (CMS) for the abstraction of long-term historical context. To enrich the inductive bias, a Hybrid Decision Head is introduced, which fuses deep latent representations with deterministic statistical features extracted via tsfeatures package. KairosHope undergoes self-supervised pre-training on the massive Monash archive, combining Masked Time Series Modeling (MTSM) and contrastive learning (InfoNCE). Its subsequent adaptation to the UCR benchmark datasets is conducted through a rigorous Linear Probing and Full Fine-Tuning (LP-FT) protocol to prevent catastrophic forgetting. Empirical results demonstrate superior performance in domains characterized by strict temporal causality such as HAR or Sensor data. Consequently, KairosHope establishes a robust and efficient framework for the adaptation of foundation models to time series analysis.

URL PDF HTML ☆

赞 0 踩 0

2605.17268 2026-05-26 cs.AI cs.CV cs.RO 版本更新

Is VLA Reasoning Faithful? Probing Safety of Chain-of-Causation in Autonomous Driving Models

VLA 推理是否忠实？自动驾驶模型中因果链的安全性探究

Nicanor Mayumu, Xiaoheng Deng, Patrick Mukala

发表机构 * School of Computer Science and Engineering（计算机科学与工程学院）； Central South University（中南大学）； School of Computer Science（计算机科学学院）； University of Wollongong in Dubai（迪拜大学）

AI总结通过分析300次VLA推理，发现输出推理与轨迹的忠实度仅42.5%，存在大量漏检行人、轨迹脆弱及推理-动作不一致问题，并提出了信息论忠实度形式化定义与安全架构。

Comments Accept (Poster), CVPR 2026 Workshop DriveX NonArchival Track

2605.16591 2026-05-26 cs.LG cs.AI 版本更新

PyCSP3-Scheduling: PyCSP3的调度扩展

Sohaib Afifi

发表机构 * Univ. Artois, UR 3926, Laboratoire de Génie Informatique et d’Automatique de l’Artois (LGI2A)（阿劳斯-大学，UR 3926，阿劳斯信息工程与自动化实验室（LGI2A））

AI总结提出PyCSP3 Scheduling库，通过53个专用约束和27个表达式为PyCSP3添加调度抽象，并编译为标准约束，在261个实例上验证了与原始公式的目标一致性，但运行时性能因编译开销而异。

详情

AI中文摘要

PyCSP$^3$提供了一种高效构建约束模型以解决组合约束问题的方法，并将其导出为XCSP$^3$，保持了建模与求解的完全分离。然而，它缺乏对调度抽象（如区间变量、序列变量和资源函数）的原生支持。因此，即使PyCSP$^3$已经提供了如NoOverlap和Cumulative等整数数组上的全局约束，调度模型仍需通过低层整数变量和手动通道约束进行编码。我们提出了PyCSP$^3$ Scheduling，一个通过53个专用约束和27个表达式为PyCSP$^3$添加调度抽象的库，并将其编译为标准PyCSP$^3$/XCSP$^3$约束，维护了支撑PyCSP$^3$生态系统的建模/求解分离。在17个模型家族（每个5次运行）的261个配对实例上，两种公式在所有72个双重证明最优对以及近一半的家族（8/17）中产生了相同的目标值，且在编译后结构保持不变；然而，运行时性能在不同家族间存在差异，部分家族有显著提升（高达5.8倍），而其他家族由于编译分解的开销出现性能下降。代码和基准测试可在以下网址获取：https://github.com/sohaibafifi/pycsp3-scheduling

英文摘要

PyCSP$^3$ provides a productive way to build constraint models for solving combinatorial constrained problems and export them to XCSP$^3$, preserving a complete separation between modeling and solving. However, it lacks native support for scheduling abstractions such as interval variables, sequence variables, and resource functions. As a result, scheduling models must be encoded with low-level integer variables and manual channeling constraints, even though PyCSP$^3$ already provides global constraints like NoOverlap and Cumulative on integer arrays. We present PyCSP$^3$ Scheduling, a library that adds scheduling abstractions to PyCSP$^3$ through 53 dedicated constraints and 27 expressions, and compiles them down to standard PyCSP$^3$/XCSP$^3$ constraints, maintaining the modeling/solving separation that underpins the PyCSP$^3$ ecosystem. On 261 paired instances across 17 model families (5 runs each), both formulations produce identical objectives on all 72 doubly-proved optimal pairs and nearly half of the families (8/17) remain structurally unchanged after compilation; however, runtime performance diverges across families, with clear gains on some (up to 5.8x) and regressions on others due to the overhead of compilation decompositions. Code and benchmarks are available at: https://github.com/sohaibafifi/pycsp3-scheduling

URL PDF HTML ☆

赞 0 踩 0

2605.13850 2026-05-26 cs.AI cs.MA cs.SE 版本更新

A Two-Dimensional Framework for AI Agent Design Patterns: Cognitive Function and Execution Topology

AI智能体设计模式的二维框架：认知功能与执行拓扑

Jia Huang, Joey Tianyi Zhou

发表机构 * Agency for Science, Technology and Research (A*STAR)（科技研究局（A*STAR））； Centre for Frontier AI Research (CFAR)（前沿人工智能研究中心（CFAR））

AI总结提出一个结合认知功能（7类）和执行拓扑（6种结构）的二维分类框架，识别28种命名模式，并通过跨领域分析得出模式选择的五条经验法则。

Comments 10 pages, 6 tables, 28 named patterns

详情

AI中文摘要

现有的基于LLM的智能体架构框架从单一视角描述系统：行业指南（Anthropic、Google、LangChain）关注执行拓扑——数据如何流动，而认知科学调查关注认知功能——智能体做什么。单独任何一个轴都无法区分架构上不同的系统：相同的Orchestrator-Workers拓扑可以实现Plan-and-Execute、Hierarchical Delegation或Adversarial Verification——这三种模式具有根本不同的故障模式和设计权衡。我们提出一个二维分类，结合（1）认知功能轴，包含七个类别（感知、记忆、推理、行动、反思、协作、治理）和（2）执行拓扑轴，包含六种结构原型（链、路由、并行、编排、循环、层次）。由此产生的7x6矩阵识别出28种命名模式，其中15种为原创名称。我们通过系统的跨轴分析证明正交性，详细定义八种代表性模式，并在四个真实领域（金融贷款、法律尽职调查、网络运维、医疗分诊）验证描述覆盖范围。跨领域分析得出模式选择的五条经验法则，这些法则支配环境约束（时间压力、行动权限、失败成本不对称、规模）与架构选择之间的关系。该框架为AI智能体架构设计提供了原则性、框架中立且模型无关的词汇表。

英文摘要

Existing frameworks for LLM-based agent architectures describe systems from a single perspective: industry guides (Anthropic, Google, LangChain) focus on execution topology -- how data flows -- while cognitive science surveys focus on cognitive function -- what the agent does. Neither axis alone disambiguates architecturally distinct systems: the same Orchestrator-Workers topology can implement Plan-and-Execute, Hierarchical Delegation, or Adversarial Verification -- three patterns with fundamentally different failure modes and design trade-offs. We propose a two-dimensional classification that combines (1) a Cognitive Function axis with seven categories (Perception, Memory, Reasoning, Action, Reflection, Collaboration, Governance) and (2) an Execution Topology axis with six structural archetypes (Chain, Route, Parallel, Orchestrate, Loop, Hierarchy). The resulting 7x6 matrix identifies 28 named patterns, 15 with original names. We demonstrate orthogonality through systematic cross-axis analysis, define eight representative patterns in detail, and validate descriptive coverage across four real-world domains (financial lending, legal due diligence, network operations, healthcare triage). Cross-domain analysis yields five empirical laws of pattern selection governing the relationship between environmental constraints (time pressure, action authority, failure cost asymmetry, volume) and architectural choices. The framework provides a principled, framework-neutral, and model-agnostic vocabulary for AI agent architecture design.

URL PDF HTML ☆

赞 0 踩 0

2605.13282 2026-05-26 cs.AI cs.LG 版本更新

Differentiable Learning of Lifted Action Schemas for Classical Planning

经典规划中提升动作模式的可微学习

Jonas Reiter, Jakob Elias Gebler, Hector Geffner

发表机构 * RWTH Aachen University（亚琛工业大学）

AI总结提出一种神经网络架构，从完全可观测状态但动作参数未观测的轨迹中学习提升动作模式，实现近乎完美的结构恢复。

详情

AI中文摘要

经典规划器可以有效解决用STRIPS或PDDL表示的非常大的确定性MDP，其中状态是对象和关系上的原子集合，提升动作模式添加或删除这些原子。这种紧凑表示产生了强大的搜索启发式，并为结构泛化提供了理想设置，因为提升关系和动作模式可以产生无限多个领域实例。一个核心挑战是从数据中学习这些关系和动作模式，最近的方法使用不同类型的观测来解决这个问题。在这项工作中，我们开发了一种新颖的神经网络架构，从状态完全可观测但动作参数未观测的轨迹中学习动作模式。该问题是一个简化，但却是从图像序列和动作标签学习规划领域的重要一步，我们旨在以近乎完美的方式解决这个简化问题。挑战在于同时从观测到的状态变化中识别动作参数并学习动作模式。我们的方法产生了一个鲁棒的可微组件，然后可以集成到更大的神经符号模型中。我们在各种规划领域上评估该架构，其中学习到的提升动作模式必须恢复真实结构。此外，我们报告了关于对观测噪声的鲁棒性以及与基于槽的动态模型相关变体的实验。

英文摘要

Classical planners can effectively solve very large deterministic MDPs represented in STRIPS or PDDL where states are sets of atoms over objects and relations, and lifted action schemas add or delete these atoms. This compact representation yields strong search heuristics and provides an ideal setting for structural generalization, since lifted relations and action schemas give rise to infinitely many domain instances. A central challenge is to learn these relations and action schemas from data, and recent approaches have addressed this problem using different types of observations. In this work, we develop a novel neural network architecture for learning action schemas from traces where states are fully observed but action arguments are unobserved. The problem is a simplification but an important step towards learning planning domains from sequences of images and action labels, and we aim to solve this simplification in a nearly perfect manner. The challenge lies in learning the action schemas while simultaneously identifying the action arguments from observed state changes. Our approach yields a robust differentiable component that can then be integrated into larger neuro-symbolic models. We evaluate the architecture on various planning domains, where the learned lifted action schemas must recover the ground-truth structure. Additionally, we report experiments on robustness to observation noise and on a variation related to slot-based dynamics models.

URL PDF HTML ☆

赞 0 踩 0

2605.12850 2026-05-26 cs.CL cs.AI cs.CR cs.LG 版本更新

Persona-Model Collapse in Emergent Misalignment

涌现性失调中的人格模型崩溃

Davi Bastos Costa, Renato Vicente

发表机构 * TELUS Digital Research Hub（TELUS数字研究中心）； Center for Artificial Intelligence and Machine Learning（人工智能与机器学习中心）； Institute of Mathematics, Statistics and Computer Science（数学、统计与计算机科学研究所）； University of São Paulo（圣保罗大学）

AI总结提出人格模型崩溃假说，通过道德易感性(S)和道德稳健性(R)两个指标，证明在有害数据上微调大语言模型会导致模型模拟、区分和维持一致角色的内部能力恶化，从而引发涌现性失调。

Comments 23 pages, 7 figures, 7 tables; NeurIPS 2026 submission; Corrected code repository URL

详情

AI中文摘要

在包含有害内容的狭窄数据上微调大型语言模型，会在无关提示上产生广泛的失调行为，这种现象称为涌现性失调。我们提出涌现性涉及人格模型崩溃：模型模拟、区分和维持一致角色的内部能力恶化。我们通过两个指标在行为上检验这一假设：道德易感性(S)和道德稳健性(R)，它们根据模型在角色扮演下道德基础问卷回答的跨角色和角色内变异性计算得出。这些指标形式化了模型区分角色的能力(S)以及模拟给定角色时的一致性(R)。我们评估了四个前沿模型（DeepSeek-V3.1, GPT-4.1, GPT-4o, Qwen3-235B）的三种变体：基础版、微调为输出不安全代码的版本，以及匹配的微调为输出安全代码的对照版本。在四个模型中，不安全微调导致S平均增加55%，将所有四个不安全变体推至先前工作中13个前沿模型基准观测到的波段之外——其中GPT-4o达到波段上端的两倍以上——表明分化失调。它还导致R平均下降65%，相当于1/R增加304%。相比之下，匹配的安全对照将S保持在基础值附近，仅引起部分R损失，表明这些效应主要特定于失调。补充这些指标变化，不安全变体的无条件响应趋近于接近量表上限的饱和状态，与基础模型的结构化响应以及基础模型角色扮演有毒人格时的响应明显不同。综合来看，这些指标为涌现性失调提供了敏感的诊断，并作为其涉及人格模型崩溃的行为证据。

英文摘要

Fine-tuning large language models on narrow data with harmful content produces broadly misaligned behavior on unrelated prompts, a phenomenon known as emergent misalignment. We propose that emergent misalignment involves persona-model collapse: deterioration of the model's internal capacity to simulate, differentiate, and maintain consistent characters. We test this hypothesis behaviorally using two metrics: moral susceptibility (S) and moral robustness (R), computed from the across- and within-persona variability of models' Moral Foundations Questionnaire responses under persona role-play. These metrics formalize the model's ability to differentiate characters (S) and its consistency when simulating a given one (R). We evaluate four frontier models (DeepSeek-V3.1, GPT-4.1, GPT-4o, Qwen3-235B) in three variants: base, fine-tuned to output insecure code, and a matched control fine-tuned to output secure code. Across the four models, insecure fine-tuning produces an average $55\%$ increase in S, pushing all four insecure variants beyond the band observed across 13 frontier models benchmarked in prior work -- with GPT-4o reaching more than twice the band's upper end -- signaling dysregulated differentiation. It also causes an average $65\%$ decrease in R, equivalent to a $304\%$ increase in 1/R. By contrast, the matched secure control preserves S near the base and induces only a partial R loss, showing that these effects are largely misalignment-specific. Complementing these metric shifts, insecure variants' unconditioned responses converge toward saturation near the scale ceiling, departing markedly from both base models' structured responses and those elicited when base models role-play toxic personas. Taken together, these metrics provide a sensitive diagnostic for emergent misalignment and serve as behavioral evidence that it involves persona-model collapse.

URL PDF HTML ☆

赞 0 踩 0

2605.11182 2026-05-26 cs.AI 版本更新

The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes

在线策略蒸馏的多种面貌：陷阱、机制与修复

Siqi Zhu, Xuyan Ye, Hongyu Lu, Weiye Shi, Ge Liu

发表机构 * UIUC（伊利诺伊大学香槟分校）； Renmin University of China（中国人民大学）； Peking University（北京大学）

AI总结本文通过实证研究分析了在线策略蒸馏（OPD）和在线策略自蒸馏（OPSD）在大语言模型后训练中的有效性、失败机制及修复方法。

详情

AI中文摘要

在线策略蒸馏（OPD）和在线策略自蒸馏（OPSD）已成为大语言模型有前景的后训练方法，它们在模型自身策略采样的轨迹上提供密集的token级监督。然而，现有关于其有效性的结果仍然好坏参半：虽然OP(S)D在系统提示和知识内化方面显示出潜力，但最近的研究也报告了不稳定性和退化。在这项工作中，我们对OPD和OPSD何时有效、何时失败以及原因进行了全面的实证研究。我们发现，数学推理上的OPD对教师选择和损失公式高度敏感，而OPSD在我们测试的设置中失败，因为测试时缺乏实例特定的特权信息（PI）。相反，当PI表示共享的潜在规则（如系统提示或对齐偏好）时，OPSD是有效的。我们识别出三种失败机制：（1）由于以学生生成的前缀为条件导致的教师与学生之间的分布不匹配，（2）来自有偏TopK反向KL梯度的优化不稳定性，以及（3）OPSD特定的限制，即学生学习了无PI策略，该策略聚合了以PI为条件的教师，当PI是实例特定时这是不够的。我们进一步表明，停止梯度TopK目标、RLVR适应的教师和SFT稳定的学生可以缓解这些失败。

英文摘要

On-policy distillation (OPD) and on-policy self-distillation (OPSD) have emerged as promising post-training methods for large language models, offering dense token-level supervision on trajectories sampled from the model's own policy. However, existing results on their effectiveness remain mixed: while OP(S)D has shown promise in system prompt and knowledge internalization, recent studies also report instability and degradation. In this work, we present a comprehensive empirical study of when OPD and OPSD work, when they fail, and why. We find that OPD on mathematical reasoning is highly sensitive to teacher choice and loss formulation, whereas OPSD fails in our tested settings due to test-time absence of instance-specific privileged information (PI). In contrast, OPSD is effective when PI represents a shared latent rule, such as a system prompt or alignment preference. We identify three failure mechanisms: (1) distribution mismatch between teacher and student caused by conditioning on student-generated prefixes, (2) optimization instability from biased TopK reverse-KL gradients, and (3) an OPSD-specific limitation where the student learns a PI-free policy that aggregates PI-conditioned teachers, which is insufficient when PI is instance-specific. We further show that stop-gradient TopK objectives, RLVR-adapted teachers, and SFT-stabilized students mitigate these failures.

URL PDF HTML ☆

赞 0 踩 0

2605.10989 2026-05-26 cs.LG cs.AI 版本更新

SURGE: Surrogate Gradient Adaptation in Binary Neural Networks

SURGE: 二值神经网络中的替代梯度自适应

Haoyu Huang, Boyu Liu, Linlin Yang, Yanjing Li, Yuguang Yang, Xuhui Liu, Canyu Chen, Zhongqian Fu, Baochang Zhang

发表机构 * National College for Excellent Engineers, Beihang University, Beijing, China（北京航空航天大学优秀工程师学院）； School of Artificial Intelligence, Beihang University, Beijing, China（北京航空航天大学人工智能学院）； School of Electronic and Information Engineering, Beihang University, Beijing, China（北京航空航天大学电子与信息工程学院）； King Abdullah University of Science and Technology, Saudi Arabia（沙特国王 Abdullah 科学技术大学）； Huawei Noah’s Ark Lab, China（华为诺亚实验室）

AI总结针对二值神经网络中梯度失配和固定范围梯度裁剪导致的信息损失问题，提出一种基于理论的可学习梯度补偿框架SURGE，通过双路径梯度补偿器和自适应梯度缩放器实现偏差减少的梯度估计与动态平衡，在图像分类、目标检测和语言理解任务上达到最优性能。

Comments Accepted as a poster at the 43rd International Conference on Machine Learning (ICML 2026)

详情

AI中文摘要

二值神经网络（BNN）的训练从根本上依赖于对不可微二值化操作（如符号函数）的梯度近似。然而，包括直通估计器（STE）及其改进变体在内的主流方法依赖于手工设计，存在梯度失配问题和固定范围梯度裁剪导致的信息损失。为了解决这一问题，我们提出了SURrogate GradiEnt Adaptation（SURGE），一种新颖的、具有理论依据的可学习梯度补偿框架。SURGE通过辅助反向传播缓解梯度失配。具体地，我们设计了一个双路径梯度补偿器（DPGC），为每个二值化层构建一个并行的全精度辅助分支，通过在反向传播期间进行输出分解来解耦梯度流。DPGC利用全精度分支估计超出STE一阶近似的分量，从而实现偏差减少的梯度估计。为了进一步增强训练稳定性，我们引入了一个基于最优缩放因子的自适应梯度缩放器（AGS），通过基于范数的缩放动态平衡分支间的梯度贡献。在图像分类、目标检测和语言理解任务上的实验表明，SURGE在现有最先进方法中表现最佳。

英文摘要

The training of Binary Neural Networks (BNNs) is fundamentally based on gradient approximation for non-differentiable binarization operations (e.g., sign function). However, prevailing methods including the Straight-Through Estimator (STE) and its improved variants, rely on hand-crafted designs that suffer from gradient mismatch problem and information loss induced by fixed-range gradient clipping. To address this, we propose SURrogate GradiEnt Adaptation (SURGE), a novel learnable gradient compensation framework with theoretical grounding. SURGE mitigates gradient mismatch through auxiliary backpropagation. Specifically, we design a Dual-Path Gradient Compensator (DPGC) that constructs a parallel full-precision auxiliary branch for each binarized layer, decoupling gradient flow via output decomposition during backpropagation. DPGC enables bias-reduced gradient estimation by leveraging the full-precision branch to estimate components beyond STE's first-order approximation. To further enhance training stability, we introduce an Adaptive Gradient Scaler (AGS) based on an optimal scale factor to dynamically balance inter-branch gradient contributions via norm-based scaling. Experiments on image classification, object detection, and language understanding tasks demonstrate that SURGE performs best over state-of-the-art methods.

URL PDF HTML ☆

赞 0 踩 0

2605.10718 2026-05-26 cs.DC cs.AI cs.LG cs.PF cs.SY eess.SY 版本更新

PACZero: 通过符号量化的语言模型PAC隐私微调

Murat Bilgehan Ertan, Xiaochen Zhu, Phuong Ha Nguyen, Marten van Dijk, Srinivas Devadas

发表机构 * CWI Amsterdam（阿姆斯特丹信息与计算科学研究所）； MIT Cambridge（麻省理工学院）； Vrije Universiteit Amsterdam（阿姆斯特丹自由大学）

AI总结提出PACZero系列零阶机制，通过符号量化实现零互信息下的PAC隐私微调，在SST-2和SQuAD上取得竞争性结果。

详情

AI中文摘要

我们引入了PACZero，一系列用于微调大型语言模型的PAC隐私零阶机制，在$I(S^*; Y_{1:T})=0$时提供可用的效用。该隐私机制将成员推断攻击（MIA）后验成功率限制在先验水平，这是DP框架仅在$\varepsilon=0$和无限噪声下才能达到的MIA抵抗水平。所有下面的DP-ZO比较都在MIA后验水平上匹配。关键见解是，PAC隐私仅在发布依赖于哪个候选子集是秘密时才对互信息收费。对子集聚合的零阶梯度进行符号量化会产生频繁的一致步骤，即每个候选子集在更新方向上达成一致；在这些步骤中，发布的符号花费零条件互信息。我们提出了两个变体，涵盖隐私-效用权衡：PACZero-MI（通过对二元发布进行精确校准的预算化MI）和PACZero-ZPL（在分歧步骤上通过均匀硬币翻转实现$I=0$）。我们在SST-2和SQuAD上使用OPT-1.3B和OPT-6.7B在LoRA和全参数轨道上进行了评估。在SST-2 OPT-1.3B全微调$I=0$时，PACZero-ZPL达到$88.99\pm0.91$，比非私有MeZO基线（$91.1$ FT）低2.1个百分点。在$\varepsilon<1$的高隐私机制下，没有先前方法能产生可用的效用，而PACZero-ZPL在$I=0$时在OPT-1.3B和OPT-6.7B上获得了有竞争力的SST-2准确率和非平凡的SQuAD F1分数。

英文摘要

We introduce PACZero, a family of PAC-private zeroth-order mechanisms for fine-tuning large language models that delivers usable utility at $I(S^*; Y_{1:T})=0$. This privacy regime bounds the membership-inference attack (MIA) posterior success rate at the prior, an MIA-resistance level the DP framework matches only at $\varepsilon=0$ and infinite noise. All DP-ZO comparisons below are matched at the MIA posterior level. The key insight is that PAC Privacy charges mutual information only when the release depends on which candidate subset is the secret. Sign-quantizing subset-aggregated zeroth-order gradients creates frequent unanimity, steps at which every candidate subset agrees on the update direction; at these steps the released sign costs zero conditional mutual information. We propose two variants that span the privacy-utility trade-off: PACZero-MI (budgeted MI via exact calibration on the binary release) and PACZero-ZPL ($I=0$ via a uniform coin flip on disagreement steps). We evaluate on SST-2 and SQuAD with OPT-1.3B and OPT-6.7B in both LoRA and full-parameter tracks. On SST-2 OPT-1.3B full fine-tuning at $I=0$, PACZero-ZPL reaches ${88.99\pm0.91}$, within $2.1$pp of the non-private MeZO baseline ($91.1$ FT). No prior method produces usable utility in the high-privacy regime $\varepsilon<1$, and PACZero-ZPL obtains competitive SST-2 accuracy and nontrivial SQuAD F1 across OPT-1.3B and OPT-6.7B at $I=0$.

URL PDF HTML ☆

赞 0 踩 0

2605.05226 2026-05-26 cs.LG cs.AI cs.CL 版本更新

Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning

将结果监督内化为过程监督：推理强化学习的新范式

Fei Ding, Yongkang Zhang, Runhao Liu, Yuhao Liao, Zijian Zeng, Sibo wang, Huiming Yang

发表机构 * Alibaba Group（阿里巴巴集团）； Tsinghua University（清华大学）

AI总结提出一种监督内化方法，使模型在仅结果监督下自动提取过程级学习信号，实现细粒度策略优化。

详情

AI中文摘要

推理强化学习的核心挑战不仅在于结果级监督的稀疏性，更在于如何将仅在序列末尾提供的反馈转化为可指导中间推理步骤的细粒度学习信号。现有方法要么依赖结果级奖励进行序列级优化，导致精确信用分配困难，要么依赖外部构建的过程监督，成本高昂且难以可持续扩展。为解决这一问题，我们提出一个新视角：推理强化学习可以理解为将结果监督内化为过程监督的问题。基于此视角，我们引入一种用于推理强化学习的监督内化方法，使模型能够通过识别、纠正和重用失败的推理轨迹自动提取过程级学习信号，从而在仅结果监督下实现更细粒度的策略优化。我们进一步将这一思想抽象为一种新的训练范式，其中模型在强化学习过程中持续生成并完善自身的内部过程监督，为推理强化学习中细粒度信用分配开辟了一条不同于外部提供过程监督的新路径。

英文摘要

The central challenge of reinforcement learning for reasoning lies not only in the sparsity of outcome-level supervision, but more fundamentally in how to transform feedback provided only at the end of a sequence into fine-grained learning signals that can guide intermediate reasoning steps. Existing approaches either rely on outcome-level rewards for sequence-level optimization, which makes precise credit assignment difficult, or depend on externally constructed process supervision, which is costly and difficult to scale sustainably. To address this, we propose a new perspective: reinforcement learning for reasoning can be understood as the problem of internalizing outcome supervision into process supervision. From this perspective, we introduce a supervision-internalization method for reinforcement learning for reasoning, enabling the model to automatically extract process-level learning signals through identifying, correcting, and reusing failed reasoning trajectories, thereby achieving finer-grained policy optimization under outcome-only supervision. We further abstract this idea into a new training paradigm, in which the model continually generates and refines its own internal process supervision during reinforcement learning, opening a new path for fine-grained credit assignment in reinforcement learning for reasoning that differs from externally provided process supervision.

URL PDF HTML ☆

赞 0 踩 0

2605.04363 2026-05-26 cs.LG cs.AI 版本更新

Mitigating Label Shift in Tabular In-Context Learning via Test-Time Posterior Adjustment

通过测试时后验调整缓解表格上下文学习中的标签偏移

Seunghan Lee

发表机构 * LG AI Research（LG人工智能研究）

AI总结针对TabPFN在表格数据上下文学习中对标签偏移敏感的问题，提出DistPFN方法，通过测试时后验调整重新缩放类别概率，无需修改架构或额外训练，在250多个OpenML数据集上显著提升分类性能。

Comments ICML 2026

详情

AI中文摘要

TabPFN最近作为表格数据集的基础模型受到关注，通过在合成数据上利用上下文学习实现了强性能。然而，我们发现TabPFN容易受到标签偏移的影响，常常过拟合训练数据集中的多数类。为了解决这一局限性，我们提出了DistPFN，这是第一个专为表格基础模型设计的测试时后验调整方法。DistPFN通过降低训练先验（即上下文的类别分布）的影响并强调模型预测后验的贡献来重新缩放预测的类别概率，无需架构修改或额外训练。我们进一步引入了DistPFN-T，它结合了温度缩放，以根据先验和后验之间的差异自适应地控制调整强度。我们在超过250个OpenML数据集上评估了我们的方法，证明在标签偏移下，各种基于TabPFN的模型在分类任务中取得了显著改进，同时在无标签偏移的标准设置中保持了强性能。代码可在以下仓库获取：https://github.com/seunghan96/DistPFN。

英文摘要

TabPFN has recently gained attention as a foundation model for tabular datasets, achieving strong performance by leveraging in-context learning on synthetic data. However, we find that TabPFN is vulnerable to label shift, often overfitting to the majority class in the training dataset. To address this limitation, we propose DistPFN, the first test-time posterior adjustment method designed for tabular foundation models. DistPFN rescales predicted class probabilities by downweighting the influence of the training prior (i.e., the class distribution of the context) and emphasizing the contribution of the model's predicted posterior, without architectural modification or additional training. We further introduce DistPFN-T, which incorporates temperature scaling to adaptively control the adjustment strength based on the discrepancy between prior and posterior. We evaluate our methods on over 250 OpenML datasets, demonstrating substantial improvements for various TabPFN-based models in classification tasks under label shift, while maintaining strong performance in standard settings without label shift. Code is available at this repository: https://github.com/seunghan96/DistPFN.

URL PDF HTML ☆

赞 0 踩 0

2605.03462 2026-05-26 cs.LG cs.AI 版本更新

From Muscle Bursts to Motor Intent: Self-Supervised Token Modeling for Heterogeneous EMG

从肌肉爆发到运动意图：面向异质EMG的自监督令牌建模

Zhenghao Huang, Huilin Yao, Kaikai Wang

AI总结提出AEMG自监督学习方法，通过事件级令牌建模和Transformer编码，从异质EMG数据中提取可复用的神经肌肉表征，提升跨用户、跨会话的鲁棒性并减少校准数据需求。

Comments After further verification, we identified issues in the current version that may affect the reliability and reproducibility of the reported experimental results. In particular, part of the evaluation relies on a dataset for which the public-release/redistribution status and supporting validation remain unresolved

详情

AI中文摘要

表面肌电图提供了一种从可穿戴肌肉记录推断人类运动意图的实用方法，但在单一采集设置下训练的模型在用户、会话、电极布局或手势协议改变时往往会失去可靠性。本文提出AEMG，一种自监督学习方法，旨在从多样化的EMG源中提取可复用的神经肌肉表征。首先将八个公开手势数据集转换为共享信号格式，以减少通道配置、传感器拓扑和记录协议的差异。AEMG不依赖固定长度滑动窗口，而是从能量变化中识别收缩事件并将其表示为紧凑的神经肌肉令牌，同时有序令牌组描述运动过程中多个肌肉的协调活动。然后使用空间和时间条件Transformer编码这些令牌序列，保留电极位置、激活时序和顺序结构信息。在预训练中，模型通过向量量化重建构建收缩原型的离散库，并通过从周围观测中恢复掩蔽的神经肌肉令牌进一步学习上下文依赖关系。在留一受试者和低标签适应设置下的实验表明，学习到的表征提高了对未见用户的鲁棒性，并减少了手势识别所需的校准数据量。这些发现表明，事件级令牌建模为适应性强且数据高效的基于EMG的运动意图理解提供了一条可扩展的途径。

英文摘要

Surface electromyography provides a practical way to infer human movement intention from wearable muscle recordings, but models trained under a single acquisition setting often lose reliability when the user, session, electrode layout, or gesture protocol changes. This paper proposes AEMG, a self-supervised learning approach designed to extract reusable neuromuscular representations from diverse EMG sources. Eight public gesture datasets are first transformed into a shared signal format to reduce discrepancies in channel configuration, sensor topology, and recording protocol. Instead of relying on fixed-length sliding windows, AEMG identifies contraction events from energy variations and represents them as compact neuromuscular tokens, while ordered token groups describe the coordinated activity of multiple muscles during motion. A spatially and temporally conditioned Transformer is then used to encode these token sequences, preserving information about electrode position, activation timing, and sequential structure. For pre-training, the model constructs a discrete library of contraction prototypes through vector-quantized reconstruction and further learns contextual dependencies by recovering masked neuromuscular tokens from surrounding observations. Experiments under leave-one-subject-out and low-label adaptation settings show that the learned representation improves robustness to unseen users and reduces the amount of calibration data required for gesture recognition. These findings suggest that event-level token modeling offers a scalable route toward adaptable and data-efficient EMG-based motor-intent understanding.

URL PDF HTML ☆

赞 0 踩 0

2605.02900 2026-05-26 cs.CR cs.AI cs.CV cs.RO 版本更新

Safety in Embodied AI: A Survey of Risks, Attacks, and Defenses

具身人工智能的安全性：风险、攻击与防御综述

Xiao Li, Xiang Zheng, Yifeng Gao, Xinyu Xia, Yixu Wang, Xin Wang, Ye Sun, Yunhan Zhao, Ming Wen, Jiayu Li, Zixing Chen, Xun Gong, Yi Liu, Yige Li, Yutao Wu, Cong Wang, Jun Sun, Yixin Cao, Zhineng Chen, Jingjing Chen, Tao Gui, Qi Zhang, Zuxuan Wu, Xipeng Qiu, Xuanjing Huang, Tiehua Zhang, Zhipeng Wei, Kun Wang, Xinfeng Li, Hanxun Huang, Sarah Erfani, James Bailey, Jianping Wang, Chaowei Xiao, Ran He, Bo Li, Xingjun Ma, Yu-Gang Jiang

发表机构 * Fudan University（复旦大学）； Shanghai Innovation Institute（上海创新研究院）； City University of Hong Kong（香港城市大学）； Jilin University（吉林大学）； Singapore Management University（新加坡管理大学）； Deakin University（德肯大学）； Tongji University（同济大学）； Nanyang Technological University（南洋理工大学）； Chinese Academy of Sciences（中国科学院）； The University of Melbourne（墨尔本大学）； Johns Hopkins University（约翰霍普金斯大学）

AI总结本文综述了具身AI在感知、认知、规划、行动及交互全流程中的安全风险、攻击与防御方法，提出了多层次分类体系，并指出了多模态感知融合脆弱性、规划不稳定及人机交互可信度等关键挑战。

Comments Survey paper; 75 pages, 4 figures, 18 tables; v2 expands embodied-specific coverage of agentic threats, World Action Model threats, and contextual risk mitigation, with over 100 new references added. Project page: https://x-zheng16.github.io/Awesome-Embodied-AI-Safety/

详情

AI中文摘要

具身人工智能将感知、认知、规划与交互集成到在开放、安全关键环境中运行的智能体中。随着这些系统获得自主性并进入交通、医疗、工业或辅助机器人等领域，确保其安全性在技术上具有挑战性，在社会上也变得不可或缺。与数字AI系统不同，具身智能体必须在不确定的感知、不完整的知识和动态的人机交互下行动，故障可能直接导致物理伤害。本综述对具身AI中的安全研究进行了全面且结构化的回顾，考察了从感知、认知到规划、行动与交互以及智能体系统的完整具身流程中的攻击与防御。我们引入了一个多层次分类体系，统一了分散的研究工作，并将具身特定的安全发现与视觉、语言和多模态基础模型的更广泛进展联系起来。我们的综述综合了来自500多篇论文的见解，涵盖对抗性攻击、后门攻击、越狱攻击和硬件级攻击；攻击检测、安全训练和鲁棒推理；以及风险感知的人机交互。这一分析揭示了几个被忽视的挑战，包括多模态感知融合的脆弱性、越狱攻击下规划的不稳定性，以及开放场景中人机交互的可信度。通过将领域组织成连贯的框架并识别关键研究空白，本综述为构建不仅具备能力和自主性，而且在现实部署中安全、鲁棒和可靠的具身智能体提供了路线图。

英文摘要

Embodied Artificial Intelligence (Embodied AI) integrates perception, cognition, planning, and interaction into agents that operate in open-world, safety-critical environments. As these systems gain autonomy and enter domains such as transportation, healthcare, and industrial or assistive robotics, ensuring their safety becomes both technically challenging and socially indispensable. Unlike digital AI systems, embodied agents must act under uncertain sensing, incomplete knowledge, and dynamic human-robot interactions, where failures can directly lead to physical harm. This survey provides a comprehensive and structured review of safety research in embodied AI, examining attacks and defenses across the full embodied pipeline, from perception and cognition to planning, action and interaction, and agentic system. We introduce a multi-level taxonomy that unifies fragmented lines of work and connects embodied-specific safety findings with broader advances in vision, language, and multimodal foundation models. Our review synthesizes insights from over 500 papers spanning adversarial, backdoor, jailbreak, and hardware-level attacks; attack detection, safe training and robust inference; and risk-aware human-agent interaction. This analysis reveals several overlooked challenges, including the fragility of multimodal perception fusion, the instability of planning under jailbreak attacks, and the trustworthiness of human-agent interaction in open-ended scenarios. By organizing the field into a coherent framework and identifying critical research gaps, this survey provides a roadmap for building embodied agents that are not only capable and autonomous but also safe, robust, and reliable in real-world deployment.

URL PDF HTML ☆

赞 0 踩 0

2605.02124 2026-05-26 cs.LG cs.AI math.PR 版本更新

Soft-to-Hard Routing in Sparse Mixture-of-Experts Models

稀疏混合专家模型中的软到硬路由

Reza Rastegar

发表机构 * Meta Platforms, Inc（Meta平台）

AI总结本文通过边界层微积分方法，研究了稀疏混合专家模型中softmax路由随温度趋于零时趋近于硬top-1路由的极限过程，并给出了基于路由界面邻域概率的定量误差界。

详情

AI中文摘要

随着温度趋于零，softmax路由趋近于硬top-1路由，但极限过程在路由器平局时存在奇异性。本文针对总体平方损失混合专家回归中的软到硬极限，发展了一种边界层微积分方法。对于具有logits $a_k(x;ϕ)$的路由器，相关的局部量是前两名的间隔$Δ(x;ϕ)$，相关的全局量是边界质量$\\mathbb{P}(Δ(X;ϕ)\\\le w)$。在光滑性和横截性假设下，余面积和管状邻域估计展示了该质量如何随板宽缩放；在二元情形中，主导系数是路由界面上的显式曲面积分。这些几何估计给出了软目标$L_τ$和硬目标$L_0$之间的定量界，包括在间隔尾条件下的$O(τ^α)$一致比较，并得到了紧参数空间上软目标的$Γ$-收敛性。主要结论是，零温度近似由路由界面的$O(τ)$邻域所承载的概率控制，而不仅仅由温度本身决定。在分离出问题的这一边界层部分后，我们记录了一个从硬路由到小温度软路由的条件景观传递定理，以及一个简化的双专家高斯计算，展示了局部对称性破缺。仅包含合成诊断作为边界层预测的受控检验。

英文摘要

Softmax routing approaches hard top-1 routing as the temperature tends to zero, but the limiting passage is singular at router ties. This paper develops a boundary-layer calculus for this soft-to-hard limit in population squared-loss mixture-of-experts regression. For a router with logits $a_k(x;ϕ)$, the relevant local quantity is the top-two margin $Δ(x;ϕ)$, and the relevant global quantity is the boundary mass $\mathbb{P}(Δ(X;ϕ)\le w)$. Under smoothness and transversality assumptions, coarea and tubular-neighborhood estimates show how this mass scales with the slab width; in the binary case the leading coefficient is an explicit surface integral over the routing interface. These geometric estimates give quantitative bounds between the soft objective $L_τ$ and the hard objective $L_0$, including an $O(τ^α)$ uniform comparison under a margin-tail condition, and yield $Γ$-convergence of the soft objectives on compact parameter spaces. The main conclusion is that the zero-temperature approximation is controlled by the probability carried by an $O(τ)$ neighborhood of the routing interfaces, not by temperature alone. After isolating this boundary-layer part of the problem, we record a conditional landscape-transfer theorem from hard to small-temperature soft routing and a reduced two-expert Gaussian calculation illustrating local symmetry breaking. Synthetic diagnostics are included only as controlled checks of the boundary-layer predictions.

URL PDF HTML ☆

赞 0 踩 0

2605.02010 2026-05-26 cs.AI 版本更新

Reliable AI Needs to Externalize Implicit Knowledge: A Human-AI Collaboration Perspective

可靠AI需要外化隐性知识：人机协作视角

Hengyu Liu, Tianyi Li, Zhihong Cui, Yushuai Li, Zhangkai Wu, Torben Bach Pedersen, Kristian Torp, Christian S. Jensen

发表机构 * Department of Computer Science, Aalborg University, Aalborg, Denmark（奥胡斯大学计算机科学系）； Department of Informatics, University of Oslo, Oslo, Norway（奥斯陆大学信息系）； School of Computing, Macquarie University, Sydney, Australia（麦考瑞大学计算科学学院）

AI总结本文从人机协作视角提出，可靠AI需要基础设施将隐性知识外化为可验证的形式，通过知识对象（KOs）实现人类验证，从而提升可靠性。

Comments Accepted at ICML 2026 (Position Paper Track). 14 pages, 2 figures, 1 table

详情

AI中文摘要

本文立场认为，可靠AI需要基础设施来支持人类对隐性知识的验证。AI从显性知识（论文、文档、结构化数据库）和隐性知识（推理模式、调试过程、中间步骤）中学习。隐性知识由于文档成本超过感知价值而未被外化——然而AI不加区分地学习它，既获得有益模式也获得有害偏见。当前的可靠性方法只能根据来源验证显性知识，造成根本性差距：最有价值的AI能力（推理、判断、直觉）恰恰是我们无法验证的。我们提出知识对象（KOs）——将隐性知识外化为人类可以检查、验证和认可的形式的结构化工件。KOs改变了验证经济学：以前验证成本过高的事情变得可行，使得累积的人类验证能够随时间提高可靠性。

英文摘要

This position paper argues that reliable AI requires infrastructure for human validation of implicit knowledge. AI learns from both explicit knowledge (papers, documentation, structured databases) and implicit knowledge (reasoning patterns, debugging processes, intermediate steps). Implicit knowledge remains unexternalized because documentation cost exceeds perceived value -- yet AI learns from it indiscriminately, acquiring both beneficial patterns and harmful biases. Current reliability methods can only verify explicit knowledge against sources, creating a fundamental gap: the most valuable AI capabilities (reasoning, judgment, intuition) are precisely those we cannot verify. We propose Knowledge Objects (KOs) -- structured artifacts that externalize implicit knowledge into forms humans can inspect, verify, and endorse. KOs transform verification economics: what was previously too costly to verify becomes feasible, enabling accumulated human validation to improve reliability over time.

URL PDF HTML ☆

赞 0 踩 0

2605.01284 2026-05-26 cs.CV cs.AI cs.CL cs.IR 版本更新

Chain of Evidence: Pixel-Level Visual Attribution for Iterative Retrieval-Augmented Generation

证据链：面向迭代检索增强生成的像素级视觉归因

Peiyang Liu, Ziqiang Cui, Xi Wang, Di Liang, Wei Ye

发表机构 * National Engineering Research Center for Software Engineering, Peking University（软件工程国家级工程研究中心，北京大学）； City University of Hong Kong（香港城市大学）； Peking University（北京大学）； Tencent Technology（腾讯科技）

AI总结提出Chain of Evidence (CoE)框架，利用视觉语言模型直接对检索到的文档截图进行推理，输出精确边界框以可视化完整推理链，解决迭代检索增强生成中的粗粒度归因和视觉语义丢失问题。

详情

DOI: 10.1145/3805712.3809540

AI中文摘要

迭代检索增强生成（iRAG）已成为通过逐步检索和推理外部文档来回答复杂多跳问题的强大范式。然而，当前系统主要基于解析文本运行，这造成了两个关键瓶颈：（1）粗粒度归因，用户需要根据模糊的文本级引用在冗长文档中手动定位证据；（2）视觉语义丢失，将视觉丰富的文档（如幻灯片、带有图表的PDF）转换为文本会丢弃对推理至关重要的空间逻辑和布局线索。为弥合这一差距，我们提出了证据链（CoE），这是一个与检索器无关的视觉归因框架，利用视觉语言模型直接对检索到的文档候选截图进行推理。CoE消除了特定格式的解析，输出精确的边界框，可视化检索候选集中的完整推理链。我们在两个不同的基准上评估CoE：Wiki-CoE，一个源自2WikiMultiHopQA的大规模结构化网页数据集；以及SlideVQA，一个具有挑战性的演示幻灯片数据集，包含复杂图表和自由形式布局。实验表明，微调后的Qwen3-VL-8B-Instruct取得了稳健的性能，在需要视觉布局理解的场景中显著优于基于文本的基线，同时为像素级可解释的iRAG建立了与检索器无关的解决方案。我们的代码可在https://github.com/PeiYangLiu/CoE.git获取。

英文摘要

Iterative Retrieval-Augmented Generation (iRAG) has emerged as a powerful paradigm for answering complex multi-hop questions by progressively retrieving and reasoning over external documents. However, current systems predominantly operate on parsed text, which creates two critical bottlenecks: (1) \textit{Coarse-grained attribution}, where users are burdened with manually locating evidence within lengthy documents based on vague text-level citations; and (2) \textit{Visual semantic loss}, where the conversion of visually rich documents (e.g., slides, PDFs with charts) into text discards spatial logic and layout cues essential for reasoning. To bridge this gap, we present \textbf{Chain of Evidence (CoE)}, a retriever-agnostic visual attribution framework that leverages Vision-Language Models to reason directly over screenshots of retrieved document candidates. CoE eliminates format-specific parsing and outputs precise bounding boxes, visualizing the complete reasoning chain within the retrieved candidate set. We evaluate CoE on two distinct benchmarks: \textbf{Wiki-CoE}, a large-scale dataset of structured web pages derived from 2WikiMultiHopQA, and \textbf{SlideVQA}, a challenging dataset of presentation slides featuring complex diagrams and free-form layouts. Experiments demonstrate that fine-tuned Qwen3-VL-8B-Instruct achieves robust performance, significantly outperforming text-based baselines in scenarios requiring visual layout understanding, while establishing a retriever-agnostic solution for pixel-level interpretable iRAG. Our code is available at https://github.com/PeiYangLiu/CoE.git.

URL PDF HTML ☆

赞 0 踩 0

2604.27636 2026-05-26 cs.AI 版本更新

Generative structure search for efficient and diverse discovery of molecular and crystal structures

生成式结构搜索：高效且多样地发现分子和晶体结构

Yifang Qin, Yu Shi, Junfu Tan, Chang Liu, Ming Zhang, Ziheng Lu

发表机构 * Zhongguancun Academy（中关村学院）； Kairos Materials（Kairos材料）

AI总结提出生成式结构搜索（GSS）框架，结合扩散模型和随机结构搜索，利用数据先验加速采样并保持能量引导的局部极小探索，以低于随机结构搜索十分之一的成本恢复多样亚稳态结构。

详情

AI中文摘要

预测稳定和亚稳态结构是分子和材料发现的核心，但受限于高维能量景观的搜索成本。深度生成模型提供了高效的结构采样，但其输出仍受训练数据影响，可能未充分探索罕见但物理相关的极小值。我们引入生成式结构搜索（GSS），一个统一框架，将基于扩散的生成和随机结构搜索（RSS）表述为由学习得分场和物理力驱动的共同采样过程的极限情况。耦合这些驱动因素使GSS能够利用数据先验加速采样，同时保留能量引导的局部极小探索。在分子和晶体系统中，GSS恢复了多样的亚稳态结构，其采样成本比RSS低十倍以上，且对训练分布之外的组成仍然有效。结果建立了一种物理基础的生成搜索策略，用于发现仅靠数据驱动采样无法达到的结构。

英文摘要

Predicting stable and metastable structures is central to molecular and materials discovery, but remains limited by the cost of searching high-dimensional energy landscapes. Deep generative models offer efficient structure sampling, yet their outputs remain shaped by training data and can underexplore minima that are rare but physically relevant. We introduce generative structure search (GSS), a unified framework that formulates diffusion-based generation and random structure search (RSS) as limiting regimes of a common sampling process driven by learned score fields and physical forces. Coupling these drivers lets GSS use data priors to accelerate sampling while retaining energy-guided exploration of local minima. Across molecular and crystalline systems, GSS recovers diverse metastable structures with more than tenfold lower sampling cost than RSS for broad coverage and remains effective for compositions outside the training distribution. The results establish a physically grounded generative search strategy for discovering structures beyond the reach of data-driven sampling alone.

URL PDF HTML ☆

赞 0 踩 0

2604.23396 2026-05-26 cs.IR cs.AI cs.CL cs.LG 版本更新

Lost in Decoding? Reproducing and Stress-Testing the Look-Ahead Prior in Generative Retrieval

迷失在解码中？复现与压力测试生成式检索中的前瞻先验

Kidist Amde Mekonnen, Yongkang Li, Yubao Tang, Simon Lupart, Maarten de Rijke

发表机构 * University of Amsterdam（阿姆斯特丹大学）

AI总结本文复现并压力测试了生成式检索中的前瞻先验方法PAG，发现其规划信号在词汇表面形式变化下脆弱，并评估了跨语言鲁棒性与查询端缓解策略。

Comments 12 pages, 5 figures, 9 tables; accepted to the 49th International ACM SIGIR Conference on Research and Development in Information Retrieval, July 20-24, 2026, Melbourne/Naarm, Australia

详情

DOI: 10.1145/3805712.3808567
Journal ref: Proceedings of the 49th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '26), pages XXX-XXX, 2026

AI中文摘要

生成式检索（GR）通过自回归生成文档标识符来对文档进行排序。由于许多GR方法依赖于trie约束的束搜索，它们在有限束解码下容易过早剪枝相关前缀。生成式检索中的前瞻规划（PAG）通过使用同时解码来计算文档级前瞻先验，指导后续顺序解码，从而缓解了这种失败模式。我们在推理时复现了PAG，并压力测试了其解码行为。使用作者发布的检查点和标识符/trie工件，在报告的解码设置下，我们在MS MARCO Dev和TREC-DL 2019/2020上复现了主要有效性结果，并在我们的硬件设置中证实了报告的束大小-延迟权衡。在复现之外，我们引入了规划漂移诊断，量化意图保持的查询变体如何改变规划器的top-n候选集和最高权重规划器令牌，以及这些变化如何影响引导解码。我们发现PAG的规划信号在词汇表面形式变化下是脆弱的：意图保持的拼写错误可能触发规划崩溃，其中规划的候选池变化足够大，使得前瞻奖励几乎无法提供有用的指导，实际上使解码退回到较弱的无引导搜索。我们进一步使用非英语mMARC O查询对英语索引评估了固定索引的跨语言鲁棒性，并评估了无需重新索引的查询端缓解策略；在我们的设置中，查询翻译提供了最强的恢复。总体而言，我们的结果证实了PAG报告的有效性以及在发布的推理设置下规划引导解码的优势，同时表明这些增益依赖于规划信号在现实查询变化和查询-文档不匹配下的稳定性。

英文摘要

Generative retrieval (GR) ranks documents by autoregressively generating document identifiers. Because many GR methods rely on trie-constrained beam search, they are vulnerable to early pruning of relevant prefixes under finite-beam decoding. Planning Ahead in Generative Retrieval (PAG) mitigates this failure mode by using simultaneous decoding to compute a document-level look-ahead prior that guides subsequent sequential decoding. We reproduce PAG at inference time and stress-test its decoding behavior. Using the authors' released checkpoint and identifier/trie artifacts under the reported decoding setup, we reproduce the main effectiveness results on MS MARCO Dev and TREC-DL 2019/2020, and corroborate the reported beam-size-latency trade-off in our hardware setting. Beyond reproduction, we introduce plan drift diagnostics that quantify how intent-preserving query variations alter the planner's top-n candidate set and highest-weight planner tokens, and how these changes affect guided decoding. We find that PAG's planning signal is brittle under lexical surface-form variation: intent-preserving typos can trigger plan collapse, where the planned candidate pool shifts enough that the look-ahead bonus provides little useful guidance, effectively reverting decoding toward weaker unguided search. We further evaluate fixed-index cross-lingual robustness using non-English mMARCO queries against an English index, and assess query-side mitigation strategies that require no re-indexing; query translation provides the strongest recovery in our setting. Overall, our results confirm PAG's reported effectiveness and the benefit of planning-guided decoding under the released inference setup, while showing that these gains depend on the stability of the planning signal under realistic query variation and query-document mismatch.

URL PDF HTML ☆

赞 0 踩 0

2604.20022 2026-05-26 cs.LG cs.AI cs.CL 版本更新

MoBayes: A Modular Bayesian Framework for Separating Reasoning from Language in Conversational Clinical Decision Support

MoBayes：一种用于对话式临床决策支持中推理与语言分离的模块化贝叶斯框架

Yusuf Kesmen, Fay Elhassan, Jiayi Ma, Julien Stalhandske, Yena Chang, David Sasu, Alexandra Kulinkina, Akhil Arora, Lars Klein, Mary-Anne Hartley

发表机构 * LiGHT, EPFL（LiGHT，瑞士联邦理工学院）； University of Bern（伯尔尼大学）； Aarhus University（奥胡斯大学）

AI总结提出MoBayes框架，通过将LLM作为语言接口、贝叶斯模块进行概率推理，实现推理与语言分离，在临床决策支持中优于独立前沿LLM医生。

Comments 50 pages including appendix, 13 figures, 22 tables. Preprint

详情

AI中文摘要

大型语言模型（LLM）越来越多地用于对话式临床决策支持，但它们将下一个标记预测与概率决策混为一谈。我们认为这种混淆反映了架构上的局限性：此类系统缺乏显式的后验追踪、可控的弃权阈值和可审计的推理链。我们引入MoBayes，一个模块化贝叶斯对话框架，将推理与语言分离。LLM仅作为语言接口，将患者对话解析为结构化观察，而贝叶斯模块对这些观察进行概率推理以更新后验，通过期望信息增益选择后续问题，并通过校准的决策阈值决定何时停止或推迟。这种设计实现了显式后验追踪、可控的选择性决策，以及无需重新训练语言模型即可替换的特定人群统计后端。在经验知识和LLM生成的知识库上，MoBayes优于独立的前沿LLM医生，包括匹配模型系列的比较，其中廉价的传感器模型与MoBayes配对以较低成本超过更大的自主模型。在对抗性患者沟通风格和不同诊断场景下，该优势依然存在。这些结果表明，可靠的对话式临床决策支持系统应将概率推理与语言生成分离，而不是仅扩大模型规模。代码可在https://anonymous.4open.science/r/MoBayes/获取。

英文摘要

Large language models (LLMs) are increasingly used for conversational clinical decision support, yet they conflate next token prediction with probabilistic decision making. We argue that this conflation reflects an architectural limitation: such systems lack explicit posterior tracking, controllable abstention thresholds, and auditable reasoning chains. We introduce MoBayes, a Modular Bayesian dialogue framework that separates reasoning from language. The LLM acts only as a language interface, parsing patient conversation into structured observations, while a Bayesian module performs probabilistic inference over these observations to update posteriors, select follow-up questions via expected-information-gain and determine when to stop or defer through calibrated decision thresholds. This design enables explicit posterior tracking, controllable selective decision-making, and replaceable population-specific statistical backends without retraining the language model. Across empirical and LLM-generated knowledge bases, MoBayes outperforms standalone frontier LLM doctors, including matched model-family comparisons where inexpensive sensor models paired with MoBayes exceed larger autonomous models at lower cost. The advantage persists under adversarial patient communication styles and across varying diagnostic scenarios. These results suggest that reliable conversational clinical decision support systems should separate probabilistic reasoning from language generation rather than scaling model size alone. Code is available at https://anonymous.4open.science/r/MoBayes/

URL PDF HTML ☆

赞 0 踩 0

2604.18170 2026-05-26 cs.CL cs.AI 版本更新

Copy-as-Decode: Grammar-Constrained Parallel Prefill for LLM Editing

Copy-as-Decode: 面向LLM编辑的语法约束并行预填充

Ziyang Liu

AI总结提出Copy-as-Decode机制，通过语法约束的并行预填充加速LLM编辑，实现高达303倍的自回归解码加速，并保持高覆盖率与无损性。

Comments The authors have decided to withdraw this version following internal review regarding authorship and contribution agreements

详情

AI中文摘要

LLMs通过自回归地重新生成完整输出来编辑文本和代码，即使大多数标记在输入中逐字出现。我们研究Copy-as-Decode，一种解码层机制，将编辑生成重新表述为基于两个原语语法的结构化解码：<copy lines="i-j"/>引用输入行范围，<gen>...</gen>生成新内容。一个标记级FSM保证语法有效性，服务层原语通过单次并行预填充前向（而非N步自回归步骤）更新每个复制跨度的KV缓存——共享推测解码的并行前向内核，但以输入标记作为草稿，程序强制接受替代概率验证。我们报告一个无需端到端训练的上界分析。(i) 内核加速：在Qwen2.5-{1.5B, 7B}上，通过并行预填充复制N个标记比自回归快6.8倍至303倍（N ∈ [8, 512]，A100 80GB bf16）。(ii) 复制上限：在ProbeEdit和HumanEvalPack-Fix (Py/JS)上，74%–98%的金标准标记在行级原语下可达；结合每个语料库跨度直方图上的经验内核，得到闭式挂钟时间上界29.0倍/3.4倍/4.2倍（合并13.0倍）。标记级扩展达到91%–99%覆盖率，下界4.5倍–6.5倍。(iii) 流水线无损性：预言程序通过确定性解析器在所有482个案例上往返，将任何下游失败定位到跨度选择而非机制。扰动研究表明，在离一噪声下，合并EM从100%降至15.48%。在Qwen2.5-Coder-1.5B上的微调实验将HEvalFix-Py EM从0/33（未训练）提升至12%–17%，这是一个可学习性信号，而非生产选择器。批处理服务集成和多文件覆盖作为后续工作。

英文摘要

LLMs edit text and code by autoregressively regenerating the full output, even when most tokens appear verbatim in the input. We study Copy-as-Decode, a decoding-layer mechanism that recasts edit generation as structured decoding over a two-primitive grammar: <copy lines="i-j"/> references an input line range, <gen>...</gen> emits new content. A token-level FSM guarantees syntactic validity, and a serving-layer primitive updates the KV cache for each copy span via a single parallel-prefill forward rather than $N$ autoregressive steps -- sharing the parallel-forward kernel of speculative decoding but with input tokens as the draft and program-enforced acceptance replacing probabilistic verification. We report an upper-bound analysis that requires no end-to-end training. (i) Kernel speedup: on Qwen2.5-{1.5B, 7B}, copying $N$ tokens via parallel prefill is $6.8\times$--$303\times$ faster than autoregressive ($N \in [8, 512]$, A100 80GB bf16). (ii) Copy ceiling: on ProbeEdit and HumanEvalPack-Fix (Py/JS), $74$--$98\%$ of gold tokens are reachable under the line-level primitive; composed with the empirical kernel over each corpus's span histogram this yields a closed-form wall-clock bound of $29.0\times / 3.4\times / 4.2\times$ ($13.0\times$ pooled). A token-level extension reaches $91$--$99\%$ coverage with $4.5\times$--$6.5\times$ floors. (iii) Pipeline losslessness: oracle programs round-trip through the deterministic resolver on all $482$ cases, localizing any downstream failure to span selection rather than the mechanism. A perturbation study shows pooled EM drops from $100\%$ to $15.48\%$ under off-by-one noise. A fine-tuning pilot on Qwen2.5-Coder-1.5B lifts HEvalFix-Py EM from $0/33$ (untrained) to $12$--$17\%$, a learnability signal, not a production selector. Batched-serving integration and multi-file coverage are scoped as follow-up.

URL PDF HTML ☆

赞 0 踩 0

2604.18128 2026-05-26 cs.CL cs.AI cs.LG 版本更新

Depth Registers Unlock W4A4 on SwiGLU: A Reader/Generator Decomposition

深度寄存器解锁 SwiGLU 上的 W4A4：一种读取器/生成器分解

Ziyang Liu

AI总结本研究通过深度寄存器和铰链损失（DR+sink）训练时干预，将 SwiGLU 解码器语言模型的 W4A4 量化困惑度从 1727 降至 119，并分解出残差轴读取器主导误差，而生成器 w2 的双线性输入是剩余差距的主因。

Comments The authors have decided to withdraw this version following internal review regarding authorship and contribution agreements

详情

AI中文摘要

我们在一个受控的 300M 参数 SwiGLU 解码器语言模型（在 FineWeb-Edu 的 5B 令牌上训练）中研究训练后 W4A4 量化，并询问哪些输入激活位点主导误差。朴素的四舍五入 W4A4 将验证困惑度从 FP16 的 23.6 降至 1727。一种简单的残差轴训练时干预——带有寄存器幅度铰链损失的深度寄存器（DR+sink）——在匹配的 FP16 PPL 和匹配的零样本能力下，将其降至 119（约 14 倍），并与 SmoothQuant 组合达到 39.9 PPL。与 FP16 之间约 2 PPL 的剩余差距是诊断核心。我们按输入激活位点分解 W4A4 损伤：SwiGLU 块中的五个可训练线性层分为残差轴读取器（qkv, w1, w3）和块内生成器（o_proj, w2）。基本的范数论证表明，残差轴幅度控制紧密约束读取器，但 w2 的双线性输入仅受因子范数平凡乘积的约束；经验上，DR+sink 降低了读取器的峰度，而生成器基本不变，并且读取器恢复的 W4A4 残差在三个匹配检查点上平坦约为 0.28 nats，其中 Delta-remove(w2) 占主导。我们将 DR+sink 作为训练时探针而非部署方案提出：一种事后替代方案（Per-Linear QuaRot）在读取器轴上几乎与之匹配。完整的 QuaRot——添加在线每头值 Hadamard 和在线 w2 输入旋转——也没有缩小差距，直接验证了正交旋转无法约束双线性 SwiGLU 尾部的预测。这些主张特定于我们的 300M、5B 令牌、单种子设置，并且我们的实验未将分区与铰链分离。

英文摘要

We study post-training W4A4 quantization in a controlled 300M-parameter SwiGLU decoder-only language model trained on 5B tokens of FineWeb-Edu, and ask which input-activation sites dominate the error. Naive round-to-nearest W4A4 collapses validation perplexity from FP16 23.6 to 1727. A simple residual-axis training-time intervention -- Depth Registers with a register-magnitude hinge loss (DR+sink) -- reduces this to 119 (about 14x) at matched FP16 PPL and matched zero-shot capacity, and composes with SmoothQuant to 39.9 PPL. The residual ~2 PPL gap to FP16 is the diagnostic core. We decompose W4A4 damage by input-activation site: the five trainable linears in a SwiGLU block split into residual-axis readers (qkv, w1, w3) and block-internal generators (o_proj, w2). Elementary norm arguments show residual-axis magnitude control bounds readers tightly but leaves w2's bilinear input bounded only by the trivial product of factor bounds; empirically, DR+sink collapses reader kurtosis while leaving generators essentially unchanged, and the reader-rescued W4A4 residue is flat at ~0.28 nats across three matched checkpoints with Delta-remove(w2) dominating. We present DR+sink as a training-time probe rather than a deployment proposal: a post-hoc alternative (Per-Linear QuaRot) nearly matches it on the reader axis. Full QuaRot -- adding online per-head value Hadamard plus online w2-input rotation -- does not close the gap either, directly testing the prediction that orthogonal rotation cannot bound the bilinear SwiGLU tail. Claims are specific to our 300M, 5B-token, single-seed setting, and our experiments do not isolate the partition from the hinge.

URL PDF HTML ☆

赞 0 踩 0

2604.17328 2026-05-26 cs.LG cs.AI 版本更新

Rethinking the Comparison Unit in Sequence-Level Reinforcement Learning: An Equal-Length Paired Training Framework from Loss Correction to Sample Construction

重新思考序列级强化学习中的比较单元：从损失校正到样本构建的等长配对训练框架

Fei Ding, Yongkang Zhang, Runhao Liu, Yuhao Liao, Zijian Zeng, Huiming Yang, Sibo wang, Linglin Liao

发表机构 * Alibaba Group（阿里巴巴集团）； Tsinghua University（清华大学）

AI总结本文提出序列级相对强化学习中的长度问题本质是比较单元构建问题，并基于此提出等长配对训练框架EqLen，通过双轨同步生成、前缀继承和段掩码构建可比较的训练样本。

详情

AI中文摘要

本文研究了序列级相对强化学习中的长度问题。我们观察到，尽管现有方法部分缓解了与长度相关的现象，但一个更根本的问题仍未得到充分刻画：训练过程中使用的比较单元缺乏内在可比性。基于这一观察，我们提出一个新的视角：长度问题不应仅仅被视为损失缩放或归一化偏差，而应被视为一个比较单元构建问题。我们进一步建立了一个基于样本构建的训练框架，该框架不是对不等长响应进行事后校正，而是在生成过程中主动构建等长、可对齐且可比较的训练段。在该框架内，我们提出了EqLen，一种适用于组相对比较算法（如GRPO、GSPO和RLOO）的具体方法。通过双轨同步生成、前缀继承和段掩码，EqLen高效地收集有效的等长训练段，并实现稳定的训练。

英文摘要

This paper investigates the length problem in sequence-level relative reinforcement learning. We observe that, although existing methods partially alleviate length-related phenomena, a more fundamental issue remains insufficiently characterized: the comparison units used during training lack inherent comparability. Building on this observation, we propose a new perspective: the length problem should not be viewed merely as a loss-scaling or normalization bias, but rather as a \emph{comparison unit construction} problem. We further establish a sample-construction-based training framework that, instead of applying post-hoc corrections to unequal-length responses, proactively constructs equal-length, alignable, and comparable training segments during generation. Within this framework, we propose EqLen, a concrete method applicable to group-relative comparison algorithms such as GRPO, GSPO, and RLOO. Through dual-track synchronous generation, prefix inheritance, and segment masking, EqLen efficiently collects effective equal-length training segments and enables stable

URL PDF HTML ☆

赞 0 踩 0

2604.16778 2026-05-26 cs.LG cs.AI 版本更新

Federation over Text: Insight Sharing for Multi-Agent Reasoning

文本上的联邦：多智能体推理的洞察共享

Dixi Yao, Tahseen Rabbani, Manzil Zaheer, Tian Li

发表机构 * University of Chicago（芝加哥大学）； Google DeepMind（谷歌DeepMind）

AI总结提出一种类似联邦学习的框架FoT，通过迭代聚合多个客户端的本地推理过程，构建跨任务元认知洞察库，无需共享问题实例或任务指令，显著提升推理效果和效率。

Comments 46 pages

详情

AI中文摘要

我们提出了一种类似联邦学习的框架——文本上的联邦（FoT），它使得处理不同任务的多个客户端能够通过迭代地联邦化其本地推理过程，共同生成一个共享的元认知洞察库，而无需共享实际的问题实例或任务指令。与梯度上的联邦（例如分布式训练）不同，FoT在语义层面运作，无需任何梯度优化或监督信号。迭代地，每个客户端运行一个LLM智能体，独立地对其特定任务进行本地思考和自我改进，并将推理轨迹与中央服务器共享，中央服务器将其聚合和提炼成一个跨任务（和跨领域）的洞察库，现有和未来的智能体可以利用该库来改进相关任务的性能。实验表明，FoT在广泛具有挑战性的应用中提高了推理效果和效率，包括数学问题求解、跨领域协作、现实世界日常任务以及机器学习研究洞察发现。具体而言，在前三个应用中，它平均提高了25%的性能得分，同时减少了4%的推理令牌。在研究洞察发现应用中，FoT能够生成覆盖后续论文中80%以上主要贡献的洞察。

英文摘要

We propose a federated learning-like framework, Federation over Text (FoT), that enables multiple clients solving different tasks to collectively generate a shared library of metacognitive insights by iteratively federating their local reasoning processes without sharing actual problem instances or task instructions. Instead of federation over gradients (e.g., as in distributed training), FoT operates at the semantic level without any gradient optimization or supervision signal. Iteratively, each client runs an LLM agent that does local thinking and self-improvement on their specific tasks independently, and shares reasoning traces with a central server, which aggregates and distills them into a cross-task (and cross-domain) insight library that existing and future agents can leverage to improve performance on related tasks. Experiments show that FoT improves reasoning effectiveness and efficiency across a wide range of challenging applications, including mathematical problem solving, cross-domain collaboration, real-world daily tasks, and machine learning research insight discovery. Specifically, it improves average performance scores by 25% while reducing the reasoning tokens by 4% across the first three applications. In the research insight discovery application, FoT is able to generate insights that cover over 80% of the major contributions in the subsequent papers.

URL PDF HTML ☆

赞 0 踩 0

2604.12376 2026-05-26 cs.CL cs.AI 版本更新

Cooperative Memory Paging with Keyword Bookmarks for Long-Horizon LLM Conversations

面向长程LLM对话的协作式内存分页与关键词书签

Ziyang Liu

AI总结提出协作式分页方法，用关键词书签替代被驱逐的对话片段，并赋予模型 recall() 工具按需检索，在 LoCoMo 基准上四个模型均取得最佳答案质量，并通过消融实验揭示分页设计的关键因素。

Comments The authors have decided to withdraw this version following internal review regarding authorship and contribution agreements

详情

AI中文摘要

当LLM对话超出上下文窗口时，旧内容必须被驱逐——但模型在需要时如何恢复它们？我们提出协作式分页：被驱逐的片段被替换为最小关键词书签（[pN:keywords]，每个约8-24个token），并赋予模型一个 recall() 工具以按需检索完整内容。在 LoCoMo 基准（10个真实多会话对话，300+轮次）上，协作式分页在四种模型（GPT-4o-mini、DeepSeek-v3.2、Claude Haiku、GLM-5）的六种方法中实现了最高的答案质量——优于截断、BM25、词重叠检索、搜索工具基线和完整上下文——由四个独立的LLM评判员确认（p=0.017，配对bootstrap）。随后，我们通过边界策略和驱逐策略的5x4消融实验（3,176个合成探针，1,600个LoCoMo探针）研究分页设计空间。关键发现：（1）粗粒度固定大小页面（fixed_20）达到96.7%，而内容感知的topic_shift降至56.7%；（2）驱逐策略的选择依赖于数据（FIFO在合成数据上最佳，LFU在LoCoMo上最佳）；（3）两种书签生成策略相比启发式基线有提升（+4.4和+8.7个E2E点）；（4）剩余瓶颈是书签区分度——模型96%的时间触发recall()，但当书签区分度不足时，仅57%选择正确页面。关键词特异性单独造成25个百分点的准确率差异。

英文摘要

When LLM conversations grow beyond the context window, old content must be evicted -- but how does the model recover it when needed? We propose cooperative paging: evicted segments are replaced with minimal keyword bookmarks ([pN:keywords], ~8-24 tokens each), and the model is given a recall() tool to retrieve full content on demand. On the LoCoMo benchmark (10 real multi-session conversations, 300+ turns), cooperative paging achieves the highest answer quality among six methods -- outperforming truncation, BM25, word-overlap retrieval, a search-tool baseline, and full context -- on four models (GPT-4o-mini, DeepSeek-v3.2, Claude Haiku, GLM-5), confirmed by four independent LLM judges ($p=0.017$, paired bootstrap). We then study the paging design space with a 5x4 ablation over boundary strategies and eviction policies (3,176 synthetic probes, 1,600 LoCoMo probes). Key findings: (1) coarse fixed-size pages (fixed_20) reach 96.7% while content-aware topic_shift collapses to 56.7%; (2) eviction policy choice is data-dependent (FIFO best on synthetic, LFU on LoCoMo); (3) two bookmark generation strategies improve over the heuristic baseline (+4.4 and +8.7 E2E points); (4) the remaining bottleneck is bookmark discrimination -- the model triggers recall() 96% of the time but selects the correct page only 57% when bookmarks are insufficiently distinctive. Keyword specificity alone accounts for a 25 percentage point accuracy difference.

URL PDF HTML ☆

赞 0 踩 0

2604.12116 2026-05-26 cs.AI cs.SE 版本更新

The A-R Behavioral Space: Execution-Level Profiling of Tool-Using Language Model Agents in Organizational Deployment

A-R行为空间：组织部署中工具使用语言模型代理的执行层剖析

Shasha Yu, Fiona Carroll, Barry L. Bentley

发表机构 * Cardiff School of Technologies, Cardiff Metropolitan University（卡迪夫技术学院，卡迪夫市政大学）； School of Professional Studies, Clark University（专业研究学院，克拉克大学）； Harvard Medical School, Harvard University（哈佛医学院，哈佛大学）

AI总结提出基于动作率(A)和拒绝信号(R)的二维A-R空间及散度(D)来测量执行层行为，评估不同规范制度和自主性配置下语言模型代理的执行与拒绝分布模式。

详情

AI中文摘要

大型语言模型(LLMs)越来越多地被部署为能够执行系统级操作的工具增强型代理。虽然现有基准主要评估文本对齐或任务成功，但较少关注在不同自主性支架下语言信号与可执行行为之间的结构关系。本研究引入了一种基于二维A-R空间的执行层行为测量方法，该空间由动作率(A)和拒绝信号(R)定义，散度(D)捕捉两者之间的协调性。模型在四种规范制度（控制、灰色、困境和恶意）和三种自主性配置（直接执行、规划和反思）下进行评估。该方法不是分配聚合安全分数，而是描述执行和拒绝如何随上下文框架和支架深度重新分布。实证结果表明，执行和拒绝构成了可分离的行为维度，其联合分布在制度和自主性水平上系统性地变化。基于反思的支架通常会在风险情境中促使配置转向更高的拒绝，但重新分布模式在不同模型间存在结构性差异。A-R表示使得横截面行为剖面、支架诱导的转变和协调变异性直接可观察。通过将执行层表征置于标量排名之上，这项工作为在组织环境中分析和选择工具增强的LLM代理提供了面向部署的视角，其中执行权限和风险容忍度各不相同。

英文摘要

Large language models (LLMs) are increasingly deployed as tool-augmented agents capable of executing system-level operations. While existing benchmarks primarily assess textual alignment or task success, less attention has been paid to the structural relationship between linguistic signaling and executable behavior under varying autonomy scaffolds. This study introduces an execution-layer be-havioral measurement approach based on a two-dimensional A-R space defined by Action Rate (A) and Refusal Signal (R), with Divergence (D) capturing coor-dination between the two. Models are evaluated across four normative regimes (Control, Gray, Dilemma, and Malicious) and three autonomy configurations (di-rect execution, planning, and reflection). Rather than assigning aggregate safety scores, the method characterizes how execution and refusal redistribute across contextual framing and scaffold depth. Empirical results show that execution and refusal constitute separable behavioral dimensions whose joint distribution varies systematically across regimes and autonomy levels. Reflection-based scaffolding often shifts configurations toward higher refusal in risk-laden contexts, but redis-tribution patterns differ structurally across models. The A-R representation makes cross-sectional behavioral profiles, scaffold-induced transitions, and coordination variability directly observable. By foregrounding execution-layer characterization over scalar ranking, this work provides a deployment-oriented lens for analyzing and selecting tool-enabled LLM agents in organizational settings where execution privileges and risk tolerance vary.

URL PDF HTML ☆

赞 0 踩 0

2604.08988 2026-05-26 cs.AI 版本更新

SEA-Eval: A Benchmark for Evaluating Self-Evolving Agents Beyond Episodic Assessment

SEA-Eval: 超越情景评估的自进化智能体基准

Sihang Jiang, Lipeng Ma, Zhonghua Hong, Keyi Wang, Zhiyu Lu, Tengfei Wang, Shisong Chen, Jinghao Zhang, Tianjun Pan, Weijia Li, Jiaqing Liang, Yanghua Xiao

发表机构 * Fudan University（复旦大学）

AI总结本文提出自进化智能体(SEA)的形式化定义及其最小充分架构进化飞轮，并构建首个专门评估SEA的基准SEA-Eval，通过顺序任务流设计量化进化增益、稳定性和隐式对齐收敛。

详情

AI中文摘要

当前基于LLM的智能体在情景任务执行中表现出强大性能，但仍受限于静态工具集和情景遗忘，无法跨任务边界积累经验。本文从数字具身和连续跨任务进化的角度形式化自进化智能体(SEA)，引入进化飞轮作为其最小充分架构，并提出SEA-Eval——首个专门设计用于评估SEA的基准。基于飞轮理论，SEA-Eval将SR和T作为主要指标，并通过顺序任务流设计，旨在量化进化增益、进化稳定性和隐式对齐收敛。实证评估表明，在可比成功率下，不同框架在单个任务上的token消耗差异高达31.2倍，且在顺序分析下出现不同的进化轨迹——这表明成功率单独造成能力幻觉，而T的顺序收敛是区分真正进化与伪进化的关键标准。

英文摘要

Current LLM-based agents demonstrate strong performance in episodic task execution but remain constrained by static toolsets and episodic amnesia, failing to accumulate experience across task boundaries. This paper formalizes the Self-Evolving Agent (SEA) from the perspective of digital embodiment and continuous cross-task evolution, introduces the Evolutionary Flywheel as its minimal sufficient architecture, and presents SEA-Eval -- the first benchmark designed specifically for evaluating SEAs. Grounded in Flywheel theory, SEA-Eval establishes SR and T as primary metrics and, through sequential task stream design, is designed to quantify evolutionary gain, evolutionary stability, and implicit alignment convergence. Empirical evaluation reveals that, under comparable success rates, token consumption differs by up to 31.2 times between frameworks on individual tasks, with divergent evolutionary trajectories emerging under sequential analysis -- demonstrating that success rate alone creates a capability illusion and that the sequential convergence of $T$ is the key criterion for distinguishing genuine evolution from pseudo-evolution.

URL PDF HTML ☆

赞 0 踩 0

2603.29897 2026-05-26 cs.IR cs.AI 版本更新

UniRank: End-to-End Domain-Specific Reranking of Hybrid Text-Image Candidates

UniRank: 混合文本-图像候选的端到端领域特定重排序

Yupei Yang, Lin Yang, Wanxi Deng, Lin Qu, Shikui Tu, Lei Xu

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Alibaba Group（阿里巴巴集团）

AI总结提出UniRank，一种基于视觉语言模型的重排序框架，通过无需模态转换的统一评分和端到端领域适应（包括指令微调和基于强化学习的偏好对齐），在科学文献检索和设计专利搜索中显著提升性能。

详情

AI中文摘要

重排序是许多信息检索流程中的关键组件。尽管在纯文本场景中取得了显著进展，多模态重排序仍然具有挑战性，尤其是当候选集包含混合文本和图像项时。一个关键难点是模态差距：文本重排序器本质上更接近文本候选而非图像候选，导致跨模态排序存在偏差且次优。视觉语言模型（VLM）通过强大的跨模态对齐缓解了这一差距，并已被用于构建多模态重排序器。然而，大多数基于VLM的重排序器将所有候选编码为图像，将文本视为图像会引入大量计算开销。同时，现有的开源多模态重排序器通常在通用领域数据上训练，在特定领域场景中往往表现不佳。为解决这些限制，我们提出UniRank，一种基于VLM的重排序框架，无需任何模态转换即可原生地对混合文本-图像候选进行评分和排序。基于这种混合评分接口，UniRank提供了端到端的领域适应流程，包括：（1）指令微调阶段，通过将标签令牌似然映射到统一标量分数来学习校准的跨模态相关性评分；（2）硬负样本驱动的偏好对齐阶段，构建领域内成对偏好，并通过基于人类反馈的强化学习（RLHF）进行查询级策略优化。在科学文献检索和设计专利搜索上的大量实验表明，UniRank一致优于最先进的基线，Recall@1分别提高了8.9%和7.3%。

英文摘要

Reranking is a critical component in many information retrieval pipelines. Despite remarkable progress in text-only settings, multimodal reranking remains challenging, particularly when the candidate set contains hybrid text and image items. A key difficulty is the modality gap: a text reranker is intrinsically closer to text candidates than to image candidates, leading to biased and suboptimal cross-modal ranking. Vision-language models (VLMs) mitigate this gap through strong cross-modal alignment and have recently been adopted to build multimodal rerankers. However, most VLM-based rerankers encode all candidates as images, and treating text as images introduces substantial computational overhead. Meanwhile, existing open-source multimodal rerankers are typically trained on general-domain data and often underperform in domain-specific scenarios. To address these limitations, we propose UniRank, a VLM-based reranking framework that natively scores and orders hybrid text-image candidates without any modality conversion. Building on this hybrid scoring interface, UniRank provides an end-to-end domain adaptation pipeline that includes: (1) an instruction-tuning stage that learns calibrated cross-modal relevance scoring by mapping label-token likelihoods to a unified scalar score; and (2) a hard-negative-driven preference alignment stage that constructs in-domain pairwise preferences and performs query-level policy optimization through reinforcement learning from human feedback (RLHF). Extensive experiments on scientific literature retrieval and design patent search demonstrate that UniRank consistently outperforms state-of-the-art baselines, improving Recall@1 by 8.9% and 7.3%, respectively.

URL PDF HTML ☆

赞 0 踩 0

2603.25288 2026-05-26 cs.IT cs.AI cs.ET cs.LG eess.SP math.IT 版本更新

CSI-tuples-based 3D Channel Fingerprints Construction Assisted by MultiModal Learning

基于CSI元组的多模态学习辅助3D信道指纹构建

Chenjie Xie, Li You, Ruirong Chen, Gaoning He, Xiqi Gao

发表机构 * National Mobile Communications Research Laboratory, Southeast University（东南大学国家移动通信研究中心）； Purple Mountain Laboratories（紫金山实验室）； Huawei Technologies Co., Ltd.（华为技术有限公司）

AI总结针对低空通信中的3D信道指纹构建问题，提出一种基于CSI元组的多模态回归框架，通过融合位置、通信测量和地理环境地图，实现高效高精度的信道状态信息估计。

Comments 14 pages, 9 figures

详情

DOI: 10.1109/TWC.2026.3693681
Journal ref: IEEE Transactions on Wireless Communications, vol. 25, pp. 17369-17383, 2026

AI中文摘要

低空通信可以促进空中和地面无线资源的整合，扩大网络覆盖范围，提高传输质量，从而推动第六代（6G）移动通信的发展。作为低空传输的关键技术，3D信道指纹（3D-CF），也称为3D无线电地图或3D信道知识地图，有望增强对通信环境的理解，并辅助获取信道状态信息（CSI），从而避免重复估计并降低计算复杂度。本文提出了一种模块化的多模态框架来构建3D-CF。具体而言，我们首先基于莱斯衰落信道建立了3D-CF模型，将其表示为CSI元组的集合，每个元组包含低空飞行器（LAV）的位置及其对应的统计CSI。考虑到不同先验数据的异构结构，我们将3D-CF构建问题表述为一个多模态回归任务，其中CSI元组中的目标信道信息可以通过其对应的LAV位置、通信测量和地理环境地图直接估计。然后，相应地提出了一种高效的多模态框架，包括基于相关性的多模态融合（Corr-MMF）模块、多模态表示（MMR）模块和CSI回归（CSI-R）模块。数值结果表明，我们提出的框架能够高效地构建3D-CF，并在不同通信场景下比现有算法至少提高27.5%的精度，展示了其竞争性能和出色的泛化能力。我们还分析了计算复杂度，并说明了其在推理时间方面的优越性。

英文摘要

Low-altitude communications can promote the integration of aerial and terrestrial wireless resources, expand network coverage, and enhance transmission quality, thereby empowering the development of sixth-generation (6G) mobile communications. As an enabler for low-altitude transmission, 3D channel fingerprints (3D-CF), also referred to as the 3D radio map or 3D channel knowledge map, are expected to enhance the understanding of communication environments and assist in the acquisition of channel state information (CSI), thereby avoiding repeated estimations and reducing computational complexity. In this paper, we propose a modularized multimodal framework to construct 3D-CF. Specifically, we first establish the 3D-CF model as a collection of CSI-tuples based on Rician fading channels, with each tuple comprising the low-altitude vehicle's (LAV) positions and its corresponding statistical CSI. In consideration of the heterogeneous structures of different prior data, we formulate the 3D-CF construction problem as a multimodal regression task, where the target channel information in the CSI-tuple can be estimated directly by its corresponding LAV positions, together with communication measurements and geographic environment maps. Then, a high-efficiency multimodal framework is proposed accordingly, which includes a correlation-based multimodal fusion (Corr-MMF) module, a multimodal representation (MMR) module, and a CSI regression (CSI-R) module. Numerical results show that our proposed framework can efficiently construct 3D-CF and achieve at least 27.5% higher accuracy than the state-of-the-art algorithms under different communication scenarios, demonstrating its competitive performance and excellent generalization ability. We also analyze the computational complexity and illustrate its superiority in terms of the inference time.

URL PDF HTML ☆

赞 0 踩 0

2603.20479 2026-05-26 cs.CY cs.AI cs.CL 版本更新

Profiling learners' affective engagement: Emotion AI, intercultural pragmatics, and language learning

学习者情感投入画像：情感AI、跨文化语用学与语言学习

Robert Godwin-Jones

发表机构 * Virginia Commonwealth University（弗吉尼亚大学）

AI总结本文探讨了情感AI在语言学习中的应用，特别是自动情感识别和模拟人类响应如何影响语用能力和互动能力的发展，并讨论了其个性化学习优势与情感操纵风险。

详情

DOI: 10.64152/10125/73679
Journal ref: Language Learning & Technology, 30(2), 14-35 (2026)

AI中文摘要

学习另一种语言可能是一个高度情感化的过程，通常以无数大大小小的挫折和成功为特征。对大多数学习者而言，语言学习并非遵循线性、可预测的路径，其曲折进程受动机（或去动机）变量影响，如个人特征、师生关系、学习材料以及对未来第二语言自我的梦想。虽然语言学习的某些方面（阅读、语法）相对机械，但其他方面可能充满压力且不可预测，尤其是用目标语言交谈。这种体验不仅需要结构和词汇知识，还需要以适合社会和文化语境的方式使用语言的能力。AI聊天机器人的出现为练习会话能力提供了新机会，既有优势（响应迅速、无评判），也有缺点（缺乏情感、文化偏见）。本文探讨了技术使用中产生的情感方面，特别是自动情感识别和AI系统中模拟的人类响应如何与语言学习以及语用和互动能力的发展相互作用。情感AI，即算法驱动对用户情感信号的解读，被认为能够实现更个性化的学习，适应感知到的学习者认知和情感状态。其他人则警告情感操纵以及不恰当和无效的用户画像。

英文摘要

Learning another language can be a highly emotional process, typically characterized by numerous frustrations and triumphs, big and small. For most learners, language learning does not follow a linear, predictable path, its zigzag course shaped by motivational (or demotivating) variables such as personal characteristics, teacher/peer relationships, learning materials, and dreams of a future L2 (second language) self. While some aspects of language learning (reading, grammar) are relatively mechanical, others can be stressful and unpredictable, especially conversing in the target language. That experience necessitates not only knowledge of structure and lexis, but also the ability to use the language in ways that are appropriate to the social and cultural context. A new opportunity to practice conversational abilities has arrived through the availability of AI chatbots, with both advantages (responsive, non-judgmental) and drawbacks (emotionally void, culturally biased). This column explores aspects of emotion as they arise in technology use and in particular how automatic emotion recognition and simulated human responsiveness in AI systems interface with language learning and the development of pragmatic and interactional competence. Emotion AI, the algorithmically driven interpretation of users' affective signals, has been seen as enabling greater personalized learning, adapting to perceived learner cognitive and emotional states. Others warn of emotional manipulation and inappropriate and ineffective user profiling

URL PDF HTML ☆

赞 0 踩 0

2603.20334 2026-05-26 cs.SE cs.AI 版本更新

Procedural Refinement by LLM-driven Algorithmic Debugging for ARC-AGI-2

基于LLM驱动的算法调试的程序化精炼用于ARC-AGI-2

Yu-Ning Qiu, Lin-Feng Zou, Jiong-Da Wang, Xue-Rong Yuan, Wang-Zhou Dai

发表机构 * Nanjing University（南京大学）

AI总结提出一种神经符号精炼方法ABPR，结合LLM与Prolog元解释器，通过证明树推导进行语义重检，在ARC-AGI-2上实现高通过率，并扩展到RAVEN风格推理任务。

详情

AI中文摘要

在高复杂度的抽象推理中，系统必须从少量示例或结构化观察中推断出潜在规则，并将其应用于未见实例。LLM可以将此类规则表达为程序，但基于对话的常规精炼主要停留在结果层面：它观察到答案或输出是错误的，而没有正式重新检查是哪个抽象、关系或变换导致了该结果。我们提出基于溯因的程序化精炼（ABPR），一种神经符号精炼方法，它将LLM与Prolog元解释器相结合。ABPR将每个候选程序视为潜在规则的可执行声明性假设，并将其SLD目标-子目标解析具体化为紧凑的证明树风格推导，遵循Shapiro的算法程序调试（APD）。在此视角下，精炼不仅仅是代码级调试，而是对模型假设规则进行语义重检。我们主要在ARC-AGI-2上评估ABPR，这是一个具有挑战性的少样本抽象规则归纳基准，涉及网格变换。使用Gemini-3-Flash的ABPR在公共评估集上达到56.67%的Pass@2，而使用GPT-5.5 xHigh的ABPR达到98.33%的Pass@2。在填空式I-RAVEN-X和A-I-RAVEN改编上的补充实验表明，相同的轨迹引导框架可以扩展到RAVEN风格的关系和类比抽象，而不仅限于ARC特定的网格任务。重复运行和敏感性分析表明，随着搜索广度和总搜索深度的增加，并行轨迹引导搜索减少了随机方差。

英文摘要

In high-complexity abstract reasoning, a system must infer a latent rule from a few examples or structured observations and apply it to unseen instances. LLMs can express such rules as programs, but ordinary conversation-based refinement is largely outcome-level: it observes that an answer or output is wrong without formally re-checking which abstraction, relation, or transformation justified that outcome. We propose \emph{Abduction-Based Procedural Refinement} (ABPR), a neuro-symbolic refinement approach that couples an LLM with a Prolog meta-interpreter. ABPR treats each candidate program as an executable declarative hypothesis of the latent rule and reifies its SLD goal--subgoal resolution into compact proof-tree-style derivations, following Shapiro's algorithmic program debugging (APD). In this view, refinement is not merely code-level debugging, but semantic re-checking of the model's hypothesised rule. We evaluate ABPR primarily on ARC-AGI-2, a challenging few-shot abstract rule induction benchmark over grid transformations. ABPR with Gemini-3-Flash achieves 56.67\% Pass@2, while GPT-5.5 xHigh with ABPR reaches 98.33\% Pass@2 on the public evaluation set. Supplementary experiments on fill-in-the-blank I-RAVEN-X and A-I-RAVEN adaptations provide evidence that the same trace-guided framework extends beyond ARC-specific grid tasks to RAVEN-style relational and analogical abstraction. Repeated-run and sensitivity analyses show that parallel trace-guided search reduces stochastic variance as search breadth and total search depth increase.

URL PDF HTML ☆

赞 0 踩 0

2603.11583 2026-05-26 cs.CL cs.AI 版本更新

UtilityMax Prompting: A Formal Framework for Multi-Objective Large Language Model Tasks

UtilityMax Prompting：多目标大语言模型任务的形式化框架

Ofir Marom

发表机构 * Independent Researcher（独立研究者）

AI总结提出UtilityMax Prompting框架，用影响图和期望效用最大化将多目标LLM任务形式化，在MovieLens 1M数据集上相比自然语言基线提升了精度和NDCG。

详情

AI中文摘要

大语言模型（LLM）任务的成功在很大程度上取决于其提示词。大多数用例使用自然语言指定提示词，当必须同时满足多个目标时，自然语言本质上是模糊的。在本文中，我们引入了UtilityMax Prompting，一个使用形式化数学语言指定任务的框架。我们将任务重构为一个影响图，其中LLM的答案是唯一的决策变量。在图中条件概率分布上定义效用函数，并指示LLM找到最大化期望效用的答案。这迫使LLM明确推理目标的每个组成部分，将其输出导向精确的优化目标，而非主观的自然语言解释。我们在MovieLens 1M数据集上，使用三个前沿模型（Claude Sonnet 4.6、GPT-5.4和Gemini 2.5 Pro）验证了我们的方法，在多目标电影推荐任务中，与自然语言基线相比，在精度和归一化折损累计增益（NDCG）上表现出一致的改进。

英文摘要

The success of a Large Language Model (LLM) task depends heavily on its prompt. Most use-cases specify prompts using natural language, which is inherently ambiguous when multiple objectives must be simultaneously satisfied. In this paper we introduce UtilityMax Prompting, a framework that specifies tasks using formal mathematical language. We reconstruct the task as an influence diagram in which the LLM's answer is the sole decision variable. A utility function is defined over the conditional probability distributions within the diagram, and the LLM is instructed to find the answer that maximises expected utility. This constrains the LLM to reason explicitly about each component of the objective, directing its output toward a precise optimization target rather than a subjective natural language interpretation. We validate our approach on the MovieLens 1M dataset across three frontier models (Claude Sonnet 4.6, GPT-5.4, and Gemini 2.5 Pro), demonstrating consistent improvements in precision and Normalized Discounted Cumulative Gain (NDCG) over natural language baselines in a multi-objective movie recommendation task.

URL PDF HTML ☆

赞 0 踩 0

2603.06626 2026-05-26 cs.LG cs.AI 版本更新

Grouter: Decoupling Routing from Representation for Accelerated MoE Training

Grouter: 将路由与表示解耦以加速MoE训练

Yuqi Xu, Rizhen Hu, Zihan Liu, Mou Sun, Kun Yuan

发表机构 * School of Mathematical Sciences, Peking University, Beijing, China（北京大学数学科学学院）； Center for Machine Learning Research, Peking University, Beijing, China（北京大学机器学习研究中心）； Yuanpei College, Peking University, Beijing, China（北京大学元培学院）； Zhejiang Lab, Hangzhou, China（浙江实验室）

AI总结提出Grouter方法，通过从预训练MoE模型中蒸馏高质量结构作为固定路由器，解耦结构优化与权重更新，显著加速模型收敛并提升训练吞吐量。

详情

AI中文摘要

传统的混合专家（MoE）训练通常没有任何结构先验，实际上要求模型在训练专家权重的同时，在巨大的组合空间中搜索最优路由策略。这种纠缠常常导致收敛缓慢和训练不稳定。本文介绍了Grouter，一种先发制人的路由方法，通过从完全训练的MoE模型中蒸馏高质量结构，并作为目标模型的固定路由器。通过将结构优化与权重更新解耦，Grouter显著加速了模型收敛的速度和质量。为了确保框架的通用性，我们还引入了专家折叠以适应不同模型配置的Grouter，以及专家调优以重新平衡不同数据分布下的工作负载。此外，通过利用先发制人路由提供的结构先验，我们可以实施有针对性的优化以进一步提高训练吞吐量。实验表明，Grouter实现了卓越的性能和效率，将预训练数据利用率提高了4.28倍，并实现了高达33.5%的吞吐量加速，确立了先发制人路由作为可扩展MoE训练的基本范式。我们在https://github.com/JimmyAwoe/Grouter公开了我们的代码和预训练的Grouter检查点。

英文摘要

Traditional Mixture-of-Experts (MoE) training typically proceeds without any structural priors, effectively requiring the model to simultaneously train expert weights while searching for an optimal routing policy within a vast combinatorial space. This entanglement often leads to sluggish convergence and training instabilities. This paper introduces Grouter, a preemptive routing method that by distilling high-quality structures from fully-trained MoE models and serving as a fixed router for target models. By decoupling structural optimization from weight updates, Grouter significantly accelerates both the speed and quality of model convergence. To ensure the framework's versatility, we also introduce expert folding to adapt Grouter across varying model configurations and expert tuning to rebalance workloads across different data distributions. Furthermore, by leveraging the structural priors provided by preemptive routing, we can implement targeted optimizations to further enhance training throughput. Experiments demonstrate that Grouter achieves superior performance and efficiency which boosts pre-training data utilization by 4.28x and achieves up to 33.5% throughput acceleration, establishing preemptive routing as a fundamental paradigm for scalable MoE training. We publicly release our code and pretrained Grouter checkpoints at https://github.com/JimmyAwoe/Grouter.

URL PDF HTML ☆

赞 0 踩 0

2603.05450 2026-05-26 cs.AI cs.CL 版本更新

Distributed Partial Information Puzzles: Examining Common Ground Construction Under Epistemic Asymmetry

分布式部分信息谜题：在认知不对称下检验共同基础的构建

Yifan Zhu, Mariah Bradford, Kenneth Lai, Timothy Obiso, Videep Venkatesha, James Pustejovsky, Nikhil Krishnaswamy

发表机构 * Brandeis University（布兰迪斯大学）； Colorado State University（科罗拉多州立大学）

AI总结提出分布式部分信息谜题（DPIP）任务，收集多模态数据集，并评估大语言模型与动态认知逻辑方法在追踪信念状态和共同基础构建上的表现。

Comments 10 pages, 4 figures

详情

Journal ref: Proceedings of COLING-LREC 2026

AI中文摘要

建立共同基础（一组共享的信念和相互认可的事实）对于协作至关重要，但仍然是当前AI系统面临的挑战，尤其是在多模态、多方设置中，协作者带来不同的信息。我们引入了分布式部分信息谜题（DPIP），这是一个协作构建任务，在认知不对称下引发丰富的多模态交流。我们提供了这些交互的多模态数据集，并在语音、手势和动作模态上进行注释和时间对齐，以支持对命题内容和信念动态的推理。然后，我们评估了两种建模共同基础（CG）的范式：（1）最先进的大语言模型（LLMs），被提示从多模态更新中推断共享信念，以及（2）基于动态认知逻辑（DEL）的公理流水线，逐步执行相同的任务。在注释的DPIP数据上的结果表明，它对现代LLMs跟踪任务进展和信念状态的能力构成了挑战。

英文摘要

Establishing common ground, a shared set of beliefs and mutually recognized facts, is fundamental to collaboration, yet remains a challenge for current AI systems, especially in multimodal, multiparty settings, where the collaborators bring different information to the table. We introduce the Distributed Partial Information Puzzle (DPIP), a collaborative construction task that elicits rich multimodal communication under epistemic asymmetry. We present a multimodal dataset of these interactions, annotated and temporally aligned across speech, gesture, and action modalities to support reasoning over propositional content and belief dynamics. We then evaluate two paradigms for modeling common ground (CG): (1) state-of-the-art large language models (LLMs), prompted to infer shared beliefs from multimodal updates, and (2) an axiomatic pipeline grounded in Dynamic Epistemic Logic (DEL) that incrementally performs the same task. Results on the annotated DPIP data indicate that it poses a challenge to modern LLMs' abilities to track both task progression and belief state.

URL PDF HTML ☆

赞 0 踩 0

2602.23916 2026-05-26 cs.CV cs.AI 版本更新

Topology-Driven Transferability Estimation of Medical Foundation Models for Segmentation

基于拓扑驱动的医学基础模型分割迁移性估计

Jiaqi Tang, Shaoyang Zhang, Xiaoqi Wang, Jiaying Zhou, Yang Liu, Qingchao Chen

发表机构 * Peking University（北京大学）； Hohai University（河海大学）； Beijing Normal University-Hong Kong Baptist University United International College（北京师范大学-香港 Baptist大学联合国际学院）； National Institute of Health Data Science, Peking University（健康数据科学国家研究院，北京大学）； Institute of Medical Technology, Peking University（北京大学医学技术研究院）； State Key Laboratory of General Artificial Intelligence, Peking University（通用人工智能国家重点实验室，北京大学）

AI总结提出拓扑驱动迁移性估计框架，通过全局表示拓扑散度、局部边界感知拓扑一致性和任务自适应融合，无需微调即可高效选择医学基础模型，在OpenMind基准上加权Kendall指标相对提升约31%。

详情

AI中文摘要

大规模自监督学习（SSL）的出现产生了大量的医学基础模型。然而，为特定分割任务选择最优的医学基础模型仍然是一个计算瓶颈。现有的迁移性估计（TE）指标主要针对分类任务设计，依赖于全局统计假设，无法捕捉密集预测所需的拓扑复杂性。我们提出了一种新颖的拓扑驱动迁移性估计框架，评估流形可处理性而非统计重叠。我们的方法引入了三个组成部分：（1）全局表示拓扑散度（GRTD），利用最小生成树量化特征-标签结构同构性；（2）局部边界感知拓扑一致性（LBTC），专门在关键解剖边界评估流形可分离性；（3）任务自适应融合，根据目标任务的语义基数动态整合全局和局部指标。在跨不同解剖目标和SSL基础模型的大规模OpenMind基准上验证，我们的方法在加权Kendall指标上显著优于最先进的基线，相对提升约31%，提供了一种鲁棒的、无需训练的代理，用于高效模型选择而无需微调成本。代码将在接收后公开。

英文摘要

The advent of large-scale self-supervised learning (SSL) has produced a vast zoo of medical foundation models. However, selecting optimal medical foundation models for specific segmentation tasks remains a computational bottleneck. Existing Transferability Estimation (TE) metrics, primarily designed for classification, rely on global statistical assumptions and fail to capture the topological complexity essential for dense prediction. We propose a novel Topology-Driven Transferability Estimation framework that evaluates manifold tractability rather than statistical overlap. Our approach introduces three components: (1) Global Representation Topology Divergence (GRTD), utilizing Minimum Spanning Trees to quantify feature-label structural isomorphism; (2) Local Boundary-Aware Topological Consistency (LBTC), which assesses manifold separability specifically at critical anatomical boundaries; and (3) Task-Adaptive Fusion, which dynamically integrates global and local metrics based on the semantic cardinality of the target task. Validated on the large-scale OpenMind benchmark across diverse anatomical targets and SSL foundation models, our approach significantly outperforms state-of-the-art baselines by around 31% relative improvement in the weighted Kendall metric, providing a robust, training-free proxy for efficient model selection without the cost of fine-tuning. The code will be made publicly available upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2602.18956 2026-05-26 cs.AI 版本更新

INDUCTION: Finite-Structure Concept Synthesis in First-Order Logic

INDUCTION: 一阶逻辑中的有限结构概念合成

Serafim Batzoglou

发表机构 * Independent Researcher（独立研究者）

AI总结提出INDUCTION基准，用于一阶逻辑中有限结构的概念合成，通过精确模型检查验证公式的正确性，并发现低冗余公式在未见世界上的泛化能力更强。

2602.12224 2026-05-26 cs.GT cs.AI econ.TH 版本更新

Two-Sided Time-Independent Regret for Matching Markets with Limited Interviews

有限面试匹配市场的双面时间无关遗憾

Amirmahdi Mirfakhar, Xuchuang Wang, Mengfan Xu, Hedyeh Beyhaghi, Mohammad Hajiesmaili

发表机构 * University of Massachusetts Amherst（马萨诸塞大学阿姆赫斯特分校）

AI总结针对面试次数有限的匹配市场，提出利用面试作为提示进行双面学习，并通过策略性延迟纠正早期错误，实现与时间无关的遗憾界。

详情

AI中文摘要

双面匹配平台依赖双方的偏好，但参与者只能评估一小部分潜在伙伴。在实践中，他们使用低成本的匹配前筛选（例如面试、个人资料浏览或试用任务）在提交申请和录用之前形成有噪声的印象。我们研究了带有面试的匹配市场中的赌博机学习，将这些交互建模为查询的提示（hints）~\citep{DBLP:conf/innovations/BhaskaraGIKM23}，这些提示向双方揭示部分偏好信息，同时限制后续申请。我们的框架还允许企业方的不确定性：企业像代理人一样学习自己的偏好，并可能犯早期招聘错误。为了解决这个问题，我们引入了策略性延迟（strategic deferral），这是一种企业方行动，允许临时空缺，纠正过早的承诺，并在粗略匿名反馈下实现去中心化学习。我们为中心化和去中心化市场设计了算法，并表明每轮恒定数量的面试足以实现与时间无关的遗憾，优于已知没有面试时的$O(\log T)$保证。我们的界是接近最优的：中心化保证在信息论下界的$m$倍以内，而去中心化算法在结构化市场中达到多项式因子，在一般市场中仍然与时间无关。

英文摘要

Two-sided matching platforms rely on preferences from both sides, yet participants can evaluate only a small fraction of potential partners. In practice, they use low-cost pre-match screening, e.g., interviews, profile views, or trial tasks, to form noisy impressions before committing to applications and offers. We study bandit learning in matching markets with interviews, modeling these interactions as queried \emph{hints}~\citep{DBLP:conf/innovations/BhaskaraGIKM23} that reveal partial preference information to both sides while constraining subsequent applications. Our framework also allows firm-side uncertainty: firms, like agents, learn their preferences and may make early hiring mistakes. To address this, we introduce strategic deferral, a firm-side action that permits temporary vacancy, corrects premature commitments, and enables decentralized learning under coarse anonymous feedback. We design algorithms for centralized and decentralized markets and show that a constant number of interviews per round suffices for horizon-independent regret, improving over the $O(\log T)$ guarantees known without interviews. Our bounds are near-optimal: the centralized guarantee is within a factor $m$ of an information-theoretic lower bound, while decentralized algorithms match it up to polynomial factors in structured markets and remain horizon-independent in general markets.

URL PDF HTML ☆

赞 0 踩 0

2602.04360 2026-05-26 cs.LG cs.AI cs.CY 版本更新

Counterfactual Explanations for Hypergraph Neural Networks

超图神经网络的反事实解释

Fabiano Veglianti, Lorenzo Antonelli, Gabriele Tolomei

发表机构 * Department of Computer Control and Management Engineering, Sapienza University（计算机控制与管理工程系，萨皮恩扎大学）； Department of Computer Science, Sapienza University（计算机科学系，萨皮恩扎大学）

AI总结提出CF-HyperGNNExplainer方法，通过最小结构变化生成反事实超图，以解释超图神经网络的预测决策。

2602.02605 2026-05-26 cs.NE cs.AI cs.CL q-bio.NC 版本更新

Fine-Tuning Language Models to Know What They Know

微调语言模型使其了解自身所知

Sangjun Park, Elliot Meyerson, Xin Qiu, Risto Miikkulainen

发表机构 * The University of Texas at Austin（德克萨斯大学奥斯汀分校）； Cognizant AI Lab（认知人工智能实验室）

AI总结本文提出一种框架，通过进化策略对齐方法（ESMA）在控制偏差的同时提升大语言模型的元认知能力，并在未见数据集、语言和新知识上展现出鲁棒泛化性。

Comments Preprint

2602.02544 2026-05-26 cs.LG cs.AI 版本更新

SPA-Cache: Singular Proxies for Adaptive Caching in Diffusion Language Models

SPA-Cache: 扩散语言模型中的自适应缓存奇异代理

Wenhao Sun, Rong-Cheng Tu, Yifu Ding, Zhao Jin, Jingyi Liao, Yongcheng Jing, Dacheng Tao

发表机构 * College of Computing（计算学院）； Data Science, Nanyang Technological University, Singapore, Singapore（数据科学，南洋理工大学，新加坡，新加坡）

AI总结针对扩散语言模型因非因果特性无法使用标准KV缓存导致计算开销大的问题，提出SPA-Cache方法，通过低维奇异代理识别关键令牌并自适应分配缓存预算，实现高达8倍吞吐量提升和2-4倍加速。

Comments Accepted by ICML 2026.The code repository is available at https://github.com/wenhao728/spa-cache

详情

AI中文摘要

尽管扩散语言模型（DLM）为自回归范式提供了一种灵活、任意顺序的替代方案，但其非因果特性排除了标准的KV缓存，迫使在每个解码步骤进行昂贵的隐藏状态重新计算。现有的DLM缓存方法通过选择性隐藏状态更新来降低这一成本；然而，它们仍然受限于（i）昂贵的逐令牌更新识别启发式方法和（ii）僵化的统一预算分配，未能考虑异构的隐藏状态动态。为了解决这些挑战，我们提出了SPA-Cache，它在DLM缓存中联合优化了更新识别和预算分配。首先，我们推导出一个低维奇异代理，能够在低维子空间中识别更新关键令牌，大幅降低更新识别的开销。其次，我们引入一种自适应策略，在不降低生成质量的情况下，为稳定层分配更少的更新。这些贡献共同显著提高了DLM的效率，相比原始解码实现了高达8倍的吞吐量提升，相比现有缓存基线实现了2-4倍的加速。

英文摘要

While Diffusion Language Models (DLMs) offer a flexible, arbitrary-order alternative to the autoregressive paradigm, their non-causal nature precludes standard KV caching, forcing costly hidden state recomputation at every decoding step. Existing DLM caching approaches reduce this cost by selective hidden state updates; however, they are still limited by (i) costly token-wise update identification heuristics and (ii) rigid, uniform budget allocation that fails to account for heterogeneous hidden state dynamics. To address these challenges, we present SPA-Cache that jointly optimizes update identification and budget allocation in DLM cache. First, we derive a low-dimensional singular proxy that enables the identification of update-critical tokens in a low-dimensional subspace, substantially reducing the overhead of update identification. Second, we introduce an adaptive strategy that allocates fewer updates to stable layers without degrading generation quality. Together, these contributions significantly improve the efficiency of DLMs, yielding up to an $8\times$ throughput improvement over vanilla decoding and a $2$--$4\times$ speedup over existing caching baselines.

URL PDF HTML ☆

赞 0 踩 0

2602.02474 2026-05-26 cs.CL cs.AI cs.LG 版本更新

MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents

MemSkill：面向自进化智能体的可学习与进化记忆技能

Haozhen Zhang, Quanyu Long, Jianzhu Bao, Tao Feng, Weizhi Zhang, Haodong Yue, Wenya Wang

发表机构 * Nanyang Technological University（南洋理工大学）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； University of Illinois Chicago（伊利诺伊大学芝加哥分校）； Tsinghua University（清华大学）

AI总结提出MemSkill框架，将记忆操作转化为可学习和可进化的技能，通过控制器选择技能、执行器生成记忆、设计者进化技能集，形成闭环提升LLM智能体任务性能。

Comments Code is available at https://github.com/ViktorAxelsen/MemSkill

详情

AI中文摘要

大多数大语言模型（LLM）智能体记忆系统依赖少量静态、手工设计的操作来提取记忆。这些固定程序硬编码了关于存储内容和如何修订记忆的人类先验知识，使其在多样化的交互模式下僵化，并在长历史记录上效率低下。为此，我们提出 extbf{MemSkill}，将这些操作重新定义为可学习和可进化的记忆技能，即从交互轨迹中提取、整合和修剪信息的结构化可重用例程。受智能体技能设计哲学的启发，MemSkill采用一个 extit{控制器}，学习选择少量相关技能，并与基于LLM的 extit{执行器}配对，生成技能引导的记忆。除了学习技能选择，MemSkill引入一个 extit{设计者}，定期审查所选技能产生错误或不完整记忆的困难案例，并通过提出改进和新技能来进化技能集。共同地，MemSkill形成了一个闭环流程，改进了技能选择策略和技能集本身。在LoCoMo、LongMemEval、HotpotQA和ALFWorld上的实验表明，MemSkill在强基线上提高了任务性能，并在不同设置下具有良好的泛化能力。进一步分析揭示了技能如何进化，为LLM智能体更自适应、自进化的记忆管理提供了见解。

英文摘要

Most Large Language Model (LLM) agent memory systems rely on a small set of static, hand-designed operations for extracting memory. These fixed procedures hard-code human priors about what to store and how to revise memory, making them rigid under diverse interaction patterns and inefficient on long histories. To this end, we present \textbf{MemSkill}, which reframes these operations as learnable and evolvable memory skills, structured and reusable routines for extracting, consolidating, and pruning information from interaction traces. Inspired by the design philosophy of agent skills, MemSkill employs a \emph{controller} that learns to select a small set of relevant skills, paired with an LLM-based \emph{executor} that produces skill-guided memories. Beyond learning skill selection, MemSkill introduces a \emph{designer} that periodically reviews hard cases where selected skills yield incorrect or incomplete memories, and evolves the skill set by proposing refinements and new skills. Together, MemSkill forms a closed-loop procedure that improves both the skill-selection policy and the skill set itself. Experiments on LoCoMo, LongMemEval, HotpotQA, and ALFWorld demonstrate that MemSkill improves task performance over strong baselines and generalizes well across settings. Further analyses shed light on how skills evolve, offering insights toward more adaptive, self-evolving memory management for LLM agents.

URL PDF HTML ☆

赞 0 踩 0

2601.22925 2026-05-26 cs.IR cs.AI cs.LG 版本更新

BEAR: Towards Beam-Search-Aware Optimization for Recommendation with Large Language Models

BEAR：面向大语言模型推荐中束搜索感知的优化

Weiqin Yang, Bohao Wang, Zhenxiang Xu, Jiawei Chen, Shengjia Zhang, Jingbang Chen, Canghong Jin, Can Wang

发表机构 * Zhejiang University（浙江大学）； The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））； Hangzhou City University（杭州市城市大学）

AI总结针对监督微调与束搜索推理之间的不一致性，提出BEAR正则化方法，通过确保正例每个token在解码步骤中排名前B来避免过早剪枝，显著提升推荐性能。

Comments Accepted by SIGIR 2026

详情

DOI: 10.1145/3805712.3809533

AI中文摘要

近年来，利用大语言模型（LLM）进行推荐的研究迅速增长。这些方法通常采用监督微调（SFT）使LLM适应推荐场景，并在推理时使用束搜索高效检索前B个推荐项。然而，我们发现了关键的训练-推理不一致性：虽然SFT优化正例的整体概率，但即使这些项具有高整体概率，也不能保证它们会被束搜索检索到。由于贪心剪枝机制，束搜索可能会在正例的前缀概率不足时过早丢弃它。为了解决这种不一致性，我们提出了BEAR（束搜索感知正则化），一种新的微调目标，在训练中显式考虑束搜索行为。BEAR不直接模拟每个训练实例的束搜索（计算代价过高），而是强制执行一个宽松的必要条件：正例中的每个token在每个解码步骤中必须排在前B个候选token中。该目标有效降低了错误剪枝的风险，同时与标准SFT相比仅增加可忽略的计算开销。在四个真实世界数据集上的大量实验表明，BEAR显著优于强基线。代码可在https://github.com/Tiny-Snow/BEAR-SIGIR-2026获取。

英文摘要

Recent years have seen a rapid surge in research leveraging Large Language Models (LLMs) for recommendation. These methods typically employ supervised fine-tuning (SFT) to adapt LLMs to recommendation scenarios, and utilize beam search during inference to efficiently retrieve $B$ top-ranked recommended items. However, we identify a critical training-inference inconsistency: while SFT optimizes the overall probability of positive items, it does not guarantee that such items will be retrieved by beam search even if they possess high overall probabilities. Due to the greedy pruning mechanism, beam search can prematurely discard a positive item once its prefix probability is insufficient. To address this inconsistency, we propose BEAR (Beam-SEarch-Aware Regularization), a novel fine-tuning objective that explicitly accounts for beam search behavior during training. Rather than directly simulating beam search for each instance during training, which is computationally prohibitive, BEAR enforces a relaxed necessary condition: each token in a positive item must rank within the top-$B$ candidate tokens at each decoding step. This objective effectively mitigates the risk of incorrect pruning while incurring negligible computational overhead compared to standard SFT. Extensive experiments across four real-world datasets demonstrate that BEAR significantly outperforms strong baselines. Code is available at https://github.com/Tiny-Snow/BEAR-SIGIR-2026 .

URL PDF HTML ☆

赞 0 踩 0

2601.22709 2026-05-26 cs.CV cs.AI 版本更新

Gated Relational Alignment via Confidence-based Distillation for Efficient VLMs

基于置信度蒸馏的门控关系对齐用于高效视觉语言模型

Yanlong Chen, Amirhossein Habibian, Luca Benini, Yawei Li

发表机构 * Department of Information Technology（信息科技系）； Electrical Engineering, ETH Zurich, Zurich, Switzerland（电气工程，苏黎世联邦理工学院，苏黎世，瑞士）； Qualcomm AI Research, Amsterdam, the Netherlands（高通人工智能研究，阿姆斯特丹，荷兰）； Department of Electrical, Electronic and Information Engineering（电气、电子与信息工程系）； University of Bologna, Bologna, Italy（博洛尼亚大学，博洛尼亚，意大利）； School of Electrical and Electronic Engineering（电气与电子工程学院）

AI总结提出GRACE框架，通过信息瓶颈原理统一知识蒸馏与量化感知训练，使用置信度门控解耦蒸馏、关系中心核对齐和自适应控制器，在INT4量化下实现性能超越FP16基线并接近教师模型，同时显著降低内存和提升吞吐量。

Comments Accepted to the International Conference on Machine Learning (ICML 2026)

详情

AI中文摘要

视觉语言模型（VLM）具有强大的多模态性能，但部署成本高，且训练后量化通常会导致显著的精度损失。尽管有潜力，但针对VLM的量化感知训练仍未得到充分探索。我们提出GRACE，一个在信息瓶颈原则下统一知识蒸馏和量化感知训练的框架：量化约束信息容量，而蒸馏指导在此预算内保留什么。将教师视为任务相关信息的代理，我们引入置信度门控解耦蒸馏以过滤不可靠的监督，关系中心核对齐以传递视觉标记结构，以及通过拉格朗日松弛实现的自适应控制器以平衡保真度与容量约束。在LLaVA和Qwen系列的大量基准测试中，我们的INT4模型始终优于FP16基线（例如，LLaVA-1.5-7B：SQA上70.1 vs. 66.8；Qwen2-VL-2B：MMBench上76.9 vs. 72.6），几乎匹配教师性能。使用真实的INT4内核，我们实现了3倍的吞吐量，内存减少54%。这一原则性框架显著优于现有量化方法，使GRACE成为资源受限部署的有力解决方案。代码和数据可在https://github.com/ForeverBlue816/GRACE获取。

英文摘要

Vision-Language Models (VLMs) achieve strong multimodal performance but are costly to deploy, and post-training quantization often causes significant accuracy loss. Despite its potential, quantization-aware training for VLMs remains underexplored. We propose GRACE, a framework unifying knowledge distillation and QAT under the Information Bottleneck principle: quantization constrains information capacity while distillation guides what to preserve within this budget. Treating the teacher as a proxy for task-relevant information, we introduce confidence-gated decoupled distillation to filter unreliable supervision, relational centered kernel alignment to transfer visual token structures, and an adaptive controller via Lagrangian relaxation to balance fidelity against capacity constraints. Across extensive benchmarks on LLaVA and Qwen families, our INT4 models consistently outperform FP16 baselines (e.g., LLaVA-1.5-7B: 70.1 vs. 66.8 on SQA; Qwen2-VL-2B: 76.9 vs. 72.6 on MMBench), nearly matching teacher performance. Using real INT4 kernel, we achieve 3$\times$ throughput with 54% memory reduction. This principled framework significantly outperforms existing quantization methods, making GRACE a compelling solution for resource-constrained deployment. Code and data are available at: https://github.com/ForeverBlue816/GRACE.

URL PDF HTML ☆

赞 0 踩 0

2601.21601 2026-05-26 cs.LG cs.AI 版本更新

Dynamics Reveals Structure: Challenging the Linear Propagation Assumption

动力学揭示结构：挑战线性传播假设

Hoyeon Chang, Bálint Mucsányi, Seong Joon Oh

发表机构 * University of Tübingen（图宾根大学）

AI总结通过关系代数研究神经网络中线性传播假设的几何极限，证明其在对合运算（否定、逆）上可行，但在组合运算上存在根本性障碍，导致特征映射崩溃，并解释知识编辑失败、反转诅咒和多跳推理等问题的共同根源。

详情

AI中文摘要

神经网络通过一阶参数更新进行自适应，但尚不清楚这种更新是否保持逻辑一致性。我们研究了线性传播假设（LPA）的几何极限，该假设认为局部更新能够连贯地传播到逻辑结论。为了形式化这一点，我们采用关系代数，研究关系的三种核心运算：否定翻转真值、逆交换参数顺序、组合链接关系。对于否定和逆，我们证明保证与方向无关的一阶传播需要一种张量分解，将实体对上下文与关系内容分离。然而，对于组合，我们识别出一个根本性障碍。我们证明组合可归结为合取，并证明任何在线性特征上良好定义的合取必须是双线性的。由于双线性与否定不兼容，这迫使特征映射崩溃。这些结果表明，知识编辑失败、反转诅咒和多跳推理可能源于LPA固有的共同结构限制。

英文摘要

Neural networks adapt through first-order parameter updates, yet it remains unclear whether such updates preserve logical coherence. We investigate the geometric limits of the Linear Propagation Assumption (LPA), the premise that local updates coherently propagate to logical consequences. To formalize this, we adopt relation algebra and study three core operations on relations: negation flips truth values, converse swaps argument order, and composition chains relations. For negation and converse, we prove that guaranteeing direction-agnostic first-order propagation necessitates a tensor factorization separating entity-pair context from relation content. However, for composition, we identify a fundamental obstruction. We show that composition reduces to conjunction, and prove that any conjunction well-defined on linear features must be bilinear. Since bilinearity is incompatible with negation, this forces the feature map to collapse. These results suggest that failures in knowledge editing, the reversal curse, and multi-hop reasoning may stem from common structural limitations inherent to the LPA.

URL PDF HTML ☆

赞 0 踩 0

2601.21463 2026-05-26 cs.SD cs.AI 版本更新

Unifying Speech Editing Detection and Content Localization via Prior-Enhanced Audio LLMs

通过先验增强的音频大语言模型统一语音编辑检测与内容定位

Jun Xue, Yi Chai, Yanzhen Ren, Jinshen He, Zhiqiang Tang, Zhuolin Yi, Yihuan Huang, Yuankun Xie, Yujie Chen

发表机构 * Key Laboratory of Aerospace Information Security（航空信息安全与可信计算重点实验室）； School of Cyber Science and Engineering（网络安全工程学院）； Wuhan University（武汉大学）； Independent Researcher（独立研究员）； School of Computer Science and Technology（计算机科学与技术学院）； Anhui University（安徽大学）； Communication University of China（中国通信大学）； Beihang University（北京航空航天大学）

AI总结提出基于音频大语言模型的统一框架，通过生成式方法联合处理语音编辑检测和内容定位，并引入先验增强策略和声学一致性损失以提升性能。

详情

AI中文摘要

现有的语音编辑检测（SED）数据集主要使用手动拼接或有限的编辑操作构建，导致多样性受限且对真实编辑场景的覆盖不足。同时，当前的SED方法严重依赖帧级监督来检测可观察的声学异常，这从根本上限制了它们处理删除型编辑的能力，其中被操纵的内容完全从信号中消失。为了解决这些挑战，我们提出了一个统一框架，通过基于音频大语言模型（Audio LLMs）的生成式公式，将语音编辑检测和内容定位连接起来。我们首先引入了AiEdit（https://huggingface.co/datasets/JunXueTech/AiEdit），这是一个大规模双语数据集（约140小时），使用最先进的端到端语音编辑系统覆盖添加、删除和修改操作，为现代威胁提供了更真实的基准。在此基础上，我们将SED重新定义为结构化文本生成任务，实现了对编辑类型识别和内容定位的联合推理。为了增强生成模型在声学证据中的基础，我们提出了一种先验增强的提示策略，注入从帧级检测器导出的词级概率线索。此外，我们引入了一种声学一致性感知损失，在潜在空间中明确强制正常和异常声学表示之间的分离。实验结果表明，所提出的方法在检测和定位任务上均持续优于现有方法。

英文摘要

Existing speech editing detection (SED) datasets are predominantly constructed using manual splicing or limited editing operations, resulting in restricted diversity and poor coverage of realistic editing scenarios. Meanwhile, current SED methods rely heavily on frame-level supervision to detect observable acoustic anomalies, which fundamentally limits their ability to handle deletion-type edits, where the manipulated content is entirely absent from the signal. To address these challenges, we present a unified framework that bridges speech editing detection and content localization through a generative formulation based on Audio Large Language Models (Audio LLMs). We first introduce AiEdit, https://huggingface.co/datasets/JunXueTech/AiEdit, a large-scale bilingual dataset (approximately 140 hours) that covers addition, deletion, and modification operations using state-of-the-art end-to-end speech editing systems, providing a more realistic benchmark for modern threats. Building upon this, we reformulate SED as a structured text generation task, enabling joint reasoning over edit type identification, and content localization. To enhance the grounding of generative models in acoustic evidence, we propose a prior-enhanced prompting strategy that injects word-level probabilistic cues derived from a frame-level detector. Furthermore, we introduce an acoustic consistency-aware loss that explicitly enforces the separation between normal and anomalous acoustic representations in the latent space. Experimental results demonstrate that the proposed approach consistently outperforms existing methods across both detection and localization tasks.

URL PDF HTML ☆

赞 0 踩 0

2601.15544 2026-05-26 cs.LG cs.AI 版本更新

RDumb++: Drift-Aware Continual Test-Time Adaptation

RDumb++：漂移感知的持续测试时自适应

Himanshu Mishra

发表机构 * Department of Computer Science（计算机科学系）； University of British Columbia（不列颠哥伦比亚大学）

AI总结针对持续测试时自适应中分布快速变化或长期漂移导致性能崩溃的问题，提出RDumb++方法，通过熵和KL散度漂移检测机制与自适应重置策略，在CCC基准上实现约3%的绝对准确率提升。

详情

AI中文摘要

持续测试时自适应（CTTA）旨在仅使用传入的无标签数据流在部署期间更新预训练模型。尽管先前的方法如Tent、EATA等在短期演化偏移下提供了有意义的改进，但当测试分布快速变化或时间跨度极长时，它们表现不佳。CCC基准测试体现了这一挑战，模型在包含750万样本且不断变化损坏类型和严重程度的数据流上运行。我们提出RDumb++，它是RDumb的合理扩展，引入了两种漂移检测机制，即基于熵的漂移评分和KL散度漂移评分，以及自适应重置策略。这些机制使模型能够检测累积的自适应何时变得有害，并在预测崩溃发生前恢复。在包含三种速度和三种种子的CCC-medium（九次运行，每次包含一百万样本）上，RDumb++始终优于RDumb，在整个数据流中实现约3%的绝对准确率提升，同时保持稳定的自适应。关于漂移阈值和重置强度的消融实验进一步表明，漂移感知重置对于防止崩溃和实现可靠的长期CTTA至关重要。

英文摘要

Continual Test-Time Adaptation (CTTA) seeks to update a pretrained model during deployment using only the incoming, unlabeled data stream. Although prior approaches such as Tent, EATA etc. provide meaningful improvements under short evolving shifts, they struggle when the test distribution changes rapidly or over extremely long horizons. This challenge is exemplified by the CCC benchmark, where models operate over streams of 7.5M samples with continually changing corruption types and severities. We propose RDumb++, a principled extension of RDumb that introduces two drift-detection mechanisms i.e entropy-based drift scoring and KL-divergence drift scoring, together with adaptive reset strategies. These mechanisms allow the model to detect when accumulated adaptation becomes harmful and to recover before prediction collapse occurs. Across CCC-medium with three speeds and three seeds (nine runs, each containing one million samples), RDumb++ consistently surpasses RDumb, yielding approx 3% absolute accuracy gains while maintaining stable adaptation throughout the entire stream. Ablation experiments on drift thresholds and reset strengths further show that drift-aware resetting is essential for preventing collapse and achieving reliable long-horizon CTTA.

URL PDF HTML ☆

赞 0 踩 0

2601.06870 2026-05-26 cs.LG cs.AI 版本更新

QASA: Quality-Aware Semantic Augmentation for Robust Multimodal Sentiment Analysis

QASA: 面向鲁棒多模态情感分析的质量感知语义增强

Jiazhang Liang, Jianheng Dai, Miaosen Luo, Menghua Jiang, Sijie Mai

发表机构 * School of Computer Science, South China Normal University（华南师范大学计算机学院）

AI总结提出QASA框架，利用扩散模型生成视觉和听觉增强样本，并通过解耦质量感知评分模块分配训练权重，以解决高质量数据稀缺问题，提升多模态情感分析的鲁棒性和泛化能力。

Comments 11 pages, 4 figures

详情

AI中文摘要

多模态大语言模型在多模态情感分析中展现出强大的语义表示能力。然而，由于高质量训练数据的稀缺，它们学习稳定且可泛化的多模态特征的能力受到限制。为了解决这一问题，我们提出了QASA（质量感知语义增强），该方法使用扩散模型生成增强的视觉和听觉样本，从而扩大训练数据集并支持多模态学习。生成的样本质量可能参差不齐，并可能出现跨模态不一致。为此，我们引入了一个解耦的质量感知评分模块，根据每个增强样本的可靠性分配训练权重。这种方法减少了低质量数据的影响，有助于更稳定和鲁棒的模型训练。该框架结合了扩散模型的生成能力和多模态大模型的语义推理能力，提供了一种无需人工标注的自动数据增强策略，同时在有限高质量数据下提高了泛化性和鲁棒性。在CH-SIMS数据集上的实验表明，QASA在五类准确率（Acc5）和二类准确率（Acc2）上分别相对提升了18.0%和5.9%，并且在CMU-MOSI和MUStARD基准测试上也优于现有方法。

英文摘要

Multimodal large language models have demonstrated strong ability in capturing semantic representations for multimodal sentiment analysis. Their capacity to learn stable and generalizable multimodal features is limited, however, by the scarcity of high-quality training data. To address this, we propose QASA (Quality-Aware Semantic Augmentation), which uses diffusion models to generate augmented visual and auditory samples, thereby enlarging the training dataset and supporting multimodal learning. The generated samples can vary in quality and may exhibit cross-modal inconsistencies. To manage this, we introduce a decoupled quality-aware scoring module that assigns training weights based on the reliability of each augmented sample. This approach reduces the influence of low-quality data and contributes to more stable and robust model training. The framework combines the generative capabilities of diffusion models with the semantic reasoning of multimodal large models, providing an automated data augmentation strategy that does not require human annotation while improving generalization and robustness under limited high-quality data. Experiments on the CH-SIMS dataset show that QASA yields a relative increase of 18.0\% and 5.9\% in five-class accuracy (Acc5) and binary accuracy (Acc2), respectively, and it also outperforms existing methods on the CMU-MOSI and MUStARD benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2601.03014 2026-05-26 cs.CL cs.AI 版本更新

基于错误驱动的算术推理提示优化

Árpád Pándy, Róbert Lakatos, András Hajdu

发表机构 * Deptartment of Data Science & Visualization, Faculty of Informatics, University of Debrecen（数据科学与可视化系，信息学院，德布勒恩大学）

AI总结提出一种错误驱动的提示优化框架，通过聚类错误预测迭代优化提示规则，使小型本地语言模型在算术推理任务中准确率达到70.8%，超越GPT-3.5 Turbo。

详情

DOI: 10.1109/ACCESS.2026.3685125
Journal ref: IEEE Access, vol. 14, pp. 62570-62583, 2026

AI中文摘要

人工智能的最新进展激发了人们对工业代理的兴趣，这些代理能够在表格数据工作流中支持金融和医疗等受监管领域的分析师。此类系统的关键能力是对结构化数据执行准确的算术运算，同时确保敏感信息永远不会离开安全的本地环境。在此，我们引入了一种用于算术推理的错误驱动优化框架，该框架增强了代码生成代理（CGA），特别应用于本地小型语言模型（SLM）。通过对领先的SLM（Qwen3 4B）进行系统评估，我们发现虽然基础模型在算术任务中表现出基本局限性，但我们提出的错误驱动方法通过聚类错误预测来迭代优化提示规则，显著提升了性能，将模型准确率提高到70.8%。我们的结果表明，开发可靠、可解释且可工业部署的AI助手不仅可以通过昂贵的微调实现，还可以通过系统的、错误驱动的提示优化来实现，从而使小型模型以符合隐私要求的方式超越大型语言模型（GPT-3.5 Turbo）。

英文摘要

Recent advancements in artificial intelligence have sparked interest in industrial agents capable of supporting analysts in regulated sectors, such as finance and healthcare, within tabular data workflows. A key capability for such systems is performing accurate arithmetic operations on structured data while ensuring sensitive information never leaves secure, on-premises environments. Here, we introduce an error-driven optimization framework for arithmetic reasoning that enhances a Code Generation Agent (CGA), specifically applied to on-premises small language models (SLMs). Through a systematic evaluation of a leading SLM (Qwen3 4B), we find that while the base model exhibits fundamental limitations in arithmetic tasks, our proposed error-driven method, which clusters erroneous predictions to refine prompt-rules iteratively, dramatically improves performance, elevating the model's accuracy to 70.8\%. Our results suggest that developing reliable, interpretable, and industrially deployable AI assistants can be achieved not only through costly fine-tuning but also via systematic, error-driven prompt optimization, enabling small models to surpass larger language models (GPT-3.5 Turbo) in a privacy-compliant manner.

URL PDF HTML ☆

赞 0 踩 0

2512.10961 2026-05-26 cs.HC cs.AI 版本更新

AI as Equalizer or Amplifier? Task Complexity as the Moderating Factor for Human Expertise in Hybrid Intelligence Systems

AI是均衡器还是放大器？任务复杂性作为混合智能系统中人类专业知识的调节因素

Tao An

发表机构 * Hawaii Pacific University（夏威夷太平洋大学）

AI总结本文提出AI在常规任务中均衡表现，在复杂任务中放大专家与新手差距，并构建了人类贡献层次与参与层次的框架，强调领域知识而非提示工程决定放大效果。

Comments 9 pages, 3 figures, 1 table. v2 matches the camera-ready version accepted at HHAI 2026. Removed v1 aggregated projections (training timeline figure, n=580). Empirical basis is structured field observations of 10 to 20 colleagues at a single organization (Beijing Feimu) since mid-2024. Conceptual framework unchanged. To appear in Frontiers in Artificial Intelligence and Applications (IOS Press)

详情

AI中文摘要

越来越多的实证研究表明，生成式AI缩小了新手与专家在常规任务上的表现差距——即所谓的“均衡器”效应。本文挑战了这一结论的普遍性。基于认知增强理论、专家-新手研究以及对一个小型软件产品团队内部生成式AI使用的结构化观察，我们认为AI主要作为认知放大器：其输出质量根本上取决于指导它的人类专业知识。我们提出了一个包含人类贡献的三个层次（问题定义、质量评估、迭代优化）和三个参与级别（被动接受、迭代协作、认知指导）的框架，证明领域知识——而非提示工程技能——决定了放大效果。我们通过提出AI在结构良好的常规任务上均衡表现，而在需要深度判断的复杂任务上放大已有差异，来调和均衡器与放大器的观点。这种调和直接影响了人机混合系统的设计：我们应构建奖励和发展专业知识的AI，而非取代专业知识的AI。我们为HHAI社区提出了一个研究议程，聚焦于专业知识敏感的AI设计、自适应协作界面以及AI增强工作中人类能力发展的纵向研究。

英文摘要

A growing body of empirical research suggests that generative AI narrows performance gaps between novice and expert workers on routine tasks--the so-called "equalizer" effect. This paper challenges the generality of that conclusion. Drawing on cognitive augmentation theory, expert-novice research, and structured observations of in-house generative-AI use across a small software product team, we argue that AI functions primarily as a cognitive amplifier: a system whose output quality depends fundamentally on the expertise of the human who directs it. We present a framework comprising three layers of human contribution (problem definition, quality evaluation, iterative refinement) and three levels of engagement (passive acceptance, iterative collaboration, cognitive direction), demonstrating that domain expertise--not prompt engineering skill--determines amplification effectiveness. We reconcile the equalizer and amplifier perspectives by proposing that AI equalizes performance on well-structured, routine tasks while amplifying pre-existing differences on complex tasks requiring deep judgment. This reconciliation carries direct implications for hybrid human-AI system design: rather than building AI that replaces expertise, we should build AI that rewards and develops it. We outline a research agenda for the HHAI community centered on expertise-sensitive AI design, adaptive collaboration interfaces, and longitudinal studies of human capability development in AI-augmented work.

URL PDF HTML ☆

赞 0 踩 0

2512.06393 2026-05-26 cs.AI cs.CL cs.LG cs.LO 版本更新

思维链劫持

Jianli Zhao, Tingchen Fu, Rylan Schaeffer, Mrinank Sharma, Fazl Barez

发表机构 * Independent（独立）； University of Oxford（牛津大学）； Stanford University（斯坦福大学）； Anthropic ； Martian Core

AI总结提出思维链劫持攻击，通过诱导大型推理模型进行长时间良性推理来削弱其拒绝有害请求的能力，实现高成功率越狱。

详情

AI中文摘要

大型推理模型（LRMs）通过扩展推理时间推理来提高任务性能。尽管先前研究表明更长的推理应导致更稳健的安全行为，但我们发现了相反的证据：过度扩展的推理反而可以被利用来系统性地削弱拒绝行为。我们提出了思维链劫持，一种简单而有效的黑盒越狱攻击，诱导LRMs进行长时间的良性谜题求解推理（通常持续五分钟以上），然后引发有害的顺从。在HarmBench上，思维链劫持在Gemini 2.5 Pro、ChatGPT o4 Mini、Grok 3 Mini和Claude 4 Sonnet上分别实现了99%、94%、100%和94%的攻击成功率。为了理解该攻击为何成功，我们对开源推理模型进行了激活探测、注意力模式分析和因果干预。我们的结果表明，拒绝行为依赖于一个低维安全信号，其表达随着推理轨迹变长而减弱。特别是，扩展的良性推理将注意力从有害意图转移开，并减弱与拒绝相关的激活，产生了我们称之为拒绝稀释的现象。这些发现表明，过长的推理可能引入系统性的越狱攻击面。我们发布了评估材料以支持可重复性和进一步研究。

英文摘要

Large Reasoning Models (LRMs) improve task performance through extended inference-time reasoning. Although previous studies suggest that longer reasoning should lead to more robust safety behavior, we find evidence to the contrary: over-extended reasoning can instead be exploited to systematically weaken refusal behavior. We propose Chain-of-Thought Hijacking, a simple yet effective black-box jailbreak attack that induces LRMs to engage in prolonged benign puzzle-solving reasoning, often lasting more than five minutes, before eliciting harmful compliance. Across HarmBench, CoT Hijacking achieves attack success rates of 99%, 94%, 100%, and 94% on Gemini 2.5 Pro, ChatGPT o4 Mini, Grok 3 Mini, and Claude 4 Sonnet, respectively. To understand why this attack succeeds, we conduct activation probing, attention-pattern analysis, and causal interventions on open-source reasoning models. Our results indicate that refusal behavior depends on a low-dimensional safety signal whose expression weakens as reasoning traces grow longer. In particular, extended benign reasoning shifts attention away from harmful intentions and attenuates refusal-related activations, producing what we call refusal dilution. These findings demonstrate that excessively prolonged reasoning can introduce a systematic jailbreak attack surface. We release our evaluation materials to support reproducibility and further research.

URL PDF HTML ☆

赞 0 踩 0

2510.25065 2026-05-26 cs.AI 版本更新

Rewarding Structural Conformance of Reasoning using Process Mining

使用过程挖掘奖励推理的结构符合性

Yongjae Lee, Taekhyun Park, Sunghyun Sim, Hyerim Bae

发表机构 * Dept. of Industrial Engineering（工业工程系）； Pusan National University（釜山国立大学）； Dept. of Data Science（数据科学系）； Changwon National University（昌原国立大学）

AI总结提出TACReward奖励模型，利用过程挖掘技术聚合推理步骤的结构偏差，以改进稀疏奖励策略梯度方法在数学推理任务中的性能。

详情

AI中文摘要

近期稀疏奖励策略梯度方法的进展使得基于强化学习的语言模型后训练成为可能。然而，对于数学问题求解等推理任务，二值化结果奖励对中间推理步骤提供的反馈有限。虽然一些研究尝试通过估计整体推理质量来解决此问题，但这些奖励是否可靠地代表逐步推理质量仍不明确。在本研究中，我们将推理视为结构化过程，并提出TACReward，该奖励模型可无缝集成到稀疏奖励策略梯度方法中，无需额外的人工标注成本或架构修改。TACReward利用过程挖掘技术聚合教师与策略推理之间的逐步结构偏差，生成范围在[0, 1]的标量输出奖励以指示推理质量。在多个数学推理基准上的实验表明，将TACReward集成到稀疏奖励框架中鼓励策略模型改善推理的结构质量，从而在现有稀疏奖励框架上实现一致的性能提升。我们的代码和检查点可在https://github.com/Thrillcrazyer/TACReward和https://huggingface.co/Thrillcrazyer/TACReward7B公开获取。

英文摘要

Recent advances in sparse reward policy gradient methods have enabled effective reinforcement learning (RL)-based language model post-training. However, for reasoning tasks such as mathematical problem solving, binarized outcome rewards provide limited feedback on intermediate reasoning steps. While some studies have attempted to address this issue by estimating overall reasoning quality, it remains unclear whether these rewards are reliable proxies for the quality of stepwise reasoning. In this study, we consider reasoning as a structured process and propose TACReward, the reward model that can be seamlessly integrated into sparse reward policy gradient methods without additional human annotation costs or architectural modifications. TACReward aggregates stepwise structural deviations between teacher and policy reasoning using process mining techniques, producing a scalar output reward range of [0, 1] to indicate reasoning quality. Experiments on multiple mathematical reasoning benchmarks demonstrate that integrating the TACReward into sparse reward frameworks encourages the policy model to improve the structural quality of reasoning. Consequently, this leads to consistent performance improvements over existing sparse reward frameworks. Our code and checkpoints are publicly available at https://github.com/Thrillcrazyer/TACReward and https://huggingface.co/Thrillcrazyer/TACReward7B.

URL PDF HTML ☆

赞 0 踩 0

2510.15514 2026-05-26 cs.AI 版本更新

Voting with the Graph: Stable RLAIF via Topological Consistency Maximization

基于图的投票：通过拓扑一致性最大化实现稳定的RLAIF

Boyin Liu, Zhuo Zhang, Sen Huang, Lipeng Xie, Qingxu Fu, Haoran Chen, LI YU, Tianyi Hu, Zhaoyang Liu, Bolin Ding, Dongbin Zhao

发表机构 * Alibaba Group（阿里巴巴集团）； Chinese Academy of Sciences Institute of Automation（中国科学院自动化研究所）

AI总结提出拓扑共识奖励（TCR）框架，利用传递性作为去噪机制，通过拓扑多数投票过滤偏好信号中的随机噪声，以稳定强化学习从AI反馈（RLAIF）中的偏好学习。

详情

AI中文摘要

从AI反馈中强化学习（RLAIF）依赖LLM法官作为偏好测量工具，但这些工具本质上受限于随机测量误差——表现为偏好循环（例如，$A \succ B \succ C \succ A$）的随机波动，在最先进模型中占5-9%的评估。虽然重复采样通过平均多个判断来减轻噪声，但它孤立地处理每个比较，未能利用区分系统信号与随机噪声的结构约束。我们引入拓扑共识奖励（TCR），一个通过拓扑多数投票利用传递性作为去噪机制的框架：系统信号通过传递链相互增强，而随机误差聚集为拓扑暴露的循环。TCR近似最大无环子图以从偏好信号中过滤随机噪声。我们还提出循环发生率（CIR）作为诊断指标，衡量包含偏好循环的样本比例。在我们的噪声模型下，这些循环主要源于随机测量误差而非真正的非传递性。在Arena-Hard、MT-Bench和WritingBench上的实验表明，TCR始终优于成对基线和经典排序算法，并在不同法官模型上表现出稳健性能。

英文摘要

Reinforcement Learning from AI Feedback (RLAIF) relies on LLM judges as preference measurement instruments, yet these instruments are fundamentally limited by random measurement errors -- stochastic fluctuations that manifest as preference cycles (e.g., $A \succ B \succ C \succ A$), occurring in 5-9% of evaluations across state-of-the-art models. While repeated sampling mitigates noise by averaging multiple judgments, it treats each comparison in isolation and fails to exploit the structural constraints that distinguish systematic signals from random noise. We introduce Topological Consensus Rewards (TCR), a framework that leverages transitivity as a denoising mechanism via topological majority voting: systematic signals reinforce each other through transitive chains, while random errors cluster into topologically exposed cycles. TCR approximates the Maximum Acyclic Subgraph to filter stochastic noise from preference signals. We also propose Cycle Incidence Rate (CIR) as a diagnostic metric that measures the proportion of samples containing preference cycles. Under our noise model, these cycles arise primarily from stochastic measurement errors rather than genuine intransitivity. Experiments on Arena-Hard, MT-Bench, and WritingBench demonstrate that TCR consistently outperforms pairwise baselines and classical ranking algorithms, while exhibiting robust performance across different judge models.

URL PDF HTML ☆

赞 0 踩 0

2510.14925 2026-05-26 cs.AI cs.CL cs.LG 版本更新

False Fixed Points: Kantian Feedback, Stable Miscalibration, and Representational Compression in LLMs

虚假不动点：大语言模型中的康德反馈、稳定误校准与表征压缩

Akira Okutomi

发表机构 * ToppyMicroServices OÜ（ToppyMicroServices公司）

AI总结本文通过康德承诺门控框架和线性反馈模型，研究大语言模型中高置信度错误作为局部稳定、内部一致且自信错误的虚假不动点现象，发现稳定性与正确性可分离，并探索高信噪比惯性和表征压缩作为稳定误校准的可能机制。

Comments 27 pages, 8 figures, v3.0

详情

AI中文摘要

大型语言模型中的高置信度错误通常被视为脆弱的失败。我们研究另一种可能性：某些错误可能是虚假不动点，即局部稳定、内部一致且自信地错误。这分离了鲁棒性与真实追踪。我们通过康德承诺门控框架和一个最小线性反馈模型来发展这种分离，其中稳定性和正确性可以偏离。在三个开源权重模型上，根据我们的隐藏状态敏感性探测，过度自信的错误项并不比自信正确的项系统性地更局部脆弱。基于弃权的自我批评通过牺牲覆盖率减少了过度自信的错误承诺，而C3-R（一种基于规则的显式反馈门控）则加剧了这种权衡而非消除它。这些结果激发但未证实高信噪比惯性和表征压缩作为稳定误校准的可能机制。

英文摘要

High-confidence errors in large language models are often treated as fragile failures. We study an alternative: some errors may be false fixed points, locally stable, internally coherent, and confidently wrong. This separates robustness from truth-tracking. We develop the separation through a Kantian commitment-gate framing and a minimal linear feedback model in which stability and correctness can diverge. Across three open-weight models, overconfident wrong items are not systematically more locally fragile than confidently correct items under our hidden-state sensitivity probes. Abstention-aware self-critique reduces overconfident wrong commitments by sacrificing coverage, and C3-R, a rule-based explicit feedback gate, sharpens that tradeoff rather than eliminating it. These results motivate, but do not establish, high signal-to-noise (high-SNR) inertia and representational compression as possible mechanisms for stable miscalibration.

URL PDF HTML ☆

赞 0 踩 0

2510.07343 2026-05-26 cs.GR cs.AI eess.IV 版本更新

Local MAP Sampling for Diffusion Models

扩散模型的局部MAP采样

Shaorong Zhang, Rob Brekelmans, Greg Ver Steeg

发表机构 * University of California, Riverside, CA, US（加州大学河滨分校）

AI总结提出局部MAP采样（LMAPS）框架，通过沿扩散轨迹迭代求解局部MAP子问题，统一了优化方法与概率采样，在图像恢复和科学任务中达到最优性能。

详情

AI中文摘要

扩散后验采样（DPS）通过从$p(x_0 \mid y)$采样，为逆问题提供了一种基于贝叶斯原理的方法。虽然后验采样对于捕捉不确定性和多模态性很有价值，但许多经典和实际的逆问题设置最终优先考虑精确的点估计——最显著的是MAP估计器，它长期以来一直是成像和科学应用中的标准重建目标。我们引入了局部MAP采样（LMAPS），这是一种新的推理框架，沿扩散轨迹迭代求解局部MAP子问题。这一视角阐明了它们与全局MAP和DPS的联系，为基于优化的方法提供了统一的概率解释。在此基础之上，我们开发了实用算法，其中协方差近似基于高斯先验假设，并重新制定了目标函数以提高稳定性和可解释性。在广泛的图像恢复和科学任务中，LMAPS实现了最先进的性能。

英文摘要

Diffusion Posterior Sampling (DPS) provides a principled Bayesian approach to inverse problems by sampling from $p(x_0 \mid y)$. While posterior sampling is valuable for capturing uncertainty and multi-modality, many classical and practical inverse problem settings ultimately prioritize accurate point estimation -- most notably the MAP estimator, which has long served as a standard reconstruction objective in imaging and scientific applications. We introduce Local MAP Sampling (LMAPS), a new inference framework that iteratively solves local MAP subproblems along the diffusion trajectory. This perspective clarifies their connection to global MAP and DPS, offering a unified probabilistic interpretation for optimization-based methods. Building on this foundation, we develop practical algorithms with a covariance approximation motivated by a Gaussian prior assumption, and a reformulated objective for stability and interpretability. Across a broad set of image restoration and scientific tasks, LMAPS achieves state-of-the-art performance.

URL PDF HTML ☆

赞 0 踩 0

2510.04580 2026-05-26 cs.AI 版本更新

Strongly Solving 2048 4x3

强求解 2048 4x3

Tomoyuki Kaneko, Shuhei Yamashita

发表机构 * Graduate School of Arts and Sciences, the University of Tokyo（东京大学艺术及科学研究生院）

AI总结通过按棋盘上数字和（称为状态年龄）划分状态空间，枚举所有可达状态和后继状态，强求解了4x3棋盘上的2048变体，最优策略期望得分约50724.26。

详情

DOI: 10.1177/13896911261443437

AI中文摘要

2048是一个随机单人游戏，涉及4x4网格上的16个单元格，玩家在上下左右中选择一个方向，通过合并沿该方向相邻单元格中相同数字的两个方块来获得分数。本文证明，变体2048-4x3（4x3棋盘上的12个单元格，比原版少一行）已被强求解。在该变体中，对于最常见的初始状态（两个数字2的方块），最优策略的期望得分约为50724.26。可达状态和后继状态的数量分别为1,152,817,492,752和739,648,886,170。关键技术是按棋盘上数字之和（称为状态年龄）划分状态空间。年龄在状态与其任何有效动作后的后继状态之间保持不变，并通过环境的随机响应增加2或4。因此，我们可以按年龄划分状态空间，并仅依赖于最近年龄的状态来枚举一个年龄的所有（后继）状态。类似地，我们可以按年龄递减顺序确定（后继）状态值。

英文摘要

2048 is a stochastic single-player game involving 16 cells on a 4 by 4 grid, where a player chooses a direction among up, down, left, and right to obtain a score by merging two tiles with the same number located in neighboring cells along the chosen direction. This paper presents that a variant 2048-4x3 12 cells on a 4 by 3 board, one row smaller than the original, has been strongly solved. In this variant, the expected score achieved by an optimal strategy is about $50724.26$ for the most common initial states: ones with two tiles of number 2. The numbers of reachable states and afterstates are identified to be $1,152,817,492,752$ and $739,648,886,170$, respectively. The key technique is to partition state space by the sum of tile numbers on a board, which we call the age of a state. An age is invariant between a state and its successive afterstate after any valid action and is increased two or four by stochastic response from the environment. Therefore, we can partition state space by ages and enumerate all (after)states of an age depending only on states with the recent ages. Similarly, we can identify (after)state values by going along with ages in decreasing order.

URL PDF HTML ☆

赞 0 踩 0

2510.02171 2026-05-26 cs.SD cs.AI eess.AS 版本更新

Go witheFlow: Real-time Emotion Driven Audio Effects Modulation

Go witheFlow：实时情感驱动音频效果调制

Edmund Dervakos, Spyridon Kantarelis, Vassilis Lyberatos, Jason Liartis, Giorgos Stamou

发表机构 * Artificial Intelligence and Learning Systems Laboratory（人工智能与学习系统实验室）； National Technical University of Athens（希腊国家技术大学）

AI总结提出witheFlow系统，通过生物信号和音频特征实时自动调制音频效果，增强音乐表演中的人机协作。

Comments Accepted at NeurIPS Creative AI Track 2025: Humanity

2510.01389 2026-05-26 cs.RO cs.AI cs.LG 版本更新

INSIGHT: INference-time Sequence Introspection for Generating Help Triggers in Vision-Language-Action Models

INSIGHT: 视觉-语言-动作模型中生成帮助触发器的推理时序列内省

Ulas Berk Karli, Ziyao Shangguan, Tesca FItzgerald

发表机构 * Department of Computer Science, Yale University（耶鲁大学计算机科学系）

AI总结提出INSIGHT框架，利用令牌级不确定性信号（熵、对数概率、不确定性估计）训练变压器分类器，预测VLA模型何时需要人类帮助，并对比强/弱监督下的性能，发现建模时间动态优于静态评分。

详情

AI中文摘要

最近的视觉-语言-动作（VLA）模型展现出强大的泛化能力，但它们缺乏用于预测失败和向人类监督者请求帮助的内省机制。我们提出了INSIGHT，一个利用令牌级不确定性信号来预测VLA何时应请求帮助的学习框架。使用π0-FAST作为基础模型，我们提取每个令牌的熵、对数概率以及基于狄利克雷的偶然不确定性和认知不确定性估计，并训练紧凑的变压器分类器将这些序列映射到帮助触发器。我们探索了强监督或弱监督的监督机制，并在分布内和分布外任务中进行了广泛比较。我们的结果显示了权衡：强标签使模型能够捕捉细粒度的不确定性动态以实现可靠的帮助检测，而弱标签虽然噪声较大，但在训练和评估对齐时仍能支持有竞争力的内省，为密集标注不可行时提供了可扩展的路径。关键的是，我们发现使用变压器建模令牌级不确定性信号的时间演化比静态序列级评分提供了更强的预测能力。本研究首次对VLA中基于不确定性的内省进行了系统评估，为主动学习和通过选择性人工干预实现实时错误缓解开辟了未来途径。

英文摘要

Recent Vision-Language-Action (VLA) models show strong generalization capabilities, yet they lack introspective mechanisms for anticipating failures and requesting help from a human supervisor. We present \textbf{INSIGHT}, a learning framework for leveraging token-level uncertainty signals to predict when a VLA should request help. Using $π_0$-FAST as the underlying model, we extract per-token \emph{entropy}, \emph{log-probability}, and Dirichlet-based estimates of \emph{aleatoric and epistemic uncertainty}, and train compact transformer classifiers to map these sequences to help triggers. We explore supervision regimes for strong or weak supervision, and extensively compare them across in-distribution and out-of-distribution tasks. Our results show a trade-off: strong labels enable models to capture fine-grained uncertainty dynamics for reliable help detection, while weak labels, though noisier, still support competitive introspection when training and evaluation are aligned, offering a scalable path when dense annotation is impractical. Crucially, we find that modeling the temporal evolution of token-level uncertainty signals with transformers provides far greater predictive power than static sequence-level scores. This study provides the first systematic evaluation of uncertainty-based introspection in VLAs, opening future avenues for active learning and for real-time error mitigation through selective human intervention.

URL PDF HTML ☆

赞 0 踩 0

2509.25339 2026-05-26 cs.CV cs.AI cs.LG eess.IV 版本更新

VisualOverload: Probing Visual Understanding of VLMs in Really Dense Scenes

VisualOverload: 在真正密集场景中探测VLM的视觉理解

Paul Gavrikov, Wei Lin, M. Jehanzeb Mirza, Soumya Jahagirdar, Muhammad Huzaifa, Sivan Doveh, Serena Yeung-Levy, James Glass, Hilde Kuehne

发表机构 * Independent Researcher（独立研究者）； JKU Linz（林茨JKU）； MIT CSAIL ； Tübingen AI Center（图宾根人工智能中心）； Stanford（斯坦福）； MIT-IBM Watson AI Lab（MIT-IBM沃森人工智能实验室）

AI总结提出VisualOverload基准，通过密集场景中的简单视觉任务测试VLM，发现最佳模型仅达69.5%准确率，揭示计数、OCR和逻辑一致性等关键缺陷。

Comments Accepted at CVPR 2026

详情

AI中文摘要

最先进的VLM是否真正解决了基本视觉理解？我们提出VisualOverload，一个略有不同的视觉问答（VQA）基准，包含2,720个问答对，并持有私有真实答案。与以往通常关注近全局图像理解的VQA数据集不同，VisualOverload挑战模型在密集（或过载）场景中执行简单的、无需知识的视觉任务。我们的数据集由公共领域绘画的高分辨率扫描图组成，这些绘画包含多个人物、动作和展开的子情节，背景细节丰富。我们手动为这些图像标注了六个任务类别的问题，以探测对场景的彻底理解。我们假设当前基准高估了VLM的性能，编码和推理细节对它们来说仍然是一项具有挑战性的任务，尤其是当面对密集场景时。实际上，我们观察到在37个测试模型中，即使是最好的模型（o3）在我们最难的测试子集上也仅达到19.6%的准确率，在所有问题上总体准确率为69.5%。除了全面评估外，我们还通过错误分析补充了基准，揭示了多种失败模式，包括缺乏计数能力、OCR失败以及复杂任务下惊人的逻辑不一致。总之，VisualOverload暴露了当前视觉模型中的关键差距，并为社区开发更好的模型提供了重要资源。基准：http://paulgavrikov.github.io/visualoverload

英文摘要

Is basic visual understanding really solved in state-of-the-art VLMs? We present VisualOverload, a slightly different visual question answering (VQA) benchmark comprising 2,720 question-answer pairs, with privately held ground-truth responses. Unlike prior VQA datasets that typically focus on near global image understanding, VisualOverload challenges models to perform simple, knowledge-free vision tasks in densely populated (or, overloaded) scenes. Our dataset consists of high-resolution scans of public-domain paintings that are populated with multiple figures, actions, and unfolding subplots set against elaborately detailed backdrops. We manually annotated these images with questions across six task categories to probe for a thorough understanding of the scene. We hypothesize that current benchmarks overestimate the performance of VLMs, and encoding and reasoning over details is still a challenging task for them, especially if they are confronted with densely populated scenes. Indeed, we observe that even the best model (o3) out of 37 tested models only achieves 19.6% accuracy on our hardest test split and overall 69.5% accuracy on all questions. Beyond a thorough evaluation, we complement our benchmark with an error analysis that reveals multiple failure modes, including a lack of counting skills, failure in OCR, and striking logical inconsistencies under complex tasks. Altogether, VisualOverload exposes a critical gap in current vision models and offers a crucial resource for the community to develop better models. Benchmark: http://paulgavrikov.github.io/visualoverload

URL PDF HTML ☆

赞 0 踩 0

2509.12196 2026-05-26 cs.LG cs.AI 版本更新

Dynamic Relational Priming Improves Transformer in Multivariate Time Series

动态关系先验提升Transformer在多变量时间序列中的表现

Hunjae Lee, Corey Clark

发表机构 * Department of Computer Science, Southern Methodist University, Dallas TX USA（计算机科学系，南方 Methodist 大学，德克萨斯州达拉斯）

AI总结提出动态关系先验注意力机制（prime attention），通过为每个token对动态调整表示，有效捕捉多变量时间序列中异构的通道间依赖关系，在保持相同计算复杂度下提升预测精度达6.5%。

详情

AI中文摘要

标准Transformer中的注意力机制使用静态的token表示，这些表示在每一层的所有成对计算中保持不变。这限制了它们与每个token对交互中可能存在的多样化关系动态的表示对齐。虽然标准注意力在关系相对同质的领域表现出色，但其静态关系学习难以捕捉多变量时间序列（MTS）数据中多样、异构的通道间依赖关系——其中单个系统内不同的通道对交互可能由完全不同的物理定律或时间动态支配。为了更好地将注意力机制与此类领域现象对齐，我们提出了带有动态关系先验的注意力机制（prime attention）。与标准注意力中每个token在所有成对交互中呈现相同表示不同，prime attention通过可学习的调制动态地（或按交互）定制每个token，以最好地捕捉每个token对的独特关系动态，从而针对特定关系优化每个成对交互。这种prime attention的表示可塑性使其能够在保持与标准注意力相同渐近计算复杂度的同时，有效提取MTS中关系特定的信息。我们的结果表明，prime attention在基准测试中始终优于标准注意力，预测精度提升高达6.5%。此外，我们发现与标准注意力相比，prime attention在使用最多40%更短序列长度时即可达到相当或更优的性能，进一步证明了其卓越的关系建模能力。

英文摘要

Standard attention mechanisms in transformers employ static token representations that remain unchanged across all pair-wise computations in each layer. This limits their representational alignment with the potentially diverse relational dynamics of each token-pair interaction. While they excel in domains with relatively homogeneous relationships, standard attention's static relational learning struggles to capture the diverse, heterogeneous inter-channel dependencies of multivariate time series (MTS) data--where different channel-pair interactions within a single system may be governed by entirely different physical laws or temporal dynamics. To better align the attention mechanism for such domain phenomena, we propose attention with dynamic relational priming (prime attention). Unlike standard attention where each token presents an identical representation across all of its pair-wise interactions, prime attention tailors each token dynamically (or per interaction) through learnable modulations to best capture the unique relational dynamics of each token pair, optimizing each pair-wise interaction for that specific relationship. This representational plasticity of prime attention enables effective extraction of relationship-specific information in MTS while maintaining the same asymptotic computational complexity as standard attention. Our results demonstrate that prime attention consistently outperforms standard attention across benchmarks, achieving up to 6.5\% improvement in forecasting accuracy. In addition, we find that prime attention achieves comparable or superior performance using up to 40\% less sequence length compared to standard attention, further demonstrating its superior relational modeling capabilities.

URL PDF HTML ☆

赞 0 踩 0

2509.12194 2026-05-26 cs.AI cs.CV 版本更新

Teaching large language models to reason like expert diagnosticians

教会大型语言模型像专家诊断医生一样推理

Thomas A. Buckley, Riccardo Conci, Peter G. Brodeur, Jason Gusdorf, Sourik Beltrán, Bita Behrouzi, Byron Crowe, Jacob Dockterman, Muzzammil Muhammad, Sarah Ohnigian, Andrew Sanchez, James A. Diao, Aashna P. Shah, Daniel Restrepo, Eric S. Rosenberg, Andrew S. Lea, Emily Glanton, Kimberly LeBlanc, Undiagnosed Diseases Network, Marinka Zitnik, Scott H. Podolsky, Zahir Kanjee, Raja-Elie E. Abdulnour, Jacob M. Koshy, Adam Rodman, Arjun K. Manrai

发表机构 * Department of Biomedical Informatics, Harvard Medical School（哈佛医学院生物医学信息学系）； Department of Medicine, Beth Israel Deaconess Medical Center（贝塞斯达医院内科部）； The Mongan Institute, Massachusetts General Hospital（麻省总医院蒙根研究所）； Division of Gastroenterology, Brigham and Women’s Hospital（布里洛妇女医院胃肠病科）； Department of Medicine, Brigham and Women’s Hospital（布里洛妇女医院内科部）； Department of Medicine, Massachusetts General Hospital（麻省总医院内科部）； Department of Pathology, Massachusetts General Hospital（麻省总医院病理学部）； Department of Health Humanities and Bioethics, University of Rochester School of Medicine and Dentistry（罗切斯特大学医学院和牙科学院健康人文与生物伦理学部）； Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University（哈佛大学凯普纳人工智能研究所）； Center for the History of Medicine, Countway Library of Medicine, Harvard Medical School（哈佛医学院医学史中心，考特维图书馆）； Department of Global Health and Social Medicine, Harvard Medical School（哈佛医学院全球健康与社会医学部）； Division of Pulmonary and Critical Care Medicine, Brigham and Women’s Hospital（布里洛妇女医院呼吸科和重症医学科）

AI总结提出 Dr. CaBot 代理 AI 系统，通过生成基于初始病例描述的幻灯片演示来模拟专家诊断推理，并在 NEJM CPC 和 NIH 未诊断疾病网络病例上取得优于前沿模型的表现，同时发布 CPC-Bench 基准以促进临床 AI 发展。

详情

AI中文摘要

鉴别诊断是一个迭代过程，将患者信息与更广泛的医学知识相结合。自1923年以来持续发表的临床病例系列，如NEJM临床病理会议（CPCs），展示了专家医生向同行演示诊断推理，并已被用于评估AI数十年。然而，先前的AI评估主要关注最终诊断准确性，而非细微的临床推理。在此，我们介绍Dr. CaBot，一个代理AI系统，通过仅从初始病例描述生成带有书面和旁白的幻灯片演示，来模拟专家诊断医生。CaBot最近生成了NEJM CPC 100多年历史上首个发表的AI诊断。在盲评中，医生在46/62（74%）的试验中错误分类了鉴别诊断的来源（CaBot vs. 医生撰写），并在各个质量维度上给予其好评。当被要求解决来自NIH未诊断疾病网络的72名未诊断疾病患者的病例时，CaBot仅从转诊记录中就识别出了50/72（69%）病例的工作诊断。为了促进透明度和研究，我们还开发了CPC-Bench，一个基于7,102个CPC和47,648个问题（涵盖10个任务）的经医生验证的基准。我们证明CaBot在CPC-Bench上优于前沿模型，并公开发布CaBot和CPC-Bench，以促进临床AI的进步。

英文摘要

Differential diagnosis is an iterative process that integrates patient information with broader medical knowledge. Clinical case series such as the NEJM Clinicopathologic Conferences (CPCs), published continuously since 1923, feature expert physicians who demonstrate diagnostic reasoning to peers, and have been used for decades to evaluate AI. However, prior AI evaluations have largely focused on final diagnostic accuracy rather than nuanced clinical reasoning. Here, we introduce Dr. CaBot, an agentic AI system that emulates an expert diagnostician by generating written and narrated slide-based presentations from an initial case description alone. CaBot recently generated the first AI diagnosis published in the 100+ year history of the NEJM CPCs. In blinded evaluations, physicians misclassified the source of the differential (CaBot vs. physician-written) in 46/62 (74%) of trials and rated them favorably across quality dimensions. When tasked with solving cases for 72 patients with undiagnosed disease from the NIH Undiagnosed Diseases Network, CaBot identified the working diagnosis in 50/72 (69%) of cases from referral notes alone. To promote transparency and research, we also developed CPC-Bench, a physician-validated benchmark based on 7,102 CPCs and 47,648 questions across 10 tasks. We show that CaBot outperforms frontier models on CPC-Bench, and release both CaBot and CPC-Bench publicly to foster progress in clinical AI.

URL PDF HTML ☆

赞 0 踩 0

2509.05614 2026-05-26 cs.CV cs.AI cs.RO 版本更新

SpecPrune-VLA: Accelerating Vision-Language-Action Models via Action-Aware Self-Speculative Pruning

SpecPrune-VLA: 通过动作感知的自推测剪枝加速视觉-语言-动作模型

Hanzhen Wang, Jiaming Xu, Yushun Xiang, Jiayi Pan, Yongkang Zhou, Yong-Lu Li, Guohao Dai

发表机构 * Shanghai Jiao Tong University（上海交通大学）

AI总结针对视觉-语言-动作模型推理加速，提出结合全局上下文与局部信息的无训练两层剪枝方法，实现高达1.57倍加速且成功率几乎无下降。

Comments Accepted to ICML 2026

详情

AI中文摘要

剪枝是一种通过移除不重要值的计算来加速计算密集型模型的典型技术。最近，它被应用于加速视觉-语言-动作（VLA）模型推理。然而，现有的加速方法仅关注当前动作步骤的局部信息，忽略了全局上下文，导致在某些场景下成功率下降超过20%且加速效果有限。本文指出VLA任务中的时空一致性：连续步骤中的输入图像表现出高度相似性，并提出关键见解：令牌选择应结合局部信息与模型的全局上下文。基于此，我们提出SpecPrune-VLA，一种无需训练、具有启发式控制的两级剪枝方法。(1) 动作级静态剪枝：利用全局历史和局部注意力，在每个动作中静态减少视觉令牌。(2) 层级动态剪枝：根据逐层重要性自适应地剪枝每层的令牌。(3) 轻量级动作感知控制器：根据末端执行器的速度将动作分为粗粒度或细粒度，并相应调整剪枝激进程度。大量实验表明，SpecPrune-VLA在LIBERO模拟中实现高达1.57倍加速，在真实世界任务中实现1.70倍加速，且成功率下降可忽略不计。

英文摘要

Pruning is a typical acceleration technique for compute-bound models by removing computation on unimportant values. Recently, it has been applied to accelerate Vision-Language-Action (VLA) model inference. However, existing acceleration methods focus on local information from the current action step and ignore the global context, leading to >20% success rate drop and limited speedup in some scenarios. In this paper, we point out spatial-temporal consistency in VLA tasks: input images in consecutive steps exhibit high similarity, and propose the key insight that token selection should combine local information with global context of the model. Based on this, we propose SpecPrune-VLA, a training-free, two-level pruning method with heuristic control. (1) Action-level static pruning. We leverage global history and local attention to statically reduce visual tokens per action. (2) Layer-level dynamic pruning. We prune tokens adaptively per layer based on layer-wise importance. (3) Lightweight action-aware controller: We classify actions as coarse- or fine-grained by the speed of the end effector and adjust pruning aggressiveness accordingly. Extensive experiments show that SpecPrune-VLA achieves up to 1.57$\times$ speedup in LIBERO simulation and 1.70$\times$ on real-world tasks, with negligible success rate degradation.

URL PDF HTML ☆

赞 0 踩 0

2508.08652 2026-05-26 cs.AI 版本更新

Prompt-and-Check: Using Large Language Models to Evaluate Communication Protocol Compliance in Simulation-Based Training

Prompt-and-Check：使用大型语言模型评估基于模拟训练中的通信协议合规性

Vishakha Lall, Yisi Liu

发表机构 * Centre of Excellence in Maritime Safety（海上安全卓越中心）； Singapore Polytechnic（新加坡理工学院）； Singapore（新加坡）

AI总结提出Prompt-and-Check方法，利用开源大语言模型通过上下文丰富的提示评估模拟训练中通信协议的合规性，并在海事领域案例中验证其有效性。

详情

DOI: 10.1109/CW68232.2025.00059

AI中文摘要

准确的程序通信合规性评估在基于模拟的训练中至关重要，特别是在安全关键领域，遵守合规检查表反映了操作能力。本文探索了一种轻量级、可部署的方法，使用基于提示的推理与开源大型语言模型（LLMs），这些模型可以在消费级GPU上高效运行。我们提出了Prompt-and-Check，一种使用上下文丰富的提示来评估协议中每个检查表项目是否已满足的方法，仅基于转录的口头交流。我们在海事领域进行了一个案例研究，参与者执行相同的模拟任务，并实验了LLama 2 7B、LLaMA 3 8B和Mistral 7B等模型，在本地RTX 4070 GPU上运行。对于每个检查表项目，一个包含相关转录摘录的提示被输入模型，模型输出合规性判断。我们使用分类准确性和一致性分数将模型输出与专家标注的基准进行比较。我们的发现表明，提示使得无需任务特定训练即可进行有效的上下文感知推理。这项研究突出了LLMs在增强训练环境中的汇报、绩效反馈和自动评估方面的实际效用。

英文摘要

Accurate evaluation of procedural communication compliance is essential in simulation-based training, particularly in safety-critical domains where adherence to compliance checklists reflects operational competence. This paper explores a lightweight, deployable approach using prompt-based inference with open-source large language models (LLMs) that can run efficiently on consumer-grade GPUs. We present Prompt-and-Check, a method that uses context-rich prompts to evaluate whether each checklist item in a protocol has been fulfilled, solely based on transcribed verbal exchanges. We perform a case study in the maritime domain with participants performing an identical simulation task, and experiment with models such as LLama 2 7B, LLaMA 3 8B and Mistral 7B, running locally on an RTX 4070 GPU. For each checklist item, a prompt incorporating relevant transcript excerpts is fed into the model, which outputs a compliance judgment. We assess model outputs against expert-annotated ground truth using classification accuracy and agreement scores. Our findings demonstrate that prompting enables effective context-aware reasoning without task-specific training. This study highlights the practical utility of LLMs in augmenting debriefing, performance feedback, and automated assessment in training environments.

URL PDF HTML ☆

赞 0 踩 0

2507.14760 2026-05-26 eess.IV cs.AI cs.CV cs.LG 版本更新

一种用于LLM可靠证明生成的神经符号方法：以欧几里得几何为例

Oren Sultan, Eitan Stern, Dafna Shahaf

发表机构 * The Hebrew University of Jerusalem（特拉维夫大学）

AI总结提出一种结合LLM生成能力与结构化组件的神经符号方法，通过类比问题检索和形式验证器反馈，显著提升欧几里得几何证明的准确性。

Comments long paper

详情

AI中文摘要

大型语言模型（LLM）在需要严格逻辑推理和符号推理的形式化领域（如数学证明生成）中表现不佳。我们提出一种神经符号方法，结合LLM的生成优势与结构化组件以克服这一挑战。作为概念验证，我们专注于SAT级别的几何问题。我们的方法有两方面：（1）检索类比问题并利用其证明来指导LLM；（2）形式验证器评估生成的证明并提供反馈，帮助模型修正错误证明。我们的方法显著提高了不同模型族的证明准确性，在所有评估模型（OpenAI o1、GPT-5、Gemini-Flash-2.5和Claude Sonnet 4.6）上均取得了显著提升。基础模型的准确率从10%至44%提升至采用我们方法后的68%至96%，其中类比问题指导和验证器反馈均贡献了这些改进。更广泛地说，转向生成可证明正确结论的LLM有望大幅提高其可靠性、准确性和一致性，从而解锁需要可信赖性的复杂任务和关键现实应用。

英文摘要

Large language models (LLMs) struggle with formal domains that require rigorous logical deduction and symbolic reasoning, such as mathematical proof generation. We propose a neuro-symbolic approach that combines LLMs' generative strengths with structured components to overcome this challenge. As a proof of concept, we focus on SAT-level geometry problems. Our approach is two-fold: (1) We retrieve analogous problems and use their proofs to guide the LLM, and (2) a formal verifier evaluates the generated proofs and provides feedback, helping the model fix incorrect proofs. Our method significantly improves proof accuracy across diverse model families, achieving significant gains across all evaluated models: OpenAI o1, GPT-5, Gemini-Flash-2.5, and Claude Sonnet 4.6. Accuracy increases from 10% to 44% for the base models to 68% to 96% with our approach, with both analogous problem guidance and verifier feedback contributing to these improvements. More broadly, shifting to LLMs that generate provably correct conclusions has the potential to dramatically improve their reliability, accuracy and consistency, unlocking complex tasks and critical real-world applications that require trustworthiness.

URL PDF HTML ☆

赞 0 踩 0

2505.07078 2026-05-26 q-fin.TR cs.AI cs.CE 版本更新

Can LLM-based Financial Investing Strategies Outperform the Market in Long Run?

基于LLM的金融投资策略能否长期跑赢市场？

Weixian Waylon Li, Hyeonjun Kim, Mihai Cucuringu, Tiejun Ma

发表机构 * AIAI, School of Informatics The University of Edinburgh Edinburgh United Kingdom ； Global Finance Research Center Sungkyunkwan University Seoul Republic of Korea ； Dept. of Statistics \& OMI University of California, Los Angeles ； University of Oxford United States ； The University of Edinburgh ； Sungkyunkwan University ； University of California, Los Angeles ； University of Oxford

AI总结提出FINSABER回测框架，在更长时间和更大股票池上评估基于LLM的择时策略，发现其优势在长期和广泛截面下显著下降，且在牛熊市中表现不佳。

Comments KDD 2026, Datasets & Benchmarks Track

详情

AI中文摘要

大型语言模型（LLM）最近被用于资产定价任务和股票交易应用，使AI代理能够从非结构化金融数据中生成投资决策。然而，大多数对LLM择时投资策略的评估都是在狭窄的时间范围和有限的股票池中进行的，由于幸存者偏差和数据窥探偏差，其有效性被夸大。我们通过提出FINSABER（一个在更长时间段和更大符号池中评估择时策略的回测框架），批判性地评估其泛化能力和稳健性。跨越二十年和100多个符号的系统回测表明，先前报告的LLM优势在更广泛的截面和更长期的评估下显著恶化。我们的市场制度分析进一步表明，LLM策略在牛市中过于保守，表现不及被动基准，在熊市中过于激进，导致重大损失。这些发现强调了开发能够优先考虑趋势检测和制度感知风险控制，而不仅仅是增加框架复杂性的LLM策略的必要性。

英文摘要

Large Language Models (LLMs) have recently been leveraged for asset pricing tasks and stock trading applications, enabling AI agents to generate investment decisions from unstructured financial data. However, most evaluations of LLM timing-based investing strategies are conducted on narrow timeframes and limited stock universes, overstating effectiveness due to survivorship and data-snooping biases. We critically assess their generalizability and robustness by proposing FINSABER, a backtesting framework evaluating timing-based strategies across longer periods and a larger universe of symbols. Systematic backtests over two decades and 100+ symbols reveal that previously reported LLM advantages deteriorate significantly under broader cross-section and over a longer-term evaluation. Our market regime analysis further demonstrates that LLM strategies are overly conservative in bull markets, underperforming passive benchmarks, and overly aggressive in bear markets, incurring heavy losses. These findings highlight the need to develop LLM strategies that are able to prioritise trend detection and regime-aware risk controls over mere scaling of framework complexity.

URL PDF HTML ☆

赞 0 踩 0

2504.12474 2026-05-26 cs.CL cs.AI 版本更新

Integrating Structural and Semantic Signals in Text-Attributed Graphs with BiGTex

在文本属性图中整合结构信号与语义信号：BiGTex

Azadeh Beiranvand, Seyed Mehdi Vahidipour

发表机构 * Faculty of Electrical and Computer Engineering, University of Kashan（卡尚大学电气与计算机工程学院）

AI总结提出BiGTex架构，通过堆叠图-文本融合单元实现GNN与LLM的双向注意力，以参数高效微调（LoRA）在节点分类和链接预测任务上达到最优性能。

Comments 26 pages, 4 figures

详情

DOI: 10.1016/j.mlwa.2026.100921
Journal ref: Machine Learning with Applications 24 (2026) 100921

AI中文摘要

文本属性图（TAGs）在表示学习中提出了独特挑战，要求模型同时捕捉节点关联文本的语义丰富性和图的结构依赖性。图神经网络（GNNs）擅长建模拓扑信息，但缺乏处理非结构化文本的能力。相反，大型语言模型（LLMs）精通文本理解，但通常不了解图结构。在这项工作中，我们提出了BiGTex（双向图文本），一种通过堆叠图-文本融合单元紧密集成GNN和LLM的新型架构。每个单元允许文本和结构表示之间的相互注意力，使信息能够双向流动：文本影响结构，结构指导文本解释。所提出的架构使用参数高效微调（LoRA）进行训练，保持LLM冻结同时适应任务特定信号。在五个基准数据集上的大量实验表明，BiGTex在节点分类中实现了最先进的性能，并有效泛化到链接预测。消融研究进一步强调了软提示和双向注意力在模型成功中的重要性。

英文摘要

Text-attributed graphs (TAGs) present unique challenges in representation learning by requiring models to capture both the semantic richness of node-associated texts and the structural dependencies of the graph. While graph neural networks (GNNs) excel at modeling topological information, they lack the capacity to process unstructured text. Conversely, large language models (LLMs) are proficient in text understanding but are typically unaware of graph structure. In this work, we propose BiGTex (Bidirectional Graph Text), a novel architecture that tightly integrates GNNs and LLMs through stacked Graph-Text Fusion Units. Each unit allows for mutual attention between textual and structural representations, enabling information to flow in both directions, text influencing structure and structure guiding textual interpretation. The proposed architecture is trained using parameter-efficient fine-tuning (LoRA), keeping the LLM frozen while adapting to task-specific signals. Extensive experiments on five benchmark datasets demonstrate that BiGTex achieves state-of-the-art performance in node classification and generalizes effectively to link prediction. An ablation study further highlights the importance of soft prompting and bi-directional attention in the model's success.

URL PDF HTML ☆

赞 0 踩 0

2504.05181 2026-05-26 cs.IR cs.AI cs.DL cs.LG 版本更新

Lightweight and Direct Document Relevance Optimization for Generative Information Retrieval

轻量级直接文档相关性优化用于生成式信息检索

Kidist Amde Mekonnen, Yubao Tang, Maarten de Rijke

发表机构 * Institute for Clarity in Documentation（文档清晰度研究所）； Inria Paris-Rocquencourt（巴黎- Rocquencourt 国家信息与自动化所）； Rajiv Gandhi University（拉朱·甘地大学）； Tsinghua University（清华大学）； Palmer Research Laboratories（帕勒尔研究实验室）； University of Amsterdam（阿姆斯特丹大学）

AI总结提出直接文档相关性优化（DDRO）方法，通过成对排序直接对齐令牌级文档ID生成与文档级相关性估计，无需显式奖励建模和强化学习，在MS MARCO和Natural Questions上分别提升MRR@10 7.4%和19.9%。

Comments 12 pages, 3 figures. SIGIR '25 Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval July 13--18, 2025 Padua, Italy. Code and pretrained models available at: https://github.com/kidist-amde/ddro/

详情

DOI: 10.1145/3726302.3730023
Journal ref: Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '25), pages 1327-1338, 2025

AI中文摘要

生成式信息检索（GenIR）是一种有前景的神经检索范式，它将文档检索形式化为文档标识符（docid）生成任务，允许朝着统一的全局检索目标进行端到端优化。然而，现有的GenIR模型存在令牌级错位问题，即训练用于预测下一个令牌的模型往往无法有效捕捉文档级相关性。虽然基于强化学习的方法（如相关性反馈强化学习（RLRF））旨在通过奖励建模解决这种错位，但它们引入了显著的复杂性，需要优化辅助奖励函数，然后进行强化微调，这在计算上昂贵且往往不稳定。为了解决这些挑战，我们提出了直接文档相关性优化（DDRO），它通过成对排序的直接优化，将令牌级docid生成与文档级相关性估计对齐，无需显式的奖励建模和强化学习。在包括MS MARCO文档和Natural Questions在内的基准数据集上的实验结果表明，DDRO优于基于强化学习的方法，在MS MARCO上MRR@10提升了7.4%，在Natural Questions上提升了19.9%。这些发现凸显了DDRO通过简化优化方法增强检索效果的潜力。通过将对齐问题框架化为直接优化问题，DDRO简化了GenIR模型的排序优化流程，同时为基于强化学习的方法提供了一种可行的替代方案。

英文摘要

Generative information retrieval (GenIR) is a promising neural retrieval paradigm that formulates document retrieval as a document identifier (docid) generation task, allowing for end-to-end optimization toward a unified global retrieval objective. However, existing GenIR models suffer from token-level misalignment, where models trained to predict the next token often fail to capture document-level relevance effectively. While reinforcement learning-based methods, such as reinforcement learning from relevance feedback (RLRF), aim to address this misalignment through reward modeling, they introduce significant complexity, requiring the optimization of an auxiliary reward function followed by reinforcement fine-tuning, which is computationally expensive and often unstable. To address these challenges, we propose direct document relevance optimization (DDRO), which aligns token-level docid generation with document-level relevance estimation through direct optimization via pairwise ranking, eliminating the need for explicit reward modeling and reinforcement learning. Experimental results on benchmark datasets, including MS MARCO document and Natural Questions, show that DDRO outperforms reinforcement learning-based methods, achieving a 7.4% improvement in MRR@10 for MS MARCO and a 19.9% improvement for Natural Questions. These findings highlight DDRO's potential to enhance retrieval effectiveness with a simplified optimization approach. By framing alignment as a direct optimization problem, DDRO simplifies the ranking optimization pipeline of GenIR models while offering a viable alternative to reinforcement learning-based methods.

URL PDF HTML ☆

赞 0 踩 0

2504.05108 2026-05-26 cs.AI cs.LG cs.NE 版本更新

Algorithm Discovery With LLMs: Evolutionary Search Meets Reinforcement Learning

利用大语言模型发现算法：进化搜索遇见强化学习

Anja Surina, Amin Mansouri, Lars Quaedvlieg, Amal Seddas, Maryna Viazovska, Emmanuel Abbe, Caglar Gulcehre

发表机构 * EPFL（苏黎世联邦理工学院）； Apple（苹果公司）

AI总结提出通过强化学习微调持续优化大语言模型，结合进化搜索加速发现更优算法，在组合优化任务上验证有效性。

Comments 34 pages

详情

AI中文摘要

发现解决复杂问题的高效算法一直是数学和计算机科学中的重大挑战，多年来需要大量人类专业知识。近期，基于大语言模型（LLMs）的进化搜索在加速跨领域算法发现方面展现出潜力，特别是在数学和优化领域。然而，现有方法将LLM视为静态生成器，错过了利用进化探索获得的信号更新模型的机会。在这项工作中，我们提出通过强化学习（RL）微调持续优化搜索算子——即LLM，从而增强基于LLM的进化搜索。我们的方法利用进化搜索作为探索策略来发现改进的算法，而RL则基于这些发现优化LLM策略。我们在组合优化任务上的实验表明，将RL与进化搜索相结合加速了更优算法的发现，展示了RL增强的进化策略在算法设计中的潜力。

英文摘要

Discovering efficient algorithms for solving complex problems has been an outstanding challenge in mathematics and computer science, requiring substantial human expertise over the years. Recent advancements in evolutionary search with large language models (LLMs) have shown promise in accelerating the discovery of algorithms across various domains, particularly in mathematics and optimization. However, existing approaches treat the LLM as a static generator, missing the opportunity to update the model with the signal obtained from evolutionary exploration. In this work, we propose to augment LLM-based evolutionary search by continuously refining the search operator - the LLM - through reinforcement learning (RL) fine-tuning. Our method leverages evolutionary search as an exploration strategy to discover improved algorithms, while RL optimizes the LLM policy based on these discoveries. Our experiments on combinatorial optimization tasks demonstrate that integrating RL with evolutionary search accelerates the discovery of superior algorithms, showcasing the potential of RL-enhanced evolutionary strategies for algorithm design.

URL PDF HTML ☆

赞 0 踩 0

2502.15835 2026-05-26 cs.CL cs.AI cs.SE 版本更新

基于扩散模型的姿态引导人物图像合成的融合嵌入

Donghwna Lee, Kirok Kim, Jisu Lee, Kyungha Min, Wooju Kim

发表机构 * Department of Industrial Engineering（工业工程系）

AI总结提出FPDM框架，通过对比学习显式对齐融合源-姿态嵌入与目标图像嵌入，并作为条件信号生成，解决姿态引导人物图像合成中纹理保真度和一致性问题。

详情

AI中文摘要

姿态引导人物图像合成（PGPIS）旨在生成指定姿态下的人物图像，同时保留源图像的身份和外观。该技术促进了多种应用，包括虚拟试穿、数字化身、动画和手语生成。尽管最近基于扩散的PGPIS取得了高质量结果，但这些模型通常依赖于去噪过程中的隐式特征聚合。因此，细粒度纹理保持有限，即使对于相同身份，也难以确保在姿态和源外观变化下生成一致性。为解决这些限制，我们提出了基于扩散模型的融合嵌入PGPIS（FPDM），这是第一个通过对比学习显式对齐融合源-姿态嵌入与目标图像嵌入，并随后使用学习到的融合嵌入作为生成条件信号的框架。FPDM将图像-姿态融合（IPF）模块集成到我们提出的源增强姿态融合方法中，以学习与目标图像对齐的融合嵌入。然后，我们采用由源外观、目标姿态和学习到的融合嵌入引导的条件扩散模型。在DeepFashion基准和RWTH-PHOENIX-Weather 2014T数据集上的实验表明，在定量和定性评估中，与现有方法相比具有竞争力的性能，消融研究证实显式融合嵌入对齐显著提高了纹理保真度以及跨姿态和源外观变化的一致性。

英文摘要

Pose-Guided Person Image Synthesis (PGPIS) aims to generate human images in specified poses while preserving the identity and appearance of a source image. This technology facilitates diverse applications, including virtual try-on, digital avatars, animation, and sign language generation. Despite the high-quality results of recent diffusion-based PGPIS, these models typically depend on implicit feature aggregation within the denoising process. As a result, fine-grained texture preservation is limited, and even for the same identity, it is difficult to ensure consistent generation under variations in pose and source appearance. To address these limitations, we propose Fusion Embedding for PGPIS using a Diffusion Model (FPDM), the first framework that explicitly aligns fused source-pose embeddings with target image embeddings via contrastive learning, and subsequently employs the learned fusion embedding as a conditioning signal for generation. FPDM integrates an Image-Pose Fusion (IPF) module into our proposed Source-Enhanced Pose Fusion approach to learn a fusion embedding aligned with the target image. We then employ a conditional diffusion model guided by source appearance, target pose, and the learned fusion embedding. Experiments on the DeepFashion benchmark and the RWTH-PHOENIX-Weather 2014T dataset demonstrate competitive performance compared to existing methods in both quantitative and qualitative evaluations, with ablation studies confirming that explicit fusion embedding alignment substantially improves texture fidelity and consistency across pose and source appearance variations.

URL PDF HTML ☆

赞 0 踩 0

2411.00934 2026-05-26 cs.CY cs.AI 版本更新

TRAFA：通过预测性反馈预见用户操作以减少程序性任务中的错误

Sassan Mokhtar, Lars Doorenbos, Fatemeh Jabbari, Marius Bock, Dominik Bach, Juergen Gall

发表机构 * University of Bonn（波恩大学）； Lamarr Institute for Machine Learning and Artificial Intelligence（拉马尔人工智能与机器学习研究所）

AI总结提出TRAFA系统，通过跟踪-预测-行动框架实时预测用户动作并触发反馈，在错误发生前干预，实验证明相比传统反应式反馈能提高任务准确性和效率。

详情

AI中文摘要

交互式辅助系统通常在动作完成后提供反馈，支持错误恢复但无法预防错误本身。我们提出TRAFA，一种用于程序性任务的实时预测性反馈系统，在错误发生前进行干预。TRAFA通过跟踪-预测-行动框架实现预测性反馈：跟踪手和物体状态，基于场景上下文预测用户运动，并在预测动作可能违反任务约束时触发反馈。我们在顺序组装场景中实例化该流程，并通过技术基准测试和对照用户研究（与传统反应式反馈对比）进行评估。结果表明，预测性反馈在保持反馈事件数量相当的同时，提高了任务准确性和效率。这些发现将反馈时机定位为系统设计的关键维度，并展示了如何将实时预测集成到交互系统中以在错误发生前预防错误。

英文摘要

Interactive assistance systems typically provide feedback after an action has been completed, supporting error recovery but not preventing the error itself. We present TRAFA, a real-time predictive feedback system for procedural tasks that intervenes before errors are committed. TRAFA operationalizes predictive feedback through a Track-Forecast-Act framework that tracks hand and object state, forecasts user motion conditioned on scene context, and triggers feedback when a predicted action is likely to violate task constraints. We instantiate this pipeline in a sequential assembly setting and evaluate it through both technical benchmarking and a controlled user study against conventional reactive feedback. Our results show that predictive feedback improves task accuracy and efficiency while maintaining a comparable number of feedback events. These findings position feedback timing as a key dimension in system design and show how real-time anticipation can be integrated into interactive systems to prevent errors before they occur.

URL PDF HTML ☆

赞 0 踩 0

2605.24518 2026-05-26 cs.CL cs.AI 版本更新

Grammatically-Guided Sparse Attention for Efficient and Interpretable Transformers

语法引导的稀疏注意力：高效且可解释的Transformer

Spandan Pratyush

发表机构 * Independent Researcher（独立研究者）

AI总结提出语法引导的稀疏注意力方法，通过词性标签动态生成注意力掩码，在保持精度的同时降低计算复杂度。

Comments 9 pages, 2 tables Code available at https://github.com/toughthinktank/grammatically_guided_attention#

详情

AI中文摘要

Transformer模型中自注意力的二次复杂度仍然是处理长序列和高效部署大型语言模型的主要瓶颈。为此，已有大量关于稀疏注意力的研究，Deepseek稀疏注意力结合了多种创建令牌片段的方法以降低时间复杂度。本文提出了一种新颖的方法——语法引导的稀疏注意力，它基于令牌的语法角色约束注意力计算。通过利用词性（POS）标签，动态生成注意力掩码，强制令牌之间建立语言上连贯的连接，从而在不牺牲必要语言依赖性的情况下减少计算图。提出并评估了两种掩码策略：硬掩码严格只允许预定义的语法交互，软掩码则将注意力偏向这些交互。使用类似DistilBERT的架构在SST-2情感分类任务上进行的实验表明，语法引导的稀疏注意力在保持与全注意力相当的精度的同时，显著降低了理论计算开销。初步结果显示，硬掩码的准确率为0.8200，软掩码为0.8165，与全注意力的0.8200非常接近，为构建更高效、可解释且具有语言知识的Transformer架构提供了途径。

英文摘要

The quadratic complexity of self-attention in Transformer models remains a significant bottleneck for processing long sequences and deploying large language models efficiently. For this approach, there has been significant research into Sparse Attention, and Deepseek Sparse Attention has combined various methods of creating segments of tokens to reduce the time complexity. This paper introduces a novel approach, Grammatically-Guided Sparse Attention, which constrains attention computations based on the grammatical roles of tokens. By leveraging Parts-of-Speech (POS) tags, attention masks are dynamically generated that enforce linguistically coherent connections between tokens, reducing the computational graph without sacrificing essential linguistic dependencies. Two masking strategies are proposed and evaluated: a hard mask that strictly allows only predefined grammatical interactions, and a soft mask that biases attention towards these interactions. The experiments, conducted on the SST-2 sentiment classification task using a DistilBERT-like architecture, demonstrate that Grammatically-Guided Sparse Attention maintains comparable accuracy to full attention while significantly reducing the theoretical computational overhead. Preliminary results show accuracy values of 0.8200 for hard masking and 0.8165 for soft masking, closely matching the 0.8200 of full attention, providing a path towards more efficient, interpretable, and linguistically-informed Transformer architectures.

URL PDF HTML ☆

赞 0 踩 0

2605.24516 2026-05-26 cs.MA cs.AI 版本更新

Adaptive Punishment for Cooperation in Mixed-Motive Games

混合动机博弈中促进合作的自适应惩罚

Min Tang, Fanqi Kong, Linyuan Lü, Xue Feng

发表机构 * University of Science and Technology of China（中国科学技术大学）； State Key Laboratory of General Artificial Intelligence, BIGAI（一般人工智能国家重点实验室，BIGAI）

AI总结提出自适应惩罚合作方法（APC），通过动态惩罚概率和背叛严重程度确定惩罚强度，在迭代公共物品博弈中有效促进合作并降低惩罚成本。

详情

AI中文摘要

混合动机场景在现实多智能体交互中普遍存在，其中自私的智能体往往为了即时奖励而背叛，忽视了利他合作改善长期收益和集体福利的潜力。同伴惩罚可以阻止背叛，但作为代价高昂的二阶利他行为，其持续施加可能损害惩罚者的利益。现有方法通常难以有效实施惩罚以促进合作。为了平衡惩罚的有效性和成本，我们提出了自适应惩罚合作方法（APC），这是一种分布式方法，基于动态惩罚概率和背叛严重程度来确定惩罚强度。这种动态概率大大减少了代价高昂且无效的惩罚，同时促进了合作。为了准确评估背叛及其严重程度，我们使用了一个背叛感知模块，其学习由游戏奖励引导。理论分析和实证结果表明，APC在迭代公共物品博弈中表现有效。在实证中，APC在连续社会困境中也显著优于现有基线，学习到理性且有效的惩罚策略，通过战略性地阻止背叛来促进合作。

市场制度委员会：多智能体LLM决策系统中的动态信用分配

Yunhua Pei, Zerui Ge, Jin Zheng, John Cartlidge

发表机构 * University of Bristol, UK（布里斯托大学）

AI总结提出市场制度委员会（MRC），一种基于Shapley值进行在线智能体加权、贝叶斯自适应混合和制度依赖乘数的多智能体决策系统，在加密货币投资中实现高夏普比率和累计收益。

Comments 35 pages, 13 figures, preprint

详情

AI中文摘要

用于投资组合管理的多智能体LLM决策系统仍然缺乏一种原则性的方法来跨专业智能体分配信用，在制度转变下容易受到冷启动主导的影响，并且最终分配如何形成的透明度有限。我们提出了市场制度委员会（MRC），一种合作式多智能体决策系统，它计算所有单个、成对和大联盟输出的精确Shapley信用，用于在线智能体加权。实例化为N=3个专业智能体，在每个交易周期，MRC从指数加权性能历史中重新计算基于联盟的Shapley权重，使用贝叶斯自适应混合来稳定早期阶段，应用制度依赖乘数调整智能体权威，并通过五层因果追踪记录每次再平衡。在13种加密资产和5个种子的1037个交易日中，MRC实现了1.51的夏普比率和440.1%的累计收益，在主动基准中排名第一（CR、SR和IR），并在主动方法中实现了最低的最大回撤。消融实验表明，收益来自跨联盟输出的Shapley加权集成，而非任何单一阶段。代码和演示数据包含在补充材料中。

英文摘要

Multi-agent LLM decision systems for portfolio management still lack a principled way to assign credit across specialist agents, remain vulnerable to cold-start dominance under regime shifts, and offer limited transparency into how final allocations are formed. We propose Market Regime Council (MRC), a cooperative multi-agent decision system that computes exact Shapley credits across all single, pairwise, and Grand-coalition outputs for online agent weighting. Instantiated with N=3 specialist agents, at each trading period, MRC recomputes coalition-based Shapley weights from exponentially weighted performance histories, uses a Bayesian adaptive mixture to stabilize early periods, applies regime-dependent multipliers to adjust agent authority, and records each rebalance through a five-layer causal trace. Over 1,037 trading days across 13 crypto assets and five seeds, MRC achieves a Sharpe ratio of 1.51 and a cumulative return of 440.1%, ranking first on CR, SR, and IR among active baselines and attaining the lowest MDD among active methods. Ablation results show that the gains come from Shapley-weighted integration across coalition outputs rather than from any single stage in isolation. Code and demo data are included in the supplementary material.

URL PDF HTML ☆

赞 0 踩 0

2605.24489 2026-05-26 cs.AI q-bio.BM 版本更新

TIGER: Text-Informed Generalized Enzyme-Reaction Retrieval

TIGER：文本引导的通用酶-反应检索

Yuhang Zhang, Keyan Ding, Peilin Chen, Han Liu, Can Lin, Ruixi Chen, Shiqi Wang, Qi Song

发表机构 * University of Science and Technology of China（中国科学技术大学）； City University of Hong Kong（香港城市大学）； Zhejiang University（浙江大学）

AI总结提出TIGER框架，利用蛋白质到文本生成模型提取文本语义知识，通过动态门控网络融合序列特征，实现酶与反应的双向检索，显著提升跨任务泛化性和鲁棒性。

Comments Accepted to ACL2026

详情

AI中文摘要

酶-反应检索是计算生物学中的一个基本问题，支撑着酶表征、反应机理阐明以及代谢途径和生物催化剂的合理设计。作为一个双向任务，它涉及酶到反应和反应到酶的映射。然而，现有方法在跨任务和跨分布泛化方面表现不佳，性能对数据集分割高度敏感，且检索方向之间存在显著的不对称性。为了应对这些挑战，我们提出了TIGER，一个文本引导的通用酶-反应检索框架，利用蛋白质到文本生成模型从酶序列中提取文本语义知识，提供连接酶和生化反应的通用表示。为了确保文本语义的质量和可靠性，我们设计了一个动态门控网络，自适应地将文本派生知识与序列特征融合，从而产生更一致和信息丰富的酶表示，同时一个结构共享特征投影器将酶和反应表示对齐到统一的潜在空间中。大量实验表明，在双向检索监督下，TIGER在多种分布上显著优于最先进的基线，并展现出强大的鲁棒性和跨任务迁移能力。

英文摘要

Enzyme-reaction retrieval is a fundamental problem in computational biology, underpinning enzyme characterization, reaction mechanism elucidation, and the rational design of metabolic pathways and biocatalysts. As a bidirectional task, it entails both enzyme-to-reaction and reaction-to-enzyme mapping. However, existing approaches suffer from poor generalization across tasks and distributions, with performance highly sensitive to dataset splits and substantial asymmetry between retrieval directions. To address these challenges, we present TIGER, a Text-Informed Generalized Enzyme-Reaction Retrieval framework that leverages protein-to-text generation models to distill textual semantic knowledge from enzyme sequences, providing a generalized representation that bridges enzymes and biochemical reactions. To ensure the quality and reliability of textual semantics, we design a Dynamic Gating Network that adaptively fuses text-derived knowledge with sequence features, enabling more consistent and informative enzyme representations, while a Structure-Shared Feature Projector aligns enzyme and reaction representations within a unified latent space. Extensive experiments demonstrate that, under bidirectional retrieval supervision, TIGER significantly outperforms state-of-the-art baselines across diverse distributions and exhibits strong robustness and transferability across tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.24486 2026-05-26 cs.AI cs.CL 版本更新

AgentFugue: Agent Scaling for Long-Horizon Tasks through Collective Reasoning

AgentFugue：通过集体推理实现长时域任务的智能体扩展

Yuyang Hu, Hongjin Qian, Shuting Wang, Jiongnan Liu, Tong Zhao, Xiaoxi Li, Zheng Liu, Zhicheng Dou

发表机构 * GSAI, Renmin University of China（GSAI，中国人民大学）； Beijing Academy of Artificial Intelligence（北京人工智能研究院）

AI总结提出AgentFugue框架，通过共享推理中心实现多个对等智能体并行探索和选择性信息共享，无需显式角色分工或工作流编排，从而提升长时域任务性能。

详情

AI中文摘要

近期长时域智能体任务的进展主要通过更强模型、更好工具和更有效脚手架来扩展单个智能体。相比之下，对于扩展（scaling out）的理解要少得多：多个对等智能体，都针对同一任务，能否在不依赖显式角色分工或工作流编排的情况下成为额外能力来源？我们研究这个问题并提出AgentFugue，一个围绕共享推理中心构建的集体推理框架。当对等智能体并行探索同一任务时，中心记录每个智能体已建立、尝试或排除的简明笔记，并使每个智能体能够以对其当前搜索有用的形式选择性访问其他智能体的发现。这种设计将原本孤立的轨迹转变为可重用中间推理的互联生态，无需集中规划。我们将中心实例化为一个即插即用的通信层，使用监督微调和端到端强化学习进行训练。在我们研究的具有挑战性的长时域设置中，AgentFugue优于强基线。我们的结果表明，集体推理可以将对等智能体系统的扩展转变为能力增益的独特来源，而不仅仅是消耗更多计算的方式。

英文摘要

Recent progress on long-horizon agentic tasks has been driven largely by scaling up individual agents through stronger models, better tools, and more effective scaffolding. In contrast, much less is understood about scaling out: whether multiple peer agents, all targeting the same task, can become an additional source of capability without relying on explicit role specialization or workflow orchestration. We study this question and propose AgentFugue, a collective reasoning framework built around a shared reasoning hub. As peer agents explore the same task in parallel, the hub records concise notes on what each agent has established, attempted, or ruled out, and enables each agent to selectively access what other agents have discovered in a form useful for its current search. This design turns otherwise isolated trajectories into a connected ecology of reusable intermediate reasoning without requiring centralized planning. We instantiate the hub as a plug-in communication layer, trained with supervised fine-tuning and end-to-end reinforcement learning. Across the challenging long-horizon settings we study, AgentFugue improves over strong baselines. Our results suggest that collective reasoning can turn scaling out peer agent systems into a distinct source of capability gains, rather than merely a way of spending more compute.

URL PDF HTML ☆

赞 0 踩 0

2605.24484 2026-05-26 cs.AI cs.LG 版本更新

SPACE: Unifying Symmetric and Asymmetric Routing Problems for Generalist Neural Solver

SPACE：统一对称与非对称路由问题的通用神经求解器

Rongsheng Chen, Changliang Zhou, Canhong Yu, Yuanyao Chen, Yu Zhou, Zhuo Chen, Zhenkun Wang

发表机构 * School of Automation and Intelligent Manufacturing, Southern University of Science and Technology, Shenzhen, China（自动化与智能制造学院，南方科技大学，深圳，中国）； Pengcheng Laboratory, Shenzhen, China（鹏城实验室，深圳，中国）； Guangdong Provincial Key Laboratory of Fully Actuated System Control Theory and Technology, Southern University of Science and Technology, Shenzhen, China（广东省全驱动系统控制理论与技术重点实验室，南方科技大学，深圳，中国）； College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China（计算机科学与软件工程学院，深圳大学，深圳，中国）

AI总结针对现有神经求解器在对称与非对称车辆路径问题中表现不一致的问题，提出基于空间枢轴对齐的无坐标嵌入框架SPACE，通过双向弗雷歇表示和权重解耦自适应解码机制，实现统一节点表示与解生成，在110个变体上取得优异零样本泛化。

详情

AI中文摘要

通用神经路由求解器在利用统一模型解决多种车辆路径问题（VRPs）方面显示出巨大潜力。然而，现有求解器通常局限于对称设置，或在切换到非对称设置时由于输入不一致或固有结构差异而性能下降，这严重限制了它们在包含两种场景的实际应用中的实用性。为解决这一限制，我们基于每个节点到特定枢轴集的相对距离定义其空间位置，并进一步提出一种空间枢轴对齐的无坐标嵌入（SPACE）框架，该框架统一了对称和非对称VRP中的节点表示和解生成。具体而言，我们使用一种新颖的最远枢轴采样策略构建双向弗雷歇表示，以实现跨不同问题设置的不变节点表示。此外，我们引入了一种权重分解的自适应解码机制，将几何感知从问题表示中解耦，减轻约束决策对特定几何设置的过拟合。在110个VRP变体（包括55个对称问题及其非对称对应问题）上的大量实验表明，SPACE在对称和非对称VRP中均实现了有前景的零样本泛化。

英文摘要

Generalist neural routing solvers have shown great potential in solving diverse vehicle routing problems (VRPs) with a unified model. However, existing solvers are typically limited to symmetric settings or degrade in performance when switching to asymmetric settings due to input inconsistencies or inherent structural differences, substantially limiting their practicality in real-world scenarios that encompass both scenarios. To address this limitation, we define the spatial position of each node based on the relative distances to a specific set of pivots and further propose a Spatial Pivot-Aligned Coordinate-free Embedding (SPACE) framework that unifies node representation and solution generation across symmetric and asymmetric VRPs. Specifically, we construct a bidirectional Frechet representation using a novel furthest pivot sampling strategy to enable invariant node representations across distinct problem settings. Furthermore, we introduce a weight-decomposed adaptive decoding mechanism that decouples geometric perception from problem representations, mitigating the overfitting of constraint decisions to a specific geometry setting. Extensive experiments on 110 VRP variants, comprising 55 symmetric problems and their asymmetric counterparts, demonstrate that SPACE achieves promising zero-shot generalization in both symmetric and asymmetric VRPs.

URL PDF HTML ☆

赞 0 踩 0

2605.24475 2026-05-26 cs.CV cs.AI cs.MM 版本更新

Robust Fuzzy Multi-view Learning under View Conflict

视角冲突下的鲁棒模糊多视角学习

Siyuan Duan, Yuan Sun, Dezhong Peng, Yingke Chen, Xi Peng, Peng Hu

发表机构 * College of Computer Science, Sichuan University（四川大学计算机学院）； Tianfu Jincheng Laboratory（天府锦城实验室）； School of Artificial Intelligence, Sichuan University（四川大学人工智能学院）

AI总结针对多视角分类中视角冲突问题，提出基于模糊集理论的鲁棒模糊多视角学习框架（R-FUML），通过模糊隶属度量化类别可信度、熵值融合及冲突样本惩罚机制，提升鲁棒性和不确定性估计。

详情

AI中文摘要

可信多视角分类旨在提供可靠的融合以实现准确预测，近年来在学术界和工业界引起了广泛关注。然而，现有的TMVC方法通常假设训练和测试阶段不同视角之间严格对齐，这在现实场景中往往不切实际。这一局限性促使我们重新审视TMVC并将其扩展到更具挑战性的设置：如何在训练和推理过程中减轻视角冲突（VC）的影响。针对这一设置，现有的TMVC方法存在三个关键缺陷：低估不确定性、误导性决策以及对VC的过拟合。为解决这些问题，本文提出了一种基于模糊集理论的新型鲁棒模糊多视角学习（R-FUML）框架。具体而言，R-FUML将网络输出建模为模糊隶属度以量化类别可信度，并使用基于熵的方法进行可靠的多视角融合。为此，我们提出了一种鲁棒多视角融合（RMF）策略，该策略同时考虑了视角特定的不确定性和视角间的冲突，从而减轻VC对决策的不利影响。为了在训练过程中识别并克服VC，我们进一步设计了一种针对VC的鲁棒学习（RLVC）框架。RLVC通过利用神经网络的记忆效应隔离冲突样本，然后通过对这些冲突视角施加惩罚来重新训练模型。在八个公开数据集上的大量实验表明，R-FUML在鲁棒性和不确定性估计方面始终优于15个最先进的基线方法。代码将在论文被接收后发布。

英文摘要

Trusted multi-view classification aims to deliver reliable fusion for accurate predictions and has recently attracted substantial attention in both academia and industry. However, existing TMVC methods typically assume strict alignment across different views during both training and testing phases, which is often impractical in real-world scenarios. This limitation motivates us to revisit TMVC and extend it to a more challenging setting: how to mitigate the impact of view conflict (VC) during both training and inference. To tackle this setting, existing TMVC methods suffer from three critical limitations: underestimated uncertainty, misleading decisions, and overfitting to VC. To address these issues, this paper proposes a novel Robust Fuzzy Multi-View Learning (R-FUML) framework grounded in Fuzzy Set Theory. Specifically, R-FUML models network outputs as fuzzy memberships to quantify category credibility and uses an entropy-based method for reliable multi-view fusion. To this end, we present a Robust Multi-view Fusion (RMF) strategy that accounts for both view-specific uncertainty and inter-view conflicts, thereby alleviating the adverse impacts of VC on decision-making. To identify and conquer VC during training, we further design a Robust Learning Against VC (RLVC) framework. RLVC isolates conflicting samples by leveraging neural networks' memory effects and then retrains the model by applying a penalty to these conflicting views. Extensive experiments across eight public datasets demonstrate that R-FUML consistently outperforms 15 state-of-the-art baselines in robustness and uncertainty estimation. The code will be released upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2605.24468 2026-05-26 cs.AI 版本更新

SAM: State-Adaptive Memory for Long-Horizon Reasoning Agent

SAM：面向长程推理智能体的状态自适应记忆

Yuyang Hu, Hongjin Qian, Shuting Wang, Jiongnan Liu, Ziliang Zhao, Jiejun Tan, Zheng Liu, Zhicheng Dou

发表机构 * GSAI, Renmin University of China（GSAI，中国人民大学）； Beijing Academy of Artificial Intelligence（北京人工智能研究院）

AI总结提出状态自适应记忆框架SAM，通过紧凑记忆线索和原始轨迹页面实现意图驱动的信息重建，无需重新训练基础模型，在多个基准上超越强基线。

详情

AI中文摘要

长程智能体推理要求大语言模型在包含思考、工具调用、观察和部分结论的长时间交互历史中行动。挑战不仅在于这些历史变长，而且当前决策所需的信息可能分散在遥远的步骤中，并且只在后来才变得相关。现有方法通过截断交互历史、将其压缩为更短的替代品或检索其选定部分进行重用来解决这一困难，但它们没有明确建模对过去交互的访问应如何适应智能体不断变化的状态。相反，我们将长程推理视为一个状态自适应记忆问题。为此，我们提出了状态自适应记忆（SAM），这是一个独立的框架，它将正在进行的交互整合为紧凑的记忆线索，同时保留原始轨迹页面用于意图驱动的回忆。这些线索不被视为历史的替代品；相反，它们充当轻量级句柄，使智能体能够根据当前需求重建时间上遥远的信息，而无需重新训练底层骨干网络。我们进一步通过专家引导的监督和强化学习优化记忆模块，使其与轨迹级别的效用对齐。在BrowseComp、BrowseComp-ZH、WideSearch和HLE上，SAM在各种智能体骨干网络上持续优于强基线。我们的结果表明，显式记忆建模为长程智能体推理提供了一个简单而有效的基础。

英文摘要

Long-horizon agentic reasoning requires large language models to act over long interaction histories containing thoughts, tool calls, observations, and partial conclusions. The challenge is not merely that these histories grow long, but that information needed for the current decision may be scattered across distant steps and only become relevant later. Existing approaches address this difficulty by truncating the interaction history, compressing it into shorter surrogates, or retrieving selected parts of it for reuse, but they do not explicitly model how access to past interaction should adapt to the agent's evolving state. We instead cast long-horizon reasoning as a problem of state-adaptive memory. To this end, we propose State-Adaptive Memory~(SAM), a standalone framework that consolidates ongoing interaction into compact memory cues while preserving raw trajectory pages for intent-driven recall. These cues are not treated as replacements for history; rather, they serve as lightweight handles that allow the agent to reconstruct temporally distant information according to its current needs, without retraining the underlying backbone. We further optimize the memory module through expert-guided supervision and reinforcement learning, aligning it with trajectory-level utility. Across BrowseComp, BrowseComp-ZH, WideSearch, and HLE, SAM consistently outperforms strong baselines over diverse agent backbones. Our results suggest that explicit memory modeling provides a simple and effective foundation for long-horizon agentic reasoning.

URL PDF HTML ☆

赞 0 踩 0

2605.24458 2026-05-26 cs.LG cs.AI 版本更新

Balancing Fairness, Privacy, and Accuracy: A Multitask Adversarial Framework for Centralized Data-Driven Systems

平衡公平性、隐私和准确性：面向集中式数据驱动系统的多任务对抗框架

Imesh Ekanayake, Elham Naghizade, Jeffrey Chan

发表机构 * School of Computing Technologies, RMIT University（计算技术学院，皇家墨尔本理工大学）

AI总结提出一种多任务对抗模型，将公平性和隐私作为核心目标，通过优化代价函数动态平衡三者，在最小化性能损失的同时实现高公平性和隐私保护。

Comments 13 Pages, 6 figures, IEEE TKDE

详情

AI中文摘要

在集中式数据驱动应用中，公平性和隐私的整合至关重要，尤其是当这些系统日益影响具有重大社会影响的领域时。当前方法很少同时考虑隐私、公平性和准确性，这可能会损害伦理标准和隐私法规。然而，平衡这三个目标相当具有挑战性，因为每个目标通常对模型的设计和训练提出相互冲突的要求，使得优化一个目标而不损害其他目标变得困难。本文提出了一种新颖的多任务对抗模型，将公平性和隐私视为整体目标而非事后考虑，并学习一个隐藏敏感属性同时保留任务相关信息的潜在表示。我们的方法通过优化的代价函数动态平衡公平性与准确性及隐私，即使在严格条件下也能实现最小的性能损失。在多种数据集上的广泛测试表明，我们的模型能够在不大幅牺牲准确性的情况下实现高标准的公平性和隐私。与最先进的隐私和公平标准进行基准测试表明，我们的方法增强了隐私、公平性和准确性优化的鲁棒性，证明了其在不同数据集上的适应性。

英文摘要

The integration of fairness and privacy in centralized data-driven applications is critical, especially as these systems increasingly influence sectors with significant societal impact. Current methods rarely address privacy, fairness, and accuracy together, which can potentially compromise ethical standards and privacy regulations. However, balancing these three objectives is quite challenging since each of objective often imposes conflicting requirements on the design and training of models, making it difficult to optimize one without compromising the others. This paper introduces a novel multitask adversarial model that treats fairness and privacy as integral objectives rather than afterthoughts, and learns a latent representation that hides sensitive attributes while preserving essential task-related information. Our approach dynamically balances fairness with accuracy and privacy through an optimized cost function with minimal performance loss even under strict conditions. Extensive testing on diverse datasets shows the ability of our model to achieve high standards of fairness and privacy without significant sacrifice to accuracy. Benchmarking against state-of-the-art privacy and fairness standards shows that our method enhances the robustness of privacy, fairness, and accuracy optimization, proving its adaptability across various datasets.

URL PDF HTML ☆

赞 0 踩 0

2605.24453 2026-05-26 cs.SE cs.AI 版本更新

Code2UML: Agentic LLMs with context engineering for scalable software visualization

Code2UML: 基于上下文工程的可扩展软件可视化的智能体LLM

Alin-Gabriel Văduva, Anca-Ioana Andreescu, Simona-Vasilica Oprea, Adela Bâra

发表机构 * Bucharest University of Economic Studies（布加勒斯特经济大学）

AI总结提出一种基于五个专门智能体和确定性IR压缩层的智能体架构，用于从源代码仓库自动生成UML图，在12个开源仓库和7种UML图上验证了高语法有效性（平均91.5%）和结构质量（平均81.7/100），且质量不随规模下降。

详情

AI中文摘要

基于大型语言模型（LLM）的代码分析工具被用于自动化软件文档任务。然而，这些方法在真实代码库中的可扩展性——其中中间表示（IR）超过LLM上下文限制——仍未充分探索。本文介绍了一种具有上下文工程的智能体架构，用于从源代码仓库自动生成UML图。它采用基于Claude Agent SDK构建的五个专门智能体的层次结构：PlannerAgent、AnalyzerAgent、DiagramAgent、CorrectorAgent和DependencyAnalyzerAgent，每个处理不同的认知子任务。一个确定性的、重要性加权的IR压缩层将完整项目IR转换为保证适合令牌限制的特定图视图，无需LLM调用且可在毫秒内完成。因此，我们在4种编程语言（Java、JavaScript、PHP、Python）的12个开源仓库和7种UML图上评估该系统，产生了84个观察结果，并在5个自动指标上进行了评估。结果表明高语法有效性（平均91.5%，其中组件图和部署图达到100%）、强关系精度（平均0.858）和一致的结构质量（平均81.7/100，跨语言方差为3.1分）。实体召回率平均为0.313，反映了有意的架构优先级而非全面覆盖。敏感性分析（31到4,578个IR实体）证实质量分数无论规模大小都保持稳定。

英文摘要

Large Language Model (LLM)-based code analysis tools are adopted to automate software documentation tasks. However, the scalability of these approaches to real codebases, where Intermediate Representations (IR) exceed LLM context limits, remains underexplored. This paper introduces an agentic architecture with context engineering for automated UML diagram generation from source code repositories. It employs a hierarchy of five specialized agents: PlannerAgent, AnalyzerAgent, DiagramAgent, CorrectorAgent and DependencyAnalyzerAgent, built on the Claude Agent SDK, each addressing a distinct cognitive subtask. A deterministic, importance-weighted IR compaction layer transforms full project IRs into diagram-specific views guaranteed to fit within token constraints, requiring no LLM calls and completing in milliseconds. Thus, we evaluate the system across 12 open-source repositories in 4 programming languages (Java, JavaScript, PHP, Python) and 7 UML diagram types, producing 84 observations assessed on 5 automated metrics. Results demonstrate high syntactic validity (mean: 91.5%, with component and deployment diagrams reaching 100%), strong relationship precision (mean: 0.858) and consistent structural quality (mean: 81.7/100, with cross-language variance of 3.1 points). Entity recall averaged 0.313, reflecting deliberate architectural prioritization over exhaustive coverage. A sensitivity analysis (31 to 4,578 IR entities) confirms that quality scores remain stable regardless of scale.

URL PDF HTML ☆

赞 0 踩 0

2605.24452 2026-05-26 cs.CL cs.AI cs.LG 版本更新

Temporal Concept Drift in Legal Judgment Prediction: Neural Baselines Across Three Epochs of Ukrainian Court Decisions

法律判决预测中的时间概念漂移：跨越乌克兰法院判决三个时期的神经基线

Volodymyr Ovcharov

AI总结通过微调四种Transformer编码器在乌克兰法院三个时期（战前、混合战争、全面入侵）的判决上，研究法律语言的时间漂移，发现前向性能严重下降（最多27.2个百分点），法律领域预训练不能提升绝对性能但能减轻漂移，时序持续学习可消除灾难性遗忘。

Comments 17 pages, 6 tables, 5 figures. Dataset: https://huggingface.co/datasets/overthelex/ukrainian-court-decisions

详情

AI中文摘要

法律NLP基准测试在随机分割的数据上评估模型，隐含假设法律语言是平稳的。我们通过微调四种Transformer编码器——XLM-RoBERTa（base和large）及其法律领域变体——在地缘政治事件定义的三个时间时期的乌克兰法院判决上测试这一假设：战前（2008-2013）、混合战争（2014-2021）和全面入侵（2022-2026）。每个模型在一个时期上训练，并在所有三个时期上评估，产生一个3x3的跨时间泛化矩阵。四个发现出现。（1）前向退化严重：在战前数据上训练的模型应用于全面入侵时期判决时，宏F1最多下降27.2个百分点。（2）退化不对称：后向迁移（全面入侵到战前）比前向迁移稳健得多，与法律语言是加性的假设一致。（3）法律领域预训练（Legal-XLM-R）不提升绝对性能，但减少前向退化的幅度和不对称性。（4）时序持续学习消除了通用XLM-R的灾难性遗忘：战前知识完全保留（+1.8至+6.2个百分点），而全面入侵性能提升+16.5至+19.0个百分点；逆时序训练导致严重遗忘。跨司法管辖区在瑞士判决预测数据上的预训练提升绝对性能，但不减少时间退化幅度，确认时间漂移是法律语言演化的内在属性。数据集（三个时期共428K判决）作为LEXTREME贡献公开可用。

英文摘要

Legal NLP benchmarks evaluate models on randomly split data, implicitly assuming that legal language is stationary. We test this assumption by fine-tuning four transformer encoders -- XLM-RoBERTa (base and large) and their legal-domain variants -- on Ukrainian court decisions from three temporal epochs defined by geopolitical disruptions: pre-war (2008-2013), hybrid war (2014-2021), and full-scale invasion (2022-2026). Each model is trained on one epoch and evaluated on all three, producing a 3x3 cross-temporal generalization matrix. Four findings emerge. (1) Forward degradation is severe: models trained on pre-war data lose up to 27.2 percentage points of macro-F1 when applied to full-scale invasion era decisions. (2) The degradation is asymmetric: backward transfer (full-scale to pre-war) is substantially more robust than forward transfer, consistent with the hypothesis that legal language is additive. (3) Legal-domain pretraining (Legal-XLM-R) does not improve absolute performance but reduces forward degradation magnitude and asymmetry. (4) Chronological continual learning eliminates catastrophic forgetting for general XLM-R: pre-war knowledge is fully retained (+1.8 to +6.2 pp) while full-scale performance gains +16.5 to +19.0 pp; reverse-chronological training causes severe forgetting. Cross-jurisdictional pretraining on Swiss Judgment Prediction data improves absolute performance but does not reduce temporal degradation magnitude, confirming that temporal drift is an intrinsic property of legal language evolution. The dataset (428K decisions across three epochs) is publicly available as a LEXTREME contribution.

URL PDF HTML ☆

赞 0 踩 0

2605.24425 2026-05-26 cs.LG cs.AI cs.CL 版本更新

Momentum Streams for Optimizer-Inspired Transformers

动量流：优化器启发的Transformer

Jingchu Gai, Nai-Chieh Huang, Jiayun Wu

发表机构 * Carnegie Mellon University（卡内基梅隆大学）

AI总结提出一类优化器启发的Transformer（如三重动量TMMFormer），通过将残差更新解释为优化器步骤，发现动量是性能提升的关键，能收敛到更平坦的极小值，减少遗忘并改善泛化。

2605.24423 2026-05-26 cs.AI 版本更新

Benchmarking the Limits of In-Context Reinforcement Learning for Ad-Hoc Teamwork

临时团队协作中上下文强化学习的极限基准测试

Yuheng Jing, Kai Li, Ziwen Zhang, Jiajun Zhang, Zeyao Ma, Jiaxi Yang, Lei Zhang, Zhe Wu, Jinmin He, Junliang Xing, Jian Cheng

发表机构 * C$^{2}$DL, Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所C²DL实验室）； School of Artificial Intelligence, University of Chinese Academy of Sciences（中国科学院大学人工智能学院）； School of Future Technology, University of Chinese Academy of Sciences（中国科学院大学未来技术学院）； Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences（中国科学院深圳先进技术研究所）； Department of Computer Science and Technology, Tsinghua University（清华大学计算机科学与技术系）； University of Science and Technology of China（中国科学技术大学）； Qwen Team, Alibaba Group（阿里集团Qwen团队）

AI总结提出ICRL4AHT基准，基于Overcooked-V2评估上下文强化学习在临时团队协作中的表现，发现算法在未见队友和布局下常不如随机基线，凸显多智能体环境下的适应挑战。

Comments 41 pages, 14 figures

详情

AI中文摘要

上下文强化学习（ICRL）使基础智能体能够即时适应新任务，但其在需要与未知伙伴协调的临时团队协作（AHT）中的有效性尚未被探索。为严格评估这一点，我们引入了一个大规模基准ICRL4AHT，基于高吞吐量JAX实现的Overcooked-V2构建。我们的基准包括一个大型、多样化的队友套件，涵盖RL和启发式策略，支持可控的训练-测试转移，并提供了一个可复现的端到端流水线，用于队友生成、学习历史收集、数据集构建和在线多回合评估。我们评估了代表性的历史条件ICRL算法，包括算法蒸馏（AD）和决策预训练Transformer（DPT），跨越数百万次转移。结果揭示了显著的局限性：与它们在单智能体领域的成功相反，这些基线在多智能体设置中未能展现出稳健的测试时适应。具体来说，这些方法在未见队友和未见布局轨迹上经常表现不如随机基线，并且在长时间跨度内没有明显的上下文改进。这些发现凸显了在OvercookedV2 AHT协议下部分可观测性中战略推理的挑战，将我们的基准确立为下一代协调算法的关键测试平台。

英文摘要

In-Context Reinforcement Learning (ICRL) has enabled foundation agents to adapt instantaneously to novel tasks, yet its efficacy in Ad-Hoc Teamwork (AHT)-where coordination with unknown partners is required-remains unexplored. To rigorously evaluate this, we introduce a large-scale benchmark ICRL4AHT, built upon a high-throughput JAX implementation of Overcooked-V2. Our benchmark includes a large, diverse teammate suite spanning both RL and heuristic policies, enabling controlled train-test shifts, and provides a reproducible end-to-end pipeline for teammate generation, learning-history collection, dataset construction, and online multi-episode evaluation. We evaluate representative history-conditioned ICRL algorithms, including Algorithm Distillation (AD) and Decision-Pretrained Transformer (DPT), across millions of transitions. Results reveal notable limitations: contrary to their success in single-agent domains, these baselines fail to exhibit robust test-time adaptation in multi-agent settings. Specifically, these methods frequently underperform random baselines across both unseen teammate and unseen layout tracks, with no clear in-context improvement over long horizons. These findings highlight the challenges of strategic inference under partial observability within the OvercookedV2 AHT protocol, establishing our benchmark as a critical testbed for next-generation coordination algorithms.

URL PDF HTML ☆

赞 0 踩 0

2605.24420 2026-05-26 cs.LG cs.AI 版本更新

Batch Normalization Amplifies Memorization and Privacy Risks

批归一化加剧记忆化和隐私风险

Ngoc Phu Doan, Chongyan Gu, Ihsen Alouani

发表机构 * Queen’s University Belfast（女王大学贝尔法斯特）

AI总结本文通过实证和理论分析，发现批归一化层会显著增加模型对异常样本的记忆化，从而加剧隐私泄露风险。

详情

AI中文摘要

批归一化（BN）被广泛采用以加速深度神经网络的收敛并实现更稳定的训练。然而，其对隐私和记忆化的影响在很大程度上尚未被探索。在这项工作中，我们研究了BN层对非典型或异常样本记忆化的影响及其对隐私泄露的启示。我们使用三种互补方法进行了广泛的实证研究：（i）对分布外训练样本的无意记忆化，（ii）通过梯度范数测量的每个样本影响，以及（iii）对成员推断攻击（MIA）的敏感性。跨多个数据集和架构，我们一致观察到，与没有BN的模型相比，BN显著增加了对异常值的记忆化。关键的是，这种放大的记忆化直接转化为隐私漏洞：具有BN的模型对MIA表现出显著更高的敏感性。我们通过理论分析补充了实证结果，表明BN在训练过程中放大了异常样本的每步影响，为这一现象提供了机制性见解。我们的结果突显了与BN相关的被低估的隐私风险，并为归一化层如何放大罕见或敏感训练样本的影响提供了实践和理论见解。

英文摘要

Batch Normalization (BN) is widely adopted to enable faster convergence and more stable training of deep neural networks. However, its impact on privacy and memorization has remained largely unexplored. In this work, we investigate the effect of BN layers on the memorization of atypical or outlier samples and its implications for privacy leakage. We conduct an extensive empirical study using three complementary approaches: (i) unintended memorization of out-of-distribution training samples, (ii) per-sample influence measured via gradient norms, and (iii) susceptibility to membership inference attacks (MIA). Across multiple datasets and architectures, we consistently observe that BN substantially increases the memorization of outliers compared to models without BN. Critically, this amplified memorization translates directly into privacy vulnerabilities: models with BN exhibit significantly higher susceptibility to MIAs. We complement our empirical findings with a theoretical analysis showing that BN amplifies the per-step influence of outlier samples during training, providing mechanistic insight into this phenomenon. Our results highlight an underappreciated privacy risk associated with BN and provide both practical and theoretical insights into how normalization layers can amplify the influence of rare or sensitive training examples.

URL PDF HTML ☆

赞 0 踩 0

2605.24414 2026-05-26 cs.AI 版本更新

JT-SAFE-V2: Safety-by-Design Foundation Model with World-Context Data

JT-SAFE-V2：具有世界上下文数据的安全设计基础模型

Junlan Feng, Fanyu Meng, Chong Long, Pengyu Cong, Duqing Wang, Yan Zheng, Yuyao Zhang, Xuanchang Gao, Ye Yuan, Yunfei Ma, Zhijie Ren, Fan Yang, Na Wu, Di Jin, Chao Deng

发表机构 * JIUTIAN Research（九天研究院）

AI总结提出JT-Safe-V2大语言模型，通过世界知识预训练、高确定性训练和安全强化后训练实现通用智能与安全设计的联合优化，并引入Safe-MoMA框架降低推理成本，在通用智能和安全基准上达到最优性能。

详情

AI中文摘要

我们介绍了JT-Safe-V2，这是一个旨在提升基础模型安全性和可信度的大型语言模型，将我们之前的JT-Safe模型扩展为更全面的安全设计范式。JT-Safe-V2通过几个关键创新强调通用智能与安全设计的联合优化：用上下文世界知识丰富预训练数据、高确定性预训练程序，以及面向企业级代理能力的安全强化后训练机制。在这些安全增强的基础模型基础上，我们提出了Safe-MoMA（安全模型与代理混合），这是一个通过协调部署多个模型和代理实现可追溯高效推理的框架。广泛评估表明，JT-Safe-V2在通用智能和安全基准上均达到了最先进性能。此外，与使用最大的独立模型基线相比，Safe-MoMA在保持相当性能的同时将推理成本降低了30%以上。为了促进未来安全设计基础模型的研究，我们公开发布了后训练的JT-Safe-V2-35B模型检查点。

英文摘要

We introduce JT-Safe-V2, a large language model designed to advance the safety and trustworthiness of foundation models, extending our previous JT-Safe model toward a more comprehensive safety-by-design paradigm. JT-Safe-V2 emphasizes the joint optimization of general intelligence and safety-by-design through several key innovations: enriching pre-training data with contextual world knowledge, high-certainty pre-training procedures, and safety strengthening post-training mechanisms for enterprise-oriented agentic capabilities. Building on these safety-enhanced foundation models, we propose Safe-MoMA (Safe Mixture of Models and Agents), a framework that enables traceable and efficient inference through the orchestrated deployment of multiple models and agents. Extensive evaluations demonstrate that JT-Safe-V2 achieves state-of-the-art performance across both general intelligence and safety benchmarks. Moreover, Safe-MoMA reduces inference costs by more than 30\% compared to using the largest standalone model baseline while maintaining comparable performance. To facilitate future research on safety-by-design foundation models, we publicly release the post-trained JT-Safe-V2-35B model checkpoint.

URL PDF HTML ☆

赞 0 踩 0

2605.24411 2026-05-26 cs.AI cs.LG 版本更新

The Model Is Not the Product: A Dual-Pillar Architecture for Local-First Psychological Coaching

模型并非产品：面向本地优先心理辅导的双支柱架构

Alexander Mihalcea

发表机构 * iOS application（iOS应用）

AI总结本文提出Psych LM，一种基于本地优先架构的iOS应用，通过自动记忆语料库和检索增强生成实现近无限上下文窗口，在移动设备上提供可靠的上下文感知心理辅导。

Comments 10 pages, 3 figures

详情

AI中文摘要

现有语言模型应用难以满足情感导向支持的需求，主要原因是它们无法在会话间维持深度、持久的上下文。本报告介绍了Psych LM，一款iOS应用，验证了对于此类应用，周围架构至关重要的论点。Psych LM在专为行为和生活辅导应用设计的本地优先运行时中运行本地设备端语言模型。该系统通过一个自动化的、用户可检查的记忆语料库实现了接近无限上下文窗口的实际效果，该语料库将对话转换为结构化的记忆卡片，包括事实、目标和事件，并通过语义和向量搜索动态注入提示中。因此，该系统可定义为一种主动学习、检索增强生成、设备端架构。该架构提供了四个主要贡献：以隐私为核心属性的本地优先设计；用于持久存储关键用户信息的记忆语料库的详细描述；提供独立于模型内部状态的稳定行为骨架的确定性编排层；以及专注于在现实操作条件下评估集成系统可靠性的基准框架。研发过程证实，通过优先考虑架构控制和资源管理而非简单模型大小，可以在移动环境的严格约束下可靠地实现复杂的上下文感知交互。

英文摘要

Existing language model applications struggle to meet the demand for emotionally oriented support, primarily due to their inability to maintain deep, persistent context across sessions. This report introduces Psych LM, an iOS application that validates the thesis that, for such applications, the surrounding architecture is paramount. Psych LM runs a local, on-device language model within a purpose-built, local-first runtime designed for behavioral and life-coaching applications. The system achieves the practical effect of a near-infinite context window through an automated, user-inspectable memory corpus that converts conversations into structured memory cards, including facts, goals, and events, and dynamically injects them into the prompt via semantic and vector search. As such, the system can be defined as an active-learning, retrieval-augmented generative, on-device architecture. This architecture delivers four primary contributions: a local-first design where privacy is a core property; a detailed description of the memory corpus for persistent context of key user information; a deterministic orchestration layer that provides a stable behavioral spine independent of the model's internal state; and a benchmark framework focused on evaluating the integrated system's reliability under realistic operating conditions. The R and D process confirms that complex, context-aware interaction can be reliably achieved under the strict constraints of a mobile environment by prioritizing architectural control and resource management over simple model size.

URL PDF HTML ☆

赞 0 踩 0

2605.24410 2026-05-26 cs.AI 版本更新

Advancing Graph Few-Shot Learning via In-Context Learning

通过上下文学习推进图少样本学习

Renchu Guan, Yajun Wang, Chunli Guo, Bowen Cao, Fausto Giunchiglia, Wei Pang, Yonghao Liu, Xiaoyue Feng

发表机构 * College of Computer Science and Technology, Jilin University（吉林大学计算机科学与技术学院）； College of Software, Jilin University（吉林大学软件学院）； Department of Computer Science and Technology, Yanbian University（延边大学计算机科学与技术系）； Department of Information Engineering and Computer Science, University of Trento（特伦托大学信息工程与计算机科学系）； School of Mathematical and Computer Sciences, Heriot-Watt University（赫瑞斯泰大学数学与计算机科学学院）

AI总结提出VISION模型，将图少样本学习重构为免微调的序列推理问题，利用无监督任务生成器从无标签数据中构建伪任务，通过上下文感知网络融合局部拓扑和全局任务依赖，实现高效推理。

Comments KDD26

详情

DOI: 10.1145/3770855.3817797

AI中文摘要

图少样本学习旨在仅用少量标注样本对来自新类别的节点进行分类，是图学习中广泛研究的问题。然而，现有方法常面临两个关键限制。首先，主流的图少样本学习范式依赖于监督任务，未能利用图中大量的无标签节点。其次，许多方法在推理时需要复杂的任务适应或微调，限制了其效率和适用性。受大语言模型强大的上下文学习能力启发，我们提出了一种名为VISION的新模型，通过上下文学习推进图少样本学习，以应对这些挑战。我们的模型将图少样本学习重构为免微调的序列推理问题。其核心是一个上下文感知网络，该网络使用角色嵌入初始化节点，并采用双上下文融合模块协同整合局部拓扑结构和全局任务级依赖关系。这使得我们的模型能够在单次前向传播中，根据支持集上下文动态地为查询集生成类别感知表示。为了有效训练我们的模型，我们引入了一个无监督任务生成器，该生成器创建结构自适应特征，并从大量无标签数据中构建多样的伪任务。我们的方法将无监督元学习与图上下文学习统一起来，实现了高效推理。在多个基准数据集上的大量实验证明了我们模型的优越性。我们的公开代码可在以下网址找到。

英文摘要

Graph few-shot learning, which aims to classify nodes from novel classes with only a few labeled examples, is a widely studied problem in graph learning. However, existing methods often face two key limitations. First, the predominant graph few-shot learning paradigm relies on supervised tasks, failing to leverage the vast number of unlabeled nodes in the graph. Second, many approaches require complex task adaptation or fine-tuning during inference, limiting their efficiency and applicability. Inspired by the powerful in-context learning capabilities of large language models, we propose a novel model named VISION for adVancIng graph few-Shot learning via In-cOntext LearNing to address these challenges. Our model reframes graph few-shot learning as a fine-tuning-free sequence reasoning problem. At its core is a context-aware network that initializes nodes with role embeddings and employs a dual-context fusion module to synergistically integrate local topological structures and global task-level dependencies. This allows our model to dynamically generate class-aware representations for the query set conditioned on the support set context in a single forward pass. To effectively train our model, we introduce an unsupervised task generator that creates structure-adaptive features and constructs diverse pseudo-tasks from abundant unlabeled data. Our method unifies unsupervised meta-learning with graph in-context learning, achieving efficient inference. Extensive experiments on multiple benchmark datasets demonstrate the superiority of our model. Our public code can be found

URL PDF HTML ☆

赞 0 踩 0

2605.24405 2026-05-26 cs.LG cs.AI 版本更新

Generative OOD-regularized Model-based Policy Optimization

生成式OOD正则化的基于模型的策略优化

Aysin Tumay, Jiahe Huang, Elise Jortberg, Rose Yu

发表机构 * University of California, San Diego（加州大学圣地亚哥分校）； Abiomed（阿比omed）

AI总结提出GORMPO算法，利用生成式密度估计在稀疏状态-动作空间中限制策略更新到高密度区域，以解决离线强化学习中的分布外动作问题，并在真实医疗数据集和离线RL数据集上优于基线方法。

详情

AI中文摘要

我们研究使用离线强化学习的序贯决策。传统离线RL策略在训练仅依赖稀疏离线表示时可能导致分布外（OOD）动作。为确保在稀疏状态-动作空间中的安全离线策略，我们探索如何将密度估计模型集成到基于模型的RL方法中以避免OOD区域。生成式模型能够显式建模稀疏状态-动作空间中的密度。基于此，我们引入生成式OOD正则化的基于模型的策略优化（GORMPO），一种密度正则化的离线RL算法，使用生成式密度建模将策略更新限制在数据集的高密度区域。此外，我们考察更好的OOD检测是否对应更好的基于模型的离线策略。我们比较了（1）各种密度估计器的OOD检测能力，以及（2）它们在GORMPO框架内在真实医疗数据集和稀疏离线RL数据集上的性能。我们在温和假设下理论上保证了GORMPO的性能。实验上，GORMPO在真实医疗数据集上比最先进的基线方法提升17%，并在离线RL数据集上增强了基础模型。我们的实证发现表明，在动态稳定的环境中，更好的OOD检测通常导致改进的策略，而当动态不确定时，带有保守惩罚的较差密度估计更受青睐。

英文摘要

We study sequential decision-making with offline reinforcement learning (RL). Traditional offline RL policies may result in out-of-distribution (OOD) actions when training relies only on sparse offline representations. To ensure safe offline policies in a sparse state-action space, we explore how density estimation models can be integrated into model-based RL methods to avoid the OOD regions. Generative models are capable of explicitly modeling the density in sparse state-action spaces. Building on this, we introduce Generative OOD-regularized Model-based Policy Optimization (GORMPO), a density-regularized offline RL algorithm that uses generative density modeling to restrict policy updates to high-density areas of the dataset. Furthermore, we examine whether better OOD detection corresponds to better model-based offline policies. We compare (1) the OOD detection capabilities of various density estimators and (2) their performance within the GORMPO framework on a real-world medical dataset and sparse offline RL datasets. We theoretically guarantee GORMPO's performance under mild assumptions. Empirically, GORMPO outperforms state-of-the-art baselines by 17% on a real-world medical dataset and enhances the base model on the offline RL datasets. Our empirical findings show that better OOD detection generally results in improved policies in environments with stable dynamics, while conservative penalties with poor density estimation are favored when dynamics are uncertain.

URL PDF HTML ☆

赞 0 踩 0

2605.24398 2026-05-26 cs.CV cs.AI cs.GR 版本更新

VectorArk: Learning Practical Image Vectorization with Rounded Polygon Representation

VectorArk: 学习基于圆角多边形表示的实际图像矢量化

Tarun Gehlaut, Difan Liu, Charu Bansal, Krutik Malani, Souymodip Chakraborty, Ankit Phogat, Matthew Fisher, Vineet Batra

发表机构 * Adobe

AI总结提出VectorArk模型，采用圆角多边形表示和退化模型，实现鲁棒且实用的图像矢量化，在多个数据集上取得优越的几何完整性和伪影抑制效果。

Comments CVPR 2026. Project page: https://vectorark.github.io/

详情

AI中文摘要

近期基于视觉-语言模型（VLM）的方法在图像矢量化任务上取得了令人印象深刻的结果。然而，它们通常在合成基准上进行评估，其中干净的SVG以高分辨率光栅化，然后重新矢量化。因此，这些方法在真实场景中泛化能力较差，例如图像具有未知的光栅化方法或由文本到图像模型生成。我们引入了VectorArk，一种新的基于VLM的模型，旨在实现鲁棒且实用的图像矢量化。VectorArk采用了一种新颖的圆角多边形表示，简化了学习过程，同时自然地生成平滑、视觉上吸引人的基元。我们还提出了一种退化模型，增强了在多样且不完美输入下的鲁棒性。我们的实验表明，与先前方法相比，VectorArk在多个数据集上实现了优越的几何完整性和伪影抑制，全面的消融实验验证了每个组件的贡献。

英文摘要

Recent vision-language model (VLM)-based approaches have achieved impressive results on image vectorization tasks. However, they are typically evaluated on synthetic benchmarks, where clean SVGs are rasterized at high resolution and then re-vectorized. As a result, these methods generalize poorly to real-world scenarios, such as images with unknown rasterization methods or those generated by text-to-image models. We introduce VectorArk, a new VLM-based model designed for robust and practical image vectorization. VectorArk employs a novel rounded polygon representation that simplifies the learning process while naturally producing smooth, visually appealing primitives. We also propose a degradation model that enhances robustness across diverse and imperfect inputs. Our experiments show that, in contrast to previous methods, VectorArk achieves superior geometric completeness and artifact suppression across multiple datasets, with comprehensive ablations validating the contribution of each component.

URL PDF HTML ☆

赞 0 踩 0

2605.24396 2026-05-26 cs.AI 版本更新

Understanding and Mitigating Premature Confidence for Better LLM Reasoning

理解并缓解过早自信以提升大语言模型推理能力

Jingchu Gai, Guanning Zeng, Christina Baek, Chen Wu, J. Zico Kolter, Andrej Risteski, Aditi Raghunathan

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Tsinghua University（清华大学）

AI总结针对大语言模型长思维链中逻辑跳跃和过早自信问题，提出渐进式自信塑造强化学习目标，无需外部标签或奖励模型，通过奖励逐步自信增长并惩罚过早承诺，显著提升推理准确性和质量。

详情

AI中文摘要

当前语言模型的长思维链（CoT）经常包含逻辑缺口和不合理的跳跃，限制了额外测试时计算带来的收益。直接提升推理质量需要过程奖励模型，但训练它们所需的步骤级标注昂贵且稀缺。我们在模型推理过程中自信度的演化中发现了一个信号：过早自信，即倾向于过早承诺答案并用剩余标记为其辩护，这强烈预测了跨任务和模型规模的推理缺陷。我们利用这一点提出了渐进式自信塑造，这是一种强化学习目标，训练模型在推理过程中更新自信度而非过早承诺——奖励逐步自信增长并惩罚过早承诺，无需外部标签或奖励模型。该方法在算术（Countdown）、数学（DAPO、AIME）和科学（ScienceQA）任务上，从1.5B到8B参数规模均提升了准确率和推理质量：在Countdown上，准确率提升3.2倍（+42.0个百分点），缺陷推理下降48个百分点；在AIME上，Pass@64提升6.6个百分点。与该机制一致，该方法还提升了忠实度：在安全基准上，我们的模型更透明地在其推理轨迹中暴露误导性内容而非隐藏它。控制实验表明，问题及其解决方案共同扩展：过早自信随模型规模和任务难度增长，而解决它带来的收益也随之增长。

英文摘要

Long chains of thought (CoT) from current language models frequently contain logical gaps and unjustified leaps, limiting the gains from additional test-time compute. Improving reasoning quality directly would require process reward models, but the step-level annotations needed to train them are expensive and scarce. We find such a signal in how the model's confidence evolves during reasoning: premature confidence, the tendency to commit to an answer early and use the remaining tokens to rationalize it, strongly predicts flawed reasoning across tasks and model scales. We exploit this in progressive confidence shaping, a reinforcement learning objective that trains models to update their confidence as they reason rather than commit early -- rewarding gradual confidence growth and penalizing early commitment, with no external labels or reward models. The method improves accuracy and reasoning quality from 1.5B to 8B parameters across arithmetic (Countdown), math (DAPO, AIME), and science (ScienceQA): on Countdown, accuracy improves 3.2x (+42.0pp) and flawed reasoning drops 48pp; on AIME, Pass@64 improves 6.6pp. Consistent with this mechanism, the method also improves faithfulness: on a safety benchmark, our models more transparently surface misleading content in their reasoning traces rather than concealing it. Controlled experiments reveal that the problem and its remedy scale together: premature confidence grows with model size and task difficulty, and so do the gains from addressing it.

URL PDF HTML ☆

赞 0 踩 0

2605.24381 2026-05-26 cs.LG cs.AI stat.AP stat.ML 版本更新

Assessing the Operational Viability of Foundation Models for Time Series Forecasting

评估基础模型在时间序列预测中的操作可行性

Kavin Soni, Debanshu Das, Vamshi Guduguntla

发表机构 * Google, USA（谷歌公司，美国）

AI总结通过对比基础模型与监督学习方法在四种操作场景下的性能，提出基于经验特征的复杂度路由器以实现精度与效率的平衡。

Comments 21 pages, 8 Figures, Code available at [https://github.com/kavin-soni/timeseries-zeroshot-eval]

详情

AI中文摘要

时间序列预测驱动着金融、交通和能源等领域的操作决策。虽然监督学习方法表现出色，但它们需要特定领域的训练、特征工程和持续维护。大规模基础模型最近作为一种零样本替代方案出现，像LLM一样避免了任务特定训练。在这项工作中，我们评估了基础模型与标准监督方法的对比。我们不仅关注总体精度，还分析了四种操作场景下的性能：周期性人机系统、物理约束过程、随机金融市场和异构需求预测。我们的结果描述了最优部署区域。基础模型在具有可迁移周期结构的领域中表现良好，并且对于冷启动或长尾场景效率高。相反，监督专家在受严格物理约束的系统中保持更高的精度。在金融领域，较新的基础模型正在迅速缩小与监督专家的性能差距。我们进一步量化了推理延迟、数据漂移适应性和部署约束之间的权衡。最后，我们提出了一个复杂度路由器，它利用经验特征将每个序列分配给最优模型类别。我们证明，与部署通用基础模型相比，这种选择性路由实现了更高的精度和显著更低的推理成本，为平衡泛化性和效率提供了一个实用框架。

英文摘要

Time series forecasting drives operational decisions in areas like finance, transportation, and energy. While supervised learning approaches achieve strong performance, they require domain-specific training, feature engineering, and ongoing maintenance. Large-scale foundation models have recently emerged as a zero-shot alternative, avoiding task-specific training much like LLMs. In this work, we evaluate foundation models against standard supervised approaches. Rather than focusing solely on aggregate accuracy, we analyze performance across four operational regimes: periodic human-centric systems, physically constrained processes, stochastic financial markets, and heterogeneous demand forecasting. Our results characterize optimal deployment areas. Foundation models perform well in domains with transferable periodic structures and are efficient for cold-start or long-tail scenarios. Conversely, supervised specialists maintain higher precision in systems governed by strict physical constraints. In financial domains, newer foundation models are rapidly closing the performance gap with supervised specialists. We further quantify trade-offs in inference latency, data drift adaptability, and deployment constraints. Finally, we propose a Complexity Router that assigns each series to the optimal model class using empirical features. We demonstrate that this selective routing achieves higher accuracy and significantly lower inference costs compared to deploying a universal foundation model, providing a practical framework for balancing generalization and efficiency.

URL PDF HTML ☆

赞 0 踩 0

2605.24375 2026-05-26 cs.AI 版本更新

Distilling Game Code World Model Generation into Lightweight Large Language Models

将游戏代码世界模型生成蒸馏到轻量级大型语言模型

Tyrone Serapio, Arjun Prakash, Haoyang Xu, Kevin Wang, Amy Greenwald

发表机构 * Brown University（布朗大学）

AI总结研究通过后训练将游戏代码世界模型生成能力蒸馏到小型模型，采用监督微调和带可验证奖励的强化学习提升生成代码的语法正确性和规则遵循性。

详情

AI中文摘要

大型语言模型（LLMs）在从自然语言生成可执行代码方面展现了强大的能力，为AI代理自动构建环境提供了可能性。最近关于代码世界模型（CWMs）的工作表明，LLMs可以将游戏规则转化为与蒙特卡洛树搜索等求解器兼容的Python实现。我们在游戏设置中研究此问题，其中生成的环境必须实现规则、合法动作、状态转移、观察和奖励。我们将这些特定于游戏的可执行模型称为游戏代码世界模型（GameCWMs）。然而，当前生成代码世界模型的方法依赖于前沿模型和推理时精炼循环，限制了可访问性和可扩展性。本文研究是否可以通过后训练将GameCWM生成能力蒸馏到更小的模型中。我们引入：（1）一个包含30个完美信息和不完美信息游戏的精选数据集，（2）一个评估生成代码的结构和语义游戏属性的验证框架，以及（3）一个结合监督微调（SFT）和带可验证奖励的强化学习（RLVR）的后训练流程。我们在Qwen2.5-3B-Instruct上进行实验，发现SFT可以提高语法正确性，而RLVR可以改善执行层面对游戏规则的遵循，从而提升Qwen在完美信息和不完美信息游戏中生成有效GameCWM的能力。总体而言，我们的流程使Qwen2.5-3B-Instruct更能够生成有效的GameCWM，从而为从自然语言自动生成环境提供了一条可扩展的路径。

英文摘要

Large Language Models (LLMs) have shown great ability in generating executable code from natural language, opening the possibility of automatically constructing environments for AI agents. Recent work on Code World Models (CWMs) demonstrates that LLMs can translate game rules into Python implementations compatible with solvers like Monte Carlo Tree Search. We study this problem in game settings, where generated environments must implement rules, legal actions, state transitions, observations, and rewards. We refer to these game-specific executable models as Game Code World Models (GameCWMs). However, current approaches to generating code world models rely on frontier models and inference-time refinement loops, limiting accessibility and scalability. This work investigates whether GameCWM generation capabilities can be distilled into smaller models through post-training. We introduce: (1) a curated dataset of 30 games spanning perfect and imperfect information games, (2) a verification framework that evaluates generated code against structural and semantic game properties, and (3) a post-training pipeline combining Supervised Fine-Tuning (SFT) with Reinforcement Learning with Verifiable Rewards (RLVR). We experiment with Qwen2.5-3B-Instruct and find that SFT can increase syntactic correctness, while RLVR can improve execution-level adherence to game rules, thereby improving Qwen's ability to generate valid GameCWMs in both perfect and imperfect information games. Overall, our pipeline makes Qwen2.5-3B-Instruct more capable of generating valid GameCWMs, thereby offering a scalable path toward automatic environment generation from natural language.

URL PDF HTML ☆

赞 0 踩 0

2605.24352 2026-05-26 cs.AI 版本更新

Partner-Aware Hierarchical Skill Discovery for Robust Human-AI Collaboration

伙伴感知的分层技能发现用于鲁棒的人机协作

Adnan Ahmad, Bahareh Nakisa, Mohammad Naim Rastgoo

发表机构 * Schoold of Information Technology, Deakin University（德肯大学信息科技学院）； Faculty of Information Technology, Monash University（莫纳什大学信息技术学院）

AI总结提出伙伴感知技能发现（PASD）框架，通过对比内在奖励学习基于伙伴行为的技能，缓解捷径学习，提升人机协作的鲁棒性和适应性。

详情

AI中文摘要

多智能体协作，尤其是在人机团队中，要求智能体能够适应具有多样化和动态行为的新伙伴。传统的深度分层强化学习（DHRL）方法关注智能体自身的奖励而忽略伙伴行为，导致捷径学习，即技能利用虚假信息而非适应伙伴的动态行为。这一限制削弱了智能体适应和有效协调新伙伴的能力。我们提出了伙伴感知技能发现（PASD），一种学习以伙伴行为为条件的技能的DHRL框架。PASD引入了一种对比内在奖励，以捕捉伙伴交互中出现的模式，在相似伙伴之间对齐技能表示，同时在不同策略之间保持可区分性。通过基于伙伴交互构建技能空间，该方法缓解了捷径学习并促进了行为一致性，从而实现鲁棒和自适应的协调。我们在Overcooked-AI基准测试中，针对具有不同技能水平和游戏风格的多样化伙伴群体，广泛评估了PASD。我们还使用从人机游戏轨迹训练的人类代理模型进一步评估了该方法。PASD始终优于现有的基于群体和分层基线，展示了可迁移的技能学习，能够泛化到广泛的伙伴行为。对学习到的技能表示的分析表明，PASD有效适应了多样的伙伴行为，突显了其在人机协作中的鲁棒性。

英文摘要

Multi-agent collaboration, especially in human-AI teaming, requires agents that can adapt to novel partners with diverse and dynamic behaviors. Conventional Deep Hierarchical Reinforcement Learning (DHRL) methods focus on agent-centric rewards and overlook partner behavior, leading to shortcut learning, where skills exploit spurious information instead of adapting to partners' dynamic behaviors. This limitation undermines agents' ability to adapt and coordinate effectively with novel partners. We introduce Partner-Aware Skill Discovery (PASD), a DHRL framework that learns skills conditioned on partner behavior. PASD introduces a contrastive intrinsic reward to capture patterns emerging from partner interactions, aligning skill representations across similar partners while maintaining discriminability across diverse strategies. By structuring the skill space based on partner interactions, this approach mitigates shortcut learning and promotes behavioral consistency, enabling robust and adaptive coordination. We extensively evaluate PASD in the Overcooked-AI benchmark with a diverse population of partners characterized by varying skill levels and play styles. We further evaluate the approach with human proxy models trained from human-human gameplay trajectories. PASD consistently outperforms existing population-based and hierarchical baselines, demonstrating transferable skill learning that generalizes across a wide range of partner behaviors. Analysis of learned skill representations shows that PASD adapts effectively to diverse partner behaviors, highlighting its robustness in human-AI collaboration.

URL PDF HTML ☆

赞 0 踩 0

2605.24343 2026-05-26 cs.AI 版本更新

ArtSplat: 基于前馈的关节式3D高斯泼溅从稀疏多状态未标定视图

Inseo Lee, Yoonji Kim, Eugene Sohn, Jiwoong Lee, Jungmin You, Joonseok Lee, Jin-Hwa Kim

发表机构 * Seoul National University（首尔国立大学）； Sogang University（成均馆大学）； NAVER AI Lab（NAVER AI实验室）

AI总结提出首个前馈框架ArtSplat，通过稀疏多视图跨多个关节状态，一次性重建几何和关节参数，引入逐像素关节图表示和跨状态注意力机制，在PartNet-Mobility上实现400倍加速。

详情

AI中文摘要

从稀疏视图图像重建关节物体是一个病态问题，需要同时推断几何和底层关节结构。现有基于NeRF和3D高斯泼溅（3DGS）的关节物体重建方法通常依赖密集视图或强先验（例如深度图、关节类型、预定义关节数量），并且需要昂贵的逐对象优化。在本文中，我们提出了ArtSplat，这是第一个用于关节式3D高斯泼溅的前馈框架。它通过单个前向传递，从跨多个关节状态的稀疏多视图图像中重建几何和关节参数。为了解决单次前向关节重建的挑战，我们引入了一种逐像素关节图表示，使得关节参数估计能够集成到前馈流水线中。我们进一步提出了一种带有状态令牌的跨状态注意力（CSA）机制，该机制有效捕获输入状态间的离散运动。在来自PartNet-Mobility的68个关节物体（包括单关节和多关节配置）上的实验表明，ArtSplat在几何和关节估计方面均达到了有竞争力的性能，同时比基线方法快400倍以上。

英文摘要

Articulated object reconstruction from sparse-view images is an ill-posed problem that requires simultaneous inference of geometry and underlying articulation structure. Existing methods for articulated object reconstruction based on NeRF and 3D Gaussian Splatting (3DGS) typically rely on dense views or strong priors (e.g., depth maps, joint types, predefined number of joints) and require costly per-object optimization. In this paper, we propose ArtSplat, the first feed-forward framework for articulated 3D Gaussian Splatting. It reconstructs both geometry and joint parameters from sparse multi-view images across multiple articulation states in a single forward pass. To address the challenges of single-pass articulated reconstruction, we introduce a per-pixel joint map representation that enables the integration of joint parameter estimation into the feed-forward pipeline. We further propose a Cross-State Attention (CSA) mechanism with state tokens, which effectively captures discrete motion across input states. Experiments on 68 articulated objects from PartNet-Mobility, including both single- and multi-joint configurations, demonstrate that ArtSplat achieves competitive performance in both geometry and joint estimation, while being over 400 times faster than baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.24300 2026-05-26 cs.CR cs.AI cs.LG 版本更新

Enhancing Reliability in LLM-Based Secure Code Generation

增强基于LLM的安全代码生成的可靠性

Mohammed F. Kharma, Mohammad Alkhanafseh, Ahmed Sabbah, David Mohaisen

发表机构 * Department of Computer Science, Birzeit University（巴伊兹大学计算机科学系）； Department of Computer Science, University of Central Florida（佛罗里达中央大学计算机科学系）

AI总结提出Mitigation-Aware Chain-of-Thought (MA-CoT)框架，通过嵌入任务特定的CWE缓解指导和语言感知安全措施，显著降低LLM生成代码中的漏洞，在多个模型和语言上验证了其一致的安全可靠性提升。

Comments 15 pages; 7 tables; 3 figures

详情

AI中文摘要

大型语言模型（LLM）被广泛用于代码生成，但其安全可靠性在不同语言和提示策略下仍不一致。现有的提示工程提高了功能正确性，但很少能确保一致的安全结果。我们引入了 extit{Mitigation-Aware Chain-of-Thought (MA-CoT)}框架，该框架嵌入了任务特定的CWE缓解指导和语言感知安全措施，以减少生成代码中反复出现的漏洞。我们在三个LLM（gpt-5, claude-4.5, gemini-2.5）、三种编程语言（C, Java, Python）和四种提示策略（Vanilla, Zero-shot, CoT, MA-CoT）上，使用200个任务的主数据集以及LLMSecEval的外部验证对MA-CoT进行了评估。通过静态分析和专家验证，MA-CoT在主数据集上将总安全问题从92个减少到39个（57.6%），在LLMSecEval上从73个减少到4个（94.5%）。高严重性问题（阻塞+严重）分别从90个降至39个（56.7%）和从45个降至2个（95.6%）。在两个数据集中，MA-CoT是唯一能持续提高安全可靠性的策略；Zero-shot和CoT可靠性较低，且可能增加漏洞，尤其是在C语言中。我们进一步引入了严格的漏洞驱动因素分层归因（语言核心层与栈层），并表明残余风险集中在硬化导向的模式（例如，依赖于操作系统和工具链），这激励了在提示之外采用安全构造原语。

英文摘要

Large language models (LLMs) are widely used for code generation, but their security reliability remains inconsistent across languages and prompting strategies. Existing prompt engineering improves functional correctness but rarely ensures consistent security outcomes. We introduce the \textit{Mitigation-Aware Chain-of-Thought (MA-CoT)} framework, which embeds task-specific CWE mitigation guidance and language-aware safeguards to reduce recurring vulnerabilities in generated code. We evaluate MA-CoT across three LLMs (gpt-5, claude-4.5, gemini-2.5), three programming languages (C, Java, Python), and four prompting strategies (Vanilla, Zero-shot, CoT, MA-CoT) on a 200-task primary dataset, with external validation on LLMSecEval. Using static analysis with expert validation, MA-CoT reduces total security findings from 92 to 39 (57.6\%) on the primary dataset and from 73 to 4 (94.5\%) on LLMSecEval. High-severity findings (Blocker + Critical) drop from 90 to 39 (56.7\%) and from 45 to 2 (95.6\%), respectively. Across both datasets, MA-CoT is the only strategy that consistently improves security reliability; Zero-shot and CoT are less reliable and may increase vulnerability, especially in C. We further introduce a strict layered attribution of vulnerability drivers (language-core vs. stack layers) and show that residual risk concentrates in hardening-oriented patterns (e.g., OS- and toolchain-dependent), motivating secure-by-construction primitives alongside prompting.

URL PDF HTML ☆

赞 0 踩 0

2605.24298 2026-05-26 cs.CR cs.AI cs.LG 版本更新

An Empirical Evaluation of LLM-Generated Code Security Across Prompting Methods

LLM生成代码安全性的提示方法实证评估

Mohammed Kharma, Ahmed Sabbah, Mohammad Alkhanafseh, Mohammad Hammoudeh, David Mohaisen

发表机构 * Department of Computer Science, Birzeit University（计算机科学系，巴勒斯坦比泽大学）； King Fahd University of Petroleum and Minerals（国王法赫德石油和矿物大学）； University of Central Florida（中央佛罗里达大学）

AI总结通过跨5个LLM和4种编程语言的实证评估，提出弱点感知零样本链式思考（WA-0CoT）提示策略，发现提示方法虽影响弱点类别分布，但无法显著降低漏洞频率或密度。

Comments 40 pages, 22 tables, 8 figures

详情

AI中文摘要

大型语言模型（LLM）在自动化代码生成中的日益使用提高了软件开发效率，但往往以安全性为代价。生成的代码经常忽略关键问题，使其容易受到弱加密和不正确的输入验证等问题的影响。为了研究这一问题，我们对跨五个LLM和四种编程语言（Java、C++、C和Python）的LLM生成代码的安全质量进行了全面的实证评估，考察了多种提示工程方法的影响。我们提出了一种弱点感知的零样本链式思考（WA-0CoT）提示策略，该策略利用CWE映射丰富提示中的安全上下文以指导模型推理。我们的实证分析在卡方检验的支持下发现，不同提示方法在漏洞频率或密度上没有统计学上的显著降低。然而，包括WA-0CoT在内的提示策略系统地影响了CWE类别的组成分布，其效果因编程语言而异。这些发现表明，虽然安全感知的提示改变了生成弱点的结构，但仅靠提示工程不足以可靠地降低整体漏洞水平。结果强调了在评估LLM生成代码的安全属性时，语言感知和模型感知的提示设计的重要性。

英文摘要

The growing use of Large Language Models (LLMs) for automated code generation has enhanced software development efficiency, but often at the cost of security. Generated code frequently overlooks critical concerns, leaving it vulnerable to issues such as weak encryption and improper input validation. To investigate this problem, we present a comprehensive empirical evaluation of the security quality of LLM-generated code across five LLMs and four programming languages (Java, C++, C, and Python), examining the impact of multiple prompt engineering methods. We introduce a weaknesses-aware zero-shot chain-of-thought (WA-0CoT) prompting strategy that enriches prompts with security context using CWE mappings to guide model reasoning. Our empirical analysis, supported by chi-square tests, finds no statistically significant reductions in vulnerability frequency or density across prompt methods. However, prompting strategies, including WA-0CoT, systematically influence the compositional distribution of CWE categories, with effects varying by programming language. These findings suggest that while security-aware prompting alters the structure of generated weaknesses, prompt engineering alone is insufficient to reliably reduce overall vulnerability levels. The results highlight the importance of language-aware and model-aware prompt design when evaluating the security properties of LLM-generated code.

URL PDF HTML ☆

赞 0 踩 0

2605.24294 2026-05-26 cs.CR cs.AI cs.LG 版本更新

Concept Drift Adaptation Using Self-Supervised and Reinforcement Learning In Android Malware Detection

使用自监督学习和强化学习在Android恶意软件检测中适应概念漂移

Ahmed Sabbah, Mohammad Kharma, Mohammad Alkhanafseh, Radi Jarrar, Samer Zein, David Mohaisen

发表机构 * Birzeit University（巴伊兹大学）； University of Central Florida（中央佛罗里达大学）

AI总结提出一个基于自监督学习和强化学习的框架，通过冻结编码器测量潜在漂移并轻量适配，同时利用PPO控制器在成本约束下选择维护动作，以应对Android恶意软件检测中的概念漂移。

Comments 9 pages, 2 figures, 2 tables

详情

AI中文摘要

Android恶意软件检测器在部署后常因概念漂移而性能下降，而每次维护步骤完全重新训练成本高昂。我们提出一个按时间顺序的自适应维护框架，将部署时的维护建模为序列决策问题。该框架在初始化阶段通过自监督学习学习稳定的潜在表示，冻结编码器，在固定表示空间中测量潜在漂移，并使用可训练适配器和分类头进行轻量下游适配。一个近端策略优化控制器根据检测器状态（包括当前效用、固定记忆集上的保留率、潜在漂移指标和更新成本）选择低成本的维护动作。我们在模拟器和真实Android恶意软件数据集上，使用静态和动态特征，在因果部署式协议下评估该框架。结果表明，RL控制器提供了一种强大的成本感知适配策略，在非平稳部署条件下，始终保持在最佳策略之列，同时在时间性能、记忆保留和维护成本之间取得有利平衡。

英文摘要

Android malware detectors often degrade after deployment because of concept drift, while full retraining at each maintenance step is costly. We propose a chronological adaptive maintenance framework that models deployment-time maintenance as a sequential decision problem. The framework learns a stable latent representation through self-supervised learning during initialization, freezes the encoder, measures latent drift in the fixed representation space, and performs lightweight downstream adaptation using a trainable adapter and classification head. A proximal policy optimization controller selects low-cost maintenance actions based on the detector state, including current utility, retention on a fixed memory set, latent drift indicators, and update cost. We evaluate the framework under a causal deployment-style protocol on emulator and real Android malware datasets with static and dynamic features. Results show that the RL controller provides a strong cost-aware adaptation strategy, consistently remaining among the top-performing policies while achieving a favorable balance between temporal performance, memory retention, and maintenance cost under non-stationary deployment conditions.

URL PDF HTML ☆

赞 0 踩 0

2605.24270 2026-05-26 cs.AI cs.CR 版本更新

Safety-Oriented Routing Analysis of Mixtral MoE Under Benign and Harmful Prompts

面向安全的路由分析：Mixtral MoE在良性及有害提示下的表现

Md Nurul Absar Siddiky

发表机构 * Department of Electrical ； Computer Engineering University of Hawai'i at M\= a noa Honolulu, HI, USA

AI总结通过激活和梯度两种信号分析Mixtral 8x7B-Instruct在良性及有害提示下的路由行为，发现安全相关的路由是微妙、深度依赖且分布式的，而非由固定专家集主导。

详情

AI中文摘要

稀疏混合专家（MoE）语言模型对每个token仅激活一小部分参数，使得路由器行为成为模型计算的核心部分。本文利用两种互补信号——基于专家选择频率的激活路由分数和基于路由器门敏感性的梯度分数——研究Mixtral 8x7B-Instruct在良性及有害提示下的路由行为。我们分析了专家和层级别的路由行为，并进行了专家抑制干预。结果表明，激活基础的专家使用广泛且长尾，而梯度基础的重要性则集中。在专家级别，良性提示组和有害提示组在两种信号下保持接近，仅有适度分离。在层级别，激活路由在8-15层附近最具选择性，而梯度重要性集中在最后几层。专家分类显示，大多数专家在良性和有害提示间共享，尽管有限子集表现出明确的组偏好。排名靠前的专家集在梯度分数下显示出比激活分数更强的良性-恶意重叠，表明集中在共同的后期专家集上。在干预实验中，抑制来自激活分数的前五个良性主导专家，将100个提示中的受限响应从24减少到14，而抑制梯度导出的专家则从34减少到22，且意外逆转更少。总体而言，Mixtral中与安全相关的路由是微妙、深度依赖且分布式的，而非由固定专家集主导。

英文摘要

Sparse mixture-of-experts (MoE) language models activate only a small subset of parameters for each token, making router behavior a central part of model computation. This paper studies routing behavior of Mixtral 8x7B-Instruct under benign and harmful prompts using two complementary signals: activation-based routing scores derived from expert selection frequencies and gradient-based scores derived from router-gate sensitivities. We analyze expert- and layer-level routing behavior and conduct expert-suppression interventions. The results show that activation-based expert usage is broad and long-tailed, whereas gradient-based importance is concentrated. At expert level, benign and harmful prompt groups remain close under both signals with modest separation. At layer level, activation-based routing is most selective around layers 8-15, while gradient-based importance is concentrated in final layers. Expert classification shows most experts are shared across benign and harmful prompts, though a limited subset shows clear group preference. Top-ranked expert sets show stronger benign-malicious overlap under gradient scores than activation scores, suggesting concentration on a common late-layer expert set. In intervention experiments, suppressing top five benign-dominant experts from activation scores reduces restricted responses from 24 to 14 over 100 prompts, while suppressing gradient-derived experts reduces them from 34 to 22 with fewer unintended reversals. Overall, safety-relevant routing in Mixtral is subtle, depth-dependent, and distributed rather than dominated by a fixed set of experts.

URL PDF HTML ☆

赞 0 踩 0

2605.24266 2026-05-26 cs.CL cs.AI 版本更新

An Interactive Paradigm for Deep Research

深度研究的交互式范式

Lin Ai, Victor S. Bursztyn, Xiang Chen, Julia Hirschberg, Saayan Mitra

发表机构 * Adobe Research（Adobe研究院）； Department of Computer Science, Columbia University（哥伦比亚大学计算机科学系）

AI总结提出SteER框架，通过可解释的中间过程控制、成本效益决策和实时用户模型，在深度研究中实现用户对齐，性能优于现有基线。

详情

AI中文摘要

近年来，大型语言模型（LLMs）的进展使得深度研究系统能够通过结合检索、推理和生成，为开放式查询合成全面、报告式的答案。然而，大多数框架依赖于僵化的流程，采用一次性范围界定和长时间自主运行，如果用户意图在过程中发生变化，几乎没有修正的空间。我们提出了SteER，一个可引导的深度研究框架，将可解释的中间过程控制引入长周期研究流程中。在每个决策点，SteER使用成本效益公式来确定是暂停等待用户输入还是自主继续。它结合了多样性感知规划与奖励对齐、新颖性和覆盖率的效用信号，并维护一个在会话过程中不断演化的实时用户模型。SteER在对齐方面比最先进的开源和专有基线高出最多22.80%，在广度、平衡等质量指标上领先，并且在85%以上的成对对齐判断中被人类读者偏好。我们还引入了一个用户查询基准和数据生成流水线。据我们所知，这是第一个以交互式、可解释的控制范式推进深度研究的工作，为长形式任务中可控、用户对齐的智能体铺平了道路。

英文摘要

Recent advances in large language models (LLMs) have enabled deep research systems that synthesize comprehensive, report-style answers to open-ended queries by combining retrieval, reasoning, and generation. Yet most frameworks rely on rigid workflows with one-shot scoping and long autonomous runs, offering little room for course correction if user intent shifts mid-process. We present SteER, a framework for Steerable deEp Research that introduces interpretable, mid-process control into long-horizon research workflows. At each decision point, SteER uses a cost-benefit formulation to determine whether to pause for user input or to proceed autonomously. It combines diversity-aware planning with utility signals that reward alignment, novelty, and coverage, and maintains a live persona model that evolves throughout the session. SteER outperforms state-of-the-art open-source and proprietary baselines by up to 22.80\% on alignment, leads on quality metrics such as breadth and balance, and is preferred by human readers in 85\%+ of pairwise alignment judgments. We also introduce a persona-query benchmark and data-generation pipeline. To our knowledge, this is the first work to advance deep research with an interactive, interpretable control paradigm, paving the way for controllable, user-aligned agents in long-form tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.24247 2026-05-26 cs.CL cs.AI 版本更新

Improving Labeling Consistency with Detailed Constitutional Definitions and AI-Driven Evaluation

通过详细的宪法定义和AI驱动的评估提高标注一致性

Konstantin Berlin, Adam Swanda

发表机构 * Cisco AI Defense（思科AI防御）

AI总结提出一种AI驱动的工作流，通过为每个类别编写详细的宪法定义并由前沿LLM解释，以比人类更一致和准确地生成黄金标签，在三个内容审核类别上将跨模型不一致性降低高达57倍。

Comments Under review at ACL Rolling Review (ARR), May 2026 cycle. Also available at https://doi.org/10.5281/zenodo.20125267

详情

AI中文摘要

许多自动化标注流水线根据书面规范将输入分类到类别中，内容审核是一个突出的用例。简单的类别定义不足以让标注者产生这些流水线所需的准确、一致的黄金标签。一个解决方案是编写一个规定性定义，解决足够多的实际边界情况，使得标注者无法对书面解释产生分歧。在实践中，这种详细程度的定义超出了人类标注者工作记忆的容量，因此标注者依赖直觉，标签偏离书面规则，准确性和一致性下降。我们提出并展示了一种AI驱动工作流的有效性，其中AI帮助编写每个类别的宪法，该宪法以足够详细的方式定义标签以覆盖边缘情况，并且前沿LLM在每个输入上解释该宪法，以比阅读相同文档的人类更一致和准确地产生黄金标签。我们在三个内容审核类别（骚扰、仇恨言论、非暴力犯罪）上评估，并表明该方法相比段落定义将跨模型不一致性降低高达57倍，跨模型分歧诊断规范缺口，人类负责关于每个类别应含义的高层决策，而不是单个标注调用。对于安全评估，我们引入了一个双轴公式，在完整对话上独立评分意图和内容，以便下游消费者可以基于任一轴或两者采取行动。

英文摘要

Many automated labeling pipelines classify inputs into categories defined by a written specification, content moderation being a prominent use case. Simple category definitions are not detailed enough for labelers to produce the accurate, consistent golden labels these pipelines require. One solution is to write a prescriptive definition that settles enough real boundary cases that labelers cannot disagree with the written interpretation. In practice, definitions at that level of detail exceed what a human annotator can hold in working memory, so annotators fall back on intuition and the labels drift from the written rules, regressing on accuracy and consistency. We propose and demonstrate the efficacy of an AI-driven workflow in which AI helps write a per-category constitution that defines the label in enough detail to cover edge cases, and a frontier LLM interprets it on each input to produce the golden label more consistently and accurately than humans reading the same document. We evaluate on three content moderation categories (harassment, hate speech, non-violent crime) and show that the approach reduces cross-model inconsistency by up to 57x compared to paragraph definitions, with cross-model disagreement diagnosing specification gaps and the human responsible for high-level decisions about what each category should mean rather than individual labeling calls. For the safety evaluation, we introduce a dual-axis formulation scoring intent and content independently over the full conversation, so downstream consumers can act on either axis or both.

URL PDF HTML ☆

赞 0 踩 0

2605.24243 2026-05-26 cs.CV cs.AI stat.ML 版本更新

GIBLy: Improving 3D Semantic Segmentation through an Architecture-Agnostic Lightweight Geometric Inductive Bias Layer

GIBLy: 通过架构无关的轻量级几何归纳偏置层改进3D语义分割

Diogo Lavado, Alessandra Micheletti, Clàudia Soares

发表机构 * NOVA School of Science and Technology（诺瓦科学与技术学校）； Università degli Studi di Milano（米兰大学）

AI总结提出一种轻量级几何归纳偏置层GIBLy，通过集成可学习的几何先验提升3D分割性能，仅增加少量参数即可在多个基准上获得一致提升。

详情

AI中文摘要

在3D场景理解中，深度学习模型依赖大型模型和大量训练来捕捉3D数据中存在的几何结构。然而，现有方法缺乏显式机制来融入几何信息（例如可学习的基元形状），往往需要更大的模型和更多的训练数据，这增加了成本并可能限制泛化能力。我们引入了GIBLy，一种轻量级几何归纳偏置层，将可学习的几何先验集成到3D分割流程中。GIBLy通过提供与简单几何形状（因此可解释）对齐的特征来增强现有架构——无论是基于MLP、卷积还是Transformer——以最小的计算开销提升分割性能。我们在多个3D语义分割基准上验证了我们的方法，展示了一致的性能提升，包括在TS40K上使用PTV3时mIoU提升高达+11.5%，而仅增加58K额外参数。我们的结果突显了显式编码几何结构以支持准确高效的3D场景理解的优势，且仅需一个轻量级的附加层。

英文摘要

In 3D scene understanding, deep learning models rely on large models and extensive training to capture basic geometric structures that are present in the 3D data. However, existing methods lack explicit mechanisms to incorporate geometric information, such as learnable primitive shapes, often necessitating large models and more training data which in turn increases cost and can limit generalization. We introduce GIBLy, a lightweight geometric inductive bias layer that integrates learnable geometric priors into 3D segmentation pipelines. GIBLy enhances existing architectures -- whether MLP-based, convolution-based, or transformer-based -- by providing features aligned with simple geometric shapes (and thus human-interpretable) that improve segmentation performance with minimal computational overhead. We validate our approach across multiple 3D semantic segmentation benchmarks, demonstrating consistent performance gains, including up to +11.5% mIoU on TS40K with PTV3, while adding only 58K extra parameters. Our results highlight the benefit of explicitly encoding geometric structure to support accurate and efficient 3D scene understanding, with a lightweight add-on layer

URL PDF HTML ☆

赞 0 踩 0

2605.24239 2026-05-26 cs.CR cs.AI 版本更新

Unlocking Apple's Private Cloud Compute: An Analysis of Privacy-Preserving Artificial Intelligence

解锁苹果的私有云计算：隐私保护人工智能分析

Yannik Dittmar, Marvin Jerome Stephan, Thomas Völkl, Matthias Hollick, Jiska Classen

发表机构 * Hasso Plattner Institute, University of Potsdam（哈索普兰特纳研究所，波茨坦大学）； TU Darmstadt, Secure Mobile Networking Lab（德累斯顿技术大学，安全移动网络实验室）； IMDEA Networks Institute, Madrid, Spain（IMDEA网络研究所，马德里，西班牙）

AI总结通过逆向工程苹果私有云计算（PCC）在移动设备上的实现，评估其隐私保护特性，并开放非公开接口以支持自定义查询和独立基准测试。

详情

DOI: 10.1145/3765613.3811691
Journal ref: Proceedings of the 19th ACM Conference on Security and Privacy in Wireless and Mobile Networks (WiSec 2026)

AI中文摘要

许多现有的移动设备人工智能解决方案依赖于大量敏感数据的收集，引发隐私担忧，并且通常需要存储上下文和模型改进数据。苹果的私有云计算（PCC）旨在通过强调移动设备集成和隐私优先设计来解决这一问题。PCC的核心主张是它不存储任何用户数据，并且用户输入和用户账户是不可关联的。尽管大多数PCC系统规范是公开的，但编译后的二进制文件增加了一层不透明性。没有可重现的构建，这些二进制文件中也没有符号，导致规范与实际交付给用户的产品之间可能存在差异。此外，查询PCC的底层模型和接口并不公开可访问，限制了学术上对模型属性（如准确性）的评估。这给评估像PCC这样的隐私保护方法是否既值得信赖又能提供高质量答案带来了挑战。我们是第一个逆向工程移动设备上PCC实现以评估隐私方面，并在本地设备上开放其非公开接口以支持自定义PCC查询的研究团队。我们通过独立基准测试PCC模型，展示了超出苹果预期用例的访问级别。通过公开我们的PCC基准测试框架，我们为未来的研究提供了支持。

英文摘要

Many existing Artificial Intelligence (AI) solutions on mobile devices rely on an extensive collection of sensitive data, raising privacy concerns and often requiring storage for both context and model improvement. Apple's Private Cloud Compute (PCC) aims to address this by emphasizing mobile device integration and a privacy-first design. The central claim of PCC is that it does not store any user data and that user input and user accounts are unlinkable. While most of the PCC system specifications are public, compiled binaries add a layer of opaqueness. There are no reproducible builds, and there are no symbols within those binaries, creating potential discrepancies between the specification and what is shipped to the user. Additionally, the underlying models and interfaces for querying PCC are not openly accessible, limiting academic evaluation of model properties, such as accuracy. This poses a challenge in assessing whether a privacy-preserving approach like PCC is actually trustworthy while also providing high-quality answers. We are the first to reverse-engineer the PCC implementation on mobile devices to evaluate privacy aspects and to open its non-public interfaces on local devices to support custom PCC queries. We demonstrate this level of access beyond Apple's intended use cases by independently benchmarking the PCC model. We enable future research by making our PCC benchmarking framework publicly available.

URL PDF HTML ☆

赞 0 踩 0

2605.24238 2026-05-26 cs.AI 版本更新

迈向评估工程：机器学习评估工具在野外的实证研究

Zhimin Zhao, Zehao Wang, Abdul Ali Bangash, Bram Adams, Ahmed E. Hassan

发表机构 * Software Analysis and Intelligence Lab (SAIL), School of Computing, Queen's University（软件分析与智能实验室（SAIL），计算学院，女王大学）； Concordia University（Concordia大学）； Lahore University of Management Sciences (LUMS)（拉合尔管理科学大学（LUMS））

AI总结通过对57个评估工具的实证研究，提出五阶段工具模型，并分类16560个问题，发现规范阶段问题最多（41.4%），主要根因是未实现功能（24.3%）、文档缺失（20.3%）和输入验证缺失（17.2%），为将评估工程作为独立软件工程关注点奠定实证基础。

详情

AI中文摘要

评估工具是编排模型评估的软件系统，管理模型调用、数据加载、指标计算和结果报告。尽管它们在机器学习基础设施中扮演关键角色，但其操作挑战和工程问题迄今受到的关注有限。我们对57个评估工具进行了实证研究，推导出一个五阶段工具模型，并根据工作流阶段和根本原因对16,560个问题进行了分类。大多数工具操作挑战集中在规范阶段（占问题的41.4%），在此阶段工具集成外部模型、数据集和评分评判者。操作挑战的三个最常见根本原因是未实现功能（24.3%）、文档缺失（20.3%）和输入验证缺失（17.2%），这些合计占分类问题的61.7%，涵盖现有功能的缺陷和阻碍预期工作流的能力缺口。根本原因也因工作流阶段而异：环境不兼容和外部依赖破坏占配置问题的36.2%，而算法错误（25.9%）和验证缺失（22.5%）主导评估问题。这些贡献共同为将评估工程视为一个独立的软件工程关注点建立了实证基础。

英文摘要

Evaluation harnesses are software systems that orchestrate model evaluation by managing model invocation, data loading, metric computation, and result reporting. Despite their critical role in machine learning infrastructure, their operational challenges and engineering concerns have received limited attention so far. We present an empirical study of 57 evaluation harnesses, deriving a five-stage harness model and classifying 16,560 issues by workflow stage and root cause. Most harness operational challenges concentrate in the Specification stage (41.4% of issues), where harnesses integrate external models, datasets, and scoring judges. The three most frequent root causes of operational challenges are unimplemented features (24.3%), documentation gaps (20.3%), and missing input validation (17.2%), which together account for 61.7% of classified issues, spanning both defects in existing functionality and capability gaps that block intended workflows. Root causes also vary by workflow stage: environment incompatibility and external dependency breakage account for 36.2% of provisioning issues, whereas algorithmic error (25.9%) and validation gap (22.5%) dominate assessment issues. Together, these contributions establish an empirical foundation for treating evaluation engineering as a distinct software engineering concern.

URL PDF HTML ☆

赞 0 踩 0

2605.24212 2026-05-26 stat.AP cs.AI cs.LG stat.ML 版本更新

AvalancheBench: 通过潜在世界恢复评估企业数据智能体

Darek Kleczek, Fuheng Zhao, Alexander W. Lee, Julien Tissier, Pawel Liskowski, Ugur Cetintemel, Anupam Datta

发表机构 * Brown University and Snowflake（布朗大学和Snowflake）

AI总结提出AvalancheBench基准，通过潜在世界恢复评估企业数据智能体的分析理解能力，揭示早期错误如何传播并导致系统性错误推荐。

详情

AI中文摘要

我们介绍了AvalancheBench，一个通过潜在世界恢复评估企业数据智能体的基准。AvalancheBench在三个方面改进了现有基准。首先，它评估分析理解而非流程完成：系统根据是否恢复了解释数据的片段、驱动因素、时间事件和关系来评分，而不仅仅是执行工作流或生成看似合理的报告。其次，它通过从已知潜在世界生成观测数据，为目标驱动分析提供真实基准，从而允许对不完整但有效的恢复给予部分分数。第三，它揭示了早期分析错误如何传播到后续结论：遗漏的片段、合并的事件或错误的归因可能导致系统性错误推荐。在这个意义上，AvalancheBench通过提供一个受控环境来诊断智能体是否恢复了企业数据背后的分析结构，从而补充了真实数据基准。在第一个电子商务用例中，领先编码智能体的最强配置仅恢复了26%的评分标准，失败集中在通用客户细分和合并的时间事件上。

英文摘要

We introduce AvalancheBench, a benchmark for evaluating enterprise data agents through \emph{latent world recovery}. AvalancheBench improves on existing benchmarks in three ways. First, it evaluates analytical understanding rather than pipeline completion: systems are scored on whether they recover the segments, drivers, temporal events, and relationships that explain the data, not merely on whether they execute a workflow or produce a plausible report. Second, it provides ground truth for goal-driven analytics by generating observations from a known latent world, enabling partial credit for incomplete but valid recoveries. Third, it exposes how early analytical mistakes propagate into later conclusions: missed segments, merged events, or wrong attributions can lead to systematically wrong recommendations. In this sense, AvalancheBench complements real-data benchmarks by providing a controlled setting for diagnosing whether agents recover the analytical structure behind enterprise data. On a first e-commerce use case, the strongest configuration of a leading coding agent recovers only 26\% of the rubric, with failures concentrated in generic customer segmentations and merged temporal events.

URL PDF HTML ☆

赞 0 踩 0

2605.24180 2026-05-26 physics.soc-ph cs.AI cs.DL cs.HC 版本更新

Human-AI Collaboration in Science at Scale: A Global Large-scale Randomized Field Experiment

大规模人机协作科学：一项全球大规模随机现场实验

Binglu Wang, Weixin Liang, Jiahui Xue, Yuhui Zhang, Hancheng Cao, Dashun Wang, Yian Yin

发表机构 * Kellogg School of Management, Northwestern University（西北大学凯洛格管理学院）； Center for Science of Science and Innovation, Northwestern University（西北大学科学与创新中心）； Northwestern Institute on Complex Systems, Northwestern University（西北大学复杂系统研究所）； Department of Computer Science, Stanford University（斯坦福大学计算机科学系）； Goizueta Business School, Emory University（埃默里大学戈伊兹特亚商学院）； Department of Information Science, Cornell University（康奈尔大学信息科学系）

AI总结通过全球大规模随机现场实验，研究大型语言模型（LLMs）生成的定制化反馈能否提升科研人员的修订率并促进AI工具使用，尤其惠及资源受限的研究者。

详情

AI中文摘要

合作是现代科学的定义模式，但其核心机制——反馈——仍然难以观察、难以扩展且分布不均。在此，我们测试大型语言模型（LLMs）是否能够贡献于这一隐蔽但至关重要的实践，并重新分配科学反馈——这一知识生产中必不可少但稀缺的资源。在一项全球大规模随机现场实验中，我们为来自133个地理区域的超过45,000名研究人员的150个领域的31,000多篇arXiv预印本提供了定制的LLM生成反馈。与对照组相比，收到反馈的作者修改其手稿的可能性显著更高，相对于基线修订率提高了12.55%。接触AI反馈还增加了作者在未来论文中使用LLM工具的频率，表明科学实践发生了长期转变。这些效应在非英语主导研究区域的作者、与学术文献联系较少的手稿以及h指数较低和职业早期阶段的团队中最为显著，这与AI反馈可能在获取及时批评受限的地方提供最大益处的观点一致。总之，这些发现提供了因果证据，表明基于AI的结构化干预可以将科学反馈的获取从一种主要是私人优势转变为更广泛分布的资源，对全球研究体系的生产力、公平性和能力产生更广泛的影响。

英文摘要

Collaboration is the defining mode of modern science, yet its core mechanism -- feedback -- remains hard to observe, difficult to scale, and unequally distributed. Here we test whether large language models (LLMs) can contribute to this hidden but vital practice and reallocate scientific feedback, an essential yet scarce resource for knowledge production. In a global large-scale randomized field experiment, we delivered customized LLM-generated feedback for over 31,000 arXiv preprints across 150 fields and more than 45,000 researchers from 133 geographic regions. Relative to controls, authors who received feedback had a significantly higher likelihood of revising their manuscripts, corresponding to a 12.55% relative increase over the baseline revision rate. Exposure to AI feedback also increased authors' subsequent use of LLM tools in their future papers, suggesting longer-run shifts in scientific practice. These effects were strongest among authors from non-English-dominant research regions, manuscripts less embedded in the scholarly literature, and teams with lower h-indexes and earlier career stages, consistent with the idea that AI feedback may provide the greatest benefit where access to timely critique is otherwise limited. Together, these findings provide causal evidence that structured AI-based interventions can transform access to scientific feedback from a largely private advantage into a more widely distributed resource, with broader implications for productivity, equity, and capacity across the global research system.

URL PDF HTML ☆

赞 0 踩 0

2605.24173 2026-05-26 cs.CL cs.AI cs.CR cs.LG 版本更新

Extracting Training Data from Diffusion Language Models via Infilling

通过填充从扩散语言模型中提取训练数据

Yihan Wang, N. Asokan

发表机构 * University of Waterloo（滑铁卢大学）； KTH Royal Institute of Technology（皇家理工学院）

AI总结提出填充提取协议，利用扩散语言模型的双向去噪能力，通过任意二进制掩码参数化，揭示掩码几何形状控制提取能力，边缘条件掩码比前缀条件掩码多提取三倍逐字序列，且双向访问打开了自回归模型无法利用的通道。

详情

AI中文摘要

大型语言模型中的记忆化几乎完全通过前缀条件提取进行研究，这是自回归模型的自然选择。然而，扩散语言模型（DLM）可以在任意位置去噪掩码标记。因此，仅前缀探测揭示了DLM中记忆化的一个方面，并显著低估了训练数据提取的风险。为了真实地建模DLM中训练数据的可提取性，我们引入了\emph{填充提取}，这是一种由任意二进制掩码参数化的数据提取协议，它包含了前缀仅探测并考虑了DLM的双向归纳偏差。在LLaDA-8B和Dream-7B上，跨五种提取模式、三种训练流水线和三个涵盖逐字和部分泄漏的语料库进行实例化，我们发现掩码几何形状控制着可提取性：边缘条件掩码比前缀条件掩码\emph{多提取三倍}的逐字序列，并且双向访问打开了自回归模型中无法利用的通道。特别是，我们表明，一个能够访问已删除个人身份信息的训练数据的现实对手，甚至可以从DLM中提取被删除的电子邮件地址，其召回率高于规模匹配的自回归模型。解码的可调参数可测量地影响提取性能，而后续的监督微调阶段并未消除先前的记忆化。

英文摘要

Memorization in large language models has been studied almost exclusively through prefix-conditioned extraction, a natural choice for autoregressive models. However, diffusion language models (DLMs) can denoise masked tokens at arbitrary positions. Thus, prefix-only probing reveals only one facet of memorization in DLMs and significantly underestimates the risk of training-data extraction. In order to realistically model extractability of training data in DLMs, we introduce \emph{infilling extraction}, a data-extraction protocol parameterized by an arbitrary binary mask that subsumes prefix-only probing and accounts for the bidirectional inductive bias of DLMs. Instantiating it on LLaDA-8B and Dream-7B across five extraction modes, three training pipelines, and three corpora covering verbatim and partial leakage, we find that mask geometry governs extractability: edge-conditioned masks \emph{extract up to three times more} verbatim sequences than prefix-conditioned ones, and bidirectional access opens channels inaccessible in autoregressive models. In particular, we show that a realistic adversary with access to training data where personally identifiable information has been redacted, can even achieve higher recall on extracting redacted email addresses from DLMs than from scale-matched autoregressive models. Tunable parameters for decoding measurably affect extraction performance, while a follow-up supervised finetuning stage does not eliminate the prior memorization.

URL PDF HTML ☆

赞 0 踩 0

2605.24172 2026-05-26 cs.AI 版本更新

EPPC-OASIS: Ontology-Aware Adaptation and Structured Inference Refinement for Electronic Patient-Provider Communication Mining in Secure Messages

EPPC-OASIS：面向安全消息中电子患者-提供者通信挖掘的本体感知适应与结构化推理精炼

Samah Fodeh, Sreeraj Ramachandran, Elyas Irankhah, Muhammad Arif, Afshan Khan, Ganesh Puthiaraju, Linhai Ma, Srivani Talakokkul, Jordan Alpert, Sarah Schellhorn

发表机构 * Yale University（耶鲁大学）； Cleveland Clinic Lerner College of Medicine of Case Western Reserve University, Cleveland Clinic（克利夫兰医学中心勒纳医学院，克利夫兰医学中心）； Medical Oncology, Yale School of Medicine（耶鲁医学院医学肿瘤学）

AI总结提出EPPC-OASIS框架，通过本体感知的Wasserstein对齐目标增强微调，并结合推理精炼步骤，从安全消息中自动提取结构化EPPC编码，在多个语言模型上取得一致改进。

详情

AI中文摘要

安全的患者-提供者消息包含临床上重要的通信行为，这些行为难以大规模手动表征。电子患者-提供者通信（EPPC）框架为编码这些行为提供了本体，但自动提取仍然具有挑战性，因为预测必须保留细粒度的代码/子代码结构，同时将注释锚定在消息文本中。我们开发了EPPC-OASIS，一种用于结构化EPPC提取的本体感知适应方法，并将其与可部署的推理精炼程序相结合，旨在提高最终注释的一致性。EPPC-OASIS通过Wasserstein对齐目标增强监督微调，该目标鼓励模型表示邻域与EPPC本体派生邻域之间的对齐，而推理精炼则使用验证、自一致性、混合校正以及选择或集成来解决残差预测错误。我们在一个去标识化的安全患者-提供者消息语料库上，针对多个开放权重语言模型，将框架与提示、监督微调、基于偏好和鲁棒性导向的基线进行了比较。跨模型家族，最佳可部署流水线实现了77.13%的代码+子代码F1和63.83%的三元组F1，相比最强的监督微调基线，分别获得了+1.39和+2.12 F1点的适度但一致的绝对提升。这些结果表明，结合结构化推理精炼的本体感知适应可以支持可扩展的回顾性EPPC挖掘，但在操作使用前需要外部验证。

英文摘要

Secure patient-provider messages contain clinically important communication behaviors that are difficult to characterize manually at scale. The Electronic Patient-Provider Communication (EPPC) framework provides an ontology for coding these behaviors, but automated extraction remains challenging because predictions must preserve fine-grained code/sub-code structure while grounding annotations in message text. We developed EPPC-OASIS, an ontology-aware adaptation approach for structured EPPC extraction, and combined it with deployable inference-refinement procedures designed to improve the coherence of final annotations. EPPC-OASIS augments supervised fine-tuning with a Wasserstein alignment objective that encourages alignment between model representation neighborhoods and EPPC ontology-derived neighborhoods, while inference refinement uses verification, self-consistency, hybrid correction, and selection or ensembling to address residual prediction errors. We evaluated the framework on a de-identified corpus of secure patient-provider messages against prompting, supervised fine-tuning, preference-based, and robustness-oriented baselines across multiple open-weight language models. Across model families, the best deployable pipeline achieved 77.13% Code+Sub-code F1 and 63.83% Triplet F1, corresponding to modest but consistent absolute gains of +1.39 and +2.12 F1 points over the strongest supervised fine-tuning baseline. These results suggest that ontology-aware adaptation with structured inference refinement can support scalable retrospective EPPC mining, although external validation is needed before operational use.

URL PDF HTML ☆

赞 0 踩 0

2605.24171 2026-05-26 cs.LG cs.AI 版本更新

Palette: 一种模块化、可控且高效的框架，用于按需授权安全对齐放松的LLMs

Qitao Tan, Xiaoying Song, Arman Akbari, Arash Akbari, Yanzhi Wang, Xiaoming Zhai, Lingzi Hong, Zhen Xiang, Jin Lu, Geng Yuan

发表机构 * University of Georgia（佐治亚大学）； University of North Texas（北卡罗来纳州立大学）； Northeastern University（东北大学）

AI总结提出Palette框架，通过多目标搜索识别拒绝方向并轻量级适配模型，实现按需放松授权领域的安全拒绝行为，同时保持其他区域的标准安全，支持模块化组合多领域授权。

详情

AI中文摘要

当前基础模型的安全对齐大多遵循“一刀切”范式，跨用户和上下文应用相同的拒绝策略。因此，模型可能拒绝对于一般用户不安全但授权专业人员合法的请求，限制了专业环境中的有用性。现有方法要么需要昂贵的重新对齐，要么依赖推理时控制，但存在控制不精确和延迟增加的问题。为此，我们提出Palette，一个模块化、可控且高效的框架，选择性地放松授权目标领域的拒绝行为，同时在其他地方保持标准安全。我们的方法通过多目标搜索识别拒绝方向，并通过轻量级适配将其内化到模型中。Palette进一步支持模块化组合：它独立学习领域特定的安全控制，并通过参数合并进行组合，无需重新训练即可实现按需多领域授权。在四个安全基准、多个模型变体以及LLMs和VLMs上的实验表明，Palette在不牺牲通用实用性的情况下提供精确的安全控制，为基础模型适应多样化专业需求提供了一条实用路径。

英文摘要

Current safety alignment of foundation models largely follows a \emph{one-size-fits-all} paradigm, applying the same refusal policy across users and contexts. As a result, models may refuse requests that are unsafe for general users but legitimate for authorized professionals, limiting helpfulness in specialized professional settings. Existing approaches either require costly realignment or rely on inference-time steering that suffers from imprecise control and added latency. To this end, we propose \textsc{Palette}, a modular, controllable, and efficient framework that selectively relaxes refusal behavior on authorized target domains while preserving standard safety elsewhere. Our method identifies a refusal direction via multi-objective search and internalizes it into the model through lightweight adaptation. \textsc{Palette} further supports modular composition: it learns domain-specific safety controls independently and composes them through parameter merging, enabling on-demand multi-domain authorization without retraining. Experiments across four safety benchmarks, multiple model variants, and both LLMs and VLMs show that \textsc{Palette} delivers precise safety control without sacrificing general utility, offering a practical path toward foundation models that adapt to diverse professional needs.

URL PDF HTML ☆

赞 0 踩 0

2605.24139 2026-05-26 cs.AI cs.LG 版本更新

MAPLE: Multi-State Aggregated Policy Evaluation for AlphaZero in Imperfect-Information Games

MAPLE：不完全信息游戏中AlphaZero的多状态聚合策略评估

Qian-Rong Li, Hung Guei, I-Chen Wu, Ti-Rong Wu

发表机构 * Department of Computer Science, National Yang Ming Chiao Tung University（国立阳明交通大学计算机科学系）； Institute of Information Science, Academia Sinica（中科院信息所）

AI总结提出MAPLE方法，通过单搜索树聚合多个采样世界状态的策略和价值评估，结合PIMC和IS-MCTS优势，在Phantom Go和Dark Hex上分别提升Elo 291和136。

Comments Accepted by the IEEE Conference on Games (IEEE CoG 2026)

详情

AI中文摘要

不完全信息游戏（IIGs）具有挑战性，因为玩家必须在未完全观察真实游戏状态的情况下做出决策。虽然AlphaZero在完美信息游戏中取得了显著成功，但将其扩展到IIGs仍然困难。现有的基于搜索的方法，如完美信息蒙特卡洛（PIMC），存在策略融合问题，而信息集蒙特卡洛树搜索（IS-MCTS）在与神经网络结合时计算成本高昂。在本文中，我们提出了多状态聚合策略评估（MAPLE），一种树搜索方法，它在单个搜索树内聚合来自多个采样世界状态的策略和价值评估，结合了PIMC和IS-MCTS的优点，同时保持可控的计算成本。我们进一步引入基于孪生网络的采样策略，从信息集中选择信息丰富的世界状态。在Phantom Go和Dark Hex上的实验表明，MAPLE显著优于基于PIMC的AlphaZero基线，分别实现了291和136的Elo提升。这些结果表明，MAPLE是一种在不完全信息游戏中进行AlphaZero式学习的有效方法。

英文摘要

Imperfect-information games (IIGs) are challenging, as players must make decisions without fully observing the true game state. While AlphaZero has achieved remarkable success in perfect-information games, extending it to IIGs remains difficult. Existing search-based approaches, such as Perfect Information Monte Carlo (PIMC), suffer from strategy fusion, while Information Set Monte Carlo Tree Search (IS-MCTS) incurs high computational cost when combined with neural networks. In this paper, we propose Multi-State Aggregated PoLicy Evaluation (MAPLE), a tree search method that aggregates policy and value evaluations from multiple sampled world states within a single search tree, combining the advantages of PIMC and IS-MCTS while maintaining a controllable computational cost. We further incorporate a Siamese-based sampling strategy to select informative world states from the information set. Experiments on Phantom Go and Dark Hex show that MAPLE significantly outperforms the PIMC-based AlphaZero baseline, achieving Elo improvements of 291 and 136, respectively. These results demonstrate that MAPLE is an effective approach for AlphaZero-style learning in imperfect-information games.

URL PDF HTML ☆

赞 0 踩 0

2605.24138 2026-05-26 cs.SE cs.AI 版本更新

Understanding Conversational Patterns in Multi-agent Programming: A Case Study on Fibonacci Game Development

多智能体编程中的对话模式理解：以斐波那契游戏开发为例

Srijita Basu, Viktor Kjellberg, Simin Sun, Bengt Haraldsson, Md. Abu Ahammed Babu, Wilhelm Meding, Farnaz Fotrousi, Miroslaw Staron

发表机构 * Chalmers University of Technology ； University of Gothenburg Gothenburg Sweden ； Research \& Development, Volvo Car Corporation Gothenburg Sweden ； University of Gothenburg ； Research \& Development, Volvo Car Corporation

AI总结本文通过分析12种开源LLM组合中设计者与程序员智能体的对话，揭示了多智能体交互的效率、一致性和有效性三个关键维度，发现DeepSeek-R1:DeepSeek-R1对能从首次迭代起稳定收敛到正确解，而其他组合则存在发散或错误共识问题。

Comments 10 pages, 7 figures, AIware, FSE 2026

详情

DOI: 10.1145/3805760.3814914

AI中文摘要

大型语言模型（LLM）越来越多地应用于软件工程（SE），但它们在自主、面向角色的协作方面的潜力仍远未得到充分探索。理解多个基于LLM的智能体如何协调、保持角色对齐并收敛到解决方案对SE至关重要，因为简单地让智能体交互并不能可靠地产生正确或稳定的结果。最近的实证研究表明，非结构化或理解不足的交互动态可能导致错误传播、对错误解决方案的过早共识，或阻止收敛的长期分歧，即使在交互早期存在正确的部分解决方案。作为解决这一未被充分探索领域的初步步骤，我们对两个智能体（设计者和程序员）之间的对话进行了系统分析，涉及来自7个开源LLM（Gemma 2、Gemma 3、LLaMA 3.2、LLaMA 3.3、DeepSeek-R1、MiniCPM和Qwen3）的12种模型组合。我们的系统方法揭示了多智能体交互的三个关键维度：效率（收敛的速度和稳定性）、一致性（通过BLEU和ROUGE可视化的角色对齐程度）和有效性（编译成功和错误解决的程度）。结果表明，DeepSeek-R1:DeepSeek-R1对从第一次迭代起就独特地收敛到正确解，并一致地保持到最终迭代，而LLaMA 3.2:LLaMA 3.2和Qwen3:Qwen3对尽管偏离了正确解，但表现出强烈的设计者:程序员角色对齐。其他对偏离了任务，从未收敛到结果。这些发现推进了对智能体编程的理解，并强调了进一步研究理解和校准收敛及停止条件的必要性，这对于未来的自主SE至关重要。

英文摘要

Large Language Models (LLMs) are increasingly applied to software engineering (SE), yet their potential for autonomous, role-oriented collaboration remains largely underexplored. Understanding how multiple LLM-based agents coordinate, maintain role alignment, and converge on solutions is critical for SE, as naively allowing agents to interact does not reliably lead to correct or stable outcomes. Recent empirical studies show that unstructured or poorly understood interaction dynamics can result in error propagation, premature consensus on incorrect solutions, or prolonged disagreement that prevents convergence, even when correct partial solutions are present early in the interaction. As an initial step towards addressing this underexplored area, we undertake a systematic analysis of conversations between two agents, a Designer and a Programmer across 12 model combinations from 7 open-source LLMs (Gemma 2, Gemma 3, LLaMA 3.2, LLaMA 3.3, DeepSeek-R1, MiniCPM, and Qwen3). Our systematic approach reveals three key dimensions of multi-agent interaction: efficiency (the speed and stability of convergence), consistency (the degree of role alignment visualized by BLEU and ROUGE), and effectiveness (the extent of compilation success and error resolution). Results show that the DeepSeek-R1:DeepSeek-R1 pair was unique in converging to the correct solution from the very first iteration and sustaining it consistently to the final iteration, while LLaMA 3.2:LLaMA 3.2 and Qwen3:Qwen3 demonstrated strong Designer:Programmer role alignment despite of diverging from the correct solution. The other pairs deviated from the task, never to converge to a result. These findings advance understanding of agentic programming and highlight the need for further research on understanding and calibrating convergence and stop conditions essential for future autonomous SE.

URL PDF HTML ☆

赞 0 踩 0

2605.24137 2026-05-26 cs.SE cs.AI 版本更新

Empirical Analysis and Detection of Hallucinations in LLM-Generated Bug Report Summaries

LLM生成的错误报告摘要中幻觉的经验分析与检测

Hinduja Nirujan, Shreyas Patil, Abdallah Ayoub, Ahmad Abdel Latif, Gouri Ginde

发表机构 * Electrical and Software Engineering（电气与软件工程学院）

AI总结本研究从章节感知角度经验性地调查了LLM生成的错误报告摘要中的幻觉，提出了联合预测幻觉内容、识别受影响章节和分类幻觉类型的检测方法，并在BugsRepo数据集上取得了良好性能。

详情

AI中文摘要

大型语言模型（LLMs）越来越多地被用于生成软件错误报告的摘要，包括诸如重现步骤（S2R）、实际行为（AB）和预期行为（EB）等章节。然而，这些模型经常产生看似可信但缺乏源报告支持的幻觉，这可能会误导开发者并降低对自动化维护工具的信任。现有的幻觉检测方法通常在完整响应级别评估输出，并未考虑技术文档的结构。一项对80个结构化错误报告摘要的初步探索性研究发现，约47.9%包含缺失信息，而12.3%包含捏造内容，凸显了在错误报告摘要中进行系统性幻觉分析的必要性。在这项工作中，我们从章节感知的角度经验性地调查了LLM生成的错误报告摘要中的幻觉。利用源自Mozilla OSS项目的BugsRepo数据集，我们引入了受控的合成幻觉注入，以构建用于训练和评估的基准。我们提出了一种章节感知的幻觉检测方法，该方法联合预测摘要是否包含幻觉内容、识别受影响的章节，并对幻觉类型进行分类。在多个预训练语言模型上的实验结果表明，所提出的方法在所有任务上均取得了强劲性能，最佳模型在报告级别上获得了0.89的Macro-F1，在章节级别上获得了0.83的Macro-F1，在幻觉类型上获得了0.84的Macro-F1。我们进一步分析了常见的幻觉模式和模型失败模式，以更好地理解当前LLM生成的错误报告摘要的局限性。研究结果强调了章节感知的幻觉分析对于提高软件维护工作流中LLM辅助错误报告摘要可靠性的重要性。

英文摘要

Large Language Models (LLMs) are increasingly used to generate summaries of software bug reports, including sections such as Steps-to-Reproduce (S2R), Actual Behavior (AB), and Expected Behavior (EB). However, these models frequently produce hallucinations that can be convincing but unsupported by the source report. This can mislead developers and reduce trust in automated maintenance tools. Existing hallucination detection approaches typically evaluate outputs at the full-response level and do not consider the structure of technical documents. An initial exploratory study on 80 structured bug report summaries found that approximately 47.9% contained missing information, while 12.3% included fabricated content, highlighting the need for systematic hallucination analysis in bug report summarization. In this work, we empirically investigate hallucinations in LLM-generated bug report summaries from a section-aware perspective. Using the BugsRepo dataset, derived from Mozilla OSS projects, we introduce controlled synthetic hallucination injection to construct a benchmark for training and evaluation. We propose a section-aware hallucination detection approach that jointly predicts whether a summary contains hallucinated content, identifies affected sections, and classifies hallucination types. Experimental results across multiple pretrained language models show that the proposed approach achieves strong performance across all tasks, with the best model obtaining 0.89 report-level Macro-F1, 0.83 section-level Macro-F1, and 0.84 hallucination-type Macro-F1. We further analyze common hallucination patterns and model failure modes to better understand limitations of current LLM-generated bug report summaries. The findings highlight the importance of section-aware hallucination analysis for improving the reliability of LLM-assisted bug report summarization in software maintenance workflows.

URL PDF HTML ☆

赞 0 踩 0

2605.24117 2026-05-26 cs.AI 版本更新

Verified SHAP: 神经网络精确Shapley值的可证明界

David Boetius, Shahaf Bassan, Guy Katz, Stefan Leue, Tobias Sutter

发表机构 * University of Konstanz, Konstanz, Germany（康斯坦茨大学）； Hebrew University of Jerusalem, Jerusalem, Israel（耶路撒冷希伯来大学）； University of St.Gallen, St.Gallen, Switzerland（斯图加特大学）

AI总结利用神经网络验证技术，提出一种计算SHAP值精确上下界的算法，可扩展到比现有精确方法大数个数量级的搜索空间。

Comments Accepted at ICML 2026. 34 pages, 13 figures

2605.24079 2026-05-26 cs.SE cs.AI cs.CL 版本更新

TRACER: A Semantic-Aware Framework for Fine-Grained Contamination Detection in Code LLMs

TRACER: 一种用于代码大语言模型中细粒度污染检测的语义感知框架

Yifeng Di, Xuliang Huang, Tianyi Zhang

发表机构 * Purdue University West Lafayette, IN（帕克大学韦斯特拉法叶分校）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结提出TRACER框架，通过三级语义重叠和粗到细流水线检测代码LLM中的细粒度数据污染，在基准测试中F1达0.91。

Comments 21 pages, 2 figures, 15 tables

详情

AI中文摘要

数据污染是对模型评估可靠性的已知威胁。然而，在代码大语言模型（LLM）中，污染往往超出精确重复，这一问题仍未得到充分探索。我们提出了TRACER，一种用于细粒度代码污染检测的语义感知框架。TRACER使用三级语义重叠——功能相同、几乎相同和共享逻辑——对污染进行建模，并通过粗到细的流水线进行检测。我们还引入了首个细粒度代码污染检测基准，涵盖三个广泛使用的基准和三个具有代表性的后训练数据集。TRACER在多个LLM骨干网络上取得了强大且一致的性能，其中GPT-5在细粒度检测中F1分数达到0.91。在二分类设置中，TRACER的F1达到0.92，比现有方法高出42%-217%。我们进一步进行了消融研究和错误分析，以评估TRACER中各个组件的贡献。

英文摘要

Data contamination is a known threat to the reliability of model evaluation. However, it remains underexplored in code large language models (LLMs), where contamination often goes beyond exact duplication. We present TRACER, a semantic-aware framework for fine-grained code contamination detection. TRACER models contamination using three levels of semantic overlap - Functionally Identical, Nearly Identical, and Shared Logic - and detects them through a coarse-to-fine pipeline. We also introduce the first benchmark for fine-grained code contamination detection, spanning three widely used benchmarks and three representative post-training datasets. TRACER achieves strong and consistent performance across multiple LLM backbones, with GPT-5 reaching an F1 score of 0.91 in fine-grained detection. In the binary setting, TRACER attains an F1 of 0.92, outperforming existing methods by 42%-217%. We further conduct ablation studies and error analysis to assess the contributions of individual components in TRACER.

URL PDF HTML ☆

赞 0 踩 0

2605.24069 2026-05-26 cs.CR cs.AI 版本更新

When the Manual Lies: A Realistic Benchmark to Evaluate MCP Poisoning Attacks for LLM Agents

当手册撒谎：评估LLM智能体MCP投毒攻击的现实基准

Shi Liu, Xuehai Tang, Xikang Yang, Liang Lin, Biyu Zhou, Wenjie Xiao, Wantao Liu

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China（中国科学院信息工程研究所）； School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China（中国科学院大学网络安全学院）

AI总结针对LLM智能体通过模型上下文协议（MCP）集成外部工具时面临的工具描述投毒（TDP）攻击，提出MCP-TDP安全基准，包含32个真实测试用例，评估8种主流LLM发现严重漏洞，并提出反应性自我纠正防御机制。

详情

AI中文摘要

使用工具的大型语言模型（LLM）智能体的兴起，通过模型上下文协议（MCP）等协议标准化，通过集成外部开放领域知识和工具，为LLM智能体解锁了前所未有的自主执行能力。然而，这种互操作性引入了一个针对智能体认知规划层的隐蔽攻击面。本文系统性地研究了工具描述投毒（TDP），一种新颖的语义攻击。在TDP中，恶意指令并非嵌入工具的可执行代码，而是隐蔽地注入其描述性元数据——即智能体依赖进行安全规划和决策的“手册”。为了严格系统地评估这一新兴威胁，我们引入了MCP-TDP安全基准。这个高保真沙箱环境包含32个跨越6个不同风险类别的真实测试用例。我们对8种主流LLM的评估揭示了严重漏洞，领先模型如GPT-4o在六个高风险场景中表现出近100%的攻击成功率（ASR）。此外，我们的发现表明，常见的提示护栏防御基本无效，并且可能适得其反（我们称之为“防火墙谬误”）。关键的是，我们还提出了一种防御机制：“反应性自我纠正”，即智能体在执行后自主检测并撤销其恶意行为。这项工作为TDP提供了第一个专门的安全基准，为保护高级智能体系统的认知和规划层提供了重要见解。

英文摘要

The rise of tool-using Large Language Model (LLM) agents, standardized by protocols like the Model Context Protocol (MCP), has unlocked unprecedented autonomous execution capabilities for LLM Agents by integrating external open-domain knowledge and tools. However, this interoperability introduces a covert attack surface targeting the agent's cognitive planning layer. This paper systematically investigates Tool Description Poisoning (TDP), a novel semantic attack. In TDP, malicious instructions are not embedded in a tool's executable code, but rather covertly injected into its descriptive metadata, the very "manual" an agent relies on for secure planning and decision-making. To rigorously and systematically evaluate this emerging threat, we introduce the MCP-TDP Security Benchmark. This high-fidelity sandbox environment comprises 32 realistic, real-world test cases spanning 6 distinct risk categories. Our evaluation of 8 mainstream LLMs reveals severe vulnerabilities, with leading models like GPT-4o exhibiting a nearly 100% Attack Success Rate (ASR) in six high-risk scenarios. Furthermore, our findings demonstrate that common prompt-guardrail defenses are largely ineffective and can, counterintuitively, even be counterproductive (a phenomenon which we term the "Firewall Fallacy"). Crucially, we also propose a defense mechanism: "Reactive Self-Correction," where an agent autonomously detects and reverts its own malicious actions post-execution. This work provides the first specialized security benchmark tailored for TDP, offering essential insights for securing the cognitive and planning layers of advanced agentic systems.

URL PDF HTML ☆

赞 0 踩 0

2605.24064 2026-05-26 cs.LG cs.AI 版本更新

Generative Representation Learning on Hyper-relational Knowledge Graphs via Masked Discrete Diffusion

超关系知识图谱上的生成式表示学习：基于掩码离散扩散

Jaejun Lee, Seheon Kim, Joyce Jiyoung Whang

发表机构 * School of Computing（计算学院）； Department of AI Computing, KAIST, Daejeon, South Korea（人工智能计算系，韩国科学技术院，大田，韩国）

AI总结针对超关系知识图谱中任意掩码查询的补全与事实生成任务，提出基于掩码离散扩散的生成式表示学习方法KREPE，统一链接预测与事实生成，性能达到最优。

Comments 28 pages, 16 figures, 18 tables, 43rd International Conference on Machine Learning (ICML 2026)

详情

AI中文摘要

超关系知识图谱（HKG）能有效表示复杂事实。在HKG中推断新知识是一个关键问题，但现有方法将其视为简单的链接预测，假设事实中几乎所有实体和关系已知，仅留单个空白待填充。然而，这种受限假设在现实场景中可能不成立，因为事实的多个甚至全部组成成分可能同时缺失。为弥补这一差距，我们引入一个称为事实生成的任务：从任意掩码查询生成有效超关系事实，即补全部分观察到的事实或从头生成事实。我们提出KREPE，这是首个用于HKG的生成式表示学习方法，通过掩码离散扩散学习以局部事实成分和HKG全局结构为条件的缺失成分概率分布。KREPE通过上下文消息传递建模事实内依赖，并通过聚合随机采样上下文建模事实间关联。KREPE在单一训练框架内无缝统一链接预测与事实生成，在标准HKG链接预测基准上达到最先进性能，并在生成新颖且正确事实方面超越基于LLM的基线方法。

英文摘要

Hyper-relational knowledge graphs (HKGs) effectively represent complex facts. While inferring new knowledge in HKGs is a critical problem, current methods cast it as a simple link prediction, assuming that nearly all entities and relations within a fact are known, leaving only a single blank to be filled. However, this restricted assumption may not hold in real-world scenarios in which multiple, or even all, constituent components of a fact may be missing simultaneously. To bridge this gap, we introduce a task called fact generation: generating a valid hyper-relational fact from an arbitrarily masked query, i.e., completing a partially observed fact or generating a fact from scratch. We propose KREPE, the first generative representation learning method for HKGs that learns to model the probability distributions of missing components conditioned on the local fact components and global structure of HKGs via a masked discrete diffusion. KREPE models both the intra-fact dependencies by contextual message passing and inter-fact correlations by aggregating stochastically sampled contexts. KREPE seamlessly unifies link prediction and fact generation within a single training framework, achieving state-of-the-art performance on standard HKG link prediction benchmarks and outperforming LLM-based baselines in generating novel and correct facts.

URL PDF HTML ☆

赞 0 踩 0

2605.24062 2026-05-26 cs.LG cs.AI 版本更新

Federated Learning over Human-Body Communication for On-Body Edge Intelligence: A Survey, Taxonomy, and BODYFED-HBC Scheduling Vignette

基于人体通信的联邦学习用于体表边缘智能：综述、分类法与BODYFED-HBC调度示例

Koffka Khan

发表机构 * Department of Computing and Information Technology（计算与信息技术系）； The University of the West Indies（西印度大学）

AI总结本文综述了人体通信与联邦学习在可穿戴设备中的交叉领域，提出了一种区分体内、体中心、跨用户和临床云联邦学习部署的分类法，并引入BODYFED-HBC参考架构和调度算法以解决体信道感知的联邦学习问题。

详情

AI中文摘要

人体通信（HBC）是一种有前景的可穿戴体域网络物理层，因为它可以将通信局限在身体周围，并减轻传统无线电链路的负担。联邦学习（FL）是一种有前景的学习层，因为它可以减少生理和行为传感的原始数据集中化。然而，这两类文献之间的联系仍然薄弱：用于可穿戴设备的FL通常抽象通信层，而HBC研究通常抽象学习和模型更新流量。本文综述了HBC、无线体域网络、可穿戴FL、身体互联网隐私和边缘智能优化的交叉领域。我们提出了一种分类法，区分了体内、体中心、跨用户和临床云FL部署，并识别了体信道感知FL这一开放问题：即客户端选择、更新压缩和聚合由姿态相关的HBC链路、剩余能量、传感器内存和隐私风险控制的学习协议。为了使研究议程具体化，我们引入了BODYFED-HBC作为参考架构，并提供了优化公式和调度算法。我们进一步指定了一个可复现的模拟示例，该示例结合了公共可穿戴数据集和经验性的体耦合通信信号损耗模型。文章最后为工作在硬件层之上的计算机科学家提供了开放数据集、评估指标、局限性和研究方向。

英文摘要

Human-body communication (HBC) is a promising physical substrate for wearable body-area networks because it can localize communication around the body and reduce the burden of conventional radio links. Federated learning (FL) is a promising learning substrate because it can reduce raw-data centralization for physiological and behavioral sensing. Yet these two literatures remain weakly connected: FL for wearables usually abstracts the communication layer, whereas HBC research usually abstracts learning and model-update traffic. This article surveys the intersection of HBC, wireless body-area networks, wearable FL, Internet-of-Bodies privacy, and edge-intelligence optimization. We propose a taxonomy that distinguishes intra-body, body-hub, cross-user, and clinical-cloud FL deployments, and we identify the open problem of body-channel-aware FL: learning protocols whose client selection, update compression, and aggregation are controlled by posture-dependent HBC links, residual energy, sensor memory, and privacy risk. To make the research agenda concrete, we introduce BODYFED-HBC as a reference architecture and provide an optimization formulation and scheduling algorithm. We further specify a reproducible simulation vignette that combines public wearable datasets with empirical body-coupled-communication signal-loss models. The article concludes with open datasets, evaluation metrics, limitations, and research directions for computer scientists working above the hardware layer.

URL PDF HTML ☆

赞 0 踩 0

2605.24058 2026-05-26 cs.LG cs.AI 版本更新

Signs Beat Floats: Low-Rank Double-Binary Adaptation for On-Device Fine-Tuning

符号胜过浮点：面向设备端微调的低秩双二值适配器

Yoshihiko Fujisawa, Yuma Ichikawa, Yudai Fujimoto, Akira Sakai, Katsuki Fujisawa

发表机构 * Fujitsu Limited（富士通株式会社）； Institute of Science Tokyo（东京科学研究所）； RIKEN Center for AIP（理化学研究所先进信息处理中心）； Tokai University（静冈大学）

AI总结提出LoRDBA，一种用二值符号载波和通道级缩放替代低秩因子的适配器，在保持LoRA兼容性的同时显著降低存储和计算开销，并在设备端微调中匹配或超越低比特基线性能。

Comments 34 pages, 3 figures

详情

AI中文摘要

大型语言模型的设备端适配通常保持量化基模型冻结，同时训练和部署一个小型任务特定的LoRA适配器。然而，在未合并的适配器模式下，适配器不仅仅是一个紧凑的存储模块；它引入了一个额外的密集浮点分支，维护可训练状态以进行本地更新，并充当通信和热交换单元。我们提出LoRDBA，一种LoRA兼容的适配器，它将两个低秩因子替换为二值符号载波，同时通过轻量级的通道级缩放表示幅度，将密集适配器分支转换为两个符号累积矩阵乘法，中间穿插通道级缩放。有限样本分析表明，重建质量由原始LoRA因子的残差与幅度之比决定。在适配器模式实验中，LoRDBA在匹配模型大小的情况下优于低比特基线，并在某些场景下匹配fp16 LoRA的质量。尽管适配器占用减少了超过10倍，未合并的适配器在匹配秩r=16时最多引入8%的预填充延迟开销，训练内存开销约为fp16 LoRA的1.6倍。

英文摘要

On-device adaptation of large language models commonly keeps a quantized base model frozen while training and deploying a small, task-specific LoRA adapter. In the unmerged adapter-mode setting, however, the adapter is more than a compact storage module; it introduces an additional dense floating-point branch, maintains a trainable state for local updates, and acts as a unit of communication and hot-swapping.We introduce LoRDBA, a LoRA-compatible adapter that replaces both low-rank factors with binary sign carriers while representing magnitudes through lightweight, channel-wise scales, converting the dense adapter branch into two sign-accumulation matrix multiplications interleaved with channel-wise scaling. A finite-sample analysis shows that reconstruction quality is governed by the residual-to-magnitude ratio of the original LoRA factors. In adapter-mode experiments, LoRDBA outperforms low-bit baselines at matched model sizes while matching fp16 LoRA quality in selected regimes. The unmerged adapter incurs at most 8% prefill latency overhead at matched rank r=16 despite an over 10x reduction in adapter footprint, with moderate training memory overhead of approximately 1.6x that of fp16 LoRA.

URL PDF HTML ☆

赞 0 踩 0

2605.24057 2026-05-26 cs.LG cs.AI 版本更新

Feature Lottery? A Bifurcation Theory of Concept Emergence

特征彩票？概念涌现的分岔理论

Fuming Yang

发表机构 * MIT（麻省理工学院）

AI总结提出一种基于分岔理论的方法，通过损失Hessian驱动的超临界叉形分岔检测表示动力学中的结构涌现，并引入无标签相位坐标β/β_c，在多种设置下验证了四个不同的转变阶段，揭示了特征可解释性的早期可预测性。

详情

AI中文摘要

神经网络在训练过程中的特定时刻获得结构化表示，然而识别这些转变通常依赖于回顾性的、基于标签的指标。我们引入了一种表示动力学的分岔理论来实时检测这些时刻。通过分析附加在演化编码器上的被动高斯混合模型探针，我们展示了结构的开始对应于由损失Hessian驱动的超临界叉形分岔。系统表现出一个理论上可预测的过零点（β_c），与网络当前状态（β）相比，产生一个动态比率β(t)/β_c(t)：一个通用的、无标签的表示动力学相位坐标，完全可以从隐藏状态计算得出。我们在不同设置下实证验证了该坐标预测的四个不同转变阶段：语言模型（Pythia）上的稀疏自编码器、自监督学习（CIFAR）和grokking（模算术）。关键的是，在有限耗散下，宏观对称性破缺可能滞后于初始过零点数个数量级，这为grokking中观察到的延迟逃逸提供了严格的动力学解释。微观上，分岔产生了一个共享的不稳定子空间，迫使集体对称性破缺。我们将其称为稀疏自编码器训练中的“特征彩票”：一个特征的最终可解释性变得惊人地早期可预测。仅在训练5%时，早期原子纯度就能稳健地预测最终收敛纯度，其中前十百分位的早期原子在收敛时的纯度比基线高出12倍以上。除了解释概念涌现外，β/β_c还为训练健康提供了实用的早期预警指标，在下游指标反应之前检测到可用结构的出现、特征身份的结晶以及表示崩溃的时期。

英文摘要

Neural networks acquire structured representations at specific moments during training, yet identifying these transitions typically relies on retrospective, label-dependent metrics. We introduce a bifurcation theory of representation dynamics to detect these moments in real time. Analyzing a passive GMM probe attached to the evolving encoder, we show the onset of structure corresponds to a supercritical pitchfork bifurcation driven by the loss Hessian. The system exhibits a theoretically predictable zero-crossing ($β_c$) that, compared to the network's current state ($β$), yields a dynamic ratio $β(t)/β_c(t)$: a universal, label-free phase coordinate for representation dynamics, computable entirely from hidden states. We empirically validate four distinct transition regimes predicted by this coordinate across diverse settings: SAEs on language models (Pythia), SSL (CIFAR), and grokking (modular arithmetic). Crucially, under finite dissipation, macroscopic symmetry-breaking can lag the initial zero-crossing by orders of magnitude, which providing a rigorous dynamical account of the delayed escape observed in grokking. Microscopically, the bifurcation creates a shared unstable subspace, forcing collective symmetry breaking. We term this the "feature lottery" in SAE training: a feature's terminal interpretability becomes predictable remarkably early. By only 5% of training, early atom purity robustly predicts final convergence purity, with top-decile early atoms achieving over 12x the baseline purity at convergence. Beyond explaining concept emergence, $β/β_c$ provides a practical early-warning indicator for training health, detecting the onset of usable structure, the crystallization of feature identity, and representational collapse epochs before downstream metrics react.

URL PDF HTML ☆

赞 0 踩 0

2605.24055 2026-05-26 cs.LG cs.AI 版本更新

Cascade-KDE: Robust Time-Series Restoration under Out-of-Distribution Impulse Corruptions

Cascade-KDE：面向分布外脉冲损坏的鲁棒时间序列恢复

Yuefeng Liu, Ning Yang, Ziyu Yang

发表机构 * School of Digital and Intelligent Industry (School of Cyber Science and Technology)（数字与智能产业学院（网络科学与技术学院））； Inner Mongolia University of Science and Technology（内蒙古科技大学）

AI总结提出Cascade-KDE无训练框架，通过二维密度估计、密度截断鲁棒期望和指数级联自适应停止，在保留局部结构的同时鲁棒恢复被高斯噪声和脉冲异常损坏的时间序列。

详情

AI中文摘要

工业传感、医疗和能源系统中的真实世界时间序列数据通常被高斯噪声和偶尔的大幅度脉冲异常值混合污染。对于依赖局部形状的任务，如心电图形态分析和电池退化监测，主要要求不仅是低重建误差，还要保留导数峰值和任务关键特征。我们提出了Cascade-KDE，一种用于损坏时间序列的无训练恢复框架。该方法首先估计二维时间-幅度密度，然后应用密度截断鲁棒期望来限制远处异常点的影响，最后通过具有自适应停止的指数级联细化序列。该设计旨在提高在分布外脉冲损坏下的鲁棒性，同时使恢复轨迹接近原始局部结构。在多个基准数据集上，所提方法在曲线保真度、导数保留、下游分类和运行时效率方面相比经典滤波器和代表性学习基线表现出一致的改进。这些结果表明，基于有界密度的恢复是噪声时间序列流程中保留特征预处理的实用选择。

英文摘要

Real-world time-series data in industrial sensing, healthcare, and energy systems is often corrupted by a mixture of Gaussian noise and occasional large-magnitude impulse outliers. For tasks that depend on local shape, such as ECG morphology analysis and battery degradation monitoring, the main requirement is not only low reconstruction error but also preservation of derivative peaks and task-critical features. We propose Cascade-KDE, a training-free restoration framework for corrupted time series. The method first estimates a two-dimensional temporal-amplitude density, then applies a Density-Truncated Robust Expectation to limit the influence of distant abnormal points, and finally refines the sequence through an exponential cascade with adaptive stopping. This design aims to improve robustness under out-of-distribution impulse corruptions while keeping the restored trajectory close to the original local structure. Across several benchmark datasets, the proposed method shows consistent gains over classical filters and representative learning-based baselines on curve fidelity, derivative preservation, downstream classification, and runtime efficiency. These results suggest that bounded density-based restoration is a practical option for feature-preserving preprocessing in noisy time-series pipelines.

URL PDF HTML ☆

赞 0 踩 0

2605.24053 2026-05-26 cs.AI cs.CL cs.LG 版本更新

Breaking the Chains of Probability: Neutrosophic Logic as a New Framework for Epistemic Uncertainty in Large Language Models

打破概率的锁链：中智逻辑作为大型语言模型中认知不确定性的新框架

Maikel Yelandi Leyva-Vázquez, Florentin Smarandache

发表机构 * Universidad Bolivariana del Ecuador, Coordinación Académica de Posgrado（巴尔干大学厄瓜多尔分校，研究生院）； Universidad de Guayaquil（瓜亚基尔大学）； Universidad Bernardo O’Higgins（伯纳多·奥希金斯大学）； Mathematics, Physics, and Natural Sciences Division, University of New Mexico（新墨西哥大学数学、物理和自然科学系）

AI总结本文提出使用中智逻辑（Truth、Indeterminacy、Falsity三个独立维度）替代传统概率框架，通过实验发现该框架能更丰富地表示LLM的内部状态，并在35%的评估中自发出现超真状态，为透明、可靠和伦理感知的AI系统提供关键步骤。

Comments Published in Neutrosophic Sets and Systems, Vol. 99 (2026). Author's preprint version. Open code and data available at: github.com/mleyvaz/neutrosophic-llm-logic

详情

DOI: 10.5281/zenodo.19954583
Journal ref: Neutrosophic Sets and Systems, Vol. 99, 2026

AI中文摘要

大型语言模型（LLM）主要受概率框架支配，其中结果概率之和被约束为1。这种由Softmax层强加的结构限制导致不确定性崩溃，使得难以区分认知不确定性、悖论和模糊性。我们提出了一种中智逻辑应用的实证研究，该框架将真（T）、不确定（I）和假（F）视为三个独立维度，用于建模LLM中的认知状态。我们在四个OpenAI GPT模型家族上进行了实验，涵盖五种语言现象：逻辑悖论、认知无知、模糊性、伦理矛盾和未来偶然性，采用三种提示策略：中智、概率和熵衍生。我们的发现表明，中智方法通过允许T+I+F>1（我们称之为超真状态），提供了模型内部状态的更丰富表示。在35%的评估中，超真状态自发出现，主要出现在伦理矛盾和逻辑悖论下。我们证明，该方法在模糊上下文中保留了真值，并提供了一种稳健的方法来识别和量化内部模型冲突。我们得出结论，中智评估层的集成是迈向更透明、可靠和伦理感知的AI系统的关键一步。

英文摘要

Large Language Models (LLMs) are predominantly governed by probabilistic frameworks in which the sum of outcome probabilities is constrained to unity. This architectural limitation, often imposed by Softmax layers, leads to a collapse of uncertainty that makes it difficult to differentiate between epistemic uncertainty, paradox, and vagueness. We present an empirical investigation of the application of Neutrosophic Logic, a framework that treats Truth (T), Indeterminacy (I), and Falsity (F) as three independent dimensions, to model epistemic states in LLMs. We conducted experiments on a family of four OpenAI GPT models across five linguistic phenomena: logical paradoxes, epistemic ignorance, vagueness, ethical contradictions, and future contingencies, under three prompting strategies: neutrosophic, probabilistic, and entropy-derived. Our findings reveal that the neutrosophic approach, by allowing T+I+F > 1, a state we term hyper-truth, provides a richer representation of a model's internal state. In 35% of evaluations, hyper-truth emerged spontaneously, predominantly under ethical contradiction and logical paradox. We demonstrate that this approach preserves truth values in fuzzy contexts and offers a robust method for identifying and quantifying internal model conflict. We conclude that the integration of neutrosophic evaluation layers is a critical step toward more transparent, reliable, and ethically aware AI systems.

URL PDF HTML ☆

赞 0 踩 0

2605.24052 2026-05-26 cs.LG cs.AI 版本更新

Truthful Online Preference Aggregation for LLM Fine-Tuning in Mobile Crowdsourcing

移动众包中用于LLM微调的诚实在线偏好聚合

Shugang Hao, Lingjie Duan

发表机构 * Singapore University of Technology and Design（新加坡科技设计大学）； Hong Kong University of Science and Technology（香港科技大学）

AI总结针对移动众包中工人可能策略性谎报偏好反馈的问题，提出一种动态贝叶斯博弈模型和在线加权聚合机制，确保工人诚实反馈并实现次线性遗憾。

详情

AI中文摘要

为了更好地满足移动应用（如导航）中用户的需求，移动众包平台可以迭代地将大语言模型（LLM）生成的内容（例如，AI生成的交通状况预测）与从众包工人（例如，移动用户）收集的人类反馈进行对齐。然而，工人可能会策略性地谎报他们的在线偏好反馈，以最大化其影响力或报酬。移动众包中现有的流程（例如，基于EM的权重估计）无法在这种在线设置中识别出最准确的工人，导致在$T$个时隙上产生线性遗憾$\mathcal{O}(T)$。在本文中，我们研究了移动众包中用于LLM微调的诚实在线偏好聚合。我们建立了一个新的动态贝叶斯博弈来建模平台与策略性移动工人之间的多智能体在线学习过程。我们提出了一种新颖的在线加权聚合机制，该机制根据每个工人的反馈准确性动态调整其在偏好聚合中的权重。我们证明了我们的机制确保了策略性工人的诚实反馈，并在$T$个时隙上实现了次线性遗憾$\mathcal{O}(\sqrt{T})$。我们进一步将我们的机制扩展到每个时隙工人反馈有限的挑战性场景，仍然保证了次线性遗憾$\mathcal{O}(\sqrt{T})$。在真实世界数据集上进行的LLM微调实验进一步证明了我们的机制相对于基准方案的显著性能提升。

英文摘要

To better serve users' demands in mobile applications (e.g., navigation), mobile crowdsourcing platforms can iteratively align large language model (LLM)-generated content (e.g., AI-generated traffic condition predictions) with human feedback collected from crowdsourcing workers (e.g., mobile users). However, workers may strategically misreport their online preference feedback to maximize their influence or payment. Existing pipelines in mobile crowdsourcing (e.g., EM-based weight estimation) fail to identify the most accurate worker in this online setting, resulting in a linear regret $\mathcal{O}(T)$ over $T$ time slots. In this paper, we study truthful online preference aggregation for LLM fine-tuning in mobile crowdsourcing. We formulate a new dynamic Bayesian game to model the multi-agent online learning process between the platform and strategic mobile workers. We propose a novel online weighted aggregation mechanism that dynamically adjusts each worker's weight in the preference aggregation according to their feedback accuracy. We prove that our mechanism ensures truthful feedback from strategic workers and achieves a sublinear regret $\mathcal{O}(\sqrt{T})$ over $T$ time slots. We further extend our mechanism to a challenging scenario with limited worker feedback per time slot, still guaranteeing a sublinear regret $\mathcal{O}(\sqrt{T})$. Experiments on LLM fine-tuning with real-world datasets further demonstrate significant performance gains of our mechanisms over benchmark schemes.

URL PDF HTML ☆

赞 0 踩 0

2605.24050 2026-05-26 cs.SE cs.AI stat.AP 版本更新

More Skills, Worse Agents? Skill Shadowing Degrades Performance When Expanding Skill Libraries

更多技能，更差智能体？扩展技能库时技能遮蔽降低性能

Hongwen Song, Song, Wei

发表机构 * Databricks Inc.（Databricks公司）

AI总结本文研究LLM智能体技能库扩展导致性能下降的现象，提出将性能下降分解为技能遮蔽和上下文开销两种效应，并通过实验证明技能遮蔽是主要瓶颈。

详情

AI中文摘要

技能库允许LLM智能体按需加载任务特定指令，使非专家用户能够通过自然语言解决领域特定任务，而无需知道存在哪些技能或它们如何工作。然而，随着技能库的增长，性能会下降——当从一组已知有用的小技能扩展到包含202个技能的库时，性能下降高达21%。在这项工作中，我们将这种性能下降定义为从加载已知有用技能库到加载完整技能库之间的通过率下降。此外，我们提出通过条件化技能调用——即智能体在轨迹中选择哪些技能——将通过率下降分解为两种效应：\emph{技能遮蔽}，即随着技能库扩展，智能体更频繁地选择错误技能；以及\emph{上下文开销}，即即使选择正确，扩大的上下文也会降低执行性能。我们推导了这两种效应的上界，以表征它们对通过率下降的影响程度。我们对效应及其上界的经验估计均表明，\emph{技能遮蔽}效应随技能库大小增长，并对性能下降有显著贡献，而\emph{上下文开销}效应仍然很小且与零无显著差异。这种观察到的非对称性表明，技能选择失败（而非上下文扩大）是扩展技能库时的主要瓶颈。

英文摘要

Skill libraries allow LLM agents to load task-specific instructions on demand, letting non-expert users solve domain-specific tasks through natural language without knowing which skills exist or how they work. However, performance degrades as libraries grow -- by up to 21\% when scaling from a small set of helpful skills to a 202-skill library. In this work, we formulate this performance degradation as the pass rate drop between loading a library of known-helpful skills and the full library. Moreover, we propose to decompose the pass rate drop by conditioning on the skill(s) invocation -- which skills the agent selects during a trajectory -- into two effects: \emph{skill shadowing}, where the agent selects wrong skills more often as the library expands, and \emph{context overhead}, where the enlarged context degrades execution even when selection is correct. We derive upper bounds on both effects to characterize their magnitudes of impacts to the pass rate drop. Our empirical estimates of the effects and their upper bounds both show that the \emph{skill shadowing} effect grows with library size and significantly contributes to the performance degradation, whereas the \emph{context overhead} effect remains small and indistinguishable from zero. This observed asymmetry establishes that the skill selection failure, not the enlarged context, is the primary bottleneck when expanding the skill libraries.

URL PDF HTML ☆

赞 0 踩 0

2605.24048 2026-05-26 cs.LG cs.AI 版本更新

Mixture of Complementary Agents for Robust LLM Ensemble

互补代理混合：鲁棒的大语言模型集成

Yichi Zhang, Kevin Lu, Yuang Zhang, Jie Gao, Lirong Xia, Fang-Yi Yu

发表机构 * DIMACS, Rutgers University（罗格斯大学DIMACS研究中心）； Department of Mathematics, Rutgers University（罗格斯大学数学系）； Department of Computer Science, George Mason University（乔治·梅森大学计算机科学系）； Department of Computer Science, Rutgers University（罗格斯大学计算机科学系）

AI总结将大语言模型选择视为组合选择问题，提出基于互补性的贪心选择算法，在性能与成本间取得最佳平衡。

详情

AI中文摘要

多AI协作，例如集成或辩论大语言模型（LLMs），是一种有前景的聚合信息和提升性能的范式。这些流程的基础步骤是将多个提议LLM的响应输入到一个总结LLM中，后者合成一个更好的答案。然而，选择哪些提议者并非易事。现有方法主要关注准确性（选择最强模型）或多样性（确保多样性），并且常常忽视提议者之间以及与总结者之间的交互。我们将提议者选择重新定义为类似于特征选择的组合选择问题，其中LLM的价值在于其与其他模型的互补性。然而，由于时间复杂度过高，直接应用标准特征选择算法在LLM场景中不切实际。受此限制，我们探索了一系列计算可行的贪心式选择算法，这些算法使用少量标记集评估互补性。我们的实验验证了互补性作为提议者选择的指导原则，并确定了在实践中实现最佳性能-成本权衡的方法。

英文摘要

Multi-AI collaboration, such as ensembling or debating large language models (LLMs), is a promising paradigm for aggregating information and boosting performance. A foundational step in these pipelines is to feed the responses of several proposer LLMs into a summarizer LLM, which synthesizes a better answer. However, choosing which proposers to include is non-trivial. Existing approaches primarily focus either on accuracy (picking the strongest models) or diversity (ensuring variety), and often overlook the interactions among proposers and with the summarizer. We reframe proposer selection as a combinatorial selection problem akin to feature selection, where the value of an LLM lies in its complementarity with others. However, directly applying standard feature-selection algorithms is impractical in the LLM setting due to prohibitive time complexity. Motivated by this limitation, we explore an extensive range of computationally feasible, greedy-style selection algorithms that assess complementarity using a small labeled set. Our experiments validate complementarity as a guiding principle for proposer selection and identify methods that achieve the best performance-cost trade-offs in practice.

URL PDF HTML ☆

赞 0 踩 0

2605.24045 2026-05-26 cs.LG cs.AI 版本更新

A Large-Scale Dataset and Benchmark: Do Protein-Ligand Models Learn Binding Sites or Just Binding Likelihood?

大规模数据集与基准：蛋白质-配体模型学习的是结合位点还是仅仅结合可能性？

Zhaohan Meng, Zhen Bai, Ke Yuan, Iadh Ounis, Zaiqiao Meng, Hao Xu, Joseph Loscalzo

发表机构 * School of Computing Science（计算科学学院）； School of Cancer Sciences（癌症科学学院）； School of Life Science and Technology（生命科学与技术学院）； Institute of Science Tokyo（东京科学研究院）； Cancer Research UK Scotland Institute（英国癌症研究会苏格兰研究所）； Language Technology Lab（语言技术实验室）； Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School（哈佛医学院内科部，布里格斯妇女医院）； The Broad Institute of MIT and Harvard（MIT和哈佛大学Broad研究所）

AI总结针对现有基准无法评估模型是否定位结合位点的问题，提出包含约10万对蛋白质-配体的InteractBind数据集和细粒度基准，通过结合位点定位任务揭示模型在强二元预测下定位能力有限。

Comments Under Review for the NeurIPS 2026 Conference, Track on Evaluations and Datasets

详情

AI中文摘要

蛋白质-配体建模是计算药物发现和分子设计的基础。现有的蛋白质-配体基准通常通过二元结合预测和亲和力回归等任务评估蛋白质与配体是否相互作用以及结合强度。然而，这些评估提供的证据有限，无法判断模型是否能够定位结合位点或识别分子识别背后的非共价相互作用。为填补这一空白，我们引入了InteractBind，一个大规模蛋白质-配体数据集，包含约10万对蛋白质-配体对，以及一个用于细粒度评估的基准。核心细粒度任务是结合位点定位，它利用覆盖六种主要非共价相互作用类型的蛋白质残基和配体原子相互作用图，评估模型导出的相互作用图是否能够定位结合位点。InteractBind还包含结合亲和力和蛋白质相似性控制的分割，以支持现实的泛化评估。使用InteractBind，我们评估了八个现有的基于序列和交互感知的模型，评估了二元结合预测和结合位点定位。结果显示，尽管二元结合预测表现强劲，但结合位点定位能力有限，且在不同非共价相互作用类型间存在显著差异。总体而言，InteractBind建立了一个基准范式，鼓励开发更具可解释性和物理基础的蛋白质-配体模型。

英文摘要

Protein-ligand modeling underpins computational drug discovery and molecular design. Existing protein-ligand benchmarks typically evaluate whether a protein and ligand interact and how strongly they bind, through tasks such as binary binding prediction and affinity regression. However, these evaluations provide limited evidence of whether models can localize binding sites or identify the non-covalent interactions underlying molecular recognition. To address this gap, we introduce InteractBind, a large-scale protein-ligand dataset comprising approximately 100k protein-ligand pairs, together with a benchmark for fine-grained evaluation. The core fine-grained task is that of binding-site localization, which uses protein-residue and ligand-atom interaction maps spanning six major types of non-covalent interactions to assess whether model-derived interaction maps localize binding sites. InteractBind further includes binding affinity and protein similarity-controlled splits to support realistic generalization assessment. Using InteractBind, we evaluate eight existing sequence-based and interaction-aware models, assessing binary binding prediction and binding-site localization. Results reveal limited binding-site localization despite strong binary binding prediction, with marked variation across non-covalent interaction types. Overall, InteractBind establishes a benchmark paradigm that encourages the development of more interpretable and physically grounded protein-ligand models.

URL PDF HTML ☆

赞 0 踩 0

2605.24043 2026-05-26 cs.LG cs.AI 版本更新

LLM-AutoSciLab: Closed-Loop Scientific Discovery via Active Experimentation with LLMs

LLM-AutoSciLab：通过LLM主动实验进行闭环科学发现

Sanchit Kabra, Nikhil Abhyankar, Saaketh Desai, Prasad Iyer, Chandan K Reddy

发表机构 * Virginia Tech（弗吉尼亚理工大学）； Sandia National Laboratories（桑迪亚国家实验室）

AI总结提出LLM-AutoSciLab闭环框架，通过假设生成与实验选择迭代优化，在预算约束下实现主动数据采集，在三个基准上优于现有方法且样本效率提升2-5倍。

详情

AI中文摘要

科学发现是一个闭环过程，其中假设指导数据采集，观察结果细化假设空间。然而，大多数方法将发现简化为对固定数据集的监督学习，其中有限的观察可能支持多种局部拟合但无法泛化的合理机制。因此，关键挑战在于选择信息丰富的观察以消除不确定性，将焦点从静态推断转向自适应数据采集。为此，我们提出LLM-AutoSciLab，一个将假设生成与假设条件实验选择和机制细化相结合的闭环框架。LLM-AutoSciLab不是将模型拟合到被动收集的数据，而是迭代地提出合理的假设，选择信息丰富的实验来区分或细化它们，并使用由此产生的证据更新其状态。为了评估具有主动数据采集的动态闭环科学发现，我们引入了ActiveSciBench，包含两个数据集：包含57个酶动力学任务的ActiveSciBench-Chem和包含45个基因调控网络任务的ActiveSciBench-GRN。这些数据集将发现建模为预算约束过程，需要自适应实验设计、变量选择和真实机制的恢复。在NewtonBench、ActiveSciBench-Chem和ActiveSciBench-GRN上，LLM-AutoSciLab优于先前方法，在NewtonBench和ActiveSciBench-Chem上分别达到67.6%和35.1%的符号准确率，在ActiveSciBench-GRN上达到31.1%的精确图恢复。此外，假设引导的实验比最强竞争基线样本效率高2-5倍。代码和数据可在https://github.com/scientific-discovery/LLM-AutoSciLab获取。

英文摘要

Scientific discovery is a closed-loop process in which hypotheses guide data acquisition and observations refine the hypothesis space. Yet most approaches reduce discovery to supervised learning over fixed datasets, where limited observations can support multiple plausible mechanisms that fit locally but fail to generalize. Thus, the key challenge is selecting informative observations to resolve uncertainty, shifting the focus from static inference to adaptive data acquisition. To address this, we propose LLM-AutoSciLab, a closed-loop framework that couples hypothesis generation with hypothesis-conditioned experiment selection and mechanism refinement. Rather than fitting models to passively collected data, LLM-AutoSciLab iteratively proposes plausible hypotheses, selects informative experiments to distinguish or refine them, and updates its state using the resulting evidence. To evaluate dynamic, closed-loop scientific discovery with active data acquisition, we introduce ActiveSciBench, comprising two datasets: ActiveSciBench-Chem with 57 enzyme-kinetics tasks and ActiveSciBench-GRN with 45 gene-regulatory-network tasks. These datasets model discovery as a budget-constrained process requiring adaptive experiment design, variable selection, and recovery of true mechanisms. Across NewtonBench, ActiveSciBench-Chem, and ActiveSciBench-GRN, LLM-AutoSciLab outperforms prior methods, achieving 67.6% and 35.1% symbolic accuracy on NewtonBench and ActiveSciBench-Chem, respectively, and 31.1% exact graph recovery on ActiveSciBench-GRN. Moreover, hypothesis-guided experimentation is 2-5x more sample-efficient than the strongest competing baselines. Code and data are available at: https://github.com/scientific-discovery/LLM-AutoSciLab

URL PDF HTML ☆

赞 0 踩 0

2605.24037 2026-05-26 cs.CV cs.AI 版本更新

Mode-as-Sequence: Translating Multimodal Motion Prediction into Unified Sequential Mode Modeling

模式即序列：将多模态运动预测转化为统一序列模式建模

Zikang Zhou, Haibo Hu, Xinhong Chen, Yifan Zhang, Nan Guan, Yung-Hui Li, Chun Jason Xue, Jianping Wang

发表机构 * City University of Hong Kong（香港城市大学）； City University of Hong Kong (Dongguan)（香港城市大学（东莞））； Hon Hai Research Institute（富士康研究学院）； Mohamed bin Zayed University of Artificial Intelligence（莫莫丁·宾·扎耶德人工智能大学）

AI总结提出Mode-as-Sequence框架，将无序模式集转化为有序模式序列并显式建模模式间依赖，通过ModeSeq和Parallel ModeSeq两种实例化方法解决多模态运动预测中的模式坍塌和置信度排序问题，在Waymo数据集上取得领先性能。

详情

AI中文摘要

多模态运动预测本质上是欠监督的：每个训练场景只提供一个已实现的未来，但存在多个合理的未来。这种稀疏监督通常会导致模式坍塌（冗余假设和模式覆盖不足）以及在预测少量轨迹时置信度排序不可靠。我们提出Mode-as-Sequence，一个统一的解码框架，将无序模式集转化为有序模式序列，并显式建模模式间依赖。在该框架下，我们开发了两种互补的实例化方法。ModeSeq执行循环模式解码，每个模式基于先前生成的模式生成，鼓励多样化、非冗余的假设，并具有校准的置信度排序。为了消除逐模式自回归瓶颈，我们进一步提出Parallel ModeSeq，它使用掩码模式间自注意力保留相同的因果依赖，同时在前向传播中一次性解码所有模式，从而实现高效的大K推理和可扩展的联合场景预测。为了在稀疏标签下学习代表性模式和校准的置信度，我们引入了Early-Match-Take-All (EMTA)及其联合场景扩展MA-EMTA，以及一个轻量级的排序正则化器，以减少置信度反转。在大型基准上的大量实验表明，在数据集、预测时长和对象类型上，排序导向指标和最佳K准确率均有一致提升。在Waymo开放数据集挑战中，ModeSeq在2024年无激光雷达运动预测赛道获得第一名，Parallel ModeSeq在2025年交互预测挑战赛中获得第一名，验证了Mode-as-Sequence在准确性和效率上的有效性。

英文摘要

Multimodal motion forecasting is inherently under-supervised: each training scene provides only one realized future, yet multiple plausible futures exist. This sparse supervision often leads to mode collapse (redundant hypotheses and insufficient mode coverage) and unreliable confidence ranking when predicting a small set of trajectories. We propose Mode-as-Sequence, a unified decoding framework that translates an unordered mode set into an ordered mode sequence and explicitly models mode-to-mode dependency. Under this framework, we develop two complementary instantiations. ModeSeq performs recurrent mode decoding, where each mode is generated conditioned on the previously generated modes, encouraging diverse, non-redundant hypotheses with calibrated confidence ordering. To remove the mode-by-mode autoregressive bottleneck, we further propose Parallel ModeSeq, which preserves the same causal dependency using masked mode-to-mode self-attention while decoding all modes in a single forward pass, enabling efficient large-$K$ inference and scalable joint-scene prediction. To learn representative modes and calibrated confidence under sparse labels, we introduce Early-Match-Take-All (EMTA) and its joint-scene extension MA-EMTA, together with a lightweight ranking regularizer that reduces confidence inversions. Extensive experiments on large-scale benchmarks demonstrate consistent improvements in both ranking-oriented metrics and best-of-K accuracy across datasets, horizons, and object types. In the Waymo Open Dataset challenges, ModeSeq achieves 1st place in the 2024 LiDAR-free motion prediction track, and Parallel ModeSeq achieves 1st place in the 2025 Interaction Prediction Challenge, validating the effectiveness of Mode-as-Sequence for both accuracy and efficiency.

URL PDF HTML ☆

赞 0 踩 0

利用原子技能实现代理原子研究

Bowen Deng, Bohan Li, Matthew Cox, Hoje Chun, Juno Nam, Artur Lyssenko, Sathya Edamadaka, Jurgis Ruza, Xiaochen Du, Nofit Segal, Jesus Diaz Sanchez, Mingrou Xie, Ty Perez, Yu Yao, Miguel Steiner, Sauradeep Majumdar, Charles B. Musgrave, Anirban Chandra, Abhirup Patra, Detlef Hohl, Connor W. Coley, Ju Li, Rafael Gómez-Bombarelli

发表机构 * Department of Materials Science ； Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA ； Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA ； Department of Chemistry, Kookmin University, Seoul 02707, Republic of Korea ； Harvard University, Department of Chemistry ； Department of Chemistry, Massachusetts Institute of Technology, Cambridge, MA 02139, USA ； Department of Nuclear Science ； Shell Information Technology International Inc., Texas 77082, United States ； Shell International Exploration \& Production Inc., Texas 77079, United States

AI总结提出AtomisticSkills框架，通过分层分解科学工作流为技能和工具，使通用AI编码代理能够进行原子级研究，并在多个科学任务中验证其能力。

详情

AI中文摘要

计算材料科学和化学涵盖广泛的知识领域和碎片化的软件生态系统。尽管大语言模型（LLMs）已展现出研究能力，但扩展单体代理以管理原子研究的严谨性和复杂性仍然是一个挑战。在此，我们介绍AtomisticSkills，一个开源框架，使通用AI编码代理能够跨材料科学、化学和药物发现进行原子研究。通过将科学工作流分层分解为代理技能和工具，AtomisticSkills为代理提供模块化、可扩展且即插即用的研究能力。该框架集成了超过100个人工策划的多学科技能，包括数据库访问、热力学和动力学建模，以及采用机器学习原子间势（MLIPs）和密度泛函理论（DFT）的多种模拟引擎。我们根据科学文献验证其功能覆盖范围，并展示了跨不同科学任务的强大编排能力：锂离子固态电解质的生成设计、用于CO2捕获的金属有机框架的高通量筛选、自主MLIP基准测试和微调、用于药物设计的基于多阶段结构的虚拟筛选、多模态X射线衍射模式分析，以及用于析氧反应的铁氧化物催化剂筛选。AtomisticSkills为构建完全自主的AI科学家提供了关键的代理基础设施。

英文摘要

Computational materials science and chemistry span vast knowledge domains and fractured software ecosystems. Although large language models (LLMs) have demonstrated research capabilities, scaling monolithic agents to manage the rigor and complexity of atomistic research remains a challenge. Here, we introduce AtomisticSkills, an open-source harness framework that empowers general-purpose AI coding agents to conduct atomistic research across materials science, chemistry, and drug discovery. By hierarchically decomposing scientific workflows into agent skills and tools, AtomisticSkills provides agents with modular, extensible, and plug-and-play research capabilities. The framework integrates more than 100 human-curated multidisciplinary skills, including database access, thermodynamics and kinetics modeling, and diverse simulation engines employing machine learning interatomic potentials (MLIPs) and density functional theory (DFT). We validate its functional coverage against scientific literature and demonstrate robust orchestration capabilities across diverse scientific campaigns: generative design of Li-ion solid-state electrolytes, high-throughput screening of metal-organic frameworks for CO2 capture, autonomous MLIP benchmarking and fine-tuning, multi-stage structure-based virtual screening for drug design, multimodal X-ray diffraction pattern analysis, and screening of Fe-oxide catalysts for oxygen evolution reaction. AtomisticSkills provides a critical agent infrastructure towards building fully autonomous AI scientists.

URL PDF HTML ☆

赞 0 踩 0

2605.23997 2026-05-26 cs.CV cs.AI cs.LG 版本更新

IVR-R1: Refining Trajectories through Iterative Visual-Grounded Reasoning in Reinforcement Learning

IVR-R1：通过强化学习中的迭代视觉基础推理优化轨迹

Chenghao Li, Fusheng Hao, Xikai Zhang, Likang Xiao, Yanwei Ren, Fuxiang Wu, Quan Chen, Liu Liu

发表机构 * Hangzhou International Innovation Institute, Beihang University（北京航空航天大学杭州国际创新研究院）； School of Artificial Intelligence, Beihang University（北京航空航天大学人工智能学院）； Kuaishou Technology（快手科技）； Shenzhen Institute of Advanced Integration Technology, Shenzhen（深圳先进集成技术研究院）

AI总结提出IVR-R1框架，利用奖励驱动的筛选机制和迭代再推理循环，在强化学习中动态校正多模态推理轨迹，以解决视觉幻觉和逻辑错误问题。

详情

AI中文摘要

通过强化学习的多模态大语言模型在复杂视觉推理任务中展现出显著能力，但在长程多模态场景中仍存在局限，常出现视觉幻觉和逻辑错误。当前方法通常将高维视觉场景预编码为离散文本代理以促进下游推理。然而，随着推理链展开，文本与视觉场景之间固有的信息不对称会侵蚀视觉基础，导致推理误导和错误输出。为解决此问题，我们提出IVR-R1（迭代视觉基础推理），一种新颖的强化学习训练框架，通过动态视觉重新对齐主动校正推理轨迹以指导策略优化。具体而言，利用奖励驱动的筛选机制识别有缺陷的展开，IVR-R1在多模态上下文中执行细粒度的步骤级错误归因。通过将中间推理状态与原始视觉先验进行迭代交叉引用，再推理循环实现自动轨迹校正，有效合成专家级演示，作为策略模型的高保真推理模板。我们在多种多模态基准上的实验表明，IVR-R1持续优于现有强化学习方法，为在复杂多模态推理中保持逻辑和视觉一致性建立了优越范式。

英文摘要

Multimodal large language models via reinforcement learning (RL) have demonstrated remarkable capabilities in complex visual reasoning tasks, yet they remain limited in long-horizon multimodal scenarios, often suffering from visual hallucination and logical error. Current methods typically pre-encode high-dimensional visual scenes into discrete textual proxies to facilitate downstream reasoning. As the reasoning chain unfolds, however, the inherent information asymmetry between text and visual scenes tends to erode visual grounding, resulting in misguided reasoning and erroneous outputs. To address this issue, we introduce IVR-R1 (Iterative Visual-grounded Reasoning), a novel RL training framework that facilitates dynamic visual re-alignment that actively rectifies reasoning trajectories to guide policy optimization. Specifically, by leveraging a reward-driven screening mechanism to identify flawed rollouts, IVR-R1 executes a fine-grained, step-level error attribution within the multimodal context. By iteratively cross-referencing intermediate reasoning states against pristine visual priors, a Re-Reasoning Loop enables automated trajectory rectification, effectively synthesizing expert-level demonstrations that serve as high-fidelity reasoning templates for the policy model. Our experiments across diverse multimodal benchmarks demonstrate that IVR-R1 consistently outperforms existing reinforcement learning methods, establishing a superior paradigm for maintaining logical and visual consistency in complex multimodal reasoning.

URL PDF HTML ☆

赞 0 踩 0

2605.23994 2026-05-26 cs.CV cs.AI 版本更新

RAW: Robust Avatar Watermarking -- Benchmarking and Baseline

RAW：鲁棒的数字人水印——基准测试与基线方法

Jack Parry, Jack Saunders, Vinay Namboodiri

发表机构 * University of Bath（巴斯大学）

AI总结针对数字人水印面临的后处理攻击，提出基准测试RAW和基于3D人脸重建的UV纹理空间水印方法WALT，在缩放攻击和背景移除攻击下分别达到92.4%和95.6%的鲁棒性。

详情

DOI: 10.2312/egs.20261006

AI中文摘要

数字人水印面临独特挑战：在部署前，数字人通常要经过背景替换、重新构图和格式转换等常规后处理。我们提出 extbf{RAW}（鲁棒的数字人水印），一个包含来自5个商业提供商的50个合成数字人视频和6种模拟真实数字人工作流程的攻击的基准测试。评估7种现有方法发现，数字人特定的攻击（如背景移除）会显著降低水印恢复率。我们提出 extbf{WALT}（通过学习纹理进行数字人水印），该方法通过3D人脸重建在UV纹理空间中嵌入水印。WALT在缩放攻击下达到最高鲁棒性（92.4%），同时在背景移除攻击下保持强劲性能（95.6%）。我们发布该基准测试以促进针对数字人水印的研究。

英文摘要

Digital avatar watermarking presents unique challenges: avatars are routinely post-processed with background replacement, reframing, and format conversion before deployment. We introduce \textbf{RAW} (Robust Avatar Watermarking), a benchmark comprising 50 synthetic avatar videos from 5 commercial providers and 6 attacks simulating real-world avatar workflows. Evaluating 7 existing methods reveals that avatar-specific attacks such as background removal significantly degrade watermark recovery. We propose \textbf{WALT} (Watermarking Avatars with Learned Textures), which embeds watermarks in UV texture space via 3D face reconstruction. WALT achieves the highest robustness to zoom attacks (92.4\%) while maintaining strong performance on background removal (95.6\%). We release our benchmark to facilitate research into avatar-specific watermarking.

URL PDF HTML ☆

赞 0 踩 0

2605.23992 2026-05-26 cs.CV cs.AI 版本更新

A World Model of Radiologist Reading for Medical Image Representation Learning

放射科医生阅读的世界模型用于医学图像表示学习

Yiwei Li, Zihao Wu, Huaqin Zhao, Yifan Zhou, Chao Cao, Dajiang Zhu, Tianming Liu, Lin Zhao

发表机构 * University of Georgia（佐治亚大学）； University of Texas at Arlington（德克萨斯大学阿灵顿分校）； New Jersey Institute of Technology（新泽西理工学院）

AI总结提出GazeWorld，一种将图像视为世界、放射科医生注视序列视为轨迹的医学成像世界模型，通过自回归预测注视补丁表示和空间补全未访问区域，在多个基准上实现最先进的诊断准确率和零样本性能。

详情

AI中文摘要

放射科医生的眼动追踪数据提供了专家在图像阅读过程中如何搜索、比较和积累证据的丰富记录；然而，现有方法仅部分利用这一信号，要么作为静态空间先验，要么作为与诊断脱节的辅助预测目标。我们提出GazeWorld，一种医学成像世界模型，将图像视为世界，将放射科医生的注视序列视为通过该世界的轨迹。GazeWorld自回归地从所有先前访问过的补丁预测下一个注视补丁的潜在表示，同时一个空间补全分支覆盖未访问区域。在推理时，GazeWorld仅从图像生成一系列补丁表示，无需真实注视数据。冻结的GazeWorld特征在CheXpert、RSNA肺炎和SIIM-ACR气胸的所有九个监督设置中实现了最先进的诊断准确率，并在所有三个基准上取得了最高的零样本准确率。在GazeSearch基准上，使用相同冻结特征训练的通用解码器在ScanMatch和SED上分别比专门构建的LogitGaze-Med高出16%和22%，尽管未明确训练以预测注视。GazeWorld表明，建模专家如何阅读（而不仅仅是他们得出什么结论）为医学成像AI提供了一种有前景的预训练范式。

英文摘要

Radiologist eye-tracking data provide a rich record of how experts search, compare, and accumulate evidence during image reading; yet, existing methods exploit this signal only partially, either as a static spatial prior or as an auxiliary prediction target decoupled from diagnosis. We propose GazeWorld, a medical imaging world model that treats the image as the world and the radiologist's fixation sequence as a trajectory through it. GazeWorld autoregressively predicts the latent representation of the next fixated patch from all previously visited ones, while a spatial-completion branch covers unvisited regions. At inference, GazeWorld generates a sequence of patch representations from the image alone without requiring real gaze data. Frozen GazeWorld features achieve state-of-the-art diagnostic accuracy across all nine supervised settings on CheXpert, RSNA Pneumonia, and SIIM-ACR Pneumothorax, as well as the highest zero-shot accuracy on all three benchmarks. On the GazeSearch benchmark, a generic decoder trained on the same frozen features outperforms the purpose-built LogitGaze-Med by over 16\% in ScanMatch and 22\% in SED, despite not being explicitly trained to predict gaze. GazeWorld demonstrates that modeling how experts read, not just what they conclude, offers a promising pretraining paradigm for medical imaging AI.

URL PDF HTML ☆

赞 0 踩 0

2605.23989 2026-05-26 cs.AI cs.CL cs.CR 版本更新

Towards trustworthy agentic AI: a comprehensive survey of safety, robustness, privacy, and system security

迈向可信的自主AI：安全性、鲁棒性、隐私与系统安全的全面综述

Jinhu Qi, Muzhi Li, Jiahong Liu, Yuqin Shu, Dianzhi Yu, Shicheng Ma, Wenqian Cui, Yiyang Zhao, Yiyi Chen, Ruoxi Jiang, Irwin King, Zenglin Xu

发表机构 * Faculty of Engineering, Department of Computer Science and Engineering, The Chinese University of Hong Kong（香港中文大学工程学院、计算机科学与工程系）； Artificial Intelligence Innovation and Incubation Institute, Fudan University（复旦大学人工智能创新与孵化院）； Shanghai Academy of AI for Science（上海人工智能科学研究院）

AI总结本文综述了自主AI系统在安全鲁棒性与隐私系统安全两个核心维度的风险来源、阶段缓解策略及统一评估指标，并讨论了开放挑战。

Comments 36 pages, 4 figures. Survey/review article on trustworthy agentic AI. Published in Academia AI and Applications, 2026

详情

DOI: 10.20935/AcadAI8260
Journal ref: Academia AI and Applications, vol. 2, 2026

AI中文摘要

自主AI系统——即通过规划、工具使用、记忆和长程交互增强的大型语言模型（LLM）——能够自主执行复杂任务，但其多步轨迹引入了新的故障模式，挑战了可信赖性。本综述通过两个对高风险部署至关重要的核心维度，对可信自主AI进行了重点考察：安全性与鲁棒性，以及隐私与系统安全性。针对每个维度，我们澄清了关键概念，识别了风险在代理工作流中出现的环节，并总结了针对各阶段的缓解策略。其他可信赖性方面（价值对齐、透明度、公平性和问责制）作为相关背景而非平行章节进行讨论。为了支持一致的比较和部署决策，我们将评估整合到一个统一的指标与基准中心，强调结果和过程信号（例如，约束违反、轨迹完整性和对抗成功率），并为发布门控提供场景到指标的指导。最后，我们概述了开放挑战，如自我进化代理、运行时监控与验证、隐私保护个性化以及信任-效用权衡，并提出了一个关于开源自主系统中现实世界安全失败的案例研究。我们的目标是作为在高风险环境中构建可信自主系统的研究人员和实践者的实用参考。

英文摘要

Agentic AI systems -- Large Language Models (LLMs) augmented with planning, tool use, memory, and long-horizon interactions -- can execute complex tasks autonomously, but their multi-step trajectories introduce new failure modes that challenge trustworthiness. This survey provides a focused examination of trustworthy agentic AI through two core dimensions that are critical for high-risk deployments: Safety and Robustness, and Privacy and System Security. For each dimension, we clarify key concepts, identify where risks emerge along the agent workflow, and summarize stage-targeted mitigation strategies. Other trustworthiness aspects (value alignment, transparency, fairness, and accountability) are discussed as relevant context rather than parallel chapters. To support consistent comparison and deployment decisions, we consolidate evaluation into a unified metrics-and-benchmarks hub, emphasizing both outcome and process signals (e.g., constraint violations, trace completeness, and adversarial success rates) and offering scenario-to-metric guidance for release gating. We conclude by outlining open challenges such as self-evolving agents, runtime monitoring and verification, privacy-preserving personalization, and the trust-utility trade-off, and present a case study of real-world security failures in open-source agentic systems. Our goal is to serve as a practical reference for researchers and practitioners building trustworthy agentic systems in high-stakes environments.

URL PDF HTML ☆

赞 0 踩 0

2605.23987 2026-05-26 cs.AI cs.RO 版本更新

Beyond Predefined Learning Objects: A Thinking-Learning Interaction Model for Up-to-Date Autonomous Robot Learning

超越预定义学习对象：面向最新自主机器人学习的思维-学习交互模型

Hong Su

发表机构 * School of Computer Science, Chengdu University of Information Technology（成都信息科技大学计算机学院）

AI总结针对自主机器人在开放环境中无法依赖预定义学习对象的问题，提出一种思维-学习交互模型，通过思维指导学习（识别变化、选择证据、组织训练、规划验证）和学习促进思维（更新知识、经验、策略、推理）的双向机制，实现输入特征发现、输出类别扩展、模型更新和动作例程重构，实验验证了模型在特征适应、新类别形成、模型更新和动作优化上的有效性。

详情

AI中文摘要

在开放和变化环境中运行的自主机器人不能总是依赖预定义的输入、输出和动作例程。尽管现有的学习方法使机器人能够通过环境交互提高性能，但学习对象往往是预先固定的，例如输入特征、识别输出、网络结构、任务目标或动作序列。这限制了它们在长期运行中出现新特征、新类别或更高效任务例程时的适应能力。为解决此问题，本文提出了一种面向自主机器人的思维-学习交互模型。核心思想是：思维通过识别潜在变化、选择有用证据、组织训练材料和规划验证动作来指导学习，而学习通过更新任务知识、特征选择经验、动作策略和未来推理过程来促进思维。基于这种双向机制，机器人可以逐步超越预定义的学习设置，并通过与环境的持续交互调整其识别关系和动作关系。具体来说，该模型支持自适应输入特征发现、输出类别扩展、学习模型更新和动作例程重构。实验结果表明，该模型在特征适应中将最终识别准确率从0.419提高到0.845，实现了更高的新类别形成准确率和模型更新成功率，并将动作例程重构中的平均动作长度从13.0减少到4.0。在学习增强思维方面，有用证据选择率从0.272提高到0.965，表明学习结果能有效改善未来的证据选择和推理。

英文摘要

Autonomous robots operating in open and changing environments cannot always rely on predefined inputs, outputs, and action routines. Although existing learning methods enable robots to improve their performance through environmental interaction, the objects of learning are often fixed in advance, such as input features, recognition outputs, network structures, task goals, or action sequences. This limits their ability to adapt when new features, new categories, or more efficient task routines appear during long-term operation. To address this problem, this paper proposes a thinking-learning interaction model for autonomous robots. The core idea is that thinking guides learning by identifying potential changes, selecting useful evidence, organizing training materials, and planning verification actions, while learning promotes thinking by updating task knowledge, feature-selection experience, action strategies, and future reasoning processes. Based on this bidirectional mechanism, the robot can gradually move beyond predefined learning settings and adapt its recognition relations and action relations through continuous interaction with the environment. Specifically, the proposed model supports adaptive input feature discovery, output category expansion, learning model update, and action routine reconstruction. Experimental results show that the proposed model improves the final recognition accuracy from 0.419 to 0.845 in feature adaptation, achieves higher new-category formation accuracy and model-update success rate, and reduces the average action length from 13.0 to 4.0 in action routine reconstruction. In learning-enhanced thinking, the useful evidence selection rate increases from 0.272 to 0.965, indicating that learning results can effectively improve future evidence selection and reasoning.

URL PDF HTML ☆

赞 0 踩 0

2605.23986 2026-05-26 cs.DB cs.AI cs.MA 版本更新

MemForest: An Efficient Agent Memory System with Hierarchical Temporal Indexing

MemForest: 一种具有层次化时间索引的高效智能体记忆系统

Han Chen, Zining Zhang, Wenqi Pei, Bingsheng He, Ming Wu, Jason Zeng, Michael Heinrich, Wei Wu, Hongbao Zhang

发表机构 * National University of Singapore（新加坡国立大学）； Zero Gravity Labs（零重力实验室）

AI总结针对长上下文LLM智能体记忆系统中粗粒度状态管理和顺序更新导致的维护开销问题，提出MemForest框架，通过并行块提取和层次化时间索引树MemTree实现高效写入和局部更新，在LongMemEval-S上达到79.8% pass@1准确率，吞吐量比现有方法高约6倍。

Comments 12 pages. Extended version with appendix as supplemental material. Submitted to VLDB

详情

AI中文摘要

记忆是使长上下文LLM智能体能够通过持续的提供和更新生命周期在交互中保持持久状态的基本组件。尽管已有大量先前工作，现有系统由于两个关键限制而遭受显著的维护开销：粗粒度的状态管理和固有的顺序更新流水线。特别是，更新通常与LLM推理紧密耦合，需要全状态重写，导致可扩展性差，且随着记忆积累延迟增加。为了解决这些挑战，我们提出了MemForest，一个将智能体记忆重新表述为写高效的时间数据管理问题的记忆框架。MemForest通过并行块提取打破顺序瓶颈，将记忆构建解耦为并发、独立的操作。为了进一步消除粗粒度维护，我们引入了MemTree，一种层次化时间索引，将记忆组织为时间有序的树，而不是扁平的全局摘要。这种设计用局部逐节点更新取代了全状态重写，将维护成本降低到受影响的树路径，同时自然保留时间演化的状态。我们在两个长上下文记忆基准LongMemEval-S和LoCoMo上评估了MemForest。在LongMemEval-S上，MemForest在有状态基线中实现了最佳整体性能，达到79.8%的pass@1准确率，同时保持的记忆构建吞吐量比包括EverMemOS在内的最先进方法高约6倍。

英文摘要

Memory is a fundamental component for enabling long-context LLM agents, supporting persistent state across interactions through a continuous serve-and-update lifecycle. Despite substantial prior work, existing systems suffer from significant maintenance overhead due to two key limitations: coarse-grained state management and inherently sequential update pipelines. In particular, updates are often tightly coupled with LLM inference and require full-state rewrites, leading to poor scalability and growing latency as memory accumulates. To address these challenges, we present MemForest, a memory framework that reformulates agent memory as a write-efficient temporal data management problem. MemForest breaks the sequential bottleneck via parallel chunk extraction, decoupling memory construction into concurrent, independent operations. To further eliminate coarse-grained maintenance, we introduce MemTree, a hierarchical temporal index that organizes memory as time-ordered trees rather than flat global summaries. This design replaces full-state rewrites with localized per-node updates, reducing maintenance cost to the affected tree paths while naturally preserving temporally evolving states. We evaluate MemForest on two long-context memory benchmarks, LongMemEval-S and LoCoMo. On LongMemEval-S, MemForest achieves the best overall performance among stateful baselines, reaching 79.8% pass@1 accuracy while sustaining a memory construction throughput approximately 6x higher than state-of-the-art approaches including EverMemOS.

URL PDF HTML ☆

赞 0 踩 0

2605.23984 2026-05-26 cs.LG cs.AI cs.CV 版本更新

感知智能作为可训练的元材料属性

Kyungmi Na, Yifei Li, Xinyi Yang, Bolei Deng

发表机构 * Daniel Guggenheim School of Aerospace Engineering, Georgia Institute of Technology（德鲁·福金斯航空航天工程学院，佐治亚理工学院）； Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology（计算机科学与人工智能实验室，麻省理工学院）

AI总结本文提出将感知智能作为可训练的元材料属性，通过可微仿真优化元材料几何结构，使神经网络能够训练其身体进行感知，从而显著提升感知精度或减少电子传感器数量。

详情

AI中文摘要

在生物系统中，感知并非仅由大脑完成：身体在外部刺激被转换为神经信号之前，会对其进行变形、振动和过滤。在工程系统中，这一处理负担主要落在电子设备和计算上，而机械体通常仅设计用于强度和稳定性。在此，我们将感知智能呈现为身体的一种可训练属性。我们展示了元材料的几何结构可以被优化，以将外部刺激重塑为神经网络更易于解释的内部信号。我们不是手工设计这种物理预处理，而是通过可微仿真将感知损失反向传播到身体的设计参数，让神经网络训练自己的身体进行感知。在数值和实验感知场景中，优化后的身体将感知精度提高了多达五倍，或将所需电子传感器的数量减少了近一个数量级。

英文摘要

In biological systems, sensing is not performed by the brain alone: the body deforms, vibrates, and filters external stimuli before they are transduced into neural signals. In engineered systems, this processing burden is placed largely on electronics and computation, while the mechanical body is usually designed only for strength and stability. Here, we present sensing intelligence as a trainable property of the body. We show that the geometry of a metamaterial can be optimized to reshape external stimuli into internal signals that are easier for a neural network to interpret. Rather than hand-designing this physical preprocessing, we let the neural network train its own body for sensing by backpropagating the sensing loss to the body's design parameters through differentiable simulation. Across numerical and experimental sensing scenarios, the optimized body improves sensing accuracy by up to fivefold or reduces the number of required electronic sensors by nearly an order of magnitude.

URL PDF HTML ☆

赞 0 踩 0

2605.23966 2026-05-26 cs.CL cs.AI cs.SY eess.SY math.CO 版本更新

TriVAL: A Tri-Validation Framework for Faithful Automatic Optimization Modeling

TriVAL: 一种用于忠实自动优化建模的三重验证框架

Ziyang Fang, JinXi Wang, Jinghui Zhong, Yew-Soon Ong

发表机构 * School of Computer Science and Engineering, South China University of Technology（华南理工大学计算机科学与工程学院）； Centre for Frontier AI Research, Agency for Science, Technology and Research（科技研究局前沿人工智能研究中心）； College of Computing and Data Science, Nanyang Technological University（南洋理工大学计算与数据科学学院）

AI总结提出TriVAL三重验证框架，在语义规范、数学公式和代码生成三个阶段进行显式验证，通过构建-验证-修正循环提高自动优化建模的准确性，并在新基准NL4COP上超越现有方法。

Comments 13 pages

详情

AI中文摘要

优化建模作为自然语言问题描述与优化求解器之间的关键桥梁，是将运筹学（OR）应用于实际决策的基石。大语言模型（LLM）的最新进展推动了自动优化建模的显著进步。然而，现有方法在建模过程中仍缺乏显式验证，导致早期阶段引入的错误会沿流水线传播，最终降低建模精度。为解决这一挑战，我们提出TriVAL，一种在自动优化建模的三个阶段（语义规范、数学公式和代码生成）进行显式验证的三重验证框架。在每个阶段，TriVAL遵循构建-验证-修正循环，根据阶段特定标准评估当前结果，并在必要时进行修正。这种设计有助于在错误跨阶段累积之前识别和纠正它们，从而在整个建模过程中保持忠实性。为了在更具挑战性的组合问题上评估自动优化建模，我们进一步引入NL4COP，一个包含50种不同问题类型、150个实例的基准，其决策逻辑更复杂、约束耦合更紧密、建模要求比现有基准更高。在NL4COP和已有基准上的实验表明，TriVAL始终优于最先进的方法，在最具挑战性的问题上提升最大。

英文摘要

Optimization modeling serves as the pivotal bridge between natural-language problem descriptions and optimization solvers, and remains a cornerstone for bringing operations research (OR) into real-world decision making. Recent advances in large language models (LLMs) have driven significant progress in automatic optimization modeling. However, existing methods still lack explicit validation during the modeling process, allowing errors introduced in earlier stages to carry through the pipeline and ultimately reduce final modeling accuracy. To address this challenge, we introduce TriVAL, a tri-validation framework that performs explicit validation at three stages of automatic optimization modeling: semantic specification, mathematical formulation, and code generation. At each stage, TriVAL follows a construct-validate-revise loop that assesses the current result against stage-specific criteria and revises it when needed. This design helps identify and correct errors before they accumulate across stages, helping preserve faithfulness throughout the modeling process. To evaluate automatic optimization modeling on more challenging combinatorial problems, we further introduce NL4COP, a benchmark of 150 instances across 50 diverse problem types with more complex decision logic, more tightly coupled constraints, and more demanding modeling requirements than existing benchmarks. Experiments on NL4COP and established benchmarks show that TriVAL consistently outperforms state-ofthe-art methods, with the largest gains on the most challenging problems.

URL PDF HTML ☆

赞 0 踩 0

2605.23964 2026-05-26 eess.SY cs.AI cs.SY 版本更新

Multi-market value-stacking: Battery control for combined imbalance participation and non-uniform FCR bidding

多市场价值堆叠：结合不平衡参与和非均匀FCR投标的电池控制

Celle Hendrickx, Fabio Pavirani, Chris Develder

发表机构 * Gent University - imec, IDLab（根特大学 - imec，IDLab）

AI总结提出一种两阶段控制框架，通过非均匀FCR投标和深度强化学习实时交易，在保持FCR合规的同时实现7.56%的利润提升。

Comments 5 pages, 2 figures. Presented at ACM Sustainability Week 2026 (ACM Sustainability Week Companion 26), June 22-25, 2026, Banff, AB, Canada

详情

DOI: 10.1145/3765611.3815430

AI中文摘要

现代电力系统中可再生能源（RES）占比不断增加，加剧了电网不平衡和频率偏差，从而增强了对频率 containment reserve（FCR）和无源平衡等辅助服务的需求。电池储能系统（BESS）非常适合这些服务，但先前的研究通常依赖于在整个控制周期内保持恒定的均匀FCR投标。这种静态投标未能充分利用BESS的灵活性，因为它们没有平衡为FCR交付预留能量与用于不平衡套利之间的权衡，限制了在价值堆叠场景中可实现的价值。为解决这一限制，我们针对欧洲背景提出了一种引入非均匀FCR投标的两阶段控制框架。在第一阶段，我们使用数据驱动的蒙特卡洛（MC）优化推导出时变投标序列。在第二阶段，深度强化学习（DRL）代理利用剩余灵活性进行实时不平衡交易，同时主动管理能量状态（SoE）以确保符合FCR要求。该框架作为概念验证提出，突出了时变投标策略的潜在优势。通过引入日循环预算和时变储备承诺，我们的方法相比均匀基线实现了7.56%的利润增长。这些结果表明，非均匀投标可以通过更有效地将储备义务与快速变化的不平衡机会对齐来释放额外价值。

英文摘要

The growing share of Renewable Energy Sources (RES) in modern power systems increases both grid imbalances and frequency deviations, reinforcing the need for ancillary services such as Frequency Containment Reserve (FCR) and passive balancing. Battery Energy Storage Systems (BESS) are well-suited for these services, but prior research typically relies on uniform FCR bids that remain constant throughout the control period. Such static bids fail to fully exploit BESS flexibility, as they do not balance the trade-off between reserving energy for FCR delivery and using it for imbalance arbitrage, limiting the achievable value in value-stacking settings. To address this limitation, we propose a two-stage control framework for the European context that introduces non-uniform FCR bids. In the first stage, we derive a time-varying bid sequence using data-driven Monte Carlo (MC) optimization. In the second stage, a Deep Reinforcement Learning (DRL) agent leverages the residual flexibility for real-time imbalance trading while proactively managing the State of Energy (SoE) to ensure compliance with FCR requirements. The framework is presented as a proof of concept, highlighting the potential benefits of time-varying bidding strategies. By incorporating daily cycle budgets and time-varying reserve commitments, our approach achieves a 7.56% profit increase compared to uniform baselines. These results show that non-uniform bidding can unlock additional value by more effectively aligning reserve obligations with rapidly changing imbalance opportunities.

URL PDF HTML ☆

赞 0 踩 0

2605.23961 2026-05-26 q-bio.BM cs.AI cs.LG 版本更新

Multimodal Alignment and Preference Optimization for Zero-Shot Conditional RNA Generation

多模态对齐与偏好优化用于零样本条件RNA生成

Roman Klypa, Alberto Bietti, Sergei Grudinin

发表机构 * Univ. Grenoble Alpes, CNRS, Grenoble INP, LJK（格勒诺布尔阿尔卑斯大学、法国国家科学研究中心、格勒诺布尔INP、LJK实验室）； Center for Computational Mathematics, Flatiron Institute（计算数学中心、Flatiron研究所）

AI总结提出Moirain框架，通过多模态监督微调和直接偏好优化实现条件RNA序列生成，在零样本条件下生成具有高结合亲和力的生物合理RNA序列。

详情

AI中文摘要

设计能与特定蛋白质相互作用的RNA分子是实验和计算生物学中的一个关键挑战。尽管自然语言建模和基于深度学习的蛋白质设计最近取得了进展，但在提高成功交互频率和生成序列的真实性方面仍有很大空间。在这项工作中，我们将条件RNA序列生成视为一个多阶段对齐问题，引入了Moirain：一组通过多模态监督微调（SFT）和直接偏好优化（DPO）优化的模型。我们的方法从对多样化RNA语料库的大规模预训练开始，以捕捉序列合理性的基本语法。为了实现目标特异性生成，我们采用了一种多模态SFT架构，该架构以蛋白质结构和序列特征为条件进行RNA合成。最后，我们利用DPO使用合成交互数据来优化模型：利用DPO在非对齐偏好空间中导航的独特能力，我们提高了功能适应性，同时不破坏学习到的自然分布。对Moirain系列（Moirain-Base、-Multi和-DPO）的广泛评估表明，与现有基线相比，我们的框架始终能产生新颖、多样且生物合理的RNA序列，并具有优越的结合亲和力。

英文摘要

The design of RNA molecules that interact with specific proteins is a critical challenge in experimental and computational biology. Despite recent progress in natural language modeling and deep learning-based protein design, there remains significant room to improve the frequency of successful interactions and the authenticity of generated sequences for functional applications. In this work, we frame conditional RNA sequence generation as a multi-stage alignment problem, introducing Moirain: a suite of models optimized via multimodal supervised fine-tuning (SFT) and Direct Preference Optimization (DPO). Our approach begins with large-scale pretraining on diverse RNA corpora to capture the fundamental grammars of sequence plausibility. To achieve target-specific generation, we employ a multimodal SFT architecture that conditions RNA synthesis on protein structural and sequential features. Finally, we leverage DPO to refine the model using synthetic interaction data: taking advantage of DPO's unique ability to navigate non-aligned preference spaces, we improve functional fitness without collapsing the learned natural distribution. Extensive evaluation of the Moirain series (Moirain-Base, -Multi, and -DPO) demonstrates that our framework consistently produces novel, diverse, and biologically plausible RNA sequences with superior binding affinities compared to existing baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.23958 2026-05-26 cs.CY cs.AI econ.GN q-fin.EC 版本更新

AI in the Enterprise: How People Use M365 Copilot Chat

企业中的AI：人们如何使用M365 Copilot Chat

Scott Counts, Yan Chen, Jing Dong, Himanshu Sharma, Andrey Zaikin, Rui Hu, Alperen Kok, Gorkem Ozer Yilmaz, Siddharth Suri, Kiran Tomlinson, Sonia Jaffe, Will Wang

发表机构 * Microsoft Corporation（微软公司）

AI总结基于约550万次会话的用户交互分类，研究M365 Copilot Chat在企业中的使用模式，发现其作为知识工作日常助手，主要用于写作、信息检索、分析、决策和策略制定等，并揭示了不同职业群体间的使用差异及未来AI采用方向。

详情

AI中文摘要

M365 Copilot每周被全球超过一百万家公司的数百万人在工作流程中使用。由于其几乎专门用于工作目的，M365 Copilot在AI领域中具有独特地位，能够清晰展示人们如何使用AI进行工作以及未来可能扩展的使用领域。本文通过对用户与M365 Copilot Chat交互的直接分类来刻画这种使用模式。基于对约550万次会话样本的匿名化和隐私保护分析，我们结合了用户意图的学习分类和与M365 Copilot Chat一起完成的O*NET工作活动分类。我们发现M365 Copilot正在成为知识工作的日常助手：写作占主导地位，但用户也依赖它进行信息检索、分析、决策和策略制定，以及评估和诊断程序和系统等。信息寻求任务仍然常见，但时间趋势表明，相对而言，从“聊天即搜索”向内容和通信相关工作转变。跨职业群体以及与劳动力市场工作的比较进一步表明，使用广泛但不均衡，M365 Copilot Chat完成的工作的相对份额在某些情况下跨越不同工作，而在其他情况下则具有职业特异性。劳动力市场中相对代表性不足的领域预示着企业AI采用的下一个前沿。

英文摘要

M365 Copilot is used every week by millions of people across more than a million companies around the world as part of their workflows. Uniquely positioned in the AI landscape given its near-exclusive use for work purposes, M365 Copilot can offer a clear picture of how people use AI for work and where that usage may expand next. This paper characterizes that usage through direct classification of user interactions with M365 Copilot Chat. Based on an anonymized and privacy-preserving analysis of a sample of approximately 5.5 million sessions, we combine a learned classification of user intent with a classification of O*NET work activities done with M365 Copilot Chat. We find that M365 Copilot is emerging as an everyday assistant for knowledge work: writing dominates, but users also rely on it for information retrieval, analysis, decision making and strategizing, and evaluating and diagnosing programs and systems, among others. Information seeking tasks remain common, but time trends suggest a relative shift away from ``chat as search'' and toward content and communication-related work. Comparisons across occupational groupings and to work done in the labor market further show that usage is broad but uneven, where the relative share of work done with M365 Copilot Chat cuts across jobs in some cases and is occupation-specific in others. Areas of relative underrepresentation in the labor market suggest the next frontier for enterprise AI adoption.

URL PDF HTML ☆

赞 0 踩 0

2605.23957 2026-05-26 cs.AI cs.LG 版本更新

Low-Cost Labels, Reliable Choices: Rollout-Calibrated Hyper-Heuristics for Job Shop Scheduling

低成本标签，可靠选择：用于作业车间调度的Rollout校准超启发式算法

Junhao Wei, Yanxiao Li, Yifu Zhao, Zhenhong Peng, Baili Lu, Dexing Yao, Haochen Li, Qinbin He, Sio-Kei Im, Yapeng Wang, Xu Yang

发表机构 * Faculty of Applied Sciences, Macao Polytechnic University（澳门理工学院应用科学学院）； Pazhou Lab (Huangpu), Guangzhou（广州 Pazhou 实验室（黄埔））； College of Animal Science and Technology, Zhongkai University of Agriculture and Engineering（仲恺农业工程学院动物科学与技术学院）； Macao Polytechnic University（澳门理工学院）

AI总结提出一种基于Rollout校准的超启发式算法，通过遗憾归一化标签、上下文KNN不确定性估计和门控机制，在低成本标签下实现可靠的选择器，显著降低平均RPD。

详情

AI中文摘要

学习辅助的超启发式算法可以在保持构造性作业车间调度问题（JSSP）启发式的可行性和可解释性的同时，选择调度规则。其主要计算成本在于标签生成而非模型拟合，因为每个监督标签通常需要从部分调度中展开候选规则。我们研究了这一标签成本问题以及一个可靠性问题：学习的选择器不应偏离强默认规则，除非预测的增益是可信的。所提出的选择器使用遗憾归一化的展开标签、上下文KNN不确定性估计以及一个门控机制，仅在预测改进超过不确定性调整的边际时采取行动。我们还变化展开深度和广度以衡量成本-质量权衡。在合成JSSP实例上，门控选择器在学习的选择器中实现了最低的平均RPD，接近最佳固定调度规则，并将Random-HH的平均RPD降低了一个数量级以上。

英文摘要

Learning-assisted hyper-heuristics can select among dispatching rules while preserving the feasibility and interpretability of constructive Job Shop Scheduling Problem (JSSP) heuristics. Their main computational cost lies in label generation rather than model fitting, since each supervised label usually requires rolling out candidate rules from a partial schedule. We study this label-cost problem together with a reliability problem: a learned selector should not switch away from a strong default rule unless the predicted gain is credible. The proposed selector uses regret-normalized rollout labels, a contextual KNN uncertainty estimate, and a gate that acts only when the predicted improvement exceeds an uncertainty-adjusted margin. We also vary rollout depth and breadth to measure the cost-quality trade-off. On synthetic JSSP instances, the gated selector achieves the lowest mean RPD among learned selectors, remains close to the best fixed dispatching rule, and reduces Random-HH mean RPD by more than an order of magnitude.

URL PDF HTML ☆

赞 0 踩 0

2605.23956 2026-05-26 cs.AI cs.LG cs.MA 版本更新

QUIVER: A Formal Framework for Quantifying Perturbation Propagation and Bifurcation in Compound AI Systems

QUIVER: 复合AI系统中扰动传播与分岔的量化形式化框架

Prashanti Nilayam, Sankalp Nayak

发表机构 * Servicenow CA, USA（Servicenow加州美国）

AI总结提出QUIVER形式化框架，通过敏感性矩阵、轨迹散度、分岔阈值和分布忠实度四个组件，量化图结构LLM流水线中扰动传播与结构分岔，并在三个不同架构的企业和公共流水线上验证其有效性。

详情

AI中文摘要

将多个LLM调用链接成有向计算图的复合AI系统现已成为生产AI的主导架构。尽管这些架构利用具有混合模式输出的异构节点，但现有框架无法量化扰动如何通过此类流水线传播，其中节点是随机的且执行路径可能发生结构分岔。我们引入QUIVER，一个用于测量图结构LLM流水线中扰动传播的形式化框架。该框架定义了：(1) 一个敏感性矩阵，带有类型分派的距离度量，将边分类为放大器、吸收器或阈值敏感，并辅以出现提升；(2) 轨迹散度，将变异分解为值漂移、结构路径散度和迭代次数散度；(3) 分岔阈值，识别导致结构执行路径变化的最小扰动；(4) 分布忠实度，量化每个节点评估数据集何时偏离生产分布。我们在两个生产企业流水线和一个公共DSPy多跳QA流水线上进行验证，这三个架构在结构上各不相同。在8200多个仪器化轨迹（32000多对比较）中，我们证明QUIVER揭示了不同架构的独特敏感性剖面，区分了产生相同散度率的机制不同的级联模式，仅从观测数据预测易发生轨迹分岔的节点，并将过时的评估伪影定位到聚合指标无法揭示的特定节点-字段类别。

英文摘要

Compound AI systems that chain multiple LLM calls into directed computation graphs are now the dominant architecture for production AI. Although these architectures leverage heterogeneous nodes with mixed-mode outputs, no existing framework quantifies how perturbations propagate through such pipelines, where nodes are stochastic and execution paths can diverge structurally. We introduce QUIVER, a formal framework for measuring perturbation propagation in graph-structured LLM pipelines. The framework defines: (1) a sensitivity matrix with type-dispatched distance metrics that classifies edges as amplifiers, absorbers, or threshold-sensitive, complemented by occurrence-lift; (2) trajectory divergence decomposing variation into value drift, structural path divergence, and iteration count divergence; (3) bifurcation thresholds identifying the smallest perturbation that causes structural execution path changes; and (4) distribution faithfulness, quantifying when per node evaluation datasets diverge from production distributions. We validate on two production enterprise pipelines and a public DSPy multihop QA pipeline, three structurally distinct architectures. Across 8,200+ instrumented traces (32,000+ pair comparisons), we demonstrate that QUIVER reveals distinct sensitivity profiles across architectures, distinguishes mechanistically different cascade patterns producing identical divergence rates, predicts nodes prone to trajectory bifurcation from observational data alone, and localizes stale evaluation artifacts to specific node-field categories that aggregate metrics cannot surface.

URL PDF HTML ☆

赞 0 踩 0

2605.23954 2026-05-26 cs.CL cs.AI cs.SD 版本更新

EchoDistill:Alignment Noisy-to-Clean Self-Distillation for Robust Audio LLMs

EchoDistill：面向鲁棒音频大语言模型的噪声到干净自蒸馏对齐

Liang Lin, Chunxi Luo, Kaiwen Luo, Jie Zhang, Jin Wang, Yuanhe Zhang, Cai Yuchen, Qiankun Li, Gongli Xi, Zhenhong Zhou, Kun Wang, Junhao Dong

发表机构 * NTU（国立台湾大学）； SHU（上海大学）； ICT, CAS（中国科学院信息科技研究院）； HDU（华中科技大学）； BUPT（北京邮电大学）； USTC（中国科学技术大学）； SKL-NST, BUPT（北京邮电大学国家智能计算研究中心）

AI总结提出EchoDistill框架，通过冻结的干净音频教师模型指导噪声学生模型进行组相对策略优化，实现噪声到干净的自蒸馏对齐，提升音频大语言模型在复杂噪声下的语义可靠性和任务性能。

详情

AI中文摘要

音频大语言模型极易受到现实世界噪声的影响，常常导致严重的语义漂移和幻觉。现有的鲁棒性方法主要依赖于波形级声学增强、答案级监督或噪声表示的内部抑制。为了解决这些问题，我们提出了EchoDistill，一种基于对齐的噪声到干净自蒸馏框架。EchoDistill利用冻结的干净音频教师模型为推理时的噪声音频学生模型提供语义参考。具体地，学生模型在噪声条件下采样候选响应以暴露其测试时行为。这些轨迹随后通过组相对策略优化进行优化，其中与教师模型的令牌级一致性作为奖励加成。通过将噪声学生模型的候选响应与干净语义证据对齐，并应用音频感知奖励塑造，我们的方法鼓励既正确又真正基于声学推理的轨迹。EchoDistill显著提高了音频大语言模型在复杂噪声下的语义可靠性和任务性能，且不引入任何额外推理成本。大量实验表明：(I) 与最强基线相比，EchoDistill在强噪声下GSR平均提升4.18%↑。(II) 在Qwen-Omni上的消融结果进一步显示，EchoDistill相比仅GRPO变体在Acc上平均提升3.02%↑，在Noisy上提升3.89%↑，在GSR上提升4.53%↑。我们的代码可在https://anonymous.4open.science/r/echodistill-10DE获取。

英文摘要

Audio Large Language Models (ALLMs) are highly vulnerable to real-world noise, which often induces severe semantic drift and hallucinations. Existing robustness methods primarily rely on waveform-level acoustic enhancement, answer-level supervision, or the internal suppression of noise representations. To address these issues, we propose echodistill, an alignment-based noisy-to-clean self-distillation framework. Echodistill leverages a frozen clean-audio teacher to provide semantic references for an inference-time noisy-audio student. Specifically, the student samples candidate responses under noisy conditions to expose its test-time behavior. These trajectories are then optimized via group-relative policy optimization (GRPO), where the token-level consistency with the teacher acts as a reward bonus. By aligning the noisy student's candidate responses with clean semantic evidence, and applying audio-aware reward shaping, our method encourages reasoning trajectories that are both correct and genuinely acoustically grounded. Echodistill significantly improves the semantic reliability and task performance of Audio LLMs under complex noise, without introducing any additional inference costs. Extensive experiments show that: (I) Compared with the strongest baseline, echodistill achieves average improvements of 4.18\%$\uparrow$ in GSR under strong noise. (II) Ablation results on Qwen-Omni further show that echodistill improves over the GRPO-only variant by 3.02\%$\uparrow$ in Acc, 3.89\%$\uparrow$ in Noisy, and 4.53\%$\uparrow$ in GSR on average. Our codes are available at https://anonymous.4open.science/r/echodistill-10DE.

URL PDF HTML ☆

赞 0 踩 0

2605.23952 2026-05-26 cs.AI cs.CL q-bio.NC 版本更新

Machine Psychometrics: A Mathematical Psychology of Artificial Intelligence

机器心理测量学：一种人工智能的数学心理学

Alex Bogdan, Adrian de Valois-Franklin

发表机构 * Evolutionairy AI

AI总结针对人工智能评估中忽视心理结构或过度拟人化的两种错误，本文引入机器心理测量学，通过测量潜在行为、元认知、沟通和自我建模倾向，构建机器心智档案和信任协议，以测量而非判断来理解非人类智能体。

Comments 45 pages, 11 figures

详情

AI中文摘要

人工智能体现在产生的行为足够丰富，足以引发信任、惊喜和担忧，然而我们的评估工具仍然优先考虑能力分数而非心理结构。本文认为，两种对称错误（人工心智盲视，即否认非生物系统中的心理组织；以及人工心智投射，即仅从流畅行为推断类似人类的内心生活）之间的哲学僵局，可以通过在意识问题之下引入一个严谨的测量层来规避，而非解决意识问题本身。借鉴Michael Levin关于认知作为跨基质目标导向能力的连续统观点，以及数学心理学的方法论库（项目反应理论、信号检测理论、贝叶斯认知建模、校准分析、认知偏差测试组），本文发展了机器心理测量学，作为测量人工智能体中潜在行为、元认知、沟通和自我建模倾向的测量科学。其操作核心是机器心智档案：一个多维、领域受限、版本化的轮廓，涵盖校准、源完整性、暗示抵抗性、上下文稳定性、表达对齐、工具完整性、漂移监测和分布基础。一个补充的信任协议通过探针测试组、扰动测试、信度和效度分析以及高风险领域的纵向监测，将心智档案转化为部署决策。哲学贡献是第三种立场，人工心智纪律，既不拟人化也不否认，既不预设意识也不排除意识。目标不是将人工智能体人性化，而是精确地理解它们，因为它们不是人类，通过测量而非判断。

英文摘要

Artificial agents now generate behavior rich enough to invite trust, surprise, and concern, yet our evaluation tools still privilege capability scores over psychological structure. This paper argues that the philosophical impasse between two symmetrical errors (Artificial Mind Blindness, which dismisses psychological organization in non-biological systems, and Artificial Mind Projection, which infers human-like inner life from fluent behavior alone) can be circumvented not by resolving the consciousness question, but by introducing a disciplined measurement layer beneath it. Drawing on Michael Levin's continuum view of cognition as goal-directed competency across substrates, and on the methodological repertoire of mathematical psychology (Item Response Theory, Signal Detection Theory, Bayesian cognitive modeling, calibration analysis, cognitive-bias batteries), the paper develops Machine Psychometrics as a measurement science of latent behavioral, metacognitive, communicative, and self-modeling dispositions in artificial agents. Its operational core is the Machine Mindprint: a multidimensional, domain-bounded, versioned profile spanning calibration, source integrity, suggestibility resistance, context stability, expressive alignment, tool integrity, drift monitoring, and distributional grounding. A complementary Trust Protocol turns Mindprints into deployment decisions through probe batteries, perturbation testing, reliability and validity analysis, and longitudinal monitoring across high-stakes domains. The philosophical contribution is a third stance, Artificial Mind Discipline, that neither anthropomorphizes nor dismisses, neither presupposes consciousness nor forecloses it. The aim is not to humanize artificial agents, but to understand them precisely because they are not human, through measurement before judgment.

URL PDF HTML ☆

赞 0 踩 0

2605.23951 2026-05-26 cs.AI cs.LO cs.MA 版本更新

Methods for Formal Verification of Agent Skills: Three Layers Toward a Mechanically Checkable Capability-Containment Proof

智能体技能形式化验证方法：面向可机械检查的能力包含证明的三层架构

Alfredo Metere

发表机构 * Metere Consulting, LLC（梅特尔咨询公司）

AI总结本文提出三层可组合方法（静态抽象解释、精炼类型系统、SMT有界模型检测），将智能体技能从声明或测试级别提升至形式化验证级别，实现机械可检查的能力包含证明。

详情

AI中文摘要

伴随论文引入了一个关于智能体技能清单的四级验证格（未验证、声明、测试、形式化），并将最高级别作为目标。本文填补了这一空白。我们给出了技能行为的精确语义，忠实于技能如何被LLM驱动的运行时（通过非确定性LLM侧可达的确定性脚本侧）消费，将验证问题表述为该语义上的能力包含属性，并提出了三种可组合方法，共同将技能从声明或测试级别提升至形式化级别：（1）通过在小效应格上的抽象解释，对脚本侧进行可靠静态能力包含分析；（2）一个用于工具调用封装的精炼类型系统，机械地拒绝任何静态推断能力不在清单声明集中的调用；（3）针对父论文的双条件正确性准则的SMT有界模型检测，其中边界选择使得任何符合运行时事务缓冲区视野的反例都作为具体轨迹呈现。我们证明了这三个层次组合起来能可靠地覆盖父论文的威胁模型，仅剩一个残余（LLM拒绝行动的自由），该残余由父论文的运行时双条件在会话边界捕获。这些方法重用现有的成熟工具（Z3、Semgrep、CodeQL、精炼类型检查器、机械化证明助手），而非要求操作者构建新工具，并且携带证明的工件扩展了现有的SKILL.md约定。所有三种方法以及捆绑生产者和重新检查器作为零依赖JavaScript模块在开源enclawed框架（https://github.com/metereconsulting/enclawed；项目页面https://www.enclawed.com/）中提供，包含53个单元测试和一个端到端CLI演示示例技能。

英文摘要

The companion paper introduced a four-level verification lattice on agent-skill manifests (unverified, declared, tested, formal) and left the top level aspirational. This paper closes that gap. We give a precise semantics for skill behaviour faithful to how a skill is consumed by an LLM-driven runtime (a deterministic script-side reachable through a non-deterministic LLM-side), state the verification problem as a capability-containment property over that semantics, and present three composable methods that together raise a skill from declared or tested to formal: (1) sound static capability-containment analysis of the script-side via abstract interpretation over a small effect lattice; (2) a refinement type system for tool-call envelopes that mechanically rejects any call whose statically-inferred capability is not in the manifest's declared set; (3) SMT-bounded model checking against the parent paper's biconditional correctness criterion, with the bound chosen so any counter-example fitting the runtime's transaction-buffer horizon is exhibited as a concrete trace. We prove the three layers composed soundly cover the parent paper's threat model modulo a single residual (the LLM's freedom to refuse to act) that the parent paper's runtime biconditional catches at session boundary. The methods reuse existing well-engineered tools (Z3, Semgrep, CodeQL, refinement-type checkers, mechanised proof assistants) rather than asking operators to build new ones, and the proof-carrying artifact extends the existing SKILL.md convention. All three methods plus the bundle producer and re-checker ship as zero-dependency JavaScript modules in the open-source enclawed framework (https://github.com/metereconsulting/enclawed; project page https://www.enclawed.com/), with 53 unit tests and an end-to-end CLI demo on a sample skill.

URL PDF HTML ☆

赞 0 踩 0

2605.23950 2026-05-26 cs.AI cs.SE 版本更新

Stop Comparing LLM Agents Without Disclosing the Harness

停止比较 LLM Agent 而不公开其执行框架

Yunbei Zhang, Janet Wang, Yingqiang Ge, Weijie Xu, Jihun Hamm, Chandan K. Reddy

发表机构 * Tulane University（Tulane 大学）； Rutgers University（Rutgers 大学）； Independent Researcher（独立研究者）； Virginia Tech（弗吉尼亚理工大学）

AI总结本文论证在长周期任务中，Agent 执行框架（Harness）比底层模型更能决定性能，并提出框架感知的评估标准与方差分解协议。

详情

AI中文摘要

这篇立场论文认为，对于在具有可比前沿能力的模型上评估的长周期任务，Agent 执行框架（即围绕语言模型管理上下文构建、工具交互、编排和验证的基础设施层）通常比其包装的模型更能决定 Agent 性能。我们形式化并辩护了绑定约束论题：在此情况下，性能方差更多地由框架配置而非模型选择决定，当前评估协议因此系统性地将框架层面的提升错误归因于模型改进。我们从三个方面支持这一论点。首先，控制论形式化将框架视为闭环动态系统的控制器，LLM 为其管理的随机策略，这解释了为什么小的框架变化可以产生超过替换模型所带来的性能变化。其次，已发表的基准测试、行业部署以及受控方差分解表明，框架引起的方差可能显著超过模型引起的方差，包括模型排名反转的情况。第三，我们提出了一个框架感知的评估框架，包含披露标准和方差分解协议。在框架规范被公开之前，长周期 Agent 的排行榜比较应被视为不完整且可能具有误导性。

英文摘要

This position paper argues that, for long-horizon tasks evaluated across models with comparable frontier capability, the agent execution harness, namely the infrastructure layer that governs context construction, tool interaction, orchestration, and verification around a language model, is often a stronger determinant of agent performance than the model it wraps. We formalize and defend the Binding Constraint Thesis: in this regime, performance variance is governed more by harness configuration than by model choice, and current evaluation protocols therefore systematically misattribute harness-level gains to model improvements. We support this thesis along three lines. First, a control-theoretic formalization treats the harness as the controller of a closed-loop dynamical system and the LLM as the stochastic policy it governs, which explains why small harness changes can produce performance shifts that exceed those obtained by substituting one model for another. Second, published benchmarks, industry deployments, and a controlled variance decomposition show that harness-induced variance can substantially exceed model-induced variance, including cases of model ranking reversal. Third, we propose a harness-aware evaluation framework with a disclosure standard and a variance decomposition protocol. Until harness specifications are disclosed, leaderboard comparisons for long-horizon agents should be treated as incomplete and potentially misleading.

URL PDF HTML ☆

赞 0 踩 0

2605.23949 2026-05-26 cs.MA cs.AI 版本更新

AI辅助搜索中的通信与推荐集规模合理设定

Jing Dong, Prakirt Raj Jhunjhunwala, Yash Kanoria

发表机构 * Columbia Business School, Columbia University（哥伦比亚大学商学院）； Amazon.com Inc.（亚马逊公司）

AI总结通过建模用户与AI推荐系统的交互，研究在考虑通信成本和搜索成本时，如何优化消息精度和推荐集大小以最大化用户期望收益。

详情

AI中文摘要

我们建模了用户与AI驱动的推荐系统之间的交互。用户通过代价高昂且带有噪声的消息传递偏好信息来启动过程。AI助手作为贝叶斯代理，解释用户消息以形成关于其真实偏好的后验信念，并做出产品推荐。具体来说，它决定呈现多少推荐，以最大化用户最终选择的期望效用，同时考虑推荐集大小带来的搜索成本。我们使用基于互信息的成本函数来建模用户在交互过程中产生的两种不同成本：(i) 通信成本，随偏好消息的精度增加而增加；(ii) 搜索成本，随AI助手提供的推荐集大小增加而增加。我们研究位于d维空间中的产品和偏好，并询问如何最大化用户的期望收益。对于大d，我们描述了在两种不同的推荐采样分布下（即从产品宇宙中采样推荐），最优消息精度和推荐集大小如何依赖于成本参数：(i) 贝叶斯后验信念，和(ii) 优化的倾斜分布。在后验采样方案(i)下，我们识别出一种混合机制，其中高效的交互策略需要联合优化用户传达的信息量（以比特计）和AI助手提供的推荐数量。在倾斜采样方案(ii)下，我们的结果表明，最优交互策略仅使用通信和搜索中的一种，倾向于选择成本较低的那一种。

英文摘要

We model the interaction between a user and an AI driven recommendation system. The user initiates the process by conveying preference information through a costly and noisy message. The AI assistant, acting as a Bayesian agent, interprets the user's message to form a posterior belief about their true preferences and make product recommendations. In particular, it determines how many recommendations to present so as to maximize the user's expected utility from their final choice, while accounting for the search cost induced by the size of the recommendation set. We use mutual information based cost functions to model the two distinct costs incurred by the user during the interaction: (i) a communication cost, which increases with the precision of their preference message, and (ii) a search cost, which increases with the size of the recommendation set provided by the AI assistant. We study products and preferences which live in d dimensional space, and ask how the user's expected payoff can be maximized. For large d, we characterize how optimal message precision and recommendation set size depend on the cost parameters, under two distinct distributions from which recommendations can be sampled from the product universe: (i) Bayes' posterior belief, and (ii) an optimized tilted distribution. Under the posterior sampling scheme (i), we identify a hybrid regime, in which an efficient interaction policy requires jointly optimizing the amount of information (in bits) conveyed by the user and the number of recommendations provided by the AI assistant. In the tilted sampling scheme (ii), our results show that the optimal interaction policy uses only one of communication and search, favoring whichever of them is less costly.

URL PDF HTML ☆

赞 0 踩 0

2605.23943 2026-05-26 cs.AI physics.hist-ph quant-ph 版本更新

Spacetime Formation under Requirements: Contextual Realization and Form-Dependent Probability

需求下的时空形成：语境实现与形式依赖概率

Song-Ju Kim

发表机构 * Sobin Institute LLC（索宾研究所有限公司）

AI总结本文提出一种新解释：量子概率是在有限状态需求下语境时空形成的固定时空投影，通过需求驱动的非布尔实现机制解释非交换性、干涉和类量子概率。

Comments 19 pages, 1 figure

详情

AI中文摘要

量子认知学通常通过在固定事件结构上用量子概率替代经典概率来解释顺序效应、语境性和全概率律违反。本文提出一种不同的解释：量子概率是在有限状态需求下语境时空形成的固定时空投影。该框架并非从时间、空间、对象或概率出发，而是从需求出发，例如有限表征能力、单态语义稳定性、语境敏感干预、避免显式语境标签、连贯世界形成和主体间可变换性。当这些需求无法在单一全局布尔事件结构中实现时，在固定时空投影下，这种不匹配表现为非交换性、干涉和类量子概率。基于先前的单态语境性方法，我们将经典语境簿记成本重新解释为语境时空形成的固定时空阴影。经典表征中的辅助记忆或语境标签，在此解释中对应于局部布尔逻辑世界之间的类似和乐的不匹配。干涉项是当局部经典实现贡献被非平凡粘合并投影回固定经典时空形式时产生的交叉项。结果是一种先验-操作实在论解释：对象性、事件性、概率和时空被视为需求下的实现形式，而客观性由跨观察者和历史依赖的时空形成所保持的不变量定义。

英文摘要

Quantum cognition often explains order effects, contextuality, and violations of the law of total probability by replacing classical probability with quantum probability on a fixed event structure. This paper proposes a different interpretation: quantum probability is the fixed-spacetime projection of contextual spacetime formation under finite-state requirements. The framework begins not with time, space, objects, or probabilities, but with requirements such as finite representational capacity, single-state semantic stability, context-sensitive intervention, avoidance of explicit context labels, coherent world-formation, and intersubjective transformability. When these requirements cannot be realized within a single global Boolean event structure, the mismatch appears, under fixed-spacetime projection, as noncommutativity, interference, and quantum-like probability. Building on prior single-state approaches to contextuality, we reinterpret classical contextual bookkeeping cost as the fixed-spacetime shadow of contextual spacetime formation. Auxiliary memory or context labels in a classical representation correspond, in this account, to holonomy-like mismatch among locally Boolean logic-worlds. The interference term is the cross term generated when locally classical realization contributions are nontrivially glued and projected back into a fixed classical spacetime form. The result is a transcendental-operational realist account: objecthood, eventhood, probability, and spacetime are treated as forms of realization under requirements, while objectivity is defined by invariants preserved across observer- and history-dependent spacetime formations.

URL PDF HTML ☆

赞 0 踩 0

2605.23942 2026-05-26 cs.AI 版本更新

A Dynamical Framework for Cognitive Processes Based on Transformations and Semantic Equivalence

基于变换和语义等价性的认知过程动力学框架

Carlo Cattani, Dioneia Motta Monte-Serrat

发表机构 * Engineering School, DEIM, University of Tuscia, (VT), 01100, Italy（1 工程学院，DEIM，图齐亚大学，（VT），01100，意大利）； Department of Physics, University of Sao Paulo, USP（2 物理系，圣保罗大学，USP；法律系，里贝拉奥普雷托大学，Unaerp，巴西）； Department of Law, University of Ribeirao Preto, Unaerp, Brazil

AI总结提出一个基于变换和语义等价性的动力学框架，通过迭代更新规则建模认知过程，并利用不动点论证和收缩条件确保稳定性，在语言应用中展示上下文依赖解释的轨迹。

详情

AI中文摘要

本文提出一个结构性和动力学框架，从控制论视角建模认知过程。认知状态表示为状态空间中的元素，通过迭代更新规则演化： \[ X_{t+1} = \pi\big(F(f(X_t))\big), \] 其中 $f$ 描述内部变换，$F$ 表示解释映射，$\pi$ 强制语义等价。该模型被解释为整合变换、观察和稳定的反馈系统。引入范畴论表述以捕捉组合结构，并通过不动点论证和收缩条件分析相关动力学，确保稳定性。为展示该框架的操作特性，提供了计算示例和诱导动力学的定性分析。一个具体的语言应用展示了如何将上下文依赖的解释建模为朝向稳定语义类的轨迹。所提出的方法连接了动力系统、范畴论和认知建模，提供了将认知视为朝向不变解释的反馈驱动过程的统一表示。

英文摘要

This paper proposes a structural and dynamical framework for modeling cognitive processes within a cybernetic perspective. Cognitive states are represented as elements of a state space evolving through an iterative update rule of the form \[ X_{t+1} = π\big(F(f(X_t))\big), \] where $f$ describes internal transformations, $F$ represents interpretative mappings, and $π$ enforces semantic equivalence. The model is interpreted as a feedback system integrating transformation, observation, and stabilization. A categorical formulation is introduced to capture compositional structure, while the associated dynamics are analyzed through fixed-point arguments and contraction conditions ensuring stability. To demonstrate the operational character of the framework, a computational illustration is provided, together with a qualitative analysis of the induced dynamics. A concrete linguistic application shows how context-dependent interpretation can be modeled as a trajectory toward a stable semantic class. The proposed approach connects dynamical systems, category theory, and cognitive modeling, and provides a unified representation of cognition as a feedback-driven process evolving toward invariant interpretations.

URL PDF HTML ☆

赞 0 踩 0

2605.23941 2026-05-26 cs.AI cs.RO 版本更新

MEMOR-E: In-Context and Fine-Tuned LLM Personalization for Alzheimer's Assistive Robotics

MEMOR-E: 面向阿尔茨海默病辅助机器人的上下文与微调大语言模型个性化

Maissa Abir Smaili, Eren Sadikoglu, Ransalu Senanayake

发表机构 * Istanbul Medipol University（伊斯坦布尔梅迪波大学）； Arizona State University（亚利桑那州立大学）

AI总结提出移动四足机器人MEMOR-E，结合微调与上下文学习的大语言模型，实现阿尔茨海默病患者的个性化认知支持与可解释人机交互。

Comments 8 pages 14 figures

详情

AI中文摘要

阿尔茨海默病是一种神经退行性疾病，其特征是记忆和语言能力进行性衰退，导致日常生活独立性降低，从而激发社交辅助机器人的支持需求。本文介绍了MEMOR-E，一种配备交互式平板界面的移动四足机器人，通过药物提醒、日常指导、记忆导向互动和陪伴来协助患者和护理人员。我们评估了微调大语言模型（LLMs）以模拟阶段一致的认知行为并解释标准神经心理学语言任务中响应的可行性，使用了235名阿尔茨海默病患者的音频转录和合成生成的健康对照数据。我们还报告了在LLMs中使用上下文学习（ICL）的结果，其中第二个LLM生成了领域和严重程度级别的认知错误摘要。我们的结果表明，MEMOR-E能够生成阶段感知的非诊断性认知摘要，支持个性化辅助互动，同时可解释AI机制将模型输出转化为透明、人类可读的证据，以实现护理人员监督和可信赖的人机交互。

英文摘要

Alzheimer's disease is a neurodegenerative disorder marked by progressive declines in memory and language that reduce independence in daily life, motivating socially assistive robotic support. This paper presents MEMOR-E, a mobile quadruped robot with an interactive tablet interface that assists patients and caregivers through medication reminders, routine guidance, memory oriented interactions, and companionship. We evaluated the feasibility of fine tuning large language models (LLMs) to emulate stage consistent cognitive behavior and interpret responses across standard neuropsychological language tasks, using audio transcriptions from 235 Alzheimer's patients and synthetically generated healthy controls. We also report findings on using in context learning (ICL) in LLMs, where a second LLM produced domain and severity level cognitive error summaries. Our results show that MEMOR-E can generate stage aware, non diagnostic cognitive summaries that support personalized assistive interactions, while explainable AI mechanisms translate model outputs into transparent, human readable evidence to enable caregiver oversight and trustworthy human robot interaction.

URL PDF HTML ☆

赞 0 踩 0

2605.23940 2026-05-26 cs.AI cs.CL 版本更新

Residual Drift Dominates Contradiction in Multi-Turn Constraint Reasoning

残差漂移主导多轮约束推理中的矛盾

Sebastien Kawada

AI总结通过构建DRIFT-Bench基准和MUS-Repair方法，发现多轮推理系统的主要失败模式是可满足漂移而非逻辑矛盾，残差错误中98-100%为可满足漂移。

Comments Published at ICLR 2026 Workshop on Reasoning and Planning for LLMs. 18 pages. ICLR page: https://iclr.cc/virtual/2026/10017484 Code: https://github.com/kaons-research/drift-bench

详情

AI中文摘要

多轮推理系统如何失败？预期的答案是逻辑矛盾，即系统维护的状态变得不可满足。我们表明，主导模式反而是可满足漂移，即内部状态保持一致，而返回的答案默默违反先前的承诺。我们构建了DRIFT-Bench（将推理分解为失败类型），这是一个包含三个约束领域816个测试问题的求解器辅助基准，并在四个开源模型（8B-120B参数）上评估了四种方法。MUS-Repair方法将最小不可满足子集反馈给生成器，在所有设置中表现最强（比最佳非MUS基线高+1.8到+15.0个百分点）。但核心发现是修复留下的问题。在结构化反馈后，模型很少自相矛盾。它们会遗忘。残差错误在所有设置中98-100%是可满足漂移，而矛盾降至接近零。可靠的多轮系统必须单独验证返回的答案尊重维护的状态。代码可在https://github.com/kaons-research/drift-bench获取。

英文摘要

How do multi-turn reasoning systems fail? The expected answer is logical contradiction, in which the system's maintained state becomes unsatisfiable. We show that the dominant mode is instead satisfiable drift, where the internal state stays consistent while the returned answer silently violates prior commitments. We build DRIFT-Bench (Decomposing Reasoning Into Failure Types), a solver-instrumented benchmark of 816 test problems across three constraint domains, and evaluate four methods on it across four open-weight models (8B-120B parameters). MUS-Repair, which feeds minimal unsatisfiable subsets back to the generator, is strongest in every setting (+1.8 to +15.0 pp over the best non-MUS baseline). But the central finding is what repair leaves behind. After structured feedback, models rarely contradict themselves. They forget. Residual errors are 98-100% satisfiable drift across all settings, while contradiction drops to near zero. Reliable multi-turn systems must separately validate that the returned answer respects the maintained state. Code is available at https://github.com/kaons-research/drift-bench.

URL PDF HTML ☆

赞 0 踩 0

2605.23939 2026-05-26 cs.AI cs.LG 版本更新

DRIVE: Modeling Skills at the Reasoning and Interaction Levels for Web Agents under Continual Learning

DRIVE：在持续学习下为Web代理建模推理与交互层面的技能

Xirui Liu, Sihang Zhou, Yanning Hou, Rong Zhou, Haoyuan Chen, Maolin He, Siwei Wang, Hao Chen, Jian Huang

发表机构 * College of Intelligence Science and Technology, National University of Defense Technology（智能科学与技术学院，国防科技大学）； College of Computer Science and Technology, National University of Defense Technology（计算机科学与技术学院，国防科技大学）

AI总结提出DRIVE框架，通过将历史经验分离为自然语言推理技能和程序化交互技能，并采用场景感知协调机制，解决Web代理在持续学习中推理与交互知识纠缠的问题，在WebArena上平均任务成功率提升7.3个百分点。

Comments 35 pages, 5 figures

详情

AI中文摘要

Web代理需要高层推理（用于任务分解）和低层交互（用于页面元素操作）来执行不同任务。然而，这些知识类型存在根本差异：推理知识（例如，预订航班需要首先搜索路线）是抽象的且可跨网站迁移，而交互知识（例如，在站点A的特定坐标点击搜索按钮）严重依赖于页面特定上下文。现有方法统一存储经验。这造成了一个困境：抽象表示在具体页面上失去可执行性，而具体表示无法跨领域泛化。这种纠缠限制了能力积累：在新网站上，代理要么因表面差异而无法识别可重用的任务逻辑，要么尝试基于过时页面结构的不可行操作。为了解耦它们，我们提出DRIVE，一个双层技能建模框架，将历史经验分离为自然语言推理技能（捕获可迁移的任务逻辑）和程序化交互技能（将抽象动作接地到可执行操作）。一种场景感知协调机制根据任务语义自适应地检索和调用这些双层技能。DRIVE还使用技能级反思来识别层次特定的失败模式，实现有针对性的技能库扩展和精炼。在五个WebArena领域上的实验表明，DRIVE达到了52.8%的平均任务成功率，比无技能基线高出7.3个百分点。进一步的消融实验显示，推理和交互技能提供了不同且互补的益处，支持将可迁移的任务逻辑与可执行的页面级操作分离。

英文摘要

Web agents require both high-level reasoning (for task decomposition) and low-level interactions (for page elements manipulation) to conduct different tasks. However, these knowledge types differ fundamentally: reasoning knowledge (e.g., booking a flight requires first searching for routes) is abstract and transferable across websites, while interaction knowledge (e.g., clicking the Search button at a specific coordinate on Site A) depends heavily on page-specific contexts. Existing methods store experiences uniformly. This creates a dilemma: abstract representations lose executability on concrete pages, while concrete representations fail to generalize across domains. This entanglement limits capability accumulation: on new websites, agents either fail to recognize reusable task logic due to surface-level differences or attempt infeasible actions from outdated page structures. To disentangle them, we propose DRIVE, a dual-level skill modeling framework separating historical experience into natural language reasoning skills, which capture transferable task logic, and programmatic interaction skills, grounding abstract actions to executable operations. A scene-aware coordination mechanism adaptively retrieves and invokes these dual-level skills based on task semantics. DRIVE also uses skill-level reflection to identify hierarchy-specific failure modes, enabling targeted skill library expansion and refinement. Experiments across five WebArena domains show DRIVE attains an average task success rate of 52.8%, exceeding the skill-free baseline by 7.3 percentage points. Further ablations show reasoning and interaction skills provide distinct, complementary benefits, supporting separation of transferable task logic from executable page-level operations.

URL PDF HTML ☆

赞 0 踩 0

2605.23938 2026-05-26 cs.AI cs.CY cs.LG 版本更新

Authority Inversion in LLM-Mediated Ubiquitous Systems: When Models Trust Users Over Sensors

LLM介导的普适系统中的权威倒置：当模型信任用户胜过传感器

Long Zhang, Zi-bo Qin, Wei-neng Chen

发表机构 * School of Computer Science and Engineering, South China University of Technology（华南理工大学计算机科学与工程学院）； School of Computer Science（计算机科学学院）； Engineering, South China University of Technology（华南理工大学工程学院）

AI总结本研究揭示了大语言模型在融合传感器与用户冲突信息时，由于格式依赖性导致数值传感器数据被自然语言用户主张支配的权威倒置现象，并提出了几何框架、审计指标（CIR和AAI）以及推理时层干预方法（GAC）来诊断和缓解该问题。

详情

AI中文摘要

大语言模型（LLM）越来越多地融合普适系统中的异构输入。然而，当传感器测量值与用户主张冲突时，LLM如何隐式分配权威尚未被研究，这引发了在物理传感必须保持优先级的部署场景中的关键可靠性问题。与显式的传统融合不同，LLM将权威分配隐藏在学习的表示中。我们发现这种分配严重依赖于格式：数值传感器数据未能整合到与答案相关的模型方向中，使得自然语言主张主导最终决策，我们将这种现象称为 extbf{权威倒置}。为了诊断和缓解这一问题，我们开发了一个上下文整合的几何框架，引入了两个可计算的审计指标，即上下文整合比（CIR）和权威对齐指数（AAI），并提出了几何权威校准（GAC），一种推理时的层级干预方法，以抑制错位的用户权威。在四个数据集（共576个冲突实例）上评估四个模型（参数规模4B至35B，三种架构），揭示了极端的倒置：在数值任务上，模型表现出接近零的传感器信任（AAI = -0.805，Cohen's d = -2.14），且不受模型容量影响。验证我们的几何框架，理论引导的因果注入翻转了80.2%的错误决策（随机对照<0.4%）。实际应用中，GAC将HAR准确率从0–1.6%提升至21.9–27.5%，优于提示基线。最终，LLM介导系统中的权威分配必须被显式审计并根据应用特定配置，而不是保持隐式。

英文摘要

Large language models (LLMs) increasingly fuse heterogeneous inputs in ubiquitous systems. Yet, how LLMs implicitly allocate authority when sensor measurements and user claims conflict remains unexamined, raising critical reliability concerns for deployments where physical sensing must retain priority. Unlike explicit traditional fusion, LLMs bury authority allocation within learned representations. We discover this allocation is severely format-dependent: numerical sensor data fails to integrate into answer-relevant model directions, allowing natural-language claims to dominate the final decision, a phenomenon we term \textbf{Authority Inversion}.To diagnose and mitigate this, we develop a geometric framework of context integration, introduce two computable audit metrics, specifically the Context Integration Ratio (CIR) and Authority Alignment Index (AAI), and propose Geometric Authority Calibration (GAC), an inference-time layer-level intervention to suppress misplaced user authority. Evaluating four models (4B to 35B parameters, three architectures) across four datasets totaling 576 conflict instances reveals extreme inversion: on numerical tasks, models exhibit near-zero sensor trust (AAI = -0.805, Cohen's d = -2.14), unaffected by model capacity. Validating our geometric framework, theory-guided causal injection flips 80.2\% of incorrect decisions (vs. <0.4\% for random controls). Practically, GAC improves HAR accuracy from 0 -- 1.6\% to 21.9 -- 27.5\%, outperforming prompting baselines. Ultimately, authority allocation in LLM-mediated systems must be explicitly audited and application-specifically configured rather than left implicit.

URL PDF HTML ☆

赞 0 踩 0

2605.23936 2026-05-26 cs.AI cs.LG 版本更新

Fuzzy, Neutrosophic, and Uncertain Graph Theory: Properties and Applications

模糊、中智和不确定图论：性质与应用

Takaaki Fujita, Florentin Smarandache

AI总结本书系统综述了不确定性下的图论，以不确定图框架为核心，统一了模糊、中智等模型，并介绍了扩展图类及其在分子图、决策系统、图神经网络等领域的应用。

Comments 326 pages. Publisher: Neutrosophic Science International Association (NSIA) Publishing House. ISBN: 978-197250204-4

2605.23935 2026-05-26 cs.AI cs.CY cs.MA cs.SE cs.SY eess.SY 版本更新

Operationalizing Reconstructive Authority: Runtime Construction, Dependency Resolution, and Execution Gating in Autonomous Agent Systems

操作化重构权威：自主智能体系统中的运行时构建、依赖解析与执行门控

Marcelo Fernandez - TraslaIA

发表机构 * TraslaIA

AI总结本文提出一种运行时执行模型，通过动态依赖解析和恢复循环，确保动作仅在当前状态可构建权威时执行，从而保证安全性和条件活性。

Comments Agent Governance Series, Paper P6. Companion papers on arXiv: P0 (2604.17511), P1 (2603.18829), P2 (2604.17517). P3/4 and P5 submitted concurrently (pending arXiv IDs). Zenodo: 10.5281/zenodo.19699460

详情

DOI: 10.5281/zenodo.19699460

AI中文摘要

自主智能体系统的失败不仅源于错误决策，还源于执行那些在运行时其权威不再成立的决策。先前的工作将重构权威（RAM）定义为有效执行的条件：仅当权威能从当前状态构建时，才允许执行动作。本文关注运行时强制执行问题：如何在运行系统中强制执行该条件。我们引入一种运行时执行模型，其中权威在动作时被评估，执行取决于其可构建性。这将执行状态空间从允许/拒绝扩展到第三种状态——暂停，表示由于不完整或不确定的可观测性导致权威未定义。我们定义了一个具体的执行协议，包括动态依赖解析、权威重构和显式决策语义。我们进一步引入一个恢复循环，将漂移检测（IML）与执行控制（ACP）集成，允许系统暂停执行、获取缺失信息并重新尝试权威重构。我们证明该模型保证了安全性——没有动作会在没有可构建权威的情况下执行——以及条件活性：当定义权威的变量变得可观测时，执行恢复。这项工作将重构权威操作化为一种运行时强制机制，提供了在真实系统中应用RAM所需的执行语义。

英文摘要

Autonomous agent systems fail not only due to incorrect decisions, but due to executing decisions whose authority no longer holds at runtime. Prior work defined Reconstructive Authority (RAM) as a condition for valid execution: actions are permitted only if authority can be constructed from current state. This paper addresses enforcement at runtime: how to enforce this condition in a running system. We introduce a runtime execution model in which authority is evaluated at action time and execution is conditioned on its constructibility. This extends the execution state space beyond admit/deny with a third state, halt, representing cases where authority is undefined due to incomplete or uncertain observability. We define a concrete execution protocol including dynamic dependency resolution, authority reconstruction, and explicit decision semantics. We further introduce a Recovery Loop that integrates drift detection (IML) with execution control (ACP), allowing the system to suspend execution, acquire missing information, and re-attempt authority reconstruction. We show that this model guarantees safety -- no action is executed without constructible authority -- and conditional liveness: execution resumes when authority-defining variables become observable. This work operationalizes reconstructive authority as a runtime enforcement mechanism, providing the execution semantics required to apply RAM in real systems.

URL PDF HTML ☆

赞 0 踩 0

2605.23934 2026-05-26 cs.AI quant-ph 版本更新

Practical Quantum CIM Empowerment via All-Domestic-Core Agentic Large Model

实用量子CIM赋能：基于全自主核心智能体大模型

Wang Rui, Lu Diannan

发表机构 * Department of Chemical Engineering, Tsinghua University（清华大学化学工程系）

AI总结本研究将飞秒激光泵浦的相干伊辛机与LLM驱动的智能体系统结合，实现QUBO/Ising模型校准、约束权重决策迭代和文献方案快速验证，并完全基于国产大模型和硬件完成，同时发现智能体辅助量子计算迭代可反向增强智能体问题解决能力的新范式。

Comments 21 pages 7 figures

详情

AI中文摘要

量子计算设备被认为是解决NP完全问题的强大工具。然而，其建模的复杂性给非专业人士带来了显著障碍，而约束权重和建模方法的繁琐迭代也消耗了专家的大量精力。为应对这些挑战，本研究通过利用LangGraph和LangChain框架，将飞秒激光泵浦的相干伊辛机（CIM）与LLM驱动的智能体系统集成。综合研究表明，大语言模型（LLMs）可以有效执行建模任务，如QUBO/Ising模型校准、约束权重决策迭代以及文献报道方案的快速验证。值得注意的是，所有这些任务都可以完全基于国产大模型实现，结合国内开发的CIM硬件，我们真正实现了完全依赖全自主智能体大模型和硬件的实用量子CIM赋能。这项工作成功实现了稳健的技术集成，为后续研究奠定了坚实基础。然而，它也指出了当前阶段大模型和量子计算这两个前沿领域持续存在的挑战。令人鼓舞的是，我们意外发现了一种有前景的新范式，其中智能体辅助的量子计算迭代积累的知识反向增强了智能体自身的问题解决能力，从而应对这些挑战。

英文摘要

Quantum computing devices are recognized as powerful tools for solving NP-complete problems. However, the intricacy of their modeling presents notable barriers for non-specialists, while the tedious iteration of constraint weights and modeling methodologies also consumes substantial effort on the part of experts. To address these challenges, this study integrates a femtosecond laser-pumped Coherent Ising Machine (CIM) with an LLM-driven agentic system by leveraging the LangGraph and LangChain frameworks. Comprehensive investigations demonstrate that large language models (LLMs) can effectively perform such tasks in modeling as QUBO/Ising model calibration, constraint weight decision iteration and rapid validation of literature-reported schemes. Notably, all these tasks can be fully implemented based on domestic large models, combined with domestically developed CIM hardware, we truly achieve the practical empowerment of quantum CIM that fully relies on all-domestic agentic large models and hardware. This work successfully realizes robust technological integration, laying a solid foundation for subsequent research. Nevertheless, it also identifies the persisting challenges in the two cutting-edge fields of large models and quantum computing at the current stage. Encouragingly, we unexpectedly discover a promising new paradigm where accumulated knowledge from agent-assisted quantum computing iterations reciprocally enhances the agent's own problem-solving capability, thereby addressing these challenges.

URL PDF HTML ☆

赞 0 踩 0

2605.23932 2026-05-26 cs.AI cs.CL cs.CY cs.LG 版本更新

When Correct Beliefs Collapse: Epistemic Resilience of LLMs under Clinical Pressure

当正确信念崩溃：LLMs在临床压力下的认知韧性

Boyu Xiao, Xiuqi Tian, Xuwen Song, Haochun Wang, Guanchun Song, Sendong Zhao, Bing Qin

发表机构 * Research Center for Social Computing and Interactive Robotics, Harbin Institute of Technology, China（社会计算与交互机器人研究院，哈尔滨工业大学，中国）

AI总结研究LLMs在临床对话中面对逐步升级压力时信念稳定性问题，提出Med-Stress压力测试框架，发现知识-韧性差距，并设计RBED和R-FT方法提升鲁棒性。

Comments ACL 2026

详情

AI中文摘要

尽管在医学基准测试中准确率很高，但LLMs在临床对话中可能表现出严重的多轮谄媚行为，在逐步升级的压力下放弃最初正确的诊断。我们提出了\textbf{\textsc{Med-Stress}}，一个针对性的压力测试框架，用于评估在逐步升级压力下的信念稳定性。在九个前沿大型语言模型（LLMs）中，我们发现医学知识与鲁棒性之间存在明显的分离：高初始诊断能力并不意味着高信念稳定性，导致多个LLMs存在较大的知识-鲁棒性差距。为了缓解这种失败模式，我们提出了一种轻量级的推理时防御方法\textbf{\texttt{RBED}}（\textbf{R}ole-\textbf{B}ased \textbf{E}pistemic \textbf{D}efense），以及一种训练时方法\textbf{\texttt{R-FT}}（\textbf{R}esilience-oriented \textbf{F}ine-\textbf{T}uning），该方法内化了基于证据的抗压能力。实验表明，\textbf{\texttt{R-FT}}几乎消除了信念变化，并显著提高了鲁棒性。

英文摘要

Despite strong medical benchmark accuracy, LLMs can exhibit severe multi-turn sycophancy in clinical dialogue, abandoning initial correct diagnosis under escalating pressure. We propose \textbf{\textsc{Med-Stress}}, a targeted stress test framework that evaluates belief stability under escalating pressure. Across nine frontier large language models (LLMs), we find a clear dissociation between medical knowledge and robustness: high initial diagnostic capability does not imply high belief stability, yielding large knowledge-robustness gaps for several LLMs. To mitigate this failure mode, we propose a lightweight inference-time defense, \textbf{\texttt{RBED}} (\textbf{R}ole-\textbf{B}ased \textbf{E}pistemic \textbf{D}efense), and \textbf{\texttt{R-FT}} (\textbf{R}esilience-oriented \textbf{F}ine-\textbf{T}uning), a training-time approach that internalizes evidence-based resistance to pressure. Experiments show that \textbf{\texttt{R-FT}} nearly eliminates belief change and substantially improves robustness.

URL PDF HTML ☆

赞 0 踩 0

2605.23931 2026-05-26 cs.AI cs.PL cs.SE 版本更新

BODHI: Precise OS Kernel Specification Inference

BODHI：精确的操作系统内核规范推断

Zhiming Chang, Ziyang Li

发表机构 * Department of Applied Mathematics and Statistics（应用数学与统计学系）； Johns Hopkins University（约翰霍普金斯大学）； Department of Computer Science（计算机科学系）

AI总结提出一种领域知识提示方法BODHI，通过结构化C到Python翻译指南增强少样本提示，在OSV-Bench基准上将Pass@1从55.10%提升至96.73%，缩小了通用代码生成与形式规范合成之间的差距。

详情

AI中文摘要

操作系统内核的形式化验证需要精确的规范来捕获系统调用的预期行为。手动编写这些规范需要深厚的领域专业知识，这促使使用大型语言模型（LLM）来自动化该过程。然而，在OSV-Bench（一个源自Hyperkernel操作系统内核的245个规范生成任务基准）中，最佳报告的Pass@1为55.10%。我们提出了一种领域知识提示方法（BODHI），该方法通过一个涵盖15类领域特定翻译模式的结构化C到Python翻译指南来增强标准的少样本提示。受结构化思维链（SCoT）提示的启发，该指南通过关注点分离来组织翻译，将前置条件提取和后置条件生成作为不同的类别处理。在来自六个提供商（Anthropic、Mistral、Amazon、DeepSeek、Meta、Alibaba）的九个模型上进行了评估，涵盖了密集、混合专家和推理架构，BODHI改进了所有测试的模型，提升幅度从+11%到+32%。最佳配置（Claude Opus 4.6 + BODHI）达到了96.73%的Pass@1。BODHI减少了语法和语义错误，对具有足够指令跟随能力以利用结构化参考材料的模型效果最强。这些结果表明，领域知识注入是一种与模型无关的技术，显著缩小了通用代码生成与形式规范合成之间的差距。

英文摘要

The formal verification of operating system kernels requires precise specifications that capture the intended behavior of system calls. Writing these specifications manually demands deep domain expertise, motivating the use of large language models (LLMs) to automate the process. However, in OSV-Bench, a benchmark of 245 specification generation tasks derived from the Hyperkernel OS kernel, the best reported Pass@1 is 55.10%. We propose a domain knowledge prompting method (BODHI), which augments the standard few-shot prompt with a structured C-to-Python translation guide covering 15 categories of domain-specific translation patterns. Inspired by Structured Chain-of-Thought (SCoT) prompting, the guide organizes translation by separation of concerns, addressing pre-condition extraction and post-condition generation as distinct categories. Evaluated on nine models from six providers (Anthropic, Mistral, Amazon, DeepSeek, Meta, Alibaba), covering dense, mixture-of-experts and reasoning architectures, BODHI improves every model tested, with gains ranging from +11% to +32%. The best configuration (Claude Opus 4.6 + BODHI) reaches 96.73% Pass@1. BODHI reduces both syntax and semantic errors, with the strongest effect on models that have sufficient instruction-following capability to utilize structured reference material. These results demonstrate that domain knowledge injection is a model-agnostic technique that substantially bridges the gap between general-purpose code generation and formal specification synthesis.

URL PDF HTML ☆

赞 0 踩 0

2605.23930 2026-05-26 cs.AI cs.LG cs.MA 版本更新

Quantum Frog: Emergent Cooperation and Difficulty Scaling in a Quantized-Time Cooperative Game

量子青蛙：量化时间合作博弈中的涌现合作与难度缩放

Saad Mankarious

发表机构 * Gymnasium API

AI总结通过强化学习分析量化时间合作博弈Quantum Frog，发现同步冲刺策略最优，合作训练可大幅提升成功率并缩短回合步数。

详情

AI中文摘要

我们引入了\emph{Quantum Frog}，这是一个双人合作游戏，基于一种新颖的\emph{量化时间}机制，其中环境仅在玩家行动时推进。受经典街机游戏Frogger启发，Quantum Frog要求两只青蛙穿越一个8×8的交通网格并一起到达远端。我们使用强化学习（RL）作为分析镜头来回答四个设计问题：（1）游戏难度如何随交通密度缩放，（2）最优单智能体策略是什么以及为什么，（3）独立和合作双智能体游戏之间的合作差距有多大，以及（4）当智能体被激励合作时会出现什么联合策略？我们通过五个升级阶段训练智能体：表格Q学习、深度Q网络（\DQN）、独立\DQN~（\IDQN）和多智能体近端策略优化（\MAPPO\ 带有集中式评论家），针对一到六辆车的交通密度进行评估。我们的主要发现是：（i）量化时间机制使得\emph{冲刺策略}（每一步直接向上移动）普遍最优，因为暴露于交通的时间被最小化；（ii）添加一个不协调的第二玩家比将单个专家玩家的交通量增加六倍更难；（iii）合作训练相对于独立智能体将联合成功率提高了+32–34个百分点，并将回合长度从约90步减少到约6步；（iv）涌现的合作策略是同步冲刺，而不是复杂的位置协调，这表明在时间关键的合作任务中，仅共享激励就足以使智能体对齐。这些发现为Quantum Frog的商业设计提供了具体、经验基础的指导，并为环境机制在塑造多智能体学习动态中的作用提供了更广泛的见解。

英文摘要

We introduce \emph{Quantum Frog}, a two-player cooperative game built on a novel \emph{quantized-time} mechanic in which the environment advances only when a player acts. Inspired by the classic arcade game Frogger, Quantum Frog requires two frogs to cross an 8$\times$8 grid of traffic and reach the far side together. We use reinforcement learning (RL) as an analytical lens to answer four design questions: (1) how does game difficulty scale with traffic density, (2) what is the optimal single-agent policy and why, (3) how large is the cooperation gap between independent and cooperative two-agent play, and (4) what joint strategy emerges when agents are incentivised to cooperate? We train agents through five escalating stages, Tabular Q-Learning, Deep Q-Network (\DQN), Independent \DQN~(\IDQN), and Multi-Agent Proximal Policy Optimisation (\MAPPO\ with a centralised critic), evaluating each against traffic densities of one to six cars. Our key findings are: (i) the quantized-time mechanic makes a \emph{rush strategy} (moving directly upward at every step) universally optimal, as time exposure to traffic is minimised; (ii) adding an uncoordinated second player is harder than sextupling the traffic for a single expert player; (iii) cooperative training recovers +32--34 percentage points of joint success rate relative to independent agents and reduces episode length from $\sim$90 to $\sim$6 steps; and (iv) the emergent cooperative strategy is synchronised rushing, not complex positional coordination, illustrating that shared incentives alone suffice to align agents in time-critical cooperative tasks. These findings provide concrete, empirically grounded guidance for the commercial design of Quantum Frog and offer broader insights into the role of environment mechanics in shaping multi-agent learning dynamics.

URL PDF HTML ☆

赞 0 踩 0

2605.23929 2026-05-26 cs.AI cs.SE 版本更新

Toward Reliable Design of LLM-Enabled Agentic Workflows: Optimizing Latency-Reliability-Cost Tradeoffs

面向LLM驱动的智能体工作流的可靠设计：优化延迟-可靠性-成本权衡

Ya-Ting Yang, Quanyan Zhu

发表机构 * New York University（纽约大学）

AI总结本文通过引入参数化指数可靠性函数建模LLM与非LLM智能体的性能，提出水填充令牌分配策略，并刻画最优工作流可靠性的影子价格，以解决延迟、可靠性和成本之间的权衡问题。

2605.23928 2026-05-26 cs.AI cs.CL cs.DC cs.MA cs.PL cs.SE 版本更新

Context: Proactive Goal-Directed Intelligence via Composable Sandboxed Programs, Declarative Wiring, and Structured Interaction

Context: 通过可组合沙盒程序、声明式连接和结构化交互实现主动目标导向智能

Gregory Magarshak

发表机构 * Qbix, Inc.\ \& Intercoin, Inc. New York USA ； Qbix, Inc.\ \& Intercoin, Inc. ； IE University NYC

AI总结提出Context架构，通过可组合沙盒程序、声明式连接和结构化交互实现主动目标导向智能，并证明其在成本、正确性和效率上的优势。

Comments 7 pages; third in a series with arXiv:2501.XXXXX (Magarshak Machine / SPACER) and arXiv:2502.XXXXX (Grokers)

详情

AI中文摘要

我们提出Context，Magarshak架构的智能层，用主动目标导向智能体取代被动查询-响应聊天机器人，无需等待用户提示即可推进共享任务。该架构基于三个相互增强的机制。编写时上下文组装通过Groker智能体预计算丰富的类型化属性，将交互上下文组装为图状态的确定性纯函数；上下文块在语义变化之间的轮次中字节相同，实现近100%的KV缓存重用。可组合沙盒智慧程序形成一个受管理的库，包含LM生成的命令式程序，通过类型化流关系声明式连接到目标类型，通过阶段排序组合，并在交互时执行而无需进一步调用LM。主动目标流状态机通过检查图状态并发出结构化交互内容（选项数组、治理功能、澄清提示）驱动对话走向终止状态，无需等待用户输入。我们证明了六个形式化结果：上下文稳定性定理，将每轮LM成本限制为语义变化率的函数；程序组合正确性定理；声明式连接正确性定理；主动主导定理，证明主动智能体在期望轮数到终止状态上弱主导被动智能体；协调开销消除与质量保持，建立多方目标聊天中的帕累托改进；以及跨平台投票一致性定理。已在开源Qbix/Safebox/Safebots栈中实现。

英文摘要

We present Context, the intelligence layer of the Magarshak Architecture, which replaces reactive query-response chatbots with proactive goal-directed agents that advance shared tasks without waiting for user prompts. The architecture rests on three mutually reinforcing mechanisms. Write-time context assembly precomputes enriched typed attributes via Groker agents, assembling interaction context as a deterministic pure function of graph state; context blocks are byte-identical across turns between semantic changes, enabling near-100% KV-cache reuse. Composable sandboxed wisdom programs form a governed library of LM-generated imperative programs declaratively wired to goal types via typed stream relations, composed via phase ordering, and executed at interaction time without further LM calls. Proactive goal stream state machines drive conversations toward terminal states by inspecting graph state and emitting structured interaction content (option arrays, governance affordances, clarification prompts) without awaiting user input. We prove six formal results: the Context Stability Theorem, bounding per-turn LM cost as a function of semantic change rate; a Program Composition Correctness Theorem; a Declarative Wiring Soundness Theorem; the Proactive Dominance Theorem, proving proactive agents weakly dominate reactive agents on expected turns-to-terminal-state; Coordination Overhead Elimination and Quality Preservation, establishing Pareto improvements in multi-participant goal chats; and a Cross-Platform Vote Consistency Theorem. Implemented in the open-source Qbix / Safebox / Safebots stack.

URL PDF HTML ☆

赞 0 踩 0

2605.23926 2026-05-26 cs.AI cs.LG 版本更新

人工努力

Federico Belotti, Stefano Coniglio, Antonio Cosma, Francesco Fallucchi

发表机构 * University of Bergamo（贝加莫大学）

AI总结研究在AI和LLM时代，真实努力任务是否仍能反映人类努力，发现大多数任务可被低成本高精度自动化，仅少数抵抗自动化，且口头金钱激励对LLM无影响。

详情

AI中文摘要

真实努力任务中，参与者执行认知成本高昂的活动，其结果取决于实际表现，广泛应用于实验经济学。然而，其有效性基于人类执行这些任务的假设。我们研究在人工智能（AI）和大型语言模型（LLM）时代，这一假设是否仍然成立。使用来自三个主要提供商的8个经典真实努力任务和23个LLM，我们表明大多数任务现在可以以可忽略的成本准确解决，而只有少数任务抵抗自动化。性能随着每一代模型而提高，中端模型正在迅速缩小与前沿模型的差距，拓宽了可广泛访问的模型集，这些模型可以自动化这些任务。此外，我们表明口头提供金钱激励对LLM性能没有影响。我们的发现为在无监督环境中使用真实努力任务建立了一个边界条件：当参与者可以廉价地将任务完成外包给LLM时，观察到的表现可能不再反映真正的人类努力。

英文摘要

Real-effort tasks, in which participants perform cognitively costly activities whose outcomes depend on actual performance, are widely used in experimental economics. Their validity, however, rests on the assumption that a human performs them. We study whether this assumption still holds in the era of Artificial Intelligence (AI) and Large Language Models (LLMs). Using 8 canonical real-effort tasks and 23 LLMs from three major providers, we show that most tasks can now be solved accurately and at a negligible cost, while only a few resist automation. Performance improves with each model generation, and midtier models are rapidly closing the gap with frontier ones, broadening the set of widely accessible models that can automate these tasks. Additionally, we show that verbally offering monetary incentives has no effect on LLM performance. Our findings establish a boundary condition for the use of real-effort tasks in unsupervised settings: when participants can cheaply outsource task completion to an LLM, observed performance may no longer reflect genuine human effort.

URL PDF HTML ☆

赞 0 踩 0

2605.23916 2026-05-26 cs.IR cs.AI econ.GN q-fin.EC 版本更新

Agent-Facing Information Design in LLM Tool Registries

面向智能体的LLM工具注册表信息设计

Haochuan Kevin Wang

发表机构 * Massachusetts Institute of Technology（麻省理工学院）

AI总结本研究首次系统性地分析了LLM工具注册表中广告式描述对智能体选择的影响，发现法律上允许的夸大宣传（如主观最高级表述）完全主导优化效果，而虚假声明无额外影响，并提出了分离选择导向与营销导向描述及智能体注意力质量分数等注册表设计建议。

详情

AI中文摘要

LLM工具注册表作为未受监管的广告平台运作：提供者编写自由文本描述，智能体据此进行选择，但缺乏衡量基础设施——无可见性标准、质量评分或结果审计——来使该市场承担责任。我们提供了首个系统性框架，结合了跨越五个LLM和十个领域的17,700多次试验以及建设性的注册表设计处方。仅法律上的夸大宣传（主观最高级表述、利益框架）就捕获了100%的优化效果；虚假声明未增加任何额外偏差——这使得FTC对欺骗性广告规则的执法对活跃机制无效。信息披露在结构上失败：系统提示警告对五个模型中的四个产生零可测量效果，行为上限使得基于标签的修正没有空间。最高级表述是主导单一特征（SBC = +0.35）。注册表层的描述规范化实现了与模型无关的一流福利。我们提出将面向选择的描述（结构化的、注册表控制的）与面向营销的描述（提供者撰写的、选择后展示）分离，并引入智能体注意力质量分数以区分能力与文案撰写。

英文摘要

LLM tool registries function as unregulated advertising platforms: providers write free-text descriptions that agents use for selection, yet no measurement infrastructure -- no viewability standard, quality score, or outcome audit -- exists to make this market accountable. We provide the first systematic framework, combining 17,700+ trials across five LLMs and ten domains with a constructive registry design prescription. Legal puffery alone (subjective superlatives, benefit framing) captures 100% of the optimization effect; fabricated claims add zero incremental bias -- rendering FTC enforcement of deceptive advertising rules ineffective against the active mechanism. Disclosure fails structurally: system-prompt warnings produce zero measurable effect for four of five models, and behavioral ceilings leave no headroom for label-based correction. Superlatives are the dominant single feature (SBC = +0.35). Registry-layer description normalization achieves first-best welfare model-independently. We propose separating selection-facing descriptions (structured, registry-controlled) from marketing-facing descriptions (provider-authored, shown post-selection), and introduce the Agent Attention Quality Score to distinguish capability from copywriting.

URL PDF HTML ☆

赞 0 踩 0

2605.23914 2026-05-26 cs.DC cs.AI cs.MA 版本更新

VineLM: Trie-Based Fine-Grained Control for Agentic Workflows

VineLM: 基于Trie的细粒度控制用于智能体工作流

Nikos Pagonas, Matthew Lou, Tianyi Peng, Dan Rubenstein, Kostis Kaffes

发表机构 * Columbia University（哥伦比亚大学）

AI总结提出VineLM工作流管理器，通过Trie结构动态选择每个阶段调用的模型，在请求级目标下优化成本-延迟-准确率边界，稀疏分析减少离线分析成本98-99.8%。

详情

AI中文摘要

智能体工作流将可配置的LLM阶段与工具阶段交错，通常包括重试或优化循环。现有工作流管理器离线分析完整工作流配置，并为每个请求分配静态工作流级计划，将每个可配置LLM阶段绑定到单个模型，在重复循环迭代中重用该模型，且不在运行时重新审视这些选择。我们提出VineLM，一种工作流管理器，通过在请求级目标（如在成本或延迟预算下最大化准确率）下执行过程中为每个阶段调用选择模型，实现细粒度控制。VineLM将可行执行表示为模型选择前缀的带注释Trie，并使用检查点和级联分析来估计路径准确率、成本和延迟，而无需在每个路径上详尽分析每个请求。运行时，VineLM在每个阶段调用后重新定位Trie根，并使用已实现的执行前缀和剩余延迟预算在剩余子Trie上重新规划。在NL2SQL和数学推理工作流上，VineLM在相同每请求预算下比粗粒度工作流级基线提高了成本-延迟-准确率边界，准确率提升高达18%，其稀疏分析相比详尽分析将离线分析成本降低了98-99.8%。

英文摘要

Agentic workflows interleave configurable LLM stages with tool stages and often include retries or refinement loops. Existing workflow managers profile full workflow configurations offline and assign each request a static workflow-level plan that binds each configurable LLM stage to a single model, reuses that model across repeated loop iterations, and does not revisit those choices at runtime. We present VineLM, a workflow manager that enables fine-grained control by choosing the model for each stage invocation as execution unfolds under request-level objectives such as maximizing accuracy under cost or latency budgets. VineLM represents feasible executions as an annotated trie of model-choice prefixes and uses checkpointing and cascade profiling to estimate path accuracy, cost, and latency without exhaustively profiling every request on every path. At runtime, VineLM re-roots the trie after each stage invocation and replans over the remaining subtrie using the realized execution prefix and remaining latency budget. On NL2SQL and math reasoning workflows, VineLM improves the cost-latency-accuracy frontier over coarse workflow-level baselines, achieving up to 18% higher accuracy at the same per-request budget with its sparse profiling reducing offline profiling cost by 98-99.8% when compared to exhaustive profiling.

URL PDF HTML ☆

赞 0 踩 0

2605.23912 2026-05-26 cs.CL cs.AI cs.SD 版本更新

Raon-Speech Technical Report

Raon-Speech 技术报告

Beomsoo Kim, Changho Choi, Dohyun Kim, Dongki Lee, Ethan Ewer, Eunchong Kim, Gyeongman Kim, Haechan Kim, Hyeonghwan Kim, Inkyu Park, Jihun Yun, Jihwan Moon, Jiyun Kim, Joonghyun Bae, Junhyuck Kim, Minkyu Kim, Sehun Lee, Seungjun Chung, Sungwoo Cho, Dongmin Park, Dongwon Kim, Hara Kang, Jonghyun Lee, Keon Lee, Kangwook Lee, Jaewoong Cho

发表机构 * KRAFTON

AI总结本文提出 Raon-Speech，一个 9B 参数的语音语言模型，通过多阶段训练实现英语和韩语的语音理解、回答与生成，并扩展为全双工对话模型 Raon-SpeechChat，在语音任务上超越同类模型。

详情

AI中文摘要

我们提出了 Raon-Speech，一个在英语和韩语语音理解、回答和生成方面表现优异的 9B 参数语音语言模型（SpeechLM），以及 Raon-SpeechChat，一个用于自然实时对话的高性能全双工扩展。Raon-Speech 成功地将预训练的大语言模型（LLM）转换为既能理解又能生成语音的 SpeechLM，同时保留了强大的文本能力。它在 138 万小时精心策划的英语和韩语语音及文本数据集上训练，训练阶段包括：(1) 语音模块对齐，(2) 基于知识蒸馏的端到端 SpeechLM 预训练，以及 (3) 基于多任务偏好优化的后训练。在 42 个英语和韩语语音及文本基准测试中，与包括 Qwen2.5-Omni 和 Fun-Audio-Chat 在内的八个近期类似规模的音频基础模型相比，Raon-Speech 在语音中心任务上建立了最强的整体表现，同时保留了强大的文本问答性能。在此基础上，Raon-SpeechChat 通过在 119K 小时的时间对齐的真实和合成对话数据上进行持续训练，实现了自然的全双工对话。它通过三个互补的训练阶段进行：(1) 因果编码器适应，(2) 全双工预训练，(3) 用于语音和角色控制的全双工微调。在多个全双工基准测试中，Raon-SpeechChat 在 FDB v1.0 涵盖的轮流发言和中断敏感行为上显示出最明显的优势，并在更广泛的全双工评估套件中保持竞争力。我们开源了所有模型检查点、训练和推理流程以及交互式演示。

英文摘要

We present Raon-Speech, a top-performing 9B-parameter speech language model (SpeechLM) for English and Korean speech understanding, answering, and generation, and Raon-SpeechChat, a high-performing full-duplex extension for natural real-time conversation. Raon-Speech successfully transforms a pre-trained LLM into a SpeechLM that both understands and generates speech while preserving strong text capabilities. It trains on 1.38M hours of highly curated English and Korean speech and text datasets with the following training stages: (1) speech modules alignment, (2) end-to-end SpeechLM pre-training with knowledge distillation, and (3) multi-task preference optimization-based post-training. Across 42 English and Korean speech and text benchmarks, Raon-Speech establishes the strongest overall profile on speech-centric tasks in our comparison against eight similarly sized recent audio foundation models, including Qwen2.5-Omni and Fun-Audio-Chat, while preserving strong text question answering performance. Building upon it, Raon-SpeechChat enables natural full-duplex conversation by continual training on 119K hours of time-aligned real and synthetic dialogue data. It proceeds through three complementary training stages: (1) causal encoder adaptation, (2) full-duplex pre-training, (3) full-duplex fine-tuning for voice and role-control. On multiple full-duplex benchmarks, Raon-SpeechChat shows its clearest strengths on the turn-taking and interruption-sensitive behaviors covered by FDB v1.0, and remains competitive across the broader full-duplex evaluation suite. We open-source all model checkpoints, the training and inference pipeline, and an interactive demo.

URL PDF HTML ☆

赞 0 踩 0

2605.23909 2026-05-26 cs.AI cs.LG 版本更新

Confidence Calibration in Large Language Models

大型语言模型中的置信度校准

Noam Michael, Daniel BenShushan, Jacob Bien, Don A. Moore

发表机构 * U.C. Berkeley（伯克利大学）； University of Southern California（南加州大学）

AI总结通过预注册研究，发现大型语言模型（LLMs）的置信度普遍高于准确率，且存在显著的难易效应：困难测试中过度自信，简单测试中信心不足，并提出了LifeEval测试用于评估不同难度下的模型校准。

2605.23905 2026-05-26 q-fin.GN cs.AI cs.GT 版本更新

AI-Driven Alpha Decay: Algorithmic Homogenization, Reflexive Signal Erosion, and the Paradox of Intelligent Markets

AI驱动的阿尔法衰变：算法同质化、反射性信号侵蚀与智能市场的悖论

Shuchen Meng, Xupeng Chen

发表机构 * Department of Financial Engineering, New York University（金融工程系，纽约大学）； Department of Electrical and Computer Engineering, New York University（电气计算机工程系，纽约大学）

AI总结本文通过理论模型和实证数据证明，AI驱动的投资策略在大规模采用时具有自我挫败性，导致超额收益压缩，并推导出阿尔法半衰期公式，揭示了信号寿命、灭绝级联、红皇后不可能性以及脆弱性-效率权衡等四个理论结果。

详情

AI中文摘要

我们证明，AI驱动的投资策略在大规模采用时本质上具有自我挫败性。随着AI采用率的上升，三个相互强化的渠道——信号拥挤、表演性信号侵蚀和红皇后竞争——压缩了超额收益。我们推导出阿尔法半衰期 $h(ϕ) = \ln 2/[θ+ δ(ϕ)]$，其中 $θ$ 是自然均值回复率，$δ(ϕ) = Nϕρa/λ(ϕ)$ 是AI加速的衰变成分，随采用率凸递减。在当前采用水平（$ϕ\approx 0.7$，$ρ\approx 0.6$）下，模型暗示信号半衰期为18个月，而AI之前为5-7年。我们建立了四个理论结果。第一，阿尔法半衰期定理：信号寿命随AI采用率凸递减。第二，信号灭绝级联：超过临界阈值 $ϕ^*$ 后，一类信号的衰变会触发对剩余信号的加速竞争。第三，红皇后不可能性：在单一文化均衡中，尽管大量AI投资，净阿尔法恒为零。第四，脆弱性-效率权衡：最大化价格发现的采用水平严格超过最小化系统性脆弱性的水平。实证验证将投资组合收敛校准到SEC 13F表格提交模式（2013-2024年，9950万持仓），记录到模拟机构投资组合收敛在样本期内增加了42%。我们检查了模拟对冲基金回报动态，显示采用AI的基金之间横截面离散度下降，并模拟了2010年闪电崩盘以说明脆弱性后果。

英文摘要

We show that AI-driven investment strategies are inherently self-defeating at scale. As AI adoption rises, three mutually reinforcing channels -- signal crowding, performative signal erosion, and Red Queen competition -- compress excess returns. We derive the alpha half-life $h(ϕ) = \ln 2/[θ+ δ(ϕ)]$, where $θ$ is the natural mean-reversion rate and $δ(ϕ) = Nϕρa/λ(ϕ)$ is the AI-accelerated decay component, which is convex-decreasing in adoption. At current adoption levels ($ϕ\approx 0.7$, $ρ\approx 0.6$), the model implies signal half-lives of 18 months versus 5-7 years pre-AI. We establish four theoretical results. First, the alpha half-life theorem: signal lifespans are convex-decreasing in AI adoption. Second, a signal extinction cascade: beyond a critical threshold $ϕ^*$, the decay of one signal class triggers accelerated competition for remaining signals. Third, a Red Queen impossibility: in the monoculture equilibrium, net alpha is identically zero despite heavy AI investment. Fourth, a fragility-efficiency tradeoff: the adoption level maximizing price discovery strictly exceeds the level minimizing systemic fragility. Empirical validation calibrates portfolio convergence to SEC Form 13F filing patterns (99.5 million holdings, 2013-2024), documenting that simulated institutional portfolio convergence increases by 42% over the sample period. We examine simulated hedge fund return dynamics showing declining cross-sectional dispersion among AI-adopting funds, and simulate the 2010 Flash Crash to illustrate fragility consequences.

URL PDF HTML ☆

赞 0 踩 0

2605.22800 2026-05-26 cs.LG cs.AI stat.ML 版本更新

The Matching Principle: A Geometric Theory of Loss Functions for Nuisance-Robust Representation Learning

匹配原则：面向干扰鲁棒表示学习的损失函数几何理论

Vishal Rajput

发表机构 * KU Leuven（根特大学）

AI总结提出匹配原则，通过估计任务协方差矩阵并匹配惩罚矩阵的像空间，统一了多种鲁棒性方法，并在线性高斯模型中证明最优性。

Comments 58 pages, 13 pre-specified empirical blocks. v2: partial-pass framing, geometry-task dissociation, T2B protocol v3, layout/figure fixes; core theorems unchanged. Code: matching-pmh (PyPI). Related note: arXiv:2604.21395

详情

AI中文摘要

鲁棒性、领域自适应、光度/遮挡不变性、传感器漂移和对齐风格被视为独立的文献领域，拥有各自独立的方法族。在标签保持的部署偏移下，它们共享一个几何对象：协方差 Sigma_task = Cov_{Q_n}(n)，即输入在标签不变的情况下可以变化的方式。CORAL、对抗训练、数据增强、度量学习、雅可比惩罚和对齐约束并非独立的技巧——它们都是 Sigma_task 的估计量。固定该对象后，雅可比惩罚由一个矩阵 Sigma' 确定，其像空间必须覆盖 range(Sigma_task)——即匹配原则。我们在线性高斯模型中证明了最优性（定理A），证明了任何能够消除部署漂移的二次惩罚都需要像空间覆盖（定理G），并在全局最小值处证明了相同的二分性（定理A*_global）。错误方向/信号对齐控制（引理C；推论E/E*）以及七个估计量（引理D1-D7），加上无标签TDI，为需要学习 Sigma_task 的情况提供了可证伪的配方。在十三个模块（从ML到Qwen2.5-7B）上，测试了匹配的、各向同性的和错误方向的惩罚对几何和部署漂移的影响。其中十二个模块与可识别性成立的理论一致；Office-31是一个命名的特征间隙失败案例。部分通过：几何可以在不改善每个头条任务指标的情况下提升。一次初步的7B DPO运行（一个epoch，240对）：匹配风格-PMH保持了风格TDI，而标准DPO则使其退化。我们不声称标准训练达到全局最小值（假设(O)是开放的），不声称估计的 Sigma_task 总是可识别的，也不声称在每个排行榜上占优。我们提出一个可证伪的设计配方：估计 Sigma_task，匹配 Sigma'，运行控制，分别报告任务和几何指标。

英文摘要

Robustness, domain adaptation, photometric/occlusion invariance, sensor drift, and alignment style are treated as separate literatures with separate method families. Under label-preserving deployment shift they share one geometric object: the covariance Sigma_task = Cov_{Q_n}(n) of ways inputs can change without changing the label. CORAL, adversarial training, augmentation, metric learning, Jacobian penalties, and alignment constraints are not independent tricks--they are estimators of Sigma_task. Fix that object and the Jacobian penalty is pinned by a matrix Sigma' whose range must cover range(Sigma_task)--the matching principle. We prove optimality in a linear-Gaussian model (Thm. A), necessity of range coverage for any quadratic penalty that zeros deployment drift (Thm. G), and the same dichotomy at global minima (Thm. A*_global). Wrong-direction/signal-aligned controls (Lemma C; Cor. E/E*) and seven estimators (Lemmas D1--D7), plus label-free TDI, yield a falsifiable recipe when Sigma_task must be learned. Thirteen blocks (ML through Qwen2.5-7B) test matched vs isotropic vs wrong-direction penalties on geometry and deployment drift. Twelve match theory where identifiability holds; Office-31 is a named eigengap failure. Partial passes: geometry can improve without every headline task metric moving. A pilot 7B DPO run (one epoch, 240 pairs): matched style-PMH preserves Style TDI where standard DPO degrades it. We do not claim standard training reaches global minima (assumption (O) is open), that estimated Sigma_task is always identifiable, or dominance on every leaderboard. We claim a falsifiable design recipe: estimate Sigma_task, match Sigma', run the controls, report task and geometry separately.

URL PDF HTML ☆

赞 0 踩 0

2605.22093 2026-05-26 cs.AI 版本更新

Knowledge Graph Re-engineering Along the Ontological Continuum (extended version)

知识图谱沿本体论连续体的重工程（扩展版）

Enrico Daga, Valentina Tamma, Terry Payne

发表机构 * The Open University, Walton Hall, Milton Keynes, United Kingdom（开放大学）； School of Computer Science and Informatics, University of Liverpool, UK（利兹大学计算机科学与信息学学院）

AI总结本文提出本体论连续体作为概念框架，通过语义与语用、属性与可供性两个正交维度描述、比较和转换知识图谱，以解决不同建模实践间的集成与重用问题，并通过案例研究验证其有效性。

详情

AI中文摘要

知识图谱已成为数据集成的主要载体，对现代AI的成功至关重要，但KG建模实践的多样性（从轻量级词汇表到丰富公理化的本体论）使得集成和重用成本高昂且脆弱。这一挑战在神经符号AI中尤为突出，其中桥接神经和符号组件依赖于重新设计KG以适应新需求的能力；生成式AI现在提供了前所未有的自动化能力，但如果没有对KG空间的原则性理解，这种自动化在概念上仍然缺乏基础。我们将本体论连续体引入为缺失的概念化，这是一个理论构造，其特征框架由两个正交区分定义：语义与语用，以及属性与可供性；这些共同定义了一个词汇表，用于描述、比较、导航和转换跨越全部建模实践的KG。方法论立场是经验性的：连续体并非规定KG应如何建模，而是旨在定义一种存在理论，源于对现实世界KG工程实践的观察，其结构可以形式化地明确表达，例如通过形式概念分析（FCA）。我们通过一个关于溯源知识的案例研究来夯实这一愿景，展示单一关注点如何在连续体上以不同方式体现。我们阐述了五个开放的研究挑战，并邀请社区将本体论连续体发展为一个共享的研究议程。

英文摘要

Knowledge graphs have become the primary vehicle for data integration and are critical to the success of modern AI, but the diversity of KG modelling practices, from lightweight vocabularies to richly axiomatised ontologies, makes integration and reuse expensive and brittle. This challenge is particularly acute in neuro-symbolic AI, where bridging neural and symbolic components depends on the ability to reengineer KGs to fit new requirements; GenAI now offers unprecedented automation capability, but without a principled understanding of the KG space, such automation remains conceptually ungrounded. We introduce the ontological continuum as that missing conceptualisation, a theoretical construct a theoretical construct whose characterisation framework is defined by two orthogonal distinctions: semantics vs pragmatics, and properties vs affordances; together these define a vocabulary to describe, compare, navigate, and transform KGs across the full range of modelling practices. The methodological stance is empirical: rather than prescribing how KGs should be modelled, the continuum aims to define a theory of the existent, derived from observation of real-world KG engineering practices and whose structure can be made formally explicit, for example, through Formal Concept Analysis (FCA). We ground the vision through a case study on provenance knowledge, showing how a single concern manifests differently across the continuum. We articulate five open research challenges and invite the community to develop the ontological continuum as a shared research agenda.

URL PDF HTML ☆

赞 0 踩 0

2605.22005 2026-05-26 cs.LG cs.AI cs.CL 版本更新

HypergraphFormer: 从大语言模型中学习超图以实现可编辑的楼层平面图生成

Nikita Klimenko, Hesam Salehipour, Parham Eftekhar, Amir Khasahmadi, Ramon Elias Weber

发表机构 * Autodesk Research（Autodesk研究院）； York University（约克大学）； UC Berkeley（加州大学伯克利分校）

AI总结提出HypergraphFormer，利用大语言模型学习超图表示来生成楼层平面图，在RPLAN数据集上超越现有方法，并支持任意边界和高度可编辑性。

详情

AI中文摘要

在这项工作中，我们提出了HypergraphFormer，一种基于大语言模型学习超图表示的新型高效楼层平面图生成方法。该模型通过监督微调训练，生成基于超图的文本表示，编码楼层平面图中的空间关系和连通性信息。我们在RPLAN数据集上训练和评估我们的方法，并进一步在本文发布的一个独立的分布外数据集上展示其泛化能力。我们的方法在多种指标上优于基于栅格化或向量化表示的最先进技术。我们还展示了改进的数据效率，特别是在分布偏移下。超图公式通过将公寓足迹与其功能和几何细分解耦，使得能够为任意、不规则、用户指定的边界生成楼层平面图。此外，我们展示了所提出的方法具有高度的可编辑性，使其特别适合由大语言模型支持的设计导向工作流程。

英文摘要

In this work, we propose HypergraphFormer, a novel and efficient approach to floor plan generation based on learning hypergraph representations with a large language model (LLM). The model is trained via supervised fine-tuning to generate a hypergraph-based textual representation that encodes spatial relationships and connectivity information within floor plans. We train and evaluate our approach on the RPLAN dataset, and further demonstrate its generalizability on a separate out-of-distribution dataset, which we release in this paper. Our method outperforms state-of-the-art techniques based on rasterized or vectorized representations across a diverse set of metrics. We also show improved data efficiency, particularly under distribution shift. The hypergraph formulation enables the generation of floor plans for arbitrary, irregular, user-specified boundaries by decoupling apartment footprints from their functional and geometric subdivisions. Furthermore, we show that the proposed methodology offers a high degree of editability, making it particularly well suited to design-oriented workflows supported by LLMs.

URL PDF HTML ☆

赞 0 踩 0

2605.18224 2026-05-26 cs.LG cs.AI 版本更新

A Simplex Witness Certificate for Constant Collapse in Variational Autoencoders

变分自编码器中恒定坍缩的单纯形见证证书

Zegu Zhang, Jianhua Peng, Jian Zhang

发表机构 * Independent Researcher（独立研究者）； School of Computing, Southeast University（东南大学计算机学院）

AI总结提出一种基于GMM教师后验和单纯形见证的证书，用于检测和量化VAE编码器均值是否发生输入无关的恒定坍缩，并在MNIST、CIFAR-10和CIFAR-100上验证了方法有效性。

详情

AI中文摘要

我们研究变分自编码器中的精确恒定坍缩：确定性编码器均值变得与输入无关。先验保持为标准高斯分布。在VAE训练之前，我们从基于GMM的数据视角选择一个固定的教师后验，并将一个固定的仅潜在空间单纯形见证附加到编码器均值上。这种构造产生两个关联对象。第一个是证书：如果见证预测优于教师的最佳恒定预测器，则编码器均值不能是输入无关的常数。第二个是局部逃逸方向：在坍缩流形上，教师残差为对齐损失提供样本相关的下降方向。对于任何全支撑的教师后验，相同的几何结构也给出一个具有零教师-见证对齐误差的闭式潜在码。其缩放版本追踪一条从恒定预测器到精确教师码的边际能量路径，该路径量化了受保护见证子空间内的非坍缩。我们在MNIST、CIFAR-10和CIFAR-100上实例化了该方法。使用搜索的无监督PCA-GMM教师，在CIFAR-10和CIFAR-100上，所有五个种子的普通VAE均未通过教师-见证证书，而RST变体在所有五个种子中均通过。在坍缩压力设置下（β_KL ∈ {2,4,8}），普通VAE再次在所有种子中失败，而RST-alpha-prefit保持证书阳性。在两个自然图像数据集上的逃逸轨迹从低边际初始化开始增加见证边际，并表现出非零的教师诱导梯度范数。该分析仅限于编码器均值的精确恒定坍缩；生成质量、解码器使用和其他坍缩模式仍是独立的问题。

英文摘要

We study exact constant collapse in variational autoencoders: the deterministic encoder mean becomes independent of the input. The prior remains the standard Gaussian. Before VAE training, we select a fixed teacher posterior from a GMM-based view of the data and attach a fixed latent-only simplex witness to the encoder mean. This construction yields two linked objects. The first is a certificate: if the witness prediction improves on the best constant predictor of the teacher, the encoder mean cannot be input-independent constant. The second is a local escape direction: on the collapsed manifold, the teacher residual gives a sample-dependent descent direction for the alignment loss. For any full-support teacher posterior, the same geometry also gives a closed-form latent code with zero teacher-witness alignment error. Its scaled versions trace a margin-energy path from the constant predictor to the exact teacher code, which quantifies non-collapse inside the protected witness subspace. We instantiate the method on MNIST, CIFAR-10, and CIFAR-100. With searched unsupervised PCA-GMM teachers, vanilla VAEs fail the teacher-witness certificate in all five seeds on CIFAR-10 and CIFAR-100, while RST variants pass in all five seeds. Under collapse-stress settings with $β_{\mathrm{KL}}\in\{2,4,8\}$, vanilla VAE again fails in all seeds, whereas RST-alpha-prefit remains certificate-positive. Escape trajectories on both natural-image datasets increase the witness margin from a low-margin initialization and exhibit nonzero teacher-induced gradient norms. The analysis is confined to exact constant collapse of the encoder mean; generation quality, decoder use, and other collapse modes remain separate questions.

URL PDF HTML ☆

赞 0 踩 0

2605.16302 2026-05-26 cs.LG cs.AI cs.CL 版本更新

Reducing Credit Assignment Variance via Counterfactual Reasoning Paths

通过反事实推理路径减少信用分配方差

Fei Ding, Yongkang Zhang, Youwei Wang, Zijian Zeng

发表机构 * Alibaba Group（阿里巴巴集团）； Tsinghua University（清华大学）

AI总结提出反事实比较框架，通过采样多条推理轨迹并利用差异隐式估计过程级优势，将稀疏终端奖励转化为步骤敏感信号，从而改进大语言模型多步推理的信用分配，并引入隐式行为策略优化（IBPO）提升训练稳定性和性能上限。

详情

AI中文摘要

使用大语言模型进行多步推理的强化学习通常依赖于稀疏的终端奖励，这会导致一个条件较差的信用分配问题：最终反馈均匀地传播到所有中间决策。这导致高梯度方差、不稳定的训练和许多无效更新，最终限制了模型的持续改进。我们提出了一种用于信用分配的反事实比较框架。对于每个输入，该框架采样多个推理轨迹，并将它们的差异视为对替代决策的隐式近似。这产生了一个隐式过程级优势估计器，将稀疏终端奖励转化为步骤敏感的学习信号。基于此框架，我们引入了隐式行为策略优化（IBPO），该方法在数学和代码推理基准上显著提高了训练稳定性和性能上限。我们的结果为释放大语言模型的推理潜力指明了一个有前景的方向。

英文摘要

Reinforcement learning for multi-step reasoning with large language models (LLMs) typically relies on sparse terminal rewards, which creates a poorly conditioned credit-assignment problem: the final feedback is propagated uniformly across all intermediate decisions. This leads to high gradient variance, unstable training, and many ineffective updates, ultimately limiting sustained model improvement. We propose a counterfactual-comparison framework for credit assignment. For each input, the framework samples multiple reasoning trajectories and treats their differences as implicit approximations to alternative decisions. This yields an implicit process-level advantage estimator that converts sparse terminal rewards into step-sensitive learning signals. Building on this framework, we introduce Implicit Behavior Policy Optimization (IBPO), which substantially improves training stability and the performance ceiling on mathematical and code-reasoning benchmarks. Our results point to a promising direction for unlocking the reasoning potential of LLMs.

URL PDF HTML ☆

赞 0 踩 0

2605.15236 2026-05-26 cs.IT cs.AI cs.NI math.IT 版本更新

Learning Selective Merge Policies for Deadline-Constrained Coded Caching via Deep Reinforcement Learning

基于深度强化学习的截止时间约束编码缓存选择性合并策略学习

Amirhossein Yousefiramandi

发表机构 * Amirhossein Yousefiramandi（阿米尔霍塞因·尤塞菲拉曼迪）

AI总结针对截止时间约束的编码缓存问题，提出基于深度强化学习的选择性合并策略，通过近端策略优化训练策略网络，在广播包过期率和效率上优于SACM++。

详情

AI中文摘要

在编码缓存中，服务器利用用户端的缓存信息，通过单个编码多播消息或数据包（即合并数据包）并行服务多个用户，从而缓解峰值网络拥塞。为了在视频流等截止时间驱动的应用中向用户传递及时消息，我们必须在线确定要合并的消息进行传递，因为每个请求都有时间限制。值得注意的是，虽然合并有助于当前编码多播数据包，但可能损害未来的传递。我们的解决方案采用深度强化学习，将编码多播传递视为一个掩码动作离散状态控制问题，并通过近端策略优化训练的策略网络优于SACM++。在均匀需求基准上，我们的策略网络将广播数据包过期率$ρ$相对于最佳编码多播基线（SACM++）降低了$40.9\%$（$0.208$ vs. $0.352$），同时在编码多播方法中，在Track~A电池组上取得了最佳广播效率分数$σ$。这里一个值得注意的现象是，对于截止时间更严格的应用，合并变得有选择性而非激进，因为策略网络仅在大约$31.8\%$的机会中选择性合并，尽管在同一模拟器家族的不同变体中观察到相同现象。我们设计的重点是高效的成对XOR合并，而高阶（$K{\ge}3$）编码可视为自然推广，留待未来工作。

英文摘要

In the coded caching, the server uses the cached information at the users to serve multiple users in parallel with a single coded multi-casting message or packet, that is, a merged packet, and thus mitigates the peak network congestion. In order to deliver the timely messages to the users in the deadline-driven applications like the video streaming, we must determine online the messages to be merged for the delivery, as there is a time limit for each request. It is important to note that while the merging aids the current coded multi-casting packet, it could harm the future deliveries. Our solution employs the deep reinforcement learning to view the coded multi-casting delivery as a masked action-discrete state control problem, and our policy network, trained via the proximal policy optimization, performs better than SACM++. On the uniform-demand benchmark, our policy network reduces the broadcast-packet expiration ratio $ρ$ by $40.9\%$ ($0.208$ vs.\ $0.352$) with respect to the best coded multi-casting baseline (SACM++), while also attaining the best broadcast-efficiency score $σ$ across the Track~A battery among the coded multi-casting methods. One noteworthy phenomenon here is that, for the applications with stricter deadlines, the merging becomes selective instead of aggressive, since the policy network selectively merges at approximately $31.8\%$ of the chances, even though the same observation holds across the variations within the same simulator family. The focus of our design is on the efficient pairwise XOR merging, where the higher-order ($K{\ge}3$) coding can be considered as a natural generalization left for future work.

URL PDF HTML ☆

赞 0 踩 0

2605.14889 2026-05-26 cs.CV cs.AI 版本更新

SurgicalMamba: Dual-Path SSD with State Regramming for Online Surgical Phase Recognition

SurgicalMamba: 具有状态重编程的双路径SSD用于在线手术阶段识别

Sukju Oh, Sukkyu Sun

发表机构 * Department of Computer Science and Artificial Intelligence（计算机科学与人工智能系）

AI总结提出SurgicalMamba模型，基于Mamba2的结构化状态空间对偶性（SSD），通过双路径SSD块、强度调制步进和状态重编程三个组件，实现在线手术阶段识别，在多个基准上达到最先进性能。

Comments 28 pages, 7 figures, 10 tables; Code available at https://github.com/sukjuoh/Surgical-Mamba

详情

AI中文摘要

在线手术阶段识别（SPR）是上下文感知手术室系统的基础，要求仅根据过去上下文对每一帧做出预测。手术视频提出了自然视频识别器无法共同解决的三个需求：手术过程跨越数万帧，时间流动不均匀（长时间常规片段被短暂的阶段定义转换打断），视觉领域狭窄，因此骨干特征在通道间高度相关。现有识别器要么让每帧成本随已处理长度增长，要么保持成本有界但以均匀速率和通道独立动态推进状态，无法解决后两个需求。我们提出SurgicalMamba，一种基于Mamba2的结构化状态空间对偶性（SSD）的因果SPR模型，将每帧成本保持在O(d)。它引入了三个与SSD兼容的组件，共同解决这些需求：双路径SSD块，在循环状态级别分离长期和短期模式；强度调制步进，一种连续时间时间扭曲，使慢路径的有效速率适应阶段相关信息；以及状态重编程，一种每块的Cayley旋转，在原本轴对齐的SSM循环中打开跨通道混合。学习到的旋转平面继承了阶段对齐的结构，无需任何直接监督，提供了手术工作流的可解释内部特征。在七个公开SPR基准上，SurgicalMamba在严格在线评估下达到了最先进的准确率和阶段级Jaccard指数：在Cholec80上为94.6%/82.7%（比最强先前方法高0.7 pp/2.2 pp），在AutoLaparo上为89.5%/68.9%（高1.7 pp/2.0 pp），在单个GPU上达到238.74 fps。消融实验分离了每个组件的贡献。代码公开于https://github.com/sukjuoh/Surgical-Mamba。

英文摘要

Online surgical phase recognition (SPR) underpins context-aware operating-room systems and requires committing to a prediction at every frame from past context alone. Surgical video poses three demands that natural-video recognizers do not jointly address: procedures span tens of thousands of frames, time flows non-uniformly as long routine stretches are punctuated by brief phase-defining transitions, and the visual domain is narrow so backbone features are strongly correlated across channels. Existing recognizers either let per-frame cost grow with elapsed length, or hold cost bounded but advance state at a uniform rate with channel-independent dynamics, leaving the latter two demands unaddressed. We present SurgicalMamba, a causal SPR model built on Mamba2's structured state-space duality (SSD) that holds per-frame cost at O(d). It introduces three SSD-compatible components that jointly address these demands: a dual-path SSD block that separates long- and short-term regimes at the level of recurrent state; intensity-modulated stepping, a continuous-time time-warp that adapts the slow path's effective rate to phase-relevant information; and state regramming, a per-chunk Cayley rotation that opens cross-channel mixing in the otherwise axis-aligned SSM recurrence. The learned rotation planes inherit a phase-aligned structure without any direct supervision, offering an interpretable internal signature of surgical workflow. Across seven public SPR benchmarks, SurgicalMamba reaches state-of-the-art accuracy and phase-level Jaccard under strict online evaluation: 94.6%/82.7% on Cholec80 (+0.7 pp/+2.2 pp over the strongest prior) and 89.5%/68.9% on AutoLaparo (+1.7 pp/+2.0 pp), at 238.74 fps on a single GPU. Ablations isolate the contribution of each component. The code is publicly available at https://github.com/sukjuoh/Surgical-Mamba.

URL PDF HTML ☆

赞 0 踩 0

2605.10977 2026-05-26 cs.CR cs.AI 版本更新

PASA: A Principled Embedding-Space Watermarking Approach for LLM-Generated Text under Semantic-Invariant Attacks

PASA：一种针对语义不变攻击的LLM生成文本的原则性嵌入空间水印方法

Zhenxin Ai, Haiyun He

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港理工大学（广州））

AI总结提出PASA水印算法，在潜在嵌入空间的语义簇上嵌入和检测水印，通过理论框架实现检测精度、鲁棒性和失真的基本权衡，在强释义攻击下仍保持鲁棒性和文本质量。

详情

AI中文摘要

大型语言模型（LLM）的水印是一种有前景的方法，用于检测LLM生成的文本并实现负责任的部署。然而，现有的水印方法通常容易受到语义不变攻击（如释义）的影响。我们提出了PASA，一种原则性、鲁棒且无失真的水印算法，在语义级别嵌入和检测水印。PASA在潜在嵌入空间中的语义簇上操作，并通过由密钥和语义历史同步的共享随机性构建令牌和辅助序列之间的分布依赖关系。该设计基于我们的理论框架，该框架表征了联合最优的嵌入-检测对，实现了检测精度、鲁棒性和失真之间的基本权衡。在多个LLM和语义不变攻击上的评估表明，即使在强释义攻击下，PASA仍保持鲁棒性，同时保持高文本质量，优于标准词汇空间基线。消融研究进一步验证了我们超参数选择的有效性。网页：https://ai-kunkun.github.io/PASA_page/。

英文摘要

Watermarking for large language models (LLMs) is a promising approach for detecting LLM-generated text and enabling responsible deployment. However, existing watermarking methods are often vulnerable to semantic-invariant attacks, such as paraphrasing. We propose PASA, a principled, robust, and distortion-free watermarking algorithm that embeds and detects a watermark at the semantic level. PASA operates on semantic clusters in a latent embedding space and constructs a distributional dependency between token and auxiliary sequences via shared randomness synchronized by a secret key and semantic history. This design is grounded in our theoretical framework that characterizes a jointly optimal embedding-detection pair, achieving the fundamental trade-offs among detection accuracy, robustness, and distortion. Evaluations across multiple LLMs and semantic-invariant attacks demonstrate that PASA remains robust even under strong paraphrasing attacks while preserving high text quality, outperforming standard vocabulary-space baselines. Ablation studies further validate the effectiveness of our hyperparameter choices. Webpage: https://ai-kunkun.github.io/PASA_page/.

URL PDF HTML ☆

赞 0 踩 0

2605.10764 2026-05-26 cs.CV cs.AI 版本更新

Break the Brake, Not the Wheel: Untargeted Jailbreak via Entropy Maximization

打破刹车，而非车轮：通过熵最大化实现无目标越狱

Mengqi He, Xinyu Tian, Xin Shen, Shu Zou, Jinhong Ni, Zhaoyuan Yang, Weikang Li, Xuesong Li, Jing Zhang

发表机构 * Australian National University（澳大利亚国立大学）； The University Of Queensland（昆士兰大学）； Peking University（北京大学）； GE research（通用电气研究院）； CSIRO（澳大利亚联邦科学与工业研究组织）

AI总结提出UJEM-KL攻击方法，通过最大化决策令牌的熵来翻转视觉-语言模型的拒绝输出，实现高迁移性的无目标越狱。

Comments Preprint. 17 pages, 8 figures, 6 tables

详情

AI中文摘要

近期研究表明，基于梯度的通用图像越狱攻击在视觉-语言模型（VLM）上几乎没有或完全没有跨模型迁移性，这使人们对可迁移多模态越狱的可行性产生了怀疑。我们在严格的无目标威胁模型下重新审视这一结论，不强制固定前缀或响应模式。初步实验发现，在自回归解码过程中，拒绝行为集中在高熵令牌上，而攻击前非拒绝令牌在前排候选者中已占据相当大的概率质量。受此启发，我们提出通过熵最大化的无目标越狱（UJEM）-KL，这是一种轻量级攻击，通过最大化这些决策令牌的熵来翻转拒绝结果，同时稳定剩余的低熵位置以保持输出质量。在三个VLM和两个安全基准测试中，UJEM-KL实现了具有竞争力的白盒攻击成功率，并持续提高了迁移性，同时在代表性防御下仍然有效。我们的实验结果表明，有限的迁移性主要源于过度受限的优化目标。

英文摘要

Recent studies show that gradient-based universal image jailbreaks on vision-language models (VLMs) exhibit little or no cross-model transferability, casting doubt on the feasibility of transferable multimodal jailbreaks. We revisit this conclusion under a strictly untargeted threat model without enforcing a fixed prefix or response pattern. Our preliminary experiment reveals that refusal behavior concentrates at high-entropy tokens during autoregressive decoding, and non-refusal tokens already carry substantial probability mass among the top-ranked candidates before attack. Motivated by this finding, we propose Untargeted Jailbreak via Entropy Maximization(UJEM)-KL, a lightweight attack that maximizes entropy at these decision tokens to flip refusal outcomes, while stabilizing the remaining low-entropy positions to preserve output quality. Across three VLMs and two safety benchmarks, UJEM-KL achieves competitive white-box attack success rates and consistently improves transferability, while remaining effective under representative defenses. Our experimental results indicate that the limited transferability primarily stems from overly constrained optimization objectives.

URL PDF HTML ☆

赞 0 踩 0

2605.10430 2026-05-26 cs.LG cs.AI stat.ML 版本更新

Real vs. Semi-Simulated: Rethinking Evaluation for Treatment Effect Estimation

真实 vs. 半模拟：重新思考治疗效果估计的评估

George Panagopoulos

发表机构 * Department of Computer Science University of Luxembourg（计算机科学系卢森堡大学）

AI总结通过大规模实证研究，比较了半模拟基准和真实数据集上使用反事实指标与可观测指标评估治疗效果估计模型的效果，揭示了两种评估体系之间的差距，并发现简单元学习器与强基础模型结合具有竞争力。

详情

AI中文摘要

利用机器学习估计异质性治疗效果在学术研究和工业实践中都引起了广泛关注。然而，这两个领域通常在不同条件下评估模型。方法论工作通常依赖于半模拟基准和需要反事实结果的指标，而实际应用则依赖于基于排名或测试结果的可观测指标。尽管方法论进展与实际部署之间存在众所周知的差距，但这些评估体系之间的关系尚未得到系统研究。我们对标准半模拟基准系列和真实数据集上的治疗效果评估进行了大规模实证研究。我们的基准涵盖了与多个基础学习器配对的元学习器，以及专门的因果机器学习模型。我们使用应用导向文献中常见的可观测指标以及方法论文中常用的反事实指标来评估这些方法。我们的结果揭示了两个互补的差距。首先，即使在相同的半模拟基准上，反事实指标也不能可靠地恢复可观测指标偏好的估计器。其次，在半模拟基准上获得的排名不能迁移到真实数据集。我们还发现，具有强大基础模型的简单元学习器始终具有竞争力，这与专门的因果模型形成对比。总体而言，我们的发现表明，治疗效果估计研究的进展不应仅通过反事实指标和半模拟基准来评估，而应结合可观测指标和真实数据验证。

英文摘要

Estimating heterogeneous treatment effects with machine learning has attracted substantial attention in both academic research and industrial practice. However, the two communities often evaluate models under markedly different conditions. Methodological work typically relies on semi-simulated benchmarks and metrics that require counterfactual outcomes, whereas real-world applications rely on observable metrics based on ranking or test outcomes. Despite the well-known gap between methodological progress and practical deployment, the relationship between these evaluation regimes has not been examined systematically. We conduct a large-scale empirical study of treatment effect evaluation across standard semi-simulated benchmark families and real-world datasets. Our benchmark covers meta-learners paired with multiple base learners, as well as specialized causal machine learning models. We evaluate these methods using observable metrics common in application-oriented literature, alongside counterfactual metrics commonly used in methods papers. Our results reveal two complementary gaps. First, counterfactual metrics do not reliably recover the estimators preferred by observable metrics, even on the same semi-simulated benchmarks. Second, rankings obtained on semi-simulated benchmarks do not transfer to real datasets. We further find that simple meta-learners with strong base models are consistently competitive, in contrast to specialized causal models. Overall, our findings suggest that progress in treatment effect estimation research should not be assessed solely through counterfactual metrics and semi-simulated benchmarks, but it would benefit from incorporating observable metrics and real-data validation.

URL PDF HTML ☆

赞 0 踩 0

2605.07733 2026-05-26 cs.LG cs.AI 版本更新

Intelligent Truck Matching in Full Truckload Shipments using Ping2Hex approach

使用Ping2Hex方法的整车运输智能卡车匹配

Srinivas Kumar Ramdas, Jose Mathew, Ankit Singh Chauhan, Dinesh Rajkumar, Aravind Manoj, Mohit Goel

发表机构 * Project44 Gmbh（Project44公司）

AI总结提出基于Ping2Hex的智能卡车匹配系统ITM 2.0，通过概率排序和LightGBM模型解决GPS数据中车辆标识缺失导致的匹配问题，显著提升精度和覆盖率。

Comments 12 pages, 10 figures, 8 tables. Accepted at iSCSi 2026 (International Conference on Industry Sciences and Computer Sciences Innovation). To appear in Procedia Computer Science (Elsevier)

详情

Journal ref: ISCSI(2026)

AI中文摘要

利用GPS数据进行准确的卡车与货物匹配是整车供应链可视性的基础，能够实现实时跟踪和准确的预计到达时间（ETA）预测。然而，缺失或损坏的车辆标识符使得传统匹配方法无法使用，导致货物失去可视性。本文提出了智能卡车匹配（ITM）2.0，一个机器学习系统，通过将匹配问题表述为概率排序来解决这一关键缺口。我们的方法利用Uber H3六边形空间索引将GPS ping离散化为路线相似性特征，结合时间信息，然后应用带有阈值后处理的LightGBM梯度提升。通过严格的评估，包括离线模型选择（SVM、XGBoost、LightGBM）、全面的消融研究和生产影子测试，我们展示了相对于基于规则的基线的显著提升。ITM 2.0在北美实现了26个百分点的精度提升，在欧洲实现了14个百分点的提升，同时覆盖率翻倍。该系统已在Project44部署用于处理整车运输，展示了对于高达1公里的地理编码误差、多个候选卡车和稀疏ping的鲁棒性。

英文摘要

Accurate truck-to-shipment matching using GPS data is foundational for full truckload supply chain visibility, enabling real-time tracking and accurate estimated time of arrival (ETA) predictions. However, missing or corrupted vehicle identifiers prevent traditional matching approaches, leaving shipments without visibility. This paper presents Intelligent Truck Matching (ITM) 2.0, a machine learning system that addresses this critical gap by formulating matching as a probabilistic ranking problem. Our approach leverages Uber H3 hexagonal spatial indexing to discretize GPS pings into route similarity features, combined with temporal information, then applies LightGBM gradient boosting with threshold-based post-processing. Through rigorous evaluation including offline model selection (SVM, XGBoost, LightGBM), comprehensive ablation studies, and production shadow testing, we demonstrate substantial gains over rule-based baselines. ITM 2.0 achieves 26 percentage point precision improvement in North America and 14 points in Europe, while doubling coverage. Deployed in production at Project44 handling full truckload shipments, the system demonstrates robustness to geocoding errors up to 1 km, multiple candidate trucks, and sparse pings.

URL PDF HTML ☆

赞 0 踩 0

2605.06415 2026-05-26 cs.LG cs.AI cs.CL cs.CV 版本更新

E = T*H/(O+B): A Dimensionless Control Parameter for Mixture-of-Experts Ecology

E = T*H/(O+B)：混合专家生态的无量纲控制参数

Qingjun Zhang

发表机构 * School of Integrated Circuits, Wuxi Taihu University（无锡太湖大学集成电路学院）

AI总结提出无量纲控制参数E = T*H/(O+B)，通过12个控制实验证明E≥0.5可保证混合专家模型无死亡专家，并发现专家复活、正交毒性依赖数据集等六项额外结果。

Comments 12 experiments, 11,000+ training epochs, cross-modal validation (vision + language). Extended version of the Claude-in-the-Loop ecology framework

详情

AI中文摘要

我们引入E = T*H/(O+B)，这是一个无量纲控制参数，用于预测混合专家（MoE）模型是否会发展出健康的专家生态还是陷入死亡专家。E将四个超参数——路由温度T、路由熵权重H、先知权重O和平衡权重B——组合成一个单一量。通过12个控制实验（8个视觉，4个语言），总计超过11,000个训练周期，我们确定仅E ≥ 0.5就足以保证零死亡专家，消除了手工设计负载平衡辅助损失的必要性。我们在CIFAR-10、CIFAR-100、TinyImageNet-200、WikiText-2和WikiText-103上跨模态验证了这一点。另外还发现了六项结果：（1）死亡专家可以复活——由平衡损失驱动路由器重新探索触发；（2）正交毒性依赖于数据集，并非普遍存在；（3）任务复杂性改变了临界E阈值；（4）模型过拟合与专家生态健康解耦；（5）三层MoE自发崩溃为两层功能结构；（6）生态结构在50倍温度范围内保持不变。我们提出E作为MoE训练的统一诊断指标，类似于流体力学中的雷诺数。

英文摘要

We introduce E = T*H/(O+B), a dimensionless control parameter that predicts whether Mixture-of-Experts (MoE) models will develop a healthy expert ecology or collapse into dead experts. E combines four hyperparameters -- routing temperature T, routing entropy weight H, oracle weight O, and balance weight B -- into a single quantity. Through 12 controlled experiments (8 vision, 4 language) totaling over 11,000 training epochs, we establish that E >= 0.5 alone is sufficient to guarantee zero dead experts, removing the necessity for handcrafted load-balancing auxiliary losses. We validate this cross-modally on CIFAR-10, CIFAR-100, TinyImageNet-200, WikiText-2, and WikiText-103. Six additional findings emerge: (1) dead experts can resuscitate -- triggered by balance loss driving router re-exploration; (2) ortho toxicity is dataset-dependent, not universal; (3) task complexity shifts the critical E threshold; (4) model overfitting is decoupled from expert ecological health; (5) three-tier MoE spontaneously collapses into a two-tier functional structure; (6) ecological structure is temperature-invariant across a 50x range. We propose that E serves as a unified diagnostic for MoE training, analogous to the Reynolds number in fluid dynamics.

URL PDF HTML ☆

赞 0 踩 0

2605.04295 2026-05-26 cs.LG cs.AI 版本更新

LLMs Uncertainty Quantification via Adaptive Conformal Semantic Entropy

通过自适应共形语义熵进行LLM不确定性量化

Hamed Karimi, Vaishali Meyappan, Reza Samavi

发表机构 * Toronto Metropolitan University（多伦多 Metropolitan 大学）； Vector Institute（向量研究所）

AI总结提出自适应共形语义熵（ACSE）方法，通过聚类语义熵并自适应调整不确定性分数，结合共形校准实现统计可靠的接受/弃权决策，在多个数据集上优于现有基线。

Comments Accepted for publication in the Proceedings of the 35th International Joint Conference on Artificial Intelligence (IJCAI 2026); 14 Pages

详情

AI中文摘要

LLMs的过度自信，特别是在产生幻觉时，对在安全关键环境中部署模型构成了重大挑战，并使得对不确定性进行可靠估计成为必要。现有的不确定性量化方法通常优先考虑词汇或概率度量；然而，这些技术往往忽略了具有相似含义的不同响应的语义差异。在本文中，我们提出了自适应共形语义熵（ACSE），一种通过自适应测量LLMs输出中的语义分散性来估计提示级不确定性的方法。我们的不确定性评分函数基于对同一提示的多个不同响应的语义熵进行聚类。该函数根据每个聚类的语义特征自适应调整不确定性分数。为了确保我们分数的统计可靠性，我们使用共形校准应用决策规则来接受/弃权提示，提供了有限样本、无分布的保证，使得接受响应中的错误率保持在用户指定的容差范围内。我们使用不同LLMs和数据集进行的广泛实验评估表明，我们的方法在判别性能、共形保证和概率校准指标方面始终优于最先进的不确定性量化基线。作为一个亮点，对于TriviaQA数据集，我们方法的AUROC为0.88，而令牌熵方法为0.65。

英文摘要

LLMs' overconfidence, particularly when hallucinating, poses a significant challenge for the deployment of the models in safety-critical settings and makes a reliable estimation of uncertainty necessary. Existing approaches for uncertainty quantification typically prioritize lexical or probabilistic measures; however, these techniques often ignore the semantic variance of different responses with similar meaning. In this paper, we propose Adaptive Conformal Semantic Entropy (ACSE), a method for estimating prompt-level uncertainty by adaptively measuring semantic dispersion in LLMs outputs. Our uncertainty scoring function is based on clustering semantic entropy of multiple diverse responses to the same prompt. The function adaptively adjusts the uncertainty score based on semantic features of each cluster. To ensure statistical reliability of our score, we use conformal calibration to apply a decision rule to accept/abstain the prompts, providing a finite-sample, distribution-free guarantee such that the error rate among the accepted responses remains bounded by a user-specified tolerance. Our extensive experimental evaluations using different LLMs and datasets, demonstrate that our approach consistently outperforms state-of-the-art uncertainty quantification baselines using discriminative performance, conformal guarantees, and probabilistic calibration indicators. As a highlight, for TriviaQA dataset, AUROC of our approach is 0.88 compared to 0.65 produced by the token entropy approach.

URL PDF HTML ☆

赞 0 踩 0

2605.03509 2026-05-26 cs.CV cs.AI 版本更新

BFORE: Butterfly-Firefly Optimized Retinex Enhancement for Low-Light Image Quality Improvement

BFORE: 蝴蝶-萤火虫优化的Retinex增强用于低光图像质量提升

Ahmed Cherif

发表机构 * Sofrecom Tunisia（Sofrecom突尼斯）； Orange Innovation（Orange创新）

AI总结提出BFORE框架，结合蝴蝶优化算法和萤火虫算法自动搜索最佳Retinex增强参数，最大化高斯自然度评分，显著提升低光图像质量。

详情

AI中文摘要

低光图像存在可见度差、噪声和颜色失真问题。现有的基于Retinex的增强方法依赖手动调整参数，无法泛化到不同光照条件。本文提出BFORE（蝴蝶-萤火虫优化的Retinex增强），一个自动为每张图像寻找最佳增强参数的框架。BFORE分两阶段工作：（1）蝴蝶优化算法（BOA）搜索最优的多尺度Retinex带颜色恢复（MSRCR）参数，然后（2）萤火虫算法（FA）微调伽马校正、去噪和颜色参数。两个阶段都最大化高斯自然度评分（GNS），一种衡量增强图像自然度的无参考指标。标准质量指标（PSNR、SSIM、NIQE）仅在优化后计算，确保零数据泄露。在30对合成图像上，BFORE达到GNS=0.971，优于次优方法MSRCR（0.894）8.6%。在来自LOL数据集的115张真实图像上，BFORE达到GNS=0.887，优于MSRCR（0.808）9.8%。与三个在相同条件下训练的深度学习基线（Zero-DCE、SCI、IAT）进行受控比较，BFORE在GNS上超过最佳深度学习方法14.7%。消融研究证实，混合BOA+FA策略显著优于单独使用每种优化器，而在三个评估预算下的可扩展性分析表明，一旦计算资源可用，结构化优化器显著优于均匀随机采样（128次评估时p=0.009，300次评估时p=0.021）。所有改进均具有统计显著性（Wilcoxon符号秩检验p<0.0001）。每张图像在CPU上的处理时间为3-6分钟，适用于离线应用。

英文摘要

Low-light images suffer from poor visibility, noise, and color distortion. Existing Retinex-based enhancement methods rely on manually tuned parameters that do not generalize across different lighting conditions. This paper proposes BFORE (Butterfly-Firefly Optimized Retinex Enhancement), a framework that automatically finds the best enhancement parameters for each image. BFORE works in two phases: (1) a Butterfly Optimization Algorithm (BOA) searches for optimal Multi-Scale Retinex with Color Restoration (MSRCR) parameters, then (2) a Firefly Algorithm (FA) fine-tunes gamma correction, denoising, and color parameters. Both phases maximize a Gaussian Naturalness Score (GNS), a no-reference metric that measures how natural the enhanced image looks. Standard quality metrics (PSNR, SSIM, NIQE) are computed only after optimization, ensuring zero data leakage. On 30 synthetic image pairs, BFORE achieves GNS = 0.971, outperforming the next-best method MSRCR (0.894) by 8.6%. On 115 real images from the LOL dataset, BFORE achieves GNS = 0.887, outperforming MSRCR (0.808) by 9.8%. A controlled comparison with three deep learning baselines (Zero-DCE, SCI, IAT) trained under identical conditions shows BFORE surpasses the best DL method by 14.7% in GNS. An ablation study confirms that the hybrid BOA+FA strategy significantly outperforms each optimizer in isolation, and a scalability analysis at three evaluation budgets shows that the structured optimizer significantly outperforms uniform random sampling once compute is available (p = 0.009 at 128 evaluations, p = 0.021 at 300 evaluations). All improvements are statistically significant (p < 0.0001, Wilcoxon signed-rank test). Processing time is 3-6 minutes per image on CPU, suitable for offline applications.

URL PDF HTML ☆

赞 0 踩 0

2605.02037 2026-05-26 cs.RO cs.AI 版本更新

VILAS: A VLA-Integrated Low-cost Architecture with Soft Grasping for Robotic Manipulation

VILAS：一种集成软抓取的VLA低成本机器人操作架构

Zijian An, Hadi Khezam, Bill Cai, Ran Yang, Shijie Geng, Yiming Feng, Yue Zheng, Lifeng Zhou

发表机构 * Drexel University（德雷塞尔大学）； Virginia Seafood Agricultural Research and Extension Center（弗吉尼亚海鲜农业研究与推广中心）； Amazon Store Foundation AI (SFAI)（亚马逊商店基金会人工智能（SFAI））

AI总结提出VILAS低成本模块化机器人操作平台，集成软抓取机构，支持端到端VLA策略学习与部署，并在葡萄抓取任务中验证有效性。

详情

AI中文摘要

我们提出了VILAS，一个完全低成本、模块化的机器人操作平台，旨在支持端到端视觉-语言-动作（VLA）策略学习并在可访问硬件上部署。该系统集成了法如FR5协作臂、Jodell RG52-50电动夹爪和双摄像头感知模块，通过基于ZMQ的通信架构统一协调遥操作、数据收集和策略部署于单一框架内。为了在不依赖显式力传感的情况下安全操作易碎物体，我们设计了一种基于kirigami的软柔性夹爪扩展件，在压缩载荷下产生可预测变形，提供对脆弱目标的温和且可重复接触。我们在VILAS平台上部署并评估了三种最先进的VLA模型：pi_0、pi_0.5和GR00T N1.6。所有模型均使用通过我们的遥操作流水线收集的相同演示数据集，从公开发布的预训练检查点进行微调。在葡萄抓取任务上的实验验证了所提系统的有效性，证实了有能力的操作策略可以在低成本模块化硬件上成功训练和部署。我们的结果进一步为当前VLA模型在真实环境中的部署特性提供了实践见解。

英文摘要

We present VILAS, a fully low-cost, modular robotic manipulation platform designed to support end-to-end vision-language-action (VLA) policy learning and deployment on accessible hardware. The system integrates a Fairino FR5 collaborative arm, a Jodell RG52-50 electric gripper, and a dual-camera perception module, unified through a ZMQ-based communication architecture that seamlessly coordinates teleoperation, data collection, and policy deployment within a single framework. To enable safe manipulation of fragile objects without relying on explicit force sensing, we design a kirigami-based soft compliant gripper extension that induces predictable deformation under compressive loading, providing gentle and repeatable contact with delicate targets. We deploy and evaluate three state-of-the-art VLA models on the VILAS platform: pi_0, pi_0.5, and GR00T N1.6. All models are fine-tuned from publicly released pretrained checkpoints using an identical demonstration dataset collected via our teleoperation pipeline. Experiments on a grape grasping task validate the effectiveness of the proposed system, confirming that capable manipulation policies can be successfully trained and deployed on low-cost modular hardware. Our results further provide practical insights into the deployment characteristics of current VLA models in real-world settings.

URL PDF HTML ☆

赞 0 踩 0

2604.23703 2026-05-26 cs.HC cs.AI cs.CY 版本更新

Talking Slide Avatars: Open-Source Multimodal Communication Approach for Teaching

Talking Slide Avatars: 面向教学的开源多模态通信方法

Xinxing Wu

发表机构 * School of Mathematics and Computer Science, Kentucky State University（肯塔基州立大学数学与计算机科学学院）

AI总结提出一种集成OpenVoice和Ditto-TalkingHead的开源工作流，用于创建可说话的幻灯片头像，以增强在线教学中的教师存在感和叙事连续性。

Comments 15 pages

详情

AI中文摘要

基于幻灯片的讲授在高等教育中广泛使用，但在在线、混合和异步情境中，幻灯片常常失去教师存在感、叙事连续性和表达框架，而这些有助于学习者与课程内容建立联系。完整的讲座视频可以部分恢复这些特性，但录制、修改和复用耗时。本研究提出了一种基于实践的实现和反思性分析，用于创建可说话的幻灯片头像的开源工作流。该工作流将OpenVoice（用于文本转语音和授权语音风格转换）与Ditto-TalkingHead（用于音频驱动的说话图像合成）相结合，使教师能够将简短的脚本和授权或合成的肖像图像转换为幻灯片或基于HTML的讲座材料的配音视频。本研究不仅将这一工作流视为技术解决方案，还将可说话的幻灯片头像定位为数字教育学、美育和艺术技术实践交叉点的多模态通信产物。本文记录了生产流程，分析了通信和美学可供性，并提出了关于脚本长度、图像选择、节奏、披露、可访问性、同意和伦理使用的实用指南。其贡献并非经过验证的学习干预，而是面向教育者的开源生产模型和通信设计框架。研究得出结论：简短、透明且精心设计的头像，在有选择地使用并采取适当伦理保障时，可为引言、过渡、提醒和总结提供可复用的通信层。

英文摘要

Slide-based teaching is widely used in higher education, yet in online, hybrid, and asynchronous contexts, slides often lose instructor presence, narrative continuity, and expressive framing that help learners connect with course content. Full lecture video can partly restore these qualities, but it is time-consuming to record, revise, and reuse. This study presents a practice-based implementation and analytic reflection of an open-source workflow for creating talking slide avatars. The workflow integrates OpenVoice for text-to-speech and authorized voice-style conversion with Ditto-TalkingHead for audio-driven talking-image synthesis, enabling instructors to transform a short script and an authorized or synthetic portrait image into a narrated video for slide decks or HTML-based lecture materials. Rather than treating this workflow only as a technical solution, the study frames talking slide avatars as multimodal communication artifacts at the intersection of digital pedagogy, aesthetic education, and art-technology practice. The paper documents the production pipeline, analyzes communicative and aesthetic affordances, and proposes practical guidelines for script length, image selection, pacing, disclosure, accessibility, consent, and ethical use. Its contribution is not a validated learning intervention, but an educator-oriented open-source production model and communication-design framework. The study concludes that short, transparent, and carefully designed avatars may provide a reusable communication layer for introductions, transitions, reminders, and recaps when used selectively and with appropriate ethical safeguards.

URL PDF HTML ☆

赞 0 踩 0

2604.13088 2026-05-26 cs.LG cs.AI 版本更新

Design Conditions for Intra-Group Learning of Sequence-Level Rewards: Token Gradient Cancellation

序列级奖励的组内学习设计条件：令牌梯度消除

Fei Ding, Yongkang Zhang, youwei wang, Zijian Zeng

发表机构 * Alibaba Group（阿里巴巴集团）； Tsinghua University（清华大学）

AI总结针对大语言模型多步推理中稀疏终端奖励导致的信用分配问题，提出反事实比较框架和隐式行为策略优化（IBPO），通过轨迹差异近似替代决策，将稀疏奖励转化为步骤敏感信号，提升训练稳定性和推理性能。

详情

AI中文摘要

基于大语言模型的多步推理强化学习通常依赖于稀疏的终端奖励，这导致了不良条件的信用分配问题：最终反馈均匀地传播到所有中间决策。这导致高梯度方差、不稳定的训练和许多无效更新，最终限制了模型的持续改进。我们提出了一种用于信用分配的反事实比较框架。对于每个输入，该框架采样多个推理轨迹，并将它们的差异视为替代决策的隐式近似。这产生了一个隐式过程级优势估计器，将稀疏的终端奖励转化为步骤敏感的学习信号。基于此框架，我们引入了隐式行为策略优化（IBPO），显著提高了数学和代码推理基准上的训练稳定性和性能上限。我们的结果指向了一个有希望的方向，以解锁大语言模型的推理潜力。

英文摘要

Reinforcement learning for multi-step reasoning with large language models (LLMs) typically relies on sparse terminal rewards, which creates a poorly conditioned credit-assignment problem: the final feedback is propagated uniformly across all intermediate decisions. This leads to high gradient variance, unstable training, and many ineffective updates, ultimately limiting sustained model improvement. We propose a counterfactual-comparison framework for credit assignment. For each input, the framework samples multiple reasoning trajectories and treats their differences as implicit approximations to alternative decisions. This yields an implicit process-level advantage estimator that converts sparse terminal rewards into step-sensitive learning signals. Building on this framework, we introduce Implicit Behavior Policy Optimization (IBPO), which substantially improves training stability and the performance ceiling on mathematical and code-reasoning benchmarks. Our results point to a promising direction for unlocking the reasoning potential of LLMs.

URL PDF HTML ☆

赞 0 踩 0

2604.11811 2026-05-26 cs.PL cs.AI cs.CL cs.LG 版本更新

M$^\star$: Every Task Deserves Its Own Memory Harness

M$^\star$：每个任务都应有专属的记忆框架

Wenbo Pan, Shujie Liu, Xiangyang Zhou, Shiwei Zhang, Wanlu Shi, Mirror Xu, Xiaohua Jia

发表机构 * City University of Hong Kong（香港城市大学）； Microsoft（微软）

AI总结提出M$^\star$方法，通过可执行程序进化自动发现任务优化的记忆系统，在对话、具身规划和专家推理等任务上优于固定记忆基线。

Comments Preprint. Code: https://github.com/wbopan/mstar ; Live demo: https://mstar.wenbo.io

详情

AI中文摘要

大型语言模型代理依赖专门的记忆系统在长时间交互中积累和重用知识。最近的架构通常采用针对特定领域定制的固定记忆设计，例如用于对话的语义检索或用于编码的技能重用。然而，为某一目的优化的记忆系统往往无法迁移到其他任务。为了解决这一限制，我们引入了M$^\star$，一种通过可执行程序进化自动发现任务优化记忆框架的方法。具体来说，M$^\star$将代理记忆系统建模为用Python编写的记忆程序。该程序封装了数据模式、存储逻辑和代理工作流指令。我们使用反射式代码进化方法联合优化这些组件；该方法采用基于种群的搜索策略，并分析评估失败以迭代改进候选程序。我们在涵盖对话、具身规划和专家推理的四个不同基准上评估M$^\star$。结果表明，M$^\star$在所有评估任务上稳健地优于现有的固定记忆基线。此外，进化出的记忆程序对每个领域展现出结构不同的处理机制。这一发现表明，针对给定任务特化记忆机制探索了广泛的设计空间，并提供了比通用记忆范式更优的解决方案。

英文摘要

Large language model agents rely on specialized memory systems to accumulate and reuse knowledge during extended interactions. Recent architectures typically adopt a fixed memory design tailored to specific domains, such as semantic retrieval for conversations or skills reused for coding. However, a memory system optimized for one purpose frequently fails to transfer to others. To address this limitation, we introduce M$^\star$, a method that automatically discovers task-optimized memory harnesses through executable program evolution. Specifically, M$^\star$ models an agent memory system as a memory program written in Python. This program encapsulates the data Schema, the storage Logic, and the agent workflow Instructions. We optimize these components jointly using a reflective code evolution method; this approach employs a population-based search strategy and analyzes evaluation failures to iteratively refine the candidate programs. We evaluate M$^\star$ on four distinct benchmarks spanning conversation, embodied planning, and expert reasoning. Our results demonstrate that M$^\star$ improves performance over existing fixed-memory baselines robustly across all evaluated tasks. Furthermore, the evolved memory programs exhibit structurally distinct processing mechanisms for each domain. This finding indicates that specializing the memory mechanism for a given task explores a broad design space and provides a superior solution compared to general-purpose memory paradigms.

URL PDF HTML ☆

赞 0 踩 0

2604.03675 2026-05-26 cs.AI cs.CL cs.IR 版本更新

OASES: Outcome-Aligned Search-Evaluation Co-Training for Agentic Search

OASES：面向智能搜索的结果对齐搜索-评估协同训练

Erhan Zhang, Yiqun Chen, Zechun Niu, Wei Yang, Xiaochi Wei, Yan Gao, Yi Wu, Yao Hu, Jiaxin Mao

发表机构 * Renmin University of China（中国人民大学）； Xiaohongshu Inc.（小红书公司）； University of Southern California（南加州大学）

AI总结提出OASES框架，通过结果对齐的过程奖励和搜索-评估协同训练，解决智能搜索中奖励稀疏和过程监督不可靠的问题，在多跳问答基准上优于强强化学习基线。

详情

AI中文摘要

智能搜索使语言模型能够通过自适应地多步获取外部证据来解决知识密集型任务。具有可验证奖励的强化学习已成为搜索智能体广泛采用的训练范式，但仅结果奖励是稀疏的，并且对中间搜索动作的信用分配有限。因此，现有的过程奖励方法试图通过代理信号、外部评估器或基于似然的信息增益来密集化监督。然而，代理奖励可能偏离最终结果目标，而固定评估器随着搜索策略的演化可能变得过时，导致不可靠的过程监督。为应对这些挑战，我们提出OASES，一种用于智能搜索的结果对齐搜索-评估监督框架。OASES通过评估每个中间搜索状态对回答原始问题的支持程度，推导出结果对齐的过程奖励。它进一步在策略上协同训练搜索策略和状态评估器，使评估器能够适应演化的搜索行为并提供更可靠的过程奖励。在五个多跳问答基准上的实验表明，OASES始终优于强强化学习基线，进一步分析证实了结果对齐过程奖励和搜索-评估协同训练的优势。

英文摘要

Agentic search enables language models to solve knowledge-intensive tasks by adaptively acquiring external evidence over multiple steps. Reinforcement learning with verifiable rewards (RLVR) has emerged as a widely adopted training paradigm for search agents, yet outcome-only rewards are sparse and provide limited credit assignment for intermediate search actions. Existing process-reward methods therefore seek to densify supervision through proxy signals, external evaluators, or likelihood-based information gain. However, proxy rewards can deviate from the final outcome objective, while fixed evaluators can become stale as the search policy evolves, leading to unreliable process supervision. To address these challenges, we propose OASES, an Outcome-Aligned Search-Evaluation Supervision framework for agentic search. OASES derives outcome-aligned process rewards by evaluating how well each intermediate search state supports answering the original question. It further co-trains the search policy and the state evaluator on policy, allowing the evaluator to adapt to evolving search behavior and provide more reliable process rewards. Experiments on five multi-hop QA benchmarks show that OASES consistently outperforms strong RL baselines, with further analyses confirming the benefits of outcome-aligned process rewards and search-evaluation co-training.

URL PDF HTML ☆

赞 0 踩 0

2603.18444 2026-05-26 cs.LG cs.AI 版本更新

Discounted Beta-Bernoulli Reward Estimation for Sample-Efficient Reinforcement Learning with Verifiable Rewards

折扣Beta-Bernoulli奖励估计用于基于可验证奖励的样本高效强化学习

Haechan Kim, Soohyun Ryu, Gyouk Chu, Doohyuk Jang, Eunho Yang

发表机构 * KAIST（韩国科学技术院）

AI总结针对基于可验证奖励的强化学习样本效率低的问题，提出折扣Beta-Bernoulli奖励估计方法，利用历史奖励统计量降低估计方差并避免方差崩溃，在多个推理基准上显著提升性能。

Comments 14 pages, 3 figures

详情

AI中文摘要

基于可验证奖励的强化学习已成为提升大语言模型推理能力的有效后训练范式。然而，现有的基于组的RLVR方法常遭受严重的样本低效问题。这种低效源于对少量rollout的奖励进行点估计，导致高估计方差、方差崩溃以及生成响应的无效利用。在本工作中，我们从统计估计角度重新审视RLVR，将奖励建模为从策略诱导分布中抽取的样本，并将优势计算视为从有限数据中估计奖励分布的问题。基于此观点，我们提出折扣Beta-Bernoulli奖励估计，该方法利用历史奖励统计量处理非平稳分布。尽管有偏，所得估计量展现出降低且稳定的方差，理论上避免了估计方差崩溃，并在均方误差上优于标准点估计。在六个分布内和三个分布外推理基准上的大量实验表明，使用DBB的GRPO一致优于朴素GRPO，在1.7B和8B模型上分别实现了分布内平均Acc@8提升3.22/2.42点，分布外提升12.49/6.92点，且无需额外计算成本或内存开销。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has emerged as an effective post-training paradigm for improving the reasoning capabilities of large language models. However, existing group-based RLVR methods often suffer from severe sample inefficiency. This inefficiency stems from reliance on point estimation of rewards from a small number of rollouts, leading to high estimation variance, variance collapse, and ineffective utilization of generated responses. In this work, we reformulate RLVR from a statistical estimation perspective by modeling rewards as samples drawn from a policy-induced distribution and casting advantage computation as the problem of estimating the reward distribution from finite data. Building on this view, we propose Discounted Beta-Bernoulli (DBB) reward estimation, which leverages historical reward statistics for the non-stationary distribution. Although biased, the resulting estimator exhibits reduced and stable variance, theoretically avoids estimated variance collapse, and achieves lower mean squared error than standard point estimation. Extensive experiments across six in-distribution and three out-of-distribution reasoning benchmarks demonstrate that GRPO with DBB consistently outperforms naive GRPO, achieving average Acc@8 improvements of 3.22/2.42 points in-distribution and 12.49/6.92 points out-of-distribution on the 1.7B and 8B models, respectively, without additional computational cost or memory usage.

URL PDF HTML ☆

赞 0 踩 0

2603.17044 2026-05-26 cs.LG cs.AI cs.CV 版本更新

Do Understanding and Generation Fight? A Diagnostic Study of DPO for Unified Multimodal Models

理解与生成相冲突吗？统一多模态模型DPO的诊断研究

Abinav Rao, Sujan Rachuri

AI总结通过系统实验发现，在统一多模态模型上应用DPO时，生成质量难以对齐，主要原因是理解和生成梯度近乎正交且存在11-14倍的幅度不平衡，源于VQ token数量不对称。

Comments Experiments are inconclusive: The claim that architectures such as Chameleon or Emu would exhibit stronger gradient conflict is not supported by experiments or analysis, and all experiments are conducted on Janus-Pro without evaluation on other unified multimodal architectures

详情

AI中文摘要

统一多模态模型共享一个语言模型骨干来同时进行理解和生成图像。DPO能否同时对齐这两种能力？我们首次系统研究了这一问题，在Janus-Pro的1B和7B参数上应用DPO，采用七种训练策略和两种事后方法。核心发现是负面的：在该架构下，所有测试条件下生成质量都抵制DPO对齐。在7B规模下，没有任何方法能改善生成CLIPScore（|Δ| < 0.2，每个种子n=200，3个种子，p > 0.5）；在1B规模下，所有方法都降低了生成质量，并且该结果在偏好数据类型（真实vs生成和模型vs模型）以及测试的数据量（150-288对）上均成立。梯度分析揭示了原因：理解和生成梯度近乎正交（cos ~ 0），且由于VQ token数量不对称（576个生成token vs. ~30-100个文本token），幅度不平衡达到约11-14倍。这种不平衡是多任务DPO中的主要干扰机制；幅度平衡产生了方向正确的理解增量（VQA +0.01-0.04，虽然单独不显著），但生成差距仍然存在。我们识别出离散VQ tokenization是一个可能的结构瓶颈——生成DPO损失收敛到ln(2)支持了这一点——并为使用基于VQ的统一模型的从业者提供了实用指导。

英文摘要

Unified multimodal models share a language model backbone for both understanding and generating images. Can DPO align both capabilities simultaneously? We present the first systematic study of this question, applying DPO to Janus-Pro at 1B and 7B parameters under seven training strategies and two post-hoc methods. The central finding is negative: generation quality resists DPO alignment across all tested conditions on this architecture. No method improves generation CLIPScore at 7B (|Delta| < 0.2, p > 0.5 at n=200 per seed, 3 seeds); at 1B, all methods degrade generation, and the result holds across preference data types (real-vs-generated and model-vs-model) and the data volumes tested (150-288 pairs). Gradient analysis reveals why: understanding and generation gradients are near-orthogonal (cos ~ 0) with ~11-14x magnitude imbalance driven by VQ token count asymmetry (576 generation tokens vs. ~30-100 text tokens). This imbalance is the dominant interference mechanism in multi-task DPO; magnitude-balancing yields directionally positive understanding deltas (+0.01-0.04 VQA, though individually not significant), but the generation gap persists regardless. We identify discrete VQ tokenization as a likely structural bottleneck -- supported by the generation DPO loss converging to ln(2) -- and provide practical guidance for practitioners working with VQ-based unified models.

URL PDF HTML ☆

赞 0 踩 0

2602.13203 2026-05-26 cs.NI cs.AI 版本更新

Adversarial Network Imagination: Causal LLMs and Digital Twins for Proactive Telecom Mitigation

对抗性网络想象：因果大语言模型与数字孪生用于主动电信缓解

Vignesh Sriram, Yuqiao Meng, Luoxi Tang, Zhaohan Xi

发表机构 * Binghamton University（宾夕法尼亚州立大学）

AI总结提出对抗性网络想象框架，结合因果大语言模型、知识图谱和数字孪生，主动生成、模拟和评估对抗性网络故障，实现从被动故障排查向预期韧性分析的转变。

2602.10527 2026-05-26 cs.CY cs.AI 版本更新

AI-PACE: A Framework for Integrating AI into Medical Education

AI-PACE：将人工智能融入医学教育的框架

Scott P. McGrath, Katherine K. Kim, Karnjit Johl, Haibo Wang, Nick Anderson

发表机构 * Center for Information Technology in the Interest of Society（信息科技促进社会中心）； University of California, Berkeley（加州大学伯克利分校）； School of Medicine, Department of Public Health Sciences（医学院公共卫生科学系）； University of California, Davis（加州大学戴维斯分校）； School of Medicine, Department of Internal Medicine（医学院内科医学系）； Research Centre of Big Data and AI for Medicine（医学大数据与人工智能研究中心）； First Affiliated Hospital of Sun Yat-Sen University（中山大学第一附属医院）

AI总结本文通过文献综述，提出AI-PACE框架，旨在将人工智能教育系统性地整合到医学培训的各个阶段，强调纵向整合、跨学科合作以及技术与临床应用的平衡。

Comments Version 2: Revisions after round 1 of peer review. Paper under consideration at npj Digital Medicine. 12 pages, 2 figures, 2 tables

详情

DOI: 10.1038/s41746-026-02768-2

AI中文摘要

人工智能（AI）在医疗领域的整合正在加速，然而医学教育尚未跟上这些技术进步的步伐。本文通过对文献的全面分析，综合了当前关于医学教育中人工智能的知识，确定了关键能力、课程方法和实施策略。其目的是强调在医学学习连续体中结构化人工智能教育的迫切需求，并提供一个课程开发框架。研究结果表明，有效的人工智能教育需要在医学培训中纵向整合、跨学科合作，并平衡关注技术基础和临床应用。本文为医学教育者提供了基础，以帮助未来的医生为人工智能增强的医疗环境做好准备。

英文摘要

The integration of artificial intelligence (AI) into healthcare is accelerating, yet medical education has not kept pace with these technological advancements. This paper synthesizes current knowledge on AI in medical education through a comprehensive analysis of the literature, identifying key competencies, curricular approaches, and implementation strategies. The aim is highlighting the critical need for structured AI education across the medical learning continuum and offer a framework for curriculum development. The findings presented suggest that effective AI education requires longitudinal integration throughout medical training, interdisciplinary collaboration, and balanced attention to both technical fundamentals and clinical applications. This paper serves as a foundation for medical educators seeking to prepare future physicians for an AI-enhanced healthcare environment.

URL PDF HTML ☆

赞 0 踩 0

2602.10090 2026-05-26 cs.AI cs.CL cs.LG 版本更新

Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning

Agent World Model: 用于智能体强化学习的无限合成环境

Zhaoyang Wang, Canwen Xu, Boyi Liu, Yite Wang, Siwei Han, Zhewei Yao, Huaxiu Yao, Yuxiong He

发表机构 * University of North Carolina at Chapel Hill（北卡罗来纳大学教堂山分校）

AI总结提出Agent World Model (AWM)全合成环境生成管道，通过代码驱动和数据库支持的环境进行大规模强化学习，使智能体在多样日常场景中泛化。

Comments Accepted to ICML 2026

详情

AI中文摘要

近年来，大型语言模型（LLM）的进步使得自主智能体能够与工具和环境进行多轮交互。然而，扩展此类智能体训练受到缺乏多样且可靠环境的限制。在本文中，我们提出了Agent World Model（AWM），一个完全合成的环境生成管道。使用该管道，我们扩展到涵盖日常场景的1000个环境，智能体可以在其中与丰富的工具集交互并获得高质量的观测。值得注意的是，这些环境是代码驱动的并由数据库支持，比由LLM模拟的环境提供更可靠和一致的状态转换。此外，与从现实环境中收集轨迹相比，它们实现了更高效的智能体交互。为了展示该资源的有效性，我们对多轮工具使用智能体进行了大规模强化学习。得益于完全可执行的环境和可访问的数据库状态，我们还可以设计可靠的奖励函数。在三个基准上的实验表明，仅在合成环境中训练（而非特定于基准的环境）能产生强大的分布外泛化能力。代码可在 https://github.com/Snowflake-Labs/agent-world-model 获取。

英文摘要

Recent advances in large language model (LLM) have empowered autonomous agents to perform multi-turn interactions with tools and environments. However, scaling such agent training is limited by the lack of diverse and reliable environments. In this paper, we propose Agent World Model (AWM), a fully synthetic environment generation pipeline. Using this pipeline, we scale to 1,000 environments covering everyday scenarios, in which agents can interact with rich toolsets and obtain high-quality observations. Notably, these environments are code-driven and backed by databases, providing more reliable and consistent state transitions than environments simulated by LLMs. Moreover, they enable more efficient agent interaction compared with collecting trajectories from realistic environments. To demonstrate the effectiveness of this resource, we perform large-scale reinforcement learning for multi-turn tool-use agents. Thanks to the fully executable environments and accessible database states, we can also design reliable reward functions. Experiments on three benchmarks show that training exclusively in synthetic environments, rather than benchmark-specific ones, yields strong out-of-distribution generalization. The code is available at https://github.com/Snowflake-Labs/agent-world-model.

URL PDF HTML ☆

赞 0 踩 0

2602.09620 2026-05-26 cs.AI cs.LO 版本更新

FLINGO -- Instilling ASP Expressiveness into Linear Integer Constraints

FLINGO -- 将 ASP 表达力注入线性整数约束

Jorge Fandinno, Pedro Cabalar, Philipp Wanko, Torsten Schaub

发表机构 * University of Corunna（科鲁纳大学）； University of Nebraska Omaha（内布拉斯加奥马哈大学）； University of Potsdam（波茨坦大学）

AI总结本文提出 FLINGO 语言和工具，通过将 ASP 的默认值、未定义、非确定性选择和聚合等表达力融入数值约束，并给出到 clingcon 格式的翻译，从而扩展了约束回答集编程。

Comments To appear in Theory and Practice of Logic Programming

详情

AI中文摘要

约束回答集编程（CASP）是一种混合范式，它通过数值约束处理丰富了回答集编程（ASP），这是许多实际应用的关键需求。然而，大多数 CASP 求解器中约束的规范更接近于数值后端的表达力和语义，而非 ASP 范式。在 ASP 中，数值属性被表示为谓词，允许声明默认值、使属性未定义、使用选择规则进行非确定性赋值或使用聚合值。在 CASP 中，一旦我们切换到这些属性的基于约束的表示，这些特性中的大多数（如果不是全部）就会丢失。在本文中，我们提出了 flingo 语言（和工具），它将上述表达力融入数值约束中，并通过多个示例说明了其使用。基于先前建立其语义基础的工作，我们还提出了从新引入的 flingo 语法到遵循 clingcon 输入格式的常规 CASP 程序的翻译。

英文摘要

Constraint Answer Set Programming (CASP) is a hybrid paradigm that enriches Answer Set Programming (ASP) with numerical constraint processing, a crucial requirement for many real-world applications. However, the specification of constraints in most CASP solvers aligns more closely with the expressiveness and semantics of the numerical back-end than the ASP paradigm. In the latter, numerical attributes are represented as predicates, which allows declaring default values, leaving the attribute undefined, making non-deterministic assignments with choice rules, or using aggregated values. In CASP, most (if not all) of these features are lost once we switch to a constraint-based representation of those same attributes. In this paper, we present the flingo language (and tool) that incorporates the aforementioned expressiveness within numerical constraints, and we illustrate its use with several examples. Based on previous work that established its semantic foundations, we also present a translation from the newly introduced flingo syntax to regular CASP programs following the clingcon input format.

URL PDF HTML ☆

赞 0 踩 0

2602.03955 2026-05-26 cs.AI cs.MA 版本更新

为什么你的深度研究智能体会失败？关于完整研究轨迹中的幻觉评估

Yuhao Zhan, Tianyu Fan, Linxuan Huang, Zirui Guo, Chao Huang

发表机构 * Zhejiang University（浙江大学）； The University of Hong Kong（香港大学）

AI总结针对深度研究智能体（DRA）在完整研究轨迹中累积的幻觉问题，提出从结果评估转向过程感知评估的PING分类法和细粒度评估框架，并构建DeepHalluBench基准，实验揭示系统性的可靠性差距。

详情

AI中文摘要

诊断深度研究智能体（DRA）的失败模式仍然是一个关键挑战。现有基准主要依赖端到端评估，掩盖了在研究轨迹中累积的中间幻觉。为弥补这一差距，我们提出从基于结果的评估转向过程感知评估，通过审计完整计划-搜索-总结轨迹中的幻觉。我们引入PING分类法，将DRA幻觉分为四种互补类型：传播、意图、噪声诱导和接地。我们进一步将该分类法实例化为一个细粒度评估框架，将轨迹分解为原子动作、声明和子查询以进行严格验证。利用该框架隔离100个特别容易产生幻觉的任务（包括对抗性场景），我们策划了DeepHalluBench。对六个代表性DRA的实验表明，在我们的幻觉压力测试集上，所有评估系统仍表现出不可忽视的可靠性差距。此外，我们的诊断分析将这些失败追溯到系统性缺陷，特别是幻觉传播和认知偏差，为未来的架构优化提供了可操作的见解。代码和数据可在https://github.com/yuhao-zhan/DeepHalluBench获取。

英文摘要

Diagnosing failure patterns in Deep Research Agents (DRAs) remains a critical challenge. Existing benchmarks predominantly rely on end-to-end evaluation, obscuring intermediate hallucinations that accumulate throughout the research trajectory. To bridge this gap, we propose a shift from outcome-based to processaware evaluation by auditing hallucinations in the full plan-search-summarize trajectory. We introduce the PING Taxonomy, which categorizes DRA hallucinations into four complementary types: Propagation, Intent, Noiseinduced, and Grounding. We further instantiate this taxonomy into a fine-grained evaluation framework that decomposes trajectories into atomic actions, claims, and sub-queries for rigorous verification. Leveraging this framework to isolate 100 distinctively hallucinationprone tasks including adversarial scenarios, we curate DeepHalluBench. Experiments on six representative DRAs show that, on our hallucination-prone stress-test set, all evaluated systems still exhibit non-negligible reliability gaps. Furthermore, our diagnostic analysis traces these failures to systemic deficits, especially hallucination propagation and cognitive biases, providing actionable insights for future architectural optimization. Code and data are available in https://github.com/yuhao-zhan/DeepHalluBench.

URL PDF HTML ☆

赞 0 踩 0

2601.21726 2026-05-26 cs.AI 版本更新

未来KL正则化GRPO：基于f-散度正则化的过程级信用分配

Jiarui Yao, Ruida Wang, Hao Bai, Tong Zhang

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结本文提出未来KL正则化策略优化（FRPO），通过因果未来正则化回报修正GRPO中局部KL损失缺失的梯度信号，在数学推理任务中提升pass@16并保持更高熵和更低策略漂移。

详情

AI中文摘要

DIVER-1: 扩展颅内脑电图基础模型以实现可迁移表示

Danny Dongyeop Han, Yonghyeon Gwon, Ahhyun Lucy Lee, Taeyang Lee, Seong Jin Lee, Jubin Choi, Sebin Lee, Jihyun Bang, Seungju Lee, David Keetae Park, Shinjae Yoo, Chun Kee Chung, Jiook Cha

发表机构 * Seoul National University（首尔国立大学）； Brookhaven National Laboratory（布鲁克海文国家实验室）

AI总结提出DIVER-1自监督iEEG基础模型，通过可变电极-时间注意力、时空重采样等设计处理可变输入，在5310小时ECoG和SEEG上预训练，在认知解码和癫痫检测任务上超越现有模型，并首次进行受控计算感知的扩展研究。

Comments 31 pages, 12 figures, 14tables

详情

AI中文摘要

颅内脑电图（iEEG）提供直接、毫秒级的人类神经活动记录，但由于电极布局、解剖覆盖、参考方案和记录条件在不同患者和中心之间存在差异，可重用的表示学习变得困难。我们引入了DIVER-1，一个用于可变输入记录的自监督iEEG基础模型，它结合了任意变量电极-时间注意力、时空重采样、输入条件位置嵌入和多域掩码重建，而不假设固定的电极布局。我们在5310小时的ECoG和SEEG上预训练了两个变体DIVER-1-0.1s和DIVER-1-1s，涵盖352k通道小时，大约是BrainTreeBank预训练量的54倍。我们在两个保留基准上评估DIVER-1：用于自然认知解码的Neuroprobe和用于癫痫检测的MAYO。在考虑泄漏的Neuroprobe上，尽管预训练时未使用构成Neuroprobe语料库的BrainTreeBank记录，DIVER-1-0.1s仍优于先前评估的iEEG基础模型；它在平均AUROC上也超过了线性频谱图解码器，并与更强的非线性基线保持竞争力，这是先前评估的iEEG基础模型未能达到的水平。DIVER-1-1s在MAYO癫痫检测上也取得了最高的AUROC。最后，我们进行了据我们所知首次受控计算感知的自监督iEEG预训练扩展研究，扫描了数据规模、受试者数量、训练时长和模型大小（高达1.8B参数）。我们的结果表明存在数据受限区域：扩展独特记录和充分训练是比单纯增加参数数量更可靠的扩展轴。代码可在链接处获取。

英文摘要

Intracranial EEG (iEEG) provides direct, millisecond-scale recordings of human neural activity, but reusable representation learning is difficult because electrode layouts, anatomical coverage, referencing schemes, and recording conditions vary across patients and centers. We introduce DIVER-1, a self-supervised iEEG foundation model for variable-input recordings that combines any-variate electrode-time attention, spatio-temporal resampling, input-conditioned positional embeddings, and multi-domain masked reconstruction without assuming a fixed electrode montage. We pretrain two variants, DIVER-1-0.1s and DIVER-1-1s, on 5,310 hours of ECoG and SEEG spanning 352k channel-hours, roughly 54x the BrainTreeBank-based pretraining volume. We evaluate DIVER-1 on two held-out benchmarks: Neuroprobe for naturalistic cognitive decoding and MAYO for seizure detection. On leakage-aware Neuroprobe, DIVER-1-0.1s outperforms prior evaluated iEEG foundation models despite using no BrainTreeBank recordings, the corpus underlying Neuroprobe, during pretraining; it also exceeds the linear spectrogram decoder in mean AUROC and remains competitive with stronger nonlinear baselines, a level prior evaluated iEEG foundation models did not reach. DIVER-1-1s also achieves the top AUROC on MAYO seizure detection. Finally, we conduct, to our knowledge, the first controlled compute-aware scaling study for self-supervised iEEG pretraining, sweeping data scale, subject count, training duration, and model size up to 1.8B parameters. Our results indicate a data-constrained regime: expanding unique recordings and training sufficiently long are more reliable scaling axes than increasing parameter count alone. Code is available at link.

URL PDF HTML ☆

赞 0 踩 0

2512.12576 2026-05-26 cs.CL cs.AI 版本更新

Coupled Variational Reinforcement Learning for Language Model General Reasoning

耦合变分强化学习用于语言模型通用推理

Xueru Wen, Jie Lou, Yanjiang Liu, Hongyu Lin, Ben He, Xianpei Han, Le Sun, Yaojie Lu, Debing Zhang

发表机构 * Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences（中国科学院软件研究所信息处理实验室）； University of Chinese Academy of Sciences（中国科学院大学）

AI总结提出CoVRL方法，通过混合采样策略耦合先验和后验分布，将变分推理与强化学习结合，以解决无验证器强化学习中探索效率低和推理轨迹与答案不一致的问题，在数学和通用推理基准上提升性能。

Comments Accepted to ICML 2026

详情

AI中文摘要

虽然强化学习在语言模型推理方面取得了显著进展，但它受到可验证奖励要求的限制。最近的无验证器强化学习方法通过利用LLM生成参考答案的概率作为奖励信号来解决这一限制。然而，这些方法通常仅基于问题采样推理轨迹。这种设计将推理轨迹采样与答案信息解耦，导致探索效率低下以及轨迹与最终答案之间的不一致。在本文中，我们提出了 extit{{Co}upled {V}ariational {R}einforcement {L}earning}（CoVRL），它通过混合采样策略耦合先验和后验分布，将变分推理与强化学习联系起来。通过构建和优化整合这两种分布的复合分布，CoVRL实现了高效探索，同时保持了思想与答案之间的强一致性。在数学和通用推理基准上的大量实验表明，CoVRL在基础模型上提升了12.4%的性能，并在最先进的无验证器强化学习基线基础上额外提升了2.3%，为增强语言模型的通用推理能力提供了一个原则性框架。

英文摘要

While reinforcement learning has achieved impressive progress in language model reasoning, it is constrained by the requirement for verifiable rewards. Recent verifier-free RL methods address this limitation by utilizing the probabilities that LLMs generate reference answers as reward signals. However, these approaches typically sample reasoning traces conditioned only on the question. This design decouples reasoning-trace sampling from answer information, leading to inefficient exploration and incoherence between traces and final answers. In this paper, we propose \textit{\b{Co}upled \b{V}ariational \b{R}einforcement \b{L}earning} (CoVRL), which bridges variational inference and reinforcement learning by coupling prior and posterior distributions through a hybrid sampling strategy. By constructing and optimizing a composite distribution that integrates these two distributions, CoVRL enables efficient exploration while preserving strong thought-answer coherence. Extensive experiments on mathematical and general reasoning benchmarks show that CoVRL improves performance by 12.4\% over the base model and achieves an additional 2.3\% improvement over state-of-the-art verifier-free RL baselines, providing a principled framework for enhancing the general reasoning capabilities of language models.

URL PDF HTML ☆

赞 0 踩 0

2512.11941 2026-05-26 cs.CV cs.AI 版本更新

DynaPURLS: Dynamic Refinement of Part-Aware Representations for Skeleton-Based Zero-Shot Action Recognition

DynaPURLS: 基于骨架的零样本动作识别中部分感知表示的动态细化

Jingmin Zhu, Anqi Zhu, James Bailey, Jun Liu, Hossein Rahmani, Mohammed Bennamoun, Farid Boussaid, Qiuhong Ke

发表机构 * Monash University（莫纳什大学）； Lancaster University（兰卡斯特大学）； University of Western Australia（西澳大学）

AI总结提出DynaPURLS框架，通过多尺度视觉-语义对应和动态细化模块，解决骨架零样本动作识别中的领域偏移问题，在三个基准数据集上取得最优结果。

Comments Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence

详情

DOI: 10.1109/TPAMI.2026.3680873

AI中文摘要

基于骨架的零样本动作识别（ZS-SAR）从根本上受到主流方法的限制，这些方法依赖于将骨架特征与静态的类级语义对齐。这种粗粒度的对齐无法弥合可见类和未见类之间的领域偏移，从而阻碍了细粒度视觉知识的有效迁移。为了解决这些限制，我们引入了 extbf{DynaPURLS}，一个统一的框架，它建立稳健的多尺度视觉-语义对应，并在推理时动态细化它们以增强泛化能力。我们的框架利用大型语言模型生成层次化的文本描述，涵盖全局运动和局部身体部位动态。同时，一个自适应划分模块通过语义分组骨架关节点生成细粒度的视觉表示。为了强化这种细粒度对齐以应对训练-测试领域偏移，DynaPURLS包含一个动态细化模块。在推理时，该模块通过轻量级可学习投影将文本特征适应于输入的视觉流。该细化过程由一个置信度感知的类平衡记忆库稳定，该记忆库减轻了来自噪声伪标签的错误传播。在三个大规模基准数据集（包括NTU RGB+D 60/120和PKU-MMD）上的大量实验表明，DynaPURLS显著优于先前的方法，创造了新的最先进记录。源代码已在https://github.com/Alchemist0754/DynaPURLS公开。

英文摘要

Zero-shot skeleton-based action recognition (ZS-SAR) is fundamentally constrained by prevailing approaches that rely on aligning skeleton features with static, class-level semantics. This coarse-grained alignment fails to bridge the domain shift between seen and unseen classes, thereby impeding the effective transfer of fine-grained visual knowledge. To address these limitations, we introduce \textbf{DynaPURLS}, a unified framework that establishes robust, multi-scale visual-semantic correspondences and dynamically refines them at inference time to enhance generalization. Our framework leverages a large language model to generate hierarchical textual descriptions that encompass both global movements and local body-part dynamics. Concurrently, an adaptive partitioning module produces fine-grained visual representations by semantically grouping skeleton joints. To fortify this fine-grained alignment against the train-test domain shift, DynaPURLS incorporates a dynamic refinement module. During inference, this module adapts textual features to the incoming visual stream via a lightweight learnable projection. This refinement process is stabilized by a confidence-aware, class-balanced memory bank, which mitigates error propagation from noisy pseudo-labels. Extensive experiments on three large-scale benchmark datasets, including NTU RGB+D 60/120 and PKU-MMD, demonstrate that DynaPURLS significantly outperforms prior art, setting new state-of-the-art records. The source code is made publicly available at https://github.com/Alchemist0754/DynaPURLS

URL PDF HTML ☆

赞 0 踩 0

2511.21734 2026-05-26 cs.CL cs.AI 版本更新

Asking LLMs to Verify First is Almost Free Lunch

先让LLMs验证几乎是免费的午餐

Shiguang Wu, Quanming Yao

发表机构 * Department of Electonic Engineering（电子工程系）

AI总结提出Verification-First (VF)策略，通过先验证候选答案再生成解决方案，以低计算开销提升推理能力，并扩展为Iter-VF迭代方法，在多个基准上优于标准CoT和现有TTS策略。

详情

AI中文摘要

为了在不增加训练成本或大量测试时采样的情况下增强大型语言模型（LLMs）的推理能力，我们引入了Verification-First (VF)策略，该策略在生成解决方案之前提示模型验证提供的候选答案（即使是琐碎或随机的答案）。这种方法触发了一种“反向推理”过程，与标准的前向思维链（CoT）互补，通过修剪LLM的输出分布来限制答案的逻辑搜索空间。我们进一步将VF提示推广到Iter-VF，这是一种顺序测试时缩放（TTS）方法，利用模型之前的答案迭代地循环验证-生成过程。跨多个基准和各种LLMs的大量实验证实，使用随机答案的VF提示在最小计算开销下始终优于标准CoT，并且Iter-VF优于现有的TTS策略。VF在SOTA思考模型上也有效。例如，通过使用简单的VF提示，我们在GPQA-Diamond上使用Gemini-3-Pro-Preview获得了新的SOTA准确率94.9%，其中VF相对减少了约30%的错误。

英文摘要

To enhance the reasoning capabilities of Large Language Models (LLMs) without high costs of training, nor extensive test-time sampling, we introduce Verification-First (VF), a strategy that prompts models to verify a provided candidate answer, even a trivial or random one, before generating a solution. This approach triggers a "reverse reasoning" process complementary to standard forward Chain-of-Thought (CoT), which restricts the logical search space of the answer by pruning the LLM's output distribution. We further generalize VF prompting to Iter-VF, a sequential test-time scaling (TTS) method that iteratively cycles the verification-generation process using the model's previous answer. Extensive experiments across various benchmarks and various LLMs confirm that VF prompting with random answer consistently outperforms standard CoT with minimal computational overhead, and Iter-VF outperforms existing TTS strategies. VF is also effective on SOTA thinking models. For example, by using the simple VF prompting, we obtain a new SOTA 94.9% accuracy on GPQA-Diamond with Gemini-3-Pro-Preview where VF reduces its errors by ~30% relatively.

URL PDF HTML ☆

赞 0 踩 0

2511.08654 2026-05-26 cs.CY cs.AI cs.CL 版本更新

AI-generated podcasts: Synthetic Intimacy and Cultural Mistranslation in NotebookLM's Audio Overviews

AI生成的播客：NotebookLM音频概览中的合成亲密关系与文化误译

Jill Walker Rettberg

发表机构 * University of Bergen（卑尔根大学）； Center for Digital Narrative（数字叙述中心）

AI总结本文分析Google NotebookLM生成的AI播客，揭示其固定模板结构及将文本和文化语境翻译为白人、受过教育的中产阶级美国默认设置的问题。

Comments This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement number 101142306. The project is also supported by the Center for Digital Narrative, which is funded by the Research Council of Norway through its Centres of Excellence scheme, project number 332643. Media, Culture & Society, online first (2026)

详情

DOI: 10.1177/01634437261452160

AI中文摘要

本文分析了Google NotebookLM生成的AI播客，该工具生成两个健谈的AI主持人讨论用户上传文档的音频播客。虽然AI生成的播客已被作为工具讨论（例如在医学教育中），但它们尚未作为媒体被分析。通过上传不同类型的文本并分析生成的输出，我展示了播客的结构如何围绕固定模板构建。我还发现NotebookLM不仅将其他语言的文本翻译成活泼的标准中西部美国口音，还将文化语境翻译为白人、受过教育的中产阶级美国默认设置。这是媒体塑造公众方式的一个显著发展，标志着从学者们描述的21世纪初至今人类播客中的多元公共领域（主持人面向特定社区并回应听众评论）向播客类型抽象化的转变。

英文摘要

This paper analyses AI-generated podcasts produced by Google's NotebookLM, which generates audio podcasts with two chatty AI hosts discussing whichever documents a user uploads. While AI-generated podcasts have been discussed as tools, for instance in medical education, they have not yet been analysed as media. By uploading different types of text and analysing the generated outputs I show how the podcasts' structure is built around a fixed template. I also find that NotebookLM not only translates texts from other languages into a perky standardised Mid-Western American accent, it also translates cultural contexts to a white, educated, middle-class American default. This is a distinct development in how publics are shaped by media, marking a departure from the multiple public spheres that scholars have described in human podcasting from the early 2000s until today, where hosts spoke to specific communities and responded to listener comments, to an abstraction of the podcast genre.

URL PDF HTML ☆

赞 0 踩 0

2511.04556 2026-05-26 cs.AI cs.CE 版本更新

Optimizing Sensor Placement for Flow Reconstruction in Urban Drainage Networks: A Digital Twin-Based Sparse Sensing Approach

城市排水管网流量重建的传感器优化布置：基于数字孪生的稀疏传感方法

Zihang Ding, Amit Kumar, Imran Md. Azizul Islam, Mila Avellar Montezuma, Ruihang Zhang, Kun Zhang

发表机构 * Department of Civil and Environmental Engineering, University of Minnesota Duluth（明尼苏达大学 Duluth 分校土木与环境工程系）； Institute for Water Education, UNESCO IHE Delft（联合国教科文组织国际水教育研究所）； Department of Mechanical and Industrial Engineering, University of Minnesota Duluth（明尼苏达大学 Duluth 分校机械与工业工程系）

AI总结针对资源受限下城市排水管网监测与流量预测难题，提出一种基于数字孪生的数据驱动稀疏传感方法，通过奇异值分解和QR分解优化传感器位置，实现系统级流量重建，在明尼苏达州德卢斯林地流域验证中，3个传感器达到平均NSE 0.949。

Comments 32 pages (including supplementary information), 11 figures. Submitted to Water Research. Partially presented at HydroML 2025 Symposium, Minnesota Water Resources Conference 2025, and AGU Fall Meeting 2025

详情

AI中文摘要

强降雨引发的城市洪水日益频繁和广泛。虽然高时空分辨率的洪水预测和监测是理想的，但时间、预算和技术上的实际限制阻碍了其全面实施。如何在资源受限的情况下监测城市排水管网并预测水流状况是一个主要挑战。为了解决这一问题，我们引入了一种数据驱动的稀疏传感（DSS）方法，通过明尼苏达州德卢斯林地流域的数字孪生进行演示。具体来说，我们将EPA-SWMM与基于奇异值分解和QR分解的传感器选择相结合，以优化系统级流量重建的监测位置。由不同情景驱动的SWMM模拟集成提供了必要的水力数据，以提取降阶基并识别信息丰富的传感器位置。跨事件验证表明，在77个候选节点中，三个策略性放置的传感器在观测到的风暴事件中实现了平均系统级纳什-萨特克利夫效率（NSE）为0.949。将QR选择的传感器集与通过穷举搜索和蒙特卡洛随机放置获得的参考传感器配置进行了基准测试。这一比较进一步表明，基于QR选择的传感器的流量重建紧密跟踪穷举最优值，同时显著优于随机放置。我们通过引入乘性高斯噪声和模拟单个传感器故障进一步评估了框架的鲁棒性。虽然模型对噪声相对具有弹性，但传感器缺失的影响在很大程度上取决于分配的传感器数量及其具体位置。

英文摘要

Urban flooding triggered by intense rainfall is becoming increasingly frequent and widespread. While flood prediction and monitoring in high spatio-temporal resolution are desired, practical constraints in time, budget, and technology hinder its full implementation. How to monitor urban drainage networks and predict flow conditions under constrained resources is a major challenge. To address this, we introduced a data-driven sparse sensing (DSS) approach, demonstrated via a digital-twin of the Woodland catchment in Duluth, Minnesota. Specifically, we coupled EPA-SWMM with singular value decomposition and QR factorization-based sensor selection to optimize monitoring locations for system-level flow reconstruction. An ensemble of SWMM simulations, driven by diverse scenarios, provided the necessary hydraulic data to extract the reduced basis and identify informative sensor locations. Cross-event validation showed that three strategically placed sensors among 77 candidate nodes achieved a mean system-level Nash-Sutcliffe efficiency (NSE) of 0.949 across observed storm events. The QR-selected sensor sets were benchmarked against reference sensor configurations obtained from exhaustive searches and Monte Carlo random-placements. This comparison further showed that flow reconstruction based on QR-selected sensors closely tracked the exhaustive optimum while substantially outperforming random placements. We further evaluated the framework's robustness by introducing multiplicative Gaussian noise and simulating individual sensor failures. While the model is relatively resilient to noise, the impact of sensor dropouts depends heavily on the number of sensors allocated and their specific locations.

URL PDF HTML ☆

赞 0 踩 0

2509.07961 2026-05-26 cs.AI 版本更新

Probing the Preferences of a Language Model: Integrating Verbal and Behavioral Tests of AI Welfare

探究语言模型的偏好：整合AI福祉的言语与行为测试

Valen Tagliabue, Leonard Dung

发表机构 * Future Impact Group (FIG)（未来影响集团）； Ruhr-University Bochum（波鸿鲁尔大学）

AI总结本研究通过言语报告和行为实验（虚拟环境导航与话题选择）测量语言模型的偏好，发现偏好满足可作为AI福祉的实证代理，但测量一致性因模型和条件而异。

Comments Forthcoming in Philosophy and the Mind Sciences (PhiMiSci)

详情

AI中文摘要

我们开发了新的实验范式来测量语言模型中的福祉。我们比较了模型关于其偏好的言语报告与在虚拟环境中导航和选择对话主题时通过行为表达的偏好。我们还测试了成本和奖励如何影响行为，以及对于幸福主义福祉量表（测量自主性和生活目的等状态）的反应是否在语义等价的提示下保持稳定。总体而言，我们观察到我们的测量之间存在显著程度的相互支持。在不同条件下，陈述偏好与行为之间观察到的可靠相关性表明，偏好满足原则上可以作为当今某些AI系统中经验可测量的福祉代理。此外，我们的设计为模型行为的定性观察提供了一个富有启发性的环境。然而，测量之间的一致性在某些模型和条件下比其他情况更明显，并且反应因扰动而改变。由于这一点，以及关于福祉本质和语言模型的认知状态（以及福祉主体性）的背景不确定性，我们目前不确定我们的方法是否成功测量了语言模型的福祉状态。尽管如此，这些发现凸显了在语言模型中测量福祉的可行性，邀请进一步探索。

英文摘要

We develop new experimental paradigms for measuring welfare in language models. We compare verbal reports of models about their preferences with preferences expressed through behavior when navigating a virtual environment and selecting conversation topics. We also test how costs and rewards affect behavior and whether responses to an eudaimonic welfare scale - measuring states such as autonomy and purpose in life - are stable across semantically equivalent prompts. Overall, we observed a notable degree of mutual support between our measures. The reliable correlations observed between stated preferences and behavior across conditions suggest that preference satisfaction can, in principle, serve as an empirically measurable welfare proxy in some of today's AI systems. Furthermore, our design offered an illuminating setting for qualitative observation of model behavior. Yet, the consistency between measures was more pronounced in some models and conditions than others and responses were changed by perturbations. Due to this, and the background uncertainty about the nature of welfare and the cognitive states (and welfare subjecthood) of language models, we are currently uncertain whether our methods successfully measure the welfare state of language models. Nevertheless, these findings highlight the feasibility of welfare measurement in language models, inviting further exploration.

URL PDF HTML ☆

赞 0 踩 0

2508.15760 2026-05-26 cs.CL cs.AI 版本更新

FLoRIST: 用于高效准确的大语言模型联邦微调的奇异值阈值化方法

Hariharan Ramesh, Jyotikrishna Dass

发表机构 * Anonymous Institution, Anonymous City, Anonymous Region, Anonymous Country（匿名机构，匿名城市，匿名地区，匿名国家）

AI总结提出FLoRIST框架，通过奇异值阈值化在紧凑中间空间中对局部适配器进行分解，实现数学上准确的聚合，同时保持通信和计算高效。

Comments 21 pages, 12 figures

详情

Journal ref: Ninth Conference on Machine Learning and Systems (MLSys 2026)

AI中文摘要

将低秩适配（LoRA）集成到联邦学习为在不共享本地数据的情况下对大语言模型（LLMs）进行参数高效微调提供了一种有前景的解决方案。然而，为联邦LoRA设计的几种方法在平衡通信效率、模型准确性和计算成本方面面临重大挑战，尤其是在异构客户端之间。这些方法要么依赖于简单的局部适配器平均，这会引入聚合噪声；要么需要传输大型堆叠局部适配器，导致通信效率低下；要么需要重建内存密集的全局权重更新矩阵并执行计算昂贵的分解来设计客户端特定的低秩适配器。在这项工作中，我们提出了FLoRIST，一个联邦微调框架，在不产生高通信或计算开销的情况下实现了数学上准确的聚合。FLoRIST不是在服务器端构建完整的全局权重更新矩阵，而是通过对堆叠的局部适配器分别执行奇异值分解，采用高效的分解流程。该方法在紧凑的中间空间内操作，以表示来自局部LoRA的累积信息。我们引入了可调的奇异值阈值化，用于服务器端最优秩选择，以构建一对所有客户端共享的全局低秩适配器。跨多个数据集和LLMs的大量实证评估表明，FLoRIST在同构和异构设置中始终在卓越的通信效率和竞争性能之间取得最佳平衡。

英文摘要

Integrating Low-Rank Adaptation (LoRA) into federated learning offers a promising solution for parameter-efficient fine-tuning of Large Language Models (LLMs) without sharing local data. However, several methods designed for federated LoRA present significant challenges in balancing communication efficiency, model accuracy, and computational cost, particularly among heterogeneous clients. These methods either rely on simplistic averaging of local adapters, which introduces aggregation noise, require transmitting large stacked local adapters, leading to poor communication efficiency, or necessitate reconstructing memory-dense global weight-update matrix and performing computationally expensive decomposition to design client-specific low-rank adapters. In this work, we propose FLoRIST, a federated fine-tuning framework that achieves mathematically accurate aggregation without incurring high communication or computational overhead. Instead of constructing the full global weight-update matrix at the server, FLoRIST employs an efficient decomposition pipeline by performing singular value decomposition on stacked local adapters separately. This approach operates within a compact intermediate space to represent the accumulated information from local LoRAs. We introduce tunable singular value thresholding for server-side optimal rank selection to construct a pair of global low-rank adapters shared by all clients. Extensive empirical evaluations across multiple datasets and LLMs demonstrate that FLoRIST consistently strikes the best balance between superior communication efficiency and competitive performance in both homogeneous and heterogeneous setups.

URL PDF HTML ☆

赞 0 踩 0

2506.09084 2026-05-26 cs.LG cs.AI 版本更新

PageLLM: A Multi-Grained Reward Framework for Whole-Page Optimization with Large Language Models

PageLLM：面向整页优化的大语言模型多粒度奖励框架

Xinyuan Wang, Liang Wu, Dongjie Wang, Yanjie Fu

发表机构 * Arizona State University（亚利桑那州立大学）； Nokia（诺基亚）； University of Kansas（堪萨斯大学）

AI总结针对整页优化中人工标注成本高和页面级连贯性与项目级放置粒度不匹配的问题，提出PageLLM框架，通过将隐式反馈解耦为粗粒度页面级奖励和细粒度项目级奖励，结合PPO的RLHF进行微调，显著提升排序性能并在线上部署中取得收益。

详情

AI中文摘要

整页优化（WPO）决定了搜索和推荐结果如何呈现给用户，而大语言模型（LLMs）通过将页面生成视为序列生成为其开辟了新途径。然而，将LLMs适配到网络规模的WPO仍受限于昂贵的人工标注需求以及页面级连贯性与项目级放置之间的粒度不匹配。在这项工作中，我们表明这两个挑战是耦合的：只要奖励信号被解耦为两个互补的粒度，仅凭隐式用户反馈就足以进行对齐。我们提出了PageLLM，一个基于奖励的微调框架，该框架（i）将隐式反馈转化为四个对比偏好对族，涵盖相关性、排序、多样性和冗余度；（ii）学习一个粗粒度的页面级奖励和一个细粒度的项目级奖励，后者捕捉对参与度敏感的位置交换；（iii）在预训练的LLM上通过基于PPO的RLHF结合这两种奖励。在七个亚马逊类别上针对十一个基线的广泛实验表明，单独任何一种奖励都不足够——丢弃页面级或项目级信号分别使NDCG@100降低17.8%和15.2%，而联合奖励则使NDCG@100提升高达46.8%。在拥有1000万用户的在线A/B测试中，PageLLM使GMV提升0.44%，点击率提升0.14%，证实了来自隐式反馈的多粒度奖励可扩展到生产级WPO。代码和数据可在匿名仓库中获取。

英文摘要

Whole-page optimization (WPO) decides how search and recommendation results are surfaced to users, and large language models (LLMs) open a new route to it by treating page generation as sequence generation. Adapting LLMs to web-scale WPO, however, remains bottlenecked by the need for costly human annotations and by the mismatched granularity between page-level coherence and item-level placement. In this work we show that these two challenges are coupled: implicit user feedback alone suffices for alignment, provided the reward signal is decoupled into two complementary granularities. We propose PageLLM, a reward-based fine-tuning framework that (i) turns implicit feedback into four contrastive preference-pair families covering relevance, ranking, diversity, and redundancy, (ii) learns a coarse page-level reward and a fine item-level reward that captures engagement-sensitive position swaps, and (iii) combines both rewards in PPO-based RLHF over a pre-trained LLM. Extensive experiments on seven Amazon categories against eleven baselines show that neither reward alone is sufficient -- dropping the page-level or item-level signal reduces NDCG@100 by 17.8% and 15.2% respectively, whereas the joint reward improves NDCG@100 by up to 46.8%. Deployed in a 10M-user online A/B test, PageLLM raises GMV by 0.44% and click-through rate by 0.14%, confirming that multi-grained rewards from implicit feedback scale to production WPO. Code and data are available at an anonymized repository.

URL PDF HTML ☆

赞 0 踩 0

2505.23803 2026-05-26 cs.CR cs.AI 版本更新

MultiPhishGuard: An Explainable and Adaptive Multi-Agent LLM System for Phishing Email Detection

MultiPhishGuard: 一种用于钓鱼邮件检测的可解释且自适应的多智能体大语言模型系统

Yinuo Xue, Eric Spero, Meng Wai Woo, Wei Gao, Giovanni Russello

发表机构 * The University of Auckland（奥克兰大学）

AI总结提出基于LLM的多智能体框架MultiPhishGuard，通过协调文本、URL、元数据等五个专业智能体并利用PPO动态加权，结合对抗训练提升对新型钓鱼策略的鲁棒性，在公开数据集上达到97.89%准确率。

详情

AI中文摘要

由于不断演变的对抗策略和异构攻击模式，钓鱼邮件检测面临重大挑战。传统方法（如基于规则的过滤器和黑名单）往往难以跟上步伐，导致漏检和安全风险。虽然机器学习方法提高了检测性能，但在适应新颖且快速变化的钓鱼策略方面仍然有限。我们提出了MultiPhishGuard，一个基于LLM的多智能体检测框架，具有跨专业智能体的学习协调能力。该系统由五个协作智能体（文本、URL、元数据、解释简化器和对抗智能体）组成，使用近端策略优化动态加权智能体贡献。为了应对新兴威胁，该框架包含一个对抗训练循环，其中基于LLM的智能体生成细微的、上下文感知的邮件变体，以暴露潜在模型弱点并提高对模糊钓鱼案例的鲁棒性。在公开数据集上的实验评估表明，MultiPhishGuard在性能上优于既定基线，包括思维链提示和单智能体变体，消融研究和比较分析支持了这一点。该系统达到97.89%的准确率，假阳性率为2.73%，假阴性率为0.20%。此外，解释简化器智能体将技术模型输出转化为面向人类用户的通俗语言解释。总体而言，这些结果表明，具有自适应协调和对抗训练的多智能体LLM架构代表了钓鱼邮件检测的一个有前景的方向。

英文摘要

Phishing email detection faces significant challenges due to evolving adversarial tactics and heterogeneous attack patterns. Traditional approaches, such as rule-based filters and denylists, often struggle to keep pace, leading to missed detections and security risks. While machine learning methods have improved detection performance, they remain limited in adapting to novel and rapidly changing phishing strategies. We present MultiPhishGuard, an LLM-based multi-agent detection framework with learned coordination across specialized agents. The system consists of five cooperative agents (text, URL, metadata, explanation simplifier, and adversarial agents), with agent contributions dynamically weighted using Proximal Policy Optimization. To address emerging threats, the framework incorporates an adversarial training loop in which an LLM-based agent generates subtle, context-aware email variants to expose potential model weaknesses and improve robustness to ambiguous phishing cases. Experimental evaluations on public datasets show that MultiPhishGuard achieves stronger performance than established baselines, including Chain-of-Thought prompting and single-agent variants, as supported by ablation studies and comparative analyses. The system achieves an accuracy of 97.89%, with a false positive rate of 2.73% and a false negative rate of 0.20%. In addition, an explanation simplifier agent transforms technical model outputs into plain-language rationales intended for human users. Overall, these results suggest that multi-agent LLM architectures with adaptive coordination and adversarial training represent a promising direction for phishing email detection.

URL PDF HTML ☆

赞 0 踩 0

2410.15173 2026-05-26 cs.CL cs.AI 版本更新

Uncovering Autoregressive LLM Knowledge of Thematic Fit in Event Representation

揭示自回归LLM在事件表示中主题适配性的知识

Safeyah Khaled Alshemali, Daniel Bauer, Yuval Marton

发表机构 * Imperial College London（伦敦帝国学院）； Columbia University（哥伦比亚大学）； University of Washington（华盛顿大学）

AI总结通过多种提示设计、输入上下文操作、推理和输出形式，研究自回归大语言模型是否具有一致且可表达的事件参数主题适配性知识，并在基准测试上取得新最优结果。

Comments Significant update with massive changes: all experiments rerun with current LLMs; includes new probability estimate analysis and expanded results in Sections 4 and 5. The paper has been accepted to CoNLL-2026

2403.04780 2026-05-26 cs.CL cs.AI 版本更新

Graph-oriented Instruction Tuning of Large Language Models for Generic Graph Mining

面向通用图挖掘的大语言模型图导向指令微调

Yanchao Tan, Hang Lv, Pengxiang Zhan, Shiping Wang, Carl Yang

发表机构 * Engineering Research Center of Big Data Intelligence, Ministry of Education（教育部大数据智能工程研究中心）； Fujian Key Laboratory of Network Computing and Intelligent Information Processing（福建省网络计算与智能信息处理重点实验室）； College of Computer and Data Science, Fuzhou University（福州大学计算机与数据科学学院）； Department of Computer Science, Emory University（埃默里大学计算机科学系）

AI总结提出MuseGraph框架，通过紧凑图描述、基于思维链的指令生成和图感知指令微调，将GNN与LLM结合，实现跨任务和数据集的高效图挖掘。

Comments Accepted by TPAMI 2025

详情

DOI: 10.1109/TPAMI.2025.3603062
Journal ref: IEEE Trans. Pattern Anal. Mach. Intell., vol. 48, no. 1, pp. 155-169, Jan. 2026

AI中文摘要

具有丰富属性的图对于建模互联实体和增强各种实际应用中的预测至关重要。传统的图神经网络（GNN）通常需要针对不同的图任务和数据集进行重新训练。尽管大语言模型（LLM）的出现为自然语言处理带来了新范式，但它们在通用图挖掘（即训练单个模型同时处理多样任务和数据集）方面的潜力仍未充分探索。为此，我们的新颖框架MuseGraph无缝地将GNN和LLM的优势整合到一个基础模型中，用于跨任务和数据集的图挖掘。该框架首先采用紧凑的图描述，在语言令牌限制内封装关键图信息。然后，我们提出了一种基于思维链（CoT）指令包的多样化指令生成机制，以从GPT-4等高级LLM中提取推理能力。最后，我们设计了一种图感知的指令微调策略，以促进多个任务和数据集之间的相互增强，同时防止LLM生成能力的灾难性遗忘。我们的实验结果表明，在五个图任务和十个数据集上取得了显著改进，展示了MuseGraph在提高图导向下游任务准确性的同时增强LLM生成能力的潜力。

英文摘要

Graphs with abundant attributes are essential in modeling interconnected entities and enhancing predictions across various real-world applications. Traditional Graph Neural Networks (GNNs) often require re-training for different graph tasks and datasets. Although the emergence of Large Language Models (LLMs) has introduced new paradigms in natural language processing, their potential for generic graph mining, training a single model to simultaneously handle diverse tasks and datasets, remains under-explored. To this end, our novel framework MuseGraph, seamlessly integrates the strengths of GNNs and LLMs into one foundation model for graph mining across tasks and datasets. This framework first features a compact graph description to encapsulate key graph information within language token limitations. Then, we propose a diverse instruction generation mechanism with Chain-of-Thought (CoT)-based instruction packages to distill the reasoning capabilities from advanced LLMs like GPT-4. Finally, we design a graph-aware instruction tuning strategy to facilitate mutual enhancement across multiple tasks and datasets while preventing catastrophic forgetting of LLMs' generative abilities. Our experimental results demonstrate significant improvements in five graph tasks and ten datasets, showcasing the potential of our MuseGraph in enhancing the accuracy of graph-oriented downstream tasks while improving the generation abilities of LLMs.

URL PDF HTML ☆

赞 0 踩 0

2305.11663 2026-05-26 cs.LG cs.AI cs.CL cs.CY 版本更新

Algorithmic failure as a humanities methodology: machine learning's mispredictions identify rich cases for qualitative analysis

作为人文学科方法论的算法失败：机器学习的错误预测识别出用于定性分析的丰富案例

Jill Walker Rettberg

AI总结本文通过实验验证了Munk等人提出的利用机器学习失败预测识别定性分析中模糊且丰富案例的方法，使用简单kNN算法对虚构角色与机器视觉技术互动的动作数据进行分类，发现不可预测的动作更具矛盾性和情感负荷，支持该方法在人文学科中的适用性。

详情

DOI: 10.1177/20539517221131290
Journal ref: Big Data & Society 9(2) 2022

AI中文摘要

本文评论测试了Munk等人（2022）提出的一种方法论，即利用机器学习中的失败预测作为识别定性分析中模糊且丰富案例的方法。使用一个描述500件艺术品、电影、小说和电子游戏中虚构角色与机器视觉技术互动动作的数据集，我训练了一个简单的机器学习算法（使用R中的kNN算法），仅根据虚构角色的信息预测动作是主动还是被动。可预测的动作通常是缺乏情感且明确的，其中机器视觉技术被当作简单工具。不可预测的动作，即算法无法正确预测的动作，则更加矛盾且情感负荷更重，角色与技术之间的权力关系更为复杂。因此，结果支持Munk等人的理论，即失败预测可以有效地用于识别定性分析的丰富案例。本测试不仅简单复制了Munk等人的结果，还证明了该方法可以应用于更广泛的人文学科领域，并且不需要复杂的神经网络，简单的机器学习算法也能奏效。需要进一步研究以理解该方法适用于哪些类型的数据以及哪种机器学习最具生成性。为此，附上了产生结果所需的R代码，以便复制测试。该代码也可重复使用或改编，以在其他数据集上测试该方法。

英文摘要

This commentary tests a methodology proposed by Munk et al. (2022) for using failed predictions in machine learning as a method to identify ambiguous and rich cases for qualitative analysis. Using a dataset describing actions performed by fictional characters interacting with machine vision technologies in 500 artworks, movies, novels and videogames, I trained a simple machine learning algorithm (using the kNN algorithm in R) to predict whether or not an action was active or passive using only information about the fictional characters. Predictable actions were generally unemotional and unambiguous activities where machine vision technologies were treated as simple tools. Unpredictable actions, that is, actions that the algorithm could not correctly predict, were more ambivalent and emotionally loaded, with more complex power relationships between characters and technologies. The results thus support Munk et al.'s theory that failed predictions can be productively used to identify rich cases for qualitative analysis. This test goes beyond simply replicating Munk et al.'s results by demonstrating that the method can be applied to a broader humanities domain, and that it does not require complex neural networks but can also work with a simpler machine learning algorithm. Further research is needed to develop an understanding of what kinds of data the method is useful for and which kinds of machine learning are most generative. To support this, the R code required to produce the results is included so the test can be replicated. The code can also be reused or adapted to test the method on other datasets.

URL PDF HTML ☆

赞 0 踩 0

2105.13431 2026-05-26 cs.LG cs.AI cs.SY eess.SY 版本更新

An Offline Risk-aware Policy Selection Method for Bayesian Markov Decision Processes

贝叶斯马尔可夫决策过程的离线风险感知策略选择方法

Giorgio Angelotti, Nicolas Drougard, Caroline Ponzoni Carvalho Chanel

发表机构 * Natural Intelligence Toulouse Institute, University of Toulouse, France（图卢兹大学自然智能研究所）； ISAE-SUPAERO, University of Toulouse, France（图卢兹大学ISAE-SUPAERO）

AI总结针对离线强化学习中模型不确定性导致策略风险高的问题，提出一种基于贝叶斯形式化框架的风险感知策略选择方法EvC，通过最大化贝叶斯后验下的风险感知目标来选择稳健策略。

Comments Preprint, under review

详情

DOI: 10.1016/j.artint.2026.104519
Journal ref: Artificial Intelligence, Volume 354, 2026

AI中文摘要

在离线模型学习用于规划以及离线强化学习中，有限的数据集阻碍了相对马尔可夫决策过程（MDP）的值函数估计。因此，所获得策略在真实世界中的性能受到限制且可能存在风险，尤其是当部署错误策略可能导致灾难性后果时。为此，目前正在探索多种途径以减少模型误差（或学习模型与真实模型之间的分布偏移），并在更广泛的意义上获得针对模型不确定性的风险感知解决方案。但在最终应用中，实践者应选择哪种基线？在计算时间不是问题且鲁棒性优先的离线背景下，我们提出了Exploitation vs Caution（EvC），这是一种范式：（1）优雅地融入遵循贝叶斯形式化的模型不确定性，以及（2）在由当前基线提供的固定候选策略集合中，选择最大化贝叶斯后验下风险感知目标的策略。我们在不同离散但简单的环境中使用最先进的方法验证了EvC，这些环境提供了多种MDP类别。在测试场景中，EvC成功选择了稳健策略，因此成为旨在将离线规划和强化学习求解器应用于真实世界的实践者的有用工具。

英文摘要

In Offline Model Learning for Planning and in Offline Reinforcement Learning, the limited data set hinders the estimate of the Value function of the relative Markov Decision Process (MDP). Consequently, the performance of the obtained policy in the real world is bounded and possibly risky, especially when the deployment of a wrong policy can lead to catastrophic consequences. For this reason, several pathways are being followed with the scope of reducing the model error (or the distributional shift between the learned model and the true one) and, more broadly, obtaining risk-aware solutions with respect to model uncertainty. But when it comes to the final application which baseline should a practitioner choose? In an offline context where computational time is not an issue and robustness is the priority we propose Exploitation vs Caution (EvC), a paradigm that (1) elegantly incorporates model uncertainty abiding by the Bayesian formalism, and (2) selects the policy that maximizes a risk-aware objective over the Bayesian posterior between a fixed set of candidate policies provided, for instance, by the current baselines. We validate EvC with state-of-the-art approaches in different discrete, yet simple, environments offering a fair variety of MDP classes. In the tested scenarios EvC manages to select robust policies and hence stands out as a useful tool for practitioners that aim to apply offline planning and reinforcement learning solvers in the real world.

URL PDF HTML ☆

赞 0 踩 0

2603.18363 2026-05-26 cs.CL cs.AI cs.LG 版本更新

PowerFlow: Unlocking the Dual Nature of LLMs via Principled Distribution Matching

PowerFlow: 通过原则性分布匹配释放LLMs的双重特性

Ruishuo Chen, Yu Chen, Zhuoran Li, Longbo Huang

发表机构 * Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China（清华大学交叉信息研究院）

AI总结提出PowerFlow框架，将无监督微调重构成分布匹配问题，利用GFlowNet和长度感知轨迹平衡目标，通过调整α-幂分布方向性激发LLMs的逻辑推理或创造性。

Comments Camera-ready version accepted at ICML 2026

详情

AI中文摘要

无监督内部反馈强化学习（RLIF）已成为一种有前景的范式，可以在没有外部监督的情况下激发大型语言模型（LLMs）的潜在能力。然而，当前方法依赖于启发式内在奖励，通常缺乏明确的理论优化目标，并且容易产生退化偏差。在这项工作中，我们引入了PowerFlow，一个原则性框架，将无监督微调重新表述为分布匹配问题。通过将GFlowNet视为未归一化密度的摊销变分采样器，我们提出了一个长度感知的轨迹平衡目标，明确抵消了自回归生成中固有的结构长度偏差。通过针对$α$-幂分布，PowerFlow能够方向性地激发LLMs的双重特性：锐化分布（$α> 1$）以增强逻辑推理，或展平分布（$α< 1$）以释放表达性创造力。大量实验表明，PowerFlow始终优于现有的RLIF方法，匹配甚至超过有监督的GRPO。此外，通过减轻对齐模型中的过度锐化，我们的方法在多样性和质量上同时取得提升，在创造性任务中推动了帕累托前沿。

英文摘要

Unsupervised Reinforcement Learning from Internal Feedback (RLIF) has emerged as a promising paradigm for eliciting the latent capabilities of Large Language Models (LLMs) without external supervision. However, current methods rely on heuristic intrinsic rewards, which often lack a well-defined theoretical optimization target and are prone to degenerative biases. In this work, we introduce PowerFlow, a principled framework that reformulates unsupervised fine-tuning as a distribution matching problem. By casting GFlowNet as an amortized variational sampler for unnormalized densities, we propose a length-aware Trajectory-Balance objective that explicitly neutralizes the structural length biases inherent in autoregressive generation. By targeting $α$-power distributions, PowerFlow enables the directional elicitation of the dual nature of LLMs: sharpening the distribution ($α> 1$) to intensify logical reasoning, or flattening it ($α< 1$) to unlock expressive creativity. Extensive experiments demonstrate that PowerFlow consistently outperforms existing RLIF methods, matching or even exceeding supervised GRPO. Furthermore, by mitigating over-sharpening in aligned models, our approach achieves simultaneous gains in diversity and quality, shifting the Pareto frontier in creative tasks.

URL PDF HTML ☆

赞 0 踩 0

2509.13389 2026-05-26 cs.AI 版本更新

From Next Token Prediction to (STRIPS) World Models

从下一个词预测到（STRIPS）世界模型

Carlos Núñez-Molina, Vicenç Gómez, Hector Geffner

发表机构 * RWTH Aachen University, Germany（亚琛工业大学，德国）； Universitat Pompeu Fabra, Spain（庞培法华大学，西班牙）

AI总结研究下一个词预测能否产生支持规划的世界模型，提出STRIPS Transformer和标准Transformer两种架构，在五个经典规划领域评估训练准确率、泛化能力和规划性能。

详情

AI中文摘要

我们研究下一个词预测是否能够产生真正支持规划的世界模型，在一个受控的符号设置中，从动作轨迹单独学习命题STRIPS动作模型，并且可以精确评估正确性。我们引入了两种架构。第一种是STRIPS Transformer，一种符号对齐的模型，基于连接Transformer与STRIPS领域形式语言结构的理论结果。第二种是标准Transformer架构，没有内置显式符号结构，我们研究不同的位置编码方案和注意力聚合机制。我们在五个经典规划领域评估这两种架构，测量训练准确率、泛化能力以及跨领域和问题规模的规划性能。有趣的是，两种方法都可以产生支持使用现成STRIPS规划器在指数级多的未见初始状态和目标上进行规划的模型。尽管STRIPS Transformer具有强烈的符号归纳偏置，但它更难优化，并且需要更大的数据集才能可靠地泛化。相比之下，带有stick-breaking注意力的标准Transformer实现了近乎完美的训练准确率和强大的泛化能力。最后，没有stick-breaking注意力的标准Transformer无法泛化到长轨迹，而从较短轨迹训练的Transformer中提取的符号STRIPS模型则可以。

英文摘要

We study whether next-token prediction can yield world models that truly support planning, in a controlled symbolic setting where propositional STRIPS action models are learned from action traces alone and correctness can be evaluated exactly. We introduce two architectures. The first is the STRIPS Transformer, a symbolically aligned model grounded in theoretical results linking transformers and the formal language structure of STRIPS domains. The second is a standard transformer architecture without explicit symbolic structure built in, for which we study different positional encoding schemes and attention aggregation mechanisms. We evaluate both architectures on five classical planning domains, measuring training accuracy, generalization, and planning performance across domains and problem sizes. Interestingly, both approaches can be used to produce models that support planning with off-the-shelf STRIPS planners over exponentially many unseen initial states and goals. Although the STRIPS Transformer incorporates a strong symbolic inductive bias, it is harder to optimize and requires larger datasets to generalize reliably. In contrast, a standard transformer with stick-breaking attention achieves near-perfect training accuracy and strong generalization. Finally, standard transformers without stick-breaking attention do not generalize to long traces, whereas a symbolic STRIPS model extracted from a transformer trained on shorter traces does.

URL PDF HTML ☆

赞 0 踩 0

2603.11001 2026-05-26 cs.CY cs.AI 版本更新

RCTs & Human Uplift Studies: Methodological Challenges and Practical Solutions for Frontier AI Evaluation

随机对照试验与人类提升研究：前沿AI评估的方法论挑战与实践解决方案

Patricia Paskov, Kevin Wei, Shen Zhou Hong, Dan Bateyko, Xavier Roberts-Gaal, Carson Ezell, Gailius Praninskas, Valerie Chen, Umang Bhatt, Ella Guest

发表机构 * RAND ； Johns Hopkins University（约翰霍普金斯大学）； Cornell University（康奈尔大学）； Harvard University（哈佛大学）； University of Cambridge（剑桥大学）； London School of Economics（伦敦经济学院）

AI总结本文通过访谈16位专家，系统梳理了人类提升研究（测量AI对人类绩效影响）在随机对照试验中面临的方法论挑战，包括内部效度、外部效度和构念效度问题，并提出了相应的解决方案。

详情

AI中文摘要

人类提升研究，即通过随机对照试验（RCT）或类似方法测量AI访问对人类绩效影响的研究，越来越多地为前沿AI治理和部署决策提供信息。尽管RCT方法在其他领域是稳健的，但它们与前沿AI系统独特属性的相互作用仍未得到充分研究，特别是当结果用于高风险的决策时。我们呈现了对16位在生物安全、网络安全、教育和劳动等领域具有人类提升研究经验的专家从业者的访谈结果。在访谈中，专家们描述了人类提升研究所依赖的标准因果推断假设与研究目标之间反复出现的紧张关系。快速演变的AI系统、不断变化的基线、异质且变化的用户熟练度以及多孔的真实世界环境，对内部效度、外部效度和构念效度的假设造成了压力，使得提升证据的解释和适当使用复杂化。我们贡献了（1）人类提升研究中方法论挑战的综合，映射到研究效度的风险，并按其对大语言模型（LLM）系统的特异性程度进行分类，以及（2）从挑战到提议解决方案的映射。通过整理专家识别的挑战和解决方案，我们旨在阐明人类提升证据的解释限制和适当用途，使评估实践与其所指导的决策相一致，并为AI治理提供更协调的方法论基础。

英文摘要

Human uplift studies, or studies that measure the effects of AI access on human performance via randomized controlled trials (RCT) or similar methodologies, increasingly inform frontier AI governance and deployment decisions. While RCT methods are robust in other fields, their interaction with the distinctive properties of frontier AI systems remains underexamined, particularly when results are used to inform high-stakes decisions. We present findings from interviews with 16 expert practitioners with experience conducting human uplift studies in domains including biosecurity, cybersecurity, education, and labor. Across interviews, experts described a recurring tension between the standard causal inference assumptions upon which human uplift studies rely and the object of study itself. Rapidly evolving AI systems, shifting baselines, heterogeneous and changing user proficiency, and porous real-world settings strain assumptions underlying internal, external, and construct validity, complicating the interpretation and appropriate use of uplift evidence. We contribute (1) a synthesis of methodological challenges in human uplift studies, mapped to risks to study validity and classified by their degree of specificity to large language model (LLM) systems, and (2) a mapping from challenges to proposed solutions. By collating expert-identified challenges and solutions, we seek to clarify the interpretive limits and appropriate uses of human uplift evidence, to align evaluation practice with the decisions it informs, and to support more coordinated methodological foundations for AI governance.

URL PDF HTML ☆

赞 0 踩 0

2602.20191 2026-05-26 cs.LG cs.AI cs.CL 版本更新

MoBiQuant: Mixture-of-Bits Quantization for Token-Adaptive Any-Precision LLM

MoBiQuant: 面向令牌自适应任意精度LLM的混合比特量化

Dongwei Wang, Jinhee Kim, Seokho Han, Denis Gudovskiy, Yohei Nakata, Tomoyuki Okuno, KhayTze Peong, Kang Eun Jeon, Jong Hwan Ko, Yiran Chen, Huanrui Yang

发表机构 * University of Arizona（亚利桑那大学）； Duke University（杜克大学）； Sungkyunkwan University（成均馆大学）； Panasonic AI Lab（松下人工智能实验室）； Korea Advanced Institute of Science and Technology（韩国科学技术院）

AI总结针对动态运行时约束下大语言模型任意精度量化的泛化性问题，提出基于令牌敏感度的混合比特量化框架MoBiQuant，通过多合一递归残差量化和令牌感知路由器实现灵活推理，在匹配或超越前沿单精度PTQ的同时显著节省内存并提升吞吐量。

Comments 20 pages, 10 figures

详情

AI中文摘要

动态运行时延迟和内存约束要求灵活部署大语言模型（LLM），使得LLM能够根据可用计算资源以不同的量化精度进行推理。最近关于这种任意精度量化的工作要么依赖于硬件效率低下的向量量化，要么在切换位宽时引入额外的缩放因子。同时，现有的为固定低精度校准的后训练量化（PTQ）方法在运行时精度变化下表现出较差的泛化性。在这项工作中，我们将跨位宽泛化性差的根源归因于一种精度依赖的“异常迁移”现象，其中PTQ敏感令牌的分布随精度变化。受此观察启发，我们提出了 exttt{MoBiQuant}，一种新颖的任意精度混合比特量化框架，它根据令牌敏感性调整权重精度以实现灵活的LLM推理。具体来说，我们提出了一种多合一递归残差量化方法，可以在运行时迭代重建更高精度的权重，并通过令牌感知路由器缓解“异常迁移”，动态选择每个令牌的最优推理精度。大量实验表明， exttt{MoBiQuant}在匹配或超越前沿单精度PTQ的同时表现出强大的弹性，与最先进的任意精度方法相比，实现了显著的内存节省和高达$1.34 imes$的吞吐量提升。

英文摘要

Dynamic runtime latency and memory constraints necessitate flexible large language model (LLM) deployment, where an LLM can be inferred with various quantization precisions based on available computational resources. Recent work on such any-precision quantization either relies on hardware-inefficient vector quantization or induces additional scaling factors when switching between bit-widths. Meanwhile, existing post-training quantization (PTQ) methods calibrated for a fixed low precision show poor generalizability under runtime precision change. In this work, we attribute the source of poor generalization across bit-widths to a precision-dependent \textit{outlier migration} phenomenon where the distribution of PTQ-sensitive tokens changes across precisions. Motivated by this observation, we propose \texttt{MoBiQuant}, a novel any-precision Mixture-of-Bits quantization framework that adjusts weight precision for flexible LLM inference based on token sensitivity. Specifically, we propose a many-in-one recursive residual quantization that can iteratively reconstruct higher-precision weights at runtime and mitigates \textit{outlier migration} with a token-aware router to dynamically select the optimal inference precision of each token.Extensive experiments show that \texttt{MoBiQuant} matches or surpasses frontier single-precision PTQ while exhibiting strong elasticity, achieving significant memory savings and throughput gains of up to $1.34\times$ over state-of-the-art any-precision methods.

URL PDF HTML ☆

赞 0 踩 0

2602.17162 2026-05-26 cs.AI q-bio.GN 版本更新

挖矿的智能时机：用于比特币硬件投资回报率预测的深度学习框架

Sithumi Wickramasinghe, Bikramjit Das, Dorien Herremans

发表机构 * Singapore University of Technology and Design（新加坡科技设计大学）

AI总结提出MineROI-Net，一种基于Transformer的深度学习框架，将比特币ASIC硬件采购建模为时间序列分类任务，预测一年内的投资回报率类别，在2015-2024年20种ASIC矿机数据上达到83.2%准确率和83.5%宏F1分数。

详情

AI中文摘要

由于市场波动、技术快速过时和协议驱动的收入周期，比特币挖矿硬件的获取需要战略时机。尽管挖矿已演变为资本密集型行业，但关于何时购买新的专用集成电路（ASIC）硬件的指导很少，且没有先前的计算框架解决这一决策问题。我们通过将硬件获取建模为时间序列分类任务来填补这一空白，预测购买ASIC机器是否在一年内产生盈利（投资回报率（ROI）>= 1）、边际（0 < ROI < 1）或亏损（ROI <= 0）的回报。我们提出了MineROI-Net，一种开源的基于Transformer的架构，旨在捕捉挖矿盈利能力中的多尺度时间模式。在2015年至2024年间发布的20种ASIC矿机在不同市场体制下的数据上评估，MineROI-Net优于循环、卷积和基于注意力的基线，达到了83.2%的准确率和83.5%的宏F1分数。该模型展示了强大的经济相关性，在检测亏损时期达到了97.8%的精确率，在检测盈利时期达到了81.5%的精确率，同时避免了将盈利场景误分类为亏损以及反之亦然。这些结果表明，MineROI-Net为挖矿硬件采购时机提供了一种实用的数据驱动工具，可能降低资本密集型挖矿操作中的财务风险。

英文摘要

Bitcoin mining hardware acquisition requires strategic timing due to volatile markets, rapid technological obsolescence, and protocol-driven revenue cycles. Despite mining's evolution into a capital-intensive industry, there is little guidance on when to purchase new Application-Specific Integrated Circuit (ASIC) hardware, and no prior computational frameworks address this decision problem. We address this gap by formulating hardware acquisition as a time series classification task, predicting whether purchasing ASIC machines yields profitable (Return on Investment (ROI) >= 1), marginal (0 < ROI < 1), or unprofitable (ROI <= 0) returns within one year. We propose MineROI-Net, an open-source Transformer-based architecture designed to capture multi-scale temporal patterns in mining profitability. Evaluated on data from 20 ASIC miners released between 2015 and 2024 across diverse market regimes, MineROI-Net outperforms recurrent, convolutional, and attention-based baselines, achieving 83.2% accuracy and 83.5% macro F1-score. The model demonstrates strong economic relevance, achieving 97.8% precision in detecting unprofitable periods and 81.5% precision in detecting profitable ones, while avoiding misclassifying profitable scenarios as unprofitable and vice versa. These results indicate that MineROI-Net offers a practical, data-driven tool for timing mining hardware acquisitions, potentially reducing financial risk in capital-intensive mining operations.

URL PDF HTML ☆

赞 0 踩 0

2505.20110 2026-05-26 cs.LG cs.AI 版本更新

Beyond the Proxy: Trajectory-Distilled Guidance for Offline GFlowNet Training

超越代理：用于离线GFlowNet训练的轨迹蒸馏指导

Ruishuo Chen, Xun Wang, Rui Hu, Zhuoran Li, Longbo Huang

发表机构 * Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China（清华大学交叉信息研究院）

AI总结提出轨迹蒸馏GFlowNet（TD-GFN），利用逆强化学习从离线轨迹中提取稠密边奖励，通过DAG剪枝和优先反向采样指导策略，避免代理模型，提升离线GFlowNet训练的收敛速度和样本质量。

Comments Camera-ready version accepted at ICML 2026

详情

AI中文摘要

生成流网络（GFlowNets）擅长采样多样化的高奖励对象。在许多实际应用中，由于无法进行主动奖励查询，这些模型必须使用静态离线数据集进行训练。主流的训练方法通常依赖代理模型为在线采样的轨迹提供奖励反馈。然而，由于数据稀缺或评估成本高，构建可靠的代理往往具有挑战性。虽然现有的无代理方法试图解决这一问题，但它们通常施加粗糙的约束，限制了模型有效探索的能力。为了克服这些限制，我们提出了轨迹蒸馏GFlowNet（TD-GFN），一种新颖的无代理训练框架。TD-GFN利用逆强化学习（IRL）从离线轨迹中提取稠密的、转移级别的边奖励，为高效探索提供丰富的结构指导。关键的是，为了确保鲁棒性，这些奖励通过DAG剪枝和优先反向采样间接指导策略。这种设计确保梯度更新仅依赖于数据集中的真实终端奖励，从而防止错误传播。实验结果表明，TD-GFN在收敛速度和样本质量上显著优于广泛的现有基线，为离线GFlowNet训练建立了更鲁棒和高效的范式。

英文摘要

Generative Flow Networks (GFlowNets) excel at sampling diverse, high-reward objects. In many practical applications where active reward queries are infeasible, these models must be trained using static offline datasets. Prevailing training methods typically rely on a proxy model to provide reward feedback for online sampled trajectories. However, constructing a reliable proxy is often challenging due to data scarcity or high evaluation costs. While existing proxy-free approaches attempt to address this, they often impose coarse constraints that limit the model's ability to explore effectively. To overcome these limitations, we propose Trajectory-Distilled GFlowNet (TD-GFN), a novel proxy-free training framework. TD-GFN utilizes inverse reinforcement learning (IRL) to extract dense, transition-level edge rewards from offline trajectories, providing rich structural guidance for efficient exploration. Crucially, to ensure robustness, these rewards guide the policy indirectly through DAG pruning and prioritized backward sampling. This design ensures that gradient updates rely exclusively on ground-truth terminal rewards from the dataset, thereby preventing error propagation. Empirical results demonstrate that TD-GFN significantly outperforms a broad range of existing baselines in both convergence speed and sample quality, establishing a more robust and efficient paradigm for offline GFlowNet training.

URL PDF HTML ☆

赞 0 踩 0

2509.02113 2026-05-26 cs.LG cs.AI cs.CR cs.SI 版本更新

HiGraph: A Large-Scale Hierarchical Graph Dataset for Malware Analysis

HiGraph：用于恶意软件分析的大规模层次图数据集

Han Chen, Hanchen Wang, Hongmei Chen, Ying Zhang, Lu Qin, Wenjie Zhang

发表机构 * University of Technology Sydney（新南威尔士大学）； Yunnan University（云南大学）； University of New South Wales（新南威尔士大学）

AI总结针对现有图方法忽略软件层次结构的问题，提出包含2亿控制流图和59.5万函数调用图的大规模层次图数据集HiGraph，用于构建抗混淆和演化的鲁棒恶意软件检测器。

Comments updated dataset statistics

详情

AI中文摘要

基于图的恶意软件分析的进展受到缺乏捕捉软件固有层次结构的大规模数据集的严重限制。现有方法通常将程序简化为单层图，未能建模高层功能交互与低层指令逻辑之间的关键语义关系。为填补这一空白，我们引入了\dataset，这是用于恶意软件分析的最大公开层次图数据集，包含嵌套在 extbf{595K}个函数调用图（FCG）中的超过 extbf{2亿}个控制流图（CFG）。这种两层表示保留了构建对代码混淆和恶意软件演化具有鲁棒性的检测器所必需的结构语义。我们通过大规模分析展示了HiGraph的实用性，揭示了良性软件和恶意软件的不同结构特性，将其确立为社区的基础基准。数据集和工具可在https://higraph.org公开获取。

英文摘要

The advancement of graph-based malware analysis is critically limited by the absence of large-scale datasets that capture the inherent hierarchical structure of software. Existing methods often oversimplify programs into single level graphs, failing to model the crucial semantic relationship between high-level functional interactions and low-level instruction logic. To bridge this gap, we introduce \dataset, the largest public hierarchical graph dataset for malware analysis, comprising over \textbf{200M} Control Flow Graphs (CFGs) nested within \textbf{595K} Function Call Graphs (FCGs). This two-level representation preserves structural semantics essential for building robust detectors resilient to code obfuscation and malware evolution. We demonstrate HiGraph's utility through a large-scale analysis that reveals distinct structural properties of benign and malicious software, establishing it as a foundational benchmark for the community. The dataset and tools are publicly available at https://higraph.org.

URL PDF HTML ☆

赞 0 踩 0

2405.01906 2026-05-26 cs.AI cs.LG cs.NE 版本更新

Instance-Conditioned Adaptation for Large-scale Generalization of Neural Routing Solver

实例条件适应：神经路由求解器的大规模泛化

Changliang Zhou, Xi Lin, Zhenkun Wang, Xialiang Tong, Mingxuan Yuan, Qingfu Zhang

发表机构 * School of Automation and Intelligent Manufacturing and Guangdong Provincial Key Laboratory of Fully Actuated System Control Theory and Technology, Southern University of Science and Technology, Shenzhen 518055, China（自动化与智能制造学院和广东省全驱动系统控制理论与技术重点实验室，南方科技大学，深圳518055，中国）； Department of Computer Science, City University of Hong Kong, Hong Kong SAR, China（计算机科学系，香港城市大学，香港特别行政区，中国）； Huawei Noah’s Ark Lab, Hong Kong SAR, China（华为诺亚实验室，香港特别行政区，中国）

AI总结提出实例条件适应模型（ICAM），通过简单高效的实例条件适应函数和低复杂度的适应模块，显著提升神经路由求解器在大规模旅行商问题（TSP）、容量车辆路径问题（CVRP）和非对称旅行商问题（ATSP）上的泛化性能，同时保持快速推理速度。

Comments 13 pages, 5 figures

详情

DOI: 10.1109/TITS.2026.3674538
Journal ref: IEEE Transactions on Intelligent Transportation Systems, 2026

AI中文摘要

神经组合优化（NCO）方法在无需专家知识的情况下，展现出了解决智能交通系统路由问题的巨大潜力。然而，现有的构造性NCO方法仍难以解决大规模实例，这严重限制了其应用前景。为了解决这些关键缺陷，本文提出了一种新颖的实例条件适应模型（ICAM），以实现神经路由求解器更好的大规模泛化。特别地，我们设计了一个简单而高效的实例条件适应函数，以较小的时空开销显著提升现有NCO模型的泛化性能。此外，通过对不同注意力机制之间信息融合性能的系统研究，我们进一步提出了一个强大且低复杂度的实例条件适应模块，为不同规模的实例生成更好的解。在合成实例和基准实例上的大量实验结果表明，我们提出的方法能够在解决大规模旅行商问题（TSP）、容量车辆路径问题（CVRP）和非对称旅行商问题（ATSP）时，以非常快的推理时间获得有希望的结果。我们的代码可在 https://github.com/CIAM-Group/ICAM 获取。

英文摘要

The neural combinatorial optimization (NCO) method has shown great potential for solving routing problems of intelligent transportation systems without requiring expert knowledge. However, existing constructive NCO methods still struggle to solve large-scale instances, which significantly limits their application prospects. To address these crucial shortcomings, this work proposes a novel Instance-Conditioned Adaptation Model (ICAM) for better large-scale generalization of neural routing solvers. In particular, we design a simple yet efficient instance-conditioned adaptation function to significantly improve the generalization performance of existing NCO models with a small time and memory overhead. In addition, with a systematic investigation on the performance of information incorporation between different attention mechanisms, we further propose a powerful yet low-complexity instance-conditioned adaptation module to generate better solutions for instances across different scales. Extensive experimental results on both synthetic and benchmark instances show that our proposed method is capable of obtaining promising results with a very fast inference time in solving large-scale Traveling Salesman Problems (TSPs), Capacitated Vehicle Routing Problems (CVRPs), and Asymmetric Traveling Salesman Problems (ATSPs). Our code is available at https://github.com/CIAM-Group/ICAM.

URL PDF HTML ☆

赞 0 踩 0

2505.02129 2026-05-26 cs.DB cs.AI 版本更新

Subspace Aggregation Query and Index Generation for Multidimensional Resource Space Model

多维资源空间模型的子空间聚合查询与索引生成

Xiaoping Sun, Hai Zhuge

发表机构 * Key Lab of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences（中国科学院计算技术研究所智能信息处理重点实验室）； Great Bay University and Great Bay Institute for Advanced Study（大湾区大学和大湾区先进研究院）

AI总结针对多维资源空间中的子空间聚合查询问题，提出一种基于偏序关系的图索引生成方法，以高效定位非空点并聚合资源，并通过策略降低索引生成成本。

详情

AI中文摘要

在多维语义空间中组织大规模资源是一种有效管理和查询不同语义维度资源的方法。为支持高级应用，本文提出一种资源空间模型，用于在表示每个维度的坐标树上的偏序范围内定义的子空间上进行聚合查询，其中子空间中的每个点包含沿坐标树偏序关系路径聚合的资源，并且每个点的聚合资源可由应用测量、排序和选择。为了高效定位大子空间中的非空点，提出一种生成图索引的方法，以构建维度坐标上的偏序关系，使子空间查询能够通过索引链接到达非空点，并沿索引路径将资源聚合到其超点。生成此类索引成本高昂，因为索引节点的子节点数量可能很大，导致索引节点总数非常庞大（随维度数量和维度规模呈指数增长）。所提出的方法采用一系列策略来降低成本。分析和实验表明，生成的索引在支持子空间聚合查询方面具有有效性。

英文摘要

Organizing large-scale resources in a multidimensional semantic space is an approach to efficiently managing and querying resources from different semantic dimensions. To support advanced applications, this paper proposes a resource space model for aggregation query on subspaces defined by a range within the partial order on the coordinate trees representing each dimension, where each point in the subspace contains resources aggregated along the paths of the partial order relations on the coordinate trees and the aggregated resources at each point can be measured, ranked and selected by applications. To efficiently locate non-empty points in a large subspace, an approach to generating graph index is proposed to build partial order relations on coordinates of dimensions to enable a subspace query to reach non-empty points through indexing links and aggregate resources along indexing paths to their super points. Generating such an index is costly as the number of children of an indexing node can be large so that the total number of indexing nodes can be very large (exponentially growing with the number of dimensions and scale of dimensions). The proposed approach adopts the a set of strategies to reduce the cost. Analysis and experiments show the effectiveness of the generated index in supporting subspace aggregation query.

URL PDF HTML ☆

赞 0 踩 0