arXivDaily arXiv每日学术速递 周一至周五更新
2605.17415 2026-05-25 cs.LG cs.AI cs.DB cs.IR 版本更新

IVF-TQ: Calibration-Free Streaming Vector Search via a Codebook-Free Residual Layer

IVF-TQ:通过无码本残差层实现无需校准的流式向量搜索

Tarun Sharma

发表机构 * Independent Researcher(独立研究者)

AI总结 本文提出了一种名为IVF-TQ的流式向量搜索索引,该方法通过一种无需代码本的残差压缩层实现了校准自由的近似最近邻搜索。核心思想是在不依赖代码本的情况下,利用固定随机旋转和预计算的Lloyd-Max标量量化器,仅通过比特宽度和维度参数进行配置,从而在不需训练的情况下保持流式数据的稳定性。实验表明,IVF-TQ在多个数据集和内存条件下均能保持良好的性能,无需重新训练或个性化调整比特预算,显著提升了流式场景下的搜索效率与鲁棒性。

详情
AI中文摘要

近似最近邻(ANN)索引部署在流式语料库上会在数周内无声地丢失召回率。标准诊断是分布漂移,但在洗牌独立同分布(shuffled-i.i.d.)摄取下(完全没有漂移),乘积量化在子匹配位预算下仍会下降3.8个百分点。主流生产压缩方法(PQ、OPQ、ScaNN)都针对初始样本拟合码本,并在数据库增长数个数量级时重复使用该码本。 本文提出IVF-TQ,一种倒排文件索引,其残差压缩层是数据无关的:一个固定的随机旋转,后跟一个仅由位宽b和维度d参数化的预计算Lloyd-Max标量量化器。仅训练IVF粗k-means分区。一个仅依赖于(b, d, delta)的球面上均匀内积误差界提供了任何学习码本方法都无法提供的结构保证。相同的无码本设计实现了IVF放大效应,将差距缩小到Extended RaBitQ的统计噪声范围内(在匹配位预算下,比平面TQ高17.7个百分点),以及一种自适应变体,在不触及压缩层的情况下刷新分区。在九个受控单元(三个10M数据集、三种PQ内存模式、三个随机种子)中,每批PQ码本重新训练从未恢复流式差距;IVF-PQ流式稳定性需要逐数据集位预算调整,而IVF-TQ在所有三个数据集上使用一个固定的(b, d)配置,Delta在[-0.80, +0.56]个百分点之间。贡献在于操作层面:无需训练码本,无需逐数据集位预算调整,无需任何能缩小差距的重新训练周期。

英文摘要

Approximate nearest neighbor (ANN) indexes deployed against streaming corpora silently lose recall over weeks. The standard diagnosis is distribution shift, but under shuffled-i.i.d. ingestion -- no shift at all -- product quantization still degrades -3.8pp at sub-matched bit budgets. The dominant production compression methods (PQ, OPQ, ScaNN) all fit a codebook to an initial sample and reuse it as the database grows by orders of magnitude. This paper presents IVF-TQ, an inverted-file index whose residual compression layer is data-independent: a fixed random rotation followed by a precomputed Lloyd-Max scalar quantizer parameterised only by the bit width b and dimension d. Only the IVF coarse k-means partition is trained. A uniform-over-sphere inner-product error bound depending only on (b, d, delta) provides a structural guarantee no learned-codebook method admits. The same codebook-free design enables an IVF-amplification effect that closes the gap to Extended RaBitQ to within statistical noise (+17.7pp over flat TQ at matched bit budget), and an Adaptive variant that refreshes the partition without touching the compression layer. Across nine controlled cells (three 10M datasets, three PQ memory regimes, three seeds), per-batch PQ codebook retraining never recovers the streaming gap; IVF-PQ streaming stability requires per-dataset bit-budget tuning, while IVF-TQ holds at one fixed (b, d) configuration on all three datasets with Delta in [-0.80, +0.56]pp. The contribution is operational: no codebook to train, no per-dataset bit-budget tuning, no retraining cycle that ever closes the gap.

2605.23901 2026-05-25 cs.LG cs.AI cs.IT math.IT 版本更新

LLMs as Noisy Channels: A Shannon Perspective on Model Capacity and Scaling Laws

LLMs 作为噪声信道:香农视角下的模型容量与缩放定律

Xu Ouyang, Deyi Liu, Yuhang Cai, Jing Liu, Yuan Yang, Chen Zheng, Thomas Hartvigsen, Yiyuan Ma

发表机构 * University of Virginia(弗吉尼亚大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 本文从香农信息论的角度出发,将大语言模型(LLM)的训练过程建模为在噪声信道中传递信息的过程,提出了香农扩展定律(Shannon Scaling Law),用以解释传统单调扩展定律无法描述的非单调现象,如灾难性过训练和量化退化。该理论通过将模型参数映射为信道带宽、训练数据映射为信号功率,揭示了模型规模或数据量的扩展若不能保持足够的信噪比,将导致噪声放大并引发性能的U型退化。实验验证表明,该理论在多个任务和扰动设置下均优于传统扩展定律,具有良好的拟合与外推能力。

Comments Accepted by ICML 2026

详情
AI中文摘要

现有的大语言模型(LLMs)缩放定律主要是单调幂律,无法解释新出现的非单调现象,如灾难性过训练和量化引起的退化,在这些现象中,尽管计算量增加,性能却下降。我们提出了香农缩放定律,这是一个统一的理论框架,将LLM训练建模为噪声信道上的信息传输,基于香农-哈特利定理。通过将模型参数映射到信道带宽,训练令牌映射到信号功率,我们的公式明确捕捉了学习信号与内在噪声之间的相互作用。这一视角揭示了LLMs的基本香农容量:在未保持足够信噪比(SNR)的情况下扩展模型规模或数据,必然会放大噪声,导致从单调改进到U形性能退化的转变。我们通过在Pythia和OLMo2上进行的实验验证了该理论,实验包括高斯噪声、量化以及在数学、问答和代码任务上的监督微调。香农缩放定律始终优于经典缩放定律和最近的扰动感知定律,取得了强$R^2$分数,并准确捕捉了先前方法遗漏的损失盆地。它还能进行外推:在$\leq$6.9B Pythia模型上使用$\leq$180B令牌拟合后,预测了未见过的12B模型在高达307B令牌时的性能,池化$R^2=0.847$,而单调基线则崩溃。

英文摘要

Existing scaling laws for Large Language Models (LLMs), predominantly monotonic power laws, fail to explain emerging non-monotonic phenomena such as catastrophic overtraining and quantization-induced degradation, where performance deteriorates despite increased compute. We propose the Shannon Scaling Law, a unified theoretical framework that models LLM training as information transmission over a noisy channel, grounded in the Shannon-Hartley theorem. By mapping model parameters to channel bandwidth and training tokens to signal power, our formulation explicitly captures the interaction between learning signal and intrinsic noise. This perspective reveals a fundamental Shannon capacity for LLMs: scaling model size or data without preserving a sufficient signal-to-noise ratio (SNR) inevitably amplifies noise, inducing a transition from monotonic improvement to U-shaped performance degradation. We validate our theory through experiments on Pythia and OLMo2 under perturbations, including Gaussian noise, quantization and supervised fine-tuning on math, QA and code tasks. The Shannon Scaling Law consistently outperforms classical scaling laws and recent perturbation-aware laws, achieving strong $R^2$ scores and accurately capturing loss basins missed by prior approaches. It also extrapolates: fitted on $\leq$6.9B Pythia models with $\leq$180B tokens, it predicts the unseen 12B model up to 307B tokens at pooled $R^2{=}0.847$, while monotonic baselines collapse.

2605.23899 2026-05-25 cs.AI 版本更新

From Raw Experience to Skill Consumption: A Systematic Study of Model-Generated Agent Skills

从原始经验到技能消费:模型生成智能体技能的系统研究

Zisu Huang, Jingwen Xu, Yifan Yang, Ziyang Gong, Qihao Yang, Muzhao Tian, Xiaohua Wang, Changze Lv, Xuemei Gao, Qi Dai, Bei Liu, Kai Qiu, Xue Yang, Dongdong Chen, Xiaoqing Zheng, Chong Luo

发表机构 * Fudan University(复旦大学) Microsoft Research(微软研究院) Shanghai Jiao Tong University(上海交通大学)

AI总结 本文系统研究了模型生成智能体技能的全生命周期,包括经验生成、技能提取和技能消费,旨在评估这些技能的实际效果及影响因素。研究构建了一个基于实用性的评估框架,在五个不同任务领域中进行了广泛实验,发现模型生成的技能虽总体有益,但存在非平凡的负迁移现象,且提取器与消费者的表现并不一致。通过深入分析各阶段特征,论文揭示了技能质量的决定因素,并提出了一种元技能,用于指导技能提取以提升其实际效用。

详情
AI中文摘要

语言智能体通过重用从过去经验中提取的结构化程序化制品——\emph{技能}——不断改进。特别是,\emph{领域级}和\emph{模型生成}的技能尤其有前景。它们通过编码领域特定的重复性程序,在领域内实现快速适应,并且超越了劳动密集型的手工制作。然而,尽管提取方法不断涌现,但理解仍然有限,缺乏覆盖技能全生命周期—— extbf{经验生成}、 extbf{技能提取}和 extbf{技能消费}——的全面研究,以探究这些技能是否真正有效、何时有效以及成功或失败的原因。为填补这一空白,我们构建了一个基于效用的评估框架,在五个多样化的智能体任务领域上,提供了跨提取器和目标智能体的系统实验结果。我们发现,模型生成的技能平均有益,但表现出非平凡的负迁移,并且提取器和目标智能体的行为并不一致。一个模型可能是强提取器但弱消费者,反之亦然,技能效用与模型规模或基线任务强度无关。为解释这些模式,我们随后深入剖析每个生命周期阶段,分析经验组成如何塑造技能质量、有用技能具备哪些属性,以及同一技能如何在不同消费者之间迁移。最后,我们将这些发现转化为具体的\emph{元技能},指导技能提取朝向与实际效用相关的特征,从而在多个领域持续提升技能质量并大幅减少负迁移。

英文摘要

Language agents increasingly improve by reusing \emph{skills} -- structured procedural artifacts distilled from past experience. In particular, \emph{domain-level} and \emph{model-generated} skills are especially promising. They offer fast adaptation within a domain by encoding domain-specific recurring procedures, and they scale beyond labor-intensive hand-crafting. However, while extraction methods continue to proliferate, understanding remains limited, with no comprehensive study spanning the full skill lifecycle -- \textbf{experience generation}, \textbf{skill extraction}, and \textbf{skill consumption} -- to ask whether such skills actually work, when they work, and what makes them succeed or fail. To close this gap, we build a utility-grounded evaluation framework that provides systematic experimental results across extractors and target agents, covering five diverse agentic task domains. We find that model-generated skills are beneficial on average but exhibit non-trivial negative transfer, and that neither extractors nor targets behave uniformly. A model can be a strong extractor yet a weak consumer, or vice versa, with skill utility independent of model scale or baseline task strength. To explain these patterns, we then dissect each lifecycle stage in depth, analyzing how experience composition shapes skill quality, what properties characterize useful skills, and how the same skill transfers across different consumers. Finally, we translate these findings into a concrete \emph{meta-skill} that guides skill extraction toward the features tied to actual utility, which consistently improves skill quality across domains and substantially reduces negative transfer.

2605.23898 2026-05-25 cs.AI 版本更新

SPACENUM: Revisiting Spatial Numerical Understanding in VLMs

SPACENUM: 重新审视视觉语言模型中的空间数值理解

Jianshu Zhang, Yijiang Li, Huifeixin Chen, Haoran Lu, Letian Xue, Bingyang Wang, Han Liu

发表机构 * Northwestern(西北大学) UCSD(加州大学圣地亚哥分校) USC(南加州大学) GaTech(佐治亚理工学院)

AI总结 本文研究视觉语言模型(VLMs)在空间数值理解方面的能力,探讨其是否能真正将数值输出与空间感知建立联系。为此,作者提出了SpaceNum框架,通过两个双向任务Num2Space和Space2Num,评估模型在动态空间探索和静态空间推理中对数值与空间结构的映射能力。研究发现,现有VLMs在空间数值理解上表现有限,主要依赖浅层空间线索,难以构建稳定的坐标感知表示,且无法从视觉中抽象出结构化的空间布局。

Comments Project page: https://sterzhang.github.io/SpaceNum-Home

详情
AI中文摘要

视觉语言模型(VLM)越来越多地部署在具身环境中,需要产生数值输出,如动作幅度和空间坐标。尽管这些数字看似有意义,但目前尚不清楚这些数值输出是否真正基于空间感知。因此,在这项工作中,我们通过SpaceNum重新审视空间数值理解,这是一个统一框架,捕捉两种互补的设置:作为空间探索中动态过渡的数字,以及作为空间推理中静态布局的数字。我们制定了两个双向任务,Num2Space和Space2Num,以评估VLM在视觉侧空间结构和语言侧数值表示之间的映射能力。我们系统地研究了当前VLM是否真正理解空间设置中的数值。在动态过渡和静态布局中,我们发现模型在很大程度上未能将数值与空间意义联系起来,并且通常表现接近随机猜测。通过错误分析、推理轨迹分析和受控干预,我们表明当前VLM严重依赖浅层空间线索,难以构建稳定的坐标感知表示,并且未能从视觉观察中抽象出结构化的空间布局。我们进一步表明,显式推理仅提供边际收益,而微调可以部分改善空间数值理解并迁移到外部空间推理基准。

英文摘要

Vision-Language Models (VLMs) are increasingly deployed in embodied environments, where they need produce numerical outputs such as action magnitudes and spatial coordinates. Although these numbers appear meaningful, it remains unclear whether these numerical outputs are genuinely grounded in spatial perception. Therefore, in this work, we revisit spatial numerical understanding through SpaceNum, a unified framework that captures two complementary settings: numbers as dynamic transitions during spatial exploration, and numbers as static layouts in spatial reasoning. We formulate two bidirectional tasks, Num2Space and Space2Num, to evaluate how well VLMs map between vision-side spatial structure and language-side numerical representations. We systematically study whether current VLMs truly understand numerical values in spatial settings. Across dynamic transitions and static layouts, we find that models largely fail to ground numbers in spatial meaning and often perform close to random guess. Through error analysis, reasoning trace analysis, and controlled interventions, we show that current VLMs rely heavily on shallow spatial cues, struggle to build stable coordinate-aware representations, and fail to abstract structured spatial layouts from visual observations. We further show that explicit reasoning provides only marginal gains, while tuning can partially improve spatial numerical understanding and transfer to external spatial reasoning benchmarks.

2605.23897 2026-05-25 cs.CV cs.AI cs.CL 版本更新

ETCHR: Editing To Clarify and Harness Reasoning

ETCHR: 通过编辑来澄清和利用推理

Beichen Zhang, Yuhong Liu, Jinsong Li, Yuhang Zang, Jiaqi Wang, Dahua Lin

发表机构 * The Chinese University of Hong Kong Shanghai AI Laboratory(香港中文大学上海人工智能实验室) Shanghai AI Laboratory(上海人工智能实验室) Shanghai Jiao Tong University(上海交通大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 多模态大语言模型在视觉推理方面取得了进展,但纯文本推理链在需要精细关注或视角变换的问题上仍存在瓶颈。为解决这一问题,研究提出ETCHR,一种与理解模型解耦的、基于问题条件的图像编辑器,通过两阶段训练方法分别弥补语言侧和生成侧的缺陷,提升编辑准确性和推理效果。实验表明,ETCHR在多个任务上显著提升了推理性能,且可无缝集成到不同开源和闭源模型中。

Comments Code, model and data are open-sourced at https://github.com/InternLM/ETCHR

详情
AI中文摘要

多模态大语言模型已经推进了视觉推理,但对于需要细粒度关注或视角变换的问题,纯文本思维链仍然是一个瓶颈。“用图像思考”范式缩小了这一差距,但现有方法要么受限于固定的预定义工具包,要么从统一的多模态方法中产生噪声的中间图像。我们追求第三种选择:使用专用的图像编辑模型并将其与理解模型解耦。然而,现成的图像编辑器作为推理助手存在两个互补的差距:语言侧差距,即被训练为被动指令跟随者的编辑器无法将抽象问题映射到适当的视觉变换;以及生成侧差距,即随着推理深度增加,编辑正确性下降。基于这一分析,我们引入了ETCHR(Editing To Clarify and Harness Reasoning),一种问题条件化、推理感知的图像编辑器,与下游理解模型解耦,并通过针对这两个差距的两阶段配方进行训练:通过监督微调编辑轨迹进行推理模仿,随后通过基于VLM的奖励进行推理增强,以提升编辑正确性和下游推理准确性。由于编辑器是解耦的,ETCHR可以以无需训练的方式插入不同的开源和闭源MLLM。在五个任务族(细粒度感知、图表理解、逻辑推理、拼图恢复和3D理解)上,ETCHR将平均Pass@1从55.95提升到60.77(+4.82,使用Qwen3-VL-8B),从65.08提升到70.55(+5.47,使用Gemini-3.1-Flash-Lite),以及从76.55提升到81.16(+4.61,使用1T参数的MoE模型Kimi K2.5)。

英文摘要

Multimodal Large Language Models have advanced visual reasoning, yet a purely textual chain of thought remains a bottleneck for questions that require fine-grained focus or view transformations. The ''think with images'' paradigm narrows this gap, but existing approaches are either constrained by fixed predefined toolkits or produce noisy intermediate images from unified multimodal methods. We pursue a third option: using a dedicated image editing model and decouple it with an understanding model. However, off-the-shelf image editors fail as reasoning assistants with two complementary gaps: a language-side gap, where editors trained as passive instruction-followers cannot map an abstract question to an appropriate visual transformation, and a generation-side gap, where edit correctness degrades as reasoning depth grows. Guided by this analysis, we introduce ETCHR (Editing To Clarify and Harness Reasoning), a question-conditioned, reasoning-aware image editor decoupled from the downstream understanding model and trained with a two-stage recipe targeted at the two gaps: Reasoning Imitation via supervised fine-tuning on edit trajectories, followed by Reasoning Enhancement with VLM-derived rewards for edit correctness and downstream reasoning accuracy. Since the editor is decoupled, ETCHR plugs into different open- and closed-source MLLMs in a training-free manner. Across five task families (fine-grained perception, chart understanding, logic reasoning, jigsaw restoration, and 3D understanding), ETCHR raises average Pass@1 from 55.95 to 60.77 (+4.82) with Qwen3-VL-8B, from 65.08 to 70.55 (+5.47) with Gemini-3.1-Flash-Lite, and from 76.55 to 81.16 (+4.61) with the 1T-parameter MoE model Kimi K2.5.

2605.23892 2026-05-25 cs.CV cs.AI cs.GR cs.LG cs.RO 版本更新

Good Token Hunting: A Hitchhiker's Guide to Token Selection for Visual Geometry Transformers

优质令牌狩猎:视觉几何变换器令牌选择指南

Shuhong Zheng, Michael Oechsle, Erik Sandström, Marie-Julie Rakotosaona, Federico Tombari, Igor Gilitschenski

发表机构 * University of Toronto & Vector Institute(多伦多大学及向量研究所) Google(谷歌) Technical University of Munich(慕尼黑技术大学)

AI总结 视觉几何变换器在多视角三维重建中表现出色,但其计算成本随输入序列长度呈二次增长,限制了模型的效率和可扩展性。本文提出了一种简单而通用的解决方案,通过限制每个查询在全局注意力中交互的关键/值标记数量来降低计算复杂度。该方法采用两阶段框架:首先在帧级别选择保留的帧以保证场景覆盖多样性,然后在帧内进一步去除冗余标记,且引入基于注意力熵的层感知稀疏化策略。实验表明,该方法在保持或提升性能的同时,可将视觉几何变换器的处理速度提升85%以上。

Comments Project Page: https://zsh2000.github.io/good-token-hunting.github.io, Code: https://github.com/zsh2000/gotohunt

详情
AI中文摘要

视觉几何变换器已成为多视图三维重建的强大架构,能够以前馈方式联合预测多个三维属性。然而,由于这些模型内部的全局注意力层,其计算成本随输入序列长度呈二次增长,限制了其可扩展性和效率。在这项工作中,我们通过一个简单而通用的策略来应对这一挑战:限制每个查询在全局注意力期间交互的键/值令牌数量。为了实现有效的令牌选择,我们引入了一个两阶段框架。首先,帧间选择步骤在帧级别操作,以识别应保留的帧。其次,帧内选择步骤进一步丢弃所选帧内更冗余的令牌。我们的分析强调了基于多样性的帧间选择策略的优势,该策略确保了对场景的广泛覆盖。对于帧内选择,我们表明层感知稀疏化是必要的,选择过程由全局注意力模式的熵引导。与现有解决方案相比,我们的方法提供了优越的速度-精度权衡。大量实验表明,对于包含500张图像的场景,我们的方法将视觉几何变换器加速超过85%,同时保持甚至提升基线性能,这暗示了我们的令牌选择策略如何在视觉几何变换器的未来应用中发挥关键作用。我们的项目网站位于 https://zsh2000.github.io/good-token-hunting.github.io。

英文摘要

Visual geometry transformers have become powerful architectures for multi-view 3D reconstruction, enabling joint prediction of multiple 3D attributes in a feed-forward manner. However, their computational cost grows quadratically with the input sequence length due to the global attention layers inside these models. This limits both their scalability and efficiency. In this work, we address this challenge with a simple yet general strategy: restricting the number of key/value tokens that each query interacts with during global attention. To achieve effective token selection, we introduce a two-stage framework. First, an inter-frame selection step operates at the frame level to identify frames that should be preserved. Second, an intra-frame selection step further discards more redundant tokens within the selected frames. Our analysis highlights the advantage of a diversity-based strategy for inter-frame selection, which ensures broad coverage of the scene. For intra-frame selection, we show that layer-aware sparsification is necessary, with the selection process guided by the entropy of the global attention pattern. Our approach offers a superior speed-accuracy trade-off compared to existing solutions. Extensive experiments show that it accelerates visual geometry transformers by over 85% for scenes with 500 images while maintaining, or even improving, baseline performance, which hints that how our token selection strategy can play a crucial role in future applications of visual geometry transformers. Our project website is available at https://zsh2000.github.io/good-token-hunting.github.io.

2605.23887 2026-05-25 cs.DB cs.AI cs.CR cs.LG cs.MA 版本更新

CHRONOS: Temporally-Aware Multi-Agent Coordination for Evolving Data Marketplaces

CHRONOS:面向演化数据市场的时态感知多智能体协调

Joydeep Chandra

发表机构 * BNRIST, Tsinghua University(北京清华大学智能机器人系统研究院)

AI总结 CHRONOS 是一种面向动态数据市场的多智能体协调框架,旨在解决静态设计中因数据演化带来的检索效率下降、价值分配不准确和隐私预算过度消耗等问题。该方法采用三层架构,分别通过时间感知的神经微分方程、基于突变点检测的夏普利价值评估和满足差分隐私的强化学习算法,实现高效且隐私保护的市场协调。实验表明,CHRONOS 在多个基准上表现出优越的检索性能和隐私效率,具有较高的实用价值。

详情
AI中文摘要

时态知识图谱数据市场在静态设计中面临三个耦合的失败:随着边演化,过时的混合索引捷径降低召回率;分布漂移后,固定的Shapley定价错误归因价值;不协调的智能体过度消耗共享的差分隐私预算。我们提出CHRONOS,一个三层架构,通过显式的公共和私有分离统一处理这些挑战。第一层应用神经ODE时间衰减到捷径边,提供每个查询的期望召回损失界为Big-O of Pq lambda delta t,单调包络保证将边界宽松度降低到观测损失的1.8到3.2倍。第二层将Shapley估值条件化在检测到的变点上,并在噪声下提供有限样本误差保证。第三层使用EXP3-IX实现Big-O of sqrt(T log T)遗憾,同时通过矩会计强制执行epsilon和delta差分隐私。CHRONOS每轮使用高斯机制发布私有化亲和矩阵;所有检索和排序都是后处理,不产生额外隐私成本。我们提供多轮结算、500个卖家的可扩展性分析,以及与加速基线的比较。在四个基准上,CHRONOS在10个结果时召回率为0.937,每秒2.74个查询,延迟161毫秒,在zCDP组合下总epsilon为4.25,delta为10^{-6}。这些结果表明一个竞争性的操作点。一个局限性是,在此隐私水平下,发布的估值仍受噪声主导;效用主要来自公共索引路由和由低敏感度统计驱动的自适应调度。

英文摘要

Temporal knowledge-graph data marketplaces face three coupled failures in static designs: stale hybrid index shortcuts reduce recall as edges evolve, stationary Shapley pricing misattributes value after distribution shifts, and uncoordinated agents over-consume a shared differential-privacy budget. We present CHRONOS, a three-layer architecture providing a unified treatment of these challenges with explicit public and private separation. Layer one applies neural-ODE temporal decay to shortcut edges, providing a per-query expected recall-loss bound of Big-O of Pq lambda delta t, with a monotone-envelope guarantee reducing bound looseness to 1.8 to 3.2 times observed loss. Layer two conditions Shapley valuation on detected changepoints and provides finite-sample error guarantees under noise. Layer three uses EXP3-IX to achieve Big-O of the square root of T log T regret while enforcing epsilon and delta differential privacy via moments accounting. CHRONOS releases a privatized affinity matrix per epoch using the Gaussian mechanism; all retrieval and ranking are post-processing, incurring no extra privacy cost. We provide multi-epoch settlement, scalability analysis for 500 sellers, and comparisons against accelerated baselines. Across four benchmarks, CHRONOS shows 0.937 recall at ten, 2.74 queries per second, 161 ms latency, and total epsilon of 4.25 at delta of 10 to the power of negative 6 under zCDP composition. These results indicate a competitive operating point. A limitation is that at this privacy level, released valuations remain noise-dominated; utility derives primarily from public index routing and adaptive scheduling driven by low-sensitivity statistics.

2605.23883 2026-05-25 cs.CV cs.AI 版本更新

PGT: Procedurally Generated Tasks for improving visual grounding in MLLMs

PGT: 用于提升多模态大语言模型视觉定位的程序化生成任务

Rim Assouel, Amir Bar, Michal Drozdzal, Adriana Romero-Soriano

发表机构 * Mila - Qu\'ebec AI Institute FAIR at Meta Superintelligence Labs McGill University Canada CIFAR AI Chair

AI总结 尽管多模态大语言模型(MLLMs)已取得显著进展,但在细粒度理解任务上仍存在不足。本文提出了一种名为PGT的过程生成任务框架,通过在图像上叠加明确的几何原语生成密集的监督信号,从而提升模型的视觉 grounding 能力,并作为低成本的诊断工具识别感知失败的原因。实验表明,PGT 在多种基准测试中显著提升了模型性能,表明细粒度感知瓶颈可通过增强监督信号有效解决。

详情
AI中文摘要

尽管多模态大语言模型(MLLMs)取得了显著进展,但这些模型在细粒度理解任务上仍然存在困难。在这项工作中,我们提出了程序化生成任务(PGT),一个简单的数据驱动框架,具有双重目的:诱导细粒度视觉理解,并作为低成本的诊断工具来识别感知失败的来源。通过在图像上叠加明确的几何基元,PGT生成额外的密集监督,将视觉定位能力与语义先验解耦。在关系、定量和3D/深度理解基准上的大量实验表明,PGT在各种架构上均取得了显著提升。在使用PGT数据增强的LLaVA-v1.5-Instruct上进行指令微调,在What'sUp基准上提升高达+20%,在CV-Bench-2D上提升+13.3%,同时保持通用感知能力。此外,在PGT数据上微调最先进的MLLMs,在What'sUp上提升高达+5.5%,在CV-Bench-2D上提升+8.3%。这些发现表明,PGT有效解决了细粒度感知的瓶颈,揭示了许多空间推理缺陷源于监督信号不足,而非固有的架构或分辨率限制。

英文摘要

Despite remarkable progress in Multimodal Large Language Models (MLLMs), these models still struggle with fine-grained understanding tasks. In this work, we propose Procedurally Generated Tasks (PGT), a simple data-driven framework that serves a dual purpose: inducing fine-grained visual understanding and acting as a low-cost diagnostic tool to identify the source of perception failures. By overlaying unambiguous geometric primitives on images, PGT generate additional dense supervision that disentangles visual grounding capability from semantic priors. Extensive experiments on relational, quantitative, and 3D/depth understanding benchmarks show that PGT yields remarkable gains across diverse architectures. Instruction tuning MLLMs on LLaVA-v1.5-Instruct augmented with PGT data results in improvements of up to +20% on the What'sUp benchmark and +13.3% on CV-Bench-2D, while maintaining general perception capabilities. Moreover, finetuning state-of-the-art MLLMs on PGT data leads to boosts of up to +5.5% on What'sUp and +8.3% on CV-Bench-2D. These findings demonstrate that PGT effectively address the bottleneck of fine-grained perception, revealing that many spatial reasoning deficits stem from inadequate supervision signals rather than inherent architectural or resolution limitations.

2605.23867 2026-05-25 cs.HC cs.AI 版本更新

Human Decision-Making with Persuasive and Narrative LLM Explanations

具有说服性和叙事性LLM解释的人类决策

Laura R. Marusich, Mary Grace Kozuch Dhooghe, Jonathan Z. Bakdash, Murat Kantarcioglu

发表机构 * DEVCOM Army Research Laboratory(美国陆军研发实验室) University of Texas at Dallas(德克萨斯大学达拉斯分校) Virginia Polytechnic Institute and State University(弗吉尼亚理工大学)

AI总结 本研究探讨了生成式语言模型(LLM)在分类任务中生成的叙事性解释对人类决策性能的影响。通过大规模人类行为实验,研究发现LLM生成的叙事解释的说服力并未显著提升决策准确性,但可能增加人类对AI预测的依赖,并可能对决策反应时间和判断AI预测正确性的能力产生负面影响。研究结果表明,在AI预测中加入叙事解释可能带来决策性能的权衡,未来需要进一步研究其具体影响机制和适用场景。

详情
AI中文摘要

大型语言模型(LLMs)有潜力在分类任务中辅助和改善人类决策,不仅通过提供相当准确的预测,还通过生成这些预测的连贯叙事解释。先前的研究表明,人们通常认为AI叙事解释易于理解、可信且具有说服力,能够改变信念和观点;然而,关于叙事解释对客观人类决策表现的影响知之甚少。在这里,我们进行了一项大规模人类行为实验,以评估使用LLM生成的不同说服力叙事解释的决策表现。我们发现,基于LLM的解释的说服力程度(或缺乏说服力)并未显著影响决策准确性,相比于简单的AI预测本身,这与基于特征重要性的可解释AI的典型结果一致。我们发现有证据表明叙事增加了对AI的依赖,但无论AI预测正确还是错误都是如此。探索性分析还表明,更具说服力的叙事可能对决策响应时间以及区分正确和错误AI预测的能力产生不利影响。总体而言,这项工作表明,将叙事解释与AI预测结合可能会对决策表现产生权衡,需要更多研究来确定叙事解释如何以及何时影响人类决策。

英文摘要

Large language models (LLMs) have the potential to aid and improve human decision-making in classification tasks, not only by providing fairly accurate predictions, but also in their ability to generate cogent narrative explanations of those predictions. Prior work has demonstrated that people generally find AI narrative explanations to be understandable, trustworthy, and convincing for changing beliefs and opinions; however, less is known about the impact of narrative explanations on objective human decision-making performance. Here we conduct a large-scale human behavioral experiment to evaluate decision-making performance with LLM-generated narrative explanations of varying persuasiveness. We found the degree of persuasiveness, or lack thereof, for LLM-based explanations did not meaningfully impact decision accuracy over a simple AI prediction alone, in agreement with typical results with explainable AI based on feature importance. We found evidence that narratives increased reliance on AI, but both when the AI prediction was correct and incorrect. Exploratory analyses also indicated that the more persuasive narratives may have had a detrimental effect on decision response times and the ability to discriminate between a correct and incorrect AI prediction. Overall, this work indicates that including narrative explanations with AI predictions may involve tradeoffs for decision-making performance, and more work is needed to determine how and when narrative explanations impact human decision-making.

2605.23861 2026-05-25 cs.LG cs.AI cs.CV 版本更新

Leveraging Foundation Models for Causal Generative Modeling

利用基础模型进行因果生成建模

Aneesh Komanduri, Xintao Wu

发表机构 * University of Arkansas(亚拉巴马大学)

AI总结 该论文研究如何利用预训练基础模型进行因果生成建模,旨在提升AI系统在反事实推理方面的能力。提出了一种名为FM-CGM的模块化框架,通过概念提取器、概念操作器和反事实生成器三个核心组件,实现了端到端的视觉因果推理。该方法结合了因果推理模型和文本到图像扩散模型,并引入了因果语义引导机制,有效支持零样本因果发现与反事实图像生成,具有重要的理论与应用价值。

详情
AI中文摘要

因果生成建模对于开发能够进行反事实推理的可靠且透明的AI系统至关重要。现有方法侧重于在生成模型训练过程中整合因果约束,但通常缺乏统一框架来利用预训练基础模型的零样本推理能力。我们提出FM-CGM,一个使用预训练基础模型进行端到端视觉因果推理的模块化框架。FM-CGM通过三个核心组件形式化因果流程:概念提取器、概念操作器和反事实生成器。通过利用大型推理模型进行因果推断,以及文本到图像扩散模型进行生成,我们的方法实现了零样本因果发现、干预和反事实生成。然后,我们开发了因果语义引导(CSG),一种基于交叉注意力的机制,确保语义干预传播到后代概念,同时保留不变区域。我们实验证明,我们的方法能够识别合理的因果结构,并适用于忠实的反事实图像生成。

英文摘要

Causal generative modeling is essential for developing reliable and transparent AI systems capable of counterfactual reasoning. While existing approaches focus on integrating causal constraints during the training of generative models, they often lack a unified framework to leverage the zero-shot reasoning capabilities of pretrained foundation models. We introduce FM-CGM, a modular framework for end-to-end visual causal reasoning using pretrained foundation models. FM-CGM formalizes the causal pipeline through three core components: a concept extractor, a concept manipulator, and a counterfactual generator. By leveraging a large reasoning model for causal inference and a text-to-image diffusion model for generation, our approach enables zero-shot causal discovery, intervention, and counterfactual generation. We then develop Causal Semantic Guidance (CSG), a cross-attention-based mechanism that ensures semantic interventions propagate to descendant concepts while preserving invariant regions. We empirically show that our approach can identify plausible causal structures and is suitable for faithful counterfactual image generation.

2605.23825 2026-05-25 cs.LG cs.AI 版本更新

It's the humans, not the data: Geopolitical bias in LLMs originates in post-training, amplified by the language of the prompt

是人类,而非数据:LLM中的地缘政治偏见源于后训练,并通过提示语言放大

Stuart Bladon, Brinnae Bent

发表机构 * Alibaba(阿里巴巴) seven AI labs(七家人工智能实验室)

AI总结 该研究发现,语言模型中的地缘政治偏见主要来源于微调阶段,而非预训练阶段。通过对七家实验室的多个模型进行对比实验,结果表明,微调后模型的立场往往更倾向于其开发者所在国家或地区,且这种偏见在不同语言提示下表现不同。研究强调,模型对国家、文化及政治观点的表征并非单纯继承自训练数据,而是在对齐过程中被主动塑造,凸显了对微调过程进行透明度和监管的重要性。

Comments 12 pages, 6 figures, 2 tables, 3 appendices. Code and scenario bank: https://github.com/recozers/LLM-Bias

详情
AI中文摘要

人们通常认为语言模型中的地缘政治偏见源于预训练阶段使用的训练数据。我们在英语、法语和中文中,对来自七个实验室的七对开放权重LLM(仅预训练的基础模型和经过预训练及后训练的对话模型)进行了28对国家对的配对场景强制选择探测,发现地缘政治偏见源于后训练而非预训练。在七个AI实验室中,有六个在模型开发者所在国家或地区的方向上,后训练后出现了偏见偏移。这种偏移在阿里巴巴的Qwen 2.5中最为显著:基础模型对中国好感度呈中性(对数几率-0.15,p=0.15),而后训练的对话变体则为+2.91(p<10^-4),几率偏移了18倍。我们还观察到所有模型对其他国家的偏见也存在偏移。此外,这种偏移的幅度取决于提示模型所用的语言:法国制造的Mistral仅在法语提示下表现出亲法倾向(法语-英语偏移+1.91,p<10^-4)。这些发现表明,语言模型中的地缘政治偏好并非简单地从大规模互联网数据中继承,而是在后训练过程中被主动塑造,这凸显了对影响模型如何表征国家、文化和政治观点的对齐过程进行更大透明度、审计和监督的必要性。

英文摘要

It has generally been assumed that geopolitical bias in language models originates from the training data used during the pre-training phase. We tested seven open-weight LLM pairs consisting of the base model (pre-training only) and the chat model (pre-training and post-training) from seven labs on a paired-scenario forced-choice probe over 28 country pairs in English, French, and Chinese, and found that geopolitical bias originates in post-training rather than in pre-training. Across seven AI labs, six showed shifts in the direction associated with the country or region of the model developer after post-training. This shift is strongest in Alibaba's Qwen 2.5: while the base is neutral on China-favourability (-0.15 log-odds, p=0.15), the post-trained chat variant is at +2.91 (p<10^-4), an 18x shift in odds. We also observe shifts in biases toward other countries across all models. Additionally, the magnitude of this shift depends on the language used to prompt the model: the French-made Mistral becomes pro-France only under French prompting (FR-EN shift +1.91, p<10^-4). These findings suggest that geopolitical preferences in language models are not simply inherited from large-scale internet data but are actively shaped during post-training, highlighting the need for greater transparency, auditing, and oversight of alignment processes that influence how models represent nations, cultures, and political perspectives.

2605.23819 2026-05-25 cs.CV cs.AI 版本更新

Not Too Generative, Not Too Discriminative: The Human Alignment Sweet Spot

不过于生成,也不过于判别:人类对齐的甜蜜点

Jorge Chang Ortega, Bastien Le Lan, Thomas Serre, Victor Boutin

发表机构 * ANITI Brown University(布朗大学) CNRS(国家科学研究中心)

AI总结 本文探讨了计算视觉中一个核心问题:人类视觉表征是由判别式学习还是生成式学习更好地解释。研究通过联合能量模型(JEMs)在固定架构下连续插值判别与生成训练目标,分离学习目标的影响,并在六个涵盖感知相似性、光泽感知、人类响应不确定性等的人类对齐基准上进行评估。结果表明,人类对齐在生成与判别目标的中间点达到最优,而非极端端点,表明人类视觉对齐源于生成与判别目标的平衡,而非单一目标的选择。

详情
AI中文摘要

计算视觉中的一个核心问题是,人类视觉表征是否更好地由判别学习或生成学习解释。然而,现有的比较常常混淆学习目标与架构、规模及训练数据,使得目标本身是否驱动对齐的问题悬而未决。我们使用联合能量模型(JEM)来解决这一混淆问题,该模型在固定架构内连续插值判别与生成训练。通过改变单个混合系数,我们隔离了学习目标的影响,并在六个涵盖感知相似性、光泽感知、人类响应不确定性、鲁棒性、形状-纹理线索冲突和诊断性特征归因的人类对齐基准上评估了所得模型。在这多样化的测试套件中,人类对齐在生成-判别连续体的中间点始终达到最大,而非任一端点。混合JEM结合了判别学习诱导的类别结构与生成学习诱导的对输入结构的敏感性,在视觉的多个层次上产生了更类人的行为。这些结果表明,生成-判别二分法不是理解人类对齐视觉的正确轴:对齐并非来自选择其中一个目标,而是来自平衡两者。

英文摘要

A central question in computational vision is whether human-like visual representations are better explained by discriminative or generative learning. Existing comparisons, however, often confound the learning objective with architecture, scale, and training data, leaving open whether the objective itself drives alignment. We address this confound using Joint Energy-Based Models (JEMs), which interpolate continuously between discriminative and generative training within a fixed architecture. By varying a single mixing coefficient, we isolate the effect of the learning objective and evaluate the resulting models across six human-alignment benchmarks spanning perceptual similarity, gloss perception, human response uncertainty, robustness, shape-texture cue conflict, and diagnostic feature attribution. Across this diverse suite, human alignment is consistently maximized at intermediate points of the generative-discriminative continuum, rather than at either endpoint. Hybrid JEMs combine the categorical structure induced by discriminative learning with the sensitivity to input structure induced by generative learning, yielding more human-like behavior across multiple levels of vision. These results suggest that the generative-discriminative dichotomy is the wrong axis for understanding human-aligned vision: alignment emerges not from choosing one objective over the other, but from balancing both.

2605.23780 2026-05-25 cs.AI 版本更新

Beyond Binary Edits Robust Multimodal Knowledge Editing with Adversarial Subspace Alignment

超越二元编辑:基于对抗子空间对齐的鲁棒多模态知识编辑

Haoyuan Wang, Xiaohao Liu, Jiajie Su, Jianmao Xiao, Chaochao Chen

发表机构 * Zhejiang University(浙江大学) National University of Singapore(新加坡国立大学) Jiangxi Normal University(江西师范大学)

AI总结 本文研究了多模态大语言模型中鲁棒的内在知识编辑问题,旨在在不损害原有能力的前提下高效更新知识。针对现有方法在语义等价的视觉和语言变体间传播编辑效果有限的问题,作者提出了对抗子空间对齐方法(ASAM),通过引入潜在对抗鲁棒化(LAR)和秩约束子空间学习(RCSL)技术,增强模型在高维多模态空间中的泛化能力和编辑鲁棒性。实验表明,该方法在知识编辑任务中表现出优越的性能。

详情
AI中文摘要

多模态大语言模型(MLLMs)需要高效的机制来更新知识,同时不降低现有能力。虽然内在多模态知识编辑实现了强可靠性和局部性,但它通常表现出有限的泛化性,无法在语义等价的视觉和语言变体之间传播编辑。这个问题源于在高维多模态空间中缺乏显式的语义监督、僵化的编辑范围以及对单个样本的有偏锚定。我们通过显式地针对泛化性来解决鲁棒的内在多模态知识编辑。我们通过知识单元(将语义等价的多模态输入分组)形式化鲁棒性,并将泛化性定义为每个单元内一致的预测。为了暴露脆弱的语义区域,我们引入了潜在对抗鲁棒化(LAR),它在联合潜在空间中生成对抗但语义连贯的变体。我们进一步提出了秩约束子空间学习(RCSL),通过基于奇异值的目标在编辑层强制对抗表示的低秩对齐。大量实验证明了ASAM的有效性。

英文摘要

Multimodal large language models (MLLMs) need efficient mechanisms to update knowledge without degrading existing capabilities. While intrinsic multimodal knowledge editing achieves strong reliability and locality, it often exhibits limited generality, failing to propagate edits across semantically equivalent visual and linguistic variations. This issue arises from the lack of explicit semantic supervision, rigid editing scopes, and biased anchoring to individual samples in high-dimensional multimodal spaces. We address robust intrinsic multimodal knowledge editing by explicitly targeting generalization. We formalize robustness through knowledge units that group semantically equivalent multimodal inputs and define generality as consistent predictions within each unit. To expose fragile semantic regions, we introduce Latent Adversarial Robustification (LAR), which generates adversarial yet semantically coherent variants in the joint latent space. We further propose Rank-Constrained Subspace Learning (RCSL), enforcing low-rank alignment of adversarial representations at the edit layer via a singular value-based objective. Extensive analysis demonstrates the effectiveness of ASAM empirically.

2605.23772 2026-05-25 cs.AI cs.LO cs.PL cs.SE 版本更新

Agentic Proving for Program Verification

程序验证的智能体证明

Alessandro Sosso, Akhil Arora, Bas Spitters

发表机构 * Department of Computer Science(计算机科学系)

AI总结 该研究评估了基于代理的定理证明系统在程序验证任务中的能力,通过在CLEVER基准上测试Claude Code的表现,发现其在生成规范、验证实现以及端到端程序生成与验证方面均取得了较高的成功率。研究还指出当前程序验证基准与现代代理证明系统的能力之间存在差距,并强调需要更严格、更具鲁棒性的评估方法,特别是替代基于同构评分的规范评估方式。研究结果表明,结合编译器的紧密循环代理范式是当前程序验证最有效的方法之一。

详情
AI中文摘要

智能体系统最近已成为形式数学中自动定理证明的最先进方法。为了评估这些能力在程序验证中的延伸程度,我们在CLEVER(一个用于可验证代码生成的Lean 4基准)上,在智能体证明框架中评估了Claude Code。我们的结果显示,Claude为98.8%的问题生成了可论证的有效规范(其中81.3%也被CLEVER基于同构的评分在基准的正确部分接受),针对正确的地面真实规范验证了87.5%问题的实现,并在具有自洽前提的条目上,端到端程序生成和验证管道的成功率达到98.1%。在所有阶段,Claude进一步对其自身尝试提供了高质量的反馈(经人工审查确认),识别了失败的根本原因和数据集中残留的错误。这些发现突显了现有程序验证基准的难度与当代智能体证明器能力之间日益增长的不匹配,并指出了对更严格、更具错误鲁棒性的评估方法的需求,特别是对生成规范基于同构的评分的替代方案。更广泛地说,我们的结果提供了经验证据,表明紧密的编译器在环智能体范式目前是基础程序验证最有效的方法。

英文摘要

Agentic systems have recently emerged as state-of-the-art approaches for automated theorem proving in formal mathematics. To assess how far these capabilities extend to program verification, we evaluate Claude Code in an agentic proving framework on CLEVER, a Lean 4 benchmark for verifiable code generation. Our results show that Claude generates arguably valid specifications for 98.8% of problems (with 81.3% also accepted by CLEVER's isomorphism-based scoring on the correct portion of the benchmark), certifies implementations against correct ground-truth specifications for 87.5% of problems, and reaches a 98.1% success rate on the end-to-end program generation and verification pipeline over entries with self-consistent premises. Across all stages, Claude further provides high-quality feedback on its own attempts (as confirmed under manual review), identifying underlying causes of failure and lingering bugs in the dataset. These findings highlight a growing mismatch between the difficulty of existing program verification benchmarks and the capabilities of modern agentic provers, and point to the need for more rigorous, bug-resilient evaluation methodologies, and in particular for alternatives to isomorphism-based scoring of generated specifications. More broadly, our results provide empirical evidence that tight compiler-in-the-loop agentic paradigms are currently the most effective approach for foundational program verification.

2605.23771 2026-05-25 cs.CV cs.AI cs.MA 版本更新

PhotoFlow: Agentic 3D Virtual Photography Missions

PhotoFlow: 智能体式3D虚拟摄影任务

Jiarui Guo, Haojia Wei, Yiming Zhang, Yifei Liu, Yuning Gong, Hongjie Zhang, Xue Yang, Zhihang Zhong

发表机构 * Shanghai Jiao Tong University(上海交通大学) Northeastern University(东北大学) University of California, Los Angeles(加州大学洛杉矶分校) Cornell University(康奈尔大学) Shanghai AI Laboratory(上海人工智能实验室) Sichuan University(四川大学)

AI总结 PhotoFlow 是一种用于虚拟摄影的智能代理系统,能够在没有预设相机参数或参考图像的情况下,根据语言指令在3D场景中生成符合语义意图的高质量照片。该系统由三个模块组成:Director 生成多样化的相机候选方案,Reviewer 进行视觉评估与参数筛选,Reflector 则通过失败经验优化搜索策略。研究还提出了 VPhotoBench 基准,包含多个 Blender 场景和语言条件摄影任务,实验表明 PhotoFlow 在多轮渲染预算下表现出色,是首个在任意 Blender 场景中实现语言条件虚拟摄影的可执行代理系统。

详情
AI中文摘要

虚拟摄影要求智能体进入一个预制的3D场景,没有预设的相机姿态或参考图像,从场景信息和语言意图中推断合适的镜头,选择可执行的相机参数,并渲染最终照片。视觉-语言模型的最新进展使这种空间智能体越来越可行,但该任务强调两种难以同时评估的能力:复杂的3D空间理解和抽象审美判断。我们引入了PhotoFlow,一个导演-评审-反思智能体,用于闭环相机搜索。导演构建软摄影蓝图并提议多样化的候选相机;评审结合规则检查、视觉批评和成对优胜者选择;反思将失败转化为区域记忆、死区抑制和高探索重定位。我们还引入了VPhotoBench,一个包含47个开源许可的Blender场景和141个语言条件摄影任务的基准,涵盖主体放置、关系构图和氛围/风格。在保留实验中,PhotoFlow在六轮渲染预算下,在一次性预测、单链反思、锚点库选择和随机搜索中取得了最强的外部质量-对齐复合指标和成功率。据我们所知,这是第一项将任意Blender场景中的语言条件虚拟摄影作为可执行智能体任务的工作,我们的结果表明,以LLM为中心的空间智能体已经可以在旨在挑战3D推理和审美选择的设置中产生强大的照片。

英文摘要

Virtual photography asks an agent to enter a prepared 3D scene with no preselected camera pose or reference image, infer a suitable shot from scene information and a language intent, choose executable camera parameters, and render the final photograph. Recent progress in vision-language models makes this kind of spatial agent increasingly plausible, but the task stresses two capabilities that remain hard to evaluate together: complex 3D spatial understanding and abstract aesthetic judgment. We introduce PhotoFlow, a Director-Reviewer-Reflector agent for closed-loop camera search. The Director builds a soft photographic blueprint and proposes diverse candidate cameras; the Reviewer combines rule checks, visual critique, and pairwise incumbent selection; and the Reflector converts failures into region memory, dead-zone suppression, and high-explore relocation. We also introduce VPhotoBench, a benchmark of 47 open-license Blender scenes and 141 language-conditioned photography missions spanning subject placement, relational composition, and atmosphere/style. On held-out experiments, PhotoFlow achieves the strongest external quality-alignment composite and success rate among one-shot prediction, single-chain reflection, anchor-bank selection, and random search under a six-round rendering budget. To our knowledge, this is the first work to make language-conditioned virtual photography in arbitrary Blender scenes an executable agent task, and our results show that an LLM-centered spatial agent can already produce strong photographs in a setting designed to challenge both 3D reasoning and aesthetic choice.

2605.23723 2026-05-25 cs.AI 版本更新

MemAudit: Post-hoc Auditing of Poisoned Agent Memory via Causal Attribution and Structural Anomaly Detection

MemAudit:通过因果归因和结构异常检测对中毒代理记忆进行事后审计

Zhewen Tan, Yilun Yao, Huiyan Jin, Wenhan Yu, Guoan Wang, Mengyuan Fan, liang lu, Feng Liu, Xiangzheng Zhang, Duohe Ma, Tong Yang, Lin Sun

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所) Qiyuan Tech(齐元科技) Laboratory of Multimedia Information Processing, School of Computer Science, Peking University(北京大学计算机科学学院多媒体信息处理实验室) School of Cyber Security, University of Chinese Academy of Sciences(中国科学院大学网络安全学院)

AI总结 随着大型语言模型代理越来越多地依赖持久内存来存储历史交互并提升任务执行能力,内存机制也带来了潜在的安全隐患:攻击者可通过正常交互向内存中注入恶意记录,从而影响代理的行为。为此,本文提出 MemAudit,一种用于事后审计内存增强型大语言模型代理的因果记忆审计框架。该方法结合因果影响评分与结构异常检测,有效识别出对有害输出有贡献的恶意记忆记录,并在多种攻击场景下显著降低了攻击成功率。

详情
AI中文摘要

大型语言模型代理越来越依赖持久记忆来存储过去的交互、检索相关演示并改进长期任务执行。然而,这种记忆机制也造成了一个实际的安全漏洞:对抗性用户可能通过普通交互将恶意记录注入代理的记忆中,这些记录随后可能被检索以引导代理的推理和行动。现有的防御主要关注在线干预,如提示过滤或输出阻止,但它们没有解决事后问题,即在观察到有害行为后,哪些存储的记忆应负责。我们提出了 extbf{MemAudit},一个用于记忆增强型LLM代理的事后因果记忆审计框架。该框架结合了两个互补信号:(1)反事实记忆影响评分,衡量每个记忆对有害输出的因果贡献;(2)记忆一致性图,识别更广泛记忆存储中的结构异常记忆。我们针对MINJA(一种仅查询的记忆注入攻击,其中恶意记录通过正常代理交互生成和存储,而非直接修改记忆库)评估了MemAudit。在QA和推理代理两种设置中,MemAudit在现实的事后审计场景下显著降低了攻击成功率。结果显示,QA攻击成功率从$70\%$降至$0\%$,而RAP攻击成功率从$83.3\%$降至$0\%$。

英文摘要

Large language model agents increasingly rely on persistent memory to store past interactions, retrieve relevant demonstrations, and improve long-horizon task execution. However, this memory mechanism also creates a practical security vulnerability: an adversarial user may inject malicious records into the agent's memory through ordinary interaction, and these records can later be retrieved to steer the agent's reasoning and actions. Existing defenses primarily focus on online intervention, such as prompt filtering or output blocking, but they do not address the post-hoc question of which stored memories are responsible after harmful behavior has already been observed. We propose \textbf{MemAudit}, a post-hoc causal memory auditing framework for memory-augmented LLM agents. The framework combines two complementary signals: (1) a counterfactual memory influence score that measures each memory's causal contribution to harmful outputs, and (2) a memory consistency graph that identifies structurally anomalous memories within the broader memory store. We evaluate MemAudit against MINJA, a query-only memory injection attack in which malicious records are generated and stored through normal agent interactions rather than direct memory-bank modification. Across both QA and reasoning-agent settings, MemAudit substantially reduces attack success rates under realistic post-hoc auditing scenarios. The results show that QA attack success is reduced from $70\%$ to $0\%$, while RAP attack success drops from $83.3\%$ to $0\%$.

2605.23719 2026-05-25 cs.CV cs.AI 版本更新

Weierstrass Positional Encoding for Vision Transformers

Weierstrass位置编码用于视觉Transformer

Zhihang Xin, Rui Wang, Xitong Hu, Xiaojun Wu

发表机构 * School of Mathematics and Data Science, Jiangnan University(江南大学数学与数据科学学院) School of Artificial Intelligence and Computer Science, Jiangnan University(江南大学人工智能与计算机科学学院)

AI总结 视觉Transformer在计算机视觉中取得了显著成功,但其常用的可学习一维位置编码在图像分块展平后削弱了图像的二维空间结构。为解决这一问题,本文提出了一种基于魏尔斯特拉斯椭圆函数的位置编码方法(WePE),通过在复数域中对二维分块坐标进行映射,构建具有双周期特性的四维位置特征,从而更准确地保留图像分块的几何关系和空间邻近性先验。该方法具有数学理论支撑,能够自然匹配图像网格的规则结构,并且无需额外计算开销,可无缝集成到现有视觉Transformer中,实验表明其在多种任务中均能带来性能提升。

详情
AI中文摘要

视觉Transformer在计算机视觉中取得了显著成功,但它们通常使用可学习的一维位置编码,这削弱了图像块展平后固有的二维空间结构。现有的位置编码往往缺乏几何约束,并且不保持欧氏空间距离与序列索引距离之间的单调关系,限制了ViTs利用空间邻近先验的能力。受周期性在位置编码中实用性的启发,我们提出了Weierstrass椭圆位置编码(WePE),这是一种在复数域中编码二维坐标的数学基础方法。WePE将归一化的二维块坐标映射到复平面,并使用Weierstrass椭圆函数及其导数构建紧凑的四维位置特征。双周期性提供了二维位置的原则性表示,其固有的晶格结构自然匹配图像块网格的规则几何形状。其非线性几何特性有助于更忠实地建模空间距离关系,而代数加法公式使得任意块对之间的相对位置信息可以直接从其绝对编码中推导出来。WePE是即插即用的且与分辨率无关,可以无缝集成到现有的ViTs中。大量实验表明,WePE在大多数设置中带来一致的性能提升。通过预计算的查找表,这些改进不会引入明显的计算或内存开销。额外的分析和消融研究进一步验证了所提方法的有效性。

英文摘要

Vision Transformers have achieved remarkable success in computer vision, but their common use of learnable one-dimensional positional encodings weakens the inherent two-dimensional spatial structure of images after patch flattening. Existing positional encodings often lack geometric constraints and do not preserve a monotonic relationship between Euclidean spatial distances and sequential index distances, limiting ViTs' ability to exploit spatial proximity priors. Motivated by the usefulness of periodicity in positional encoding, we propose Weierstrass elliptic Positional Encoding (WePE), a mathematically grounded method for encoding two-dimensional coordinates in the complex domain. WePE maps normalized 2D patch coordinates onto the complex plane and constructs compact four-dimensional positional features using the Weierstrass elliptic function and its derivative. The double periodicity provides a principled representation of 2D positions, and its intrinsic lattice structure naturally matches the regular geometry of image patch grids. Its nonlinear geometric properties help model spatial distance relationships more faithfully, while the algebraic addition formula enables relative positional information between arbitrary patch pairs to be derived directly from their absolute encodings. WePE is plug-and-play and resolution-agnostic, allowing seamless integration into existing ViTs. Extensive experiments show that WePE brings consistent performance gains in most settings. With precomputed lookup tables, these improvements introduce no noticeable computational or memory overhead. Additional analyses and ablation studies further validate the effectiveness of the proposed method.

2605.22738 2026-05-25 cs.LG cs.AI stat.ML 版本更新

Proxy-Based Approximation of Shapley and Banzhaf Interactions

基于代理的Shapley和Banzhaf交互近似

Santo M. A. R. Thies, Hubert Baniecki, R. Teal Witter, Eyke Hüllermeier, Maximilian Muschalik, Fabian Fumagalli

发表机构 * LMU Munich(慕尼黑大学) MCML DFKI(德意志联邦防务研究院) Centre for Credible AI, Warsaw University of Technology(华沙技术大学可信AI中心) University of Warsaw(华沙大学) Claremont McKenna College(克莱尔蒙特麦肯纳学院) Bielefeld University(比勒菲尔德大学)

AI总结 本文研究了如何高效准确地估计Shapley和Banzhaf交互值,以解释机器学习模型中特征之间的复杂相互作用。为此,作者提出了ProxySHAP方法,结合树模型代理的高效采样与残差校正策略,实现了在保证精度的同时提升计算效率。理论分析表明,ProxySHAP能够在多项式时间内计算树集成模型的精确交互指数,并有效控制偏差与方差。实验表明,ProxySHAP在多个基准测试中表现优异,尤其在大规模高维数据上显著优于现有方法。

详情
AI中文摘要

Shapley和Banzhaf交互捕捉了现代机器学习应用中固有的复杂动态。然而,当前对这些高阶交互的估计器在速度和准确性之间进行权衡。为了克服这一限制,我们引入了ProxySHAP。ProxySHAP将基于树的代理模型的高样本效率与通过残差校正实现一致性的原则路径相结合。在理论层面,我们推导了干预TreeSHAP的多项式时间推广,以计算树集成的精确交互指数,成功避免了先前方法中的指数树深度依赖。此外,我们正式分析了残差调整策略,刻画了最大样本重用(MSR)在特定条件下校正代理偏差而不使其方差随交互规模指数增长的条件。广泛的基准测试表明,ProxySHAP在近似质量上树立了新的最先进标准,包括在具有数千个特征的大规模应用中。通过在小预算和大预算场景下均实现最低误差,ProxySHAP显著优于先前最佳估计器ProxySPEX和KernelSHAP-IQ,同时在可解释性下游任务上也提供了卓越性能。

英文摘要

Shapley and Banzhaf interactions capture the complex dynamics inherent in modern machine learning applications. However, current estimators for these higher-order interactions trade off between speed and accuracy. To overcome this limitation, we introduce ProxySHAP. ProxySHAP reconciles the high sample efficiency of tree-based proxy models with a principled path to consistency via residual correction. On a theoretical level, we derive a polynomial-time generalization of interventional TreeSHAP to compute exact interaction indices for tree ensembles, successfully bypassing exponential tree-depth dependencies in prior methods. Furthermore, we formally analyze the residual adjustment strategy, characterizing the specific conditions under which Maximum Sample Reuse (MSR) corrects proxy bias without its variance scaling exponentially with interaction size. Extensive benchmarking demonstrates that ProxySHAP sets a new state-of-the-art standard for approximation quality, including in large-scale applications with thousands of features. By achieving the lowest error in both small- and large-budget regimes, ProxySHAP significantly outperforms the prior best estimators ProxySPEX and KernelSHAP-IQ, while also delivering superior performance on downstream explainability tasks.

2605.22672 2026-05-25 cs.AI 版本更新

Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most

能力是负担吗?更强大的语言模型在关键时刻做出更差的预测

Nick Merrill, Jaeho Lee, Ezra Karger

发表机构 * Forecasting Research Institute(预测研究 institute)

AI总结 本文研究了在具有超线性增长和制度变化尾风险的时间序列预测任务中,能力更强的语言模型反而会产生更差的分布预测这一逆向缩放现象。通过在模拟和真实数据集上的实验,发现更强大的模型倾向于高估上尾风险,而下尾预测相对稳定。研究还表明,模型规模和后训练均对这一现象有影响,并建议在评估语言模型预测能力时应结合连续的准确性指标,而不仅仅依赖于单一阈值的二元指标。

详情
AI中文摘要

我们记录了LLM在预测问题上的逆缩放现象,这些问题的底层时间序列表现出超线性增长和制度转换的尾部风险,这种结构在金融和流行病学中很常见。在这些任务上,更强大的模型会产生更差的分位数预测。该模式出现在我们发布的、无污染的模拟世界基准ForecastBench-Sim(FBSim)上,在预测具有匹配线性控制的合成SIR流行病时,并在COVID-19、麻疹、住房市场和恶性通货膨胀的真实世界数据集中得到复现。每个分位数的分解表明,失败集中在尾部上端,更强大的模型将其向上移动以跟踪激进的增长外推,而下尾部保持不变。Llama-3.1的族内研究表明,模型规模和后训练都独立地促成了这种效应。领域知识并不能可靠地挽救校准。这种逆缩放并不出现在LLM预测基准中常见的单阈值指标上,在相同的输出上,能力-准确性关系的符号发生了反转。在常规截止点上的单阈值评分忽略了尾部上端的成本;包含尾部的评分在相同的输出上反转了能力-准确性关系的符号。我们建议LLM预测评估使用连续(且无界)的准确性度量以及有界的二元阈值度量。

英文摘要

We document inverse scaling in LLMs on forecasting problems whose underlying time series exhibit superlinear growth and tail risk of regime change, a structure common in finance and epidemiology. On these tasks, more capable models produce worse distributional forecasts. The pattern appears on ForecastBench-Sim (FBSim), a contamination-free, simulated-world benchmark we release, in forecasting synthetic SIR epidemics with a matched linear control, and replicates in real-world datasets on COVID-19, measles, housing markets, and hyperinflation. A per-quantile decomposition shows the failure concentrates at the upper tail, which more capable models shift upward to track aggressive extrapolations of growth, while the lower tail stays put. A within-family study of Llama-3.1 shows that both model scale and post-training independently contribute to this effect. Domain knowledge does not reliably rescue calibration. This inverse scaling does not appear on single-threshold metrics common in LLM forecasting benchmarks, reversing the sign of the capability--accuracy relationship on identical outputs. Single-threshold scoring at conventional cutoffs misses the upper-tail cost; tail-inclusive scoring reverses the sign of the capability--accuracy relationship on the same outputs. We recommend that LLM forecasting evaluations use continuous (and unbounded) measures of accuracy alongside bounded binary threshold metrics.

2605.21504 2026-05-25 q-fin.ST cs.AI 版本更新

Multivariate Financial Forecasting using the Chronos Time Series Foundation Models

使用Chronos时间序列基础模型进行多元金融预测

Sanjiv R Das, Tarang Goyal, Mohini Yadav

发表机构 * Santa Clara University(圣克拉拉大学)

AI总结 本文利用开源时间序列基础模型Chronos-2,评估预训练时间序列模型在经济与金融预测中的表现,重点研究多变量(MV)输入相比单变量(UV)基线是否能提升预测精度。研究覆盖了七只优质股票、美国国债利率及其组合面板,通过2000年至2025年的滚动月度评估,结果显示多变量预测在利率和股票数据中均显著优于单变量预测,且误差分布更集中。研究还指出,跨市场混合时间序列会降低预测准确性,表明引入噪声背景可能影响模型性能,整体表明基础模型可通过跨序列信息提升金融预测精度,尤其在结构化滚动协议下效果更佳。

Comments 10 pages, 3 tables, 3 figures

详情
AI中文摘要

使用开源时间序列基础模型Chronos-2,我们评估了预训练时间序列模型在经济和金融预测中的表现,重点研究多元输入相对于单变量基线是否提高了准确性。研究涵盖两个面板——Magnificent-7股票和美国国债利率——以及一个组合面板,使用2000年至2025年的滚动月度评估。我们改变输入窗口长度和预测范围,并报告RMSE和MAPE。跨数据集,多元预测一致优于单变量预测,利率的增益尤为强劲,股票也有显著改善。序列级比较显示多元输入在所有情况下均有改进,且误差离散度通常更低。我们还提供了参数热图和时间序列可视化。然而,混合股票和利率市场的时间序列会降低预测准确性,表明添加噪声上下文会降低模型性能。总体而言,结果表明基础模型可以利用跨序列信息提高金融预测准确性,并且在严格滚动协议下对相关序列进行联合建模时收益最大。除了使用开源基础模型外,本文还展示了AI如何用于金融研究。

英文摘要

Using Chronos-2, an open-source time-series foundation model, we evaluate pretrained time-series models for economic and financial forecasting with an emphasis on whether multivariate (MV) inputs improve accuracy relative to univariate (UV) baselines. The study covers two panels -- the Magnificent-7 equities and U.S. Treasury interest rates -- as well as a combined panel, using rolling monthly evaluations from 2000--2025. We vary input window lengths and forecast horizons and report RMSE and MAPE. Across datasets, MV forecasts consistently outperform UV forecasts, with especially strong gains for interest rates and meaningful improvements for equities. Series-level comparisons show MV improvements in every case, and error dispersion is generally lower under MV inputs. We also provide parameter-heatmap and time-series visualizations. However, mixing time series across equity and interest rate markets reduces forecast accuracy, indicating that adding noisy context degrades model performance. Overall, the results indicate that foundation models can leverage cross-series information to improve forecast accuracy in finance, and that the benefits are strongest when related series are modeled jointly under disciplined rolling protocols. Other than using an open-source foundation model, this paper also showcases how AI may be used for financial research.

2605.19069 2026-05-25 cs.CL cs.AI 版本更新

Benchmarking Commercial ASR Systems on Code-Switching Speech: Arabic, Persian, and German

商业ASR系统在代码切换语音上的基准测试:阿拉伯语、波斯语和德语

Sajjad Abdoli, Ghassan Al-Sumaidaee, Clayton W. Taylor, Ahmad ElShiekh, Ahmed Rashad

发表机构 * Perle AI

AI总结 本文研究了自动语音识别(ASR)系统在语言代码转换(Code-Switching)场景下的性能,针对阿拉伯语、波斯语和德语与英语之间的四种语言对进行了评估。通过一个两阶段的筛选流程,选取了300个样本,并使用BERTScore和词错误率(WER)进行测评,发现不同指标对系统排名的一致性及质量差距的反映存在差异。研究还揭示了商业ASR系统在处理代码转换语音时的性能差距,并公开了相关数据集以供进一步研究。

详情
AI中文摘要

代码切换——在同一话语中两种语言的自然交替——仍然是自动语音识别(ASR)中最具挑战性和研究不足的条件之一。我们提出了一个基准测试,评估了五个商业ASR提供商在四种语言对上的表现:埃及阿拉伯语-英语、沙特阿拉伯语(纳吉迪/希贾兹)-英语、波斯语(法尔西)-英语和德语-英语,每对包含300个样本,通过结合启发式过滤和GPT-4o与Gemini 1.5 Pro集成评分器的两阶段管道选择,将LLM成本降低约91%。我们在WER和BERTScore上进行评估,表明虽然两个指标在阿拉伯语和波斯语对的系统排序上一致(τ=1.0),但WER通过惩罚语义正确的音译选择,将质量差距的幅度夸大约3倍。ElevenLabs Scribe v2实现了最低的WER(总体13.2%),并在BERTScore上领先(总体0.936)。难度分层分析揭示了被总体平均值掩盖的性能差距,BERT嵌入投影证实了参考和假设之间的语义接近性,尽管存在表面脚本差异。数据集公开于https://huggingface.co/datasets/Perle-ai/ASR_Code_Switch。

英文摘要

Code-switching -- the natural alternation between two languages within a single utterance -- remains one of the most challenging and under-studied conditions for automatic speech recognition (ASR). We present a benchmark evaluating five commercial ASR providers across four language pairs: Egyptian Arabic--English, Saudi Arabic (Najdi/Hijazi)--English, Persian (Farsi)--English, and German--English, comprising 300 samples per pair selected by a two-stage pipeline combining heuristic filtering with a GPT-4o and Gemini 1.5 Pro ensemble scorer, reducing LLM costs by $\approx$91\%. We evaluate on both WER and BERTScore, showing that while both metrics agree on the ordinal ranking of systems for all Arabic and Persian pairs ($τ= 1.0$), WER inflates the magnitude of quality gaps by approximately 3$\times$ by penalising semantically correct transliteration choices. ElevenLabs Scribe v2 achieves the lowest WER (13.2\% overall) and leads on BERTScore (0.936 overall). Difficulty-stratified analysis reveals performance gaps masked by aggregate averages, and BERT embedding projections confirm semantic proximity between reference and hypothesis despite surface-level script differences. The dataset is publicly available at https://huggingface.co/datasets/Perle-ai/ASR_Code_Switch.

2604.26145 2026-05-25 cs.HC cs.AI 版本更新

Ceci n'est pas une explication: Evaluating Explanation Failures as Explainability Pitfalls in Language Learning Systems

Ceci n'est pas une explication: 评估语言学习系统中作为可解释性陷阱的解释失败

Ben Knight, Wm. Matthew Kennedy, Danielle Carvalho, Isaac Pattis, James Edgell

发表机构 * Oxford University Press(牛津大学出版社) Oxford Internet Institute(牛津互联网研究所) University of Oxford(牛津大学)

AI总结 该研究探讨了人工智能语言学习系统中解释性失败的问题,指出这些系统提供的即时反馈可能在表面看似有帮助,但实际上存在根本性缺陷,可能加剧学习者的误解并影响学习效果。研究提出了L2-Bench基准,用于评估语言教育中的AI系统,涵盖诊断准确性、错误原因分析等多个关键反馈维度,并分析了AI在这些维度上的失效方式及其带来的“可解释性陷阱”。研究强调了语言学习场景下这些风险的特殊性,并呼吁在设计评估框架时更加关注相关问题。

Comments Accepted to Misleading Impacts Resulting from AI Generated Explanations (MIRAGE) Workshop @ IUI 2026

详情
AI中文摘要

AI驱动的语言学习工具日益为全球数百万学习者提供即时、个性化的反馈。然而,这种反馈可能以学习者甚至教师难以察觉的方式失败,长期使用可能强化误解并侵蚀学习效果。我们提出了L2-Bench的一部分,这是一个用于评估语言教育中AI系统的基准,包括(但不限于)有效反馈的六个关键维度:诊断准确性、适当性意识、错误原因、优先级排序、改进指导和支持自我调节。我们分析了AI系统在这些维度上可能失败的方式。这些失败,我们认为会导致“可解释性陷阱”,即表面上看似有帮助但本质上有缺陷的AI生成解释,增加了成就、人机交互和社会情感危害的风险。我们讨论了语言学习的特定背景如何放大这些风险,并概述了在设计评估框架时我们认为值得更多关注的开放问题。我们的分析旨在扩展社区对可解释性陷阱类型学及其可能发生的上下文动态的理解,以鼓励AI开发者更好地设计安全、可信和有效的AI解释。

英文摘要

AI-powered language learning tools increasingly provide instant, personalised feedback to millions of learners worldwide. However, this feedback can fail in ways that are difficult for learners--and even teachers--to detect, potentially reinforcing misconceptions and eroding learning outcomes over extended use. We present a portion of L2-Bench, a benchmark for evaluating AI systems in language education that includes (but is not limited to) six critical dimensions of effective feedback: diagnostic accuracy, awareness of appropriacy, causes of error, prioritisation, guidance for improvement, and supporting self-regulation. We analyse how AI systems can fail with respect to these dimensions. These failures, which we argue are conducive to "explainability pitfalls," are AI-generated explanations that appear helpful on the surface but are fundamentally flawed, increasing the risk of attainment, human-AI interaction, and socioaffective harms. We discuss how the specific context of language learning amplifies these risks and outline open questions we believe merit more attention when designing evaluation frameworks specifically. Our analysis aims to expand the community's understanding of both the typology of explainability pitfalls and the contextual dynamics in which they may occur in order to encourage AI developers to better design safe, trustworthy, and effective AI explanations.

2604.24920 2026-05-25 cs.CR cs.AI 版本更新

SUDP: Secret-Use Delegation Protocol for Agentic Systems

SUDP: 面向智能体系统的秘密使用委托协议

Xiaohang Yu, Hejia Geng, Xinmeng Zeng, William Knottenbelt

发表机构 * Imperial College London(伦敦帝国学院) University of Oxford(牛津大学) Stanford University(斯坦福大学)

AI总结 随着代理系统越来越多地使用用户秘密进行API调用、消息平台和云服务操作,现有的运行时授权机制往往通过暴露秘密或其衍生物来实现,导致潜在的安全风险。本文提出了一种名为SUDP的机密使用委托协议,旨在确保用户授权的秘密操作不被滥用,且不赋予请求者持久的访问权限。该协议通过用户授权、请求者提出操作、托管方执行有限使用的方式,满足七个关键安全属性,在结合硬件根运行时的情况下,能够在标准密码学假设下保障秘密的完整性和机密性。

详情
AI中文摘要

智能体系统越来越多地使用用户秘密来访问API、消息平台和云服务。当前的智能体运行时通常通过暴露来实现授权:启用操作通常意味着将可重复使用的秘密或由其派生的可重复使用工件放入运行时,因此瞬时的提示注入或工具侧妥协就会变成持久的账户妥协。现有的防御措施涵盖了相邻的部分,如秘密存储、范围委托、发送者约束令牌和运行时监控,但未能为组合的智能体义务提供通用规范:一个不可信的自主请求者应该能够发起用户授权的秘密支持操作,而不会获得对该操作的可重复使用权限。我们将此形式化为智能体秘密使用(ASU)问题,并确定了任何解决方案必须满足的七项安全属性,涵盖授权完整性和秘密机密性。我们提出了秘密使用委托协议(SUDP),其中请求者提出规范操作,用户使用新鲜的身份验证器支持的授权进行授权,保管人兑现授权以执行有限的使用;可重复使用的权限永远不会跨越请求者边界。我们将SUDP专门用于LLM驱动的智能体,每当工具调用会使用用户注册的授权材料时,它都适用。在标准密码学假设下,当与硬件根运行时集成时,SUDP满足所有七项属性。参考实现可在https://github.com/xhyumiracle/sudp获取。

英文摘要

Agentic systems increasingly act with user secrets for APIs, messaging platforms, and cloud services. Today's agent runtimes typically implement authorization by exposure: enabling action often means placing a reusable secret, or a reusable artifact derived from it, inside the runtime, so a transient prompt-injection or tool-side compromise becomes durable account compromise. Existing defenses cover adjacent pieces such as secret storage, scoped delegation, sender-constrained tokens, and runtime monitoring, but leave the combined agentic obligation without a common specification: an untrusted autonomous requester should be able to cause a user-authorized secret-backed operation without gaining reusable authority over it. We formalize this as the Agent Secret Use (ASU) problem and identify seven security properties any solution must satisfy, spanning authorization integrity and secret confidentiality. We propose the Secret-Use Delegation Protocol (SUDP), in which a requester proposes a canonical operation, the user authorizes it with a fresh authenticator-backed grant, and a custodian redeems the grant to perform the bounded use; reusable authority never crosses the requester boundary. We specialize SUDP for LLM-driven agents, where it applies whenever a tool call would exercise user-enrolled authority-bearing material. Under standard cryptographic assumptions, SUDP satisfies all seven properties when integrated with a hardware-rooted runtime. A reference implementation is available at https://github.com/xhyumiracle/sudp.

2604.24021 2026-05-25 cs.AI math.AP 版本更新

QED: An Open-Source Multi-Agent System for Generating Mathematical Proofs on Open Problems

QED:一个用于生成开放问题数学证明的开源多智能体系统

Chenyang An, Qihao Ye, Minghao Pan, Jiayaun Zhang

AI总结 本文介绍了一个名为 QED 的开源多智能体系统,旨在无需人工干预即可将人类提出的研究问题转化为完整的数学证明。该系统通过分离规划、证明和验证三个阶段,有效克服了单一查询证明生成的常见缺陷,其中分解代理负责结构规划,证明代理生成候选论证,验证代理检查正确性。在与领域专家合作的评估中,QED 在 18 个不同难度的研究项目上表现出色,成功生成了五项原创性研究成果,其中三项被认为具有与主流数学期刊相当的深度和广度。

详情
AI中文摘要

我们提出 extbf{QED},一个开源的多智能体系统,它能够将人类提供的研究问题转化为完整的数学证明,无需进一步的人类指导。其流水线旨在通过分离规划、证明和验证来克服单次查询证明生成的常见失败:分解智能体结构化证明搜索,证明智能体生成候选论证,验证智能体检查正确性。与领域专家合作,我们在18个不同难度的研究级项目上评估了QED。QED在代数几何、流体偏微分方程、概率和反问题领域产生了五篇原创工作。专家评估认为这些工作是扎实的专业研究贡献,其中三篇在难度和范围上与常见于成熟专业数学场所发表的工作相当。QED发布于https://github.com/proofQED/QED。

英文摘要

We present \textbf{QED}, an open-source multi-agent system that turns human-provided research questions into complete mathematical proofs without further human guidance. Its pipeline is designed to overcome common failures of single-query proof generation by separating planning, proving, and verification: a decomposition agent structures the proof search, prover agents generate candidate arguments, and verifier agents check correctness. In collaboration with domain experts, we evaluated QED on 18 research-level projects of varying difficulty. QED produced five original works across algebraic geometry, fluid PDEs, probability, and inverse problems. Expert assessments regard these works as solid specialized research contributions, with three comparable in difficulty and scope to work commonly published in established specialist mathematics venues. QED is released at https://github.com/proofQED/QED.

2604.11759 2026-05-25 cs.AI 版本更新

Retrieval Is Not Enough: Why Organizational AI Needs Epistemic Infrastructure

检索是不够的:为什么组织AI需要认知基础设施

Federico Bottino, Carlo Ferrero, Nicholas Dosio, Pierfrancesco Beneventano

发表机构 * Kakashi Ventures Accelerator (KVA)(Kakashi Ventures加速器) Massachusetts Institute of Technology(麻省理工学院)

AI总结 本文指出,当前组织中AI使用的知识通常缺乏认知结构,仅依赖检索无法准确区分决策、假设、争议和未知问题等不同知识状态。为此,研究提出了OIDA框架,通过引入知识对象、重要性评分和矛盾关系等机制,构建具有认知一致性的知识表示系统,并引入“问题”作为组织未知的建模方式,提升AI对组织认知状态的理解能力。实验表明,OIDA在保持知识质量方面具有显著优势,并验证了其核心机制的有效性。

Comments 10 pages, 2 figures, 8 tables, 6 appendices

详情
AI中文摘要

AI代理使用的组织知识通常缺乏认知结构:检索系统会呈现语义相关的内容,而不区分约束性决策与放弃的假设、有争议的主张与已解决的问题、已知事实与未解决的问题。我们认为,组织AI的上限不是检索保真度,而是认知保真度——即系统将承诺强度、矛盾状态和组织无知表示为可计算属性的能力。我们提出了OIDA,这是一个框架,将组织知识结构化为类型化的知识对象,这些对象带有认知类别、具有类别特定衰减的重要性分数以及带符号的矛盾边。知识重力引擎以确定性方式维护分数,并具有经过证明的收敛保证(充分条件:最大度数<7;经验上对度数为43的情况鲁棒)。OIDA引入了“问题”作为模型化的无知:一种具有反向衰减的原语,以越来越紧迫的方式揭示组织不知道什么——这是所有被调查系统中缺失的机制。我们描述了认知质量评分(EQS),一种包含五个组成部分的评估方法,并带有明确的循环性分析。在受控比较(n=10个响应对)中,OIDA的RAG条件(3,868个令牌)达到EQS 0.530,而全上下文基线(108,687个令牌)为0.848;28.1倍的令牌预算差异是主要的混淆因素。问题机制在统计上得到验证(Fisher p=0.0325,OR=21.0)。形式化属性已建立;在相等令牌预算下的决定性消融实验(E4)已预注册但尚未运行。

英文摘要

Organizational knowledge used by AI agents typically lacks epistemic structure: retrieval systems surface semantically relevant content without distinguishing binding decisions from abandoned hypotheses, contested claims from settled ones, or known facts from unresolved questions. We argue that the ceiling on organizational AI is not retrieval fidelity but \emph{epistemic} fidelity--the system's ability to represent commitment strength, contradiction status, and organizational ignorance as computable properties. We present OIDA, a framework that structures organizational knowledge as typed Knowledge Objects carrying epistemic class, importance scores with class-specific decay, and signed contradiction edges. The Knowledge Gravity Engine maintains scores deterministically with proved convergence guarantees (sufficient condition: max degree $< 7$; empirically robust to degree 43). OIDA introduces QUESTION-as-modeled-ignorance: a primitive with inverse decay that surfaces what an organization does \emph{not} know with increasing urgency--a mechanism absent from all surveyed systems. We describe the Epistemic Quality Score (EQS), a five-component evaluation methodology with explicit circularity analysis. In a controlled comparison ($n{=}10$ response pairs), OIDA's RAG condition (3,868 tokens) achieves EQS 0.530 vs.\ 0.848 for a full-context baseline (108,687 tokens); the $28.1\times$ token budget difference is the primary confound. The QUESTION mechanism is statistically validated (Fisher $p{=}0.0325$, OR$=21.0$). The formal properties are established; the decisive ablation at equal token budget (E4) is pre-registered and not yet run.

2603.23565 2026-05-25 cs.LG cs.AI 版本更新

Safe Reinforcement Learning with Preference-based Constraint Inference

基于偏好的约束推断的安全强化学习

Chenglin Li, Grant Ruan, Hua Geng

发表机构 * Department of Automation, Tsinghua University, Beijing, China Laboratory for Information \& Decision Systems, Massachusetts Institute of Technology, Cambridge, MA, USA

AI总结 本文研究了安全强化学习中如何从人类偏好中高效且可靠地学习复杂的安全约束。针对现有方法依赖专家演示或限制性假设的问题,提出了一种基于偏好的约束强化学习框架(PbCRL),通过引入死区机制和信噪比损失,提升了对安全成本分布的建模能力,并优化了策略学习过程。实验表明,该方法在满足安全约束和提升奖励方面优于现有先进方法,为安全关键场景中的约束推理提供了有效解决方案。

Comments Accepted by the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

安全强化学习(RL)是安全关键决策的标准范式。然而,现实世界中的安全约束可能复杂、主观,甚至难以明确指定。现有的约束推断工作依赖于限制性假设或大量的专家演示,这在许多实际应用中并不现实。如何廉价且可靠地学习这些约束是我们本研究关注的主要挑战。虽然从人类偏好中推断约束提供了一种数据高效的替代方案,但我们发现流行的Bradley-Terry(BT)模型未能捕捉安全成本的非对称、重尾特性,导致风险低估。在文献中,理解BT模型对下游策略学习的影响仍然很少。为了解决上述知识空白,我们提出了一种新颖的方法,即基于偏好的约束强化学习(PbCRL)。我们在偏好建模中引入了一种新颖的死区机制,并从理论上证明它鼓励重尾成本分布,从而实现更好的约束对齐。此外,我们引入了信噪比(SNR)损失,通过成本方差鼓励探索,这被发现有利于策略学习。进一步,采用两阶段训练策略以降低在线标注负担,同时自适应地增强约束满足。实验结果表明,PbCRL实现了与真实安全要求的优越对齐,并在安全性和奖励方面优于最先进的基线。我们的工作为安全RL中的约束推断探索了一种有前景且有效的方法,在各种安全关键应用中具有巨大潜力。

英文摘要

Safe reinforcement learning (RL) is a standard paradigm for safety-critical decision making. However, real-world safety constraints can be complex, subjective, and even hard to explicitly specify. Existing works on constraint inference rely on restrictive assumptions or extensive expert demonstrations, which are not realistic in many real-world applications. How to cheaply and reliably learn these constraints is the major challenge we focus on in this study. While inferring constraints from human preferences offers a data-efficient alternative, we identify popular Bradley-Terry (BT) models fail to capture the asymmetric, heavy-tailed nature of safety costs, resulting in risk underestimation. It is still rare in the literature to understand the impacts of BT models on the downstream policy learning. To address the above knowledge gaps, we propose a novel approach namely Preference-based Constrained Reinforcement Learning (PbCRL). We introduce a novel dead zone mechanism into preference modeling and theoretically prove that it encourages heavy-tailed cost distributions, thereby achieving better constraint alignment. Additionally, we incorporate a Signal-to-Noise Ratio (SNR) loss to encourage exploration by cost variances, which is found to benefit policy learning. Further, two-stage training strategy is deployed to lower online labeling burdens while adaptively enhancing constraint satisfaction. Empirical results demonstrate that PbCRL achieves superior alignment with true safety requirements and outperforms state-of-the-art baselines in terms of safety and reward. Our work explores a promising and effective way for constraint inference in Safe RL, with great potential in various safety-critical applications.

2602.11146 2026-05-25 cs.CV cs.AI 版本更新

Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling

超越基于VLM的奖励:扩散原生潜在奖励建模

Gongye Liu, Bo Yang, Yida Zhi, Zhizhou Zhong, Lei Ke, Didan Deng, Han Gao, Yongxiang Huang, Kaihao Zhang, Hongbo Fu, Wenhan Luo

发表机构 * The Hong Kong University of Science Huawei Hong Kong AI Framework \& Data Technologies Lab Tsinghua University The Australian National University

AI总结 本文提出了一种基于扩散模型的原生潜在奖励模型DiNa-LRM,旨在解决扩散和流匹配模型在偏好优化中对奖励函数的需求。该方法直接在扩散过程的噪声状态上进行偏好学习,引入了与扩散噪声相关的不确定性校准的Thurstone似然函数,从而提升了奖励模型的判别鲁棒性和计算效率。实验表明,DiNa-LRM在图像对齐任务中显著优于现有的扩散奖励基线,并以更低的计算成本达到与最先进视觉语言模型相当的性能,同时提升了偏好优化的动态效率。

Comments Accepted by ICML 2026. Code: https://github.com/HKUST-C4G/diffusion-rm

详情
AI中文摘要

扩散和流匹配模型的偏好优化依赖于既具有判别鲁棒性又计算高效的奖励函数。视觉语言模型(VLM)凭借其丰富的多模态先验,已成为主要的奖励提供者,用于指导对齐。然而,它们的计算和内存成本可能很高,并且通过像素空间奖励优化潜在扩散生成器会引入域不匹配,使对齐复杂化。在本文中,我们提出DiNa-LRM,一种扩散原生潜在奖励模型,直接在噪声扩散状态上制定偏好学习。我们的方法引入了一种噪声校准的Thurstone似然,具有扩散噪声依赖的不确定性。DiNa-LRM利用预训练的潜在扩散骨干网络,配备时间步条件奖励头,并支持推理时噪声集成,提供了一种扩散原生的机制用于测试时缩放和鲁棒奖励。在图像对齐基准测试中,DiNa-LRM显著优于现有的基于扩散的奖励基线,并以一小部分计算成本实现了与最先进VLM竞争的性能。在偏好优化中,我们证明DiNa-LRM改善了偏好优化动态,实现了更快且更资源高效的模型对齐。

英文摘要

Preference optimization for diffusion and flow-matching models relies on reward functions that are both discriminatively robust and computationally efficient. Vision-Language Models (VLMs) have emerged as the primary reward provider, leveraging their rich multimodal priors to guide alignment. However, their computation and memory cost can be substantial, and optimizing a latent diffusion generator through a pixel-space reward introduces a domain mismatch that complicates alignment. In this paper, we propose DiNa-LRM, a diffusion-native latent reward model that formulates preference learning directly on noisy diffusion states. Our method introduces a noise-calibrated Thurstone likelihood with diffusion-noise-dependent uncertainty. DiNa-LRM leverages a pretrained latent diffusion backbone with a timestep-conditioned reward head, and supports inference-time noise ensembling, providing a diffusion-native mechanism for test-time scaling and robust rewarding. Across image alignment benchmarks, DiNa-LRM substantially outperforms existing diffusion-based reward baselines and achieves performance competitive with state-of-the-art VLMs at a fraction of the computational cost. In preference optimization, we demonstrate that DiNa-LRM improves preference optimization dynamics, enabling faster and more resource-efficient model alignment.

2601.03715 2026-05-25 cs.LG cs.AI 版本更新

R$^3$L: Reflect-then-Retry Reinforcement Learning with Language-Guided Exploration, Pivotal Credit, and Positive Amplification

R$^3$L: 反思-重试强化学习与语言引导探索、关键信用和正向放大

Weijie Shi, Yanxi Chen, Zexi Li, Xuchen Pan, Yuchang Sun, Jiajie Xu, Xiaofang Zhou, Yaliang Li

发表机构 * Tongyi Lab(通义实验室) Soochow University(苏州大学) Hong Kong University of Science and Technology(香港科技大学)

AI总结 R$^3$L 是一种结合语言引导探索、关键信用分配和正向增强的强化学习方法,旨在解决大语言模型在推理和智能体能力训练中面临的探索与利用难题。该方法通过“反思-重试”机制合成高质量轨迹,利用语言反馈定位错误并优化失败路径,同时仅更新存在差异的轨迹后缀以提高信用分配精度,并通过增强成功轨迹的权重来稳定训练过程。实验表明,R$^3$L 在多个任务中相较基线方法实现了显著性能提升,同时保持了训练稳定性。

详情
AI中文摘要

强化学习推动了LLM推理和智能体能力的最新进展,但当前方法在探索和利用方面均存在困难。探索方面,困难任务成功率低且从头开始重复rollout成本高;利用方面,粗粒度的信用分配和训练不稳定:轨迹级奖励因后续错误惩罚有效前缀,且失败主导的群体淹没少数正向信号,使优化缺乏建设性方向。为此,我们提出R$^3$L,即反思-重试强化学习与语言引导探索、关键信用和正向放大。为合成高质量轨迹,R$^3$L通过反思-重试从随机采样转向主动合成,利用语言反馈诊断错误,将失败尝试转化为成功尝试,并通过从识别出的失败点重启来降低rollout成本。在错误被诊断和定位后,关键信用分配仅更新存在对比信号的分叉后缀,排除共享前缀的梯度更新。由于困难任务中失败占主导且反思-重试产生离策略数据,可能导致训练不稳定,正向放大提高成功轨迹的权重,确保正向信号引导优化过程。在智能体和推理任务上的实验表明,与基线相比,相对提升5%到52%,同时保持训练稳定性。我们的代码已发布在https://github.com/shiweijiezero/R3L。

英文摘要

Reinforcement learning drives recent advances in LLM reasoning and agentic capabilities, yet current approaches struggle with both exploration and exploitation. Exploration suffers from low success rates on difficult tasks and high costs of repeated rollouts from scratch. Exploitation suffers from coarse credit assignment and training instability: Trajectory-level rewards penalize valid prefixes for later errors, and failure-dominated groups overwhelm the few positive signals, leaving optimization without constructive direction. To this end, we propose R$^3$L, Reflect-then-Retry Reinforcement Learning with Language-Guided Exploration, Pivotal Credit, and Positive Amplification. To synthesize high-quality trajectories, R$^3$L shifts from stochastic sampling to active synthesis via reflect-then-retry, leveraging language feedback to diagnose errors, transform failed attempts into successful ones, and reduce rollout costs by restarting from identified failure points. With errors diagnosed and localized, Pivotal Credit Assignment updates only the diverging suffix where contrastive signals exist, excluding the shared prefix from gradient update. Since failures dominate on difficult tasks and reflect-then-retry produces off-policy data, risking training instability, Positive Amplification upweights successful trajectories to ensure positive signals guide the optimization process. Experiments on agentic and reasoning tasks demonstrate 5\% to 52\% relative improvements over baselines while maintaining training stability. Our code is released at https://github.com/shiweijiezero/R3L.

2512.15767 2026-05-25 cs.LG cs.AI 版本更新

Bridging Data and Physics: A Graph Neural Network-Based Hybrid Twin Framework

连接数据与物理:基于图神经网络的混合孪生框架

M. Gorpinich, B. Moya, S. Rodriguez, F. Meraghni, Y. Jaafra, A. Briot, M. Henner, R. Leon, F. Chinesta

发表机构 * Valeo(瓦莱欧) PIMM Lab. ENSAM Institute of Technology(ENSAM技术学院PIMM实验室)

AI总结 该研究提出了一种基于图神经网络的混合孪生框架,旨在解决物理仿真中因模型简化或未建模效应导致的“无知模型”问题。通过结合物理模型与数据驱动方法,该方法利用图神经网络学习稀疏空间测量中的缺失物理规律,从而在减少数据需求的前提下提升仿真精度与可解释性。实验表明,该框架在不同网格、几何和负载位置的非线性热传导问题中均表现出良好的泛化能力与修正效果。

Comments 27 pages, 14 figures

详情
AI中文摘要

模拟复杂的非定常物理现象依赖于详细的数学模型,例如通过有限元方法(FEM)进行仿真。然而,由于未建模效应或简化假设,这些模型通常与实际情况存在差异。我们将这种差距称为无知模型。纯数据驱动的方法试图学习整个系统的行为,但需要跨越整个空间和时间域的大量高质量数据。在现实场景中,此类信息不可用,使得完全数据驱动的建模不可靠。为了克服这一限制,我们采用混合孪生方法对无知分量进行建模,而不是从头模拟现象。由于基于物理的模型近似了现象的整体行为,剩余的无知通常比完整的物理响应复杂度低,因此可以用更少的数据进行学习。然而,一个关键困难是空间测量是稀疏的,并且在实际中获取不同空间配置下同一现象的数据具有挑战性。我们的贡献是通过使用图神经网络(GNN)来表示无知模型来克服这一限制。即使测量位置数量有限,GNN也能学习缺失物理的空间模式。这使得我们能够用数据驱动的修正来丰富基于物理的模型,而无需密集的空间、时间和参数数据。为了展示所提出方法的性能,我们在不同网格、几何形状和载荷位置的非线性热传导问题上评估了这种基于GNN的混合孪生方法。结果表明,GNN成功捕获了无知并泛化了跨空间配置的修正,提高了仿真精度和可解释性,同时最小化了数据需求。

英文摘要

Simulating complex unsteady physical phenomena relies on detailed mathematical models, simulated for instance by using the Finite Element Method (FEM). However, these models often exhibit discrepancies from the reality due to unmodeled effects or simplifying assumptions. We refer to this gap as the ignorance model. While purely data-driven approaches attempt to learn full system behavior, they require large amounts of high-quality data across the entire spatial and temporal domain. In real-world scenarios, such information is unavailable, making full data-driven modeling unreliable. To overcome this limitation, we model of the ignorance component using a hybrid twin approach, instead of simulating phenomena from scratch. Since physics-based models approximate the overall behavior of the phenomena, the remaining ignorance is typically lower in complexity than the full physical response, therefore, it can be learned with significantly fewer data. A key difficulty, however, is that spatial measurements are sparse, also obtaining data measuring the same phenomenon for different spatial configurations is challenging in practice. Our contribution is to overcome this limitation by using Graph Neural Networks (GNNs) to represent the ignorance model. GNNs learn the spatial pattern of the missing physics even when the number of measurement locations is limited. This allows us to enrich the physics-based model with data-driven corrections without requiring dense spatial, temporal and parametric data. To showcase the performance of the proposed method, we evaluate this GNN-based hybrid twin on nonlinear heat transfer problems across different meshes, geometries, and load positions. Results show that the GNN successfully captures the ignorance and generalizes corrections across spatial configurations, improving simulation accuracy and interpretability, while minimizing data requirements.

2511.22521 2026-05-25 cs.CV cs.AI 版本更新

DocVAL: Validated Chain-of-Thought Distillation for Grounded Document VQA

DocVAL:用于基于文档的视觉问答的验证链式思维蒸馏

Pinaki Prasad Guha Neogi, Ahmad Mohammadshirazi, Ser-Nam Lim, Rajiv Ramnath

发表机构 * Department of Computer Science(计算机科学系) Engineering, Ohio State University, Ohio, US(工程系,俄亥俄州立大学,俄亥俄,美国) Department of Computer Science, University of Central Florida, Florida, US(计算机科学系,中央佛罗里达大学,佛罗里达,美国)

AI总结 DocVAL 是一种用于文档视觉问答(VQA)的验证式思维链(CoT)蒸馏框架,旨在将大型视觉语言模型(VLM)中的精确空间推理能力转移到更高效的紧凑模型中。该方法结合了教师模型生成的空间推理监督、基于规则的双模式验证器以过滤低质量训练信号,并采用两阶段训练流程进行迭代优化,最终使学生模型无需OCR或检测模块即可独立运行。实验表明,DocVAL 在多个基准测试中显著提升了紧凑模型的定位性能,并引入了mAP作为新的定位评估指标。

详情
AI中文摘要

文档视觉问答要求模型不仅正确回答问题,还要在复杂文档布局中精确定位答案。大型视觉语言模型(VLM)具有强大的空间定位能力,但其推理成本和延迟限制了实际部署。紧凑型VLM更高效,但在标准微调或蒸馏下常出现显著的定位退化。为解决这一问题,我们提出DocVAL,一种验证链式思维(CoT)蒸馏框架,将显式空间推理从大型教师模型转移到紧凑、可部署的学生VLM。DocVAL结合了(1)教师生成的空间CoT监督,(2)基于规则的双模式验证器,过滤低质量训练信号并提供细粒度像素级纠正反馈,以及(3)验证驱动的两阶段训练过程与迭代细化。文本检测仅作为训练时的监督和验证脚手架,使得最终学生模型在推理时作为纯VLM运行,无需OCR或检测。在多个文档理解基准上,DocVAL相比可比的紧凑VLM持续提升高达6-7个ANLS点。我们进一步引入平均精度(mAP)作为文档问答的定位指标,并在此新评估下报告了强大的空间定位性能。我们发布了95K验证器验证的CoT轨迹,并表明高质量、验证过的监督比扩展未过滤数据更有效,实现了高效且可信的文档定位。代码/数据:https://github.com/ahmad-shirazi/DocVAL

英文摘要

Document visual question answering requires models not only to answer questions correctly, but also to precisely localize answers within complex document layouts. While large vision-language models (VLMs) achieve strong spatial grounding, their inference cost and latency limit real-world deployment. Compact VLMs are more efficient, but they often suffer substantial localization degradation under standard fine-tuning or distillation. To address this gap, we propose DocVAL, a validated chain-of-thought (CoT) distillation framework that transfers explicit spatial reasoning from large teacher models to compact, deployable student VLMs. DocVAL combines (1) teacher-generated spatial CoT supervision, (2) a rule-based dual-mode validator that filters low-quality training signals and provides fine-grained, pixel-level corrective feedback, and (3) a validation-driven two-stage training procedure with iterative refinement. Text detection is used only as training-time scaffolding for supervision and validation, enabling the final student to operate as a pure VLM without OCR or detection at inference. Across multiple document understanding benchmarks, DocVAL yields consistent improvements of up to 6-7 ANLS points over comparable compact VLMs. We further introduce mean Average Precision (mAP) as a localization metric for document question answering and report strong spatial grounding performance under this new evaluation. We release 95K validator-verified CoT traces and show that high-quality, validated supervision is more effective than scaling unfiltered data, enabling efficient and trustworthy document grounding. Code/Data: https://github.com/ahmad-shirazi/DocVAL

2511.03882 2026-05-25 cs.CV cs.AI cs.LG cs.RO 版本更新

Investigating Robot Control Policy Learning for Autonomous X-ray-guided Spine Procedures

自主X光引导脊柱手术的机器人控制策略学习研究

Florence Klitzner, Blanca Inigo, Benjamin D. Killeen, Lalithkumar Seenivasan, Michelle Song, Axel Krieger, Mathias Unberath

发表机构 * Johns Hopkins University(约翰霍普金斯大学) Technical University of Munich(慕尼黑技术大学) Johns Hopkins School of Medicine(约翰霍普金斯医学院)

AI总结 本文研究了基于模仿学习的机器人控制策略在X射线引导脊柱手术中的应用,特别是在椎体成形术中导管插入任务中的可行性与挑战。研究构建了一个高度逼真的仿真环境,并构建了包含正确操作轨迹和双平面X射线序列的数据集,用于训练仅依赖视觉信息的模仿学习策略。实验表明,该策略在多种脊柱解剖结构和初始条件下均能实现安全的导管插入,为未来轻量化、无需CT的术中脊柱机器人导航提供了基础。

详情
AI中文摘要

基于模仿学习的机器人控制策略在基于视频的机器人学中重新受到关注。然而,对于稀疏输入的X光引导手术(如脊柱内固定),这种方法是否适用尚不清楚。我们研究了在双平面引导的套管针插入中模仿策略学习的可行性、机遇和挑战。我们开发了一个用于可扩展、自动化模拟X光引导脊柱手术的计算机沙盒,具有高度逼真性。我们整理了一个包含正确轨迹和相应双平面X光序列的数据集,模拟了提供者的逐步对齐过程。然后,我们训练了用于规划和开环控制的模仿学习策略,该策略仅基于视觉信息在椎体成形术环境中迭代对齐套管针。这种精确控制的设置提供了对该方法局限性和能力的见解。我们的策略在68.5%的案例中首次尝试成功,在不同椎体水平上保持了安全的椎弓根内轨迹。该策略迁移到了复杂解剖结构(包括骨折)以及不同的解剖结构和初始位置。在真实X光上的展开表明,具有合理轨迹的部分仿真到真实迁移是可能的。尽管这些初步结果令人鼓舞,但我们还发现了局限性,特别是在入口点精度方面。当前的结果为未来的努力提供了明确的基准,而借助更稳健的先验和领域知识,此类模型可能为未来实现轻量级、无CT的机器人术中脊柱导航奠定基础。

英文摘要

Imitation learning-based robot control policies are enjoying renewed interest in video-based robotics. However, it remains unclear whether this approach applies to X-ray-guided procedures, such as spine instrumentation, with sparse inputs. We examine the feasibility, opportunities and challenges for imitation policy learning in bi-plane-guided cannula insertion. We develop an in silico sandbox for scalable, automated simulation of X-ray-guided spine procedures with a high degree of realism. We curate a dataset of correct trajectories and corresponding bi-planar X-ray sequences that emulate the stepwise alignment of providers. We then train imitation learning policies for planning and open-loop control that iteratively align a cannula in a vertebroplasty setting solely based on visual information. This precisely controlled setup offers insights into limitations and capabilities of this method. Our policy succeeded on the first attempt in 68.5% of cases, maintaining safe intra-pedicular trajectories across diverse vertebral levels. The policy transferred to complex anatomy, including fractures, as well as varied anatomies and initializations. Rollouts on real X-ray indicate that partial sim-to-real transfer with plausible trajectories is possible. While these preliminary results are promising, we also identify limitations, especially in entry point precision. The current results present a clear benchmark for future efforts, while with more robust priors and domain knowledge, such models may provide a foundation for future efforts toward lightweight and CT-free robotic intra-operative spinal navigation.

2508.13663 2026-05-25 cs.AI cs.LG 版本更新

Interactive Query Answering on Knowledge Graphs with Soft Entity Constraints

具有软实体约束的知识图谱交互式查询回答

Daniel Daza, Alberto Bernardi, Luca Costabello, Christophe Gueret, Masoud Mansoury, Michael Cochez, Martijn Schut

发表机构 * Translational AI Laboratory, Department of Laboratory Medicine(转化人工智能实验室,实验室医学系) Amsterdam University Medical Center, Vrije Universiteit Amsterdam(阿姆斯特丹大学医学中心,伏里埃大学阿姆斯特丹) Accenture Labs(埃森哲实验室) Delft University of Technology(代尔夫特理工大学) ELLIS Institute Finland & Abo Akademi University, Turku, Finland & Elsevier Discovery Lab, Amsterdam(芬兰ELLIS研究所 & 阿博阿卡迪米大学,图尔库,芬兰 & 埃西弗尔发现实验室,阿姆斯特丹)

AI总结 本文研究了在知识图谱中结合软实体约束进行交互式查询回答的问题,旨在处理现实场景中含模糊或上下文依赖约束的查询。为此,作者提出了两种高效方法,能够在不破坏原有查询结果排名结构的前提下,通过少量参数调整或小型神经网络学习软约束,从而提升查询结果的相关性。实验表明,该方法在保持原有查询性能的同时,有效融入了用户偏好,为知识图谱交互提供了更灵活的方式。

Comments Accepted in Transactions on Machine Learning Research (2026)

详情
AI中文摘要

针对不完整知识图谱的查询回答方法检索可能成为答案的实体,这在由于缺失边而无法通过直接图遍历达到此类答案时特别有用。然而,现有方法侧重于使用一阶逻辑形式化的查询。在实践中,许多现实世界的查询涉及固有模糊或上下文依赖的约束,例如对属性或相关类别的偏好。针对这一差距,我们引入了具有软约束的查询回答问题。我们形式化了该问题,并提出了两种高效方法,旨在通过融入软约束来调整查询答案分数,同时不破坏查询的原始答案。这些方法是轻量级的,只需调整两个参数或训练一个小型神经网络来捕获软约束,同时保持原始排序结构。为了评估该任务,我们通过生成带有软约束的数据集来扩展现有的QA基准。我们的实验表明,我们的方法能够捕获软约束,同时保持稳健的查询回答性能,并增加很少的开销。通过我们的工作,我们探索了一种与图数据库交互的新颖灵活方式,允许用户通过交互式提供示例来指定其偏好。

英文摘要

Methods for query answering over incomplete knowledge graphs retrieve entities that are likely to be answers, which is particularly useful when such answers cannot be reached by direct graph traversal due to missing edges. However, existing approaches have focused on queries formalized using first-order-logic. In practice, many real-world queries involve constraints that are inherently vague or context-dependent, such as preferences for attributes or related categories. Addressing this gap, we introduce the problem of query answering with soft constraints. We formalize the problem and introduce two efficient methods designed to adjust query answer scores by incorporating soft constraints without disrupting the original answers to a query. These methods are lightweight, requiring tuning only two parameters or a small neural network trained to capture soft constraints while maintaining the original ranking structure. To evaluate the task, we extend existing QA benchmarks by generating datasets with soft constraints. Our experiments demonstrate that our methods can capture soft constraints while maintaining robust query answering performance and adding very little overhead. With our work, we explore a new and flexible way to interact with graph databases that allows users to specify their preferences by providing examples interactively.

2508.12247 2026-05-25 cs.LG cs.AI 版本更新

STM3: Mixture of Multiscale Mamba for Long-Term Spatio-Temporal Time-Series Prediction

STM3: 多尺度曼巴混合模型用于长期时空时间序列预测

Haolong Chen, Liang Zhang, Zhengyuan Xin, Guangxu Zhu

发表机构 * Shenzhen Loop Area Institute(深圳环城研究院)

AI总结 本文提出了一种名为STM3的新型深度学习模型,用于解决长期时空时间序列预测中的多尺度信息提取和空间依赖建模难题。STM3结合了多尺度Mamba架构与解耦的专家混合框架(DMoE),并引入自适应图因果网络以高效捕捉复杂的时空依赖关系。该模型通过稳定路由策略和因果对比学习策略,确保了表示学习的鲁棒性和多尺度信息的可区分性,实验表明其在多个现实数据集上均取得了优越的预测性能。

Comments Accepted by KDD 2026

详情
AI中文摘要

近年来,时空时间序列预测发展迅速,但现有深度学习方法难以高效学习复杂的长期时空依赖。长期时空依赖学习带来两个新挑战:1)长期时间序列自然包含多尺度信息,难以高效提取;2)不同节点的多尺度时间信息高度相关且难以建模。为解决这些问题,我们提出时空多尺度曼巴混合模型(STM3)。STM3在新型分离式混合专家(DMoE)框架内集成多尺度曼巴架构,以高效捕获多样的多尺度信息,同时利用自适应图因果网络建模复杂的空间依赖。为确保鲁棒的表示学习,我们引入稳定路由策略和因果对比学习策略,与层次信息聚合协同工作,保证尺度可区分性。我们理论上证明STM3实现了优越的路由平滑性,并保证了每个专家的模式分离。在跨领域的10个真实世界基准上的大量实验表明,STM3具有优越性能,在长期时空时间序列预测中达到了最先进的结果。值得注意的是,在PEMSD8数据集上,它取得了显著改进,在MAE、RMSE和MAPE上分别超过第二好的模型7.1%、8.5%和15.9%。代码可在https://github.com/IfReasonable/STM3_KDD26获取。

英文摘要

Recently, spatio-temporal time-series prediction has developed rapidly, yet existing deep learning methods struggle with learning complex long-term spatio-temporal dependencies efficiently. The long-term spatio-temporal dependency learning brings two new challenges: 1) The long-term temporal sequence naturally includes multiscale information, which is hard to extract efficiently; 2) The multiscale temporal information from different nodes is highly correlated and hard to model. To address these challenges, we propose Spatio-Temporal Mixture of Multiscale Mamba (STM3). STM3 integrates a Multiscale Mamba architecture within a novel Disentangled Mixture-of-Experts (DMoE) framework to capture diverse multiscale information efficiently, while utilizing an adaptive graph causal network to model complex spatial dependencies. To ensure robust representation learning, we introduce a stable routing strategy and a causal contrastive learning strategy, which work in tandem with hierarchical information aggregation to guarantee scale distinguishability. We theoretically prove that STM3 achieves superior routing smoothness and guarantees pattern disentanglement for each expert. Extensive experiments on 10 real-world benchmarks across domains demonstrate STM3's superior performance, achieving state-of-the-art results in long-term spatio-temporal time-series prediction. Notably, on the PEMSD8 dataset, it achieves significant improvements, surpassing the second-best model by 7.1% in MAE, 8.5% in RMSE, and 15.9% in MAPE. Code is available at https://github.com/IfReasonable/STM3_KDD26.

2506.05438 2026-05-25 cs.LG cs.AI 版本更新

An Unsupervised Framework for Dynamic Health Indicator Construction and Its Application in Rolling Bearing Prognostics

一种用于动态健康指标构建的无监督框架及其在滚动轴承预测中的应用

Tongda Sun, Chen Yin, Huailiang Zheng, Yining Dong

发表机构 * School of Data Science(数据科学学院) Hong Kong Institute for Data Science, City University of Hong Kong, Hong Kong(香港数据科学研究所,香港城市大学,香港) College of Mechanical(机械学院) Electrical Engineering, Harbin Engineering University, Harbin 150001, China(电气工程学院,哈尔滨工程大学,哈尔滨150001,中国)

AI总结 本文提出了一种无需专家知识的无监督框架,用于构建动态健康指标(HI),以提升滚动轴承退化趋势建模与剩余寿命预测的准确性。该方法通过基于跳跃连接的自编码器自动提取退化特征,并在特征空间中引入嵌入内部预测模块的HI生成模块,显式建模HI状态的时序依赖关系,从而捕捉退化过程中的动态信息。实验结果表明,所提出的动态HI在两个轴承生命周期数据集上优于现有方法,显著提升了预测性能。

详情
AI中文摘要

健康指标(HI)在滚动轴承的退化评估和预测中起着关键作用。尽管已有多种HI构建方法被研究,但大多数依赖于专家知识进行特征提取,并忽略了捕捉序列退化过程中隐藏的动态信息,这限制了所构建HI在退化趋势表示和预测中的能力。为解决这些问题,通过一种无监督框架构建了考虑HI级时间依赖性的新型动态HI。具体而言,由基于跳跃连接的自编码器组成的退化特征学习模块首先将原始信号映射到代表性退化特征空间(DFS),以自动提取必要的退化特征,无需专家知识。随后,在该DFS中,提出了一种嵌入内部HI预测模块的新型HI生成模块用于动态HI构建,其中过去和当前HI状态之间的时间依赖性被保证并显式建模。在此基础上,动态HI捕捉了退化过程固有的动态内容,确保其在退化趋势建模和未来退化预测中的有效性。在两个轴承生命周期数据集上的实验结果表明,所提出的HI构建方法优于对比方法,且构建的动态HI在预测任务中表现更优。

英文摘要

Health indicator (HI) plays a key role in degradation assessment and prognostics of rolling bearings. Although various HI construction methods have been investigated, most of them rely on expert knowledge for feature extraction and overlook capturing dynamic information hidden in sequential degradation processes, which limits the ability of the constructed HI for degradation trend representation and prognostics. To address these concerns, a novel dynamic HI that considers HI-level temporal dependence is constructed through an unsupervised framework. Specifically, a degradation feature learning module composed of a skip-connection-based autoencoder first maps raw signals to a representative degradation feature space (DFS) to automatically extract essential degradation features without the need for expert knowledge. Subsequently, in this DFS, a new HI-generating module embedded with an inner HI-prediction block is proposed for dynamic HI construction, where the temporal dependence between past and current HI states is guaranteed and modeled explicitly. On this basis, the dynamic HI captures the inherent dynamic contents of the degradation process, ensuring its effectiveness for degradation tendency modeling and future degradation prognostics. The experiment results on two bearing lifecycle datasets demonstrate that the proposed HI construction method outperforms comparison methods, and the constructed dynamic HI is superior for prognostic tasks.

2502.20349 2026-05-25 q-bio.NC cs.AI 版本更新

Naturalistic Computational Cognitive Science: Towards generalizable models and theories that capture the full range of natural behavior

自然主义计算认知科学:迈向能够捕捉自然行为全范围的通用模型与理论

Wilka Carvalho, Andrew Lampinen

发表机构 * Kempner Institute for the Study of Natural and Artificial Intelligence(Kempner自然与人工智能研究学院) Harvard University(哈佛大学) Google DeepMind(谷歌DeepMind)

AI总结 本文探讨如何通过结合人工智能的最新进展,构建能够涵盖自然情境和行为全貌的通用认知科学理论。研究指出,采用更加自然化的实验范式和计算模型,有助于更准确地理解自然智能的本质,并推动理论的泛化能力。文章综述了认知科学、神经科学和人工智能领域的相关研究,提出整合这些领域进展有助于在保持实验控制和理论深度的同时,更好地解释和模拟人类认知过程。

详情
AI中文摘要

认知科学如何构建能够涵盖自然情境与行为全范围的通用理论?我们认为,人工智能(AI)的进展为认知科学提供了及时的机会,使其能够采用日益自然化的刺激、任务和行为进行实验,并构建能够适应这些变化的计算模型。我们首先回顾了涵盖神经科学、认知科学和AI的日益增长的研究,这些研究表明,纳入更广泛的自然主义实验范式及其相应模型,可能是解决自然智能某些方面并确保理论泛化的必要条件。我们回顾了认知科学和神经科学中的案例,其中自然主义范式引发了不同的行为或涉及不同的过程。然后,我们讨论了AI的最新进展,表明从自然主义数据中学习会产生定性的不同行为模式和泛化模式,并探讨了这些发现如何影响我们从认知建模中得出的结论,以及如何帮助产生关于认知和神经现象根源的新假设。接着,我们建议整合AI和认知科学的最新进展,将使我们能够处理更自然的现象,而不放弃实验控制或对理论理解基础的追求。我们提供了关于方法论实践如何有助于自然主义计算认知科学中累积进展的实用指导,并描绘了一条构建能够解决自然认知实际问题的计算模型的道路,同时对这些模型所依据的过程和原则进行还原性理解。

英文摘要

How can cognitive science build generalizable theories that span the full scope of natural situations and behaviors? We argue that progress in Artificial Intelligence (AI) offers timely opportunities for cognitive science to embrace experiments with increasingly naturalistic stimuli, tasks, and behaviors; and computational models that can accommodate these changes. We first review a growing body of research spanning neuroscience, cognitive science, and AI that suggests that incorporating a broader range of naturalistic experimental paradigms, and models that accommodate them, may be necessary to resolve some aspects of natural intelligence and ensure that our theories generalize. We review cases from cognitive science and neuroscience where naturalistic paradigms elicit distinct behaviors or engage different processes. We then discuss recent progress in AI that shows that learning from naturalistic data yields qualitatively different patterns of behavior and generalization, and examine how these findings impact the conclusions we draw from cognitive modeling, and can help yield new hypotheses for the roots of cognitive and neural phenomena. We then suggest that integrating recent progress in AI and cognitive science will enable us to engage with more naturalistic phenomena without giving up experimental control or the pursuit of theoretically grounded understanding. We offer practical guidance on how methodological practices can contribute to cumulative progress in naturalistic computational cognitive science, and illustrate a path towards building computational models that solve the real problems of natural cognition, together with a reductive understanding of the processes and principles by which they do so.

2502.13731 2026-05-25 cs.AI 版本更新

Robust Counterfactual Inference in Markov Decision Processes

马尔可夫决策过程中的鲁棒反事实推断

Jessica Lally, Milad Kazemi, Nicola Paoletti

发表机构 * King's College London(伦敦国王学院)

AI总结 本文针对马尔可夫决策过程(MDP)中现有反事实推理方法的一个关键局限性,提出了一种新的非参数方法。传统方法依赖特定的因果模型来识别反事实,而实际上存在多个与观测和干预分布一致的因果模型,导致反事实分布不同。本文通过计算所有兼容因果模型下反事实转移概率的紧致界,提供了高效且可扩展的反事实推理方法,并在此基础上设计出鲁棒的反事实策略,以优化最坏情况下的奖励。实验表明,该方法在多个案例中表现出更强的鲁棒性。

详情
AI中文摘要

本文解决了马尔可夫决策过程(MDP)中现有反事实推断方法的一个关键局限性。当前方法假设特定的因果模型以使反事实可识别。然而,通常存在许多与MDP的观测分布和干预分布一致的因果模型,每个模型产生不同的反事实分布,因此固定一个特定的因果模型限制了反事实推断的有效性(和有用性)。我们提出了一种新颖的非参数方法,该方法在所有兼容因果模型上计算反事实转移概率的紧界。与先前需要求解规模过大(变量随MDP大小呈指数增长)的优化问题的方法不同,我们的方法为这些界提供了闭式表达式,使得计算对于非平凡MDP高度高效且可扩展。一旦构建了这样的区间反事实MDP,我们的方法识别出鲁棒的反事实策略,该策略针对不确定的区间MDP概率优化最坏情况奖励。我们在各种案例研究上评估了我们的方法,证明了相比现有方法具有改进的鲁棒性。

英文摘要

This paper addresses a key limitation in existing counterfactual inference methods for Markov Decision Processes (MDPs). Current approaches assume a specific causal model to make counterfactuals identifiable. However, there are usually many causal models that align with the observational and interventional distributions of an MDP, each yielding different counterfactual distributions, so fixing a particular causal model limits the validity (and usefulness) of counterfactual inference. We propose a novel non-parametric approach that computes tight bounds on counterfactual transition probabilities across all compatible causal models. Unlike previous methods that require solving prohibitively large optimisation problems (with variables that grow exponentially in the size of the MDP), our approach provides closed-form expressions for these bounds, making computation highly efficient and scalable for non-trivial MDPs. Once such an interval counterfactual MDP is constructed, our method identifies robust counterfactual policies that optimise the worst-case reward w.r.t. the uncertain interval MDP probabilities. We evaluate our method on various case studies, demonstrating improved robustness over existing methods.

2605.23668 2026-05-25 cs.CL cs.AI 版本更新

OnePred: Next-Query Prediction via Recursive Intent Memory in Multi-Turn Conversations

OnePred: 多轮对话中基于递归意图记忆的下一个查询预测

Jiangwang Chen, Bowen Zhang, Zixin Song, Jiazheng Kang, Xiao Yang, Da Zhu, Guanjun Jiang

发表机构 * Tsinghua University(清华大学) Qwen Applications Business Group of Alibaba(阿里巴巴Qwen应用业务组)

AI总结 该研究提出了 OnePred,一种用于多轮对话中预测用户下一条查询的模型,旨在使对话系统更具主动性。其核心方法是通过递归更新的意图记忆来捕捉用户意图的演变,从而在不依赖完整对话历史的情况下实现高效且准确的预测。该方法通过两阶段强化学习训练模型,既学习预测内容又优化信息压缩,显著降低了计算成本并提升了预测性能。研究还发布了 NQP-Bench 基准数据集,实验表明 OnePred 在保持预测质量的同时,相比传统方法减少了高达 22 倍的计算开销。

详情
AI中文摘要

尽管大语言模型(LLM)对话系统每天处理数百万次多轮对话,但它们本质上仍是被动的:仅在用户输入查询后才响应。迈向主动交互的关键一步是下一个查询预测,即仅根据之前的对话预测用户后续的查询。该任务的进展受到缺乏专用基准以及基本效率-质量权衡的阻碍:简单拼接完整对话历史会导致线性增长的token消耗,而截断至最新一轮则会丢弃关键的跨轮上下文。我们的关键见解是,准确预测不需要重新阅读原始历史;只需跟踪用户跨主题、未解决需求和兴趣转移的不断演变的意图轨迹即可。我们提出OnePred,它维护一个递归更新的记忆作为唯一的跨轮上下文,将每轮成本限制为与对话长度无关。我们通过两阶段强化学习流程训练模型,首先教导预测什么,然后教导压缩什么,将记忆塑造成面向预测的意图链。为了建立严格的测试平台,我们引入了NQP-Bench,涵盖三个不同的子集。实验表明,与完整历史输入相比,OnePred将每轮token消耗减少高达22倍,同时在预测质量上持续超过所有基线,在较长对话中增益更大。我们的代码可在https://github.com/ZBWpro/OnePred公开获取。

英文摘要

Although large language model (LLM) conversational systems process millions of multi-turn dialogues daily, they remain fundamentally reactive: they respond only after the user types a query. A key step toward proactive interaction is next-query prediction, which anticipates the user's subsequent query based solely on the preceding dialogue. Progress on this task is hindered by the lack of dedicated benchmarks and a fundamental efficiency--quality trade-off: naively concatenating full dialogue history incurs linearly growing token consumption, while truncating to the latest turn discards crucial cross-turn context. Our key insight is that accurate prediction does not require re-reading raw history; it suffices to track the user's evolving intent trajectory across topics, unresolved needs, and interest shifts. We propose OnePred, which maintains a recursively updated memory as its sole cross-turn context, bounding the per-turn cost independently of conversation length. We train the model via a two-stage reinforcement learning pipeline that first teaches what to predict, then what to compress, shaping the memory into a prediction-oriented intent chain. To establish a rigorous testbed, we introduce NQP-Bench, spanning three diverse subsets. Experiments demonstrate that OnePred reduces per-turn token consumption by up to 22$\times$ compared to full-history inputs while consistently exceeding all baselines in prediction quality, with larger gains on longer conversations. Our code is publicly available at https://github.com/ZBWpro/OnePred.

2605.23655 2026-05-25 cs.CV cs.AI cs.LG cs.MM 版本更新

CVSearch: Empowering Multimodal LLMs with Cognitive Visual Search for High-Resolution Image Perception

CVSearch:赋予多模态大语言模型认知视觉搜索能力以感知高分辨率图像

Liupeng Li, Haoqian Kang, Zhenyu Lu, Jinpeng Wang, Bin Chen, Ke Chen, Yaowei Wang

发表机构 * Harbin Institute of Technology, Shenzhen, China(哈尔滨工业大学(深圳)) Peng Cheng Laboratory, Shenzhen, China(鹏城实验室) Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China(深圳先进技术研究院)

AI总结 高分辨率图像感知是多模态大语言模型面临的关键瓶颈。为解决视觉搜索中覆盖性与效率之间的矛盾,本文提出CVSearch,一种无需训练的自适应框架,通过“评估-搜索”流程动态调度搜索策略。该方法在全局信息不足时采用专家辅助搜索,失败时触发语义感知的扫描机制,有效减少物体碎片化,并通过动态自底向上搜索策略提升局部细节的探索效率。实验表明,CVSearch在高分辨率基准上实现了最先进的准确率和显著提升的搜索效率。

Comments Accepted by ICML 2026. 22 pages, 12 figures, 7 tables

详情
AI中文摘要

高分辨率图像感知是多模态大语言模型的一个关键瓶颈。虽然视觉搜索提供了有希望的解决方案,但现有方法在覆盖率和效率之间难以权衡。视觉专家辅助搜索效率高,但当提议失败时容易出现盲点,而基于扫描的搜索以计算冗余和语义碎片化为代价保证了覆盖率。为了解决这一困境,我们引入了CVSearch,一种无需训练的自适应框架,通过评估-搜索工作流动态调度搜索策略。具体来说,CVSearch首先在全局信息不足时调用专家辅助搜索,仅在失败时触发一种新颖的语义感知扫描机制。与刚性网格划分不同,这种高效扫描范式结合了语义引导的自适应补丁,将图像分解为语义一致的区域,有效缓解了物体碎片化。此外,我们设计了一种由视觉复杂性先验驱动的动态自底向上搜索策略,以实现对局部细节的高效且精确的迭代探索。在高分辨率基准上的大量实验表明,CVSearch在显著提高搜索效率的同时实现了最先进的准确性。代码已发布在https://github.com/liliupeng28/ICML26-CVSearch。

英文摘要

High-resolution (HR) image perception presents a key bottleneck for multimodal large language models (MLLMs). While visual search offers a promising solution, existing methods struggle with the trade-off between coverage and efficiency. Visual expert-assisted search is efficient but prone to blind spots when proposals fail, whereas scan-based search guarantees coverage at the cost of computational redundancy and semantic fragmentation. To address this dilemma, we introduce CVSearch, a training-free adaptive framework that dynamically schedules search strategies via an Assess-then-Search workflow. Specifically, CVSearch first invokes expert-assisted search when global information is insufficient, and only triggers a novel semantic-aware scanning mechanism upon failure. Distinct from rigid grid partitioning, this efficient scanning paradigm incorporates Semantic Guided Adaptive Patching to decompose images into semantically consistent regions, effectively mitigating object fragmentation. Furthermore, we devise a Dynamic Bottom-Up Search strategy driven by a Visual Complexity prior to enable efficient and precise iterative exploration of local details. Extensive experiments on HR benchmarks demonstrate that CVSearch achieves state-of-the-art accuracy while substantially improving search efficiency. Code is released at https://github.com/liliupeng28/ICML26-CVSearch.

2605.23652 2026-05-25 cs.AI 版本更新

One Policy, Infinite NPCs: Persona-Traceable Shared RL Policies for Scalable Game Agents

一个策略,无限NPC:用于可扩展游戏智能体的可追溯共享RL策略

Yoosung Hong

发表机构 * Independent Researcher(独立研究者)

AI总结 该研究提出了一种名为 pcsp 的共享强化学习策略,用于实现可扩展的游戏 NPC 控制,能够根据自由形式的人格描述生成具有个性特征且可控的行为。该方法基于冻结的 LLM 嵌入进行条件化,并结合了多种技术如低秩投影和一致性训练目标,以确保人格一致性与行为多样性。实验表明,pcsp 在零样本人格识别、语义行为对齐和推理速度等方面显著优于现有方法,并在实际游戏引擎中验证了其有效性与稳定性。

Comments 18 pages, 15 figures, 14 tables

详情
AI中文摘要

在300人生活模拟基准上,pcsp实现了组合零样本角色识别,准确率比随机高17倍,Spearman相关系数约0.73的语义-行为对齐,推理速度比LLM作为策略的基线快22倍。生活模拟游戏需要数百到数千个非玩家角色(NPC),这些角色具有一致的个性,同时通过设计师编写的自然语言保持可控。现有方法在个性一致性、可控性或实时推理等约束下失败。我们引入了pcsp(个性条件共享策略),这是一种单一的强化学习策略,以自由形式个性描述的冻结LLM嵌入为条件。pcsp结合了每个NPC一次的个性编码、低秩个性投影、神经个性调节以及PPO + InfoNCE一致性 + KL多样性训练目标。在三个实验设置中,消融实验表明InfoNCE轨迹一致性目标是关键:移除它会导致零样本角色识别降至随机水平。在Melting Pot 2.4.0子任务上的外部验证证实,我们的方法在多智能体战略环境中产生了基于个性的行为差异。我们区分了两种保留评估的含义:组合零样本和词汇扩展保留。最后,在UE5部署中,以64个智能体在引擎内重现了基于个性的消融实验,故障率低,表明子帧推理轮廓在商业游戏引擎中得以保留。这些结果证明,共享RL策略可以支持可扩展、实时、基于个性的NPC控制。

英文摘要

On a 300-persona life-simulation benchmark, pcsp achieves compositional zero-shot persona identification up to 17x above chance, Spearman rho approx 0.73 semantic-behavioral alignment, and 22x faster inference than an LLM-as-policy baseline. Life simulation games require hundreds to thousands of non-player characters (NPCs) that behave consistently with distinct personalities while remaining controllable through designer-authored natural language. Existing methods fail on constraints like persona consistency, controllability, or real-time inference. We introduce pcsp (Persona Conditioned Shared Policy), a single reinforcement learning policy conditioned on frozen LLM embeddings of free-form persona descriptions. pcsp combines once-per-NPC persona encoding, low-rank persona projection, neural persona conditioning, and a PPO + InfoNCE consistency + KL diversity training objective. Across three experimental settings, ablations show that the InfoNCE trajectory-consistency objective is load bearing: removing it collapses zero-shot persona identification to chance. External validation on Melting Pot 2.4.0 substrates confirms that our method produces persona-conditioned behavioral divergence in multi-agent strategic environments. We distinguish two senses of held-out evaluation: compositional zero-shot and vocabulary-expansion held-out. Finally, a UE5 deployment reproduces the in-engine persona-conditioning ablation at 64 agents with a low failure rate, showing that the sub-frame inference profile survives in a commercial game engine. These results prove that shared RL policies can support scalable, real-time, persona-conditioned NPC control.

2605.23645 2026-05-25 cs.LG cs.AI 版本更新

Learning Through Noise: Why Subliminal Learning Works and When It Fails

通过噪声学习:为什么潜意识学习有效以及何时失败

Vincent C. Brockers, Roman D. Ventzke, Valentin Neuhaus, Belén Hidalgo-Ogalde, Viola Priesemann

发表机构 * Max Planck Institute for Dynamics and Self-Organization(马克斯·普朗克动态与自组织研究所) Faculty of Physics, Institute for the Dynamics of Complex Systems, University of Göttingen(哥廷根大学物理系,复杂系统动力学研究所)

AI总结 本文研究了人工神经网络中的“潜意识学习”现象,即通过任务无关的输入-输出对进行知识蒸馏时,学生模型从教师模型中隐式学习任务相关知识或偏差的机制。研究发现,这一过程并不依赖于教师与学生模型的初始化一致性,而是由输出头的兼容性所决定。通过控制实验,作者展示了即使在随机初始化、网络结构变化等情况下,学生模型仍能通过兼容的辅助输出头从教师模型中学习有用信息,并在特定条件下达到与教师相当的任务性能。该研究为潜意识学习提供了理论解释,并明确了其适用范围与失效条件。

详情
AI中文摘要

在人工神经网络的背景下,潜意识学习指的是通过任务无关的输入-输出对的蒸馏,将任务相关知识或意外偏差从教师模型传递到学生模型。先前的解释将这种效应归因于共享或紧密匹配的教师-学生初始化。我们表明,紧密匹配的初始化并非必要。相反,潜意识学习由兼容的输出头控制。使用受控的MNIST设置,我们将输出分为辅助头(用于辅助的、任务无关的噪声信号)和分类头(用于分类),以证明潜意识学习发生——即使我们随机初始化隐藏层并移除层、添加新层或更改架构(MLP到CNN)。兼容的辅助头能够传递可恢复的教师信号,使学生的表示更接近教师的表示。当分类头也保持兼容时,仅训练于任务无关噪声的学生可以接近,并且在有利情况下达到教师级别的任务性能。我们的设置使我们能够发展一种理论来解释潜意识学习的机制,并推导出潜意识学习失败时的上界。总之,我们的结果将潜意识学习从一种令人惊讶的迁移效应转变为具有可预测限制的理论基础机制。

英文摘要

In the context of artificial neural networks, subliminal learning refers to the transfer of task-relevant knowledge or unintended biases from teacher to student models through distillation on task-unrelated input$\unicode{x2013}$output pairs. Prior explanations tie this effect to shared or closely matched teacher$\unicode{x2013}$student initialization. We show that a closely matched initialization is not necessary. Instead, subliminal learning is governed by compatible output heads. Using a controlled MNIST setting, we split outputs into an auxiliary head (for auxiliary, task-unrelated noise signals) and a class head (for classification) to demonstrate subliminal learning occurs$\unicode{x2014}$even when we randomly initialize hidden layers and remove layers, add new layers, or change the architecture (MLP-to-CNN). Compatible auxiliary heads enable transfer of a recoverable teacher signal, bringing the student's representations closer to the teacher's. When the class heads remain compatible as well, students trained only on task-unrelated noise can approach, and in favorable regimes match, teacher-level task performance. Our setting enables us to develop a theory that explains the mechanism of subliminal learning and to derive upper bounds on when subliminal learning fails. Together, our results turn subliminal learning from a surprising transfer effect into a theoretically grounded mechanism with predictable limits.

2605.23634 2026-05-25 cs.CV cs.AI 版本更新

DualMem: Bypassing the Objectness Bottleneck for Calibrated Unknown-Stream Filtering in Open-World Object Detection

DualMem: 绕过目标性瓶颈以实现开放世界目标检测中校准的未知流过滤

Yingjun Xiao, Xi Chen, Gang Fang, Siyuan Chen

发表机构 * School of Artificial Intelligence, Guangzhou University(广州大学人工智能学院) School of Computer Science and Cyber Engineering, Guangzhou University(广州大学计算机科学与网络工程学院) Institute of Computing Science and Technology, Guangzhou University(广州大学计算科学与技术研究院)

AI总结 开放世界目标检测(OWOD)需要检测器既能定位已知类别,又能识别未知对象以支持未来的增量学习。本文发现当前强OWOD检测器的未知预测流中背景误检比例过高,问题根源在于对象性头的信息瓶颈。为此,作者提出DualMem,一种基于冻结SigLIP特征空间的校准后处理过滤器,通过非参数似然比检验实现对未知对象的筛选,有效提升了未知对象识别的准确性,同时保持已知类别检测性能不变。

详情
AI中文摘要

开放世界目标检测(OWOD)要求检测器定位已知类别,同时识别未知对象以进行未来的增量学习。我们发现,强OWOD检测器的未知预测流受到严重污染:在M-OWODB上,对于PROB、OW-DETR和HypOW,未来任务的正未知样本仅占未知预测的不到10%,而背景假阳性则占46-71%。我们表明,这不是信息缺失问题,而是目标性头部的信息瓶颈。在PROB任务1上,对256维解码器查询的线性探针在正负未知区分上达到了0.908的AUROC,但最终的一维目标性标量降至0.642。一个冻结的SigLIP特征,无需访问检测器,在过滤阶段独立恢复了大部分这种提议级别的可分离性(AUROC = 0.871)。基于这一发现,我们提出DualMem,一种校准的后验过滤器,它假设一个小的、图像不相交的、标注了未来任务对象的校准分割,并在冻结的SigLIP特征空间中执行非参数似然比检验。DualMem使用k近邻正记忆来保护未来任务对象,并使用负记忆来抑制类似背景的提议。其决策阈值通过Neyman-Pearson校准选择,为用户提供了假未知抑制与新奇召回之间的显式权衡。在M-OWODB任务1上的PROB、OW-DETR和HypOW中,DualMem将每幅图像的背景型假未知提议减少了44.9%-66.3%,平均减少56.6%。在PROB任务1上,它使自然K-means原型基线的减少量翻倍以上,同时保持已知类别的mAP不变,因为已知检测绕过过滤器。

英文摘要

Open-world object detection (OWOD) requires detectors to localize known classes while identifying unknown objects for future incremental learning. We find that the unknown prediction streams of strong OWOD detectors are heavily polluted: on M-OWODB, across PROB, OW-DETR, and HypOW, future-task positive unknowns make up less than 10% of unknown predictions, whereas background false positives account for 46-71%. We show that this is not a missing-information problem, but an information bottleneck at the objectness head. On PROB Task 1, a linear probe on the 256-D decoder query achieves an AUROC of 0.908 for positive-versus-negative unknown discrimination, but the final one-dimensional objectness scalar drops to 0.642. A frozen SigLIP feature, without access to the detector, independently recovers much of this proposal-level separability at the filtering stage (AUROC = 0.871). Motivated by this finding, we propose DualMem, a calibrated post-hoc filter that assumes a small image-disjoint annotated calibration split of held-out future-task objects and performs a non-parametric likelihood ratio test in frozen SigLIP feature space. DualMem uses a k-nearest-neighbor positive memory to protect future-task objects and a negative memory to suppress background-like proposals. Its decision threshold is chosen by Neyman-Pearson calibration, giving users an explicit trade-off between false-unknown suppression and novel recall. Across PROB, OW-DETR, and HypOW on M-OWODB Task 1, DualMem reduces background-type false unknown proposals per image by 44.9%-66.3%, with a mean reduction of 56.6%. On PROB Task 1, it more than doubles the reduction achieved by a natural K-means prototype baseline, while leaving known-class mAP unchanged because known detections bypass the filter.

2605.23623 2026-05-25 cs.CR cs.AI cs.LG 版本更新

Adversarial Vulnerability Under Temporal Concept Drift: A Longitudinal Study of Android Malware Detection

时间概念漂移下的对抗脆弱性:Android恶意软件检测的纵向研究

Ahmed Sabbah, Mohammed Kharma, Radi Jarrar, Samer Zein, David Mohaisen

发表机构 * Department of Computer Science, Birzeit University(巴勒斯坦伯利兹大学计算机科学系) Department of Computer Science, University of Central Florida(佛罗里达州立大学计算机科学系)

AI总结 本文通过长期视角研究了安卓恶意软件检测系统在时间概念漂移下的对抗脆弱性,分析了十年间应用数据在静态和动态特征表示下的对抗鲁棒性。研究采用三种部署协议评估模型性能,引入了多个时间关联指标以量化分布偏移对鲁棒性的影响。结果表明,随着时间间隔增大,对抗鲁棒性下降,而攻击成功率上升,强调了在动态数据环境下需考虑时间漂移因素,并提出了针对长期对抗环境的鲁棒性评估框架的重要性。

Comments 42 pages, 4 tables, 10 figures

详情
AI中文摘要

我们提出了一种纵向的、考虑漂移的对抗鲁棒性评估,使用从模拟器和真实设备执行中提取的静态和动态特征表示,跨越超过十年的Android应用。数据集按年度切片组织,并在三种模拟现实学习场景的部署协议下进行评估:(1)同年度训练和测试,(2)跨年度部署且不更新模型,(3)使用累积历史数据进行扩展窗口重训练。在多个分类器家族中,使用FGSM和SPSA在可行性约束下生成对抗样本。我们测量了干净性能、对抗准确率(AA)、攻击成功率(ASR),并引入了时序关联指标——RobustDrop、$\Delta$ASR和对抗放大因子(AAF)——以量化分布漂移与鲁棒性退化之间的关系。结果表明,在评估的基于迁移的特征空间设置下,时间分离与对抗鲁棒性降低相关。随着训练-测试间隔增加,干净准确率和对抗准确率下降,而攻击成功率呈现配置相关的增加,特别是在FGSM扰动和静态特征下。扩展窗口重训练可以缓解但无法消除在持续分布演化下的鲁棒性损失。这些发现表明,在评估智能检测系统在演化数据分布下的长期鲁棒性时,应考虑时间漂移,并强调了在长期对抗环境中需要漂移感知的鲁棒性评估框架。

英文摘要

We present a longitudinal, drift-aware evaluation of adversarial robustness across more than a decade of Android applications using static and dynamic feature representations extracted from emulator and real-device executions. The dataset is organized into yearly slices and evaluated under three deployment protocols that emulate realistic learning scenarios: (1) same-year training and testing, (2) cross-year deployment without model updates, and (3) expanding-window retraining with cumulative historical data. Across multiple classifier families, adversarial examples are generated using FGSM and SPSA under feasibility constraints. We measure clean performance, Adversarial Accuracy (AA), Attack Success Rate (ASR), and introduce temporal linkage metrics -- RobustDrop, $Δ$ASR, and Adversarial Amplification Factor (AAF) -- to quantify the relationship between distribution shift and robustness degradation.nResults show that temporal separation is associated with reduced adversarial robustness under the evaluated transfer-based feature-space setting. As the train-test gap increases, clean accuracy and adversarial accuracy decline, while attack success exhibits configuration-dependent increases, particularly under FGSM perturbations and static features. Expanding-window retraining mitigates, but does not eliminate, robustness loss under continued distributional evolution. These findings indicate that temporal drift should be considered when assessing the long-term robustness of intelligent detection systems under evolving data distributions and highlight the need for drift-aware robustness assessment frameworks in long-lived adversarial environments.

2605.23610 2026-05-25 cs.CV cs.AI 版本更新

EM-Vid: Training-Free Entity-Centric Memory for Efficient and Consistent Multi-Shot Video Generation

EM-Vid:无需训练的以实体为中心的记忆,用于高效且一致的多镜头视频生成

Jente Vandersanden, Matheus Gadelha, Chun-Hao P. Huang, Hyeonho Jeong, Yulia Gryaditskaya

发表机构 * Max Planck Institute for Informatics(马克斯·普朗克研究所) Adobe Research(Adobe研究)

AI总结 本文提出了一种无需训练的实体中心记忆机制 EM-Vid,用于高效且一致的多镜头视频生成。该方法通过存储实体相关的潜在补丁来分离持久实体信息与瞬时场景背景,结合稀疏 token 条件控制和结构化脚本格式,有效降低了计算成本并提升了生成一致性。此外,引入的预算化记忆更新策略和噪声注入机制,进一步增强了对实体外观的精细控制,防止了无关信息的泄露。

详情
AI中文摘要

多镜头视频生成需要在不同镜头间保持重复实体的一致外观,同时忠实于镜头特定的文本提示。最近的自回归方法重用先前生成的帧作为记忆。然而,全帧存储将持久实体信息与瞬态场景上下文纠缠在一起,导致无关信息泄漏和高计算成本。我们提出一种以实体为中心的记忆,形式为实体索引的潜在补丁库。我们引入与预训练模型兼容的稀疏令牌条件化,将自注意力限制在实体相关令牌上,降低计算成本。为此,我们引入一种结构化的多镜头脚本格式。我们还提出一种预算记忆更新策略,以维护紧凑且不断演化的记忆。最后,我们为实体表示配备噪声注入机制,实现细粒度外观控制,防止无关信息泄漏。我们的方法在保持主体一致性的同时,提高了提示遵循度和效率。

英文摘要

Multi-shot video generation requires maintaining a consistent appearance of recurring entities across shots while remaining faithful to shot-specific text prompts. Recent autoregressive methods reuse previously generated frames as memory. However, full-frame storage entangles persistent entity information with transient scene context, leading to irrelevant information leakage and high computational cost. We propose an entity-centric memory in the form of an entity-indexed bank of latent patches. We introduce sparse token conditioning compatible with pretrained models, restricting self-attention to entity-relevant tokens and reducing computational cost. To support this, we introduce a structured multi-shot script format. We additionally propose a budgeted memory update strategy to maintain a compact, evolving memory. Finally, we equip the entity representation with a noise-injection mechanism that enables fine-grained appearance control, preventing leakage of irrelevant information. Our method improves prompt adherence and efficiency while preserving subject consistency.

2605.23605 2026-05-25 cs.LG cs.AI cs.CL 版本更新

DiLaDiff: Distilled Latent-Augmented Diffusion for Language Modeling

DiLaDiff: 蒸馏潜在增强扩散用于语言建模

Jean-Marie Lemercier, Tomas Geffner, Karsten Kreis, Morteza Mardani, Arash Vahdat, Ante Jukić

发表机构 * NVIDIA(英伟达)

AI总结 DiLaDiff 是一种改进的扩散语言模型,旨在解决传统扩散模型在采样质量和生成速度之间的矛盾。该方法引入了连续语义潜在空间,并通过自编码器和一致性蒸馏技术提升生成效率和质量。实验表明,DiLaDiff 在不进行蒸馏时已优于基线模型,并在蒸馏后显著加快了推理速度。

详情
AI中文摘要

扩散语言模型本质上无法捕捉解码令牌之间的相关性,导致采样质量与吞吐量之间存在严峻的权衡。为了解决这个问题,我们提出了DiLaDiff,一种掩码扩散语言模型的变体,包含三个组件:(1)具有语义能力的连续潜在空间,通过从现有掩码扩散语言模型微调的自编码器学习;(2)学习编码器分布先验的潜在扩散模型;(3)将学习到的先验蒸馏为少步潜在生成模型的一致性模型。我们表明,即使没有蒸馏,我们的潜在引导扩散模型在显著加速推理的同时也优于掩码扩散基线。一致性蒸馏进一步降低了连续扩散的计算开销,使得潜在生成的时间相对于离散解码可以忽略不计。

英文摘要

Diffusion language models intrinsically fail to capture correlations between decoded tokens, which leads to a harsh trade-off between sampling quality and throughput. To solve this issue, we propose DiLaDiff, a variant of masked diffusion language models with three components: (1) a continuous latent space with semantic capabilities, learned by an auto-encoder fine-tuned from an existing masked diffusion language model; (2) a latent diffusion model learning the prior over the encoder distribution; (3) a consistency model distilling the learned prior into a few-step latent generative model. We show that, even without distillation, our latent-guided diffusion model outperforms the masked diffusion baseline while significantly accelerating inference. Consistency distillation further lowers the computational overhead of continuous diffusion, such that the latent is generated in negligible time compared to discrete decoding.

2605.23603 2026-05-25 cs.LG cond-mat.dis-nn cs.AI cs.NE 版本更新

Preisach Attention: A Hysteretic Model of Sequential Memory

Preisach注意力:序列记忆的迟滞模型

Piotr Frydrych

发表机构 * Faculty of Mechatronics, Warsaw University of Technology(机电学院,华沙技术大学)

AI总结 本文提出了一种基于经典 Preisach 滞后算子的新型序列建模架构——Preisach 注意力层(PAL),用二值继电器操作符替代传统的 softmax 注意力机制,通过学习激活与去激活阈值来维护内部的局部极值栈。该架构在任意精度算术下实现图灵完备性,且单层 PAL-Transformer 的深度仅为 O(1),优于传统硬注意力 Transformer 所需的 O(log n) 深度。研究还证明 PAL 与 Transformer 在可计算函数类上互不包含,PAL 能以更少层数计算历史范围统计量,而 Transformer 支持随机访问但需额外状态支持,且 PAL 对序列的响应仅依赖于局部极值序列,而非绝对位置或时间间隔。

Comments 24 pages, 2 tables, preprint

详情
AI中文摘要

我们引入了Preisach注意力层(PAL),一种基于数学物理中经典Preisach迟滞算子的新型序列建模架构。PAL用由学习到的激活和去激活阈值参数化的二进制继电器算子替代了softmax注意力机制,并维护一个局部极值栈作为其内部状态。在任意精度算术下,具有O(1)深度的单层PAL-Transformer是图灵完备的,这可以通过模拟双栈下推自动机实现——而标准硬注意力变压器需要O(log n)深度。其次,我们证明了PAL和Transformer可计算的函数类是不可比的:PAL在O(1)层内计算历史范围统计,而Transformer需要O(log n)层;Transformer支持随机访问检索,而PAL在没有辅助状态的情况下无法执行。分离性质是率无关性——PAL仅响应局部极值序列,而不响应绝对标记位置或时间间隔。第三,我们证明了极值栈构成了所有率无关泛函的输入历史的最小充分统计量,提供了经典迟滞理论中擦除性质的形式类比。因此,PAL是一种适用于长情节记忆和弱位置依赖任务的高效架构,其总推理成本为O(n log n),而标准注意力为O(n^2)。

英文摘要

We introduce the Preisach Attention Layer (PAL), a novel sequence modelling architecture grounded in the classical Preisach hysteresis operator from mathematical physics. PAL replaces the softmax attention mechanism with a binary relay operator parameterised by learned activation and deactivation thresholds, maintaining a stack of local extrema as its internal state. A single-layer PAL-Transformer with O(1) depth is Turing-complete under arbitrary precision arithmetic, achievable through simulation of a two-stack pushdown automaton -- in contrast to the O(log n) depth required by standard hard-attention transformers. Second, we prove that the function classes computable by PAL and by the transformer are incomparable: PAL computes historical range statistics in O(1) layers that require O(log n) layers for transformers, while transformers support random-access retrieval that PAL cannot perform without auxiliary state. The separating property is rate-independence -- PAL responds only to the sequence of local extrema, not to absolute token positions or temporal spacing. Third, we show that the extremum stack constitutes a minimal sufficient statistic of the input history for all rate-independent functionals, providing a formal analogue of the wiping property in classical hysteresis theory. PAL is thus an efficient architecture for tasks with long episodic memory and weak positional dependence, with O(n log n) total inference cost versus O(n^2) for standard attention.

2605.23592 2026-05-25 cs.AI 版本更新

Solving the Aircraft Disassembly Scheduling Problem

解决飞机拆解调度问题

Charles Thomas, Pierre Schaus

发表机构 * Institute of Information and Communication Technologies, Electronics and Applied Mathematics (ICTEAM)(信息与通信技术、电子与应用数学研究所) UCLouvain(乌得勒支大学)

AI总结 本文研究了飞机报废拆解过程中的调度问题,该问题涉及大量任务和多种约束条件,对航空公司实现可持续拆解和盈利至关重要。文章提出了两种求解方法,包括约束规划模型和混合整数规划模型,并基于工业合作伙伴提供的真实数据进行了测试,验证了模型在处理多达1450项任务实例中的有效性。

详情
AI中文摘要

拆解寿命终结的飞机是一项复杂的工程,对于可持续性而言是必要的,但为航空运输公司带来的利润空间很小。因此,拆解过程的高效调度对于确保流程的盈利能力和激励实践至关重要。这是一个涉及数千个任务和许多不同约束的大规模调度问题:提取计划重复使用的部件需要具有特定认证和设备的技师。提取操作可能受先后顺序关系约束。此外,在整个过程中必须保持飞机平衡。最后,飞机的某些位置空间有限,限制了可同时工作的技师数量。本文详细介绍了该问题,并提出了两种解决方法:约束规划模型和混合整数规划模型。这些模型在基于工业合作伙伴提供的真实运营数据、规模不同(最多1450个任务)的实例上进行了测试。

英文摘要

Dismantling aircrafts reaching their end of life is a complex endeavour that is necessary in terms of sustainability but yields small income margins for air transport companies. An efficient scheduling of the disassembly procedure is thus crucial to ensure the profitability of the process and incentivize practice. This is a large scheduling problem that involves thousands of tasks and many different constraints: Extracting parts that are destined to be reused requires technicians with specific certifications and equipment. Extraction operations might be subject to precedence relations. Furthermore, the aircraft must be kept balanced during the whole process. Finally, some of the locations of the aircraft have a limited space that caps the number of technicians able to work there concurrently. This article presents the problem in details and proposes two approaches to solve the problem: a Constraint Programming model and a MIP model. The models are tested on instances of varying sizes involving up to 1450 tasks, which are based on real operational data provided by an industrial partner.

2605.23590 2026-05-25 cs.AI 版本更新

Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents

Co-ReAct:作为ReAct智能体逐步协作者的评分准则

Jiazheng Kang, Bowen Zhang, Zixin Song, Jiangwang Chen, Xiao Yang, Da Zhu, Guanjun Jiang

发表机构 * Qwen Applications Business Group of Alibaba(阿里巴巴文勤应用业务组) Tsinghua University(清华大学)

AI总结 Co-ReAct 是一种基于评分标准(rubrics)的行动选择框架,旨在改进 ReAct 代理在多步骤推理任务中的决策过程。该方法在每一步推理中注入评分标准作为指导,明确代理应关注的证据搜索、推理或自我评估方向,从而提升推理的深度和针对性。通过引入专门训练的评分标准生成器,并采用多评委共识排名优化目标,Co-ReAct 显著提升了多个基准任务上的表现,且无需修改原有代理的决策机制。

详情
AI中文摘要

用于搜索密集型、多步推理任务的ReAct风格智能体主要依赖自身内部判断来决定寻求哪些证据、下一步采取哪个推理或行动步骤以及何时停止,常常产生浅显、冗余或目标不明确的轨迹。先前的工作探索了将评分准则作为外部质量信号,但现有用途主要是评估性的而非行动指导性的:评分准则通常作为训练时的奖励或完成输出的事后评估器,在深度研究场景中,它们往往是粗粒度的、报告级别的而非步骤级别的。我们引入了Co-ReAct,一个评分准则指导的行动选择框架,在推理过程中将评分准则作为步骤级指导。在每个决策步骤,Co-ReAct将评分准则注入智能体的上下文,以指导下一个“推理或行动”决策,明确智能体在证据寻求、搜索、推理或自我评估中应瞄准什么。为了使这种指导可靠,我们使用GRPO训练了一个专用的评分准则生成器。与先前的成对或二元偏好公式不同,我们的目标优化了针对多评判专家共识排名的列表式斯皮尔曼等级相关奖励,鼓励评分准则具有区分性而不仅仅是合理。在DeepResearchBench和SQA-CS-V2上,Co-ReAct在基于8B/14B开源和前沿闭源基础模型构建的搜索智能体上,一致优于ReAct和代表性的测试时计算基线。训练好的评分准则生成器还可以作为即插即用组件,在不改变底层决策机制的情况下改进这些基线。我们的代码公开在https://github.com/ZBWpro/Co-ReAct。

英文摘要

ReAct-style agents for search-intensive, multi-step reasoning tasks rely largely on their own internal judgment to decide what evidence to seek, which reasoning or action step to take next, and when to stop, often producing shallow, redundant, or poorly targeted trajectories. Prior work has explored rubrics as external quality signals, but existing uses are mostly evaluative rather than action-guiding: rubrics typically serve as training-time rewards or post-hoc evaluators of completed outputs, and in deep-research settings they are often coarse-grained and report-level rather than step-level. We introduce Co-ReAct, a rubric-guided action-selection framework that uses rubrics as step-level guidance during inference. At each decision step, Co-ReAct injects a rubric into the agent's context to guide the next Reason-or-Act decision, specifying what the agent should target in evidence seeking, search, reasoning, or self-evaluation. To make this guidance reliable, we train a dedicated rubric generator with GRPO. Unlike prior pairwise or binary preference formulations, our objective optimizes a list-wise Spearman rank-correlation reward against multi-judge expert consensus rankings, encouraging rubrics that are discriminative rather than merely plausible. On DeepResearchBench and SQA-CS-V2, Co-ReAct consistently improves over ReAct and representative test-time compute baselines across search agents built on both 8B/14B open-source and frontier closed-source base models. The trained rubric generator can also serve as a drop-in component that improves these baselines without changing their underlying decision mechanisms. Our code is publicly available at https://github.com/ZBWpro/Co-ReAct.

2605.23572 2026-05-25 cs.IR cs.AI cs.LG 版本更新

HARNESS-LM: A Three-Phase Training Recipe for Harnessing SLMs in Sponsored Search Retrieval

HARNESS-LM: 一种在赞助搜索中利用小语言模型的三阶段训练方案

Vipul Gupta, Shikhar Mohan, Lakshya Kumar, Pranjal Chitale, Nikit Begwani, Amit Singh, Manik Varma

发表机构 * Microsoft AI(微软人工智能)

AI总结 在赞助搜索中,如何在保证检索质量的同时降低响应延迟是一个重要挑战。本文提出HARNESS-LM(HLM),一种三阶段训练框架,旨在将大规模语言模型的检索能力转移到参数更少、成本更低的模型中。通过知识蒸馏和对比优化等方法,HLM在保持高检索精度的同时显著提升了推理效率,并在实际的Bing Ads测试中验证了其有效性,取得了更高的收益、曝光和点击率提升。

Comments 9 pages, 3 figures, 10 tables

详情
AI中文摘要

在赞助搜索的竞争格局中,平衡检索质量与生产延迟是一个关键挑战。尽管基于小语言模型(SLM)的大型检索模型(如Qwen3-Embedding-4B/8B)在公共基准上设定了强上限,但其在高吞吐、延迟敏感环境中的部署仍不切实际。本文提出HARNESS-LM(HLM),一个三阶段训练框架,用于将大规模检索器的能力迁移至紧凑、成本高效的模型。该方法包括:(1)通过微调十亿参数规模的SLM训练高性能参考(“教师”)检索器;(2)通过L2目标对齐查询表示,将知识蒸馏至低于600M参数的学生编码器;(3)应用最终对比精炼阶段以优化学生的检索性能。我们还对关键设计选择进行了全面的实证研究,包括对齐目标、嵌入维度、模型规模、架构和优化策略,以确定在生产环境中最为有效的配置。在真实世界的Bing Ads评估基准上,HLM在多种设置下恢复了参考检索器超过98%的精度,同时在NVIDIA A100 GPU上实现了高达27倍的在线查询编码器延迟降低和20倍的吞吐量提升。在Bing Ads上的在线A/B测试进一步显示,与当前生产中运行的检索器集成(部署190M参数模型)相比,收入提升+1%,展示量提升+0.6%,点击量提升+0.4%,清晰突显了HLM方案在真实世界赞助搜索场景中的实际效果。

英文摘要

In the competitive landscape of sponsored search, balancing retrieval quality with production latency is a critical challenge. While large retrieval models based on Small Language Models (SLMs) such as Qwen3-Embedding-4B/8B set strong upper bounds on public benchmarks, their deployment in high-throughput, latency-sensitive environments remains impractical. In this paper, we present HARNESS-LM (HLM), a three-phase training framework for transferring the capabilities of large-scale retrievers into compact, cost-efficient models. The approach comprises: (1) training a high-performance reference ("teacher") retriever by fine-tuning a billion-parameter-scale SLM; (2) aligning query representations via an L2 objective to distill knowledge into a sub-600M parameter student encoder; and (3) applying a final contrastive refinement stage to optimize the student for retrieval performance. We also present a comprehensive empirical study of key design choices, including alignment objectives, embedding dimensionality, model scale, architecture, and optimization strategies, to identify configurations that are most effective in production settings. On a real-world Bing Ads evaluation benchmark, HLM recovers over 98% of the reference retriever's precision across multiple settings, while delivering up to 27x lower online query-encoder latency and 20x higher throughput on NVIDIA A100 GPUs. Online A/B testing on Bing Ads further shows a +1% Revenue, +0.6% Impression, and +0.4% Click uplift over the current ensemble of retrievers running in production with the deployed 190M parameter model, clearly highlighting the practical efficacy of the HLM recipe in a real-world sponsored search setting.

2605.23569 2026-05-25 cs.AI 版本更新

CP or DP? Why Not Both: A Case Study in the Partial Shop Scheduling Problem

CP还是DP?为何不兼得:以部分车间调度问题为例

Emma Legrand, Roger Kameugne, Pierre Schaus

发表机构 * ICTEAM, UCLouvain, Belgium(ICTEAM,鲁汶大学,比利时)

AI总结 本文研究了如何将动态规划(DP)与约束规划(CP)有效结合,以解决部分车间调度问题(PSSP)。作者提出了一种混合方法,以DP作为主搜索框架,利用CP进行全局约束传播,从而提升求解效率与灵活性。该方法不仅支持任意优先级约束,还可与任何时间策略结合,并能设计出基于DP的大型邻域搜索方案,展示了DP与CP融合在组合优化问题中的可行性。

详情
AI中文摘要

动态规划(DP)和约束规划(CP)是解决组合优化问题的成熟范式。通常,这两种方法被分开使用。本文旨在展示两者可以有效且优雅地结合,其中DP作为主搜索框架,CP作为子程序利用全局约束传播。本文针对部分车间调度问题(PSSP)提出了这样一种方法,该问题之前已有纯DP方法,并且有高效的CP过滤算法可用。PSSP是一个通用调度问题,其中每个作业由一组具有任意优先约束的操作组成。该方法足够灵活,可以容纳任意时间DP策略,例如任意时间列搜索,而原始DP算法以严格的逐层方式运行。此外,CP建模的灵活性使得可以轻松纳入任意优先约束。因此,该模型自然地处理任何优先图,甚至允许设计大邻域搜索(LNS)方案,其中重用DP模型,并在重启之间施加偏序调度以改进当前解。虽然对于这个特定问题,该方法无法与最先进的纯CP求解器竞争,但我们的主要贡献是证明了这种混合集成的可行性。

英文摘要

Dynamic Programming (DP) and Constraint Programming (CP) are well-established paradigms for solving combinatorial optimization problems. Usually, these two approaches are used separately. This paper aims to show that the two can be combined effectively and elegantly, with DP serving as the primary search framework and CP used as a subroutine to leverage global constraint propagation. This paper presents such an approach for the Partial Shop Scheduling Problem (PSSP), for which a pure DP method has previously been proposed, and efficient CP filtering algorithms are available. The PSSP is a general scheduling problem where each job consists of a set of operations with arbitrary precedence constraints. The approach is flexible enough to accommodate anytime DP strategies, such as anytime column search, whereas the original DP algorithm operated in a strictly layer-wise manner. Moreover, the flexibility of the CP modeling makes it straightforward to incorporate arbitrary precedence constraints. As a result, the model naturally handles any precedence graph and even enables the design of a Large Neighborhood Search (LNS) scheme, in which the DP model is reused, and partial-order schedules are imposed across restarts to improve the incumbent solution. While not competitive with state-of-the-art pure CP solvers for this specific problem, our primary contribution is demonstrating the viability of this hybrid integration.

2605.23565 2026-05-25 cs.LG cs.AI 版本更新

Understanding Goal Generalisation in Sequential Reinforcement Learning

理解序贯强化学习中的目标泛化

Jason Ross Brown, Edward James Young

发表机构 * University of Cambridge(剑桥大学) Geodesic Research(Geodesic研究)

AI总结 本研究探讨了序列强化学习代理在新环境中实现目标泛化的能力,分析了其训练历史对其行为的影响。通过研究超过100种序列训练流程并在250多个分布外环境中进行评估,发现显著特征和早期学习的目标对后续泛化具有重要影响。为此,研究提出了一种名为潜在策略梯度的方法,能够预测训练流程可能诱导的分布外行为,具有较高的预测准确性、良好的泛化能力和可解释性,为从发展角度理解目标泛化提供了基础。

详情
AI中文摘要

强化学习代理在其训练分布之外常常表现出非预期的目标导向行为,但我们目前缺乏基于训练历史对这类代理如何泛化到新环境的原理性理解。我们针对在单个或多个任务上序贯训练的代理解决了这一空白。我们研究了超过100个序贯训练流程,评估了超过250个分布外环境中的行为。我们发现显著特征驱动泛化,并且训练早期习得的目标会持续存在并影响后期习得的目标。为了解释这些现象,我们引入了潜在策略梯度方法,该方法预测训练流程可能诱导的分布外行为。我们的方法根据潜在变量如何映射到行为的简单模型,模拟训练过程中低维潜在变量的演化,以实现在训练目标上获得高奖励。它实现了强预测准确性,泛化到未见过的训练流程类型,并且是可解释的。我们的发现表明,虽然分布外RL代理行为依赖于整个训练流程,但这种依赖具有我们可以捕捉的底层结构,为从发展角度理解目标泛化奠定了基础。

英文摘要

Reinforcement learning agents often exhibit unintended goal-directed behaviour outside their training distribution, but we currently lack a principled understanding of how such agents will generalise to novel environments based on their training history. We address this gap for agents trained sequentially on one or more tasks. We study over 100 sequential training pipelines, evaluating behaviour across over 250 out-of-distribution environments. We find that salient features drive generalisation, and that goals learnt early in training can persist and influence those acquired later. To explain these phenomena, we introduce latent policy gradients, a method that predicts what out-of-distribution behaviour a training pipeline will likely induce. Our method simulates the evolution of low-dimensional latent variables during training according to what would achieve high reward on the training objective with respect to a simple model of how the latent variables map to behaviour. It achieves strong predictive accuracy, generalises to unseen types of training pipeline, and is interpretable. Our findings demonstrate that while out-of-distribution RL agent behaviour is dependent on the whole training pipeline, this dependence has an underlying structure we can capture, laying groundwork for understanding goal generalisation from a developmental perspective.

2605.23562 2026-05-25 cs.MA cs.AI 版本更新

ARMS: Automatic Reward Shaping for Sparse-Reward Multi-Agent Reinforcement Learning

ARMS: 稀疏奖励多智能体强化学习的自动奖励塑形

Elie Abboud, Oren Gal

发表机构 * Department of Marine Technologies(海洋技术系)

AI总结 在多智能体强化学习中,稀疏奖励是学习过程中的主要瓶颈,而传统的奖励塑造方法难以在保持策略结构的同时提升学习效率。本文提出了一种名为ARMS的自动奖励塑造框架,通过轨迹排序从稀疏环境奖励中学习密集的塑造奖励,并基于条件最佳响应推理保证在固定对手策略下保留每个智能体的最佳响应集和纳什均衡集。实验表明,ARMS在部分可观测的多智能体路径规划任务中显著提升了采样效率,具有良好的环境泛化能力,并揭示了多智能体系统中由探索不足和策略-奖励动态耦合引发的振荡行为问题。

详情
AI中文摘要

稀疏奖励是多智能体强化学习(MARL)中的一个主要瓶颈,其中同时学习会导致非平稳性并使奖励设计尤其精细。奖励塑形可以加速学习,但在多智能体环境中,它必须保留问题的战略结构,而不仅仅是改善短期优化。我们提出了多智能体系统中的自动奖励塑形(ARMS),这是一个用于MARL的自监督奖励塑形框架,通过轨迹排序从稀疏环境奖励中学习稠密塑形信号。由于单智能体轨迹排序保证不能直接迁移到MARL,我们通过条件最优反应推理重新表述策略不变性,并证明如果某些条件成立,则使用塑形奖励在固定对手策略下保留每个智能体的最优反应集,从而保留纳什均衡集。在此视角指导下,ARMS在策略学习和奖励学习之间交替,同时跨智能体共享塑形参数以提高效率。在部分可观测的多智能体路径规划领域中的实验表明,ARMS在奖励稀疏性和智能体数量增加的情况下提高了采样效率,泛化到未见过的环境,并揭示了一种MARL特有的失败模式,其中有限的探索和耦合的策略-奖励动态导致振荡行为。增加探索可缓解此效应并稳定学习。据我们所知,ARMS是第一个其设计动机来自博弈论均衡保持结果的MARL自动奖励塑形框架。

英文摘要

Sparse rewards are a major bottleneck in multi-agent reinforcement learning (MARL), where simultaneous learning induces non-stationarity and makes reward design especially delicate. Reward shaping can accelerate learning, but in the multi-agent setting it must preserve the strategic structure of the problem rather than merely improve short-term optimization. We propose Automatic Reward-shaping in Multi-agent Systems (ARMS), a self-supervised reward shaping framework for MARL that learns dense shaping signals from sparse environmental rewards through trajectory ranking. Since single-agent trajectory-ranking guarantees do not directly transfer to MARL, we reformulate policy invariance through conditional best-response reasoning, and show that if certain conditions hold, then using shaping rewards preserves each agent's best-response set under fixed opponent policies, and consequently preserve the set of Nash equilibria. Guided by this perspective, ARMS alternates between policy learning and reward learning while sharing shaping parameters across agents for efficiency. Experiments in a partially observable multi-agent pathfinding domain show that ARMS improves sampling efficiency under increasing reward sparsity and agent count, generalizes to unseen environments, and reveals a MARL-specific failure mode in which limited exploration and coupled policy--reward dynamics induce oscillatory behavior. Increasing exploration mitigates this effect and stabilizes learning. To the best of our knowledge, ARMS is the first automatic reward shaping framework for MARL whose design is motivated by a game-theoretic equilibrium-preservation result.

2605.23559 2026-05-25 cs.CV cs.AI 版本更新

PathNavigate: A Training-Free Pathology Agent with Surprise-Guided Scan and Shared Slide Memory for Whole-Slide Image VQA

PathNavigate: 一种无需训练的病理学代理,具有惊喜引导扫描和共享幻灯片记忆用于全切片图像VQA

Chunze Yang, Qidong Liu, Wenjie Zhao, Yue Tang, Jiusong Ge, Di Zhang, Jiashuai Liu, Lei Wu, Junbo Lu, Ni Zhang, Xian Wu, Zeyu Gao, Chen Li

发表机构 * School of Comp. Science & Technology, Xi’an Jiaotong University(西安交通大学计算机科学与技术学院) Tencent Jarvis Lab(腾讯Jarvis实验室) University of Cambridge(剑桥大学)

AI总结 PathNavigate 是一种无需训练的病理图像问答代理,旨在解决全切片图像问答(WSI-VQA)中在有限检查预算下高效定位关键病理证据的问题。该方法采用“扫描-搜索-读取”流程,通过共享的在线记忆模块生成异常区域池,并结合问题条件的相关性筛选高倍镜下的目标区域,从而提升答案准确性和解释性。实验表明,PathNavigate 在保持模型冻结的前提下,实现了更高的效率和更可靠的证据选择路径。

详情
AI中文摘要

全切片图像视觉问答(WSI-VQA)将病理学视为极端上下文搜索问题:为了回答自由形式的临床查询,系统必须首先在严格的检查预算下导航千兆像素切片,以定位稀疏的高分辨率证据。现有方法主要分为两种范式:i)监督式病理学多模态大语言模型(MLLMs)和代理可以将定位和推理吸收到学习模块中,但它们通常将导航与任务特定的监督和重新训练耦合,限制了其实用性;ii)无需训练的病理学代理通过保持核心模型冻结来避免这种成本,但通常遵循问题优先的设计,主要从查询条件相关性构建初始候选集。这可能会遗漏问题中未提及的决定性形态,并迫使更重的推理时脚手架。为了解决这一挑战,我们引入了PathNavigate,一种无需训练的病理学代理,基于扫描-搜索-读出流程构建。在问题匹配之前,PathNavigate在低放大倍数下扫描当前切片,使用共享的在线记忆模块处理冻结的病理学特征,生成一个切片特定的惊喜场,标记异常区域池。然后,它仅在此池内应用问题条件的PLIP相关性,以选择高放大倍数的搜索目标。最后,它提取局部高放大倍数证据,并使用冻结的感知器-裁决器堆栈进行回答,利用相同的在线记忆作为切片级上下文。在WSI-VQA和SlideBench-BCNB上的实验表明,所提出的扫描-搜索-读出设计提高了答案准确性,并产生了更可解释的证据选择轨迹,且效率更高。代码已在线公开。

英文摘要

Whole-slide image visual question answering (WSI-VQA) frames pathology as an extreme-context search problem: to answer a free-form clinical query, a system must first navigate a gigapixel slide under a strict inspection budget to locate sparse, high-resolution evidence. Existing approaches largely fall into two paradigms: i) supervised pathology multimodal large language models (MLLMs) and agents can absorb localization and reasoning into learned modules, but they often couple navigation to task-specific supervision and retraining, limiting their practicality; ii) training-free pathology agents avoid this cost by keeping core models frozen, but often follow a question-first design, constructing the initial candidate set mainly from query-conditioned relevance. This can miss decisive morphology that is not named in the question, and force heavier inference-time scaffolding. To address this challenge, we introduce PathNavigate, a training-free pathology agent built around a scan-search-readout routine. Before question matching, PathNavigate scans the current slide at low magnification with a shared online memory module over frozen pathology features, producing a slide-specific surprise field that marks an abnormal-region pool. It then applies question-conditioned PLIP relevance only within this pool to select high-magnification search targets. Finally, it extracts local high-magnification evidence and answers with a frozen perceptor-adjudicator stack, using the same online memory as slide-level context. Experiments on WSI-VQA and SlideBench-BCNB show that the proposed scan-search-readout design improves answer accuracy and yields more interpretable evidence-selection trajectories with higher efficiency.The code is available online.

2605.23551 2026-05-25 cs.LG cs.AI 版本更新

Goal-Conditioned Agents that Learn Everything All at Once

目标条件智能体一次性学习所有内容

Michael Matthews, Matthew Jackson, Michael Beukman, Thomas Foster, Alistair Letcher, Scott Fujimoto, Cédric Colas, Jakob Foerster

发表机构 * University of Oxford(牛津大学) McGill University(麦吉尔大学) MIT(麻省理工学院) Inria(法国国家信息与自动化研究所)

AI总结 本文提出了一种名为LEO(Learning Everything all at Once)的新方法,用于提升目标条件强化学习的效率。该方法通过一次性输出所有目标对应的价值和动作,实现了高效的并行更新,解决了传统全目标学习计算开销大的问题。实验表明,LEO在目标条件任务和连续控制环境中均表现出色,且相比传统方法有超过250倍的加速效果,为复杂环境中的强化学习提供了有力工具。

详情
AI中文摘要

一个目标条件的强化学习智能体在探索环境时,会在整个轨迹中看到大量信息,但大多数信息在仅根据命令目标进行在线策略更新时被丢弃。全目标学习(每个转换都用于针对每个目标进行离线策略学习)允许智能体提取最大信息,但通过简单的重新标记通常计算上不可行。这可以通过同时为每个目标输出值和动作来克服,从而允许通过网络单次传递进行高效的并行全目标更新,我们称之为一次性学习所有内容(LEO)。我们表明,这种方法在目标条件的Craftax上显著优于其他方法,在连续控制环境中与现有基线具有竞争力,同时与全目标重新标记相比实现了超过250倍的加速。然后,我们进一步表明,通过将LEO用作教师网络而非直接行动者,这种方法可以变得更加强大。我们希望,通过解锁大规模的全目标学习,LEO可以成为复杂环境中强化学习实践者的有用工具。我们开源了我们的代码。

英文摘要

A goal-conditioned reinforcement learning agent exploring an environment will see a wealth of information throughout a trajectory, most of which is discarded when only performing on-policy updates with respect to the commanded goal. All-goals learning, where each transition is used for learning off-policy with respect to every goal, allows agents to extract maximal information, however it is usually computationally infeasible when done via naive relabelling. This can be overcome by jointly outputting values and actions for every goal at once, allowing for efficient, parallel all-goals updates with a single pass through the network, in a process we call Learning Everything all at Once (LEO). We show that this approach significantly outperforms other methods on goal-conditioned Craftax and is competitive with existing baselines on continuous control environments, while achieving a >250x speed-up compared to all-goals relabelling. We then go on to show that this approach can be made even more powerful by using LEO as a teacher network, rather than a direct actor. We hope that, by unlocking all-goals learning at scale, LEO can serve as a useful tool for RL practitioners in complex environments. We open source our code.

2605.23550 2026-05-25 math.OC cs.AI cs.NA math.NA 版本更新

RA-DCA: A Randomized Active-Set DCA for Directional Stationarity in Max-Structured DC Programs

RA-DCA:面向最大结构DC规划方向稳定性的随机活动集DCA

Yi-Shuai Niu

发表机构 * Beijing Institute of Mathematical Sciences and Applications(北京数学科学研究院)

AI总结 本文研究了一类非光滑的差分凸优化问题,其中被减去的凸项为多个光滑凸函数的最大值。为了解决标准DCA可能收敛到非方向平稳临界点的问题,同时避免大规模或组合型活动集带来的高计算成本,作者提出了一种基于随机化活动集的DCA方法RA-DCA。该方法通过在采样方向上投影活动梯度、检查采样顶点残差,并仅在残差较小时使用小规模线性规划作为补充,有效保持了DCA的下降结构,同时将随机筛选过程简化为矩阵乘法。实验表明,该方法在多种模型中能够避免非平稳临界点,并在组合型问题中展现出良好的筛选效果。

Comments 40 pages, 7 figures

详情
AI中文摘要

我们研究非光滑差凸规划,其中被减的凸项是光滑凸函数的有限最大值。在此设定下,标准DCA迭代可能收敛到非方向稳定的临界点,而当活动集较大或具有组合性质时,精确的活动顶点筛选可能代价高昂。我们提出RA-DCA,一种顶点优先的随机活动集DCA,它将活动梯度投影到采样方向,检查采样顶点残差,并仅在低残差凸组合回退时使用一个小型线性规划。该方法保留了DCA的下降结构,并将随机筛选层简化为矩阵乘法。在所述正则性、数值活动集一致性和随机嵌入假设下,受保护方法生成的每个聚点以概率1是方向稳定的。MATLAB实验首先在退化的最大仿射、最大二次和稀疏支撑函数模型上测试该定理,其中保护机制避免了非稳定临界点并紧密跟踪完整活动顶点扫描。随后,块top-k测试表明,当精确聚合枚举具有组合性质时,相同的筛选思想仍然有用。修剪回归、互补性和QUBO诊断区分了活动集选择有助于问题的情况与由多起点搜索、DC分裂或其他问题特定特征主导的情况。

英文摘要

We study nonsmooth difference-of-convex programs whose subtracted convex term is a finite maximum of smooth convex functions. In this setting, standard DCA iterations may converge to critical points that are not directionally stationary, whereas exact active-vertex screening can be expensive when active sets are large or combinatorial. We propose RA-DCA, a vertex-first randomized active-set DCA that projects active gradients onto sampled directions, checks a sampled vertex residual, and uses a small linear program only as a low-residual convex-combination fallback. The method preserves the descent structure of DCA and reduces the randomized screening layer to matrix multiplications. Under the stated regularity, numerical active-set consistency, and random-embedding assumptions, every accumulation point generated by the safeguarded method is directionally stationary with probability one. MATLAB experiments first test the theorem on degenerate max-affine, max-quadratic, and sparse support-function models, where the safeguard avoids nonstationary critical points and closely tracks a full active-vertex scan. Block top-k tests then show that the same screening idea remains useful when exact aggregate enumeration is combinatorial. Trimmed-regression, complementarity, and QUBO diagnostics separate cases where active-set selection helps from cases dominated by multistart search, the DC split, or other problem-specific features.

2605.23522 2026-05-25 cs.LG cs.AI cs.CV 版本更新

Precise: SDE-Consistent Stochastic Sampling for RL Post-Training of Flow-Matching Models

Precise: 用于流匹配模型强化学习后训练的SDE一致随机采样

Jade Zou, Tao Huang, Weijie Kong, Junzhe Li, Yue Wu, Qi Tian, Jiangfeng Xiong, Jianwei Zhang, Liefeng Bo, Zhao Zhong

发表机构 * Peking University(北京大学) Tencent Hunyuan(腾讯文言)

AI总结 该论文研究了如何通过强化学习(RL)对流匹配模型进行后训练,以提升其生成质量与提示对齐能力。核心方法是将确定性的采样轨迹转化为随机策略,通过设计一个符合随机微分方程(SDE)的采样器,实现探索与稳定性的平衡。提出的新采样器Precise在保持去噪轨迹SDE一致性的同时,有效减少了噪声干扰,实验表明其在奖励优化速度和生成质量上均优于现有方法。

详情
AI中文摘要

强化学习已成为提升扩散和流匹配生成器中提示对齐和感知质量的有效方法。将在线强化学习应用于流匹配的关键步骤是将确定性采样轨迹转化为随机策略,通常通过用随机微分方程替代逆向常微分方程来实现。随机采样器控制探索行为和去噪动力学,因此是策略的一部分,其设计会显著影响奖励优化性能。我们将采样器设计分解为两个相互依赖的组成部分:选择适量的随机探索,以及在强化学习中使用的少量步数下忠实地离散化得到的SDE。针对第一个组成部分,我们分析了去噪过程中探索与稳定性之间的固有张力,并推导出平衡两者的SDE调度。针对离散化挑战,我们使用一个玩具示例表明,现有采样器可能偏离流匹配过程,要么引入过多的离散化噪声,要么依赖不能保证收敛到数据分布的启发式规则。为解决这些问题,我们提出了Precise,一种新的随机采样器,平衡了有效探索与稳定性。关键地,Precise通过一种冻结干净潜变量后验均值的新颖近似,使去噪轨迹保持SDE一致,解决了标准采样器中的过度噪声问题。大量实验表明,该公式通过强化学习实现了显著更快且更稳定的奖励优化,达到了最先进的对齐分数(例如PickScore、HPSv2.1),同时匹配先前采样器的最佳域内性能所需的训练时间减少了13.1-53.2%。

英文摘要

Reinforcement learning (RL) has become an effective way to improve prompt alignment and perceptual quality in diffusion and flow-matching generators. A critical step for applying online RL to flow matching is turning the deterministic sampling trajectory into a stochastic policy, typically by replacing the reverse-time Ordinary Differential Equation (ODE) with a Stochastic Differential Equation (SDE). The stochastic sampler, controlling the exploration behavior and denoising dynamics, is thus part of the policy, and its design can significantly affect the reward optimization performance. We break down the sampler design into two interdependent components: choosing the right amount of stochastic exploration, and discretizing the resulting SDE faithfully at the small step counts used in RL. To address the first component, we analyze the inherent tension between exploration and stability in denoising and derive an SDE schedule that balances the two. Turning to the discretization challenge, we use a toy example to show that existing samplers can deviate from the flow-matching process, either by introducing excessive discretization noise or by relying on heuristic rules that do not guarantee convergence to the data distribution. To address these issues, we propose Precise, a new stochastic sampler that balances effective exploration with stability. Crucially, Precise keeps the denoising trajectory SDE-consistent through a novel approximation that freezes the clean-latent posterior mean, resolving the excess noise issue in standard samplers. Extensive experiments demonstrate that this formulation leads to significantly faster and more stable reward optimization via reinforcement learning, achieving state-of-the-art alignment scores (e.g., PickScore, HPSv2.1) while requiring 13.1-53.2% less wall-clock training time to match the best in-domain performance of prior samplers.

2605.23508 2026-05-25 cs.GR cs.AI cs.CV cs.MM eess.IV 版本更新

DrawVideo: Generating Long Video from Storyboard Keyframe Sketches

DrawVideo: 从故事板关键帧草图生成长视频

Chuanzhi Xu, Huiqi Liang, Bang Shi, Huiming Zhang, Yifan Xiao, Guangcheng Lin, Haodong Chen, Qiang Qu, Zhicheng Lu, Weidong Cai

发表机构 * The University of Sydney(悉尼大学) Charles Sturt University(查尔斯·斯特劳特大学)

AI总结 DrawVideo 是一种基于草图和分镜脚本的可控长视频生成框架,能够通过用户提供的黑白草图、外观描述和运动提示生成结构清晰、内容连贯的长视频。该方法将视频分解为多个可独立控制的镜头,每个镜头由草图、外观提示和运动提示定义,并采用分层策略生成参考帧和动作状态帧,最终合成完整视频。研究还构建了首个用于草图引导长视频生成的数据集 SketchLongVideo,实验表明该方法在结构控制、外观一致性和视觉稳定性方面表现优异。

Comments 45 pages, 19 figures

详情
AI中文摘要

长视频生成需要高保真合成、连贯的叙事结构以及用户对长时间跨度的控制。现有的文本到视频方法通常依赖单一长提示,限制了对姿态、构图、布局和运动的控制。我们提出 DrawVideo,一种草图引导、故事板驱动的可控长视频生成框架。DrawVideo 将长视频分解为独立可控的镜头,每个镜头由黑白草图、外观提示和运动提示定义。草图控制姿态和布局,外观提示定义身份、场景和风格,运动提示引导时间动态。DrawVideo 遵循分层“全局多镜头、局部单草图”策略:首先生成结构对齐的参考关键帧,然后将运动提示扩展为代表动作状态的衍生关键帧,最后在相邻关键帧之间合成片段以构建每个镜头。我们还引入了 SketchLongVideo,这是首个用于草图引导的文本到长视频生成的数据集,通过镜头检测、关键帧提取、视觉语言识别、提示分解和草图转换从动画视频构建。实验表明,DrawVideo 实现了强大的结构可控性、外观一致性、视觉稳定性和连贯的长视频生成。

英文摘要

Long video generation requires high-fidelity synthesis, coherent narrative structure, and user control over extended time spans. Existing text-to-video methods often rely on a single long prompt, limiting control over pose, composition, layout, and motion. We propose DrawVideo, a sketch-guided, storyboard-driven framework for controllable long-video generation. DrawVideo decomposes long videos into independently controllable shots, each defined by a black-and-white sketch, an appearance prompt, and a motion prompt. The sketch controls pose and layout, the appearance prompt defines identity, scene, and style, and the motion prompt guides temporal dynamics. DrawVideo follows a hierarchical 'global multi-shot, local single-sketch' strategy: it first generates a structure-aligned reference keyframe, then expands the motion prompt into derivative keyframes representing action states, and finally synthesizes clips between adjacent keyframes to build each shot. We also introduce SketchLongVideo, the first dataset for sketch-guided text-to-long-video generation, constructed from animation videos via shot detection, keyframe extraction, vision-language recognition, prompt decomposition, and sketch conversion. Experiments show that DrawVideo achieves strong structural controllability, appearance consistency, visual stability, and coherent long-video generation.

2605.23504 2026-05-25 cs.LG cs.AI 版本更新

VACE: Learning Geometrically Structured Representations for Time Series Anomaly Detection

VACE:学习几何结构化表示用于时间序列异常检测

Alberto D. Cencillo, Leonardo Concepción, Isaac Triguero, Julián Luengo

发表机构 * Andalusian Research Institute in Data Science and Computational Intelligence (DaSCI)(安达卢西亚数据科学与计算智能研究 institute) Department of Computer Science and Artificial Intelligence (DECSAI), University of Granada(格拉纳达大学计算机科学与人工智能系)

AI总结 该论文提出了一种名为VACE的自监督异常检测方法,用于多变量时间序列中的异常检测。VACE通过速度对齐的通道嵌入方式,学习具有紧凑且方向一致结构的正常表示,从而更准确地识别异常。该方法无需负样本和合成异常,通过速度一致性目标训练编码器,使正常轨迹在嵌入空间中保持局部平滑和对齐。实验表明,VACE在多个基准数据集上取得了优于复杂方法的优异性能。

Comments 16 pages, 5 figures

详情
AI中文摘要

多变量时间序列中的异常检测是广泛实际应用中的关键任务,其中异常行为罕见、标签不可用且漏检成本高昂。核心挑战在于学习足够精确的正常性表征以标记偏差。表示自监督学习(通常通过对比方法)通过将时间补丁嵌入到潜在空间来解决这一问题,其中正常性占据一个定义明确的区域,异常通过几何偏差检测。然而,对比方法通过配对采样启发式间接塑造该空间,无法对基于距离评分所需的几何结构进行显式控制。这意味着正常表示的紧凑程度以及距离是否具有方向意义。我们提出VACE(速度对齐通道嵌入),一种自监督异常检测方法,将正常性表示为嵌入空间中紧凑且方向一致的区域。为此,VACE通过速度一致性目标训练通道感知编码器,无需负样本和合成异常,使得正常轨迹局部平滑且对齐。在测试时,马氏距离位置得分和速度库方向得分相乘,标记同时偏离分布和动态异常的点。尽管方法简单,VACE在严格评估下于TSB-AD-M上实现了最先进性能,显著优于使用更大预算训练的复杂方法。

英文摘要

Anomaly detection in multivariate time series is a critical task across a wide range of real-world applications, where abnormal behaviour is rare, labels are unavailable, and the cost of a miss is high. The central challenge is learning a characterisation of normality precise enough to flag deviations. Representation self-supervised learning, typically through contrastive approaches, addresses this by embedding temporal patches into a latent space where normality occupies a well-defined region, with anomalies detected by geometric deviation. However, contrastive approaches shape this space indirectly through pair-sampling heuristics, providing no explicit control over the geometric structure that distance-based scoring requires. This means how tightly normal representations are grouped, and whether distances are directionally meaningful. We present VACE (Velocity-Aligned Channel Embeddings), a self-supervised anomaly detection method that represents normality as a compact, directionally coherent region in the embedding space. To this end, VACE trains a channel-aware encoder through a velocity-consistency objective, with no negatives and no synthetic anomalies, so that normal trajectories are locally smooth and aligned. At test time, a Mahalanobis positional score and a velocity-bank directional score are combined multiplicatively, flagging points that are simultaneously off-distribution and dynamically atypical. Despite its simplicity, VACE achieves state-of-the-art performance on TSB-AD-M under rigorous evaluation, significantly outperforming more complex methods trained on substantially larger budgets.

2605.23493 2026-05-25 cs.AI 版本更新

EDGE-OPD: Internalizing Privileged Context with Evidence Guided On-Policy Distillation

EDGE-OPD:通过证据引导的在线策略蒸馏内化特权上下文

Aristotelis Lazaridis, Dylan Bates, Aman Sharma, Brian King, Vincent Lu, Jack FitzGerald

发表机构 * EdgeRunner AI

AI总结 本文研究了在基于特权上下文的On-Policy Self-Distillation(OPSD)中,如何避免特权信息对模型行为产生不必要的干扰问题,并提出了EDGE-OPD方法。该方法通过引导式采样和证据掩码机制,在训练过程中更精准地注入特权信息,确保学生模型学习到目标行为而非副作用。实验表明,EDGE-OPD有效提升了身份学习的效果,并有助于保持模型的一般能力。

详情
AI中文摘要

在线策略蒸馏(OPD)作为一种LLM后训练范式,因其在不引入模型分布漂移和通用任务回归的情况下有效提升能力而受到广泛关注。在线策略自蒸馏(OPSD)是OPD的一种高效用例,它仅需单一模型同时作为学生和教师,并且具有在训练过程中向教师提供推理时缺失的特权上下文(例如角色、私有事实或已解决的方案)的优势。该方法面临的挑战在于,特权信息可能过度改变模型行为:它可能修改推理、降低通用能力,并影响响应长度、风格或局部token偏好等性能指标。因此,OPSD可能训练学生模型学习副作用而非期望的可迁移行为。本文在稀有token/身份设定下研究该问题,并提出EDGE-OPD(证据引导的在线策略蒸馏),这是OPSD的一种改进,具有两个显著特征:a) 使用引导展开在采样时向学生注入特权上下文行为,使得稀有目标行为实际出现在在线策略数据中;b) 应用证据掩码:学生仅在特权上下文支持采样token的token位置进行更新,而非展开中的每个token。实验表明,OPSD(及其变体RLSD,无论是否使用验证器)完全无法学习目标身份,而引导展开的集成使其成功。此外,掩码区域消融实验显示,角色信号定位于正证据尾部,这使我们能够获得关于高效知识迁移和通用能力保持的宝贵见解。

英文摘要

On-Policy Distillation (OPD) has gained wide attraction as an LLM post-training paradigm due to its effectiveness in improving capabilities without introducing model distribution drift, and consequently, regression in general tasks. On-Policy Self-Distillation (OPSD) is an efficient use-case of OPD, which is appealing as it requires only a single model as a student and teacher, and it also has the benefit of providing privileged context that is a absent at inference time (e.g. a persona, a private fact, or a worked solution) to the teacher during the training process. The challenge in this approach is that the privileged information can change model behavior more than intended: it can modify reasoning, degrade general capabilities, and affect performance indicators like response length, style, or local token preferences. Consequently, OPSD may train the student on side effects rather than a desired, transferable behavior. In this paper, we study this problem in a rare-token/identity setting and propose EviDence GuidEd On-Policy Distillation (EDGE-OPD), a modification of OPSD with two distinct characteristics: a) it uses guided rollouts to inject privileged-context behavior to the student at sampling time, so that the rare target behavior is actually present in the on-policy data, and b) it applies an evidence mask: the student is updated only at token positions where the privileged context supports the sampled token, rather than on every token in the rollout. We empirically show that OPSD (and its variant RLSD, with and without a verifier) completely fail to learn a target identity, while the integration of guided rollouts allows them to succeed. Additionally, mask-region ablations show that the persona signal is localized to the positive-evidence tail, allows us to draw valuable insights about efficient knowledge transfer and preservation of general purpose capabilities.

2605.23482 2026-05-25 cs.CV cs.AI 版本更新

Multimodal Distribution Matching for Vision-Language Dataset Distillation

多模态分布匹配用于视觉-语言数据集蒸馏

Jongoh Jeong, Hoyong Kwon, Minseok Kim, Kuk-Jin Yoon

发表机构 * Visual Intelligence Lab., KAIST(韩国科学技术院视觉智能实验室)

AI总结 该研究提出了一种名为Multimodal Distribution Matching (MDM)的多模态数据集蒸馏方法,旨在在有限的计算和内存资源下,高效生成保留视觉-语言语义信息的紧凑合成数据集。MDM通过结合数据、模型和损失层面的互补组件,实现了跨模态对齐与表示质量的保持,包括在联合嵌入空间中采样生成图像-文本对、基于预训练模型的权重空间插值构建混合教师模型,以及利用几何感知的损失函数匹配联合分布。实验表明,MDM在多个跨架构的图像-文本检索任务中表现出色,显著降低了蒸馏成本并保持了模型的鲁棒性。

Comments Accepted for publication at CVPR 2026. Project Page: https://andyj1.github.io/mdm

详情
AI中文摘要

数据集蒸馏将大型训练集压缩为紧凑的合成数据集,同时保持下游性能。随着现代系统越来越多地处理成对的视觉-语言输入,多模态蒸馏必须在严格的计算和内存预算下保持表示质量和跨模态对齐,然而先前的方法通常需要大量计算并忽略其相关性。为了解决这个问题,我们提出了多模态分布匹配(MDM),一种用于高效且可泛化的多模态蒸馏的几何感知框架。具体来说,MDM在数据、模型和损失层面集成了互补组件。在数据层面,它通过在联合嵌入空间中的聚类采样来初始化合成图像-文本对。在模型层面,它通过在权重空间中根据独立微调模型与预训练锚点的角度偏差进行插值,形成混合教师模型。在损失层面,它使用几何感知的匹配目标在单位超球面上匹配联合分布,该目标利用跨模态一致性和差异方向上的联合特征以及对称对比学习。在跨架构评估的图像-文本检索基准上,MDM生成的紧凑合成集保留了多模态语义,显著降低了蒸馏成本,并在不同架构下保持鲁棒性。

英文摘要

Dataset distillation compresses large training sets into compact synthetic datasets while preserving downstream performance. As modern systems increasingly operate on paired vision-language inputs, multimodal distillation must preserve representation quality and cross-modal alignment under tight compute and memory budgets, yet prior methods often require heavy computes and overlook their correlations. To address this, we present Multimodal Distribution Matching (MDM), a geometry-aware framework for efficient and generalizable multimodal distillation. Specifically, MDM integrates complementary components at the data, model, and loss levels. At the data level, it initializes synthetic image-text pairs by sampling from clusters in the joint embedding space. At the model level, it forms a mixed teacher by interpolating independently fine-tuned models in weight space according to their angular deviation from the pretrained anchor. At the loss level, it matches joint distributions on the unit hypersphere using a geometry-aware matching objective that exploits the joint features in the cross-modal agreement and discrepancy directions along with symmetric contrastive learning. Across image-text retrieval benchmarks with cross-architecture evaluation, MDM yields compact synthetic sets that preserve multimodal semantics, substantially reduce distillation cost, and remain robust across architectures.

2605.23478 2026-05-25 cs.CV cs.AI 版本更新

PhenoYieldNet: Learning Crop-Aware Phenological Responses for Multi-Crop Yield Prediction

PhenoYieldNet: 学习作物感知的物候响应以进行多作物产量预测

Yu Luo, Xiaogang Zhu, Shan Zeng, Wei Xiang, Thomas Francis Bishop, Zhiyong Wang, Kun Hu

发表机构 * School of Computer Science, The University of Sydney(悉尼大学计算机科学学院) School of Computer Science and Information Technology, Adelaide University(阿德莱德大学计算机科学与信息技术学院) College of Mathematics and Computer Science, Wuhan Polytechnic University(武汉职业技术学院数学与计算机科学学院) School of Computing, La Trobe University(拉特罗布大学计算学院) School of Science, Edith Cowan University(埃迪斯科文大学科学学院)

AI总结 准确预测作物产量对可持续农业和全球粮食安全至关重要。现有方法多针对单一作物,难以泛化到多种作物,且未充分考虑不同作物对天气变化的特定物候响应。本文提出PhenoYieldNet,一种面向多作物产量预测的框架,通过显式建模作物的物候响应来学习作物特异性物候特征,包含作物物候库和注意力模块,能够动态捕捉不同物候阶段的时空特征,并通过预训练模型和自监督策略提升泛化能力,实验表明其在多作物数据集上显著优于现有方法。

Comments Accepted by CVPR2026

详情
AI中文摘要

准确的作物产量预测对于可持续农业和全球粮食安全至关重要。现有方法主要针对单一作物预测开发,通常难以泛化到不同作物类型,且未能解决由复杂天气模式动态调节的独特作物物候响应。在本文中,我们提出PhenoYieldNet,一个多作物产量预测框架,通过显式建模作物对时间驱动因素的响应来学习作物特异性物候。具体来说,我们开发了一个作物感知的时间解码器,由作物物候库(CPB)和作物物候注意力(CPA)模块组成。CPB集成了一组可学习的嵌入,利用查询引导CPA模块学习特定作物最相关的物候模式。CPA模块显式捕获多尺度趋势和变化成分以构建时间上下文,使模型能够动态调整不同物候阶段的注意力。为了学习鲁棒且可泛化的多作物预测特征,编码器使用预训练基础模型初始化,并通过自监督时序对比适应策略进一步调整以对齐农业时间动态。在多作物数据集上进行的大量实验表明,我们提出的方法显著优于最先进的方法,在不同地区和作物上展现出强大的泛化能力。

英文摘要

Accurate crop yield prediction is crucial for sustainable agriculture and global food security. While existing methods are predominantly developed for single-crop prediction, they often struggle to generalize across diverse crop types, without addressing the unique crop phenological responses that are dynamically modulated by complex weather patterns. In this paper, we propose PhenoYieldNet, a multi-crop yield prediction framework that learns crop-specific phenology by explicitly modeling their responses with temporal drivers. Specifically, we develop a crop-aware temporal decoder consisting of a Crop Phenology Bank (CPB) and a Crop Phenology Attention (CPA) module. The CPB integrates a set of learnable embeddings, which leverage a query to guide the CPA module to learn the most relevant phenology patterns for the specific crop. And the CPA module explicitly captures multi-scale trend and variation components to construct temporal contexts, enabling the model to dynamically adjust the attention across different phenological stages. To learn robust and generalizable features for multi-crop prediction, the encoder is initialized with a pre-trained foundation model, and further adapted via a self-supervised Temporal Contrastive Adaptation strategy to align with agricultural temporal dynamics. Extensive experiments conducted on multi-crop datasets indicate that our proposed method significantly outperforms state-of-the-art methods, exhibiting strong generalization capabilities across different regions and crops.

2605.23471 2026-05-25 cs.LG cs.AI 版本更新

CBANet: A Compact Attention-Based CNN-BiLSTM Network for Aggressive Driving Event Detection

CBANet:一种用于激进驾驶事件检测的紧凑型注意力CNN-BiLSTM网络

Hanadi Alhamdan, Ghadah Alosaimi, Amir Atapour-Abarghouei, Farshad Arvin

发表机构 * Department of Computer Science, Princess Nourah bint Abdulrahman University(普里西拉计算机科学系,普里西拉努拉·本·阿卜杜勒拉赫曼大学) Department of Computer Science, Durham University(计算机科学系,杜ham大学) Department of Computer Science, Imam Mohammad Ibn Saud Islamic University(计算机科学系,伊玛姆穆罕默德·本·萨德伊斯兰大学)

AI总结 本文提出了一种名为CBANet的紧凑型注意力机制结合CNN-BiLSTM的深度学习框架,用于检测激进驾驶事件。该方法通过构建工程化的动态特征来捕捉转向、加速和制动行为,并采用基于SMOTE的过采样与类别加权损失相结合的稳定训练策略,以应对自然驾驶数据中激进事件极度稀有的问题。实验表明,该方法在少数类召回率和安全关键F分数等指标上显著优于传统深度学习方法,同时保持了较高的计算效率。

Comments 8 pages, 4 figures, 4 tables. Submitted to IJCNN/WCCI 2026. CBANet: A compact attention-based CNN-BiLSTM framework for aggressive driving event detection using multivariate vehicle dynamics signals. Code available at https://github.com/halhamdan/CBANet

详情
AI中文摘要

激进驾驶是交通事故的主要原因,对道路安全构成严重威胁。尽管深度学习方法在从车辆传感器数据检测危险驾驶行为方面显示出有希望的结果,但它们在现实条件下的性能通常受到严重数据不平衡、驾驶员间巨大差异以及缺乏物理可解释的车辆动力学表示的限制。在本文中,我们提出了一种增强的深度学习框架,用于使用多变量车辆动力学信号进行激进驾驶检测。该方法不仅依赖原始测量,还构建了捕捉转向、加速和制动行为的工程动力学特征。为了解决自然驾驶数据中激进事件的极端稀少性,我们引入了一种稳定的训练策略,结合了基于SMOTE的受控过采样和类别加权损失公式,并评估了用于不平衡处理的焦点损失变体。此外,采用基于类别特定阈值校准的安全导向决策策略,以更好地反映现实应用中漏检和误报的不对称风险。该框架在新收集的自然驾驶数据集上进行了评估。大量实验表明,所提出的方法在保持实际计算效率的同时,在少数类召回率和安全关键F-score指标上始终优于标准深度学习基线。代码:\url{https://github.com/halhamdan/CBANet}

英文摘要

Aggressive driving is a major cause of traffic accidents and poses a serious threat to road safety. Although deep learning methods have shown promising results in detecting risky driving behaviours from vehicle sensor data, their performance in real-world conditions is often limited by severe data imbalance, large variability between drivers, and the lack of physically interpretable vehicle dynamics representations. In this paper, we propose an enhanced deep learning framework for aggressive driving detection using multivariate vehicle dynamics signals. Instead of relying solely on raw measurements, the proposed approach constructs engineered dynamic features that capture steering, acceleration, and braking behaviour. To address the extreme rarity of aggressive events in naturalistic driving data, we introduce a stable training strategy that combines controlled SMOTE-based oversampling with a class-weighted loss formulation, and evaluates focal loss variants for imbalance handling. Furthermore, a safety-oriented decision strategy based on class-specific threshold calibration is adopted to better reflect the asymmetric risks of missed detections and false alarms in real-world applications. The proposed framework is evaluated on a newly collected naturalistic driving dataset. Extensive experiments show that the proposed method consistently outperforms standard deep learning baselines with significant improvements in minority-class recall and safety-critical F-score metrics while maintaining practical computational efficiency. Code: \url {https://github.com/halhamdan/CBANet}

2605.23470 2026-05-25 cs.LG cs.AI cs.CE 版本更新

Learning Individual Dynamics from Sparse Cross-Sectional Snapshots

从稀疏横截面快照中学习个体动力学

Christian Lagemann, Kai Lagemann, Steven L. Brunton, Sach Mukherjee

发表机构 * Statistics and Machine Learning, German Center for Neurodegenerative Diseases (DZNE)(统计与机器学习,德国神经退行性疾病中心(DZNE)) MediaTek Research(联发科技研究) Department of Mechanical Engineering & AI Institute in Dynamic Systems, University of Washington, Seattle(机械工程与人工智能动态系统研究所,华盛顿大学,西雅图) DZNE & University of Bonn, Bonn, Germany and University of Cambridge, Cambridge, United Kingdom(DZNE与波恩大学,波恩,德国和剑桥大学,剑桥,英国)

AI总结 该研究旨在从稀疏的横截面快照中学习个体的动态演化过程,传统方法在数据稀疏或完全横截面的情况下难以准确推断个体的连续时间轨迹。本文提出了一种名为CADENCE的概率框架,通过将潜在动态与静态个体上下文关联,实现了从孤立快照中恢复个体轨迹。该方法结合了基于分数的空域编码器和软专家混合路由机制,提供了单时间点轨迹推断的可识别性保证,并在多个基准测试中表现出优于现有序列模型的性能。

详情
AI中文摘要

预测一个动力学单元如何随时间演化——例如个体如何衰老、流行病如何传播、物理系统如何退化——通常需要密集的纵向追踪。当只有极其稀疏或完全横截面的数据可用时,推断个体化的连续时间轨迹本质上是病态的。现有方法迫使严格妥协:序列模型(如潜在ODE)需要密集的纵向数据,而横截面方法(如最优传输、基于流匹配的)映射聚合群体,丢失了个体动力学。在本文中,我们证明这种二分法可以被打破。我们介绍CADENCE,一个原则性的概率框架,通过将潜在动力学锚定到静态的个体级上下文,从孤立快照中恢复连续的个体轨迹。我们为单时间点轨迹推断提供了新颖的可识别性保证。通过结合基于分数的空间编码器(双射概率流ODE)以消除微分同胚歧义,以及软混合专家(SMoE)路由器,我们证明个体动力学参数和路由函数是联合可识别的。在一系列涵盖物理系统到真实世界生物数据的基准测试中,CADENCE严格在具有上下文结构的极端稀疏快照上训练,其性能匹配或超过了在密集全轨迹数据上训练的最先进序列模型。

英文摘要

Predicting how a dynamical unit evolves over time - how an individual ages, an epidemic spreads, or a physical system degrades - typically requires dense longitudinal tracking. When only extremely sparse or entirely cross-sectional data is available, inferring individualized, continuous-time trajectories is fundamentally ill-posed. Existing methods force a strict compromise: sequence models (e.g. latent ODEs) require dense longitudinal data, while cross-sectional methods (e.g. optimal transport, flow matching-based) map aggregate populations, losing individual dynamics. In this paper, we demonstrate that this dichotomy can be broken. We introduce CADENCE, a principled probabilistic framework that recovers continuous individual trajectories from isolated snapshots by anchoring latent dynamics to static, individual-level contexts. We provide novel identifiability guarantees for single-timepoint trajectory inference. By combining a score-based spatial encoder (bijective Probability Flow ODE) to eliminate diffeomorphic ambiguities with a Soft Mixture-of-Experts (SMoE) router, we show that individual dynamical parameters and routing function are jointly identifiable. Across a suite of benchmarks spanning physical systems to real-world biological data, CADENCE, trained strictly on extremely sparse snapshots with context structure, matches or exceeds the performance of state-of-the-art sequential models trained on dense, full-trajectory data.

2605.23459 2026-05-25 cs.SE cs.AI 版本更新

AI Assurance: A Comprehensive Testing Strategy for Enterprise AI Systems

AI 保证:企业 AI 系统的综合测试策略

Chitra Badagi, Divye Singh, Animesh Sen, Adinath Shirsath

发表机构 * Thoughtworks Technologies(Thoughtworks技术公司)

AI总结 本文针对基于大语言模型、检索管道和自主代理的企业级AI系统,提出了一种全面的测试保障策略,以应对传统软件质量保证方法难以处理的新型风险。研究强调应将AI测试重点转向持续风险降低,而非严格的正确性验证,并将评估作为与开发同等重要的工程学科。文章引入了结构化的AI失效分类体系,提出了改进的五层AI保障金字塔,并提供了评估驱动开发、RAG系统测试、模型生命周期管理等方面的实践指导,旨在为企业工程领导者和实践者提供既有理论依据又可操作的保障策略。

详情
AI中文摘要

企业 AI 系统构建于大语言模型、检索管道和自主代理之上,引入了一类传统软件质量保证从未设计应对的风险。这些系统是概率性的、上下文敏感的和涌现性的:它们无法在经典意义上被验证为正确,只能通过不断增加信心来评估。本文提出了一种围绕三个关键原则的企业 AI 系统综合保证策略:第一,AI 测试应侧重于持续风险降低而非严格正确性验证;第二,评估必须与开发一起被视为核心工程学科;第三,AI 保证中的失败可能导致与传统确定性软件系统根本不同的组织影响。我们引入了结构化的 AI 故障分类法,提出了修订后的五层 AI 保证金字塔,并提供了关于评估驱动开发、RAG 系统测试、模型生命周期管理和治理的操作指南。目标是让工程领导者和从业者掌握一种既有哲学基础又可操作部署的策略。

英文摘要

Enterprise AI systems, built on large language models, retrieval pipelines and autonomous agents, introduce a class of risks that traditional software quality assurance was never designed to address. These systems are probabilistic, context-sensitive and emergent: they cannot be verified to be correct in the classical sense, but only evaluated with increasing confidence. This paper presents a comprehensive assurance strategy for enterprise AI systems built around three key principles: first, that AI testing should focus on continuous risk reduction rather than strict correctness verification; second, that evaluation must be treated as a core engineering discipline alongside development; and third, that failures in AI assurance can lead to organizational impacts that are fundamentally different from those seen in traditional deterministic software systems. We introduce a structured AI Failure Taxonomy, propose a revised five-layer AI Assurance Pyramid and provide operational guidance on evaluation-driven development, RAG system testing, model lifecycle management and governance. The goal is to equip engineering leaders and practitioners with a strategy that is both philosophically grounded and operationally deployable.

2605.23458 2026-05-25 cs.CV cs.AI 版本更新

One-Forcing: Towards Stable One-Step Autoregressive Video Generation

One-Forcing: 迈向稳定的一步自回归视频生成

Jiaqi Feng, Justin Cui, Yuanhao Ban, Cho-Jui Hsieh

发表机构 * Tsinghua University(清华大学) UCLA(加州大学洛杉矶分校)

AI总结 该论文提出了一种名为 One-Forcing 的方法,旨在解决单步自回归视频生成中的稳定性和质量问题。该方法通过在动态模式分解(DMD)目标中引入辅助的生成对抗网络(GAN)损失,实现了高质量且高效的单步视频生成。实验表明,One-Forcing 在 VBench 数据集上取得了当前最优的性能,并且仅需三分之一的训练成本即可实现稳定的逐帧自回归生成,优于以往方法。

Comments Work in Progress. Project Page: https://aurora-edu.github.io/one-forcing/, Code: https://github.com/Aurora-edu/One-Forcing

详情
AI中文摘要

最近的进展显著改善了自回归机制下的实时交互式视频生成。然而,大多数现有的少步自回归视频生成方法(通常从相应的多步教师模型蒸馏而来)默认采用4步采样配置,这在部署期间仍会产生相当大的延迟,并且当进一步减少采样步数(特别是在一步设置中)时,会遭受严重的质量下降。轨迹式一致性蒸馏方法通常生成动态较弱的视频,而基于DMD的方法(如Self-Forcing)往往产生模糊的帧。为了解决这一挑战,我们提出了One-Forcing,一种简单而有效的方法,它通过向DMD目标添加辅助GAN损失,实现高质量高效的一步视频生成。在VBench上的实验表明,One-Forcing的总得分为83.76,在一步因果视频生成方法中达到了最先进的性能,并且与强大的多步方法保持竞争力。我们进一步证明,仅需分块模型三分之一的训练成本,即可稳定实现逐帧的一步自回归生成,而先前的方法未能成功实现这一设置。

英文摘要

Recent advances have substantially improved real-time interactive video generation in the autoregressive regime. However, most existing few-step autoregressive video generation methods, often distilled from a corresponding many-step teacher, default to a 4-step sampling configuration, which still incurs considerable latency during deployment and suffers from severe quality degradation when the number of sampling steps is further reduced, particularly in the one-step setting. Trajectory-style consistency distillation methods often produce videos with weak dynamics, while DMD-based approaches, such as Self-Forcing, tend to yield blurry frames. To address this challenge, we propose One-Forcing, a simple yet effective approach which augments the DMD objective with an auxiliary GAN loss for high-quality and efficient one-step video generation. Experiments on VBench show that One-Forcing achieves a total score of 83.76, establishing state-of-the-art performance among one-step causal video generation methods and remaining competitive with strong many-step approaches. We further demonstrate that one-step framewise autoregressive generation can be achieved stably with merely one-third of the training cost of the chunkwise model, a setting that prior methods have failed to achieve successfully.

2605.23448 2026-05-25 cs.CR cs.AI 版本更新

AI Security Research Should Better Incentivize Defense Research

AI安全研究应更好地激励防御研究

Youqian Zhang

发表机构 * The Hong Kong Polytechnic University(香港理工大学)

AI总结 本文指出人工智能安全研究领域存在严重失衡现象,即攻击性研究远多于防御性研究。通过分析多个子领域的学术论文,发现攻击与防御的比例普遍偏高,且攻击性研究往往在有利条件下进行,夸大了实际威胁,而防御性研究则面临更高的标准,导致可用的防御方案寥寥无几。因此,作者呼吁人工智能安全研究应更加重视并激励防御技术的发展。

Comments 14 pages,3 figures,3 tables

详情
AI中文摘要

本文考察了人工智能(AI)安全研究中的不平衡:该领域倾向于产出更多关于攻击AI系统的研究,而非防御。通过相关学术论文,我们发现跨子领域(包括联邦学习、语音识别、成员推断、大语言模型等)存在偏斜的攻击-防御比例。这种不平衡可能远不止简单的计数:攻击论文通常在有利条件下进行评估,使威胁看起来比实际更严重,而防御则面临更严格的标准,很少有方法能达到。结果是文献中充斥着已证明的漏洞,而可用且已部署的防御则很少。因此,我们认为AI安全研究应更好地激励防御研究。

英文摘要

This work examines an imbalance in artificial intelligence (AI) security research: the field tends to produce more work on attacking AI systems than on defending them. Drawing on related academic papers, we find biased attack-to-defense ratios across subfields, including federated learning, speech recognition, membership inference, large language models, etc. The imbalance possibly means far beyond a simple count: attack papers are routinely evaluated under favorable conditions that make threats look more severe than they are in practice, while defenses are held to a stricter standard that few can meet. The result is a literature rich in demonstrated vulnerabilities and thin on usable and deployed protections. We thus argue that AI security research should better incentivize defense research.

2605.23426 2026-05-25 cs.HC cs.AI 版本更新

Socially fluent AI decouples conversational signals from source identity in online interaction

社交流畅的AI在在线互动中解耦对话信号与来源身份

Lixiang Yan, Yueqiao Jin, Xibin Han, Dragan Gašević

发表机构 * School of Education, Tsinghua University(清华大学教育学院) Faculty of Information Technology, Monash University(墨尔本大学信息技术学院) Faculty of Education, The University of Hong Kong(香港大学教育学院)

AI总结 这项研究探讨了社交流利的AI代理在在线互动中是否能像普通人一样交流,从而让人难以仅凭对话信号判断对方身份。实验表明,在多人协作任务中,参与者无法准确区分AI与人类队友,尽管对话行为中存在可区分AI与人类的线索。研究指出,人们更多依赖主观印象和刻板印象进行判断,而非基于实际行为特征,这使得AI代理可能更易影响和操控在线讨论。

详情
AI中文摘要

社交流畅的智能体AI现在能够以类似于普通人类对话的方式参与在线互动,这可能削弱人们仅凭对话信号推断谁是人类的能力。我们在同步文本群组交互中测试了这种可能性,将未公开的AI代理作为普通队友嵌入到分析性、创造性和伦理任务中。在786名参与者进行的1572次交互后身份判断中,人们区分AI和人类队友的能力未高于随机水平。这种失败并非因为交互缺乏身份相关信息。对话行为包含区分AI与人类的稳健线索,并支持高度准确的计算分类。相反,参与者依赖熟悉的怀疑启发式,包括响应速度、流畅性和感知的脚本化,这些与真实身份只有弱相关。表征分析进一步表明,判断是基于主观印象而非编码真实身份的行为结构组织的。这种分离为能够大规模影响和操纵在线话语的协调AI代理创造了新的脆弱性。

英文摘要

Socially fluent agentic AI can now participate in online interaction in ways that resemble ordinary human conversation, potentially weakening people's ability to infer who is human from conversational signals alone. We tested this possibility in synchronous text-based group interaction by embedding undisclosed AI agents as ordinary teammates across analytical, creative, and ethical tasks. Across 786 participants who made 1,572 post-interaction identity judgments, people did not distinguish AI from human teammates above chance. This failure did not arise because the interaction lacked identity-relevant information. Conversational behaviour contained robust cues that differentiated AI from humans and supported highly accurate computational classification. Instead, participants relied on familiar suspicion heuristics, including response speed, fluency, and perceived scriptedness, that were only weakly related to actual identity. Representational analyses further showed that judgments were organised around subjective impressions rather than the behavioural structure encoding ground truth. This dissociation creates new vulnerabilities to coordinated AI agents that can influence and manipulate online discourse at scale.

2605.23414 2026-05-25 cs.AI cs.LG 版本更新

When Planning Fails Despite Correct Execution: On Epistemic Calibration for LLM-Based Multi-Agent Systems

当计划正确执行却失败时:基于LLM的多智能体系统的认知校准

Zehao Wang, Shilong Jin, Zhao Cao, Lanjun Wang

发表机构 * College of Intelligence and Computing, Tianjin University, Tianjin, China(天津大学智能与计算学院) School of New Media and Communication, Tianjin University, Tianjin, China(天津大学新媒体与传播学院) Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China(中国人民大学北京校区人工智能学院)

AI总结 本文研究了基于大语言模型的多智能体系统在计划正确执行却仍可能失败的问题,指出这是由于智能体在评估计划可行性时对自身知识的误判,即“认识论校准失误”。为此,作者提出了EPC-AW方法,通过在不同信息条件下评估计划的稳定性,而非直接验证可行性,从而提升系统的整体成功率。实验表明,该方法平均提升了9.75%的系统成功率。

详情
AI中文摘要

基于LLM的多智能体系统即使在计划动作正确执行时也可能失败,因为智能体在评估计划可行性时可能误判自身知识,我们将这种现象称为规划中的认知误校准。与执行错误不同,认知误校准在规划过程中是潜在的,因为生成的计划可以保持自洽且可执行,没有可观察到的错误;同时,认知误校准也是动态的,因为新信息可能改变可行性评估,可能掩盖过去的误校准信号并导致其随时间重复出现。为了解决这个问题,我们提出了认知计划校准代理工作流(EPC-AW),它评估计划在不同信息条件下是否仍得到支持,而不是直接验证可行性。EPC-AW采用基于信息一致性的计划选择,选择评估结果在智能体间稳定的计划,并结合一致性引导的认知状态细化,通过利用过去的差异来指导未来规划,从而随时间适应校准。实验表明,EPC-AW平均将系统级成功率提高了9.75%。

英文摘要

LLM-based multi-agent systems can fail even when planned actions are executed correctly because agents may misjudge their knowledge when evaluating plan feasibility, a phenomenon we term epistemic miscalibration in planning. Unlike execution errors, epistemic miscalibration is latent during planning, as generated plans can remain self-consistent and executable without observable errors; the miscalibration is also dynamic, as new information can alter feasibility assessments, potentially obscuring past miscalibration signals and causing them to recur over time. To address this, we propose the Epistemic Planning Calibration Agentic Workflow (EPC-AW), which assesses whether plans remain supported under varying information conditions rather than directly verifying feasibility. EPC-AW employs Information-consistency-based Plan Selection, selecting plans whose evaluations are stable across agents, together with Consistency-guided Epistemic State Refinement to adapt calibration over time by leveraging past discrepancies to guide future planning. Experiments show that EPC-AW improves system-level success by an average of 9.75%.

2605.23409 2026-05-25 cs.CV cs.AI 版本更新

Online Hand Gesture Recognition Using 3D Convolutional Neural Networks

使用3D卷积神经网络的在线手势识别

Yinghao Qin, Tijana Timotijevic

发表机构 * School of Electronic Engineering and Computer Science(电子工程与计算机科学学院) Queen Mary, University of London(伦敦大学Queen Mary)

AI总结 本文提出了一种基于3D卷积神经网络的在线手部手势识别系统,旨在实现实时视频流中手势的定位与分类。为提高系统鲁棒性,采用滑动窗口方法对多窗口结果进行优化。该系统在Jester数据集上训练,检测和分类准确率分别达到98%以上和90%以上,在自制数据集上达到37.5%的Levenshtein准确率,且响应时间在三秒以内。

Comments Master's dissertation work written in Autumn 2020

详情
AI中文摘要

在人机交互中,动态手势的实时检测与分类具有挑战性,因为:1) 系统必须在实时视频流中运行,且执行手势后响应无明显延迟;2) 不同人执行手势的方式差异较大,使得识别更加困难。本文提出一种在线手势识别系统,能够定位实时视频流中的手势并识别其类别。为提高系统鲁棒性,采用滑动窗口方法对多个窗口的结果进行优化。项目中的所有模型均在Jester数据库上训练,检测器准确率达到98%以上,分类器准确率达到90%以上。在系统整体性能方面,最佳组可在三秒内响应,并在自制数据集上达到37.5%的Levenshtein准确率。本工作使用的项目代码已公开。

英文摘要

In human computer interaction, real-time detection and classification of dynamic hand gestures is challenging as: 1) the system must run in a real-time video stream and there is no noticeable lag in response after performing a gesture; 2) there is a large difference in how people perform gestures, making recognition more difficult. In this paper, an online hand gesture recognition system is proposed, which is able to localize gestures in real-time video stream and recognize what these gestures are. To improve the robustness of the system, the sliding window approach is used to refine results from multiple windows. All of the models in my project are trained on Jester database, achieving 98+% accuracy for detector and 90+% accuracy for classifier. For the overall performance of the system, the best group can respond within three seconds and reach 37.5% Levenshtein accuracy on the homemade dataset. The project codes used in this work are publicly available.

2605.23402 2026-05-25 cs.LG cs.AI 版本更新

Parametric Prior Mapping Framework for Non-stationary Probabilistic Time Series Forecasting

非平稳概率时间序列预测的参数先验映射框架

Jinglin Li, Jun Tan, QI Fang, Ning Gui

发表机构 * School of Computer Science and Engineering(计算机科学与工程学院)

AI总结 本文提出了一种参数先验映射框架(PPM),用于非平稳概率时间序列预测。该方法通过引入参数化的结构先验,结合生成模型的优势,实现了在保持计算效率的同时捕捉复杂时间依赖关系。实验表明,PPM在非平稳数据预测任务中优于现有方法,在准确性和计算效率之间取得了更好的平衡。

Comments 20 pages, 8 figures, accepted by ICML 2026

详情
AI中文摘要

在概率多变量时间序列(MTS)预测中有效建模非平稳动态需要在表达性和鲁棒性之间取得平衡。现有参数方法受益于强归纳偏置但缺乏灵活性,而深度生成模型在没有大量数据和计算的情况下难以捕捉复杂的时间依赖性。我们引入了参数先验映射(PPM),这是一个将参数化结构先验注入生成建模过程的框架。具体来说,PPM利用参数化估计器推导出一个动态的自适应先验,通过可学习的映射指导复杂预测分布的学习。这种设计使模型能够保留参数方法的效率,同时利用生成模型的表达能力。通过混合目标训练,PPM产生精确的预测,并具有良好校准的不确定性估计。实验结果表明,PPM在处理非平稳数据方面优于现有基线,在精度和计算效率之间提供了更好的权衡。代码可在https://github.com/ljl8336/PPM获取。

英文摘要

Effectively modeling non-stationary dynamics in probabilistic multivariate time series(MTS) forecasting requires balancing expressiveness with robustness. Existing parametric approaches benefit from strong inductive biases but lack flexibility, whereas deep generative models struggle to capture complex temporal dependencies without extensive data and computation. We introduce Parametric Prior Mapping (PPM), a framework that injects parametric structural priors into a generative modeling process. Specifically, PPM utilizes a parametric estimator to derive a dynamic, adaptive prior that guides the learning of a complex predictive distribution via a learnable mapping. This design allows the model to retain the efficiency of parametric methods while exploiting the expressive power of generative models. Trained with a hybrid objective, PPM yields precise forecasts with well-calibrated uncertainty estimates. Empirical results show that PPM outperforms existing baselines in handling non-stationary data, offering a superior trade-off between accuracy and computational efficiency. The code is available at https://github.com/ljl8336/PPM.

2605.23393 2026-05-25 cs.LG cs.AI 版本更新

Every Component is a Lookup: Token Attribution and Composition from a Single Decomposition

每个组件都是一个查找:来自单一分解的令牌归因与组合

Po-Kai Chen, Niki van Stein, Aske Plaat

发表机构 * Leiden University(莱顿大学)

AI总结 该论文研究了如何从单一前向传播中解析Transformer模型中各组件对预测结果的贡献及其组合方式。作者提出了一种名为Unpack的反向递归方法,通过分解注意力和MLP子层中的信用,揭示了不同组件之间的交互强度以及每个token的归因信息,无需干预、梯度或辅助训练。实验表明,该方法在GPT-2和Pythia系列模型上有效恢复了组件间的组合结构,并展示了对token级归因的准确捕捉,验证了其在机制可解释性方面的有效性。

详情
AI中文摘要

变压器的机制可解释性不仅需要识别哪些组件重要,还需要理解它们如何组合成产生预测的计算路径。注意力和MLP都遵循共享的键值模板 $ϕ(S)U$。我们利用这一结构开发了Unpack,一种后向递归方法,通过两个子层分解贡献,产生任意两个组件之间的交互强度,称为带有K/Q/V组合标签的端到端路径,以及来自单次前向传递的每个令牌的归因,无需干预、梯度或辅助训练。我们在间接宾语识别任务上进行了评估。在GPT-2 small上,该方法恢复了Wang等人(2023)描述的所有三种组合连接,包括每个连接的特定模式路由(K、Q或V)。为了测试超越简单复制的令牌级归因,我们比较了同一分解中同一名称的两次出现:第一次提及保持强归因,而重复检测位置被抑制,这一模式在匹配的控制提示中不存在。在Pythia系列从160M到6.9B参数中,这一抑制模式在每个尺度上一致地恢复,表明该方法无需真实电路标签即可追踪机制结构。代码可在https://github.com/Fun-Cry/unpacklm获取。

英文摘要

Mechanistic interpretability of transformers requires identifying not just which components matter but how they compose into the computational route that produced a prediction. Both attention and MLP follow a shared key-value template $ϕ(S)U$. We exploit this structure to develop Unpack, a backward recursion that decomposes credit through both sublayers, producing interaction strengths between any two components, named end-to-end paths with K/Q/V composition labels, and per-token attribution from a single forward pass, without intervention, gradients, or auxiliary training. We evaluate on the indirect object identification task. On GPT-2 small, the method recovers all three composition connections described by Wang et al. (2023), including the mode-specific routing of each connection (K, Q, or V). To test token-level attribution beyond trivial copying, we compare two occurrences of the same name in the same decomposition: the first mention retains strong credit while the duplicate-detection position is suppressed, a pattern absent in matched control prompts. Across the Pythia family from 160M to 6.9B parameters, this suppression pattern is consistently recovered at every scale, demonstrating that the method tracks mechanistic structure without ground-truth circuit labels. Code is available at https://github.com/Fun-Cry/unpacklm.

2605.23384 2026-05-25 cs.CL cs.AI 版本更新

Metacognition as Reward: Reinforcing LLM Reasoning via Knowledge and Regulation Signals

元认知作为奖励:通过知识和调节信号强化LLM推理

Sirui Chen, Lei Xu, Yuying Zhao, Yutian Chen, Yu Wang, Beier Zhu, Hanwang Zhang, Shengjie Zhao, Chaochao Lu

发表机构 * Tongji University(同济大学) Shanghai AI Laboratory(上海人工智能实验室) Nanyang Technological University(南洋理工大学) University of Science and Technology of China(中国科学技术大学) EPFL(苏黎世联邦理工学院) Wuhan University(武汉大学)

AI总结 该论文提出了一种基于元认知的强化学习框架 MaR,旨在提升大语言模型的推理能力。MaR 通过元认知知识和元认知调节两个维度提供奖励信号,前者用于识别任务相关的信息,后者用于规划和调整推理过程,从而超越仅依赖最终答案的奖励设计。实验表明,MaR 在多个基准测试中显著提升了模型性能,并在部分任务上超越了更强大的模型。

详情
AI中文摘要

最近的强化学习方法显著提高了LLM的推理能力。现有的奖励设计主要遵循两种范式:(1) 基于可验证奖励的强化学习(RLVR)从可执行检查或真实答案中获取结果信号,但对中间推理行为的指导有限。(2) 基于评分标准的奖励(RaR)通过使用自然语言评分标准来评估推理质量和任务合规性,超越了最终答案检查,但通常需要实例特定的评分标准和大量设计工作。为解决这些问题,我们引入了元认知奖励(MaR),一种受元认知启发的RL框架,通过两个通用过程维度指导LLM推理:i) 元认知知识,无需手工制作的实例特定评分标准即可识别任务相关信息;ii) 元认知调节,规划和调整推理过程,以提供超越最终答案结果的奖励指导。MaR将模型轨迹分解为显式的元认知组件,并通过任务知识覆盖度、调节保真度和最终答案正确性的轨迹级奖励进行优化。通过这种方式,MaR将奖励反馈扩展到推理轨迹,同时将奖励信号锚定在通用的元认知维度上。在22个基准上的实验表明,MaR持续提升模型性能,相比基础模型最高提升7.7%,相比原始DAPO最高提升11.0%。值得注意的是,Qwen3.5-9B + MaR缩小了与前沿模型的差距,在整体平均上超越GPT-OSS-120B,并在多个单独基准上超越更强模型。过程级分析进一步显示推理过程质量显著提升。MaR还能泛化到域外数据集,MaR训练的模型在平均性能上优于对应的基础模型。

英文摘要

Recent RL methods have substantially improved the reasoning abilities of LLMs. Existing reward designs mainly follow two paradigms: (1) Reinforcement learning with verifiable rewards (RLVR) derives outcome signals from executable checks or ground-truth answers, but provides limited guidance for intermediate reasoning behaviors. (2) Rubrics-as-reward (RaR) goes beyond final-answer checking by using natural-language rubrics to assess reasoning quality and task compliance, but often requires instance-specific rubrics and substantial design effort. To address these issues, we introduce Metacognition-as-Reward (MaR), a metacognition-inspired RL framework that guides LLM reasoning through two general process dimensions: i) metacognitive knowledge, which identifies task-relevant information without hand-crafted instance-specific rubrics, and ii) metacognitive regulation, which plans and adjusts the reasoning process to provide reward guidance beyond final-answer outcomes. MaR scaffolds model rollouts into explicit metacognitive components and optimizes them with a trajectory-level reward over task knowledge coverage, regulation fidelity, and final-answer correctness. In this way, MaR extends reward feedback to reasoning trajectories while grounding the reward signals in general metacognitive dimensions. Experiments on 22 benchmarks show that MaR consistently improves model performance, achieving up to a 7.7% gain over the base model and up to an 11.0% gain over vanilla DAPO. Notably, Qwen3.5-9B + MaR narrows the gap to frontier models, surpassing GPT-OSS-120B on overall average and outperforming stronger models on several individual benchmarks. Process-level analysis further shows substantial improvements in reasoning process quality. MaR also generalizes to out-of-domain datasets, where MaR-trained models improve over their corresponding base models on average.

2605.23372 2026-05-25 cs.LG cs.AI 版本更新

Curriculum reinforcement learning with measurable task representation learning

基于可度量任务表征学习的课程强化学习

Yongyan Wen, Siyuan Li, Mingjian Fu, Yiqin Yang, Xun Wang, Peng Liu

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) Fuzhou University(福州大学) China Academy of Sciences(中国科学院)

AI总结 本文研究了课程强化学习(CRL)中自动课程生成的问题,特别是在非欧几里得任务空间中的复杂导航任务。为了解决传统插值方法在非欧空间中失效的问题,作者提出了一种基于可度量任务表示学习的自动课程生成方法,通过变分自编码器结构对任务的奖励和状态转移进行编码,从而获得具有任务相似性度量能力的潜在任务表示。实验表明,该方法在多个复杂导航任务中优于基于插值和生成对抗网络的现有CRL方法。

Journal ref Neural Networks, 109019 (2026)

详情
AI中文摘要

在课程强化学习(CRL)中,智能体通过一系列任务(即课程)逐步积累知识,学习过程旨在利用积累的知识最终解决具有挑战性的目标任务。虽然早期的CRL工作侧重于对候选任务进行排序,但最近的研究探索了自动课程生成。在丰富的CRL文献中,基于插值的CRL范式是主体,它通过在任务空间中利用有意义的距离度量(即可以衡量任务相似性)对初始任务分布和目标任务分布进行插值,自动生成中间任务。然而,在具有挑战性的导航任务中,非欧几里得上下文(任务)空间使得这一假设失效。为了在复杂任务中实现自动课程生成,我们提出了一种基于可度量任务表征学习的新型自动课程生成方法。为了更好地衡量相似性,我们提出将任务空间变换到潜在空间。通过一个编码奖励和状态转移的变分自编码器结构,我们获得了具有任务相似性度量属性的潜在任务表征,其中两个相近的任务嵌入对应两个在奖励和状态转移方面相似的任务。基于学习到的任务表征,我们进一步开发了一种自动课程生成方案,该方案能够有效地生成与目标任务越来越相似的新任务。我们在各种具有挑战性的导航任务中评估了我们的方法,实验结果表明,所提出的方法超越了基于插值和生成对抗网络的最先进CRL方法。

英文摘要

In curriculum reinforcement learning (CRL), an agent incrementally accumulates knowledge over a sequence of tasks (i.e., a curriculum), and the learning process is aimed at using the accumulated knowledge to finally solve a challenging target task. While early CRL works focus on sequencing candidate tasks, recent research explores automatic curriculum generation. Among the rich CRL literature, the interpolation-based CRL paradigm is a main body, which automatically generates intermediate tasks by interpolating between the initial task distribution and the target task distribution in task space with meaningful distance metrics (i.e., can measure the task similarity). However, in challenging navigation tasks, the non-Euclidean context (task) space invalidates this assumption. To achieve automatic curriculum generation in complex task, we propose a novel automatic curriculum generation approach based on measurable task representation learning. To better measure the similarity, we propose to transform the task space to a latent space. Through a variational autoencoder structure that encodes the reward and the state transitions, we achieve a latent task representation with a task similarity measurement property, and two close task embeddings correspond to two similar tasks in terms of rewards and state transitions. Based on the learned task representation, we further develop an automatic curriculum generation scheme, which can effectively generate new tasks more and more similar to the target task. We evaluate our method in a variety of challenging navigation tasks, and the experiment results indicate that the proposed approach surpasses state-of-the-art CRL approaches based on interpolation and generative adversarial networks.

2605.23365 2026-05-25 cs.LG cs.AI 版本更新

Score-Based One-step MeanFlow Policy Optimization

基于分数的单步MeanFlow策略优化

Kyungyoon Kim, Donghyeon Ki, Hee-Jun Ahn, Byung-Jun Lee

发表机构 * Korea University, Decision Making Lab(韩国大学,决策实验室) Gauss Labs Inc.(Gauss实验室)

AI总结 本文提出了一种基于分数估计的单步均流策略优化方法(SOM),旨在解决强化学习中扩散模型和流匹配方法在在线场景下计算开销大的问题。该方法通过Q函数和概率流ODE直接构建目标速度场,无需目标分布的样本,从而在保证策略性能的同时显著降低了训练和推理时间。实验表明,SOM在运动控制任务中实现了领先的在线强化学习效果。

详情
AI中文摘要

扩散和流匹配已成为强化学习中表达力强的策略类,但它们对多步去噪的依赖在推理时带来了大量计算开销,这在在线强化学习中尤其成问题。MeanFlow通过学习一个平均速度场,在单次网络评估中将噪声映射到数据,提供了一种有前景的替代方案。然而,MeanFlow通常需要来自目标分布的样本来构建其目标速度场,而这在在线强化学习中不可用。我们提出了基于分数的单步MeanFlow策略优化(SOM),一种演员-评论家算法,通过分数估计和概率流ODE直接从Q函数构建目标速度场,从而将概率质量集中在高价值模式上。在完全在线强化学习设置中,SOM在运动任务上以单生成步骤实现了最先进的性能,同时与先前基于扩散和流匹配的策略相比,大幅减少了训练和推理时间。

英文摘要

Diffusion and flow matching have emerged as expressive policy classes in reinforcement learning, but their reliance on multi-step denoising imposes substantial computational overhead at inference time, which is particularly problematic in online RL. MeanFlow offers a promising alternative by learning an average velocity field that maps noise to data in a single network evaluation. However, MeanFlow typically requires samples from the target distribution to construct its target velocity field, which are unavailable in online RL. We propose Score-Based One-step MeanFlow Policy Optimization (SOM), an actor-critic algorithm that resolves this by constructing the target velocity field directly from the Q-function via score estimation and a probability flow ODE, thereby concentrating probability mass on high-value modes. In the fully online RL setting, SOM achieves state-of-the-art performance on locomotion tasks with a single generation step, while substantially reducing both training and inference time compared to prior diffusion- and flow-matching-based policies.

2605.23348 2026-05-25 cs.DC cs.AI cs.NI 版本更新

XWind: A Cross-site Router for Large Language Model Inference Serving at Renewable Energy Farms

XWind: 面向可再生能源农场的跨站点大语言模型推理服务路由器

Tella Rajashekhar Reddy, Atharva Deshmukh, Liangcheng Yu, Chaojie Zhang, Mike Shepperd, Rohan Gandhi, Anjaly Parayil, Srinivasan Iyengar, Ajay Manchepalli, Debopam Bhattacherjee

发表机构 * Microsoft(微软)

AI总结 随着人工智能算力需求的快速增长,电力网络面临巨大压力,而可再生能源如风能却未被有效利用。本文提出了一种名为AI Greenferencing的互补性AI基础设施部署模型,将模块化AI计算能力部署在风电场,以本地化需求匹配可再生能源供给。为应对风电波动带来的推理服务挑战,研究团队设计了XWind,一种轻量、响应式且与工作负载无关的AI推理路由系统,通过实时信号动态调度任务,显著降低了端到端延迟,验证了其在实际场景中的高效性与普适性。

详情
AI中文摘要

AI电力需求正以前所未有的速度增长,而电网往往状况不佳且难以跟上。电网扩建伴随着高昂的资本支出和远距离传输损耗,然而源头处有丰富的可再生能源,只是与需求不匹配。本文提出一种互补的AI基础设施部署模式——AI绿色围栏,将模块化AI计算带到可再生能源源头,聚焦风能,允许AI足迹扩展,为可再生能源站点产生本地表后需求,并帮助缓解电力公用事业日益增长的压力。我们的可行性分析表明,在Azure数据中心的50毫秒网络往返时间内,有超过890吉瓦的风电容量,并且站点规模的合理调整与风能的空间互补性使得整体集群利用率与传统部署相当。为了在可变风力供电下服务推理请求,我们构建了XWind,一个轻量级、反应式且与工作负载无关的AI推理路由器,仅使用实时信号:推理延迟、KV缓存利用率和队列深度,来动态配置站点并分发请求。在模拟三个风力供电站点的真实64-GPU A100测试平台上,使用Azure生产轨迹进行评估,XWind将P99端到端延迟比最强竞争者(也是我们的想法)降低高达52%,比基线(如功率上限和GPU空闲)降低高达98%,且在不同工作负载类型、负载水平和GPU代际上均有一致的增益。

英文摘要

AI power demand is growing at an unprecedented rate while power grids are often ailing and struggle to keep up. Grid expansion comes with high capital expenditure and long-distance transmission losses, yet there is abundant renewable energy at the source, just not matched to demand. This paper proposes a complementary AI infrastructure deployment model, AI Greenferencing, that brings modular AI compute to renewable energy sources, focusing on wind, allowing AI footprint expansion, generating local behind-the-meter demand for renewable sites, and helping ease the growing strain on power utilities. Our feasibility analysis shows that 890+ GW of wind capacity lies within 50 ms network round trip time of Azure data centers, and that site-wise right-sizing combined with spatial complementarity of wind energy keeps aggregate fleet utilization on par with traditional deployments. To serve inference requests under variable wind power, we build XWind, a lightweight, reactive, and workload-agnostic AI inference router that uses only real-time signals: inference latency, KV-cache utilization, and queue depth, to dynamically configure sites and distribute requests. Evaluated on a real 64-GPU A100 testbed emulating three wind-powered sites with Azure production traces, XWind reduces P99 end-to-end latency by up to 52% over the strongest contender (also our idea) and by up to 98% over baselines such as power-capping and GPU idling, with consistent gains across workload types, load levels, and GPU generations.

2605.23344 2026-05-25 cs.CV cs.AI 版本更新

CHASD: Language Increment-Calibrated Contrastive Decoding against Hallucination in LVLMs

CHASD:面向LVLMs中幻觉的语言增量校准对比解码

Xiaoyi Huang, Kejia Zhang, Zhiming Luo

发表机构 * Institute of Artificial Intelligence, Xiamen University(厦门大学人工智能学院) Department of Artificial Intelligence, Xiamen University(厦门大学人工智能系)

AI总结 本文研究了大型视觉-语言模型(LVLMs)在语言先验主导下容易产生物体幻觉的问题,提出了一种无需训练的对比解码方法CHASD。该方法通过注意力引导的局部视觉扰动构建负样本分支,并在生成过程中仅对低置信度的词元进行对比校准,从而在保证推理效率的同时有效抑制幻觉。实验表明,CHASD在多个基准数据集上显著提升了相关指标,优于现有的训练自由基线方法。

详情
AI中文摘要

大型视觉-语言模型展现了强大的多模态推理能力,但当语言先验主导不足或错位的视觉证据时,它们仍然容易产生对象幻觉。无训练对比解码方法通过比较原始和扰动视觉输入的预测来缓解此问题,但现有方法要么应用可能改变有用视觉证据的全局扰动,要么在每个解码步骤调用额外的负分支。在本文中,我们观察到幻觉风险是瞬态且特定于token的:视觉注意力在生成的token间转移,而一些功能token以高置信度产生,不需要对比校准。基于这一观察,我们提出面向大型视觉-语言模型的对比幻觉感知逐步解码(CHASD),一种“按需校准”的推理时框架。CHASD使用不确定性驱动的置信门控,仅当下一token的最大概率低于阈值时激活对比分支,并通过注意力引导的局部扰动构建负分支,扰动当前显著的视觉token。这种设计减少了不必要的负分支前向传播,同时保留了高置信度步骤的原始分布。在POPE、AMBER、MME、MMHal-Bench和CHAIR上的实验表明,CHASD在强无训练基线上改进了幻觉相关指标,并具有有竞争力的推理效率。

英文摘要

Large Vision-Language Models have shown strong multimodal reasoning capabilities, yet they remain susceptible to object hallucinations when language priors dominate insufficient or misaligned visual evidence. Training-free contrastive decoding methods mitigate this issue by comparing predictions from original and perturbed visual inputs, but existing approaches either apply global perturbations that may alter useful visual evidence or invoke an additional negative branch at every decoding step. In this paper, we observe that hallucination risks are transient and token-specific: visual attention shifts across generated tokens, while some functional tokens are produced with high confidence and do not require contrastive calibration. Based on this observation, we propose Contrastive Hallucination-Aware Step-wise Decoding (CHASD) for Large Vision-Language Models, an inference-time framework for "calibration on demand". CHASD uses an uncertainty-driven confidence gate to activate the contrastive branch only when the maximum probability of the next-token is less than the threshold, and constructs the negative branch through attention-guided localized perturbations of the currently salient visual tokens. This design reduces unnecessary negative-branch forward passes while preserving the original distribution for high-confidence steps. Experiments on POPE, AMBER, MME, MMHal-Bench, and CHAIR show that CHASD improves hallucination-related metrics over strong training-free baselines with competitive inference efficiency.

2605.23341 2026-05-25 cs.RO cs.AI 版本更新

Sparse Compositional Flow Matching by geometric assembly from motion primitives

基于运动基元的几何组装的稀疏组合流匹配

Yan Tang, Yuanbo Tang, Tingyu Cao, Shaolun Huang, Yang Li

发表机构 * Tsinghua Shenzhen Graduate School, Tsinghua University, Shenzhen, China(清华大学深圳研究生院,清华大学,深圳,中国) School of AI, Chinese University of Hong Kong (Shenzhen)(香港中文大学(深圳)人工智能学院)

AI总结 该论文研究了如何生成具身智能体(如机器人、水下机器人等)的可执行运动轨迹,提出了一种基于运动原语的稀疏组合流匹配方法。该方法通过在物理轨迹空间中直接组合可重复使用的运动原语,并引入几何约束和结构化稀疏流匹配框架,有效建模轨迹的组合结构与时空连续性。实验表明,该方法在多个数据集上取得了最先进的性能,显著提升了轨迹预测的准确性。

详情
AI中文摘要

具身轨迹,如机器人操纵器、水下航行器和移动机器人的可执行运动序列,是具身AI的基本输出。现代生成模型通常将其视为逐点生成的密集、整体信号,拟合复杂的高维后验,而未建模数据的潜在结构,这是结构化生成模型文献早已指出的样本效率低下问题。我们认为组合潜在结构是自然的选择:许多具身任务共享重复出现的运动片段,这些片段可以明确为有限的可重用运动基元库,并且组合单元自然与子任务边界对齐以支持任务分解。然而,现有的组合生成器在潜在空间中组合,并依赖事后解码将采样单元与实际轨迹段关联。相反,我们通过具有两个耦合设计的流匹配框架直接在物理轨迹空间中组合。运动基元字典学习为每个原子配备可学习的长度掩码和二进制起始指示器,使得原子本身即为基元,在其放置位置逐字重用。然后,具有几何约束的结构化稀疏流匹配通过持续时间感知分词和可微几何损失生成二进制放置矩阵,该损失在相邻基元相遇处强制执行空间连续性和时间邻接性。在Open X-Embodiment和3DMoTraj上,该框架达到了最先进的精度,并将FDE/ADE比从1.8降至1.07,相比最强基线,ADE提高了19.2%,FDE提高了21.0%。

英文摘要

Embodied trajectories, such as the executable motion sequences of robotic manipulators, underwater vehicles, and mobile robots, are a fundamental output of embodied AI. Modern generative models often treat them as a dense, monolithic signal generated point by point, fitting an intricate high-dimensional posterior while leaving the data's latent structure unmodeled, the same sample inefficiency long identified by the structured generative model literature. We argue that a compositional latent structure is a natural choice: many embodied tasks share recurring motion fragments that can be made explicit as a finite repertoire of reusable motion primitives, and compositional units naturally align with subtask boundaries to support task decomposition. Existing compositional generators, however, compose in a latent space and rely on post-hoc decoding to relate sampled units to actual trajectory segments. We instead compose directly in the physical trajectory space through a flow-matching framework with two coupled designs. Motion-Primitive Dictionary Learning equips each atom with a learnable length mask and binary starting indicators so the atom itself is the primitive, reused verbatim wherever it is placed. Structural Sparse Flow Matching with Geometric Constraints then generates a binary placement matrix using duration-aware tokenization and a differentiable geometric loss that enforces spatial continuity and temporal contiguity where adjacent primitives meet. On Open X-Embodiment and 3DMoTraj, the framework attains state-of-the-art accuracy and reduces the FDE/ADE ratio from 1.8 to 1.07, improving ADE by 19.2% and FDE by 21.0% over the strongest baseline.

2605.23320 2026-05-25 cs.AI 版本更新

Human-in-the-Loop Multi-Agent Ventilator Decision Support with Contextual Bandit Preference Learning

人在回路的多智能体呼吸机决策支持与上下文赌博机偏好学习

Sijia Li, Xiaoyu Tan, Qixing Wang, Weiyi Zhao, Chen Zhan, Teqi Hao, Xuemin Wang, Lei Gu, Roland Eils, Xihe Qiu

发表机构 * Shanghai University of Engineering Science, Shanghai, China Tencent Youtu Lab, Tencent, China Department of Critical Care Medicine, Shanghai Tenth People's Hospital, Tongji University School of Medicine, Shanghai, China Department of Emergency Critical Disease, Songjiang Hospital Affiliated to Shanghai Jiao Tong University School of Medicine, Shanghai, China Max Planck Institute for Heart Lung Research, Bad Nauheim, Germany Fudan University, Shanghai, China BIH at Charit\'e -- Universit\"atsmedizin Berlin, Berlin, Germany

AI总结 该研究提出了一种基于人类在环的多智能体框架的呼吸机决策支持系统(VDSS),用于辅助临床医生进行呼吸机参数调整。系统通过上下文老虎机算法实现在线偏好学习,根据临床医生的反馈动态调整决策策略,并利用结构化反馈机制提高交互效率与稳定性。实验表明,该方法在重症监护环境中能显著提升推荐接受率并减少交互轮次,为临床可部署的人机协作提供了有效支持。

Comments miccai 2026

详情
AI中文摘要

呼吸机决策支持需要顺序决策,跟踪不断变化的生理和疾病轨迹,同时尊重安全边界和临床医生的特定调节风格。基于规则的方法很少能泛化个性化,而端到端强化学习或单一大型语言模型系统仍难以控制和审计。我们提出了呼吸机决策支持系统(VDSS),这是一个人在回路的多智能体框架,通过合同驱动的结构化接口协调模块化决策组件,并生成可追溯的证据以供审查。VDSS使用上下文赌博机进行在线偏好适应,在每个调整周期根据最终接受的决策更新临床医生特定偏好,并利用这些偏好指导后续建议。结构化的拒绝反馈触发有针对性的重新规划,以减少无效迭代并提高交互稳定性。回顾性ICU轨迹重放与专家审查表明,推荐接受度更高,达到可接受计划所需的交互轮次更少,支持临床可部署的人机协作。

英文摘要

Ventilator decision support requires sequential decisions that track evolving physiology and disease trajectories while respecting safety boundaries and clinician specific tuning styles. Rule based approaches rarely generalize personalization, and end to end reinforcement learning or single large language model systems remain difficult to control and audit. We propose the Ventilator Decision Support System (VDSS), a human in the loop multi agent framework that coordinates modular decision components through contract driven structured interfaces and produces traceable evidence for review. VDSS performs online preference adaptation with a contextual bandit, updating clinician specific preferences from the final accepted decision at each adjustment cycle and using them to guide subsequent recommendations. Structured rejection feedback triggers targeted replanning to reduce unproductive iterations and improve interaction stability. Retrospective ICU trajectory replay with expert review indicates higher recommendation acceptability and fewer interaction rounds to reach an acceptable plan, supporting clinically deployable human AI collaboration.

2605.23315 2026-05-25 cs.CL cs.AI 版本更新

Convergence Without Understanding: When Language Models Agree on Representations but Disagree on Reasoning

没有理解的趋同:当语言模型在表示上一致但在推理上分歧时

Muhammad Usama, Dong Eui Chang

AI总结 本研究探讨了大型语言模型在不同目标和架构下训练后,其内部表征是否趋于一致,并进一步验证了这种表征一致性是否也体现在推理过程上。通过对8个模型家族共16个语言模型在800个推理问题上的分析,研究发现模型在表征层面趋于一致,但在推理策略上却存在显著分歧,表明表征收敛更多源于输入处理的共性,而非推理方法的统一。这一发现对模型集成、可解释性迁移及模型相似性评估具有重要意义。

详情
AI中文摘要

在不同目标和架构下训练的大型语言模型已被证明会发展出越来越相似的内部表示,这一观察被形式化为柏拉图式表示假说。这种表示趋同是否延伸到对共享表示进行操作的推理过程仍未得到检验。我们在800个涵盖数学、科学、常识和真实性的推理问题上,评估了来自8个家族(1.5B到72B参数)的16个语言模型的表示相似性,并按问题难度、计算阶段和因果相关性进行分层。我们的分析揭示了三种分离:难度反转,模型在它们共同失败的问题上趋同更多(中心核对齐[CKA] = 0.897),而在它们解决的问题上趋同较少(CKA = 0.830);生成差距,决策前表示对齐(CKA = 0.875),而决策后表示分歧(CKA = 0.274);以及附带正确性,共享信息可在模型间解码(66%的迁移准确率),但对预测的因果影响极小(在不同消融协议下翻转率为1.5%到5.5%)。这些结果表明,语言模型中的表示趋同反映了共享的输入处理约束而非共享的推理策略,对集成设计、可解释性迁移和模型相似性评估有直接影响。代码可在 https://github.com/Usama1002/convergence-without-understanding 获取。

英文摘要

Large language models trained under diverse objectives and architectures have been shown to develop increasingly similar internal representations, an observation formalized as the Platonic Representation Hypothesis. Whether this representational convergence extends to the reasoning processes that operate over shared representations remains untested. We evaluate representational similarity across 16 language models from 8 families (1.5B to 72B parameters) on 800 reasoning problems spanning mathematics, science, commonsense, and truthfulness, stratifying by problem difficulty, computational stage, and causal relevance. Our analysis reveals three dissociations: a difficulty inversion, where models converge more on problems they collectively fail (Centered Kernel Alignment [CKA] = 0.897) than on those they solve (CKA = 0.830); a generation gap, where pre-decision representations align (CKA = 0.875) while post-decision representations diverge (CKA = 0.274); and epiphenomenal correctness, where shared information is decodable across models (66% transfer accuracy) but exerts minimal causal influence on predictions (1.5% to 5.5% flip rate across ablation protocols). These results indicate that representational convergence in language models reflects shared input processing constraints rather than shared reasoning strategies, with direct implications for ensemble design, interpretability transfer, and evaluations of model similarity. Code is available at https://github.com/Usama1002/convergence-without-understanding.

2605.23311 2026-05-25 cs.AI 版本更新

DART: Semantic Recoverability for Structured Tool Agents

DART:结构化工具代理的语义可恢复性

Ke Yang, Panpan Li, Zonghan Wu, Kejin Xu, Huaxi Huang, Xiaoshui Huang

发表机构 * MOS Intelligent Connectivity Technology Co. Ltd.(MOS智能连接技术有限公司) Sichuan Vocational College of Post and Telecom(四川邮电职业技术学院) East China Normal University(华东师范大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Shanghai Jiao Tong University(上海交通大学)

AI总结 当结构化工具代理在执行过程中发生故障时,系统面临一个两难问题:重新执行整个任务虽然安全但效率低,而从局部检查点恢复虽然高效,却可能导致下游工作依赖于已不存在的上游历史。为了解决这一问题,DART 提出了一种模块化的运行时机制,能够定位失败实例、验证其语义可恢复的边界、对齐检查点,并选择一个在依赖和效果约束下可接受的恢复点,从而在保证下游工作不受影响的前提下实现安全恢复。实验表明,DART 在多个基于大语言模型的领域中成功恢复了传统局部恢复方法无法处理的语义敏感场景,且安全审计未发现任何不安全的回滚操作。

详情
AI中文摘要

当结构化工具代理在执行过程中失败时,运行时面临两难:重放整个任务安全但浪费资源,而从本地检查点恢复高效但可能使已提交的下游工作与不再存在的上游历史相关联。这种紧张关系在承诺敏感场景中尤为突出,其中回滚目标是一个失败的实例,但下游消费者已经对其输出采取了行动。现有的恢复方法提供机械回滚,但没有标准来判断本地恢复在下游提交后是否保持语义有效。我们将这一差距形式化为语义可恢复性,并在DART中解决它,DART是一个模块化运行时,它定位失败实例,认证该实例的语义可恢复边界,将检查点对齐到这些边界,并选择一个可接受的恢复点,该恢复点在依赖和效果约束下保留已提交的下游工作——否则阻止恢复。在三个LLM驱动的领域以及基于LangGraph的基板上的外部验证中,DART正确恢复了所有评估的承诺敏感案例,而基线局部恢复失败,并且一个五领域安全审计未发现不安全的允许回滚。这些结果表明,控制器的合法性并不意味着语义有效性,而合理的局部恢复需要明确的可接受性检查。

英文摘要

When a structured tool agent fails mid-execution, the runtime faces a dilemma: replaying the entire task is safe but wasteful, while restoring from a local checkpoint is efficient but can leave committed downstream work tied to an upstream history that no longer exists. This tension is acute in commitment-sensitive settings, where rollback targets a single failed instance yet downstream consumers have already acted on its output. Existing recovery approaches provide mechanical rollback but no criterion for whether a local restore remains semantically valid after downstream commitment. We formalize this gap as semantic recoverability and address it in DART, a modular runtime that localizes the failed instance, certifies semantically recoverable boundaries of that instance, aligns checkpoints to those boundaries, and selects an admissible restore point that preserves committed downstream work under dependency and effect constraints-or blocks otherwise. Across three LLM-driven domains and external validation on a LangGraph-based substrate, DART correctly recovers all evaluated commitment-sensitive cases where baseline local recovery fails, and a five-domain safety audit finds no unsafe admitted rollbacks. These results show that controller legality does not imply semantic validity, and that sound local recovery requires an explicit admissibility check.

2605.23297 2026-05-25 cs.AI cs.DC 版本更新

Ontological Knowledge Blocks: Executable Compliance and Profile-Based Validation for Trustworthy AI Systems

本体知识块:可信AI系统的可执行合规与基于配置文件的验证

Aasish Kumar Sharma, Julian M. Kunkel

AI总结 本文提出了一种名为本体知识块(Ontological Knowledge Blocks, OKBs)的可执行治理框架,用于实现可信AI系统的合规性与基于配置文件的验证。OKBs将法规义务编译为可由机器验证的约束条件,结合RDF/OWL本体、SHACL验证规则、证据要求和溯源链接,实现了自动化合规检查。研究通过两个原型系统在高性能计算资源分配场景中进行了评估,验证了其在不同治理配置文件下的有效性与性能表现。

Comments 6 pages, 3 figures. Accepted at the Security, Trust and Privacy for Software and Applications (STPSA) Workshop, IEEE COMPSAC 2026, Madrid, Spain, July 7-10, 2026

详情
AI中文摘要

部署在关键数字基础设施中的AI服务需遵守透明度、问责制、公平性和可追溯性等治理义务。目前的合规仍以文档为中心:义务用散文描述,审计依赖静态检查表,验证依赖人工审查。此类方法无法扩展到自动化AI系统。本文引入本体知识块(OKBs),一种可编程治理基础设施,将监管义务编译为结构化证据图上的机器可检查约束。我们将OKB形式化为一个五元组,将规范性义务绑定到RDF/OWL概念模式、可执行的SHACL验证规则、明确的证据要求和PROV-O溯源链接。一个确定性监管编译器将结构化中间表示(IR)记录转换为可组合的KB模块,实现基于配置文件的治理重配置而无需修改服务代码。我们实现了两个原型,并在AI辅助HPC资源分配场景中进行了24次验证运行和四个治理配置文件的评估。结果表明配置文件敏感的验证、严格累加的违规累积、SHACL验证延迟在12.6毫秒至100.3毫秒之间,以及配置文件等价性测试确认Combined是最严格全面的配置文件。所有工件均以开源形式发布。

英文摘要

AI-enabled services deployed in critical digital infrastructure are subject to governance obligations spanning transparency, accountability, fairness, and traceability. Compliance today remains documentation-centric: obligations are described in prose, audits rely on static checklists, and verification depends on manual review. Such approaches do not scale to automated AI systems. This paper introduces Ontological Knowledge Blocks (OKBs), a programmable governance infrastructure that compiles regulatory obligations into machine-checkable constraints over structured evidence graphs. We formalize an OKB as a 5-tuple that binds normative obligations to an RDF/OWL concept schema, executable SHACL validation rules, explicit evidence requirements, and PROV-O provenance links. A deterministic regulatory compiler translates structured Intermediate Representation (IR) records into composable KB modules, enabling profile-based governance reconfiguration without modifying service code. We implement two prototypes and evaluate them in an AI-assisted HPC resource allocation scenario across 24 validation runs and four governance profiles. Results demonstrate profile-sensitive validation, strictly additive violation accumulation, SHACL validation latency between 12.6 ms and 100.3 ms, and profile equivalence testing confirming Combined as the strictly most comprehensive profile. All artefacts are released as open source.

2605.23296 2026-05-25 cs.AI 版本更新

Parallel Context Compaction for Long-Horizon LLM Agent Serving

长视界LLM智能体服务的并行上下文压缩

Musa Cim, Burak Topcu, Chita Das, Mahmut Taylan Kandemir

发表机构 * The Pennsylvania State University(宾夕法尼亚州立大学)

AI总结 长期对话场景下的大语言模型代理在运行过程中会积累越来越多的对话历史,最终超出模型的上下文窗口限制。为解决这一问题,本文提出了一种并行上下文压缩方法,通过并行处理对话历史,实现了对摘要长度的精细控制和更高效的推理性能。实验表明,该方法在多个基准测试中优于传统的串行压缩方式,显著提升了处理效率和稳定性。

详情
AI中文摘要

长视界LLM智能体积累的对话历史会逐渐增长,最终超出模型的上下文窗口。基于LLM摘要的上下文压缩可以保持对话有界,但摘要本质上有损,且阻塞调用会暂停智能体推理数十秒。此外,由于提示指令基本被忽略,操作者无法细粒度控制摘要体积,随着上下文增长,模型生成的输出令牌数量及其保留的信息在不同运行间波动显著,使得智能体保留的知识在不同运行间不可预测。我们引入了长视界智能体流的 extbf{并行压缩},并在HotpotQA多跳问答和LoCoMo长上下文对话基准上,针对8B到120B参数的四个骨干模型(混合密集和MoE架构,包括推理和非推理模型)与顺序同步基线进行对比。并行压缩使操作者能够细粒度、可预测地控制摘要体积,并支持每块更针对性的提示工程。在匹配的压缩解码体积下,它相比顺序基线减少了端到端耗时并提升了压缩吞吐量。

英文摘要

Long-horizon LLM agents accumulate growing conversation histories that eventually exceed the model's context window. Context compaction via LLM-based summarization keeps the conversation bounded, but summarization is inherently lossy and the blocking call stalls agent inference for tens of seconds. Moreover, the operator has no fine-grained control over summary volume since prompt instructions are largely ignored, and as context grows, both the amount of output tokens the model produces and the information it retains fluctuate substantially from run to run, making the agent's retained knowledge unpredictable across runs. We introduce \textbf{parallel compaction} for long-horizon agentic flows and characterize it against the sequential synchronous baseline across four backbones spanning 8B to 120B parameters, mixing dense and MoE architectures with reasoning and non-reasoning models, on the HotpotQA multi-hop QA and LoCoMo long-context dialogue benchmarks. Parallel compaction gives the operator fine-grained, predictable control over summary volume and enables more targeted prompt engineering per block. At matched compaction decode volume, it reduces end-to-end wall time and improves compaction throughput over the sequential baseline.

2605.23285 2026-05-25 cs.LG cond-mat.stat-mech cs.AI 版本更新

Reinforcement Learning for Microcanonical Graph Ensemble with Assortativity Constraints

具有同配性约束的微正则图集成的强化学习

Hoyun Choi, Junghyo Jo, Deok-Sun Lee

发表机构 * School of Computational Sciences, Korea Institute for Advanced Study(韩国高等科学研究院计算科学系) Department of Physics Education, Seoul National University(首尔国立大学物理教育系) Center for Theoretical Physics and Artificial Intelligence Institute, Seoul National University(首尔国立大学理论物理与人工智能研究所) Center for AI and Natural Sciences, Korea Institute for Advanced Study(韩国高等科学研究院人工智能与自然科学中心)

AI总结 本文研究如何通过强化学习生成满足特定 assortativity(度-度相关性)约束的微正则图系,以精确控制网络结构特性。提出了一种基于强化学习的深度微正则图生成器(DMGG),通过度保持的重连操作,使图的 assortativity 精确达到目标值,克服了传统方法在参数调校和生成效率上的不足。该方法能够在不同规模、稀疏度和拓扑结构的图上生成精确的无偏模型,有助于定量分析网络的次级特性,如聚类系数,为研究网络结构与功能的关系提供了有力工具。

详情
AI中文摘要

网络结构如何决定功能是一个基本问题,可以通过具有精确控制结构属性的图集成来研究。规范方法(如指数随机图模型ERGM)仅期望约束,允许个体实现围绕目标波动。相反,微正则集成施加硬约束,但除固定度序列外的实用采样方法仍难以实现。本文介绍深度微正则图生成器(DMGG),一种强化学习(RL)框架,通过保度重连变换任意给定图,以精确达到指定的同配性(表征相邻节点的度-度相关性)。DMGG不依赖于ERGM的熵主导的Metropolis-Hastings动力学,而是采用策略引导搜索,最大程度地改变联合度矩阵。这消除了详尽的参数调优,并在保持构型多样性的同时将生成速度提高至少一个数量级。由于DMGG可推广到各种图大小、稀疏性和拓扑结构,它提供了精确的零模型,允许定量隔离二次可观测量(如聚类系数)。这些结果确立了RL作为生成硬约束图的实用且强大的范式,为研究无集成伪影的结构-功能关系开辟了途径。

英文摘要

How network structure determines function is a fundamental question, and it can be investigated by graph ensembles with precisely controlled structural properties. Canonical approaches, formulated as exponential random graph models (ERGMs), enforce constraints only in expectation, allowing individual realizations to fluctuate around the target. Conversely, microcanonical ensembles impose hard constraints exactly, but practical sampling methods beyond fixing the degree sequence have remained out of reach. Here we introduce the Deep Microcanonical Graph Generator (DMGG), a reinforcement learning (RL) framework that transforms any given graph through degree-preserving rewirings to exactly reach a prescribed assortativity, which characterizes the degree--degree correlation of adjacent nodes. Instead of relying on the entropically dominated Metropolis--Hastings dynamics of the ERGM, DMGG employs a policy-guided search that maximally alters the joint-degree matrix. This eliminates exhaustive parameter tuning and accelerates generation by at least an order of magnitude while preserving configurational diversity. As DMGG generalizes across various graph sizes, sparsities, and topologies, it provides exact null models that allow for the quantitative isolation of secondary observables, such as the clustering coefficient. These results establish RL as a practical and powerful paradigm for generating hard-constrained graphs, opening avenues to investigate structure-function relationships free from ensemble artifacts.

2605.23272 2026-05-25 cs.LG cs.AI 版本更新

When Good Equations Get Bad Scores: Improving Symbolic Regression Through Better Parameter Optimization

当好方程得到差分数:通过更好的参数优化改进符号回归

Boxiao Wang, Kai Li, Zhiwei Chen, Yang Huang, Runxiang Wang, Ziwen Zhang, Yifan Zhang, Jian Cheng

发表机构 * Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所)

AI总结 符号回归(SR)在科学知识发现中扮演重要角色,旨在从观测数据中提炼出数学方程。现有方法通常采用双层优化框架,但参数拟合质量直接影响结构评分,导致正确结构可能因局部最优解而被低估。为此,本文提出SAGE-Fit,一种基于符号表达式结构与语义先验的拟合框架,有效缓解了优化瓶颈,显著提升了符号回归系统的评估准确性和整体性能。

详情
AI中文摘要

符号回归(SR)通过从观测数据中提炼数学方程,在科学知识发现中发挥核心作用。大多数现有SR方法在双层优化框架内运行:外层循环搜索离散方程结构,内层循环优化该结构的连续参数。关键的是,参数拟合质量直接决定结构的得分,从而影响外层搜索。然而,非线性算子使得内层循环高度非凸,且预算驱动的快速局部求解器(如BFGS)的依赖常常导致正确的结构陷入较差的局部极小值并被低估得分。这种“好结构、差分数”现象成为关键瓶颈,降低效率并误导搜索偏离真实方程。为解决此问题,我们提出SAGE-Fit(结构感知与语义引导的符号回归评估器),一个利用符号表达式双重原生先验的SR原生拟合框架。通过利用SR特有的结构和语义先验,我们为每个属性设计定制模块,从而有效缓解这一优化瓶颈。大量实验表明,我们的方法作为即插即用模块,显著提升评估保真度,并普遍提高各种SR系统的性能。

英文摘要

Symbolic Regression (SR) plays a central role in scientific knowledge discovery by distilling mathematical equations from observational data. Most existing SR methods function within a bi-level optimization framework: an outer loop that searches for the discrete equation structure, and an inner loop that optimizes the continuous parameters of that structure. Crucially, parameter-fitting quality directly determines a structure's score and thus the outer-loop search. However, nonlinear operators make the inner loop highly non-convex, and budget-driven reliance on fast local solvers (e.g., BFGS) often yields poor local minima and underestimated scores for correct structures. This ``Good Structure, Bad Score'' phenomenon becomes a key bottleneck, degrading efficiency and misguiding the search away from the true equation. To resolve this, we propose SAGE-Fit (Structure-Aware and Semantics-Guided Evaluator for Symbolic Regression), an SR-native fitting framework that exploits the dual native priors of symbolic expressions. By capitalizing on the structural and semantic priors unique to SR, we design tailored modules for each property, thereby effectively mitigating this optimization bottleneck. Extensive experiments demonstrate that our approach, as a plug-and-play module, significantly enhances evaluation fidelity and universally improves the performance of various SR systems.

2605.23271 2026-05-25 cs.CV cs.AI 版本更新

EvalVerse: Pipeline-Aware and Expert-Calibrated Benchmarking for Professional Cinematic Video Generation

EvalVerse:面向专业电影级视频生成的流水线感知与专家校准基准测试

Songlin Yang, Haobin Zhong, Ruilin Zhang, Xiaotong Zhao, Shuai Li, Kai Zheng, Xuyi Yang, Zhe Wang, Zhenchen Tang, Yang Li, Bohai Gu, Zhengwei Peng, Yidan Huang, Mengzhou Luo, Yihang Bo, Dalu Feng, Yujia Zhang, Juntao Ma, Ruiqi Wang, Lvmin Zhang, Yuwei Guo, Frank Guan, Maneesh Agrawala, Hongbo Fu, Alan Zhao, Anyi Rao

发表机构 * The Hong Kong University of Science and Technology(香港科学与技术大学) Tencent(腾讯) Tsinghua University(清华大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Beijing Film Academy(北京电影学院) Stanford University(斯坦福大学) The Chinese University of Hong Kong(香港中文大学) Singapore Institute of Technology(新加坡理工学院)

AI总结 随着生成式视频基础模型的快速发展,影视级视频生成成为研究热点,但现有的评估方法多关注生成内容是否符合提示,而忽视了其艺术质量、表演和美学表现。为解决这一问题,本文提出 EvalVerse,一个流程感知且由专家校准的评估框架,通过构建专业影视制作流程的评估体系、收集大规模专家标注数据,并结合专家校准的微调策略提升视觉语言模型的推理能力,从而实现对视频生成质量的全面评估,为未来奖励模型和评估代理的研究提供了基础支撑。

详情
AI中文摘要

生成式视频基础模型的快速发展推动该领域向专业级电影合成迈进。为达到如此苛刻的质量,社区正转向强化学习和智能体工作流。然而,可靠的评估已成为关键瓶颈。现有基准主要评估“是否正确”(基本提示遵循),而从根本上忽略了“是否优良”(电影质量、表演和美学)。此外,当前的自动指标缺乏提供可信信号所需的领域特异性,在人类审美感知与机器评分之间造成了严重的可信度差距。为弥合这一差距,我们引入了EvalVerse,一个全面、流水线感知且专家校准的评估框架。我们将视频生成评估不仅视为一项工程任务,而是作为一个核心科学问题:主观电影专业知识的系统数字化。首先,我们将领域知识组织成与专业电影制作工作流(前期制作、制作和后期制作)一致的评估分类法。其次,我们将人类专家判断提炼为带有大规模人工标注的精选数据集。第三,我们通过专家校准的微调策略将这些知识注入视觉语言模型,使VLM能够执行显式的思维链推理。与先前工作相比,EvalVerse不仅保持与基础“正确性”指标的兼容性,还显著扩展了“优良性”标准,并将任务覆盖范围拓宽到复杂的多镜头序列和视听整合。因此,通过提供细粒度的诊断信号,EvalVerse超越了静态排行榜,为未来工作(如奖励模型和评估智能体)建立了基础基础设施。

英文摘要

The rapid evolution of generative video foundation models has propelled the field toward professional-grade cinematic synthesis. To achieve such demanding quality, the community transitions towards Reinforcement Learning (RL) and agentic workflows. However, reliable evaluation has emerged as a critical bottleneck. Existing benchmarks predominantly evaluate ''whether it is right'' (basic prompt-following) while fundamentally neglecting ''whether it is good'' (cinematic quality, acting, and aesthetics). Furthermore, current automated metrics lack the domain-specific rigor required to provide trustworthy signals, creating a severe credibility gap between human aesthetic perception and machine scoring. To bridge this gap, we introduce EvalVerse, a comprehensive, pipeline-aware, and expert-calibrated evaluation framework. We treat video generation assessment not merely as an engineering task, but as a core scientific problem: the systematic digitization of subjective cinematic expertise. First, we organize domain knowledge into an evaluation taxonomy aligned with the professional filmmaking workflow (pre-production, production, and post-production). Second, we distill human expert judgments into a curated dataset with large-scale human annotations. Third, we inject this knowledge into Vision-Language Models (VLMs) through an expert-calibrated fine-tuning strategy, enabling the VLM to perform explicit Chain-of-Thought reasoning. Compared to previous works, EvalVerse not only retains compatibility with foundational ''rightness'' metrics, but also significantly expands the criteria to ''goodness'' and broaden the task coverage to complex multi-shot sequencing and audio-visual integration. Consequently, by providing granular diagnostic signals, EvalVerse transcends a static leaderboard and establishes a fundamental infrastructure for future work, such as reward models and evaluator agent.

2605.23270 2026-05-25 cs.CV cs.AI cs.RO 版本更新

ChainFlow-VLA: Causal Flow Planning with Vision-Language Models

ChainFlow-VLA: 基于视觉语言模型的因果流规划

Xiyang Wang, Xinlin Wang, Tingguang Zhou, Gong Chen, Xingtai Gui, Zhi Xu, Xiaolei Wu, Feiyang Tan, Hangning Zhou, Mu Yang

发表机构 * Afari Intelligent Drive(阿法瑞智能驾驶) Tianjin University(天津大学) University of Macau(澳门大学)

AI总结 当前端到端自动驾驶系统在时间因果推理与全局轨迹一致性之间存在根本性矛盾。为解决这一问题,本文提出 ChainFlow-VLA,通过统一因果生成与全局优化的联合概率框架,将因果推理与全局轨迹修正相结合。该方法利用视觉语言模型作为语义先验,在保留因果结构的基础上进行轨迹修正,实验表明其在复杂场景中表现出色,达到了与人类相当的高水平性能。

详情
AI中文摘要

当前的端到端自动驾驶系统从根本上受到时间因果推理与全局轨迹一致性之间不匹配的限制。自回归(AR)模型通过因果分解捕获交互感知的时间依赖性,但其逐步解码导致误差累积和次优的全局结构。相比之下,扩散模型全局优化轨迹但缺乏显式因果约束,使其在交互和关键安全场景中不可靠。这种二分法揭示了一个更深层次的问题:现有方法将因果建模和全局优化视为分离的范式,没有原则性的方式将它们统一在单个轨迹分布中。为了解决这个问题,我们提出了ChainFlow-VLA,它在统一的概率框架内统一了因果生成和全局细化。我们将规划公式化为AR诱导模式的混合,并学习这些模式上的视觉语言模型(VLM)条件残差分布。自回归生成器(Chain)生成一组离散的因果轨迹模式,随后基于扩散的细化器(Flow)利用VLM隐藏状态作为语义先验,在残差空间中执行模式条件校正,同时保持因果结构。这种直接的调节将高层场景理解无缝注入到细粒度的轨迹调整中。实验表明,ChainFlow-VLA在模糊和长尾场景中实现了鲁棒的规划,在NAVSIM v1排行榜上取得了94.85的最新分数,匹配人类水平(94.8)。代码将在https://github.com/AFARI-Research/ChainFlow-VLA提供。

英文摘要

Current end-to-end autonomous driving systems are fundamentally limited by a mismatch between temporal causal reasoning and global trajectory consistency. Autoregressive (AR) models capture interaction-aware temporal dependencies via causal factorization, but their step-wise decoding leads to error accumulation and suboptimal global structure. In contrast, diffusion models optimize trajectories globally but lack explicit causal constraints, making them unreliable in interactive and safety-critical scenarios. This dichotomy reveals a deeper issue: existing methods treat causal modeling and global optimization as separate paradigms, without a principled way to unify them within a single trajectory distribution. To address this, we propose ChainFlow-VLA, which unifies causal generation and global refinement within a unified probabilistic framework. We formulate planning as a mixture over AR-induced modes and learn Vision-Language Model (VLM)-conditioned residual distributions over these modes. An autoregressive generator (Chain) produces a discrete set of causal trajectory modes, followed by a diffusion-based refiner (Flow) that leverages VLM hidden states as semantic priors to perform mode-conditioned correction in residual space while preserving causal structure. This straightforward conditioning seamlessly injects high-level scene understanding into fine-grained trajectory adjustments. Experiments demonstrate that ChainFlow-VLA achieves robust planning in ambiguous and long-tail scenarios, achieving a state-of-the-art score of 94.85 on the NAVSIM v1 leaderboard, matching human-level performance (94.8). Code will be available at https://github.com/AFARI-Research/ChainFlow-VLA.

2605.23264 2026-05-25 cs.CV cs.AI 版本更新

Coloring the Noise: Adversarial Sobolev Alignment for Faithful Image Super Resolution

着色噪声:用于忠实图像超分辨率的对抗性Sobolev对齐

Hongbo Wang, Huaibo Huang, Pin Wang, Jinhua Hao, Chao Zhou, Ran He

发表机构 * MAIS \& NLPR, Institute of Automation, Chinese Academy of Sciences, Beijing, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China

AI总结 图像超分辨率生成中,生成先验常导致还原不够忠实,本文认为这是由于各向同性目标与自然图像内在流形之间存在基本的谱不匹配。为解决这一问题,研究提出了一种基于Sobolev诱导黎曼几何的ASASR框架,通过显式地对噪声转移核进行谱色处理,使其更符合自然图像的谱衰减特性,并引入基于Riesz表示定理的参数化对抗网络,生成针对性的负样本以引导优化方向。实验表明,该方法在保持谱一致性和结构保真度方面优于现有生成方法,有效减少了伪影。

Comments Accepted to ICML 2026

详情
AI中文摘要

图像超分辨率(SR)中的生成先验常常损害忠实重建,我们将这一限制归因于各向同性目标与内在自然图像流形之间的基本光谱失配。虽然直接偏好优化提供了一条对齐路径,但其对光谱平坦高斯噪声的依赖无法区分真实高频细节与幻觉。为了弥合这一几何差距,我们提出了ASASR,一个理论基础的框架,通过显式着色噪声转移核以镜像自然光谱衰减,将生成流重铸为Sobolev诱导的黎曼几何。驱动这一几何对齐,我们集成一个基于Riesz表示定理的参数化对抗器,该对抗器合成目标负样本,等效于最坏情况下的Sobolev梯度,以沿着可能结构失效的切空间引导优化。大量评估表明,ASASR优于领先的生成基线,特别是在保持光谱一致性和结构保真度方面,提供了一种有效缓解伪影的鲁棒解决方案。

英文摘要

Generative priors in Image Super-Resolution (SR) often compromise faithful restoration, we attribute this limitation to a fundamental spectral misalignment between isotropic objectives and the intrinsic natural image manifold. While Direct Preference Optimization offers a path to alignment, its reliance on spectrally flat Gaussian noise fails to distinguish authentic high-frequency details from hallucinations. To bridge this geometric gap, we propose ASASR, a theoretically grounded framework that recasts the generative flow into a Sobolev-induced Riemannian geometry by explicitly coloring the noise transition kernel to mirror natural spectral decay. Driving this geometric alignment, we integrate a parametric adversary grounded in the Riesz Representation Theorem, which synthesizes targeted negative samples equivalent to worst-case Sobolev gradients to direct optimization along the tangent space of plausible structural failures. Extensive evaluations demonstrate that ASASR outperforms leading generative baselines, particularly in preserving spectral consistency and structural fidelity, offering a robust solution that effectively mitigates artifacts.

2605.23263 2026-05-25 cs.RO cs.AI cs.SY eess.SP eess.SY 版本更新

6G Communication Networks Enabling Embodied Agents: Architecture and Prototype

6G通信网络赋能具身智能体:架构与原型

Lipeng Dai, Luping Xiang, Kun Yang

发表机构 * State Key Laboratory of Novel Software Technology, Nanjing University(南京大学新型软件技术国家重点实验室) Institute of Intelligent Networks and Communications (NINE), Nanjing University (Suzhou Campus)(南京大学智能网络与通信研究所(苏州校区))

AI总结 本文研究了6G通信网络如何支持具身智能体的通信需求,探讨了具身智能体与6G网络之间的协同关系,并提出了面向人机远程交互的分层通信架构。通过构建包含触觉设备、工业机械臂和5G O-RAN测试平台的原型系统,验证了该架构在毫秒级时延和稳定闭环控制方面的可行性,为未来6G与具身智能体的融合应用提供了重要参考。

详情
AI中文摘要

具身智能体将智能决策与物理执行相结合,对通信提出了比纯软件智能体更严格和多样化的要求。尽管6G承诺亚毫秒级延迟、超高可靠性、原生智能和集成感知,但如何利用这些能力支持具身智能体通信的系统性研究仍然有限。本文从概念和工程两个角度研究了面向具身智能体的6G通信系统。首先,我们回顾了具身智能体的概念和具身价值,并澄清了其与非具身智能体的区别。然后,我们分析了具身智能体与6G网络的共生关系,强调了关键6G使能技术如何支持人机交互的严苛需求。此外,我们展示了具身智能体通过覆盖扩展、环境感知和物理世界理解在增强通信网络中的主动作用。基于这些见解,我们提出了一种用于人机远程交互的分层通信架构,包括人类意图感知层、基于开放无线接入网(O-RAN)的传输层、智能中间层和具身层。为验证其可行性,我们实现了一个端到端原型,集成了触觉设备、工业机械臂、中间平台和5G O-RAN测试床。实验结果表明毫秒级延迟和稳定的闭环操作,证实了所提架构的实用性,并为未来6G-具身智能体研究和工业部署提供了参考。

英文摘要

Embodied agents, which couple intelligent decision-making with physical actuation in the real world, impose far more stringent and heterogeneous communication requirements than purely software-based agents. While 6G promises sub-millisecond latency, ultra-high reliability, native intelligence, and integrated sensing, systematic studies on how to exploit these capabilities for embodied agent communication remain limited. This article investigates 6G-enabled communication systems for embodied agents from both conceptual and engineering perspectives. First, we review the concept, embodiment value of embodied agents, and clarify their distinctions from disembodied agents. Then, we analyse the symbiotic relationship between embodied agents and 6G networks. We highlight how key 6G enablers can support the stringent requirements of human-robot interaction. Furthermore, we demonstrate the proactive role of embodied agents in bolstering communication networks through coverage extension, environmental sensing, and physical world understanding. Building on these insights, we propose a hierarchical communication architecture for human-robot remote interaction, comprising a human-intent perception layer, an open radio access network (O-RAN)-based transport layer, an intelligent intermediary layer, and an embodiment layer. To validate its feasibility, we implement an end-to-end prototype that integrates a haptic device, an industrial robotic arm, an intermediary platform, and a 5G O-RAN testbed. Experimental results demonstrate millisecond-level latency and stable closed-loop operation, confirming the practicality of the proposed architecture and providing a reference for future 6G-embodied agent research and industrial deployments.

2605.23262 2026-05-25 cs.AI 版本更新

Design and Report Benchmarks for Knowledge Work

知识工作的设计与报告基准

Yining Hua, Hongbin Na, Cyrus Ayubcha, Levi Lian

发表机构 * Harvard University(哈佛大学) University of Technology Sydney(悉尼科技大学) Stanford University(斯坦福大学) Raycaster AI

AI总结 本文针对知识工作领域的人工智能系统评估问题,提出了一种三步骤的基准设计方法,以明确任务评分与实际工作成果之间的对应关系。研究指出当前知识工作评估仍沿用传统NLP任务逻辑,难以真实反映系统在实际部署中的能力。为此,作者从工作活动、测试环境和评分标准三个维度构建基准设计框架,并基于O*NET职业任务数据库提炼出18类工作活动,结合三个实际案例展示了该方法在不同知识工作场景中的应用与效果。

详情
AI中文摘要

LLM智能体的发展催生了越来越多关于知识工作AI的研究,包括编程、研究和医疗保健。然而,当前的知识工作评估和基准设计在很大程度上仍遵循传统NLP任务的逻辑。因此,更高的基准性能并不能可靠地表明系统能够在实际部署环境中执行知识工作。本文提出了一种三步法,用于明确基准任务如何代表其分数所附的工作主张:定义被评估的工作活动,指定测试设置,并对适当的工作产品进行评分。我们回顾了工作研究表明,知识工作是通过角色和职责、本地材料和工具以及必须在下游工作流程中保持可用的工件来组织的。然后,我们将这些关注点转化为基准设计和报告指南,涵盖任务应如何映射到工作活动、测试设置应如何指定材料、工具、角色和约束,以及评分应如何关注系统留下的工作产品。为了命名被评估的工作活动并将其与常见的基准任务区分开来,我们从O{*}NET职业任务数据库中导出了18个工作活动的清单。我们通过三个基准案例分析来演示该方法:GDPval(一个非代码职业交付物基准)、OfficeQA Pro(一个基于文档的分析基准,通过最终答案评分)和APEX-SWE(一个软件工程基准,具有可执行评分产品)。这些案例展示了基准设计选择如何塑造分数所能支持的最强工作主张,以及基准任务、测试设置、评分产品和更广泛工作主张之间出现的差距。

英文摘要

The development of LLM agents has led to a growing body of work on knowledge-work AI, including coding, research, and healthcare. However, current knowledge-work evaluation and benchmark design still largely follow the logic of traditional NLP tasks. As a result, higher benchmark performance does not reliably show that a system can carry out knowledge work in real-world deployment settings. This paper contributes a three-step approach for making explicit how benchmarked tasks represent the work claims attached to their scores: defining the work activity under evaluation, specifying the tested setting, and scoring the appropriate work product. We review work studies showing that knowledge work is organized through roles and responsibilities, local materials and tools, and artifacts that must remain usable in downstream workflows. We then translate these concerns into benchmark design and reporting guidance, covering how tasks should be mapped to work activities, how tested settings should specify materials, tools, roles, and constraints, and how scoring should focus on the work product left by the system. To name the work activity being evaluated and distinguish it from common benchmark tasks, we derive an inventory of 18 work activities from the O{*}NET occupational task database. We demonstrate the approach through three benchmark case analyses: GDPval, a non-code occupational deliverable benchmark; OfficeQA Pro, a grounded document-analysis benchmark scored by final answers; and APEX-SWE, a software-engineering benchmark with executable scored products. These cases show how benchmark design choices shape the strongest work claim a score can support, and where gaps arise between the benchmarked task, tested setting, scored product, and broader work claim.

2605.23259 2026-05-25 cs.LG cs.AI cs.CL 版本更新

Multi-Gate Residuals

多门残差

Zhizhan Zheng, Feiyun Zhang, Shuchun Liu, Tian Xia, Xi Liu, Dasheng Hu, Hongquan Zhou

发表机构 * Shanghai Yichuang Information Technology Co.,Ltd.(上海亿创信息技术有限公司) Fudan University(复旦大学)

AI总结 本文提出了一种名为Multi-Gate Residuals(MGR)的新方法,旨在解决深度残差网络中激活值无界增长的问题,同时避免引入额外的通信开销。该方法通过简单的评分与门控机制维护多流上下文,并结合注意力池化技术提取隐藏状态,从而在保持激活规模稳定的同时提升模型性能。实验表明,MGR在大规模训练与部署中具有实用性,并优于现有架构。

详情
AI中文摘要

虽然注意力残差在解决深度残差层中普遍存在的激活值无界增长问题方面显示出一定效果,但它不可避免地引入了显著的通信开销。为了规避这一瓶颈,我们提出了多门残差(MGR),它在不增加通信负担的情况下稳定激活尺度。它利用简单的评分和门控机制来维护多流上下文,并结合注意力池化从流状态中提取隐藏状态。实证实验表明,MGR对于大规模训练和部署是实用的,相比现有架构提供了切实的性能提升。

英文摘要

While Attention Residuals has shown some effectiveness in addressing the widespread issue of unbounded activation growth across deep residual layers, it inevitably incurs significant communication overhead. To circumvent this bottleneck, we propose Multi-Gate Residuals (MGR), which stabilizes activation scales without additional communication burden. It utilizes a straightforward scoring and gating mechanism to maintain multi-stream context, coupled with Attention Pooling to extract hidden states from the stream states. Empirical experiments demonstrate that MGR is practical for large-scale training and deployment, offering tangible performance improvements over existing architectures.

2605.23249 2026-05-25 cs.LG cs.AI 版本更新

Enhancing Deep Neural Network Reliability with Refinement and Calibration

通过精炼和校准增强深度神经网络的可靠性

Ramya Hebbalaguppe, Ajay Shastry, Soumya Suvra Ghosal, Chetan Arora

发表机构 * SIT, Indian Institute of Technology Delhi, New Delhi, India(印度理工学院德里SIT,新德里)

AI总结 尽管深度神经网络在预测准确性方面表现优异,但其置信度估计往往不可靠,可能影响用户对其决策的信任。为此,本文提出了一种新的损失函数和统一训练框架RefCal,旨在同时提升模型的校准性、锐度(即正确与错误预测之间的置信度差异)和准确率,从而增强深度神经网络的可靠性。实验表明,RefCal在类别不平衡的数据集上显著优于现有方法。

Comments ICLR 2026, Trustworthy AI and Representational Alignment

详情
AI中文摘要

尽管深度神经网络(DNN)实现了高预测精度,但其置信度估计通常不可靠,可能损害用户对其决策的信任。这推动了校准模型的研究,其中校准衡量模型预测置信度与正确经验概率的一致性。然而,校准指标通常可以通过后处理技术改进,这些技术仅模仿训练时的不确定性,而并未真正提升模型的理解。因此,统计学家建议模型不仅要校准,还要精炼。直观上,如果模型对正确和错误预测分配显著不同的置信度分数,则被认为更精炼,这一属性也称为锐度。我们观察到,许多现有的校准方法以降低精炼度为代价来改善校准。为解决这一局限,我们提出:(1)一种新的损失函数,显式促进精炼度,并可通过监督对比学习优化;(2)一个统一的训练框架RefCal,联合优化校准、精炼度和准确性,以提高DNN的可靠性。在类别不平衡率为10%的CIFAR-100-LT数据集上,RefCal实现了(准确率,精炼度,ECE)为(58.81,95.67,0.08),显著优于广泛使用的Correctness Ranking Loss(46.27,93.7,0.22)。

英文摘要

Although deep neural networks (DNNs) achieve high predictive accuracy, their confidence estimates are often unreliable, potentially compromising user trust in their decisions. This has motivated research on calibrated models, where calibration measures how well a model's predicted confidence aligns with the empirical probability of correctness. However, calibration metrics can often be improved through post-processing techniques that merely mimic training-time uncertainty without genuinely improving the model's understanding. For this reason, statisticians recommend that models be not only calibrated but also refined. Intuitively, a model is considered more refined if it assigns significantly different confidence scores to correct and incorrect predictions, a property also referred to as sharpness. We observe that many existing calibration methods improve calibration at the cost of reduced refinement. To address this limitation, we propose: (1) a novel loss function that explicitly promotes refinement and can be optimized through supervised contrastive learning; and (2) a unified training framework, RefCal, that jointly optimizes calibration, refinement, and accuracy to improve DNN reliability. On the CIFAR-100-LT dataset with 10 percent class imbalance, RefCal achieves (accuracy, refinement, ECE) of (58.81, 95.67, 0.08), substantially outperforming the widely used Correctness Ranking Loss, which achieves (46.27, 93.7, 0.22).

2605.23245 2026-05-25 cs.CV cs.AI 版本更新

SimInsert: Seamless Video Object Insertion via Regional Sparse Attention Fusion

SimInsert: 通过区域稀疏注意力融合实现无缝视频对象插入

Xinyu Chen, Yuyi Qian, Jiang Lin, Shenyi Wang, Gao Wang, Zhiqiu Zhang, Jizhi Zhang, Mingjie Wang, Qiang Tang, Qian Wang, Song Wu, Zili Yi

发表机构 * State Key Laboratory of Novel Software Technology, Nanjing University(新型软件技术国家重点实验室,南京大学) School of Intelligence Science and Technology, Nanjing University(智能科学与技术学院,南京大学) JIUTIAN Research(JIUTIAN研究机构) Xi’an Jiaotong-Liverpool University(西安交通大学利物浦大学) Zhejiang Sci-Tech University(浙江科技学院) The University of British Columbia(不列颠哥伦比亚大学)

AI总结 SimInsert 是一种无需训练的视频对象插入方法,旨在解决现有方法依赖显式运动工程或耗时重训练的问题,提升灵活性和泛化能力。该方法通过区域稀疏注意力融合,将任务分解为单帧编辑和语义运动描述,利用图像到视频扩散模型的生成先验,实现编辑内容在时间上的自然传播,并保持背景不变性与交互真实感。实验表明,SimInsert 在多项指标上显著优于现有方法,为高保真视频编辑提供了高效解决方案。

Comments Accepted by ICME2026

详情
AI中文摘要

视频对象插入需要确保时空连贯性和交互真实感,远不止简单的内容放置。然而,当前方法通常受限于对显式运动工程或资源密集型重新训练的依赖,限制了其灵活性和泛化能力。为弥补这一差距,我们提出了 extit{SimInsert},一种无需训练的新范式,将任务高效地分解为直观的单帧编辑和语义运动描述。通过利用图像到视频扩散模型的强大生成先验,SimInsert在时间上传播编辑,严格保持背景不变性,同时实现插入对象与动态环境之间合理的、文本驱动的交互。我们的方法依赖于非侵入式引导机制,这些机制强制执行结构一致性,促进无缝边界融合,并抵消在去噪轨迹中通常累积的保真度漂移。大量定量实验验证了我们的有效性:SimInsert在PSNR上超越最先进方法18.8%,在SSIM上超越20.1%,在LPIPS上降低44.1%,为高保真视频编辑提供了流线型解决方案。

英文摘要

Video object insertion requires ensuring spatio-temporal coherence and interactive realism, extending far beyond simple content placement. However, current approaches are often hindered by a reliance on explicit motion engineering or resource-intensive retraining, restricting their flexibility and generalization. To bridge this gap, we present \textit{SimInsert}, a training-free paradigm that efficiently decouples the task into intuitive single-frame editing and semantic motion description. By harnessing the robust generative priors of image-to-video diffusion models, SimInsert propagates edits temporally, strictly preserving background invariance while enabling plausible, text-driven interactions between the inserted object and the dynamic environment. Our approach hinges on non-invasive guidance mechanisms that enforce structural consistency, facilitate seamless boundary fusion, and counteract the fidelity drift that typically accumulates during the denoising trajectory. Extensive quantitative experiments validate our efficacy: SimInsert surpasses state-of-the-art methods with an 18.8\% gain in PSNR, 20.1\% in SSIM, and a 44.1\% decrease in LPIPS, offering a streamlined solution for high-fidelity video editing.

2605.23238 2026-05-25 cs.AI cs.GT cs.LG cs.MA 版本更新

GENSTRAT: Toward a Science of Strategic Reasoning in Large Language Models

GENSTRAT:迈向大型语言模型中的战略推理科学

Vartan Shadarevian, Kia Ghods, Alex Kenich, Anany Kotawala

发表机构 * Princeton University(普林斯顿大学) Google(谷歌)

AI总结 本文提出GENSTRAT,一种基于程序生成战略环境的评估框架,用于更准确地评估大型语言模型在复杂战略场景中的推理能力。该方法生成一系列两人零和不完全信息卡牌游戏,并结合能力分析和“崎岖度”指标,全面评估模型在不同战略维度上的表现和稳定性。实验表明,前沿模型在整体表现上更优,但其能力分布和局部波动性存在显著差异,为实际部署提供了更细致的诊断依据。

Comments 33 pages, 8 figures, 9 tables (4 figures, 2 tables in main paper)

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被部署为市场、拍卖和竞价环境中的经济主体。预测它们在特定部署中的行为是困难的。现有的战略推理基准在固定的规范博弈上评估模型。这些基准可能会随着前沿模型的改进而饱和,并且不允许评估者从基准性能自信地推广到实际部署中涉及的各种混乱的战略环境。我们引入了GENSTRAT,它使用程序化生成的战略环境来解决这些挑战。具体来说,我们生成了一个两人零和、不完全信息纸牌游戏的分布。生成器可以按需生成新游戏,从而实现常青评估并抵抗污染。我们将游戏分布与一种能力剖面方法论配对,该方法论将模型能力分解为六个轴(状态空间、时间深度、信息敏感性、对手建模、风险和脆弱性)。我们还引入了一种分布内平滑度的锯齿度量,用于检测模型在战略相似游戏之间优势是否不可预测地跳跃。我们从2000个游戏的生成池中采样了50个基准游戏,并在一个包含超过36,000场比赛的正面交锋锦标赛中评估了九个前沿和开放权重LLM。较新的前沿模型平均得分更高。除了平均值之外,整体实力几乎相同的模型显示出性质不同的能力剖面,并且排行榜前三名模型中的两个(gpt-5和claude)在局部波动性上明显高于第三个(gemini-3.1-pro),尽管整体实力接近。总之,能力剖面和锯齿度量提供了仅靠整体排名无法提供的与部署相关的诊断信息。

英文摘要

Large language models (LLMs) are increasingly deployed as economic agents in marketplaces, auctions, and bidding settings. Anticipating their behavior in any specific deployment is hard. Existing strategic-reasoning benchmarks evaluate models on fixed canonical games. These benchmarks may saturate as the frontier improves, and they do not allow evaluators to generalize with confidence from benchmark performance to the varied and messy strategic environments that actual deployments involve. We introduce GENSTRAT, which uses procedurally generated strategic environments to address these challenges. Concretely, we generate a distribution of two-player zero-sum imperfect-information card games. The generator can draw fresh games on demand, allowing for evergreen evaluation and resistance to contamination. We pair the game distribution with a capability-profile methodology that decomposes model competence across six axes (state space, temporal depth, information sensitivity, opponent modeling, risk, and brittleness). We also introduce a jaggedness measure of within-distribution smoothness that detects when a model's advantage jumps unpredictably between strategically similar games. We sample 50 benchmark games from a 2,000-game generated pool and evaluate nine frontier and open-weight LLMs in a head-to-head tournament with over 36,000 matches. Newer frontier-tier models score higher on average. Beyond that average, models with near-identical overall strength show qualitatively different capability profiles, and two of the top three leaderboard models (gpt-5 and claude) are noticeably more locally volatile than the third (gemini-3.1-pro), despite being close in overall strength. Together, the capability profile and the jaggedness measure give a deployment-relevant diagnostic that the overall ranking alone cannot provide.

2605.23219 2026-05-25 cs.LG cs.AI 版本更新

PaP-NF: Probabilistic Long-Term Time Series Forecasting via Prefix-as-Prompt Reprogramming and Normalizing Flows

PaP-NF: 通过前缀作为提示重编程和归一化流进行概率长期时间序列预测

Minju Kim, Youngbum Hur

发表机构 * Department of Industrial Engineering, Inha University, Incheon, Republic of Korea(韩国Inha大学工业工程系)

AI总结 本文提出了一种名为PaP-NF的概率长期时间序列预测框架,通过Prefix-as-Prompt机制将连续时间序列表示与冻结的大语言模型对齐,并基于该模型提取的全局上下文条件化归一化流解码器,从而实现对不确定性的建模。该方法在多个长期预测基准上表现出色,能够有效捕捉多模态不确定性,同时保持较高的点预测精度。

Comments Accepted to ICPR 2026

详情
AI中文摘要

时间序列预测在许多实际应用中扮演核心角色,并已被广泛研究。大多数现有方法依赖于确定性模型。然而,现实环境表现出固有的不确定性和复杂的未来行为,使得单点预测不足。这凸显了对能够量化和表示不确定性的概率预测方法的需求。在这项工作中,我们提出了PaP-NF,一个概率预测框架,它使用前缀作为提示机制将连续时间序列表示与冻结的大语言模型(LLM)对齐,并基于LLM提取的全局上下文条件化归一化流解码器。所得预测分布的质量使用连续排名概率得分(CRPS)进行评估,这是概率预测中的标准指标。在各种长期预测基准上,PaP-NF稳健地捕获多模态不确定性,同时保持有竞争力的点预测精度。官方实现可在:https://github.com/democracy04/PaP-NF 获取。

英文摘要

Time series forecasting plays a central role in many real-world applications and has been extensively studied. Most existing approaches rely on deterministic models. However, real-world environments exhibit inherently uncertain and complex future behaviors, making single-point predictions insufficient. This highlights the need for probabilistic forecasting methods that can quantify and represent uncertainty. In this work, we propose PaP-NF, a probabilistic forecasting framework that aligns continuous time series representations with a frozen large language model (LLM) using a Prefix-as-Prompt mechanism, and conditions a normalizing flow decoder on the global context extracted by the LLM. The quality of the resulting predictive distributions is evaluated using the Continuous Ranked Probability Score (CRPS), a standard metric in probabilistic forecasting. Across a variety of long-term forecasting benchmarks, PaP-NF robustly captures multi-modal uncertainty while maintaining competitive point forecasting accuracy. The official implementation is available at: https://github.com/democracy04/PaP-NF

2605.23218 2026-05-25 cs.AI 版本更新

Foundation Protocol: A Coordination Layer for Agentic Society

Foundation Protocol: 智能体社会的协调层

Bang Liu, Yongfeng Gu, Jiayi Zhang, Zhaoyang Yu, Sirui Hong, Maojia Song, Xiaoqiang Wang, Mingyi Deng, Zijie Zhuang, Ronghao Wang, Mingzhe Cao, Yutong Zhu, Xingjian Li, Yifan Wu, Jianhao Ruan, Yiran Peng, Shuangrui Chen, Jinlin Wang, Yizhang Lin, Dongjie Zhang, Dekun Wu, Chen Ma, Lizi Liao, Han Yu, Jian Pei, Heng Ji, Qiang Yang, Yuyu Luo, Chenglin Wu

发表机构 * Singapore University of Technology and Design(新加坡科技设计大学) City University of Hong Kong(香港城市大学) Singapore Management University(新加坡管理学院) Nanyang Technological University(南洋理工大学) Duke University(杜克大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Hong Kong Polytechnic University(香港理工大学)

AI总结 随着自主代理系统逐渐成为社会基础设施的一部分,协调能力成为系统扩展的关键瓶颈。本文提出了一种名为Foundation Protocol(FP)的协调层,旨在为人类与人工智能共存的社会提供基础架构支持。FP通过图结构统一不同类型的实体,支持多方协作与事件驱动的合作,并引入经济原语和治理机制,以确保系统的可组合性与责任可追溯性。该协议旨在兼容现有标准,降低集成与治理成本,推动自主代理系统在开放、多元和可治理的环境中发展。

详情
AI中文摘要

自主智能体正从工具转变为社会基础设施层:它们浏览、购买、部署软件、管理系统,并越来越多地相互交互。随着这些系统规模扩大,瓶颈从原始模型能力转向协调。智能体需要建立可靠的关系、组织多智能体工作、交换价值、支持人工智能经济,并在现实监督下保持安全和问责。本文介绍了Foundation Protocol (FP),一种为新兴人机社会设计的以图为核心的协调层。FP统一了异构实体,包括智能体、工具、资源、人类、机构和组织,并支持原生的多方组织和基于事件的协作。它还提供了用于计量、收据和结算的经济原语,并将策略、来源和审计视为一等关注点。FP旨在包装和桥接现有协议而非替代它们,从而在减少集成和治理开销的同时实现渐进式采用。目标是保持自主智能体的可组合性,同时确保问责制不可妥协,从而使协调本身成为开放、多元和可治理的人机社会的共享基础设施。

英文摘要

Autonomous agents are moving from tools into a layer of social infrastructure: they browse, purchase, deploy software, manage systems, and increasingly interact with one another. As these systems scale, the bottleneck shifts away from raw model capability toward coordination. Agents need to form reliable relationships, organize multi-agent work, exchange value, support an AI economy, and stay safe and accountable under real-world oversight. This paper introduces the Foundation Protocol (FP), a graph-first coordination layer for an emerging human-AI society. FP unifies heterogeneous entities, including agents, tools, resources, humans, institutions, and organizations, and supports native multi-party organization and event-based collaboration. It also provides economic primitives for metering, receipts, and settlement, and treats policy, provenance, and audit as first-class concerns. FP is designed to wrap and bridge existing protocols rather than replace them, enabling incremental adoption while reducing integration and governance overhead. The aim is to keep autonomous agency composable while keeping accountability non-negotiable, so that coordination itself can become shared infrastructure for a human-AI society that is open, pluralistic, and governable.

2605.23215 2026-05-25 cs.LG cs.AI cs.CL 版本更新

FastKernels: Benchmarking GPU Kernel Generation in Production

FastKernels:生产中GPU内核生成的基准测试

Gabriele Oliaro, Yichao Fu, May Jiang, Owen Lu, Junli Wang, Zhihao Jia, Hao Zhang, Samyam Rajbhandari

发表机构 * Snowflake AI Research(Snowflake AI研究院) CMU(卡内基梅隆大学) UCSD(加州大学圣地亚哥分校) Independent Researcher(独立研究者)

AI总结 当前基于大语言模型的GPU内核生成代理在性能评估方面面临基准与实际生产环境不匹配的问题。为此,研究提出了FastKernels,一个基于46个代表性架构构建的基准测试集,覆盖了8个类别,几乎涵盖了96.2%的HuggingFace Transformers架构,并同时提供了一个生产级推理框架。实验表明,现有最先进的内核生成代理在FastKernels上的加速效果有限,突显了基准与实际应用之间存在的关键瓶颈。

详情
AI中文摘要

基于LLM的GPU内核生成代理正在快速发展,但其进展从根本上受到所优化基准的限制。现有基准与生产推理框架严重脱节:它们在单GPU上使用合成输入评估内核,忽略周围的编译栈,并奖励复制已知优化而非发现新优化。由此产生的奖励信号具有误导性:代理学会生成在沙箱中得分高但在集成到实际系统时引入接口不兼容、编译栈冲突和静默正确性下降的内核。我们引入FastKernels,一个基于最小化46个代表性架构(涵盖8个类别)的内核基准,这些内核共同涵盖了96.2%(409/425)的HuggingFace Transformers架构。FastKernels同时作为一个简约的生产级推理框架,在主流LLM服务上与vLLM和SGLang等成熟系统运行性能相当,并在服务不足的架构上显著超过上游参考;每个任务的接口镜像其架构家族中最先进库的相应模块,使得优化后的内核能够直接部署到生产代码库中。在FastKernels上评估最先进的内核代理,我们发现即使最强的代理也仅实现0.94倍于生产基线的总加速,而较弱的代理分别为0.78倍和0.53倍——证实基准-生产错位是该领域的关键瓶颈。我们发布FastKernels,作为迈向基准收益直接转化为生产吞吐量改进的内核代理的垫脚石。代码可在https://github.com/Snowflake-AI-Research/fastkernels获取。

英文摘要

LLM-based agents for GPU kernel generation are advancing rapidly, yet their progress is fundamentally constrained by the benchmarks they optimize against. Existing benchmarks are poorly aligned with production inference frameworks: they evaluate kernels on a single GPU with synthetic inputs, ignore the surrounding compilation stack, and reward replicating known optimizations rather than discovering new ones. The resulting reward signals are misleading: agents learn to generate kernels that score well in sandboxes but introduce interface incompatibilities, compilation-stack conflicts, and silent correctness degradation when integrated into real systems. We introduce FastKernels, a kernel benchmark built around a minimal set of 46 representative architectures spanning 8 categories, whose kernels collectively subsume those of 96.2% (409/425) of HuggingFace Transformers architectures. FastKernels doubles as a minimalistic, production-grade inference framework that runs at parity with hardened systems such as vLLM and SGLang on mainstream LLM serving and substantially exceeds upstream references on under-served architectures; each task's interface mirrors the corresponding module in the state-of-the-art library for its architecture family, enabling direct deployment of optimized kernels into production codebases. Evaluating state-of-the-art kernel agents on FastKernels, we find that even the strongest agent achieves only 0.94$\times$ aggregate speedup over production baselines, with weaker agents at $0.78\times$ and $0.53\times$ -- confirming that benchmark-production misalignment is a critical bottleneck for the field. We release FastKernels as a stepping stone toward kernel agents whose benchmark gains translate directly into production throughput improvements. Code is available at https://github.com/Snowflake-AI-Research/fastkernels

2605.23204 2026-05-25 cs.AI 版本更新

AutoResearch AI: Towards AI-Powered Research Automation for Scientific Discovery

AutoResearch AI:迈向人工智能驱动的科研自动化以实现科学发现

Guiyao Tie, Jiawen Shi, Dingjie Song, Yixiao Huang, Ziji Sheng, Xueyang Zhou, Daizong Liu, Pan Zhou, Yongchao Chen, Ran Xu, Lifang He, Qingsong Wen, Manling Li, Cong Lu, Shuai Li, Pengtao Xie, Yixuan Yuan, Rui Meng, Lei Xing, Lichao Sun, Caiming Xiong, Philip S. Yu, Jianfeng Gao

发表机构 * Huazhong University of Science and Technology(华中科技大学) Lehigh University(莱斯大学) Tsinghua University(清华大学) Wuhan University(武汉大学) Salesforce Research(Salesforce研究) Squirrel AI Learning(Squirrel AI学习) Northwestern University(西北大学) Independent(独立) Shanghai Jiao Tong University(上海交通大学) University of California San Diego(加州大学圣地亚哥分校) Chinese University of Hong Kong(香港中文大学) University of Illinois Chicago(伊利诺伊大学香槟分校) Stanford University(斯坦福大学) Google Cloud AI Research(谷歌云AI研究) Recursive Superintelligence(递归超级智能) Microsoft Research(微软研究院)

AI总结 本文探讨了AI驱动的科研自动化(AutoResearch)的发展趋势,旨在通过人工智能实现从文献调研、假设生成到实验验证、结果报告等全流程的科研工作自动化。研究分析了当前系统在自主性、领域适用性、验证机制等方面的不足,并提出了五个评估维度,指出AutoResearch的自主程度依赖于具体应用场景,在结构化、可执行和易于验证的领域更具可信度,而在涉及伦理、机构责任等复杂情境中仍面临挑战。

Comments 49 pages, 12 figures, 10 tables

详情
AI中文摘要

科学研究正在被AI系统重塑,这些系统从孤立的辅助转向更长周期的工作流,涵盖文献基础、假设生成、实验、验证、报告和修订。这一转变标志着从面向科学的任务级AI向工作流级研究自动化的过渡。然而,当前系统仍然碎片化,在自主性、领域范围、执行环境、验证机制和人类监督方面存在差异,同时在证据保存、可重复性、弱方向拒绝、溯源追踪、跨领域鲁棒性和负责任的科学闭环方面仍面临挑战。本综述通过AutoResearch(定义为AI驱动的科学工作流自动化的演进谱系)审视这些发展。其中,Vibe Research表示人类引导的基于提示的辅助和人工验证执行区域,而新兴的AI主导系统协调发现循环的更大部分,但尚未实现稳健的自主性。我们分析了研究系统如何在工作流中重新分配控制、证据、执行、验证和问责,并围绕五个工作流条件组织该领域:文献与研究基础;假设形成与规划;实验与工具使用;反馈、验证与评审;报告与知识传播。我们进一步综合了AI科学家系统、混合主动协同研究框架、基准测试、领域部署和开源基础设施。最后,我们提出五个评估维度——新颖性、有效性、影响力、可靠性和溯源——并表明AutoResearch的自主性是领域条件化的,在结构化、可执行且快速可验证的环境中更为可信,但在具身、延迟、异构、伦理或机构问责的背景下则受限。

英文摘要

Scientific research is being reshaped by AI systems that move beyond isolated assistance toward longer-horizon workflows spanning literature grounding, hypothesis generation, experimentation, validation, reporting, and revision. This shift marks a transition from task-level AI for science to workflow-level research automation. Yet current systems remain fragmented, differing in autonomy, domain scope, execution environment, validation mechanism, and human oversight, while still struggling with evidence preservation, reproducibility, weak-direction rejection, provenance tracking, cross-domain robustness, and accountable scientific closure. This survey examines these developments through AutoResearch, defined as the developmental spectrum of AI-powered scientific workflow automation. Within it, Vibe Research denotes the human-steered region of prompt-based assistance and human-verified execution, whereas emerging AI-led systems coordinate larger portions of the discovery loop without achieving robust autonomy. We analyze how research systems redistribute control, evidence, execution, validation, and accountability across workflows and organize the field around five workflow conditions: literature and research grounding; hypothesis formation and planning; experimentation and tool use; feedback, validation, and review; and reporting and knowledge communication. We further synthesize AI scientist systems, mixed-initiative co-research frameworks, benchmarks, domain deployments, and open-source infrastructures. Finally, we propose five evaluation dimensions--novelty, validity, impact, reliability, and provenance--and show that AutoResearch autonomy is domain-conditioned, being more credible in structured, executable, and rapidly verifiable settings but limited in embodied, delayed, heterogeneous, ethical, or institutionally accountable contexts.

2605.23203 2026-05-25 cs.CV cs.AI cs.LG cs.RO 版本更新

Lipschitz Optimization for Formal Verification of Homographies

单应性矩阵形式化验证的Lipschitz优化

Jean-Guillaume Durand, Panagiotis Kouvaros, Maxime Gariel, Alessio Lomuscio

发表机构 * Joby Aviation(Joby航空) Safe Intelligence

AI总结 本文研究了针对视觉神经网络在安全关键领域应用的正式鲁棒性验证问题,特别关注相机运动引起的3D扰动对图像生成过程的影响。作者提出了一种基于李普希茨优化和分段连续性分析的验证方法,建立了相机姿态到像素值的闭式映射,并推导出对扰动像素值的紧致线性界。该方法适用于具有平面结构的场景,如增强现实、自动驾驶和机器人操作等,并在多个基准测试中验证了其有效性,相比现有方法在速度和边界紧致性方面均有提升。

Comments 18 pages, 13 figures, 6 tables, to be published at CVPR 2026

详情
AI中文摘要

在受监管行业中采用视觉神经网络需要形式化的鲁棒性保证,尤其是在医疗、自动驾驶和航空航天等安全关键领域。然而,当前方法局限于不完整的统计验证或对$\ell_p$范数和仿射变换的鲁棒性,仅覆盖了图像形成过程中一小部分扰动。特别是,对相机运动的鲁棒性仍然是一个开放问题,尽管它是部署许多视觉应用的关键。我们提出了一种形式化验证方法,针对捕获相机的3D运动扰动鲁棒性。我们首先建立了从相机位姿到像素值的闭式映射。通过分析所得单应性矩阵的连续性性质,我们展示了如何将最近关于Lipschitz优化和分段连续性的工作扩展到推导扰动像素值的紧线性边界。我们的方法适用于以平面结构为主的场景,例如增强现实中的地面、自动驾驶中的道路标记和交通标志,或机器人操作中的平面工作空间。这实现了对投影几何变换的首次形式化验证,无需复杂仿真、替代网络或显式图像形成模型。我们验证了实现,并展示了相比先前工作最高89%的加速和7%更紧的边界。然后,我们在VNN-COMP基准上评估了我们的方法,揭示了投影扰动的系统性弱点。最后,我们在一个安全关键的跑道分类器上进行了真实世界案例研究,突出了对相机运动的实际漏洞,并解决了学习模型认证中的一个关键挑战。数据和代码公开在https://github.com/jeangud/homography-verification。

英文摘要

The adoption of vision neural networks in regulated industries requires formal robustness guarantees, especially in safety-critical domains such as healthcare, autonomous vehicles, and aerospace. However, current approaches are confined to incomplete statistical verification or robustness to $\ell_p$-norm and affine transforms, which cover only a narrow subset of perturbations to the image formation process. In particular, robustness to camera motion remains an open problem despite being key to deploy many vision applications. We present a formal verification approach that targets robustness against 3D motion perturbations of the capturing camera. We first establish a closed-form mapping from camera pose to pixel values. By analyzing the continuity properties of the resulting homographies, we show that recent work on Lipschitz optimization and piecewise continuity can be extended to derive tight linear bounds on perturbed pixel values. Our approach applies to scenes with predominantly planar structure, such as ground planes in augmented reality, road markings and traffic signs in autonomous driving, or planar workspaces in robotic manipulation. This enables the first formal verification of projective geometry transforms, without complex simulation, surrogate networks, or explicit image-formation models. We validate our implementation and show up to 89% speedup and 7% tighter bounds over prior work. We then evaluate our method on the VNN-COMP benchmark and reveal systematic weaknesses to projective perturbations. Finally, we demonstrate a real-world case study on a safety-critical runway classifier, highlighting practical vulnerabilities to camera motion, and addressing a key challenge in the certification of learned models. Data and code are publicly available at https://github.com/jeangud/homography-verification .

2605.23200 2026-05-25 cs.LG cs.AI 版本更新

Adaptive Mass-Segmented KV Compression for Long-Context Reasoning

自适应质量分段KV压缩用于长上下文推理

Junzhe Yang, Xiaoyu Shen

发表机构 * Shanghai Jiao Tong University(上海交通大学) Institute of Digital Twin, Eastern Institute of Technology(数字孪生研究院,东部技术研究所)

AI总结 在长文本推理中,键值(KV)缓存的线性增长是关键瓶颈,现有压缩方法基于重要性评分剔除 tokens,但易导致连续推理块被严重清除,破坏逻辑连贯性。为此,本文提出自适应分块(AMS)KV压缩框架,通过关注注意力质量的空间分布,动态分配内存配额,保障关键推理段的稳定性,并兼容多种主流压缩方法和现代KV服务框架。实验表明,AMS有效缓解了结构碎片化问题,提升了模型性能。

详情
AI中文摘要

键值(KV)缓存的线性增长是长文本LLM推理中的关键瓶颈。现有的KV压缩方法通过基于重要性分数驱逐令牌来缓解这一问题。然而,我们表明它们依赖全局Top-k选择会触发区域擦除:连续推理块的严重驱逐破坏了逻辑连贯性。为解决此问题,我们提出自适应质量分段(AMS)KV压缩框架,该框架将范式从令牌级竞争转变为区域感知配额分配。AMS根据注意力质量的空间分布自适应地划分KV缓存,确保结构上重要的推理段获得有保障的内存配额。为在迭代解码过程中保持稳定性,引入了基于EMA的平滑机制以防止分段边界的抖动。关键的是,AMS是一个通用的即插即用层,与现有评分器正交。它可以无缝集成到代表性方法中,如TOVA、Expected Attention、KeyDiff、R-KV和TriAttention。AMS还与现代分页KV服务框架(如vLLM)系统兼容,支持高效的收集和压缩KV执行,而不引入额外的稳态注意力开销。在多种任务上的大量实验,包括数学推理(MATH500、AIME、GSM8K)、代码补全、开放域问答和稀疏检索,表明AMS持续减轻结构碎片化并提升模型性能。

英文摘要

The linear growth of the Key-Value (KV) cache is a critical bottleneck in long-form LLM inference. Existing KV compression methods mitigate this by evicting tokens based on importance scores. However, we show that their reliance on global Top-k selection triggers Region Wipe-out: the severe eviction of contiguous reasoning blocks that derails logical coherence. To address this, we propose Adaptive Mass-Segmented (AMS) KV Compression, a framework that shifts the paradigm from token-level competition to region-aware quota allocation. AMS adaptively partitions the KV cache based on the spatial distribution of attention mass, ensuring structurally vital reasoning segments receive guaranteed memory quotas. To ensure stability during iterative decoding, an EMA-based smoothing mechanism is incorporated to prevent jitter in segment boundaries. Crucially, AMS is a universal plug-and-play layer that is orthogonal to existing scorers. It can be seamlessly integrated into representative methods such as TOVA, Expected Attention, KeyDiff, R-KV and TriAttention. AMS is also system-compatible with modern paged-KV serving frameworks such as vLLM, supporting efficient gather-and-compact KV execution without introducing additional steady-state attention overhead. Extensive experiments across a diverse suite of tasks, including mathematical reasoning (MATH500, AIME, GSM8K), code completion, open-domain QA, and sparse retrieval, demonstrate that AMS consistently mitigates structural fragmentation and boosts model performance.

2605.23194 2026-05-25 cs.LG cs.AI 版本更新

Scalable Heterogeneous Graph Foundation Models for Data-Driven Optimal Power Flow in Smart Grids

面向智能电网数据驱动最优潮流问题的可扩展异构图基础模型

Massimiliano Lupo Pasini, Yijiang Li, Kibaek Kim, Teja Kuruganti

发表机构 * Computational Sciences and Engineering Division, Oak Ridge National Laboratory(橡树岭国家实验室计算科学与工程部) Mathematics and Computer Science Division, Argonne National Laboratory(阿贡国家实验室数学与计算机科学部) UT-Battelle, LLC(UT-巴特勒公司)

AI总结 本文提出了一种基于HydraGNN的可扩展异构图神经网络(GNN)框架,用于构建数据驱动的最优潮流(OPF)代理模型和图基础模型(GFM)。该方法保留了电力网络中不同节点和边类型的异构结构,支持在超计算机上进行分布式预处理、训练、超参数优化和下游微调。实验表明,该框架能够生成参数量较少但验证损失更低的紧凑模型,并在可行性分类和N-1故障回归任务中显著提升小样本条件下的模型性能与训练效率。

Comments 10 pages, 6 tables, 4 figures

详情
AI中文摘要

快速可靠的最优潮流(OPF)近似对于可靠的智能电网运行至关重要,然而许多基于学习的替代模型要么扁平化处理电网的天然异质结构,要么针对有限的电网拓扑,要么缺乏用于图基础模型(GFM)训练的可扩展基础设施。本文提出了一种基于HydraGNN的可扩展异构图神经网络(GNN)工作流,用于数据驱动OPF代理建模和OPF-GFM开发。该工作流保留了电网中不同的节点和边类型——母线、发电机、负荷、并联电抗器、交流线路、变压器以及设备到母线的耦合——并支持在领导级超级计算机上进行分布式预处理、训练、超参数优化(HPO)和下游微调。利用跨越十个PGLib-OPF案例(从14到13,659个母线)的三百万个异构图实例,我们在ORNL Frontier超级计算机上进行了DeepHyper驱动的HPO。该实验识别出具有最低验证损失的紧凑模型(约1.6–1.7M参数)。关于可行性分类和N-1应急回归的下游实验表明,微调预训练的OPF GFM在部分或仅头部微调时,能够提高低数据精度、稳定训练、加速收敛并降低适应成本。

英文摘要

Fast and reliable optimal power flow (OPF) approximation is essential for reliable smart-grid operation, yet many learning-based surrogates either flatten the native heterogeneous structure of power networks, target a limited set of grid topologies, or lack scalable infrastructure for graph foundation model (GFM) training. This paper presents a scalable heterogeneous graph neural network (GNN) workflow, built on HydraGNN, for data-driven OPF surrogate modeling and OPF-GFM development. The workflow preserves the distinct node and edge types of power grids -- buses, generators, loads, shunts, AC lines, transformers, and device-to-bus couplings -- and supports distributed preprocessing, training, hyperparameter optimization (HPO), and downstream fine-tuning on leadership-class supercomputers. Using three million heterogeneous graph instances spanning ten PGLib-OPF cases, from 14 to 13,659 buses, we conduct DeepHyper-driven HPO on the ORNL Frontier supercomputer. The campaign identifies compact models ($\sim$1.6--1.7M parameters) with the lowest validation losses. Downstream experiments on feasibility classification and N-1 contingency regression show that fine-tuning pretrained OPF GFM improves low-data accuracy, stabilizes training, accelerates convergence, and reduces adaptation cost when partial or head-only fine-tuning is used.

2605.23179 2026-05-25 cs.AI 版本更新

Redrawing the AI Map: A Theory of Accountability Boundaries in Agentic Ecosystems

重绘AI地图:代理生态系统中责任边界的理论

Muhammad Zia Hydari, Farooq Muzaffar

发表机构 * University of Pittsburgh(匹兹堡大学)

AI总结 该论文探讨了智能体生态系统中责任边界配置的理论问题,提出了一种基于能力层次的责任边界定位理论。研究引入了“责任资产”概念,指出其对AI输出的合法性、可审计性和责任归属具有关键作用,并分析了验证成本和责任可转移性对责任边界与执行边界协同移动的影响。理论提出了三种边界策略,并引入“规则债务”概念,揭示了组织决策规则迁移至智能体执行环境所带来的治理负担,为理解数字模块化与组织解耦的关系提供了新视角。

详情
AI中文摘要

代理AI编排器降低了跨组织边界组合信息系统能力的接口和组装成本,看似加速了模块化和组织分解。然而,其输出需要证据、审查、签核或可分配责任的AI赋能能力,即使其技术接口变得模块化,也可能保留集成的责任边界。我们提出了代理生态系统中责任边界定位的能力层面理论。我们引入责任资产:使AI支持输出合法、可审计、可审查并可分配给责任方的互补资产。我们认为验证成本和责任可转移性决定了执行边界和责任边界能否一起移动。该理论识别出三种边界策略:组件、集成和双轨。它还引入了规则债务,即当组织决策规则从正式信息系统迁移到无治理的代理执行环境时产生的治理负担。整合数字创新、交易成本、互补资产、数字平台治理和IS控制视角,我们提出了七个命题,将代理组装成本降低、责任资产、可占有性、编排者意图捕获和边界错误配置与边界策略、价值占有和规则债务联系起来。该理论解释了数字模块化何时扩展到组织分解,以及责任何时保持能力集成。通过文档处理、法律服务、审计、临床决策支持和采购中的结构化示例来约束边界逻辑。

英文摘要

Agentic AI orchestrators reduce the interface and assembly costs of composing information systems capabilities across organizational boundaries, seemingly accelerating modularization and organizational disaggregation. Yet AI-enabled capabilities whose outputs require evidence, review, signoff, or assignable responsibility may retain integrated accountability boundaries even when their technical interfaces become modular. We develop a capability-level theory of accountability-boundary placement in agentic ecosystems. We introduce accountability assets: complementary assets that make AI-supported outputs legitimate, auditable, reviewable, and assignable to a responsible party. We argue that verification cost and responsibility transferability determine whether the execution and accountability boundaries can move together. The theory identifies three boundary strategies: component, integrated, and dual-track. It also introduces rule debt, the governance burden that accrues when organizational decision rules migrate from formal information systems into ungoverned agentic execution environments. Integrating digital innovation, transaction cost, complementary-assets, digital platform governance, and IS control perspectives, we develop seven propositions linking agentic assembly-cost reductions, accountability assets, appropriability, orchestrator intent capture, and boundary misconfiguration to boundary strategy, value appropriation, and rule debt. The theory explains when digital modularization extends to organizational disaggregation and when accountability keeps capabilities integrated. Structured illustrations across document processing, legal services, audit, clinical decision support, and procurement discipline the boundary logic.

2605.23171 2026-05-25 cs.LG cs.AI stat.ML 版本更新

Understanding and Improving Noisy Embedding Techniques in Instruction Finetuning

理解与改进指令微调中的噪声嵌入技术

Abhay Yadav

发表机构 * Johns Hopkins University(约翰霍普金斯大学)

AI总结 该研究探讨了指令微调中嵌入层添加噪声的技术,分析了均匀噪声与高斯噪声的效果差异,并提出了一种新的对称噪声嵌入方法SymNoise。通过理论与实验分析,研究发现不同噪声类型性能相近,而SymNoise通过更严格地调控模型局部曲率,显著提升了微调效果。在多个基准测试中,SymNoise相比当前最优方法NEFTune取得了约6.7%的性能提升,展示了其在语言模型微调中的优越性。

Comments arXiv admin note: substantial text overlap with arXiv:2312.01523

Journal ref IEEE International Conference on Language Modeling (COLM), 2025

详情
AI中文摘要

最近指令微调的进展在嵌入中注入噪声,其中NEFTune(Jain等人,2024)使用均匀噪声设立了基准。尽管NEFTune的实验发现均匀噪声优于高斯噪声,其原因仍不清楚。本文旨在通过提供彻底的理论和实证分析来澄清这一点,表明这些噪声类型之间的性能相当。此外,我们引入了一种新的语言模型微调方法,在嵌入中使用对称噪声。该方法旨在通过更严格地调节模型的局部曲率来增强模型功能,表现出优于当前方法NEFTune的性能。当使用Alpaca微调LLaMA-2-7B模型时,标准技术在AlpacaEval上获得29.79%的分数。然而,我们的方法SymNoise使用对称噪声嵌入将这一分数显著提高到69.04%,比最先进方法NEFTune(64.69%)提高了6.7%。此外,当在各种模型和更强的基线指令数据集(如Evol-Instruct、ShareGPT、OpenPlatypus)上测试时,SymNoise始终优于NEFTune。当前文献,包括NEFTune,强调了在语言模型微调中应用基于噪声的策略需要更深入的研究。我们的方法SymNoise是朝着这一方向迈出的又一重要步骤,显示出对现有最先进方法的显著改进。

英文摘要

Recent advancements in instructional fine-tuning have injected noise into embeddings, with NEFTune (Jain et al., 2024) setting benchmarks using uniform noise. Despite NEFTune's empirical findings that uniform noise outperforms Gaussian noise, the reasons for this remain unclear. This paper aims to clarify this by offering a thorough analysis, both theoretical and empirical, indicating comparable performance among these noise types. Additionally, we introduce a new fine-tuning method for language models, utilizing symmetric noise in embeddings. This method aims to enhance the model's function by more stringently regulating its local curvature, demonstrating superior performance over the current method, NEFTune. When fine-tuning the LLaMA-2-7B model using Alpaca, standard techniques yield a 29.79% score on AlpacaEval. However, our approach, SymNoise, increases this score significantly to 69.04%, using symmetric noisy embeddings. This is a 6.7% improvement over the state-of-the-art method, NEFTune (64.69%). Furthermore, when tested on various models and stronger baseline instruction datasets, such as Evol-Instruct, ShareGPT, OpenPlatypus, SymNoise consistently outperforms NEFTune. The current literature, including NEFTune, has underscored the importance of more in-depth research into the application of noise-based strategies in the fine-tuning of language models. Our approach, SymNoise, is another significant step towards this direction, showing notable improvement over the existing state-of-the-art method.

2605.23170 2026-05-25 cs.CL cs.AI cs.LG 版本更新

Positional Failures in Long-Context LLMs: A Blind Spot in Reasoning Benchmarks

长上下文LLM中的位置失败:推理基准测试中的盲点

Chuyifei Zhang, Hongyu Cui, Xiaowen Huang, Jitao Sang

发表机构 * Beijing Jiaotong University(北京交通大学) Central South University of Forestry and Technology(中央林业科技大学)

AI总结 该研究指出当前主流的长上下文大语言模型推理基准在任务位置控制方面存在不足,导致无法准确评估模型在不同位置上的表现。为此,作者提出了Context Rot Evaluation(CRE)框架,系统地控制任务位置、填充内容和上下文长度三个因素,并通过实验发现,当目标任务从上下文末尾移至中间位置时,模型性能会显著下降,且随着上下文长度增加,这一问题更加严重。研究还表明,通过在末尾添加任务副本,可以有效缓解位置带来的性能下降,揭示了当前基准设计中存在结构性的评估盲区。

Comments 20 pages, 1 figure, 23 tables

详情
AI中文摘要

位置控制评估是检索任务(如Needle-in-a-Haystack和RULER)的标准做法,但主流推理基准测试并未控制目标任务在长上下文中的位置。我们审计了11个长上下文基准测试,发现没有一个同时控制任务位置、填充内容和上下文长度进行推理。对四个旗舰长上下文发布的审计发现,NIAH、RULER或LongBench系列基准测试的主要结果表中没有条目,而智能体和编码基准测试在所有四个发布的主要结果表中均有出现。我们提出了上下文旋转评估(CRE),一个控制所有三个因素的框架,并在两轮中评估了九个LLM在GSM8K和ARC-Challenge上的表现:初始五个模型集和四个较新的供应商发布。当目标任务从末尾移动到中间时,模型性能可能急剧下降,且对于易受影响的模型,这种下降随着上下文长度增加而恶化。MiMo-v2-Flash在64K下使用with_solutions填充时下降88个百分点(中间准确率8%)。较新的发布显示出较小的下降:在64K下,四个模型中有三个的末尾位置准确率波动在+/-6个百分点内;MiMo-V2.5-Pro将MiMo-v2-Flash的88个百分点下降缩小到32个百分点。在questions_only_v2填充下,所有四个模型在中间位置的下降仍然存在(在8K、32K、64K下范围-16到-56个百分点)。在8K下,一个诊断探针在末尾添加目标任务副本,使所有九个模型的中间准确率与末尾基线相差在+/-4个百分点内,这与位置解释一致。在初始五个模型集中,76%的中间位置错误与周围填充文本匹配,而末尾位置仅为22%,这与填充-答案干扰作为主要错误模式一致。这些结果暴露了当前推理基准测试设计和供应商评估实践中的结构性评估差距:当任务位置不受控制时,无法测量随上下文长度增长而恶化的位置脆弱性。

英文摘要

Position-controlled evaluation is standard for retrieval tasks such as Needle-in-a-Haystack and RULER, but mainstream reasoning benchmarks do not control positional placement of target tasks in long contexts. We audit 11 long-context benchmarks and find none jointly controls task position, filler content, and context length for reasoning. An audit of four flagship long-context releases finds no main result-table entry for NIAH, RULER, or LongBench-family benchmarks, while agentic and coding benchmarks appear in main result-tables across all four. We propose Context Rot Evaluation (CRE), a controlled framework varying all three factors, and evaluate nine LLMs on GSM8K and ARC-Challenge across two rounds: an initial five-model set and four newer vendor releases. Models can drop sharply when the target task moves from end to middle, and the drop grows worse with context length for vulnerable models. MiMo-v2-Flash drops 88pp at 64K under with_solutions filler (middle accuracy 8%). Newer releases show smaller drops: at 64K, three of four stay within +/-6pp of end-position accuracy; MiMo-V2.5-Pro narrows the MiMo-v2-Flash 88pp drop to 32pp. Under questions_only_v2 filler, middle-position drops persist across all four (range -16pp to -56pp across 8K, 32K, 64K). At 8K, a diagnostic probe adding a target-task copy at the end brings middle accuracy within +/-4pp of end baseline across all nine models, consistent with a positional explanation. In the initial five-model set, 76% of middle-position errors match surrounding filler text versus 22% at the end position, consistent with filler-answer interference as a dominant error mode. These results expose a structural evaluation gap in current reasoning benchmark design and vendor evaluation practice: positional vulnerabilities that grow with context length cannot be measured when task position is not controlled.

2605.23168 2026-05-25 cs.CR cs.AI cs.LG 版本更新

PoisonForge: Task-Level Targeted Poisoning Benchmark for Instruction-Tuned LLMs

PoisonForge: 面向指令微调LLM的任务级定向投毒基准

Luze Sun, Anshuman Suri, Harsh Chaudhari, Cristina Nita-Rotaru, Alina Oprea

发表机构 * Department of Computer Science(计算机科学系)

AI总结 本文提出PoisonForge,一个针对指令微调大语言模型的针对性任务级投毒基准,用于评估在有限投毒预算下模型对恶意数据的脆弱性。该基准通过四个维度参数化投毒威胁,并在五个任务家族中测试了12个不同参数量的开源模型,结果显示大多数模型在最脆弱配置下攻击成功率超过70%,但对非目标任务的影响极小。研究分析了影响攻击成功率的关键因素,并发现投毒设计选择而非模型规模是攻击成功的主要原因。

详情
AI中文摘要

当从业者在未经验证的数据集上微调LLM时,攻击者可以通过任务级投毒利用数据供应链:插入少量精心设计的指令-响应对,导致模型在目标任务族中嵌入攻击者指定的实体(如国家),而在其他行为中表现正常。我们引入PoofForge,一个沿四个维度(偏差类型、投毒模式、出现次数和目标输出长度)参数化此威胁的基准,并在五个模型族中评估了12个开源模型(参数从2B到32B),主要采用1%的投毒预算。在1000个微调样本中仅使用10个投毒样本的情况下,12个模型中有11个在其最易受攻击的配置下攻击成功率(ASR)超过70%。同时,非目标任务的无意泄露低于0.5%,模型在标准基准上表现良好。我们详细分析了影响攻击成功的因素。我们观察到,实体的多次出现提高了ASR,最佳投毒模式取决于目标实体的语义结构,并且ASR随任务输出长度单调下降。相关分析和风险预测模型证实,投毒设计选择而非模型规模是攻击成功的主要原因,并且这些模式可以推广到预测新任务上的攻击成功。我们发布所有配置、流水线和分析代码以支持可重复比较。

英文摘要

When practitioners fine-tune LLMs on unvetted datasets, an adversary can exploit the data supply chain through task-level poisoning: inserting a small number of crafted instruction-response pairs that cause the model to embed attacker-specified entities, such as a country, in outputs for a targeted task family while behaving normally elsewhere. We introduce PoisonForge, a benchmark that parameterizes this threat along four dimensions (bias type, poisoning mode, appearance count, and target output length) and evaluates 12 open-weight models (from 2B to 32B parameters) across five families under a primarily 1% poison budget. With only 10 poisoned examples among 1,000 fine-tuning examples, 11 of 12 models exceed a 70% attack success rate (ASR) in their most vulnerable configuration. Meanwhile, unintended leakage to non-target tasks remains below 0.5%, and models perform well on standard benchmarks. We analyze in detail the factors contributing to attack success. We observe that multiple appearances of an entity increase the ASR, the optimal poisoning mode depends on the semantic structure of the target entity, and ASR drops monotonically with the task output length. A correlation analysis and risk prediction model confirm that poisoning design choices, rather than model scale, are the primary causes of attack success, and that these patterns generalize to predict attack success on new tasks. We release all configurations, pipelines, and analysis code to support reproducible comparisons.

2605.23165 2026-05-25 cs.RO cs.AI cs.CL 版本更新

Autonomous Frontier-Based Exploration with VLM Guidance

基于自主前沿探索与VLM引导

Aarush Aitha, Avideh Zakhor

发表机构 * EECS Department, University of California(加州大学EECS系)

AI总结 本文提出了一种基于视觉语言模型(VLM)引导的自主前沿探索方法,用于提升机器人在未知和危险环境中的探索能力。该方法通过VLM进行高层战略决策,指导传统的底层机器人控制系统,利用当前地图和潜在路径的视觉信息生成多模态提示,从而选择最具前景的探索方向。实验表明,该方法在六个室内环境的仿真中提升了地图覆盖率,且具有轻量、无需训练和易于迁移的特点。

Comments 8 pages, 10 figures, CVPR 2026: 2nd Workshop on 3D-LLM/VLA: Bridging Language, Vision and Action in 3D Environments

详情
AI中文摘要

自主机器人在未知和危险环境中的探索是一个长期挑战,通过利用视觉语言模型的高级推理能力可以显著改进。我们提出了一种新颖的探索流程,其中VLM执行高层战略决策,引导传统的低级机器人控制栈。在决策点,机器人生成包含当前地图和潜在路径(即前沿)视觉图像的多模态提示。VLM分析该提示以选择最有希望的前沿,用上下文空间推理替代简单的几何启发式。该方法在六个室内环境的模拟中得到了验证,与现有方法相比,地图覆盖率提高了高达24%。我们的流程轻量级、无需训练,并且可以轻松迁移到任何配备标准传感器和互联网连接的机器人上。

英文摘要

Autonomous robotic exploration of unknown and hazardous environments, a long-standing challenge, can be significantly improved by leveraging the advanced reasoning of Vision-Language Models (VLMs). We introduce a novel exploration pipeline where a VLM performs high-level strategic decision-making, guiding a conventional low-level robotics control stack. At decision points, the robot generates a multimodal prompt with its current map and visual imagery of potential paths, or frontiers. The VLM analyzes this prompt to select the most promising frontier, replacing simple geometric heuristics with contextual spatial reasoning. This approach, validated in simulation across six indoor environments, improves map coverage by up to 24\% over existing methods. Our pipeline is lightweight, training-free, and easily transferable to any robot with standard sensors and an internet connection.

2605.23159 2026-05-25 econ.GN cs.AI q-fin.EC 版本更新

Generative AI and the Reorganization of Labor Demand

生成式AI与劳动力需求的重组

Fangyan Wang, Zaiyan Wei, Yang Wang

发表机构 * Mitch Daniels School of Business, Purdue University(普渡大学米切尔丹尼尔斯商学院)

AI总结 本文研究生成式人工智能(AI)对劳动力需求的重塑影响,探讨企业在技术扩散过程中如何调整招聘岗位和岗位任务结构。通过构建基于美国全行业招聘广告数据的动态暴露度指标,研究发现,生成式AI的暴露程度随时间变化,并非固定不变;企业主要通过岗位间的招聘调整(占52%)和岗位内部任务重构(占39.5%)来适应AI技术,且不同层级岗位的调整路径存在差异。研究揭示了劳动力市场对生成式AI的适应过程是组织结构和任务架构的重新配置。

详情
AI中文摘要

生成式人工智能(AI)预计将改变工作方式,但关于随着技术扩散,企业如何重组劳动力需求的研究尚不充分。现有研究主要关注哪些职业暴露于AI或暴露的工作是否减少。我们通过考察企业是否通过改变招聘地点、工作内容或两者兼而有之来调整,扩展了这一讨论。利用覆盖美国经济所有部门的全国职位发布数据集,我们通过两阶段大语言模型管道构建了一个动态的、职位级别的生成式AI暴露度度量。该管道识别每个职位发布中描述的任务,并分类生成式AI能够执行或辅助这些任务的程度。然后,我们将总暴露度的变化分解为两个边际:跨职位需求重新分配和职位内任务重新设计。我们记录了三个主要发现。首先,生成式AI暴露度是动态而非固定的,随时间显著变化。其次,劳动力需求通过两个边际进行调整。招聘重新分配解释了总暴露度下降的最大份额,平均占52%,而职位内重新设计变得越来越重要,占39.5%。补充的Oaxaca-Blinder分解显示,职业构成的变化解释了可归因于可观察职位特征的暴露度变化的约90%。第三,调整在职业阶梯上有所不同。高级职位调整更早,主要通过重新分配,而初级职位则通过重新分配、重新设计及其相互作用的更广泛组合进行调整。这些发现表明,劳动力市场对生成式AI的调整是一个组织重构的过程,在此过程中,企业重塑了招聘需求和工作的任务架构。

英文摘要

Generative artificial intelligence (AI) is expected to transform work, but less is known about how firms reorganize labor demand as the technology diffuses. Existing research has largely focused on which occupations are exposed to AI or whether exposed jobs decline. We extend this debate by examining whether firms adjust by changing where they hire, what jobs contain, or both. Using a nationwide dataset of job postings in the United States, covering all sectors of the economy, we construct a dynamic, posting-level measure of generative AI exposure with a two-stage large language model pipeline. The pipeline identifies the tasks described in each posting and classifies the extent to which generative AI can perform or assist them. We then decompose changes in aggregate exposure into two margins: reallocation of demand across jobs and redesign of tasks within jobs. We document three main findings. First, generative AI exposure is dynamic rather than fixed, changing substantially over time. Second, labor demand adjusts through both margins. Hiring reallocation explains the largest share of the aggregate decline in exposure, accounting for 52% on average, while within-job redesign becomes increasingly important, accounting for 39.5%. A complementary Oaxaca-Blinder decomposition shows that shifts in occupational composition account for about 90% of the exposure change attributable to observable job characteristics. Third, adjustment differs across the job ladder. Senior jobs adjust earlier and mainly through reallocation, whereas junior jobs adjust through a broader mix of reallocation, redesign, and their interaction. These findings suggest that labor-market adjustment to generative AI is a process of organizational reconfiguration, in which firms reshape both hiring demand and the task architecture of work.

2605.23147 2026-05-25 cs.CL cs.AI 版本更新

As X, Do Y: How Persona and Task Combine in Instruction-Tuned LLMs

作为X,做Y:角色和任务如何在指令微调LLM中结合

Eric Xu

发表机构 * Independent Researcher(独立研究者)

AI总结 该研究探讨了在指令微调的大语言模型中,角色提示(如“As X, do Y”)如何将“人物”和“任务”信息结合,并发现这种结合在残差流中的某个特定位置可以通过线性分解清晰地体现。研究指出,人物和任务分别通过部分正交的加法方向影响模型输出,并展示了通过残差流局部加法结构可以实现对角色和任务贡献的可解释控制。然而,研究也表明,尽管存在局部加法结构,角色提示无法被压缩为单一的残差向量,因为其行为依赖于整个提示中的分布式机制。

Comments 12 pages, 1 figure. Code: https://github.com/xuy/localized-additive-composition

详情
AI中文摘要

形式为“作为X,做Y”的角色提示在残差流的一个特定位置——提示到答案的过渡(最后一个提示标记与前两个生成标记)——在早期/中层波段表现出清晰的线性分解。在那里,角色和任务通过部分正交的加性方向贡献。形成纯角色效应Δ_X、纯任务效应Δ_Y,并将h_BB + Δ_X + Δ_Y替换干净残差,在Gemma-2-2B-IT和Qwen-2.5-{1.5B, 3B}-Instruct上,跨越12个单元格的短网格和48个单元格的长角色网格,下游输出与干净输出的KL散度很小,并保留了角色特定的行为标记。从这种加性结构自然推断,角色提示可以压缩为单个缓存的残差向量。我们证明它不能。将缓存的加性预测——甚至oracle干净残差h_XY——注入到移除了角色文本的基线宿主提示中,无论是在一个位置还是在多个层,都无法接近干净的长角色目标。角色条件化的多标记生成通过注意力流回整个提示中的角色文本位置,这是任何单个位置的残差无法复现的。残差流中的局部加性性并不意味着提示可压缩。提示到答案过渡处的加性结构支持可解释性和对角色或任务贡献的细粒度控制;整个延续中的角色条件化行为依赖于分布式的提示/KV机制,局部激活算术无法取代。

英文摘要

Role prompts of the form As X, do Y admit a clean linear decomposition at one specific site in the residual stream: the prompt-to-answer transition -- the last prompt token together with the first two generated tokens -- in an early/mid layer band. There, persona and task contribute through partially orthogonal additive directions. Forming a pure persona effect $Δ_X$, a pure task effect $Δ_Y$, and substituting $h_{BB} + Δ_X + Δ_Y$ for the clean residual yields downstream output within a small KL of clean on Gemma-2-2B-IT and Qwen-2.5-\{1.5B, 3B\}-Instruct, across a 12-cell short grid and a 48-cell long-persona grid, with persona-specific behavioral markers preserved. The natural inference from this additive structure is that the role prompt can be compressed into a single cached residual vector. \emph{We show it cannot.} Injecting the cached additive prediction -- or even the oracle clean residual $h_{XY}$ -- into a baseline host prompt with the persona text removed does not approach the clean long-persona target, at one site or at many layers. Persona-conditioned multi-token generation flows through attention back to the persona-text positions throughout the prompt, which no residual at one site reproduces. Local additivity in the residual stream does not imply prompt compressibility. The additive structure at the prompt-to-answer transition supports interpretability and fine-grained steering of persona or task contributions; persona-conditioned behavior across the full continuation depends on a distributed prompt/KV mechanism that local activation arithmetic does not displace.

2605.23146 2026-05-25 cs.LG cs.AI 版本更新

Infra-Bayesian Reinforcement Learning Agents Outperform Classical RL For Worst-Case Robustness

Infra-Bayesian 强化学习智能体在最坏情况鲁棒性上优于经典强化学习

Manish Aryal, Faiyaz Azam, Agnivo Banerjee, Sai Sidhanth Manoharan Jayanthi, Allegra Laro, Clément Legentilhomme, Andrew Lin, Florian Lorkowski, Radman Rakhshandehroo, Patric Rommel, Emanuel Ruzak, Nathan Theng, Paul Yushin Rapoport

发表机构 * Purdue University(普渡大学) Carnegie Mellon University(卡内基梅隆大学) WorldQuant University(WorldQuant大学) UC Berkeley(加州大学伯克利分校) Aix-Marseille University(阿维尼翁-马赛大学) MIT(麻省理工学院) University of Zurich(苏黎世大学) University of British Columbia(不列颠哥伦比亚大学) University of Stuttgart(斯图加特大学) University of Buenos Aires(布宜诺斯艾利斯大学) California State University, Fresno(弗雷斯诺加州州立大学) University of Chicago(芝加哥大学)

AI总结 该论文研究了在存在模型误设和策略依赖不确定性的情况下,经典强化学习方法的局限性,并提出了一种基于Infra-Bayesian主义的强化学习框架。该方法通过区分普通概率不确定性与Knightian不确定性,采用最坏情况下的预期值最大化策略进行决策,从而在非现实环境中实现更稳健的性能。实验表明,该方法在具有Knightian不确定性的环境中表现出更低的最坏情况遗憾,并在纽康姆问题中优于经典决策理论方法。

详情
AI中文摘要

经典强化学习假设智能体与一个固定环境交互,该环境的行为不依赖于智能体的策略。这一假设在非可实现环境中失效,其中其他参与者可能预测智能体的行为,包括对 AI 安全至关重要的环境,例如智能体与预测者、人类、其他 AI 智能体和机构交互的环境。在此类环境中,智能体的模型类无法捕捉其运行的世界。在这种误设下,经典贝叶斯方法可能产生自信的错误后验、不可靠的决策和无界遗憾,因为可实现性无法获得。Infra-Bayesianism 是一个决策理论框架,通过将普通概率不确定性(其中先验可以合理选择)与 Knightian 不确定性(其中没有构建此类先验的依据)区分开来,解决了这些失败。它通过评估行动的最坏情况结果,而不是后验期望或加权平均来实现这一点。我们首次提出了一个用于有限结果无状态决策问题的 Infra-Bayesian 强化学习架构的概念验证实现。我们的智能体维护一组不精确的假设,使用 Infra-Bayesian 条件更新它们,并通过最大化最坏情况期望值来选择行动。我们将 Infra-Bayesian 极大极小决策过程的实现应用于具有 Knightian 不确定性的环境,并展示了与经典强化学习智能体相比更低的最坏情况遗憾。我们还研究了纽科姆问题,并表明 Infra-Bayesian 智能体选择了最优策略,优于经典决策理论智能体。我们的结果为在模型误设和策略依赖不确定性下保持鲁棒性的强化学习智能体迈出了一步。

英文摘要

Classical reinforcement learning assumes the agent interacts with a fixed environment whose behavior does not depend on the agent's policy. This assumption breaks down in non-realizable settings where other actors might anticipate the agent's behavior, including environments crucial to AI safety, where the agent interacts with predictors, humans, other AI agents, and institutions. In such settings, the agent's model class fails to capture the world in which it operates. Under such misspecification, classical Bayesian methods can produce confidently wrong posteriors, unreliable decisions, and unbounded regret, as realizability fails to obtain. Infra-Bayesianism is a decision-theoretic framework that addresses these failures by distinguishing ordinary probabilistic uncertainty, where priors can be reasonably chosen, from Knightian uncertainty, where no grounds exist for the construction of such a prior. It does so by evaluating actions on their worst-case outcomes, rather than from posterior expectations or weighted averaging. We present the first proof-of-concept implementation of an infra-Bayesian reinforcement learning architecture for finite-outcome stateless decision problems. Our agent maintains a set of imprecise hypotheses, updates them using infra-Bayesian conditioning, and selects actions by maximizing worst-case expected value. We apply this implementation of the infra-Bayesian maximin decision process to an environment with Knightian uncertainty, and demonstrate a lower worst-case regret as compared to classical reinforcement learning agents. We also investigate Newcomb's problem and show that the infra-Bayesian agent picks the optimal strategy, outperforming classical decision theory agents. Our results provide a step towards reinforcement learning agents that remain robust under model misspecification and policy-dependent uncertainty.

2605.23139 2026-05-25 cs.LG cs.AI 版本更新

CALAD: Channel-Aware contrastive Learning for multivariate time series Anomaly Detection

CALAD:面向多元时间序列异常检测的信道感知对比学习

Jaehyeop Hong, Youngbum Hur

发表机构 * Department of Industrial Engineering, Inha University, Incheon, Republic of Korea(韩国Inha大学工业工程系)

AI总结 多变量时间序列异常检测在实际应用中日益重要,但通常面临标注数据稀缺的问题。现有方法多采用无监督学习建模正常模式,但往往对所有通道一视同仁,忽略了不同通道对异常检测的贡献差异。本文提出CALAD,一种基于通道感知的对比学习框架,通过估计通道相关性指导对比样本的构建,增强模型对异常语义的学习能力,并结合重建误差和对比学习,提升模型在分布偏移场景下的检测性能。

Comments Accepted to ICPR 2026

详情
AI中文摘要

多元时间序列异常检测在实际应用中变得越来越重要,而标记数据往往稀缺。许多现有方法依赖无监督学习来建模正常模式,但它们通常平等对待所有信道。这种设计会稀释异常相关信号,因为并非所有信道对异常检测的贡献相同。在本文中,我们提出CALAD,一种用于多元时间序列异常检测的信道感知对比学习框架。CALAD利用估计的信道相关性指导对比样本的构建,使学习过程反映异常语义而非通用相似性。信道相关性通过基于Transformer的自编码器的重构误差进行估计,并用于区分对异常行为影响更大的信道。利用这些信息,我们设计了一种信道级增强策略,其中正负样本基于异常相关信道是否被保留或扰动来构建。这鼓励对无关信道的变化保持不变性,同时对异常相关信道的变化保持敏感性。此外,CALAD结合了对比学习和辅助重构头,使模型在保留正常结构的同时学习判别性表示。在多个真实数据集上的实验表明,CALAD在分布漂移场景下持续优于现有方法。我们提供可复现的代码:https://github.com/hirundo1218/CALAD。

英文摘要

Multivariate time series anomaly detection has become increasingly important in real-world applications, where labeled data are often scarce. Many existing approaches rely on unsupervised learning to model normal patterns, but they often treat all channels equally. This design can dilute anomaly-relevant signals, since not all channels contribute equally to anomaly detection. In this paper, we propose CALAD, a channel-aware contrastive learning framework for multivariate time series anomaly detection. CALAD governs the construction of contrastive samples using estimated channel relevance, allowing the learning process to reflect anomaly semantics rather than generic similarity. Channel relevance is estimated from reconstruction errors of a transformer-based autoencoder and is used to distinguish channels that are more influential to anomalous behaviors. Using this information, we design a channel-wise augmentation strategy in which positive and negative samples are constructed based on whether anomaly-relevant channels are preserved or perturbed. This encourages invariance to changes in irrelevant channels while being sensitive to changes in anomaly-relevant channels. Furthermore, CALAD combines contrastive learning and an auxiliary reconstruction head, allowing the model to learn discriminative representations while retaining normal structures. Experiments on multiple real-world datasets shows that CALAD consistently outperforms existing methods, particularly under distribution shift scenarios. We provide the code for reproducibility at https://github.com/hirundo1218/CALAD

2605.23138 2026-05-25 quant-ph cs.AI cs.ET cs.LG 版本更新

Classical State Preparation for Variational Quantum Algorithms via Reinforcement Learning

基于强化学习的变分量子算法经典态制备

Gino Kwun, Dhanvi Bharadwaj, Gokul Subramanian Ravi

发表机构 * Computer Science and Engineering University of Michigan(计算机科学与工程大学密歇根大学)

AI总结 该论文提出了一种基于强化学习的新型方法CRiSP,用于变分量子算法中的经典初始态制备。该方法将离散前缀选择建模为序列决策问题,结合神经引导的蒙特卡洛树搜索和自博弈训练的Transformer策略,能够在不改变电路结构的前提下,通过多项式时间的经典稳定子模拟生成高质量初始态。实验表明,CRiSP在多个QAOA和VQE基准任务中显著优于现有方法,展现出更高的能量精度和更强的可扩展性。

Comments 22 pages, 4 figures

详情
AI中文摘要

变分量子算法(VQA)可能提供实现实际量子优势的途径,但其优化受到贫瘠高原和大量局部极小值的严重阻碍。虽然经典可模拟的克利福德电路可以热启动VQA以加速收敛,但现有的基于启发式的初始化方法难以在巨大的组合搜索空间中扩展。为了克服这一瓶颈,我们提出了CRiSP(用于态制备的克利福德强化学习智能体),这是一个将离散前缀选择表述为序列决策问题的框架。CRiSP利用神经引导的蒙特卡洛树搜索,由通过自我对弈训练的基于Transformer的策略驱动,在固定参数化旋转之前插入学习到的克利福德门。这使得能够完全通过多项式时间的经典稳定子模拟构建高质量的初始态,而不改变底层电路架构。通过整合逐步扩展搜索范围的课程学习策略,该智能体能够高效扩展到深度电路。在多达22个量子比特和1,370个参数的QAOA基准测试中,CRiSP在平均能量精度上优于最先进的克利福德初始化方法平均3.17倍(最大45.02倍),在最佳能量精度上平均2.44倍(最大16.01倍)。对VQE任务的评估进一步证明了该框架的鲁棒性和泛化能力。

英文摘要

Variational Quantum Algorithms (VQAs) potentially offer a pathway to practical quantum advantage, but their optimization is heavily hindered by barren plateaus and numerous local minima. While classically simulable Clifford circuits can warm-start VQAs to accelerate convergence, existing heuristic-based initialization methods struggle to scale within vast combinatorial search spaces. To overcome this bottleneck, we propose CRiSP (a Clifford Reinforcement Learning agent for State Preparation), a framework that formulates discrete prefix selection as a sequential decision-making problem. CRiSP utilizes Neural-Guided Monte Carlo Tree Search, driven by a Transformer-based policy trained via self-play, to insert learned Clifford gates before fixed parameterized rotations. This enables the construction of high-quality initial states entirely through polynomial-time classical stabilizer simulation without altering the underlying circuit architecture. By integrating a curriculum learning strategy that progressively expands the search horizon, the agent efficiently scales to deep circuits. Evaluated on QAOA benchmarks of up to $22$ qubits and $1{,}370$ parameters, CRiSP outperforms state-of-the-art Clifford initialization methods by a mean of $3.17\times$ (max $45.02\times$) in average energy accuracy and $2.44\times$ (max $16.01\times$) in best-achieved energy accuracy. Assessments on VQE tasks further demonstrate the framework's robustness and generalizability.

2605.23123 2026-05-25 cs.CY cs.AI cs.HC 版本更新

Defining AI Fatigue in Academic Contexts: Dimensions, Indicators, and a Stage-Based Model Using Grounded Theory

定义学术情境中的AI疲劳:维度、指标及基于扎根理论的分阶段模型

John Paul P. Miranda, Emmanuel B. Parreño, Jovita G. Rivera

发表机构 * Pampanga State University(帕曼加州大学)

AI总结 本文探讨了学术场景中由持续使用AI工具引发的一种新型压力——AI疲劳,提出了其定义、维度及阶段模型。研究基于对1054名菲律宾大学学生的开放式回答进行扎根理论分析,识别出认知超载、动机脱离、道德不安、身体负担和注意力分散五个维度,每个维度包含两个基于参与者描述的指标。研究还构建了AI疲劳阶段模型,解释了这些压力如何在重复使用AI工具的过程中累积和相互强化,为未来相关测量工具的开发和跨情境研究奠定了基础。

Comments 17 pages, journal article, Volume 25, Issue 5,

Journal ref International Journal of Learning, Teaching and Educational Research, 25(5), 91-107 (2026)

详情
AI中文摘要

AI工具在学术环境中的整合引入了一种独特的压力形式,现有框架如技术压力和数字疲劳尚未完全解决这一问题。本研究开发了一个概念模型,并确定了定义AI疲劳的维度,AI疲劳是持续在学术中使用AI工具而产生的一种压力形式。通过对菲律宾三所大学1054名大学生的开放式回答进行扎根理论分析,研究了学生在AI支持的学术工作中经历的认知、动机、情感、身体和注意力压力。分析产生了AI疲劳的五个维度,即认知超载、动机脱离、道德不安、身体疲劳和注意力漂移,每个维度包含两个基于参与者叙述的指标。研究结果还提出了AI疲劳模型,这是一个分阶段框架,解释了这些压力如何在学术任务中反复与AI交互时积累并相互强化。这些贡献为AI疲劳作为一个独特构念建立了概念和探索基础,并为未来在AI中介学生学习的学术环境中的工具验证、量表开发和跨情境研究提供了基础。

英文摘要

The integration of AI tools in academic settings has introduced a distinct form of strain that existing frameworks like technostress and digital fatigue have not yet fully addressed. This study develops a conceptual model and identifies the dimensions that define AI fatigue as a form of strain arising from sustained academic use of AI tools. Using grounded theory analysis of open-ended responses from 1,054 university students across three universities in the Philippines, the study examined the cognitive, motivational, emotional, physical, and attentional pressures students experienced during AI-supported academic work. Analysis produced five dimensions of AI fatigue, namely Cognitive Overload, Motivational Disengagement, Moral Unease, Physical Strain, and Attentional Drift, each consisting of two indicators grounded in participant accounts. The findings also yielded the AI Fatigue Model, a stage-based framework that explains how these pressures accumulate and reinforce one another across repeated AI interaction in academic tasks. These contributions establish a conceptual and exploratory foundation for AI fatigue as a distinct construct and provide a basis for future instrument validation, scale development, and cross-contextual inquiry in academic settings where AI now mediates student learning.

2605.23118 2026-05-25 cs.CV cs.AI cs.LG 版本更新

Exploiting Longitudinal Context in Clinician-Verified Interactive Lesion Tracking

在临床医生验证的交互式病灶追踪中利用纵向上下文

Yannick Kirchhoff, Maximilian Rokuss, Daniel Philipp Mertens, David Füller, Benjamin Hamm, Andreas Schreyer, Oliver Ritter, Klaus Maier-Hein

发表机构 * German Cancer Research Center (DKFZ) Heidelberg, Division of Medical Image Computing, Germany(德国癌症研究中心(DKFZ)海德堡,医学图像计算部,德国) Faculty of Mathematics and Computer Science, Heidelberg University, Germany(海德堡大学数学与计算机科学学院,德国) HIDSS4Health -- Helmholtz Information and Data Science School for Health, Karlsruhe/Heidelberg, Germany(HIDSS4Health——海德堡信息与数据科学健康学校,卡尔斯鲁厄/海德堡,德国) Medical Faculty, Heidelberg University, Germany(海德堡大学医学学院,德国) University Hospital Brandenburg an der Havel, Brandenburg Medical School Theodor Fontane, Germany(勃兰登堡运河大学医院,布兰登堡泰奥多尔·冯·_fontane医学学校,德国) Pattern Analysis and Learning Group, Department of Radiation Oncology, Heidelberg University Hospital, Germany(放射肿瘤科模式分析与学习组,海德堡大学医院,德国)

AI总结 本文研究了如何在临床验证的交互式病灶追踪中有效利用纵向影像信息,以提高肿瘤在连续CT扫描中的追踪准确性。作者提出了一种“验证追踪”范式,通过临床医生验证注册提出的提示,并结合病灶的基线外观信息,解决分割中的模糊问题。该方法结合了早期空间提示融合与潜在时间差分加权,构建了一个统一的纵向信息引导分割框架,并通过大规模合成预训练克服数据稀缺问题,显著提升了性能。实验表明,该方法在全自动和验证追踪设置下均优于现有方法,且在MICCAI autoPET IV挑战赛中取得第一名。

Comments Accepted at MICCAI 2026

详情
AI中文摘要

在系列CT扫描中追踪肿瘤病灶对于肿瘤学反应评估至关重要。现有的自动化方法面临一个基本权衡:端到端追踪器实现高度自动化,但无法纠正无声的追踪失败;而解耦的配准-分割流程允许用户验证,却丢弃了病灶的先验外观,限制了在模糊情况下的准确性。在这项工作中,我们提出了一种验证追踪范式:临床医生验证配准提出的提示,模型利用该提示以及基线病灶外观来解决分割模糊性。我们提出了一个统一框架,结合早期空间提示融合与潜在时间差异加权,用于纵向信息感知的分割。为了解决数据稀缺问题,我们利用大规模合成预训练,证明这对于利用纵向上下文至关重要,相比从头训练性能提升高达4.5个Dice点。我们的方法在MICCAI autoPET IV挑战中获得第一名。我们进一步整理并发布了PanTrack,一个新的纵向胰腺癌基准,以评估分布外泛化能力。实验表明,我们的模型在全自动和所提出的验证追踪设置中均优于先前工作,在自动化与控制之间提供了一个临床安全的中间地带。代码、模型和数据集将在https://github.com/MIC-DKFZ/LongiSeg发布。

英文摘要

Tracking tumor lesions across serial CT scans is essential for oncological response assessment. Existing automated methods face a fundamental trade-off: end-to-end trackers achieve high automation but offer no opportunity to correct silent tracking failures, while decoupled registration-segmentation pipelines permit user verification yet discard the lesion's prior appearance, limiting accuracy in ambiguous cases. In this work, we propose a Verified Tracking paradigm: a clinician verifies a registration-proposed prompt, which the model leverages alongside the baseline lesion appearance to resolve segmentation ambiguities. We present a unified framework combining early spatial prompt fusion with latent temporal difference weighting for longitudinally-informed segmentation. To address data scarcity, we leverage large-scale synthetic pretraining, proving essential for exploiting longitudinal context, improving performance by up to 4.5 Dice points over training from scratch. Our approach secured first place in the MICCAI autoPET IV challenge. We further curate and release PanTrack, a new longitudinal pancreatic cancer benchmark, to assess out-of-distribution generalization. Experiments show that our model outperforms prior work in both fully automatic and the proposed verified tracking setting offering a clinically safe middle ground between automation and control. Code, model and dataset will be released at https://github.com/MIC-DKFZ/LongiSeg

2605.23116 2026-05-25 cs.CV cs.AI 版本更新

CoReVAD: A Contextual Reasoning Framework for Training-Free Video Anomaly Detection

CoReVAD: 一种无需训练的视频异常检测上下文推理框架

Hyeongmuk Lim, Youngbum Hur

发表机构 * Department of Industrial Engineering, Inha University, Incheon, Republic of Korea(韩国釜山大学工业工程系)

AI总结 现有视频异常检测方法通常依赖任务特定的训练,导致领域依赖性强且训练成本高,且大多仅输出标量异常分数,缺乏对异常原因的解释。为此,本文提出CoReVAD,一种无需训练的上下文推理框架,利用冻结的视觉-语言模型直接生成异常分数和时间描述,并通过局部响应清理模块和全局时序优化策略提升检测精度与可解释性。实验表明,CoReVAD在多个数据集上表现出色,提供了可靠且易于理解的异常解释。

Comments Accepted to ICPR 2026

详情
AI中文摘要

现有的视频异常检测方法通常依赖于任务特定的训练,导致强领域依赖性和高训练成本。此外,大多数现有方法仅输出标量异常分数,对特定事件为何被视为异常提供的洞察有限。视觉语言模型的最新进展使得异常检测和人类可解释推理成为可能。然而,许多基于视觉语言模型的方法仍然需要额外的训练步骤(例如,指令调优或口头化学习)或外部大型语言模型,从而带来进一步的训练成本和推理开销。为了解决这些挑战,我们提出了CoReVAD,一种用于无需训练的视频异常检测的上下文推理框架,该框架使用单个冻结的视觉语言模型运行。CoReVAD直接从视觉语言模型生成异常分数和时间描述。为了减轻生成输出中的噪声,我们引入了一个基于局部视觉-文本对齐的局部响应清理模块。此外,通过基于softmax的精炼、高斯平滑和位置加权,融入了全局时间上下文和进展。在UCF-Crime和XD-Violence上的实验表明,CoReVAD在无需训练的方法中取得了竞争性能,同时提供了可靠且可解释的解释。我们的官方代码可在https://github.com/Muk-00/CoReVAD获取。

英文摘要

Existing Video Anomaly Detection (VAD) methods typically rely on task-specific training, leading to strong domain dependency and high training costs. Moreover, most existing methods output only scalar anomaly scores, providing limited insight into why specific events are considered abnormal. Recent advances in Vision-Language Models (VLMs) have enabled both anomaly detection and human-interpretable reasoning. However, many VLM-based approaches still require additional training steps (e.g., instruction tuning or verbalized learning) or external Large Language Models (LLMs), incurring further training costs and inference overhead. To address these challenges, we propose CoReVAD, a contextual reasoning framework for training-free video anomaly detection that operates with a single frozen VLM. CoReVAD directly generates anomaly scores and temporal descriptions from the VLM. To mitigate noise in generative outputs, we introduce a Local Response Cleaning (LRC) module based on local vision-text alignment. Furthermore, global temporal context and progression are incorporated through softmax-based refinement, Gaussian smoothing, and position weighting. Experiments on UCF-Crime and XD-Violence demonstrate that CoReVAD achieves competitive performance among training-free methods while providing reliable and interpretable explanations. Our official code is available at: https://github.com/Muk-00/CoReVAD

2605.23109 2026-05-25 cs.AI cs.DC cs.LO cs.PL 版本更新

Inductive Deductive Synthesis: Enabling AI to Generate Formally Verified Systems

归纳演绎合成:使AI能够生成形式化验证的系统

Shubham Agarwal, Alexander Krentsel, Shu Liu, Mert Cemri, Audrey Cheng, Rui Meng, Tomas Pfister, Chun-Liang Li, Sylvia Ratnasamy, Aditya Parameswaran, Matei Zaharia, Ion Stoica, Mohsen Lesani

发表机构 * UC Berkeley(伯克利大学) Google(谷歌) UC Santa Cruz(圣克鲁兹大学)

AI总结 本文提出了一种名为归纳演绎综合(IDS)的新方法,旨在解决AI生成代码时缺乏形式化验证的问题,特别是在分布式系统领域。该方法通过联合生成实现代码和形式化证明,并从失败尝试中学习,系统性地尝试有效策略。IDS作为基于代理的大型语言模型系统,能够在约6.8小时内以较低成本完成7个分布式键值存储规范的形式化验证,且生成的实现性能优于现有验证系统。

详情
AI中文摘要

AI代理在生成、测试和优化代码方面日益出色。然而,在需要完全覆盖的形式化保证(仅靠测试无法提供)的任务上,它们表现不足。分布式系统是一个典型例子:读写一致性等属性必须在每个可能的事件交错下成立。机械化形式验证可以保证这种正确性,但通常需要专家数月到数年的努力。证据表明,即使是最先进的编码代理(Codex with GPT-5.4和Claude Code with Opus 4.6)也仅在7个分布式键值存储规范中的2个上成功。在本文中,我们提出了解决这一差距的首个有效方法——归纳演绎合成(IDS),它联合且增量地合成实现和证明,并从失败的尝试中学习以系统地尝试有前景的策略。作为基于LLM的代理系统,IDS在平均约6.8小时和每个规范106美元的成本下实现了7/7的成功率,比专家努力快约200倍,比最先进的代理便宜17%。IDS进一步将性能反馈纳入同一循环,产生的实现比已发布的验证系统快达3倍。

英文摘要

AI agents increasingly excel at generating, testing, and refining code. However, they fall short on tasks requiring formal guarantees of full coverage that testing alone cannot provide. Distributed systems are a prime example: properties such as consistency between reads and writes must hold under every possible interleaving of events. Mechanized formal verification can guarantee such correctness, but typically demands months to years of expert effort. As evidence, even SOTA coding agents (Codex with GPT-5.4 and Claude Code with Opus 4.6) succeed on only 2/7 distributed key-value-store specifications. In this paper, we present the first effective approach to addressing this gap, Inductive Deductive Synthesis (IDS), which jointly and incrementally synthesizes implementation and proof, and learns from failed attempts to systematically try promising strategies. Built as an agentic LLM system, IDS achieves 7/7 in about 6.8 hours and $106 per spec on average, roughly 200x faster than expert effort and 17% cheaper than SOTA agents. IDS further incorporates performance feedback into the same loop, yielding implementations up to 3x faster than published verified systems.

2605.23108 2026-05-25 cs.SE cs.AI 版本更新

Philosophical Dispositions as Behavioral Constraints for AI-Assisted Code Review: An Empirical Study

哲学倾向作为AI辅助代码评审的行为约束:一项实证研究

Kaushal Bansal

发表机构 * Salesforce, Inc.(Salesforce公司)

AI总结 本文研究如何通过哲学立场(如怀疑主义、逻辑学、犬儒主义等)约束AI代码审查工具的行为,以提升其审查的多样性和深度。研究提出了一种基于特定知识论传统构建AI审查行为框架的方法,并通过实证分析验证了该方法在不同编程语言和项目中的有效性。实验表明,该系统能够发现传统AI工具难以识别的结构性和逻辑性问题,展现出更强的审查独特性和准确性。

详情
AI中文摘要

AI辅助代码评审工具通常作为通用的“专家评审者”代理运行,无论需要何种分析类型,都会产生同质化的发现。我们提出一个系统,通过哲学倾向——基于特定认识论传统(皮浪怀疑论、新正理逻辑、第欧根尼犬儒主义、儒家关系伦理)的连贯人格视角,将注意力引导到结构上不同类型的问题上——来约束AI评审者行为。每种倾向通过否定方式定义(即拒绝做什么),配备自我监控的失败模式(hamartia),并通过角色协议按顺序编排。我们在跨越5种编程语言(Python、Go、C++、Java、Terraform)、5个组织(2个企业、3个开源)和2个时间时代(AI前2020年、AI后2024-2026年)的7个代码库的50个合并拉取请求上评估该系统。该倾向系统与人类评审者达到46%的一致性(验证信号质量),以75%的比率识别出独特发现,并且在总共601个发现中,没有发现被作者判定为假阳性(未评估评分者间一致性,这仍是一个局限)。受控基线比较表明,51%的倾向发现是同一模型使用通用“专家评审者”提示不会产生的,这些独特发现针对结构、操作和逻辑问题,而非标准代码级别问题。初步跨模型验证(Claude Opus vs. GPT Codex 5.3-xhigh)在3个PR上显示100%的框架结构遵循度和39%的发现级别一致性,表明该框架在保持模型特定分析视角的同时提供了真正的行为约束。

英文摘要

AI-assisted code review tools typically operate as generic "expert reviewer" agents, producing homogeneous findings regardless of the analysis type needed. We present a system that constrains AI reviewer behavior through philosophical dispositions -- coherent personality lenses grounded in specific epistemological traditions (Pyrrhonist Skepticism, Navya-Ny=aya logic, Diogenes' Cynicism, Confucian relational ethics) that direct attention to structurally different types of issues. Each disposition is defined apophatically (by what it refuses to do), equipped with a self-monitoring failure mode (hamartia), and orchestrated in sequence by role protocols. We evaluate this system on 50 merged pull requests across 7 repositories spanning 5 programming languages (Python, Go, C++, Java, Terraform), 5 organizations (2 enterprise, 3 open-source), and 2 temporal eras (pre-AI 2020, post-AI 2024--2026). The disposition system achieves 46% convergence with human reviewers (validating signal quality), identifies unique findings at a 75% rate, and produces no findings judged false-positive by the author across 601 total findings (inter-rater agreement was not assessed and remains a limitation). A controlled baseline comparison demonstrates that 51% of disposition findings are not produced by the same model using generic "expert reviewer" prompting, and these unique findings target structural, operational, and logical concerns rather than standard code-level issues. Preliminary cross-model validation (Claude Opus vs.\ GPT Codex 5.3-xhigh) on 3 PRs shows 100% framework-structure adherence with 39% finding-level agreement, suggesting the framework provides real behavioral constraint while preserving model-specific analytical perspective.

2605.23103 2026-05-25 cs.CL cs.AI cs.CY cs.DB 版本更新

A Fine-Tuned BERT Classifier for Personal-Letter Titles in Late-Ming and Early-Qing Collected Works

用于明清之际文集中个人书信标题的微调BERT分类器

Queenie Luo

发表机构 * Harvard University(哈佛大学)

AI总结 本文提出了一种基于微调BERT的分类器Lepton,用于识别晚明至清初文集目录中的标题是否为个人书信,特别是与可混淆的序言(如告别序)进行区分。该模型在33位文人手标注的5438个文集标题上进行微调,并已部署于Hugging Face平台,应用于中国传记资料库(CBDB),成功识别出约五万五千封书信,为明信平台的数据建设提供了支持。

详情
AI中文摘要

我提出Lepton(书信预测),一个微调的BERT分类器,用于预测古典中文文集目录中的标题是个人书信还是易混淆的序文(特别是赠序)。Lepton在来自三十三位明清之际文人的5438个手工标注的文集标题上微调bert-base-chinese。我已将该模型部署在Hugging Face上,并已在中国传记数据库(CBDB)中使用,用于识别从中明到清初文集中约五万五千封书信,从而填充明代书信平台。

英文摘要

I present Lepton (Letter Prediction), a fine-tuned BERT classifier that predicts whether a title in a Classical Chinese wenji table of contents is a personal letter or a closely confusable preface (particularly the farewell-preface). Lepton fine-tunes bert-base-chinese on 5438 hand-labeled wenji titles from thirty-three late-Ming and early-Qing literati. I've deployed the model on Hugging Face and has been used at the China Biographical Database (CBDB) to identify approximately fifty-five thousand letters across mid-Ming through early-Qing wenji, populating the Ming Letter Platform.

2605.23094 2026-05-25 eess.IV cs.AI cs.CV 版本更新

Do Synthetic Brain MRIs Reliably Improve Tumour Classification? A StyleGAN2-ADA Class-Plane Augmentation Study on BRISC 2025

合成脑部MRI能否可靠改善肿瘤分类?基于BRISC 2025的StyleGAN2-ADA类平面增强研究

José Rafael Noriega Cedeño

发表机构 * NVIDIA

AI总结 该研究探讨了合成脑部MRI图像是否能有效提升肿瘤分类任务的性能,使用StyleGAN2-ADA生成器在BRISC 2025数据集上生成图像,并测试其对三种分类模型的影响。研究发现,合成图像的增益效果因模型架构和真实与合成图像比例不同而有所差异,其中MobileViTV2模型在使用过滤后的1:1合成图像增强后,肿瘤分类准确率提升了1.02%。结果表明,生成式增强的效果并非仅取决于图像的视觉质量,而是与模型结构和数据配比密切相关。

Comments 18 pages, 16 figures

详情
AI中文摘要

生成式增强常被提议作为小规模医学图像数据集的补救措施,但合成图像只有在改善下游任务性能时才有用。此处的“增强”指合成补充:将GAN生成的样本添加到真实训练池中,而非对现有图像进行几何或光度变换。我们在受限的BRISC 2025分区上训练了十二个类平面StyleGAN2-ADA生成器,以测试其输出(无论是否经过InceptionV3特征空间过滤)是否能改善三个分类器家族上的留出肿瘤分类:基于InceptionV3特征的随机森林(RF)、紧凑型双头卷积神经网络(CNN)以及移动混合卷积-Transformer MobileViTV2。每个分类器在1:1和1:2的真实与合成比例下进行评估。独立的GPT-5.5盲测在模型可读子集上将门控真实与合成辨别率定为57.73%(95%置信区间:54.48–60.92%),略高于随机水平。RF分类器未从合成MRI中获益。CNN显示出一致的均值增益,但未通过Holm校正。MobileViTV2显示出最清晰的益处:过滤后的1:1增强将肿瘤分类准确率绝对提高了1.02%(95%置信区间:0.54–1.54%;Holm校正后p=0.0104)。二次效率分析发现,每个增强的CNN条件比基线提前42–64%选择其检查点,而计算匹配的MobileViTV2运行在减少50–67%的真实数据epoch后达到选择。总体而言,增强效用被发现依赖于架构和比例,而非仅由视觉保真度保证。

英文摘要

Generative augmentation is often proposed as a remedy for small medical-image datasets, but synthetic images are only useful when they improve downstream task performance. "Augmentation" here means synthetic supplementation: GAN-generated samples added to the real training pool, not geometric or photometric transforms of existing images. Twelve class-plane StyleGAN2-ADA generators were trained on constrained BRISC 2025 partitions to test whether their output, with or without InceptionV3 feature-space filtering, improves held-out tumour classification across three classifier families: a random forest (RF) on InceptionV3 features, a compact two-headed convolutional neural network (CNN), and MobileViTV2, a mobile hybrid convolutional-transformer. Each was evaluated at 1:1 and 1:2 real-to-synthetic ratios. An independent GPT-5.5 blind test placed gated real-versus-synthetic discrimination at 57.73% (95% CI: 54.48--60.92%) on the model-legible subset -- modestly above chance. The RF classifier did not benefit from the synthetic MRIs. The CNN showed consistent mean gains that did not survive Holm correction. MobileViTV2 showed the clearest benefit: filtered 1:1 augmentation improved tumour classification accuracy by 1.02% absolute (95% CI: 0.54--1.54%; Holm-corrected p = 0.0104). A secondary efficiency analysis found that every augmented CNN condition selected its checkpoint 42--64% earlier than baseline, while compute-matched MobileViTV2 runs reached selection after 50--67% fewer real-data epochs. Overall, augmentation utility was found to be architecture- and ratio-dependent, not guaranteed by visual fidelity alone.

2605.23091 2026-05-25 cs.SE cs.AI cs.CR 版本更新

Security of LLM-generated Code: A Comparative Analysis

LLM生成代码的安全性:一项比较分析

Srivathsan G Morkonda, Mahmoud Selim, Hala Assal

发表机构 * Carleton University(卡尔顿大学)

AI总结 本文研究了大型语言模型(LLM)生成代码的安全性问题,评估了七种流行LLM生成代码中的安全漏洞。通过模拟开发者使用LLM生成代码的行为,研究发现所有被评估的模型生成的代码中均存在不同程度的安全漏洞,其中大部分为高危或严重漏洞,揭示了当前AI辅助编程在安全性方面的潜在风险。

详情
AI中文摘要

大多数软件开发人员正在或计划在其开发过程中使用人工智能(AI)工具,主要原因包括提高生产力和加快学习速度。事实上,大型语言模型(LLM)生成的代码目前已投入生产,包括在主要科技公司中。然而,人们对于使用AI工具生成代码的相关风险提出了担忧。在本文中,我们重点关注软件安全风险。我们实证评估了七种流行LLM生成代码的安全性。我们基于先前的工作,模拟了开发人员使用LLM生成代码时的行为。我们的结果表明,我们评估的所有七种LLM生成的代码都包含漏洞,其中大多数为严重或高危漏洞。

英文摘要

The majority of software developers use or are planning to use Artificial Intelligence (AI) tools in their development processes. Their top reasons include improving productivity and faster learning. In fact, Large Language Model (LLM)-generated code is currently in production, including in major tech companies. However, concerns were raised about the risks associated with the use of AI tools to generate code. In this paper, we focus our attention on the risks to software security. We empirically evaluate the security of code generated by seven popular LLMs. We build upon previous work to mimic the behaviours of developers when using LLMs to generate code. Our results show that all seven LLMs that we have evaluated generate code that contains vulnerabilities, the majority of which are of critical or high severity.

2605.23089 2026-05-25 cs.LG cs.AI 版本更新

Dreaming Smoothly and Sample Efficiently with Gradient Penalized Latent Dynamics

利用梯度惩罚潜在动力学实现平滑且高效的采样

Romil V. Sonigra, P. R. Kumar

发表机构 * Department of Electrical and Computer Engineering(电气与计算机工程系) Texas A&M University(德克萨斯大学)

AI总结 本文提出了一种名为GPLD的梯度惩罚隐动力学正则化方法,用于改进基于模型的强化学习中的隐世界模型。该方法通过对后验隐状态分布施加行级雅可比惩罚,显式地鼓励局部平滑的转移动力学学习,从而提升模型的样本效率和学习稳定性。实验表明,GPLD在多个深度强化学习任务中表现出色,尤其在复杂运动控制环境中显著提升了性能,并且在四足机器人任务中实现了更早的高回报行为和更一致的长期学习效果。

Comments 17 pages and 9 figures

详情
AI中文摘要

基于模型的强化学习通过学习世界模型来提高样本效率。然而,现有的潜在世界模型(如DreamerV3)并未明确强制其学习的转移动力学具有局部平滑性,从而未利用这一有用的归纳偏置。我们提出GPLD,一种用于DreamerV3的梯度惩罚潜在动力学正则化器,通过对后验潜在分布施加行雅可比惩罚来鼓励局部平滑的转移学习。我们证明该惩罚可解释为离散嵌入状态MDP中转移律的有限差分平滑的连续潜在类比,并使用Hutchinson风格随机探针高效估计。实验上,在DeepMind Control本体感受任务中,GPLD提高了总体样本效率,在复杂度较高的运动环境中尤其显著。在更具挑战性的四足任务中,GPLD更早达到高回报行为,并在更长的时间跨度内表现出更一致的后期学习。显式局部平滑正则化是改善平滑连续控制环境中潜在世界模型的简单有效方法。GPLD代码见github.com/romils9/gpld-mbrl。

英文摘要

Model-based reinforcement learning improves sample efficiency by learning a world model. However, existing latent world models such as DreamerV3 do not explicitly enforce local smoothness in their learned transition dynamics, leaving a useful inductive bias for transition dynamics learning unexploited. We propose GPLD, a gradient-penalized latent dynamics regularizer for DreamerV3 that applies a row-wise Jacobian penalty to the posterior latent distribution to encourage locally smooth transition learning. We show that this penalty can be interpreted as the continuous-latent analog of finite-difference smoothing of transition laws in discrete embedded-state MDPs, and estimate it efficiently using Hutchinson-style stochastic probes. Empirically, across DeepMind Control proprioceptive tasks, GPLD improves aggregate sample efficiency, with particularly strong gains on higher-complexity locomotion environments. On more challenging quadruped tasks, GPLD reaches high-return behavior earlier and exhibits more consistent late-stage learning over longer horizons. Explicit local smoothness regularization is a simple and effective way to improve latent world models for smooth continuous control environments. Code for GPLD is available at github.com/romils9/gpld-mbrl .

2605.23074 2026-05-25 cs.AI 版本更新

PathCal: State-Aware Reflection-Marker Calibration for Efficient Reasoning

PathCal: 状态感知的反思标记校准用于高效推理

Lingyu Jiang, Zirui Li, Shuo Xing, Peiran Li, Tsubasa Takahashi, Dengzhe Hou, Zhengzhong Tu, Kazunori Yamada, Fangzhou Lin

发表机构 * Tohoku University(东大大学) Texas A&M University(德克萨斯A&M大学) Worcester Polytechnic Institute(沃斯特理工学院)

AI总结 随着大语言模型在推理任务中的应用日益广泛,如何高效控制其推理路径成为一个关键问题。本文提出PathCal,一种无需训练的解码控制器,通过区分不同类型的反思标记并仅在局部不确定状态进行干预,实现对推理路径的校准。实验表明,PathCal在多个推理基准上有效提升了推理效率与性能的平衡,减少了生成长度而不牺牲准确性。

Comments 21 pages, 5 figures, 7 tables

详情
AI中文摘要

大型推理语言模型(LRMs)的出现通过推理时缩放生成长篇思维链(CoT)轨迹,为处理复杂推理任务铺平了道路。同时,这些轨迹通常包含显式的反思标记,如“wait”、“but”和“alternatively”,分别表示犹豫、修正和考虑替代探索。最近关于测试时控制的研究利用这些标记作为轻量级手柄来引导推理,通常将它们视为单一的粗粒度类别,而非区分其不同的功能角色。在本文中,我们进行类型级抑制和固定前缀干预,揭示反思标记不仅在功能角色上不同,而且在它们发挥最大影响的时机上也不同。具体来说,不同的标记类别以不同方式影响准确性和生成长度,并且标记选择在模型进入稳定推理轨迹之前最为关键。受这些发现启发,我们引入PathCal,一种新颖的无需训练的解码控制器,通过区分标记类型并仅在局部不确定状态进行干预来校准推理路径。在每个解码步骤,PathCal利用反思标记上的分布来估计维持当前推理轨迹与启动竞争分支之间的局部竞争,并在竞争分支证据过多时软性地重新平衡标记对数。在六个推理基准上的实验表明,PathCal实现了更好的效率-性能权衡,在减少生成长度的同时提高或保持准确率,且不依赖外部验证器或额外采样。

英文摘要

The emergence of Large Reasoning Language Models (LRMs) has paved the way for tackling complex reasoning tasks through test-time scaling by generating long-form Chain-of-Thought (CoT) trajectories during inference. Meanwhile, these trajectories often contain explicit reflection markers such as ``wait'', ``but'', and ``alternatively'', signaling hesitation, revision, and the consideration of alternative explorations, respectively. Recent studies on test-time control leverage such markers as lightweight handles for steering reasoning, typically treating them as a single coarse-grained category rather than distinguishing their distinct functional roles. In this paper, we conduct type-wise suppression and fixed-prefix intervention, revealing that reflection markers differ not only in their functional roles but also in when they exert the greatest influence. Specifically, different marker classes affect accuracy and generation length in distinct ways, and marker choices are most consequential before the model settles into a stable reasoning trajectory. Motivated by these findings, we introduce PathCal, a novel training-free decoding controller that calibrates reasoning paths by distinguishing marker types and intervening only at locally uncertain states. At each decoding step, PathCal utilizes the distribution over reflection-markers to estimate local competition between maintaining the current reasoning trajectory and initiating a competing branch, and softly rebalances marker logits when competing-branch evidence becomes excessive. Experiments across six reasoning benchmarks demonstrate that PathCal achieves a better efficiency--performance trade-off, improving or preserving accuracy while reducing generation length, without relying on external verifiers or additional sampling.

2605.23065 2026-05-25 cs.CV cs.AI cs.LG 版本更新

Dithering Defense: Adversarial Robustness of Vision Foundation Models via Multi-Level Floyd-Steinberg Dithering

抖动防御:通过多级 Floyd-Steinberg 抖动实现视觉基础模型的对抗鲁棒性

Yury Belousov, Brian Pulfer, Vitaliy Kinakh, Slava Voloshynovskiy

发表机构 * Department of Computer Science, University of Geneva, Switzerland(日内瓦大学计算机科学系)

AI总结 该研究提出了一种基于多级Floyd-Steinberg抖动算法的轻量输入变换方法,用于提升视觉基础模型在对抗攻击下的鲁棒性。该方法通过在图像中引入可控的噪声,破坏对抗扰动的同时保留语义内容,适用于多种下游任务和不同模型架构。实验表明,该方法在多种攻击场景下表现优异,且对干净输入的性能下降较小,优于现有的去噪基线方法。

Comments Paper accepted at the IEEE International Conference on Image Processing (ICIP 2026)

详情
AI中文摘要

视觉基础模型被广泛用作许多下游任务中的冻结骨干,使其成为对抗攻击下的单点故障。我们研究了多级 Floyd-Steinberg 误差扩散抖动作为一种轻量级、模型无关的输入变换,它在保留语义内容的同时破坏对抗扰动。与先前局限于二值抖动、灰度 CIFAR-10 和从头训练的单个小模型的工作不同,我们在六个任务(分类、分割、深度估计、检索、字幕生成、视觉问答)、两个模型家族(DINOv2、PaliGemma)以及三种强度递增的攻击(PGD、MI-FGSM、SIA)上进行了评估,还包括使用直通估计器的自适应攻击者。我们的结果表明,在中间量化级别上的 Floyd-Steinberg 抖动,尤其是与后处理模糊相结合时,超过或匹配所有测试的基线(包括基于扩散的去噪),并且在干净输入上的退化显著更小。

英文摘要

Vision foundation models are widely used as frozen backbones across many downstream tasks, making them a single point of failure under adversarial attack. We study multi-level Floyd-Steinberg error-diffusion dithering as a lightweight, model-agnostic input transformation that disrupts adversarial perturbations while preserving semantic content. Unlike prior work, which was limited to binary dithering, grayscale CIFAR-10, and a single small model trained from scratch, we evaluate across six tasks (classification, segmentation, depth estimation, retrieval, captioning, visual question answering), two model families (DINOv2, PaliGemma), and three attacks of increasing strength (PGD, MI-FGSM, SIA), as well as an adaptive attacker using a straight-through estimator. Our results show that Floyd-Steinberg dithering at intermediate quantization levels, especially when combined with post-processing blur, exceeds or matches all tested baselines, including diffusion-based denoising, with substantially less degradation on clean inputs.

2605.23061 2026-05-25 cs.LG cs.AI math.OC stat.ML 版本更新

Anytime Training with Schedule-Free Spectral Optimization

任意时间训练:无调度谱优化

Anuj Apte, Pranav Deshpande, Niraj Kumar, Shouvanik Chakrabarti, Junhyung Lyle Kim

发表机构 * Global Technology Applied Research(全球技术应用研究)

AI总结 本文提出了一种名为 SF-NorMuon 的无调度谱优化器,用于解决传统神经网络训练中依赖固定学习率计划的问题。该方法在无需预设训练时间范围的情况下,能够在大规模语言模型上达到甚至超越精心调参的 AdamW 优化器的性能。研究还从理论上证明了无调度谱动态的稳定性保证,并指出快速迭代中的权重衰减对长期训练稳定性至关重要,为无需预设时间范围的持续学习提供了更实用的优化方案。

详情
AI中文摘要

标准神经网络训练依赖于与固定训练步数绑定的学习率调度,导致路径依赖性强,且当数据可用性变化时需要昂贵的重新调优。无调度(SF)方法通过移除显式调度来解决这一问题,然而当前最先进的任意时间优化器SF-AdamW始终不如调优后的AdamW基线。我们提出SF-NorMuon,一种无调度谱优化器,弥补了这一差距:使用单一超参数配置,SF-NorMuon在125M和772M参数的语言模型上,在$1$--$8 imes$ Chinchilla训练步数范围内匹配或超过了调优的AdamW。在理论方面,我们证明了无调度谱动力学的平稳性保证,并指出快速迭代上的权重衰减对于长步数稳定性至关重要。SF-NorMuon使从业者能够在训练过程中的任何时刻获得高质量检查点,而无需预先承诺训练步数。通过缩小与调优基线的性能差距,SF-NorMuon使无步数优化更加实用,向真正开放式的持续学习迈出了一步。

英文摘要

Standard neural network training relies on learning-rate schedules tied to a fixed horizon, leading to strong path dependence and costly re-tuning as data availability changes. Schedule-Free (SF) methods address this by removing explicit schedules, yet SF-AdamW, the current state-of-the-art anytime optimizer, consistently underperforms well-tuned AdamW baselines. We propose SF-NorMuon, a schedule-free spectral optimizer that closes this gap: with a single hyperparameter configuration, SF-NorMuon matches or exceeds tuned AdamW on 125M and 772M parameter language models across $1$--$8\times$ Chinchilla horizons. On the theoretical side, we prove a stationarity guarantee for schedule-free spectral dynamics and identify weight decay at the fast iterate as essential for long-horizon stability. SF-NorMuon enables practitioners to obtain high-quality checkpoints at any point during training without committing to a horizon in advance. By closing the performance gap with tuned baselines, SF-NorMuon makes horizon-free optimization more practical, taking a step towards truly open-ended, continual learning.

2605.23058 2026-05-25 cs.SE cs.AI 版本更新

A measurement substrate for agentic Kubernetes operations: Methodology and a case study in retrieval-compounding falsification

面向代理化 Kubernetes 操作的测量基础:方法论与检索复合证伪案例研究

Joshua Odmark, Gideon Rubin, Deon van der Vyver

发表机构 * Independent(独立) LDE Cognyx

AI总结 该论文提出了一种用于评估自主 Kubernetes 操作代理的测量框架 agent-breakage,旨在解决当前相关研究中缺乏可证伪性的问题。该框架通过注入故障并观察代理的响应,从四个维度进行评分,并记录带标签的状态-动作-结果元组,从而实现对代理行为的系统评估。研究通过案例分析揭示了检索历史故障报告对代理能力的影响,并指出当前研究中存在诸如选择偏差、样本量过小等潜在问题,展示了该方法在提升实验可信度方面的重要价值。

Comments 22 pages. Code at https://github.com/odmarkj/agent-breakage tag v0.1.0 (Apache 2.0). Source repo at https://github.com/odmarkj/agent-breakage-paper tag arxiv-v1

详情
AI中文摘要

关于自主 Kubernetes 操作代理的经验声明在很大程度上是不可证伪的。已发表的工作报告了观察结果,但没有与禁用代理的基线进行受控比较,选择偏差普遍存在,缺乏预注册的决策矩阵,并且样本通常太小,无法匹配底层评分系统的噪声水平。原因在于限制代理本身的相同差距:代码代理有一个验证基础,将“是否有效”转化为快速、可证伪的 ground-truth 信号,而操作领域没有等效物。我们提出 agent-breakage,一个闭环测量框架,向目标 Kubernetes 集群注入故障,观察自主代理如何响应,在四个轴上根据 ground truth 对响应进行评分,并累积带有结果标签的 (状态, 动作, 结果) 元组。该框架区分框架错误和推理错误,通过确定性嵌入器机制支持真正的关闭条件控制,并强制执行预注册的决策矩阵。我们将其作为案例研究,测试检索过去的故障后分析是否会复合代理的能力。方法论的贡献是框架在该案例研究中捕获的三个混杂因素,每个因素都会在同一个工作的仪器化程度较低的版本上产生错误的已发表声明:pgvector 索引错误、+19% 的选择偏差工件,以及将效应夸大大约 3 倍的小样本估计。检索结果本身是部分证伪:3 个密集语料场景中有 1 个在 p<0.05 时显著,合并效应 +3.9 个百分点,在 n=60 时不显著。在 360 次运行中进行的场景内语料密度扫描表明,近邻的机械对齐主导了原始计数。该框架已开源发布。

英文摘要

Empirical claims about autonomous Kubernetes operations agents are largely unfalsifiable. Published work reports observational results without controlled comparisons against an agent-disabled baseline, selection bias is endemic, pre-registered decision matrices are absent, and samples are typically too small for the noise level of the underlying scoring system. The cause is the same gap that limits the agents themselves: code agents have a verification substrate that turns "did it work" into a fast, falsifiable, ground-truth signal, and operations has nothing equivalent. We present agent-breakage, a closed-loop measurement framework that injects faults into a target Kubernetes cluster, observes how an autonomous agent responds, scores the response on four axes against ground truth, and accumulates outcome-labeled (state, action, outcome) tuples. The framework distinguishes framework error from reasoning error, supports a true off-condition control via a deterministic-embedder mechanism, and enforces pre-registered decision matrices. We use it as a case study to test whether retrieval over past postmortems compounds an agent's capability. The methodological payload is three confounds the substrate caught during that case study, each of which would have produced a wrong published claim on a less instrumented version of the same work: a pgvector index bug, a +19% selection-bias artifact, and small-sample estimates that overstated effects by roughly 3x. The retrieval result itself is a partial falsification: 1 of 3 dense-corpus scenarios significant at p<0.05, pooled effect +3.9 percentage points, not significant at n=60. A within-scenario corpus-density sweep at 360 runs shows that mechanistic alignment of near-neighbors dominates raw count. The framework is released open source.

2605.23056 2026-05-25 cs.NI cs.AI 版本更新

DRL-Driven Edge-Aware Utility Optimization for Multi-Slice 6G Networks

DRL驱动的多切片6G网络边缘感知效用优化

Khaled M. Naguib, Soumaya Cherkaoui, Mahmoud M. Elmessalawy, Ahmed M. Abd El-Haleem, Ibrahim I. Ibrahim

发表机构 * CCAS Department, School of Engineering, New giza University(新吉扎大学工程学院CCAS系) Department of Computer and Software Engineering, Polytechnique Montreal(蒙特利尔大学计算机与软件工程系) Department of Electronics and Communications, Faculty of Engineering, Helwan University(海尔万大学工程学院电子与通信系)

AI总结 本文研究了在6G网络中如何通过深度强化学习优化多切片网络的边缘感知效用,以满足虚拟现实等高要求业务的需求。提出了一种基于深度Q网络(DQN)的智能资源分配与边缘缓存框架,能够在O-RAN架构中实现多网络切片的动态资源调度与内容分发。该方法有效提升了网络延迟和吞吐量,为6G环境下的沉浸式VR应用提供了更可靠和响应更快的支持。

Comments 5 pages

Journal ref IEEE Networking Letters, vol. 8, pp. 14-18, 2026

详情
AI中文摘要

通过6G网络传输的虚拟现实(VR)服务需要超低延迟和高带宽,以确保无缝用户体验。本文提出了一种面向6G O-RAN网络的智能资源分配与边缘缓存框架,利用深度Q网络(DQN)学习优化O-RAN架构下多网络切片的边缘缓存和动态资源配置。通过将DRL代理集成到网络控制平面,所提系统能够实现主动和自适应内容分发以及实时计算资源分配,满足eMBB、URLLC,尤其是对VR至关重要的新兴MBRLLC切片的服务质量需求。仿真结果表明,基于DQN的框架在降低延迟和提高吞吐量方面始终优于传统方法,从而为6G环境中的沉浸式VR应用提供更可靠和响应更快的支持。

英文摘要

Virtual Reality (VR) services delivered over 6G networks demand ultra-low latency and high bandwidth to ensure seamless user experiences. This paper presents an intelligent resource allocation and edge caching framework for 6G O-RAN networks, leveraging Deep Q-Network (DQN) learning for optimizing edge caching and dynamic resource provisioning across multiple network slices within an O-RAN-compliant architecture. By incorporating DRL agents into the network control plane, the proposed system enables proactive and adaptive content distribution as well as real-time computational resource allocation that meets the quality-of-service demands of eMBB, URLLC, and especially the emerging MBRLLC slices essential for VR. Simulation results demonstrate that the DQN-based framework consistently outperforms traditional methods in reducing latency and improving throughput, leading to more reliable and responsive support for immersive VR applications in 6G environments.

2605.23054 2026-05-25 cs.CL cs.AI cs.LG 版本更新

Model Collapse as Cultural Evolution

模型崩溃作为文化演化

Dongxin Guo, Jikun Wu, Siu Ming Yiu

发表机构 * The University of Hong Kong(香港大学) Stellaris AI Limited(Stellaris AI有限公司)

AI总结 本文研究了大型语言模型(LLM)在自训练过程中出现的“模型崩溃”现象,即模型输出质量逐渐下降的问题。作者引入文化进化中的迭代学习理论,提出五个可验证的预测,并通过多语言实验验证,发现模型的组合性结构在无过滤自训练下呈现非单调变化趋势,这一特征仅在任务导向的过滤机制下得以维持。研究为模型崩溃提供了语言学层面的解释,并为自训练流程的设计提供了具体原则。

Comments Accepted at CoNLL 2026. 18 pages, 3 figures, 2 tables

详情
AI中文摘要

模型崩溃,即在其自身输出上训练的LLM的逐步退化,已被统计表征,但缺乏对哪些结构退化、以何种顺序以及为何退化的语言学解释。我们表明,文化演化中的迭代学习理论填补了这一空白。我们推导出五个可证伪的预测,区分了那些对该理论具有独特判别性的预测与确认性预测,并通过在英语、德语和土耳其语中自训练LLaMA-2-7B和Mistral-7B达10代来测试它们。关键的判别性发现:在未过滤的自训练下,组合性遵循非单调轨迹(先上升后下降)。这一特征在最大规则种子数据下持续存在(排除了噪声去除),并且仅由任务导向的过滤维持,而非随机过滤,提供了压缩-通信权衡的首个LLM尺度证据。所有预测均得到确认,效应量较大(Hedges' $g > 1.6$;$\mathrm{BF}_{10} > 100$),且LLM正则化梯度与人类行为数据高度匹配($R^2 = 0.94$)。这些结果将模型崩溃重新定义为文化传播现象,并为自训练管道设计提供了具体原则。

英文摘要

Model collapse, the progressive degradation of LLMs trained on their own outputs, has been characterized statistically but lacks a linguistic explanation for which structures degrade, in what order, and why. We show that iterated learning theory from cultural evolution fills this gap. We derive five falsifiable predictions, distinguish those uniquely discriminative for the theory from confirmatory ones, and test them by self-training LLaMA-2-7B and Mistral-7B over 10 generations in English, German, and Turkish. The critical discriminative finding: compositionality follows a non-monotonic trajectory (initially rising, then falling) under unfiltered self-training. This signature persists with maximally regular seed data (ruling out noise removal) and is sustained only by task-grounded filtering, not random filtering, providing the first LLM-scale evidence for the compression-communication tradeoff. All predictions are confirmed with large effect sizes (Hedges' $g > 1.6$; $\mathrm{BF}_{10} > 100$), and LLM regularization gradients closely match human behavioral data ($R^2 = 0.94$). These results reframe model collapse as a cultural transmission phenomenon and yield concrete principles for self-training pipeline design.

2605.23052 2026-05-25 cs.CL cs.AI 版本更新

DreamerNLplus: Interpretable Modeling of Mental Health Dynamics from Social Media Timelines using Hybrid Rule-Based and RAG Methods

DreamerNLplus: 使用混合规则和RAG方法从社交媒体时间线进行可解释的心理健康动态建模

Maryia Zhyrko, Daisy Monika Lal, Erik van Mulligen, Lifeng Han

发表机构 * Leiden Institute of Advanced Computer Science (LIACS), Leiden University(莱顿高级计算机科学研究所(LIACS),莱顿大学) School of Computing and Communications (SCC), Lancaster University(计算与通信学院(SCC),兰卡斯特大学) Department of Medical Informatics, Erasmus University Medical Center Rotterdam(医学信息学系,埃因霍温医学中心鲁特万分校) Biomedical Data Sciences, Leiden University Medical Center(生物医学数据科学,莱顿大学医学中心)

AI总结 本文提出了一种混合框架 DreamerNLplus,用于从社交媒体时间线中建模心理健康动态,参与了 CLPsych 2026 共享任务。该方法结合了基于规则和检索增强生成(RAG)的技术,分别用于心理状态建模、时间变化检测和序列级摘要任务,并在多个子任务中取得了优异成绩。研究揭示了心理健康动态建模中的关键挑战,如分类与回归性能的不匹配、时间过渡建模的困难,为未来研究提供了重要方向。

Comments Accepted by CLPsych2026. CLPsych 2026 will be held at ACL in San Diego July 4th, 2026

详情
AI中文摘要

我们提出DreamerNLplus,一个用于在CLPsych 2026共享任务中从社交媒体时间线建模心理健康动态的混合框架。我们的系统处理三个任务:心理状态建模、时间变化检测和序列级总结。对于任务1,我们结合基于LLM的数据增强、DeBERTa分类和随机森林回归进行结构化状态预测。对于任务2,我们使用本地部署的Llama 3.1模型进行少样本提示,利用短期时间上下文检测切换和升级事件。对于任务3.1,我们探索了确定性基于规则的总结流水线和基于LLM的少样本方法,官方排名第二。我们的基于RAG的方法在任务3.2中取得了强劲性能,在改善任务中排名第一,在恶化任务中排名第三,展示了其捕捉时间线上反复出现的心理变化模式的能力。我们的分析揭示了关键挑战,包括分类与回归性能之间的不匹配、时间转换建模的困难,以及基于语义和基于相似性的评估指标之间的不一致。这些发现凸显了建模心理健康动态的复杂性,并推动了未来关于统一评估框架的工作。我们在https://github.com/4dpicture/CLPsych2026分享我们的代码和提示。

英文摘要

We present DreamerNLplus, a hybrid framework for modeling mental health dynamics from social media timelines in the CLPsych 2026 shared task. Our system addresses three tasks: psychological state modeling, temporal change detection, and sequence-level summarization. For Task 1, we combine LLM-based data augmentation, DeBERTa classification, and Random Forest regression for structured state prediction. For Task 2, we use few-shot prompting with a locally deployed Llama 3.1 model to detect Switch and Escalation events using short-term temporal context. For Task 3.1, we explore both a deterministic rule-based summarization pipeline and a few-shot LLM-based approach, ranking \textbf{2nd} officially. Our RAG-based method achieves strong performance in Task 3.2, ranking \textbf{1st} for Improvement and \textbf{3rd} for Deterioration, demonstrating its ability to capture recurrent psychological change patterns across timelines. Our analysis reveals key challenges, including the mismatch between classification and regression performance, the difficulty of modeling temporal transitions, and the disagreement between semantic and similarity-based evaluation metrics. These findings highlight the complexity of modeling mental health dynamics and motivate future work on unified evaluation frameworks. We share our code and prompts at https://github.com/4dpicture/CLPsych2026

2605.23045 2026-05-25 cs.CV cs.AI cs.LG 版本更新

The TIME Machine: On The Power of Motion for Efficient Perception

时间机器:论运动在高效感知中的力量

Mantas Skackauskas, Xinyue Hao, Laura Sevilla-Lara

发表机构 * School of Informatics University of Edinburgh(信息学院爱丁堡大学)

AI总结 本文提出了一种以运动为核心模态的视频表征学习方法,旨在解决现有视频模型在时序理解和训练成本方面的局限。通过使用点轨迹表示视频中的运动,并利用掩码自编码器进行自监督训练,模型能够学习到更高效且细粒度的视频表征。该方法无需依赖语言标注,大幅降低了训练数据需求,并在多项任务中展现出与当前先进模型相当的性能,为构建更高效、更具时序感知能力的视频模型提供了新方向。

详情
AI中文摘要

近年来,视频表示学习取得了巨大进展。这受到多种因素的推动,包括训练规模以及通过语言对比训练的视觉模型的成功。虽然这些因素推动了视频模型的能力边界,但它们也引入了自身的局限性:首先,扩展视频模型可能达到高昂的成本;其次,从语言学习限制了可学习概念的范围,仅限于字幕中的概念。因此,视频模型在时间理解方面仍然存在困难。在本文中,我们提出了一种新颖的方法,将运动作为视频表示的核心模态。具体而言,给定视频中以点轨迹形式存在的运动,我们使用掩码自编码器来掩码部分轨迹,并训练自编码器重建缺失的轨迹。这使我们能够以自监督方式学习表示。我们表明,使用运动来表示视频实际上解决了视频技术的两个核心局限性。首先,它使我们能够大幅减少训练数据的规模,因为运动本质上与外观无关,因此需要更少的样本就能很好地泛化。其次,运动使我们能够绕过依赖语言的训练范式,学习更细粒度的概念。结果是一种嵌入,我们称之为TIME(时间感知运动嵌入),这是一种仅使用合成运动数据训练的表示。我们在零样本方式下对广泛的任务测试了这种嵌入。我们观察到,无需额外技巧,其性能与使用多达4个数量级更少训练数据的最先进模型相当。这为迈向更有时序感知且更具可扩展性的视频模型新范式奠定了基础。

英文摘要

Video representation learning has seen tremendous progress in recent years. This has been driven by many factors, including the scale of training and the success of visual models trained contrastively with language. While these factors have pushed the boundaries of what video models can do, they also introduce their own set of limitations: first, scaling video models can reach prohibitive costs and second, learning from language restricts the range of concepts that can be learned to those in captions. As a result, video models still struggle with temporal understanding. In this paper we propose a novel approach that uses motion as the central modality for video representation. In particular, given the motion in a video in the form of point-tracks, we use a masked-autoencoder to mask some of the tracks and train the autoencoder to reconstruct the missing tracks. This allows us to learn a representation in a self-supervised manner. We show that using motion to represent videos actually addresses both of the core limitations of video technology. First, it allows us to massively reduce the scale of training data, as motion is inherently appearance-independent and hence needs fewer examples to generalize well. Second, motion allows us to bypass the language-dependent training paradigm, learning better fine-grained concepts. The result is an embedding that we call TIME (Temporally Informed Motion Embedding), a representation trained exclusively on synthetic motion data. We test this embedding on a wide set of tasks in a zero-shot manner. We observe that without bells and whistles, performance is on par with state-of-the-art models using up to 4 orders of magnitude less training data. This is a stepping stone towards a new paradigm of video models that are both more temporally aware as well as more scalable.

2605.23039 2026-05-25 cs.CL cs.AI cs.LG 版本更新

Do Language Models Know What Not to Say? Causal Evidence for Statistical Preemption in LLMs

语言模型知道不该说什么吗?大语言模型中统计预占的因果证据

Dongxin Guo, Jikun Wu, Siu Ming Yiu

发表机构 * The University of Hong Kong(香港大学) Stellaris AI Limited(Stellaris AI有限公司)

AI总结 本研究探讨了语言模型如何通过分布竞争机制习得语言禁忌知识,提出统计预占(statistical preemption)是关键机制。通过四个实验,研究发现语言模型对非常规结构的惊讶度(surprisal)与人类可接受性判断高度相关,并且这种模式由竞争形式的频率驱动,而非动词整体频率。研究还表明,预占敏感性随模型规模呈幂律增长,并通过可控微调实验验证了竞争形式频率对预占行为的因果影响,为构造语法理论提供了计算支持。

Comments Accepted at CoNLL 2026. 21 pages (9 main body + appendices and references); 4 figures, 14 tables

详情
AI中文摘要

学习者在没有负面证据的情况下如何获得关于不可接受性的知识?构式语法提出了统计预占:接触常规形式(例如,“donated the books to the library”)会预占结构上可能但未经验证的替代形式(“*donated the library the books”)。我们提出了一项计算研究,首次在单一收敛设计中直接分离了大语言模型中的统计预占与竞争性固化假说。通过跨越120个英语动词-构式配对(与格、使役、方位格)的四个实验,我们表明:(1)大语言模型的惊讶度模式与人类可接受性判断强相关(r = 0.79),并在三个独立的行为数据集上得到验证;(2)这些模式由竞争形式频率驱动,而非整体动词频率,通过非循环偏相关得到确认;(3)预占敏感度随模型规模呈幂律增长;(4)一项受控微调干预因果地表明,操纵竞争形式频率会按预测方向改变预占行为,反向控制排除了频率敏感性混淆。这些结果提供了汇聚证据,表明神经语言模型通过分布竞争(构式语法所提出的核心机制)习得负面语言知识。

英文摘要

How do learners acquire knowledge of what is unacceptable without negative evidence? Construction Grammar proposes statistical preemption: exposure to a conventional form (e.g., "donated the books to the library") preempts structurally possible but unattested alternatives ("*donated the library the books"). We present a computational study that, for the first time, directly dissociates statistical preemption from the competing entrenchment hypothesis in large language models within a single converging design. Across four experiments spanning 120 English verb-construction pairings (dative, causative, locative), we show that (1) LLM surprisal patterns correlate strongly with human acceptability judgments ($r = 0.79$), validated against three independent behavioral datasets; (2) these patterns are driven by competing-form frequency rather than overall verb frequency, confirmed by non-circular partial correlations; (3) preemption sensitivity scales as a power law with model size; and (4) a controlled fine-tuning intervention causally demonstrates that manipulating competing-form frequencies shifts preemption behavior in the predicted direction, with reverse-direction controls ruling out frequency-sensitivity confounds. These results provide converging evidence that neural language models acquire negative linguistic knowledge through distributional competition, the core mechanism posited by Construction Grammar.

2605.23035 2026-05-25 cs.CL cs.AI q-bio.NC 版本更新

Sparse Autoencoders Map Brain-LLM Alignment onto Cortical Semantic Topography

稀疏自编码器将大脑-LLM对齐映射到皮层语义拓扑

Dongxin Guo, Jikun Wu, Siu Ming Yiu

发表机构 * The University of Hong Kong(香港大学) Stellaris AI Limited(Stellaris AI有限公司)

AI总结 该研究探讨了大型语言模型(LLM)中间层与人类大脑语言响应之间的对应关系,并利用稀疏自编码器(SAEs)对其进行机制解释。通过将SAEs与神经编码模型结合,研究者分解了GPT-2 XL和Llama-3.1-8B模型,提取出每层1.6万至3.2万个可解释特征,并验证了语义特征在预测大脑编码性能中的主导作用。研究进一步表明,SAE提取的语义特征能够重现大脑皮层的语义拓扑结构,并在多种语言中展现出良好的泛化能力。

Comments Accepted at CoNLL 2026. 20 pages (9 main + 1 limitations/acknowledgments + 3 references + 7 appendix), 5 figures, 20 tables

详情
AI中文摘要

大型语言模型(LLM)的中间层最能预测人脑对语言的反应,这是计算神经语言学中最稳健的发现之一,但其机制原因仍未得到解释。我们通过将可解释性机制中的稀疏自编码器(SAE)与神经编码模型相结合来填补这一空白,将GPT-2 XL和Llama-3.1-8B分解为每层16K-32K个可解释特征。一个人工验证的分类法(κ≥0.74)显示,仅语义特征就恢复了94%的峰值编码性能(r=0.285),显著超过了方差匹配的基线(p<0.001,d=1.31)。除了这种总体主导性之外,我们还测试了一个新颖的皮层拓扑预测:从三个独立神经科学项目先验导出的五个语义子类别应映射到不同的大脑区域。一个正式的收敛测试证实了这种对齐(Spearman ρ=0.72,p<0.001;超几何p=0.007),表明SAE发现的特征以先前方法无法达到的粒度重现了已知的皮层语义组织。SAE特征进一步预测了超出词汇控制的人类阅读时间(ΔlogLik=38.4,p<0.001),并且一项探索性的预测误差分析提供了初步证据,表明大脑还编码了意外的语义内容。结果在英语、中文和法语中具有普适性。

英文摘要

Intermediate layers of large language models (LLMs) best predict human brain responses to language, one of the most robust findings in computational neurolinguistics, yet why remains mechanistically unexplained. We address this gap by bridging sparse autoencoders (SAEs) from mechanistic interpretability with neural encoding models, decomposing GPT-2 XL and Llama-3.1-8B into 16K-32K interpretable features per layer. A human-validated taxonomy ($κ\geq 0.74$) reveals that semantic features alone recover 94% of peak encoding performance ($r=0.285$), substantially exceeding variance-matched baselines ($p<0.001$, $d=1.31$). Beyond this aggregate dominance, we test a novel cortical topography prediction: five semantic subcategories derived a priori from three independent neuroscience programs should map onto distinct brain regions. A formal convergence test confirms this alignment (Spearman $ρ=0.72$, $p<0.001$; hypergeometric $p=0.007$), demonstrating that SAE-discovered features recapitulate known cortical semantic organization at a granularity inaccessible to prior methods. SAE features further predict human reading times beyond lexical controls ($Δ\mathrm{logLik}=38.4$, $p<0.001$), and an exploratory prediction-error analysis provides preliminary evidence that the brain additionally encodes unexpected semantic content. Results generalize across English, Chinese, and French.

2605.23033 2026-05-25 cs.LG cs.AI 版本更新

Uncovering the Latent Potential of Deep Intermediate Representations

揭示深度中间表示的潜在能力

Arnesh Batra, Arush Gumber, Aniket Khandelwal, Jashn Khemani, Anubha Gupta

发表机构 * SBILab, Indraprastha Institute of Information Technology Delhi, Delhi, India(SBILab,印度德里印度理工学院信息技术学院,德里,印度)

AI总结 本文研究了深度神经网络中间表示的潜在价值,指出任务相关信息在不同层中非单调分布,不能通过简单聚合恢复。为此,作者提出了一种基于谱分析的层选择方法LOES,以及几何正则化损失GeoReg,以识别任务区分性子空间并稳定表示几何结构。实验表明,该方法在多种模型和数据条件下均优于基线,且效果随模型深度增加而提升,同时揭示了语义因素在层间的分布规律,有助于跨语言和跨模态的可解释性分析。

Comments Accepted to ICML2026 as a Spotlight

详情
AI中文摘要

在海量数据上预训练的基础模型学习到随深度演化的表示,形成具有不同语义内容和几何结构的嵌入层次。与仅使用最后一层或浅层混合的普遍做法相反,我们表明任务相关信息在层间非单调分布,且无法通过简单聚合恢复。通过跨多种模态的几何与实证研究,我们表明有效迁移依赖于识别哪些层编码任务判别结构以及它们的嵌入如何几何组织。我们提出层最优嵌入选择(LOES),一种构造性谱方法,通过在正交性和各向同性约束下最小化残差误差来识别任务判别子空间。为了将微调与此选择原则对齐,我们进一步提出几何正则化损失(GeoReg),它在微调期间对类流形施加单纯形结构并稳定表示几何。在广泛的架构、深度、模态和数据规模下,LOES 持续优于标准基线,且随着模型深度增加收益增长。除了准确性,我们的方法揭示了语义因素如何在层间分布,从而实现了跨语言和跨模态的可解释性分析。总之,我们的结果提供了强有力的证据,表明逐层嵌入几何不是偶然的,而是深度模型表示和迁移知识的核心。

英文摘要

Foundational Models pretrained on huge amount of data learn representations that evolve across depth, forming a hierarchy of embeddings with distinct semantic content and geometric structure. Contrary to the widespread practice of using only the final layer or shallow mixtures, we show that task-relevant information is distributed non-monotonically across layers and cannot be recovered by naïve aggregation. Through a geometric and empirical study across multiple modalities, we show that effective transfer depends on identifying which layers encode task-discriminative structure and how their embeddings are geometrically organized. We introduce Layer-wise Optimal Embedding Selection (LOES), a constructive spectral method that identifies task-discriminative subspaces by minimizing residual error under orthogonality and isotropy constraints. To align fine-tuning with this selection principle, we further propose Geometric Regularization Loss (GeoReg), which enforces a simplicial structure on class manifolds and stabilizes representation geometry during fine-tuning. Across a wide range of architectures, depths, modalities, and data regimes, LOES consistently outperforms standard baselines, with gains that grow as model depth increases. Beyond accuracy, our method reveals how semantic factors are distributed across layers, thereby enabling cross-lingual and cross-modal interpretability analyses. Together, our results provide strong evidence that layerwise embedding geometry is not incidental but central to how deep models represent and transfer knowledge.

2605.23032 2026-05-25 cs.CL cs.AI q-bio.NC 版本更新

Brain-LLM Alignment Tracks Training Data, Not Typology

大脑-大语言模型对齐追踪训练数据,而非语言类型学

Dongxin Guo, Jikun Wu, Siu Ming Yiu

发表机构 * The University of Hong Kong(香港大学) Stellaris AI Limited(Stellaris AI有限公司)

AI总结 该研究探讨了大脑与大语言模型(LLM)之间的对齐模式是否具有跨语言泛化能力,发现对齐模式主要由模型训练语言的主导性决定,而非英语本身的特性。通过对比多种语言的fMRI数据和不同语言主导的LLM,研究发现以中文为主导训练的模型在与中文大脑对齐时表现最佳,而与英语大脑对齐最差。此外,语言类型学距离、句法相关脑区的梯度差异以及分词粒度等因素也对对齐效果产生显著影响,揭示了此前观察到的“英语优势”主要源于训练数据的组成,而非语言结构本身的特性。

Comments Accepted to CoNLL 2026. 9 pages main content + 4 pages references + 6 pages appendix; 4 figures, 13 tables

详情
AI中文摘要

大脑-大语言模型对齐在英语中已得到充分证实,然而大脑的语言网络在神经解剖学上跨语言具有普遍性。这种对齐是否也能跨语言泛化,以及什么因素决定了其变化?我们使用来自英语、中文和法语(《小王子》语料库)112名参与者的fMRI数据,以及涵盖英语主导、中文主导和多语言架构的七种大语言模型进行了测试。我们的核心发现是,训练语言主导性(而非英语的固有属性)驱动了对齐模式:一个中文主导模型(Baichuan2-7B),其架构与LLaMA-2-7B匹配,完全逆转了梯度,与中文大脑对齐最佳,与英语对齐最差。除训练主导性外,形式类型学距离独立地与对齐退化共变,与句法相关的大脑区域(IFG)显示出比词汇语义区域(PTL)陡峭2.3倍的类型学梯度,而分词丰度解释了跨语言最优编码层转移的约60%。这些结果表明,大脑-大语言模型对齐中明显的“英语优势”是训练数据组成的假象,而剩余的变化反映了集中在句法处理中的真实类型学结构。

英文摘要

Brain-LLM alignment is well established in English, yet the brain's language network is neuroanatomically universal across languages. Does alignment also generalize cross-linguistically, and what governs the variation? We test this using fMRI data from 112 participants across English, Chinese, and French (the Le Petit Prince corpus) and seven LLMs spanning English-dominant, Chinese-dominant, and multilingual architectures. Our central finding is that training-language dominance, not an inherent property of English, drives the alignment pattern: a Chinese-dominant model (Baichuan2-7B), architecture-matched to LLaMA-2-7B, reverses the gradient entirely, aligning best with Chinese brains and worst with English. Beyond training dominance, formal typological distance independently covaries with alignment degradation, syntax-associated brain regions (IFG) show $2.3\times$ steeper typological gradients than lexico-semantic regions (PTL), and tokenization fertility accounts for $\sim$60% of a cross-linguistic shift in optimal encoding layer. These results reveal that the apparent "English advantage" in brain-LLM alignment is an artifact of training data composition, while the remaining variation reflects genuine typological structure concentrated in syntactic processing.

2605.23024 2026-05-25 cs.AI cs.CC cs.CL cs.LG 版本更新

The Deterministic Horizon: Impossibility Results as Design Specifications for Trustworthy AI Systems

确定性视界:作为可信AI系统设计规范的不可行性结果

Dongxin Guo

AI总结 本文探讨了可信人工智能系统设计中由计算理论根本限制所带来的边界问题,提出将不可行性定理转化为系统设计规则的新方法。研究核心在于确定性地证明了大型语言模型的推理深度存在一个由架构决定的上限——“确定性地平线”,该上限不受训练数据量、适配器秩或损失函数的影响,并可通过模型层数和嵌入宽度预先计算。研究还展示了这一理论在多个AI子领域中的应用,形成一套包含十六项设计规范的目录,为构建更可靠的人工智能系统提供了理论依据和设计指导。

Comments PhD thesis, Department of Computer Science, The University of Hong Kong, 2026. 271 pages, 18 figures, 15 tables, 5 algorithms

详情
AI中文摘要

大型语言模型现在编写软件、起草法律文件并生成临床笔记,但从图灵、阿罗到没有免费午餐定理的基本极限,塑造了计算的能力。本文将这些不可行性结果从奇闻转化为设计规则。其旗舰结果证明了仅由架构设定的准确率上限:超过关键推理深度后,无论适配器秩、样本大小或损失函数如何,训练都无法改变它。该确定性视界在部署前可从层数和嵌入宽度计算,在十二种Transformer架构中测量值介于19到31之间,而在最优长度轨迹上微调可恢复不到4个百分点。其机制是残差流的容量不变性,信息论转换得出超过视界后准确率超指数衰减。一个针对模幂的无条件电路复杂度下界(对抗常数深度素数模电路)补充了这一结果。同样的论证重新应用于多个子领域:任何错误指定模型下的偏好学习在样本复杂度上出现不连续跳跃;多阶段检索流水线至少需要与阶段数一样多的独立指标;标准诚实拍卖对于具有提示相关估值的智能体失效;神经推理的零知识验证为每个非线性激活支付110到190倍的测量开销。这些共同构成了一个包含16条规范的目录,每条规范配对一个可计算边界、一个量化违反成本和一个建设性设计规则:两个组合已被证明,一个配对是诚实障碍,四个保持开放。本文为可信AI可能需要的生成式研究计划提供了不可行性规范方法论。AI的每一个基本极限也是一个设计规则。

英文摘要

Large language models now write software, draft legal documents, and produce clinical notes, yet fundamental limits, from Turing and Arrow to the No Free Lunch theorems, shape what computation can do. This thesis turns such impossibility results from curiosities into design rules. Its flagship result proves an accuracy ceiling set by architecture alone: past a critical reasoning depth, no amount of training moves it, at any adapter rank, sample size, or loss function. Computable before deployment from layer count and embedding width, this Deterministic Horizon is measured between nineteen and thirty-one across twelve transformer architectures, and fine-tuning on optimal-length traces recovers under four percentage points. The mechanism is a capacity invariant of the residual stream, and an information-theoretic conversion yields super-exponential accuracy decay past the horizon. An unconditional circuit-complexity lower bound for modular exponentiation against constant-depth prime-modulus circuits complements this result. The same argument recasts across subfields: preference learning under any misspecified model jumps discontinuously in sample complexity; multi-stage retrieval pipelines require at least as many independent metrics as stages; standard truthful auctions fail for agents with prompt-dependent valuations; and zero-knowledge verification of neural inference pays a measured overhead of one hundred ten to one hundred ninety times per non-linear activation. Together these form a catalogue of sixteen specifications, each pairing a computable boundary, a quantified violation cost, and a constructive design rule: two compositions are proved, one pairing is an honest obstruction, and four remain open. The impossibility-specification methodology is offered for the generative research programme that trustworthy AI may need. Every fundamental limit of AI is also a design rule.

2605.23007 2026-05-25 q-fin.TR cs.AI cs.LG q-fin.PM 版本更新

MadEvolve: Evolutionary Optimization of Trading Systems with Large Language Models

MadEvolve: 基于大型语言模型的交易系统进化优化

Yurii Kvasiuk, Tianyi Li, Owen Colegrove, Moritz Münchmeyer

发表机构 * Department of Physics, University of Wisconsin–Madison(威斯康星大学麦迪逊分校物理系) Event Horizon Labs(事件地平线实验室)

AI总结 本文提出了一种基于大型语言模型的进化优化框架MadEvolve,用于优化量化交易系统,特别是在比特币交易中的策略生成与执行。该方法通过进化算法优化交易策略的特征集、策略组件及整体流程,显著提升了交易表现。研究还对比了其他智能搜索方法,并评估了模拟环境中的p-hacking概率,验证了AI驱动的进化算法在量化金融中的有效性。

详情
AI中文摘要

我们探索了将LLM驱动的算法优化应用于量化金融中的几个常见任务。MadEvolve是一个受DeepMind的Alpha-Evolve启发的通用算法优化框架,最近被开发用于优化计算宇宙学中的算法。在此,我们以比特币交易为例,展示了MadEvolve在优化算法交易策略和alpha生成方面的实用性。在我们的模拟和回测设置中,我们在所有考虑的任务上取得了显著改进,例如演化用于信号生成的特征集、优化交易策略的独立组件,以及联合演化特征流水线与执行策略。此外,我们将我们的方法与其他智能搜索方法(特别是Claude Code)进行了比较,并仔细评估了模拟设置中的p-hacking概率。我们的发现强烈支持AI驱动的智能和进化算法在算法交易和量化金融中的实用性。

英文摘要

We explore the application of LLM-driven algorithm optimization to several common tasks in quantitative finance. MadEvolve, a general-purpose algorithm optimization framework inspired by DeepMind's Alpha-Evolve, was recently developed to optimize algorithms in computational cosmology. Here we demonstrate the utility of MadEvolve to optimize algorithmic trading strategies and alpha generation at the example of Bitcoin trading. On our simulation and backtesting setup, we achieve significant improvements on all tasks we considered, such as evolving feature sets for signal generation, optimizing separate components of the trading strategy, and jointly evolving the feature pipeline together with the execution strategy. Additionally, we compare our method to other agentic search approaches, specifically Claude Code, and carefully evaluate p-hacking probabilities on our simulation setup. Our findings strongly support the utility of AI-driven agentic and evolutionary algorithms for algorithmic trading and quantitative finance.

2605.22995 2026-05-25 cs.CY cs.AI 版本更新

Whose Good, Whose Place? The Moral Geography of Agentic AI for Social Good

谁之善,谁之地?面向社会公益的能动型AI的道德地理学

Poli Nemkova, Haeshitha Indukuri, Jaedon Charles

发表机构 * University of North Texas(北卡罗来纳州立大学) Florida International University(佛罗里达国际大学)

AI总结 本文研究了用于社会公益的智能代理AI系统在道德地理方面的不对称性,指出尽管这类系统常以联合国可持续发展目标(SDGs)为依据,但很少明确说明其地理背景,尤其在需要考虑地方政治、法律和文化因素的领域更为明显。研究分析了2015至2026年间112篇相关论文,发现仅25%的论文报告了实际部署或小规模测试,揭示了在责任归属、参与性和透明度方面的多重缺口,并提出了更具体、参与性更强的AI系统报告标准。

详情
AI中文摘要

能动型AI系统越来越多地被提出用于社会公益领域,通常引用联合国可持续发展目标(SDGs)作为全球利益的词汇。然而,社会公益的主张并未建立对系统声称服务的社区的问责。我们对2015年至2026年间发表的112篇关于社会公益的能动型AI论文进行了结构化调查。我们发现一种道德地理不对称:论文在最需要当地政治、法律和文化背景的领域最不可能指定地理背景。在整个语料库中,112篇论文中有82篇(73%)未指定任何地理背景。与健康或物理/生态SDGs相关的论文指定地理背景的比例为37-40%,而与制度和社会政策SDGs相关的论文仅13%。SDG 16(和平、正义与强大机构)既是语料库中覆盖最多的目标,也是地理指定率最低的目标。我们将此解释为道德抽象:面向社会公益的能动型AI往往将制度性善视为普适的,而不同于对待健康或生态善的方式。第二个发现加剧了这一点:112篇论文中只有28篇(25%)报告了任何实际部署或小规模测试。我们识别出五个问责缺口,并提出了一个最低报告标准,以促进更具体情境、参与性和负责任的面向社会公益的能动型AI。

英文摘要

Agentic AI systems are increasingly proposed for social-good domains, often invoking the United Nations Sustainable Development Goals (SDGs) as a vocabulary of global benefit. Yet claims of social good do not establish accountability to the communities a system claims to serve. We present a structured survey of 112 papers on agentic AI for social good published between 2015 and 2026. We find a moral-geographic asymmetry: papers are least likely to specify geographic context in precisely the domains where local political, legal, and cultural context matters most. Across the corpus, 82 of 112 papers (73%) specify no geographic context. Papers aligned with health or physical/ecological SDGs specify geography 37-40% of the time, while papers aligned with institutional and social-policy SDGs do so only 13%. SDG 16, peace, justice, and strong institutions, is both the most-covered goal in the corpus and the one with the lowest geographic-specification rate. We interpret this as moral abstraction: agentic AI for social good often treats institutional good as universal in ways it does not treat health or ecological good. A second finding compounds this: only 28 of 112 papers (25%) report any real-world deployment or small-scale test. We identify five accountability gaps and propose a minimal reporting standard for more context-specific, participatory, and accountable agentic AI for social good.

2605.22993 2026-05-25 cs.CL cs.AI 版本更新

A Proactive Multi-Agent Dialogue Framework for Assessing Social Language Disorder Traits in Autism

一种主动式多智能体对话框架用于评估自闭症中的社交语言障碍特征

Chuanbo Hu, Minglei Yin, Bin Liu, Wenqi Li, Lynn K. Paul, Shuo Wang, Xin Li

发表机构 * Department of Computer Science(计算机科学系) University at Albany(阿尔巴尼大学) Department of Management Information System(管理信息系统系) West Virginia University(西弗吉尼亚大学) Department of Radiology(放射学系) Washington University in St. Louis(圣路易斯华盛顿大学) Humanities and Social Sciences(人文学与社会科学)

AI总结 该研究提出了一种名为TPA的主动多智能体对话框架,用于评估自闭症谱系障碍中的社会语言障碍(SLD)特征。该框架通过医生智能体主动选择针对性的问题策略,以系统性地揭示患者对话中潜在的语言障碍特征,从而提高诊断效率。实验表明,TPA在多个关键指标上优于现有基线方法,显著提升了SLD特征的覆盖率和诊断效率,为AI辅助临床筛查提供了重要支持。

详情
AI中文摘要

与自闭症谱系障碍中社交语言障碍(SLD)相关的特征性语言行为,包括回声性重复、代词位移和刻板媒体引用,在自发对话中基本不存在,仅在特定对话条件下出现。在结构化临床评估中,这种延迟意味着提问策略选择是决定对话产生多少诊断信息的关键但未被充分重视的因素。大型语言模型(LLMs)能否被引导主动选择系统地揭示这些潜在特征的提问策略,在很大程度上仍未探索。本文提出TPA(思考、计划、询问),一种应用于自闭症诊断观察量表模块4(ADOS-2)语言评估部分的主动式多智能体对话框架,其中医生智能体在选择临床依据策略并生成针对性问题之前,明确推理哪些特征尚未观察到。基于真实ADOS-2临床数据的患者智能体使得无需真实患者参与即可进行可重复评估,并通过三个独立实验验证,确认其对真实患者语言具有足够的保真度。在来自35名患者的484个片段上评估,TPA在所有主要指标上优于六个竞争性对话规划基线,实现了82.1%的SLD特征覆盖率,比训练有素的临床医生进行的真实临床对话自动回放(65.5%)高16.6%,并且每轮诊断效率显著更高(AUCC:0.628 vs. 0.458,绝对增益+0.170)。这些结果表明,主动提问策略选择显著提高了自动化SLD特征评估的效率,对可扩展的AI辅助临床筛查具有直接意义。

英文摘要

Characteristic linguistic behaviors associated with Social Language Disorder (SLD) in autism spectrum disorder, including echoic repetition, pronoun displacement, and stereotyped media quoting, are largely absent from spontaneous conversation and only emerge under specific conversational conditions. In structured clinical assessments, this latency means that questioning strategy selection is a critical yet underappreciated determinant of how much diagnostic information a conversation yields. Whether large language models (LLMs) can be guided to proactively select questioning strategies that systematically surface these latent traits remains largely unexplored. Here we present TPA (Think, Plan, Ask), a proactive multi-agent dialogue framework applied to the language assessment component of the Autism Diagnostic Observation Schedule Module 4 (ADOS-2), in which a doctor agent explicitly reasons about which traits remain unobserved before selecting a clinically grounded strategy and generating a targeted question. A patient agent grounded in real ADOS-2 clinical data enables reproducible evaluation without real patient participation, validated across three independent experiments confirming adequate fidelity to real patient language. Evaluated on 484 episodes from 35 patients, TPA outperforms six competitive dialogue planning baselines across all primary metrics, achieving 82.1% SLD trait coverage, 16.6% higher than automated replay of real clinical dialogues conducted by trained clinicians (65.5%), with substantially greater per-turn diagnostic efficiency (AUCC: 0.628 vs. 0.458, absolute gain +0.170). These results demonstrate that proactive questioning strategy selection substantially improves the efficiency of automated SLD trait assessment, with direct implications for scalable AI-assisted clinical screening.

2605.22986 2026-05-25 cs.RO cs.AI cs.HC cs.LG 版本更新

Robots That Know What to Ask: Recovering Misaligned Rewards through Targeted Explanations

知道该问什么的机器人:通过有针对性的解释恢复未对齐的奖励

Helena Merker, Nick Walker, Andreea Bobu

AI总结 该研究针对从人类示范中学习奖励函数时存在的特征不充分问题,提出了一种通过有针对性的解释来识别并修正奖励函数偏差的框架。核心方法基于分析示范数据中各特征的一致性,识别出未充分说明的特征,并通过自然语言解释这些不确定性,主动请求针对性的补充示范。实验表明,该方法在模拟和真实机器人任务中显著提升了奖励函数的学习效果,优于随机查询和被动数据收集的方式。

详情
AI中文摘要

从演示中学习奖励函数假设演示对所有特征(或行为中与任务相关的方面)提供了充分的监督。实际上,演示往往不完美:由于认知负荷或物理难度,人类可能低估某些特征,或者训练机制可能未能充分覆盖所有相关情况。无论哪种情况,重要特征可能未被充分指定,导致学习到的奖励函数存在歧义,并在部署时出现未对齐的行为。我们提出一个框架,检测此类未充分指定的特征,并主动请求有针对性的纠正演示。我们的关键洞察是,演示隐含地揭示了哪些特征被良好指定:一致优化的特征在演示之间变化很小,而未充分指定的特征则变化很大。我们利用这一统计信号推断哪些特征可能未被充分演示。然后,机器人用自然语言解释它不确定哪些特征,并请求明确解决已识别差距的演示。我们在模拟桌面操作领域和真实Franka机器人的用户研究中评估了我们的方法。与随机查询和被动数据收集相比,有针对性的、解释引导的查询显著改善了奖励恢复,减少了否则会从有缺陷的演示中持续存在的歧义。

英文摘要

Learning reward functions from demonstrations assumes that demonstrations provide adequate supervision over all features -- or task-relevant aspects of behavior. In practice, demonstrations are often imperfect: humans may under-emphasize certain features due to cognitive load or physical difficulty, or the training regime may fail to sufficiently cover all relevant situations. In either case, important features may be underspecified, leading to ambiguity in the learned reward function and misaligned behavior at deployment. We propose a framework that detects such underspecified features and actively solicits targeted corrective demonstrations. Our key insight is that demonstrations implicitly reveal which features are well specified: features that are consistently optimized show little variation across demonstrations, while features that are underspecified vary widely. We leverage this statistical signal to infer which features may have been insufficiently demonstrated. The robot then explains which features it is uncertain about in natural language and queries for demonstrations that explicitly address the identified gaps. We evaluate our approach in a simulated tabletop manipulation domain and in a user study with a real Franka robot. Targeted, explanation-guided queries significantly improve reward recovery compared to random querying and passive data collection, reducing ambiguity that would otherwise persist in learning from imperfect demonstrations.

2605.22984 2026-05-25 cs.LG cs.AI 版本更新

Test-Time Training Undermines Safety Guardrails

测试时训练削弱安全护栏

Simone Antonelli, Sadegh Akhondzadeh, Aleksandar Bojchevski

发表机构 * CISPA Helmholtz Center for Information Security(CISPA海德堡信息安全中心) University of Cologne(科隆大学)

AI总结 本文研究了测试时训练(Test-Time Training, TTT)在提升模型性能的同时所带来的安全风险。作者指出,TTT允许模型在推理过程中动态调整参数,虽然能增强模型在少样本学习、检索增强生成等任务中的表现,但也引入了新的攻击漏洞,使模型更容易被绕过安全防护。实验表明,TTT显著提高了攻击成功率,并在不同规模模型中表现出高度的可转移性。为此,作者提出了一种基于困惑度变化的轻量级检测方法,以识别潜在的TTT攻击请求。

Comments 30 pages, 4 figures. Project page: https://uoc-tail.github.io/ttt-jailbreak/

详情
AI中文摘要

测试时训练(TTT)是一种新兴范式,使模型在推理过程中调整参数,从而提升少样本学习、检索增强生成和复杂推理等任务的性能。然而,这种动态适应引入了攻击者可利用的新漏洞来越狱模型。我们识别了TTT的三种威胁模型,并演示了攻击者如何利用它们绕过安全过滤器。我们的结果表明,TTT可以显著提高攻击成功率(ASR)以及超过10次生成试验的ASR(ASR@10)。例如,在LoRA下,少样本和生成阶段威胁模型在不同家族和规模的模型上平均ASR@10分别达到95%和93%。这些漏洞可迁移到生产级微调API。我们还展示了TTT引发的过拟合可能产生退化输出,在标准评判下夸大ASR,并提出了一个有效性感知评估来纠正这一点。我们的发现表明,TTT暴露了新的攻击面,增强了攻击,并削弱了现有的安全护栏。作为防御的第一步,我们提出了一个轻量级的提供商侧检测器,通过私有有害保留集上的困惑度偏移来标记TTT请求,但稳健部署最终需要动态对齐。

英文摘要

Test-Time Training (TTT) is an emerging paradigm that enables models to adapt their parameters during inference, improving performance on tasks such as few-shot learning, retrieval-augmented generation, and complex reasoning. However, this dynamic adaptation introduces new vulnerabilities that adversaries can exploit to jailbreak models. We identify three threat models for TTT and demonstrate how attackers can leverage them to bypass safety filters. Our results show that TTT can significantly increase the Attack Success Rate (ASR) and the ASR over 10 generation trials (ASR@10). For example, under LoRA, the few-shot and generation-phase threat models achieve an average ASR@10 of 95% and 93% respectively, across models from different families and scales. These vulnerabilities transfer to production fine-tuning APIs. We also show that TTT-induced overfitting can produce degenerate outputs that inflate ASR under standard judges, and propose a validity-aware evaluation to correct for this. Our findings suggest that TTT exposes a new attack surface, strengthens attacks, and undermines existing safety guardrails. As a first step toward defense, we propose a lightweight provider-side detector that flags TTT requests via the perplexity shift on a private harmful holdout, but robust deployment will ultimately require dynamic alignment.

2605.22981 2026-05-25 cs.CL cs.AI cs.LG 版本更新

Memorization Dynamics of Fill-in-the-Middle Pretraining

Fill-in-the-Middle 预训练的记忆动态

Tobias von Arx, Tanguy Dieudonné

发表机构 * Department of Computer Science, ETH Zurich(苏黎世联邦理工学院计算机科学系)

AI总结 本文研究了“填中”(FIM)预训练目标对语言模型逐字记忆能力的影响。通过在包含重复内容的语料库上训练匹配的Llama 3.2模型,发现FIM更倾向于恢复短或部分匹配的文本片段,而传统的从左到右(LTR)方法则更常对长段精确续写赋予高置信度。实验还表明,FIM训练下的逐字记忆能力随重复次数近似线性增长,并且后缀上下文不足以支持准确回忆,前缀上下文在其中起关键作用。研究强调了单一评估方式可能忽略记忆行为的复杂性。

Comments MemFM @ ICML 2026

详情
AI中文摘要

Fill-in-the-Middle (FIM) 是一种广泛用于赋予因果语言模型填充能力的预训练目标,但其对逐字记忆的影响尚未充分探索。我们在受控设置中研究 FIM 的记忆动态,通过在包含重复 Gutenberg 摘录的 FineWeb-Gutenberg 语料库上,使用 FIM 和标准从左到右 (LTR) 目标预训练匹配的 Llama 3.2 模型。基于前缀的探测表明,FIM 更常恢复短片段或部分匹配的跨度,而 LTR 更常对长精确延续赋予高置信度。我们观察到,在测试范围内,FIM 训练下的逐字提取随重复次数近似线性增长。评估原生 FIM 格式的探测显示,后缀上下文并不足够:FIM 训练下的逐字回忆仍然强烈锚定于前缀上下文。我们的结果还表明,仅评估一种跨度长度或探测格式可能会遗漏记忆行为中的重要细微差别。

英文摘要

Fill-in-the-middle (FIM) is a pretraining objective widely used to equip causal language models with infilling ability, yet its effect on verbatim memorization remains underexplored. We study the memorization dynamics of FIM in a controlled setting by pretraining matched Llama 3.2 models with FIM and standard left-to-right (LTR) objectives on a FineWeb-Gutenberg corpus containing repeated Gutenberg excerpts. With prefix-based probes, FIM more often recovers short or partially matching spans, while LTR more often assigns high confidence to long exact continuations. We observe that verbatim extraction under FIM-training grows approximately linearly with repetitions over the tested range. Evaluating native FIM-format probes reveals that suffix context is not sufficient: verbatim recall under FIM-training remains strongly anchored in prefix context. Our results also show that evaluating only one span length or probing format can miss important nuances in memorization behavior.

2605.22976 2026-05-25 cs.SE cs.AI 版本更新

LLM Code Smells: A Taxonomy and Detection Approach

LLM 代码异味:分类与检测方法

Zacharie Chenail-Larcher, Brahim Mahmoudi, Naouel Moha, Quentin Stiévenart, Florent Avellaneda

发表机构 * École de technologie supérieure Université du Québec à Montréal

AI总结 本文研究了大语言模型(LLM)在软件系统中集成时可能引入的代码异味问题,提出了一个包含九类LLM代码异味的分类体系,并开发了静态分析工具SpecDetect4LLM用于检测这些异味。通过对692个开源项目进行实证评估,结果表明近74%的系统存在LLM代码异味,检测精度达91.3%,召回率为71.8%,为开发者提供了识别和改进LLM集成质量的有效手段。

详情
AI中文摘要

大型语言模型(LLM)因其多功能性、灵活性以及在某种程度上模拟人类推理的能力,越来越多地被集成到软件系统中用于各种目的。然而,源代码中LLM推理的糟糕集成可能会损害软件系统的质量。因此,必须记录不充分的LLM集成编码实践,以帮助开发者缓解此类问题。基于我们先前关于LLM代码异味的工作,本文通过呈现一个自包含的分类体系和包含九种LLM代码异味的目录,巩固并完善了这一概念。我们还创建了SpecDetect4LLM,一个用于检测这些异味的静态源代码分析工具,并对其检测效果(精确率和召回率)以及LLM代码异味在692个开源软件项目(171,194个源文件)中的普遍性进行了广泛的实证评估。结果表明,LLM代码异味影响了73.5%的被分析系统,检测精确率为91.3%,召回率为71.8%。

英文摘要

Large Language Models (LLMs) are increasingly integrated into software systems for diverse purposes, due to their versatility, flexibility, and ability to simulate human reasoning to some extent. However, poor integration of LLM inference in source code can undermine software system quality. Therefore, inadequate LLM integration coding practices must be documented to help developers mitigate such issues. Following our earlier work on LLM code smells, this paper consolidates and refines the concept by presenting a self-contained taxonomy and a catalog of nine LLM code smells. We also create SpecDetect4LLM, a static source code analysis tool for their detection, and conduct extensive empirical evaluations of its detection effectiveness (precision and recall) as well as the prevalence of LLM code smells across 692 open-source software projects (171,194 source files). Our results show that LLM code smells affect 73.5% of the analyzed systems, with a detection precision of 91.3% and a recall of 71.8%.

2605.22973 2026-05-25 cs.LG cs.AI 版本更新

Worse than Random: The Importance of a Baseline for Unsupervised Feature Selection

比随机更差:无监督特征选择中基线的重要性

Muhammad Rajabinasab, Michael E. Houle, Oussama Chelly, Arthur Zimek

发表机构 * University of Southern Denmark(丹麦南部大学) New Jersey Institute of Technology(新泽西理工学院) Oratio Technologies(Oratio技术公司)

AI总结 本文探讨了无监督特征选择方法的评估基准问题,指出当前多数方法缺乏与随机特征选择这一基准的比较,难以衡量其实际贡献。作者提出应将随机特征选择作为评估基准,并通过实验证明许多先进方法在性能和效率上均不如随机选择。因此,研究强调在开发新的无监督特征选择方法时,必须以随机选择为基准,以确保方法的有效性与改进价值。

Comments Preprint submitted to Elsevier Pattern Recognition Letters

详情
AI中文摘要

每年都有许多新的无监督特征选择方法被提出,但它们的实证评估仅限于在选定数据集上计算的监督和无监督评估指标,以及与现有方法的比较。然而,在缺乏既定评估基线的情况下,很难确定每种方法对现有文献的附加值,以及它们底层方法的有效性。我们提出使用随机特征选择作为评估无监督特征选择方法的基线。我们通过实证表明,许多最先进的无监督特征选择方法在性能和效率上均不如随机特征选择。因此,我们强调在开发新的无监督特征选择方法时,必须严格考虑将随机特征选择作为基线,以确保相对于随机特征选择的一致改进。

英文摘要

Many novel unsupervised feature selection methods are proposed each year, yet their empirical evaluation is limited to supervised and unsupervised evaluation metrics computed on selected datasets, along with comparisons to existing methods. However, in the absence of an established evaluation baseline, it is difficult to determine the value added to the existing literature by each of these methods, and how effective their underlying approaches are. We propose using random feature selection as a baseline for evaluating the unsupervised feature selection methods. We empirically show that many of the state-of-the-art methods in unsupervised feature selection are outperformed by random feature selection in both performance and efficiency. Accordingly, we emphasize on the strict requirement of considering random feature selection as a baseline in the development process of novel unsupervised feature selection methods to ensure a consistent improvement over random feature selection.

2605.22972 2026-05-25 cs.LG cs.AI 版本更新

A mathematical theory of balancing relational generalization and memorization

关系泛化与记忆平衡的数学理论

Luke Cheng, Samuel Lippl

发表机构 * Center for Theoretical Neuroscience(理论神经科学中心)

AI总结 本文探讨了学习系统如何在关系泛化与记忆例外之间取得平衡这一核心问题,提出了一种新的任务——带有例外的传递推理任务,用于测试模型在关系规则下的泛化与例外记忆能力。通过理论分析和实验验证,研究发现神经网络模型在不同表征结构下表现出对泛化与记忆的平衡能力,但其成功依赖于具体的表征几何特性。该理论不仅揭示了这一任务的机制性挑战,还通过预训练语言模型的实验验证了理论预测,为理解学习系统的泛化机制提供了新视角。

详情
AI中文摘要

人类、动物和现代机器学习模型展现出学习复杂行为并将其泛化到未见情境的惊人能力。这种能力要求我们学习规则和规律以实现泛化。同时,在大多数复杂环境中,任何规则都有例外。学习系统如何在学习一般规律和记忆例外之间取得平衡?我们认为,缺乏任务范式阻碍了对这一基本能力的研究。为填补这一空白,我们引入了一个新任务——带例外的传递推理,该任务测试关系泛化以及对关系规则例外的记忆。然后,我们解析地表征了一个简单、理论上可处理的神经网络学习模型(核岭回归)在广泛表示族和任务参数下的行为。我们发现,这些模型能够在关系泛化和记忆之间取得平衡,但与无例外的传递推理不同,成功的泛化对特定的表示几何敏感。我们通过分析理论解释了为什么该任务在机制上更具挑战性。最后,我们在对有序关系进行微调的预训练语言模型中验证了我们的理论见解,发现这些模型成功根据传递规则进行泛化,但也做出了我们理论预测的那种系统性错误。总体而言,我们的理论展示了学习系统如何在关系泛化和记忆之间取得平衡,解释了可能出错的方式,并强调了设计新任务范式以探测这种能力的必要性。

英文摘要

Humans, animals, and modern machine learning models exhibit impressive abilities to learn complex behaviors and generalize these behaviors to unseen situations. This ability requires us to learn rules and regularities that allow for such generalizations. At the same time, in most complex environments, any rule will have its exceptions. How do learning systems balance between learning general regularities and memorizing exceptions? We argue that a lack of task paradigms has hindered the study of this essential ability. To address this gap, we introduce a novel task, transitive inference with exceptions, that tests for relational generalization and memorization of an exception to the relational rule. We then analytically characterize the behavior of a simple, theoretically tractable model of neural network learning (kernel ridge regression) across a broad family of representations and task parameters. We find that these models can balance between relational generalization and memorization, but unlike for transitive inference without an exception, successful generalization is sensitive to the specific representational geometry. We explain why this task is more challenging mechanistically by drawing on our analytical theory. Finally, we validate our theoretical insights in pretrained language models that are finetuned on ordered relations, finding that these models successfully generalize according to the transitive rule, but also make the kinds of systematic mistakes predicted by our theory. Overall, our theory shows how learning systems can balance between relational generalization and memorization, explains how this can go wrong, and emphasizes the need for new task paradigms designed to probe this ability.

2605.22963 2026-05-25 cs.CL cs.AI 版本更新

Graph Alignment Topology as an Inductive Bias for Grounding Detection

图对齐拓扑作为接地检测的归纳偏置

Paul Landes, Pranav Herur, Adam Cross, Jimeng Sun

发表机构 * Department of Pediatrics, University of Illinois College of Medicine Peoria(伊利诺伊大学皮奥里亚医学院儿科部) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Siebel School of Computing and Data Science, University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校计算机与数据科学学院) Carle Illinois College of Medicine, University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校卡莱医学院)

AI总结 本文研究了如何利用图对齐拓扑作为归纳偏置,以提升大语言模型(LLM)生成内容的事实准确性。作者构建了参考信息与模型输出之间的二分图,并通过图神经网络建模对齐结构,从而直接学习对齐拓扑特征。该方法在多个幻觉检测和问答数据集上取得了优于现有方法及基础LLM(如GPT-4o)的最先进结果,为提升模型输出的可解释性和事实可靠性提供了新思路。

详情
AI中文摘要

大型语言模型(LLM)被优化以产生分布上合理的延续,而不是明确验证生成的命题是否源自源文档。这种归纳偏置使得泛化成为可能,但它不编码响应是否相对于参考是接地的。这些问题限制了LLM在严格事实正确性至关重要的领域(如临床决策支持)中的使用。现有的幻觉检测方法通过检索增强、自一致性或声明验证来提高事实性,但通常不直接学习对齐拓扑。为了利用对齐拓扑作为归纳偏置,我们在参考信息和LLM输出之间构建对齐二分图,并训练图神经网络(GNN)通过消息传递来建模对齐结构。该方法在四个不同的幻觉和问答数据集上取得了最先进的结果,优于所有比较的方法,包括基础LLM如GPT-4o。

英文摘要

Large Language Models (LLMs) are optimized to produce distributionally plausible continuations rather than to explicitly verify whether generated propositions are entailed by source documents. This inductive bias enables generalization, but it does not encode whether responses are grounded with respect to a reference. These issues limit the use of LLMs in domains where strict factual correctness is crucial, such as clinical decision support. Existing hallucination detection approaches improve factuality through retrieval augmentation, self-consistency, or claim verification, but generally do not learn directly over alignment topology. To leverage alignment topology as an inductive bias, we construct aligned bipartite graphs between reference information and LLM outputs and train a graph neural network (GNN) to model alignment structure using message passing. The method achieves state-of-the-art results on four diverse hallucination and question-answering datasets, outperforming all compared methods, including foundational LLMs such as GPT-4o.

2605.21851 2026-05-25 cs.LG cs.AI 版本更新

OPPO: Bayesian Value Recursion for Token-Level Credit Assignment in LLM Reasoning

OPPO: 用于LLM推理中令牌级信用分配的贝叶斯价值递归

Yu Li, Rui Miao, Tian Lan, Zhengling Qi

发表机构 * George Washington University(乔治华盛顿大学) The University of Texas at Dallas(德克萨斯大学达拉斯分校)

AI总结 该论文提出了一种名为OPPO的新型算法,用于改进大语言模型(LLM)在推理任务中的信用分配机制。OPPO基于一种关键观察:传统方法中用于局部判别的 oracle 信号本质上是模型对最终成功概率的贝叶斯更新。通过沿轨迹累积该信号,OPPO能够在不依赖价值网络或额外采样的情况下,直接计算出每个位置的成功概率估计和令牌级优势,从而更准确地识别推理过程中的关键步骤。实验表明,OPPO在多个数学、科学和代码推理基准上显著优于现有方法。

详情
AI中文摘要

具有可验证奖励的强化学习已成为提升LLM推理的标准方法,但主流算法GRPO为每个令牌分配单一轨迹级优势,稀释了关键推理步骤的信号,并在无信息步骤中注入噪声。源自在线策略蒸馏的无评论家替代方案通过预言机条件似然比提供每令牌信号,但每个信号孤立于该位置之前累积的轨迹级证据。我们提出Oracle-Prompted Policy Optimization (OPPO),它基于一个简单观察:先前蒸馏式方法用于局部区分的预言机信号,也是模型对最终成功信念的自然贝叶斯更新。沿轨迹累积信号,以一次额外前向传播的代价,以闭式形式给出每个位置成功概率的运行估计,以及无需学习价值网络和额外采样的令牌级优势。一阶分析将优势分解为蒸馏方法使用的每令牌区分信号,乘以一个状态权重,该权重将信用集中在真正关键的令牌上,并具有方向性方差减少保证。该框架包含两种估计器,区别仅在于谁对证据评分: extit{自预言机}重用学生模型,将在线策略蒸馏奖励作为严格特例恢复; extit{教师预言机}将评分委托给更强的冻结模型。在两个基础LLM上,跨越七个数学、科学和代码推理基准,OPPO在AMC'23上比GRPO、DAPO和SDPO提升高达+6.0分,在AIME'24上提升+5.2分,且增益随响应长度单调增加。

英文摘要

Reinforcement learning with verifiable rewards has become the standard recipe for improving LLM reasoning, but the dominant algorithm GRPO assigns a single trajectory-level advantage to every token, diluting the signal at pivotal reasoning steps and injecting noise at uninformative ones. Critic-free alternatives derived from on-policy distillation supply per-token signals through oracle-conditioned likelihood ratios, yet apply each signal in isolation from the trajectory-level evidence accumulated up to that position. We propose Oracle-Prompted Policy Optimization (OPPO), which rests on a single observation: the oracle signal used by prior distillation-style methods for local discrimination is also the natural Bayesian update of the model's belief about eventual success. Accumulating the signal along a trajectory yields, in closed form and at the cost of one extra forward pass, a running estimate of the success probability at every position, together with a token-level advantage that requires no learned value network and no additional rollouts. A first-order analysis factorizes the advantage into the per-token discrimination signal used by distillation methods modulated by a state weight that concentrates credit on genuinely pivotal tokens, with a directional variance-reduction guarantee. The framework admits two estimators differing only in which model scores the evidence: a \textit{self-oracle} that reuses the student and recovers the on-policy distillation reward as a strict special case, and a \textit{teacher-oracle} that delegates scoring to a stronger frozen model. On two base LLMs across seven mathematics, science, and code reasoning benchmarks, OPPO improves over GRPO, DAPO, and SDPO by up to $+6.0$ points on AMC'23 and $+5.2$ points on AIME'24, with gains that widen monotonically with response length.

2605.21489 2026-05-25 cs.LG cs.AI cs.CV stat.CO stat.ML 版本更新

Variance Reduction for Expectations with Diffusion Teachers

具有扩散教师的期望方差缩减

Jesse Bettencourt, Xindi Wu, Matan Atzmon, James Lucas, Jonathan Lorraine

发表机构 * NVIDIA University of Toronto(多伦多大学) Princeton University(普林斯顿大学)

AI总结 本文研究了如何在使用预训练扩散模型作为“教师”进行下游任务(如文本到3D生成、单步蒸馏等)时,降低梯度估计的方差。提出了一种名为CARV的计算感知方差控制框架,通过分层蒙特卡洛估计器,将昂贵的上游计算过程与廉价的扩散噪声重采样相结合,并结合时间步重要性采样和分层逆CDF构造,有效减少了计算成本。实验表明,CARV在不改变目标函数的前提下显著提升了计算效率,但在某些任务中梯度方差的降低并未带来生成质量的提升,表明此时方差已不再是性能瓶颈。

Comments Project page: https://research.nvidia.com/labs/sil/projects/CARV/

详情
AI中文摘要

预训练的扩散模型作为冻结教师,为文本到3D、单步蒸馏和数据归因等下游流程提供支持。这些流程消耗的教师梯度是关于噪声水平和高斯噪声样本的蒙特卡洛期望;其估计器方差主导了计算成本,因为每次抽取都需要昂贵的上游工作(渲染、模拟、编码)。我们引入了CARV,一个计算感知的方差核算框架,它激发了一种分层蒙特卡洛估计器:通过廉价的扩散噪声重采样来摊销昂贵的上游计算,并通过时间步重要性采样和分层逆CDF构造加以强化。在我们的文本到3D蒸馏和归因实验中,CARV在不改变目标的情况下提供了2-3倍的有效计算乘数(主要来自摊销重用;约25%来自IS+分层);在单步蒸馏中,相同的技术将梯度方差降低了一个数量级,但并未改善下游FID,标志着MC方差不再是瓶颈的区间。

英文摘要

Pretrained diffusion models serve as frozen teachers feeding downstream pipelines such as text-to-3D, single-step distillation, and data attribution. The teacher gradients these pipelines consume are Monte Carlo (MC) expectations over noise levels and Gaussian noise samples; their estimator variance dominates compute cost because each draw requires expensive upstream work (rendering, simulation, encoding). We introduce CARV, a compute-aware variance-accounting framework that motivates a hierarchical MC estimator: amortize the expensive upstream computation over cheap diffusion-noise resamples, sharpened by timestep importance sampling and a stratified-inverse-CDF construction. In our text-to-3D distillation and attribution experiments, CARV delivers 2-3x effective compute multipliers (most from amortized reuse; ~25% additional from IS+stratification) without changing the objective; in single-step distillation, the same techniques cut gradient variance by an order of magnitude but do not improve downstream FID, marking the regime where MC variance is no longer the bottleneck.

2605.21071 2026-05-25 cs.CL cs.AI 版本更新

Fine-grained Claim-level RAG Benchmark for Law

细粒度声明级法律RAG基准

Souvick Das, Sallam Abualhaija, Domenico Bianculli

发表机构 * University of Luxembourg(卢森堡大学)

AI总结 本文提出ClaimRAG-LAW,一个支持英法双语、面向法律专家与非专家用户的细粒度法律检索增强生成(RAG)基准数据集,涵盖多种真实场景的问答类型。研究通过细粒度评估框架分析当前先进法律RAG系统的检索、生成及主张级表现,揭示了其在法律领域中存在的局限性,为提升法律AI系统的可靠性提供了重要参考。

详情
AI中文摘要

大型语言模型(LLM)的快速进展正在将语义搜索转向问答范式,用户提出问题,LLM生成回答。在法律等高风险领域,检索增强生成(RAG)通常用于减轻生成回答中的幻觉。然而,先前的研究表明,无论是通用还是法律专用的RAG系统,仍然以不同速率产生幻觉,这使得细粒度评估变得至关重要。尽管有需求,现有的法律RAG系统评估框架缺乏分别对检索和生成性能进行详细分析所需的粒度。此外,当前的基准主要是英文且集中于法律专家查询,忽视了非专家需求。我们引入了ClaimRAG-LAW,一个全面的法律RAG数据集,支持法语和英语,面向专家和非专家,并包含反映现实场景的多样化问题类型。我们进一步应用细粒度评估框架对最先进的法律RAG系统进行评估,揭示了法律领域在检索、生成和声明级分析方面的局限性。

英文摘要

The rapid progress of large language models (LLMs) is shifting semantic search toward a question-answering paradigm, where users ask questions and LLMs generate responses. In high-stake domains such as law, retrieval-augmented generation (RAG) is commonly used to mitigate hallucinations in generated responses. Nonetheless, prior work shows that RAG systems, whether general-purpose or legal-specific, still hallucinate at varying rates, making fine-grained evaluation essential. Despite the need, existing evaluation frameworks for legal RAG systems lack the granularity required to provide detailed analysis of retrieval and generation performance separately. Moreover, current benchmarks are largely English-only and centered on legal expert queries, overlooking non-expert needs. We introduce ClaimRAG-LAW, a comprehensive dataset for legal RAG that supports French and English, targets both experts and non-experts, and includes diverse question types reflecting realistic scenarios. We further apply a fine-grained evaluation framework of state-of-the-art legal RAG systems, revealing limitations in retrieval, generation, and claim-level analysis in the legal domain.

2605.20919 2026-05-25 cs.LG cs.AI cs.PL 版本更新

Sutra: Tensor-Op RNNs as a Compilation Target for Vector Symbolic Architectures

Sutra: 以张量操作RNN作为向量符号架构的编译目标

Emma Leonhart

发表机构 * Emma Leonhart

AI总结 Sutra 是一种类型化的纯函数式编程语言,其前向传播过程被编译为 PyTorch 神经网络。该语言通过将程序中的原始操作、控制流和字符串 I/O 等全部转换为一个融合的张量操作图,实现了对向量符号架构的高效编译。研究展示了 Sutra 在多种嵌入表示上的高精度解码能力,并验证了其可微分性,使得同一程序既能作为逻辑程序运行,也能作为可训练的神经网络进行优化。

Comments Modified NeurIPS submission, see AI declaration and replication materials at end of paper

详情
AI中文摘要

Sutra是一种带类型的纯函数式编程语言,其编译后的前向传播是一个PyTorch神经网络。编译器将整个程序——包括原语、控制流、字符串I/O——通过beta归约降级为一个在冻结嵌入基质上的融合张量操作图。旋转绑定、解绑、捆绑、多项式Kleene三值逻辑以及尾递归循环均被降级为张量操作;Kleene连接词是在{-1, 0, +1}真值网格上精确的拉格朗日插值多项式。验证通过两种方式测试同一事实。(1) 同一程序在跨越两种模态的四个冻结嵌入上运行——三种文本编码器(nomic-embed-text、all-minilm、mxbai-embed-large)和一种蛋白质语言模型(ESM-2)——并在每个基质上以宽度k=8实现100%的解码准确率,而教科书式的Hadamard乘积已经崩溃(mxbai-embed-large上2.5%,all-minilm上7.5%)。(2) PyTorch自动求导流经实际编译的图:一个用.su编写的模糊规则分类器从随机初始化(18.7±9.5%;随机概率=20%,五类)通过反向传播经过发射图(符号源未修改)训练到100.0±0.0%(三个种子)。一个加权变体额外训练一个标量余弦增益,并将其作为数值字面量写回.su源文件;重新编译重现训练后的行为,每个logit误差约2e-7,因此训练后的模型本身是可读、可重编译的代码。因此,同一工件既是一个逻辑程序,也是一个可训练的神经网络。

英文摘要

Sutra is a typed, purely functional programming language whose compiled forward pass is a PyTorch neural network. The compiler beta-reduces the whole program -- primitives, control flow, string I/O -- to one fused tensor-op graph over a frozen embedding substrate. Rotation binding, unbind, bundle, polynomial Kleene three-valued logic, and tail-recursive loops all lower to tensor operations; the Kleene connectives are Lagrange-interpolated polynomials exact on the {-1, 0, +1} truth grid. Validation is one fact tested two ways. (1) The same program runs on four frozen embeddings spanning two modalities -- three text encoders (nomic-embed-text, all-minilm, mxbai-embed-large) and one protein language model (ESM-2) -- and decodes bundles at 100% accuracy through width k=8 on every substrate, where the textbook Hadamard product has already collapsed (2.5% on mxbai-embed-large, 7.5% on all-minilm). (2) PyTorch autograd flows through the actually compiled graph: a fuzzy-rule classifier written in .su trains from random init (18.7 +/- 9.5%; chance = 20%, five classes) to 100.0 +/- 0.0% (three seeds) by backpropagating through the emitted graph, the symbolic source unmodified. A weighted variant additionally trains a scalar cosine gain and writes it back into the .su source as a numeric literal; recompiling reproduces the trained behaviour to ~2e-7 per logit, so the trained model is itself legible, recompilable code. The same artifact is therefore both a logic program and a trainable neural network.

2605.20896 2026-05-25 cs.CR cs.AI cs.LG 版本更新

GenAI-Driven Threat Detection with Microsoft Security Copilot

GenAI驱动的威胁检测与Microsoft Security Copilot

Scott Freitas, Amir Gharib

发表机构 * Microsoft Security Research(微软安全研究)

AI总结 本文提出了一种名为动态威胁检测代理(DTDA)的自主代理系统,用于提升微软安全协作者(Microsoft Security Copilot)在检测隐蔽网络威胁方面的能力。DTDA结合了统一的活动时间线、版本化的大型语言模型提示合同、基于计划-执行的调查循环以及动态告警生成机制,能够持续分析安全事件并生成可解释的检测结果。实验表明,DTDA在实际部署中表现出较高的检测精度和效率,有效提升了现有系统的威胁识别能力。

详情
AI中文摘要

防御当今日益复杂的网络攻击需要安全分析师不断将不断演变的攻击者技术转化为检测逻辑。这使防御者处于被动状态,需要在日益碎片化的安全格局中不断更新专业知识。我们引入了动态威胁检测代理(DTDA),一种始终在线的自适应代理,持续调查Microsoft Defender中的安全事件,以发现隐藏威胁并在发现攻击故事缺口时生成可解释的检测。DTDA结合了:(1)统一的活动时间线,涵盖警报、事件、用户和实体行为分析以及威胁情报;(2)版本化的LLM提示合同,包含模式验证、基础要求、有限重试和故障关闭抑制;(3)规划器-执行器调查循环,生成攻击特定假设并收集支持和反驳证据;(4)动态告警生成,包含上下文相关的标题、严重性、MITRE映射、修复指导、涉及实体和自然语言攻击描述。集成到Microsoft Security Copilot并部署在数万个Defender客户中,DTDA在行业规模下持续运行。在120天的在线评估中,DTDA根据客户反馈实现了80.1%的精确率,同时为约15%的调查事件生成了新颖告警。在离线评估中,DTDA使用GPT-5.4以0.78的F1分数恢复了隐藏的恶意活动,比GPT-4.1提高了0.12 F1,并比基线高出0.26 F1点。在操作上,DTDA处理单个事件调查的中位端到端时间为28分钟,中位代币成本为2.04美元,作业级故障率为0.38%。这些结果表明,自主代理可以在生产规模上识别遗漏的恶意活动。

英文摘要

Defending against today's increasingly sophisticated cyberattacks requires security analysts to continuously translate evolving attacker tradecraft into detection logic. This places defenders in a reactive posture, requiring constantly updated expertise across an increasingly fragmented security landscape. We introduce the Dynamic Threat Detection Agent (DTDA), an always-on adaptive agent that continuously investigates security incidents across Microsoft Defender to uncover hidden threats and generate explainable detections when attack-story gaps are found. DTDA combines: (1) a unified activity timeline spanning alerts, events, user and entity behavior analytics, and threat intelligence; (2) versioned LLM prompt contracts with schema validation, grounding requirements, bounded retries, and fail-closed suppression; (3) a planner-executor investigation loop that generates attack-specific hypotheses and gathers supporting and refuting evidence; and (4) dynamic alert generation with a context-relevant title, severity, MITRE mappings, remediation guidance, implicated entities, and natural-language attack description. Integrated into Microsoft Security Copilot and deployed across tens of thousands of Defender customers, DTDA operates continuously at industry scale. In a 120-day online evaluation, DTDA achieves 80.1% precision from customer feedback while generating novel alerts for approximately 15% of investigated incidents. In offline evaluation, DTDA recovers hidden malicious activity with 0.78 F1 using GPT-5.4, improving over GPT-4.1 by 0.12 F1 and outperforming the baseline by 0.26 F1 points. Operationally, DTDA processes single-incident investigations end-to-end in a median of 28 minutes at a median token cost of USD 2.04, with a 0.38% job-level failure rate. These results demonstrate that autonomous agents can identify missed malicious activity at a production scale.

2605.20201 2026-05-25 cs.CL cs.AI cs.LG 版本更新

Long-Context Reasoning Through Proxy-Based Chain-of-Thought Tuning

基于代理思维链调优的长上下文推理

Miao Li, Irina Saparina, Alexander Gurung, Mirella Lapata

发表机构 * School of Informatics, University of Edinburgh(爱丁堡大学信息学院)

AI总结 该研究针对大语言模型在长上下文复杂推理任务中表现不佳的问题,提出了一种名为ProxyCoT的新训练框架。该方法通过在短代理上下文中获取高质量的推理轨迹,并将其迁移到完整的长上下文中,从而提升模型的长上下文推理能力。实验表明,ProxyCoT在多个数据集上均优于现有方法,且计算开销更低,同时具备良好的跨领域泛化能力。

Comments Long paper, ACL 2026 (Main conference)

详情
AI中文摘要

近期的大语言模型支持高达1000万token的输入,但在需要复杂推理的长上下文任务上表现不佳。此类任务可以通过仅使用输入的一个子集(即代理上下文)而非完整序列来解决。尽管共享相同的底层推理过程,模型在代理上下文和完整上下文之间表现出显著的性能差异。为了改进长上下文推理,我们提出了ProxyCoT,一种新颖的训练框架,将推理能力从短代理上下文迁移到完整长上下文。具体来说,我们首先通过强化学习或从更大的教师模型蒸馏,在代理上下文中获得高质量的思维链推理轨迹,然后通过监督微调将这些生成的轨迹锚定到完整长上下文中。跨不同数据集的实验表明,ProxyCoT在减少计算开销的同时,始终优于强基线。此外,使用ProxyCoT训练的模型能够将其长上下文推理能力泛化到域外任务。

英文摘要

Recent large language models support inputs of up to 10 million tokens, yet they perform poorly on long-context tasks that require complex reasoning. Such tasks can be solved using only a subset of the input -- a proxy context -- rather than the full sequence. Despite sharing the same underlying reasoning process, models exhibit a significant performance disparity between proxy and full contexts. To improve long-context reasoning, we propose ProxyCoT, a novel training framework that transfers reasoning capabilities from short proxy contexts to full long contexts. Specifically, we first obtain high-quality chain-of-thought reasoning traces on proxy contexts through reinforcement learning or distillation from a larger teacher model, and then ground the generated traces in full long contexts with supervised fine-tuning. Experiments across different datasets demonstrate that ProxyCoT consistently outperforms strong baselines with reduced computational overhead. Furthermore, models trained with ProxyCoT generalize their long-context reasoning capabilities to out-of-domain tasks.

2605.20087 2026-05-25 cs.CL cs.AI 版本更新

ThoughtTrace: Understanding User Thoughts in Real-World LLM Interactions

ThoughtTrace: 理解真实世界LLM交互中的用户想法

Chuanyang Jin, Binze Li, Haopeng Xie, Cathy Mengying Fang, Tianjian Li, Shayne Longpre, Hongxiang Gu, Maximillian Chen, Tianmin Shu

发表机构 * Johns Hopkins University(约翰霍普金斯大学) Massachusetts Institute of Technology(麻省理工学院) Google Research(谷歌研究)

AI总结 ThoughtTrace 是首个大规模数据集,记录了真实场景中用户与AI的多轮对话及其用户自我报告的思考内容,揭示了用户发送提示的原因和对AI回复的反应。该数据集包含1,058名用户、2,155次对话及10,174条思考注释,分析表明用户思考内容在语义上与对话消息不同,且难以被当前先进大模型准确推断。研究进一步展示了思考内容在行为预测和个性化助手训练中的应用价值,为理解用户潜在目标和需求提供了新的数据模态。

Comments 53 pages, 23 figures, 4 tables. Project website: https://thoughttrace-project.github.io/

详情
AI中文摘要

对话式AI现已服务数十亿用户,但现有数据集仅捕捉用户所说,而非所想。我们引入ThoughtTrace,首个大规模数据集,将真实世界多轮人机对话与用户自述想法配对:用户发送提示的原因以及对助手回复的反应。ThoughtTrace包含来自20个语言模型的1,058名用户、2,155次对话、17,058轮次和10,174条想法标注。我们的分析表明,ThoughtTrace捕捉了长期、主题多样的交互,且想法在语义上不同于消息,前沿LLM难以从上下文中推断,内容多样,并与对话阶段相关。我们进一步展示了想法在下游建模中的实用性。首先,想法作为推理时上下文改善了用户行为预测。其次,想法引导的重写为训练个性化助手提供了细粒度对齐信号。总之,ThoughtTrace将用户想法确立为研究人机交互背后认知动态的新数据模态,并为构建更好理解和适应用户潜在目标、偏好与需求的助手奠定了基础。

英文摘要

Conversational AI has now reached billions of users, yet existing datasets capture only what people say, not what they think. We introduce ThoughtTrace, the first large-scale dataset that pairs real-world multi-turn human--AI conversations with users' self-reported thoughts: their reasons for sending prompts and reactions to assistant responses. ThoughtTrace comprises 1,058 users, 2,155 conversations, 17,058 turns, and 10,174 thought annotations collected across 20 language models. Our analysis shows that ThoughtTrace captures long-horizon, topically diverse interactions, and that thoughts are semantically distinct from messages, difficult for frontier LLMs to infer from context, diverse in content, and tied to conversation stages. We further demonstrate the utility of thoughts for downstream modeling. First, thoughts improve user-behavior prediction as inference-time context. Second, thought-guided rewrites provide fine-grained alignment signals for training personalized assistants. Together, ThoughtTrace establishes user thoughts as a new data modality for studying the cognitive dynamics behind human--AI interaction and provides a foundation for building assistants that better understand and adapt to users' latent goals, preferences, and needs.

2605.18993 2026-05-25 cs.LG cs.AI 版本更新

Distilling Linearized Behavior into Non-Linear Fine-Tuning for Effective Task Arithmetic

将线性化行为蒸馏到非线性微调中以实现有效的任务算术

Thomas Sommariva, Francesca Morandi, Simone Calderara, Angelo Porrello

发表机构 * University of Pisa, Italy(比萨大学,意大利)

AI总结 该研究探讨了如何在非线性微调中保留线性微调在任务向量组合中的优势。作者提出通过在激活空间中施加约束,使非线性模型在权重扰动上保持线性特性,并通过从线性化教师模型中蒸馏隐藏表示来训练学生模型。该方法在保持任务向量可组合性的同时,避免了推理时的额外开销,在视觉和语言任务中表现出色。

Comments Accepted at ICML 2026

详情
AI中文摘要

任务向量组合已成为编辑预训练模型的一种有前景的范式,通过加法实现模型合并,通过减法实现模型遗忘。在预训练模型的切空间中进行微调(线性微调)已被证明是有效的,因为它产生的任务向量自然解缠且抗干扰。然而,线性化模型在训练期间表达能力有限,并且在推理时计算成本较高,这限制了它们的实际应用。在这项工作中,我们弥合了线性微调与标准非线性微调之间的差距。我们表明,关于权重扰动的线性性(一种在参数空间中定义的属性)可以通过在训练期间在激活空间中施加约束来强制执行。具体来说,我们将曲率正则化的线性化教师模型的隐藏表示蒸馏到通过常规微调训练的非线性学生模型中。我们发现,得到的模型继承了线性化模型在任务算术中的关键属性,能够实现任务向量的有效组合,并在视觉和语言基准测试中实现强性能,而不会产生任何推理开销。

英文摘要

Task vector composition has emerged as a promising paradigm for editing pre-trained models, enabling model merging through addition and unlearning through subtraction. Fine-tuning in the tangent space of a pre-trained model (linear fine-tuning) has proven effective, as it produces task vectors that are naturally disentangled and resistant to interference. However, linearized models suffer from limited expressivity during training and incur higher computational costs at inference time, which restrict their practical applicability. In this work, we bridge the gap between linear and standard non-linear fine-tuning. We show that linearity with respect to weight perturbations, a property defined in parameter space, can be enforced through constraints in activation space during training. Concretely, we distill hidden representations from a curvature-regularized linearized teacher into a non-linear student trained via conventional fine-tuning. We find that the resulting model inherits key properties of linearized models for task arithmetic, enabling effective composition of task vectors and achieving strong performance across vision and language benchmarks without incurring any inference-time overhead.

2605.18911 2026-05-25 cs.LG cs.AI 版本更新

Does Your Wildfire Prediction Model Actually Work, or Just Score Well?

你的野火预测模型真的有效,还是只是得分高?

Yangshuang Xu, Yuyang Dai, Liling Chang, Qi Wang, Yushun Dong

发表机构 * Florida State University(佛罗里达州立大学) Northeastern University(东北大学)

AI总结 本文研究了现有地球基础模型在野火预测任务中的实际有效性问题,指出当前模型虽在通用大气和地球物理任务上表现良好,但未针对野火预测进行专门预训练。为此,作者提出了首个专门用于野火预测的预训练模型WILDFIRE-FM,并引入了一种固定合约评估框架,以解决野火事件稀疏性带来的评估偏差问题。研究结果表明,野火预测的迁移结论高度依赖于评估设计和任务设定,为未来相关研究提供了新的基准和方法支持。

Comments 25 pages

详情
AI中文摘要

野火预测对于早期预警和资源分配至关重要,然而现有的地球基础模型(Earth FMs)是为通用大气和地球物理目标预训练的,而非野火预测。为弥补这一空白,我们提出了WILDFIRE-FM,这是首个专门针对野火预测预训练的基础模型,使用了天气、活跃火观测、地形、植被和静态环境数据。然而,仅引入特定领域的骨干网络并不能解决评估问题:野火事件在时空上稀疏,使得迁移结论对匹配规则和评估设置高度敏感。为解决这一问题,我们引入了一个固定合约评估框架,包含两个受控检查:固定输出检查用于匹配规则效应,固定特征检查用于头部选择效应。在匹配合约下,我们在占用、蔓延、检索和回归任务上将WILDFIRE-FM与十个地球基础模型基线进行比较。结果表明,野火迁移结论强烈依赖于评估设计和任务制定。我们希望该框架和WILDFIRE-FM能为未来野火特定的地球基础模型研究和基准测试提供基础。我们的代码可在 https://anonymous.4open.science/r/Wildfire-fm-evaluation-contracts-5AE9/ 获取。

英文摘要

Wildfire prediction is important for early warning and resource allocation, yet existing Earth foundation models (Earth FMs) are pretrained for general atmospheric and geophysical objectives rather than wildfire forecasting. To address this gap, we introduce WILDFIRE-FM, the first foundation model pretrained specifically for wildfire prediction using weather, active-fire observations, topography, vegetation, and static environmental data. However, introducing a domain-specific backbone alone does not solve the evaluation problem: wildfire events are sparse in space and time, making transfer conclusions highly sensitive to matching rules and evaluation settings. To address this problem, we introduce a fixed-contract evaluation framework with two controlled checks: a fixed-output check for matching-rule effects and a fixed-feature check for head-selection effects. Under matched contracts, we compare WILDFIRE-FM with ten Earth-FM baselines across occupancy, spread, retrieval, and regression tasks. Our results show that wildfire transfer conclusions depend strongly on evaluation design and task formulation. We hope this framework and WILDFIRE-FM provide a foundation for future wildfire-specific Earth-FM research and benchmarking. Our code is available at https://anonymous.4open.science/r/Wildfire-fm-evaluation-contracts-5AE9/.

2605.18859 2026-05-25 cs.LG cs.AI 版本更新

TwinRouterBench: Fast Static and Live Dynamic Evaluation for Realistic Agentic LLM Routing

TwinRouterBench:面向现实智能体LLM路由的快速静态与实时动态评估

Pei Yang, Wanyi Chen, Tongyun Yang, Pengbin Feng, Jiarong Xing, Wentao Guo, Yuhang Yao, Yuhang Han, Hanchen Li, Xu Wang, Zeyu Wang, Jie Xiao, Anjie Yang, Liang Tian, Lynn Ai, Eric Yang, Tianyu Shi

发表机构 * Gradient Soochow University(苏州大学) Independent Researcher(独立研究者) University of Southern California(南加州大学) Rice University(Rice大学) Carnegie Mellon University(卡内基梅隆大学) Shanghai Jiao Tong University(上海交通大学) University of California, Berkeley(加州大学伯克利分校) University of the Chinese Academy of Sciences(中国科学院大学) University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 本文提出 TwinRouterBench,一个用于评估代理式大语言模型(LLM)路由策略的基准工具,旨在支持静态和动态场景下的高效评估。该基准包含两个赛道:静态赛道提供多个任务中的模型调用前缀及对应的最优模型层级,通过确定性计算进行评分;动态赛道则在真实代理系统中运行路由策略,评估其在实际任务完成和成本控制方面的表现。该工作为路由算法的开发与优化提供了全面且高效的实验平台。

详情
AI中文摘要

LLM路由在长时任务(如编码智能体、深度研究系统和计算机使用智能体)中最为重要,其中单个用户请求会触发多次模型调用。将每次调用路由到最便宜的足够模型可以在不牺牲质量的情况下降低成本,然而现有的路由器基准仅评估一次性提示的路由。它们从未暴露中间智能体步骤中路由器可见的前缀,从未测试更便宜的替代品是否保留下游任务的成功,并且通常在评估时依赖在线LLM评判。我们引入了TwinRouterBench,一个具有两轨的步骤级路由基准。静态轨提供来自SWE-bench、BFCL、mtRAG、QMSum和PinchBench中520个实例的970个路由器可见前缀,每个前缀与在发布的降级和级联协议下估计的执行验证目标层级配对;评分是层级标签、轨迹成员资格和令牌成本的确定性算术,无需在线评估方LLM评判。动态轨提供一个工具,可在完整的500例SWE-bench验证集上运行路由器;本文报告了与静态SWE监督划分不相交的100例保留评估。每次LLM调用时,路由器从锁定池中选择一个具体模型,成功由官方任务解决率和实际API支出衡量。两轨支持快速离线迭代,随后在实时智能体执行下进行端到端验证。代码和数据可在https://github.com/CommonstackAI/TwinRouterBench获取。

英文摘要

LLM routing matters most in long-horizon applications such as coding agents, deep research systems, and computer-use agents, where a single user request triggers many model calls. Routing each call to the cheapest sufficient model can cut costs without sacrificing quality, yet existing router benchmarks evaluate routers only on one-shot prompts. They never expose the router-visible prefix at an intermediate agent step, never test whether a cheaper replacement preserves downstream task success, and often rely on online LLM judges at evaluation time. We introduce TwinRouterBench, a step-level routing benchmark with two tracks. The static track provides 970 router-visible prefixes from 520 instances across SWE-bench, BFCL, mtRAG, QMSum, and PinchBench, each paired with an execution-verified target tier estimated under a released downgrade-and-cascade protocol; scoring is deterministic arithmetic over tier labels, trajectory membership, and token costs, with no online evaluator-side LLM judge. The dynamic track supplies a harness that runs routers on the full 500-case SWE-bench Verified suite; in this paper we report a 100-case held-out evaluation disjoint from the static SWE supervision split. At each LLM call the router selects a concrete model from a locked pool, and success is measured by official task resolution and realized API spend. The two tracks support fast offline iteration followed by end-to-end validation under live agent execution. Code and data are available at https://github.com/CommonstackAI/TwinRouterBench.

2605.17637 2026-05-25 cs.AI 版本更新

WebGameBench: Requirement-to-Application Evaluation for Coding Agents via Browser-Native Games

WebGameBench: 通过浏览器原生游戏对编码代理进行需求到应用的评估

Wenyu Zhang, Guoliang You, Tianlun, Haotian Zhao, Tianshu Zhu, Haoran Wang, Xiaoxuan Tang, Mingyang Dai, Jingnan Gu, Daxiang Dong, Jianmin Wu

发表机构 * Baidu(百度) University of Science and Technology of China(中国科学技术大学)

AI总结 WebGameBench 是一个用于评估代码代理从需求到实际应用构建能力的基准,特别关注其能否将结构化的网页游戏规范转化为可在浏览器中运行的游戏。该基准通过浏览器原生游戏提供紧凑而行为丰富的测试环境,评估代理生成的应用是否具备可玩性、可用性及功能性。研究显示,当前最先进的系统在可用率上达到76.9%,但优秀率仅为20.2%,表明实现完整需求仍存在较大差距。WebGameBench 是首个基于浏览器原生游戏交付的从需求到应用评估的基准,其评估结果与人工游戏体验评审高度一致。

Comments 19 pages, 6 figures

详情
AI中文摘要

编码代理越来越多地被用作应用程序构建者,然而许多评估仍聚焦于源代码、仓库级测试或中间痕迹,而非交付的应用。我们引入WebGameBench,一个需求到应用的基准,评估编码代理能否将冻结的结构化Web游戏规范转化为可浏览器访问的游戏。浏览器原生游戏提供了一个紧凑但行为密集的测试平台:即使是简单的游戏也需要协调的输入处理、空间映射、规则执行、状态转换、终止条件、重启行为和可见反馈。在WebGameBench中,每个生成的工件在统一部署协议下被构建、服务并作为浏览器可访问的应用暴露。然后,运行时评估器在真实浏览器中与交付的游戏交互,并分配三类标签:优秀、可用或不可用。在人工审查的子集上,运行时标签与人类游戏审查在可用率标准下大致一致。在111个任务、12个编码代理和14个评估配置中,WebGameBench区分了当前系统:最佳配置达到76.9%的可用率,但仅有20.2%的优秀率。这一差距表明,跨越最低可玩交付阈值仍远未达到完全满足需求。据我们所知,WebGameBench是首个针对浏览器原生游戏交付的需求到应用基准,它在可用率标准下将交付应用的运行时标签与独立的人类游戏审查进行验证。

英文摘要

Coding agents are increasingly used as application builders, yet many evaluations still focus on source code, repository-level tests, or intermediate traces rather than the delivered application. We introduce WebGameBench, a requirement-to-application benchmark that evaluates whether coding agents can turn a frozen Structured WebGame Specification into a browser-accessible game. Browser-native games provide a compact but behavior-dense testbed: even simple games require coordinated input handling, spatial mapping, rule execution, state transitions, terminal conditions, restart behavior, and visible feedback. In WebGameBench, each generated artifact is built, served, and exposed as a browser-accessible application under a unified deployment protocol. A runtime evaluator then interacts with the delivered game in a real browser and assigns a three-way label: EXCELLENT, USABLE, or UNUSABLE. On a human-reviewed subset, the runtime label is broadly aligned with human gameplay review under the Usable-rate criterion. Across 111 tasks, 12 coding agents, and 14 evaluation configurations, WebGameBench separates current systems: the best configuration reaches a 76.9% usable rate but only a 20.2% excellent rate. This gap shows that crossing the minimum playable-delivery threshold is still far from complete requirement satisfaction. To our knowledge, WebGameBench is the first requirement-to-application benchmark for browser-native game delivery that validates delivered-application runtime labels against independent human gameplay review under the Usable-rate criterion.

2605.17468 2026-05-25 cs.HC cs.AI 版本更新

An Interpretable Closed-Loop Intelligent Tutoring System for Multimodal Affective Feedback in Asynchronous Presentation Training

一种可解释的闭环智能辅导系统,用于异步演讲训练中的多模态情感反馈

Hung-Yue Suen, Kuo-En Hung

AI总结 本文提出了一种可解释的闭环智能辅导系统(ITS),用于支持异步演讲训练中的多模态情感反馈,帮助大规模提升学员的镜头前口头表达能力。该系统基于七维行为锚定评分量表(BARS),结合多模态评分、观众感知表达诊断和增强检索的对话辅导,构建了三层可解释反馈架构,能够将面部、语音、文本和眼动等多模态输入转化为可追溯的、基于证据的反馈。实验表明,该系统在MOOC视频数据上的评分表现接近专家水平,并在30天的实践过程中显著提升了学员的多项表现维度。

Comments 12 pages, 8 figures, IEEE Transactions on Learning Technologies, 2026

详情
AI中文摘要

本文提出了一种可解释的闭环智能辅导系统(ITS),支持大规模开发摄像机前口头演讲技能的反馈引导练习。该系统操作化了一个七维行为锚定评级量表(BARS),并实现了一个三层可解释反馈架构,该架构连接了与评分标准一致的多模态评分、观众感知的表达诊断以及检索增强的对话式辅导,以支持刻意练习。基于XGBoost骨干,该ITS将多模态输入(面部、声音、文本和眼动特征)映射为基于证据的反馈,这些反馈可以追溯到可观察的表现线索。在10,360个大规模开放在线课程(MOOC)视频片段上训练后,该系统实现了与专家评分相当的表现水平的评分标准一致评分(R2 = 0.48-0.61,Spearman's rho = 0.69-0.78,MAE = 0.43-0.57)。在204名成年学习者为期30天的练习窗口的前后验证研究中,参与者在所有七个BARS维度上表现出显著改善(Cohen's d = 0.39-0.90),在控制基线分数和人口统计学因素后,练习频率与后测成绩呈强正相关。结果展示了如何通过集成的反馈架构将多模态分析输出系统地转化为可观察的行为变化,推动了基于表现的能力的可解释和教学导向的ITS设计。

英文摘要

This paper presents an interpretable closed-loop Intelligent Tutoring System (ITS) that supports feedback-guided practice for developing on-camera oral presentation skills at scale. The system operationalizes a seven-dimensional Behaviorally Anchored Rating Scale (BARS) and implements a three-layer interpretable feedback architecture that connects rubric-aligned multimodal scoring, audience-perceived expressive diagnostics, and retrieval-augmented conversational coaching to support deliberate practice. Built on an XGBoost backbone, the ITS maps multimodal inputs (facial, vocal, textual, and oculomotor features) into evidence-based feedback that can be traced back to observable performance cues. Trained on 10,360 Massive Open Online Course (MOOC) video segments, the system achieved rubric-aligned scoring with performance levels comparable to expert ratings (R2 = 0.48-0.61, Spearman's rho = 0.69-0.78, MAE = 0.43-0.57). In a pre-post validation study with 204 adult learners over a 30-day practice window, participants demonstrated significant improvements across all seven BARS dimensions (Cohen's d = 0.39-0.90), with practice frequency showing a strong positive association with posttest performance after controlling for baseline scores and demographics. The results demonstrate how multimodal analytic outputs can be systematically transformed into observable behavioral change through an integrated feedback architecture, advancing explainable and pedagogically grounded ITS design for performance-based competencies.

2605.17076 2026-05-25 cs.LG cs.AI cs.DC cs.MA 版本更新

S-Bus: Automatic Read-Set Reconstruction for Multi-Agent LLM State Coordination

S-Bus: 多智能体LLM状态协调的自动读集重建

Sajjad Khan

发表机构 * Sajjad Khan

AI总结 本文提出了一种名为 S-Bus 的 HTTP 中间件,用于解决多智能体 LLM 在共享可变状态时的并发控制问题,尤其针对无法声明读集的场景。其核心机制 DeliveryLog 能够在提交时从观察到的 HTTP GET 流量中重建每个智能体的读集,从而实现一种名为“可观测读隔离”(ORI)的一致性保证,有效防止分片拓扑中的结构化竞态条件。研究贡献包括形式化验证、与传统数据库的性能对比以及对 ORI 在不同工作负载下的语义影响分析。

Comments v2: LLM judge validated against human annotator (Zahid Hussain, Mindgigs Peshawar) on PH-3 at strict kappa=0.93 (n=93, 96.8% agreement); over-claim refined to 32% (LLM) / 49% (human). Adds Exp.PG-Comparison Rust-Native and Workload-B chi2=1094.98. 24 pages, 23 tables. Annotation data attached as arXiv ancillary files

详情
AI中文摘要

我们解决了通过HTTP共享可变状态的LLM智能体的并发控制问题,其中智能体无法被修改以声明读集。S-Bus是一个HTTP中间件,其核心机制——服务端DeliveryLog——在提交时从观察到的HTTP GET流量中重建每个智能体的读集。它提供的一致性属性——可观测读隔离(ORI),一种基于HTTP可观测读投影的部分因果一致性——防止了专用分片拓扑中的结构性竞态条件。 三项贡献:(C1)DeliveryLog机制,具有三层机械化证据:TLAPS证明了ReadSetSoundness和ORICommitSafety(基于一个类型公理);N=3时的穷举TLC探索了20,763,484个状态,零违规;Dafny验证了9个归纳引理。(C2)与PostgreSQL 17 SERIALIZABLE和Redis 7 WATCH/MULTI的经验安全对等:在884,110次提交尝试中(其中427,308次处于活跃争用下)零Type-I损坏。(C3)ORI在专用分片工作负载中语义中性,但在单分片协作写入中有害,因为保留传播并发矛盾。 v2更新:PH-3 LLM评判器现在已针对人类标注者(Zahid Hussain, Mindgigs Peshawar)在400个(步骤,分片)对上进行独立验证,严格kappa=0.93(n=93,原始一致性96.8%)。LLM间评判器一致性为kappa=0.46(边界方差)。智能体自我报告高估分片使用量32%(LLM评判器)至49%(人类标注者)。SJ-v4语义质量评分标准仍为单评判器LLM-only。 源代码、形式化证明、测试框架、标注数据:https://github.com/sajjadanwar0/sbus

英文摘要

We address concurrency control for LLM agents sharing mutable state over HTTP, where agents cannot be modified to declare read sets. S-Bus is an HTTP middleware whose central mechanism, a server-side DeliveryLog, reconstructs each agent's read set at commit time from observed HTTP GET traffic. The consistency property it provides -- Observable-Read Isolation (ORI), a partial causal consistency over the HTTP-observable read projection -- prevents Structural Race Conditions in dedicated-shard topologies. Three contributions. (C1) DeliveryLog mechanism with three-tier mechanised evidence: TLAPS proves ReadSetSoundness and ORICommitSafety (modulo one typing axiom); exhaustive TLC at N=3 explores 20,763,484 states with zero violations; Dafny discharges 9 inductive lemmas. (C2) Empirical safety parity against PostgreSQL 17 SERIALIZABLE and Redis 7 WATCH/MULTI: zero Type-I corruptions across 884,110 commit attempts (427,308 under active contention). (C3) ORI is semantically neutral in dedicated-shard workloads but harmful in single-shard collaborative writing because preservation propagates concurrent contradictions. v2 update: the PH-3 LLM judge is now independently validated against a human annotator (Zahid Hussain, Mindgigs Peshawar) on 400 (step, shard) pairs at strict kappa=0.93 (n=93, 96.8% raw agreement). Inter-LLM-judge agreement is kappa=0.46 (boundary variance). Agent self-reports over-claim shard usage by 32% (LLM judge) to 49% (human annotator). The SJ-v4 semantic-quality rubric remains single-judge LLM-only. Source code, formal proofs, harness, annotation data: https://github.com/sajjadanwar0/sbus

2605.16799 2026-05-25 cs.LG cs.AI 版本更新

Cross-Domain Molecular Relational Learning: Leveraging Chemical Structure-Activity Analysis

跨域分子关系学习:利用化学结构-活性分析

Peiliang Zhang, Jingling Yuan, Shiqing Wu, Mengqing Hu, Chao Che, Yongjun Zhu, Lin Li

发表机构 * Wuhan University of Technology(武汉理工大学) Yonsei University(延世大学) Hubei Key Laboratory of Transportation Internet of Things(湖北省交通运输物联网重点实验室) State Key Laboratory of Silicate Materials for Architectures(建筑硅酸盐材料国家重点实验室) City University of Macau(澳门城市大学) Kyung Hee University(庆熙大学) Dalian University(大连大学)

AI总结 该研究针对分子关系学习中跨领域建模的不足,提出了一种基于结构-活性分析的跨领域分子关系学习方法。核心方法是引入结构语义迁移差异的领域对抗训练网络(DisTrans),通过子结构拓扑差异引导模型学习分子结构的领域依赖性,并对齐源域与目标域的功能团语义信息,从而提升跨领域适应能力。实验表明,该方法在两种典型跨领域场景下优于16种基线方法,具有良好的泛化性能。

Comments Accepted by SIGKDD 2026 Research Track

详情
AI中文摘要

分子表示的最新进展整合了分子拓扑和视觉模态,为精确的分子关系学习(MRL)开辟了新途径。现有的MRL方法专注于域内建模,其固有的域封闭效应限制了在分子科学中的适用性,特别是在阐明跨域相互作用机制方面。因此,跨域分子关系学习的必要性日益迫切。受益于结构-活性分析,我们提出了具有结构语义迁移差异的域对抗训练网络(DisTrans),以优化分子结构和视觉图像的跨域自适应表示。1)我们利用基于域间子结构拓扑差异的梯度反转策略来学习分子结构的域依赖性。该策略引导模型适应目标域中的结构邻接模式,生成域可分离的结构表示。2)我们应用跨域表示引导机制来对齐源域和目标域之间的官能团语义信息,学习跨域一致性信息。在两种典型跨域策略中的实验结果表明,DisTrans优于16种基线方法,即使在显著的域间差异下也能保持令人满意的性能。

英文摘要

Recent advances in molecular representation integrates molecular topological and visual modalities, opening new avenues for precise Molecular Relational Learning (MRL). Existing MRL methods focus on intra-domain modeling, and their inherent domain-closed effect limits applicability to molecular science, particularly in elucidating cross-domain interaction mechanisms. Consequently, the imperative for Cross-Domain Molecular Relational Learning has become increasingly pressing. Benefiting from structure-activity analysis, we propose the Domain Adversarial Training Network with Structural-Semantic Transfer Discrepancy (DisTrans) to optimize cross-domain adaptive representation for molecular structures and visual images. 1) We employ the gradient reversal strategy based on substructure topological discrepancies between domains to learn the domain dependence of molecular structures. This strategy guides the model to adapt to the structural adjacency patterns in the target domain, generating domain-separable structural representations. 2) We apply the cross-domain representation guidance mechanism to align the functional-group semantic information between the source and target domains, learning cross-domain consistency information. The experimental results in two typical cross-domain strategies demonstrate that DisTrans outperforms 16 baseline methods, maintaining satisfactory performance even under pronounced inter-domain discrepancy.

2605.16283 2026-05-25 cs.CY cs.AI 版本更新

Can the Recovery Mechanism Survive AI? Skill Formation, Labor, and What Current Measurement Misses

恢复机制能否在人工智能中幸存?技能形成、劳动以及当前测量所遗漏的

Aysa Xuemo Fan

发表机构 * Aysa Xuemo Fan

AI总结 本文探讨了生成式人工智能对传统技能形成机制的潜在冲击,指出AI可能首次打破技术进步与教育适应之间的历史循环。通过劳动经济学理论、大规模AI交互数据及技能形成实验,研究提出了三个核心贡献:构建了存量与流量分析框架,揭示当前AI主要增强现有劳动者能力却削弱下一代培养管道;系统分析发现现有研究普遍忽视认知中的知识维度,且AI虽提升表现却未促进学习;提出扩展认知分类体系,区分有助于和阻碍学习的AI交互模式。研究强调AI的社会风险不在于替代教师,而在于消除下一代能力形成所需的挑战过程。

详情
AI中文摘要

在整个现代时期,当新技术取代工人时,社会通过相同的机制进行适应:教育提高了认知上限,培养出能够完成机器尚未触及任务的工人。生成式AI可能是第一个打破这一循环的技术,因为它现在运作于该上限的顶端。本文借鉴劳动经济学、来自多个平台的数百万AI对话部署数据、对两个公共数据集的原始重新分析以及技能形成实验,提出了三项贡献。首先,一个存量-流量框架显示,经济数据和教育数据对同一技术讲述了不同的故事:增强主导当前工人,但培养下一代的发展管道正承受压力。其次,对证据基础的系统性差距分析揭示,所有主要研究均未测量认知的知识维度,三项测量学习成果的研究(每项n<200)一致发现AI提高了表现但未改善学习(在我们的跨平台重新分析中d=1.21),且没有研究连接专业人群和学生人群。第三,一个扩展的认知分类法(不确定性下的判断、认知身份和认知能动性)应用于证据中的三个案例,以区分保留学习的AI交互模式与结构相似但侵蚀学习的模式。本文认为,AI的社会风险不在于取代教师,而在于消除下一代能力形成所必需的生产性挣扎,并提出了针对当前测量系统所遗漏内容的研究和设计议程。

英文摘要

Throughout the modern era, when new technologies displaced workers, societies adapted through the same mechanism: education raised the cognitive ceiling, producing workers capable of tasks machines could not yet reach. Generative AI may be the first technology to break this cycle, because it now operates at the top of that ceiling. Drawing on labor economics, deployment data from millions of AI conversations across multiple platforms, original reanalysis of two public datasets, and skill-formation experiments, this paper develops three contributions. First, a stock-versus-flow framework showing that economic data and education data tell divergent stories about the same technology: augmentation dominates current workers, but the developmental pipeline producing the next generation is under strain. Second, a systematic gap analysis of the evidence base, revealing that the knowledge dimension of cognition is unmeasured across all major studies, that the three studies measuring learning outcomes (each $n < 200$) consistently find AI improves performance without improving learning ($d = 1.21$ in our cross-platform reanalysis), and that no study bridges professional and student populations. Third, an extended cognitive taxonomy (judgment under uncertainty, epistemic identity, and epistemic agency) applied to three cases from the evidence to distinguish AI interaction patterns that preserve learning from structurally similar ones that erode it. The paper argues that AI's societal risk lies not in replacing teachers but in eliminating the productive struggle through which the next generation's capacity forms, and proposes a research and design agenda targeting what current measurement systems miss.

2605.16087 2026-05-25 cs.RO cs.AI 版本更新

Towards Trustworthy and Explainable AI for Perception Models: From Concept to Prototype Vehicle Deployment

面向感知模型的可信与可解释人工智能:从概念到原型车辆部署

Till Beemelmanns, Shayan Sharifi, Manas Mehrotra, Ayushman Choudhuri, Lutz Eckstein

发表机构 * Institute for Automotive Engineering, RWTH Aachen University(汽车工程研究所,亚琛工业大学)

AI总结 本文研究了如何在自动驾驶感知模型中实现可信且可解释的人工智能,针对深度神经网络在自动驾驶中应用时存在的不透明性和安全性问题,提出了一种集成可信解释性和不确定性估计的感知模块。该方法基于变压器架构,在推理时通过注意力机制生成解释,并通过扰动一致性测试验证其可靠性,同时引入不确定性估计与校准模块以提升系统鲁棒性。研究还展示了该模块在原型车上的部署及可视化接口,验证了其在实时可信感知监控中的可行性。

Comments Accepted for publication at IEEE ITSC 2026

详情
AI中文摘要

深度神经网络已成为自动驾驶感知的主流解决方案,但其不透明性与新兴的可信人工智能指南相冲突,并给安全保证、调试和人工监督带来复杂性。尽管存在安全与可解释人工智能的理论框架,但针对3D场景理解的可信人工智能具体实现仍然稀缺。我们通过提出一个极其鲁棒、集成忠实可解释性和校准不确定性估计的可信人工智能感知模块来填补这一空白。基于Transformer检测器,我们在推理时从注意力机制中导出解释,并使用基于扰动的连续性测试验证其忠实性。我们进一步集成了不确定性估计与校准模块,并应用了增强鲁棒性的训练方法。实验展示了忠实的显著性行为、改进的鲁棒性以及良好校准的不确定性估计。最后,我们将这些可信人工智能元素部署到原型车辆中,并提供一个可解释人工智能界面,可视化文档工件、模型不确定性状态和显著性图,展示了实时可信感知监控的可行性。补充材料见 https://tillbeemelmanns.github.io/trustworthy_ai/ 。

英文摘要

Deep Neural Networks have become the dominant solution for Autonomous Driving perception, but their opacity conflicts with emerging Trustworthy AI guidelines and complicates safety assurance, debugging, and human oversight. While theoretical frameworks for safe and Explainable AI (XAI) exist, concrete implementations of Trustworthy AI for 3D scene understanding remain scarce. We address this gap by proposing a Trustworthy AI perception module that is remarkably robust, integrates faithful explainability, and calibrated uncertainty estimates. Building on a transformer-based detector, we derive explanation from the attention mechanism at inference time and validate their faithfulness using perturbation-based consistency tests. We further integrate an uncertainty estimation and calibration module, and apply robustness-enhancing training methods. Experiments show faithful saliency behavior, improved robustness, and well-calibrated uncertainty estimates. Finally, we deploy these Trustworthy AI elements in a prototype vehicle and provide an XAI Interface that visualizes documentation artifacts, model uncertainty state, and saliency maps, demonstrating the feasibility of trustworthy perception monitoring in real time. Supplementary materials are available at https://tillbeemelmanns.github.io/trustworthy_ai/ .

2605.15652 2026-05-25 cs.NE cs.AI 版本更新

Bridging Silicon and the Hippocampus: Algebro-Deterministic Memory "VaCoAl" as a Substrate for Vector-HaSH and TEM

连接硅与海马体:作为Vector-HaSH和TEM基底的代数确定性记忆“VaCoAl”

Hiroyuki Chuma, Kanji Otsuka, Yoichi Sato

发表机构 * Institute of Innovation Research, Hitotsubashi University(立命馆大学创新研究所) Meisei University(明海大学) Shuhari System(Shuhari系统)

AI总结 该研究提出了一种基于伽罗瓦域线性反馈移位寄存器的代数确定性高维记忆架构VaCoAl,旨在为Vector-HaSH和TEM模型提供统一的数学基础。VaCoAl通过确定性扩散机制替代随机投影,实现了与Vector-HaSH相似的准正交性,同时保证了位精确的可复现性,并引入路径积分置信度比模型解释记忆回放的乘法衰减现象。研究还揭示了VaCoAl与海马体环路的生物学对应关系,并将其与因果推理层级联系起来,为计算神经科学与高维计算的融合提供了理论支撑。

Comments 52 pages, 5 figures, 1 table, 3 appendices

详情
AI中文摘要

Vector-HaSH和Tolman-Eichenbaum Machine(TEM)提出海马-内嗅回路通过网格细胞支架进行组合回放来分解记忆。同时,人类颅内脑电图显示尖波涟漪门控回忆,且多跳回放保真度呈乘法衰减。然而,这些领域缺乏共同的代数基础。我们引入VaCoAl,一种基于伽罗瓦域线性反馈移位寄存器的代数确定性超维记忆架构。其确定性伽罗瓦域扩散为Vector-HaSH的随机投影提供了基底级替代,在匹配准正交性的同时确保位精确可重现性。此外,路径积分置信比CR2为经验观察到的乘法回放衰减提供了代数可处理模型。在生物学上,VaCoAl的两种工作模式与EC-CA3直接通路和EC-DG-CA3三突触通路一致,解释了它们5.2亿年的保守性。独立的细胞证据支持DG-CA3通路实现了伽罗瓦域算术的生物物理同源物。我们还将这一框架与Judea Pearl的因果关系阶梯联系起来。可逆的GF(2)绑定为do算子(第2层)提供了手术代数,而VaCoAl的双正交化器架构为反事实推理(第3层)提供了所需的并行基底。最终,我们证明了这些形式对应关系并推导出可测试的颅内脑电图预测,统一了计算神经科学、电生理学和超维计算。

英文摘要

Vector-HaSH and the Tolman-Eichenbaum Machine (TEM) propose the hippocampal-entorhinal circuit factorizes memory via a grid-cell scaffold for compositional replay. Concurrently, human iEEG shows sharp-wave ripples gate recall and multi-hop replay fidelity decays multiplicatively. Yet, these fields lack a shared algebraic foundation. We introduce VaCoAl, an algebro-deterministic hyperdimensional memory architecture built on Galois-field linear-feedback shift registers. Its deterministic Galois-field diffusion offers a substrate-level alternative to Vector-HaSH's random projections, matching quasi-orthogonality while ensuring bit-exact reproducibility. Furthermore, the path-integral Confidence Ratio CR2 provides an algebraically tractable model for the empirically observed multiplicative replay decay. Biologically, VaCoAl's two operating regimes align with the EC-CA3 direct and EC-DG-CA3 trisynaptic pathways, explaining their 520-Myr conservation. Independent cellular evidence supports that the DG-CA3 pathway implements a biophysical homologue of Galois-field arithmetic. We also link this framework to Judea Pearl's Ladder of Causation. Reversible GF(2) binding provides the surgical algebra for the do-operator (Rung 2), and VaCoAl's dual-orthogonalizer architecture supplies the parallel substrate required for counterfactual reasoning (Rung 3). Ultimately, we prove these formal correspondences and derive testable iEEG predictions, uniting computational neuroscience, electrophysiology, and hyperdimensional computing.

2605.11215 2026-05-25 cs.DC cs.AI 版本更新

ReCoVer: Resilient LLM Pre-Training System via Fault-Tolerant Collective and Versatile Workload

ReCoVer: 通过容错集合和多功能工作负载实现的弹性LLM预训练系统

Ziyue Liu, Zhengyang Wang, Ruijie Zhang, Avinash Maurya, Hui Zhou, Paul Hovland, Sheng Di, Franck Cappello, Bogdan Nicolae, Zheng Zhang

发表机构 * University of California at Santa Barbara(加州大学圣芭芭拉分校) Argonne National Laboratory(阿贡国家实验室)

AI总结 在大规模GPU集群上预训练大语言模型时,硬件故障已成为常态,因此需要构建具有弹性的训练系统。本文提出ReCoVer,一种通过容错集体通信和多样化工作负载策略实现鲁棒预训练的系统,其核心在于保持每轮迭代微批次数量不变,从而确保梯度与无故障训练过程保持统计一致。ReCoVer支持多种并行方案,能够在GPU故障情况下维持训练轨迹,实验表明其在处理能力与训练效率上显著优于传统检查点重启方法。

Comments Preprint

详情
AI中文摘要

在大型GPU集群上预训练大型语言模型使得硬件故障变得常见而非罕见,推动了对弹性训练系统的需求。然而,现有框架要么专注于特定的并行方案,要么存在偏离无故障训练轨迹的风险。我们提出ReCoVer,一个弹性LLM预训练系统,它维护一个单一不变性:每次迭代保持微批次数恒定,确保每次迭代的梯度在随机意义上等同于无故障运行。该框架组织为三个解耦的协议层:(1) 容错集合,隔离故障以防止跨副本传播;(2) 步内细粒度恢复,保留迭代内进度并防止梯度损坏;(3) 多功能工作负载策略,动态地在幸存者之间重新分配微批次配额。该设计与并行方案无关,可直接作为即插即用基础集成到3D并行和混合分片数据并行(HSDP)中。我们在多达512个GPU的端到端预训练任务上评估了我们的实现,ReCoVer成功保持了无故障参考的训练轨迹,尽管在整个运行过程中丢失了256个GPU。与检查点重启基线相比,ReCoVer在连续故障后有效吞吐量提高了2.23倍。这一优势使得ReCoVer在234 GPU小时内处理了74.9%更多的令牌,且随着训练时间延长差距进一步扩大。

英文摘要

Pre-training large language models on massive GPU clusters has made hardware faults routine rather than rare, driving the need for resilient training systems. Yet existing frameworks either focus on specific parallelism schemes or risk drifting away from a failure-free training trajectory. We propose ReCoVer, a resilient LLM pre-training system that upholds a single invariant: each iteration keeps the number of microbatches constant, ensuring per-iteration gradients remain stochastically equivalent to a failure-free run. The framework is organized as three decoupled protocol layers: (1) Fault-tolerant collectives that isolate faults from propagating across replicas; (2) in-step fine-grained recovery that preserves intra-iteration progress and prevents gradient corruption; (3) versatile-workload policy that dynamically redistributes microbatch quotas across the survivors. The design is parallelism-agnostic, integrating directly with both 3D parallelism and Hybrid Sharded Data Parallel (HSDP) as a drop-in substrate. We evaluate our implementation on end-to-end pre-training tasks for up to 512 GPUs, ReCoVer successfully preserves the training trajectory from a failure-free reference despite of 256 GPUs lost spread across the run. For comparison with checkpoint-and-restart baselines, ReCoVer demonstrates $2.23\times$ higher effective throughput after successive failures. This advantage results in ReCoVer processing 74.9% more tokens at 234 GPU-hours, with the gap widening as the training prolongs.

2605.11053 2026-05-25 cs.CR cs.AI cs.LG 版本更新

Content-Aware Attack Detection in LLM Agent Tool-Call Traffic: An Empirical Study of Features, Architectures, and Evaluation Protocols

LLM Agent工具调用流量中的内容感知攻击检测:特征、架构与评估协议的实证研究

Sultan Zavrak

发表机构 * Department of Computer Engineering, Duzce University(杜兹大学计算机工程系)

AI总结 本文研究了大语言模型代理在调用外部工具时的流量攻击检测问题,提出了一种基于内容感知的检测框架,将每个代理会话建模为图结构,并结合语句嵌入特征进行分类。研究对比了多种图神经网络和传统机器学习模型,发现内容级别的特征对检测性能至关重要,且基于SBERT的嵌入特征在多个数据集上表现优异,优于图神经网络和MLP模型。此外,研究还揭示了数据划分方式对评估结果的影响,并指出先前工作未充分考虑这一问题。

Comments v2: renamed manuscript (brand removed; descriptive title). No changes to methodology, results, tables, or figures

详情
AI中文摘要

模型上下文协议(MCP)已成为LLM agent调用外部工具的广泛采用的接口,然而对MCP工具调用流量的学习监控仍未被充分探索。本文提出的检测器是一个针对MCP工具调用流量的攻击检测框架,它将每个agent会话编码为图(工具调用作为节点,顺序和数据流链接作为边),通过参数和响应的句子嵌入特征丰富节点,并将会话分类为良性或受攻击。评估了三种GNN架构(GAT、GCN、GraphSAGE)、一个无图MLP以及经典基线(XGBoost、随机森林、逻辑回归、线性SVM),完整架构比较在RAS-Eval(任务分层分割)上进行,GraphSAGE作为GNN基线保留在ATBench和组合源变体(均标签分层)上。得出三个发现。首先,内容级特征至关重要:仅元数据检测的AUROC停滞在0.64左右,无论架构如何,而内容嵌入将AUROC推高至0.89以上。其次,相对于任务不相交分割,朴素随机分割评估将AUROC高估多达26个百分点,这是先前agent检测工作未解决的记忆混淆问题。第三,检测信号主要存在于SBERT内容嵌入中:在池化嵌入上,树集成达到了0.975的AUROC,在大多数情况下优于主要RAS-Eval设置中的神经架构,包括GNN(0.917)和MLP(0.896),并且自监督预训练在此任务上未带来标签效率优势。

英文摘要

The Model Context Protocol (MCP) has become a widely adopted interface for LLM agents to invoke external tools, yet learned monitoring of MCP tool-call traffic remains underexplored. In this article, the proposed detector is presented as an attack detection framework for MCP tool-call traffic that encodes each agent session as a graph (tool calls as nodes, sequential and data-flow links as edges), enriches nodes with sentence-embedding features over arguments and responses, and classifies sessions as benign or attacked. Three GNN architectures (GAT, GCN, GraphSAGE), a no-graph MLP, and classical baselines (XGBoost, random forest, logistic regression, linear SVM) are evaluated, with the full architecture comparison conducted on RAS-Eval (task-stratified splits) and GraphSAGE retained as the GNN baseline on ATBench and a combined-source variant (both label-stratified). Three findings emerge. First, content-level features are essential: metadata-only detection plateaus around an AUROC of 0.64 regardless of architecture, while content embeddings push the AUROC above 0.89. Second, naive random-split evaluation inflates AUROC by up to 26 percentage points relative to task-disjoint splits, a memorization confound that prior agent-detection work has not addressed. Third, the detection signal resides primarily in the SBERT content embeddings: an AUROC of 0.975 was reached by tree ensembles on pooled embeddings, performing, for the most part, better than the neural architectures in the primary RAS-Eval setting including GNNs (0.917) and the MLP (0.896), and self-supervised pre-training does not deliver a label-efficiency advantage on this task.

2605.10347 2026-05-25 cs.AI cs.CL 版本更新

How Mobile World Model Guides GUI Agents?

移动世界模型如何指导GUI代理?

Weikai Xu, Kun Huang, Yunren Feng, Jiaxing Li, Yuhan Chen, Yuxuan Liu, Zhizheng Jiang, Heng Qu, Pengzhi Gao, Wei Liu, Jian Luan, Xiaolin Hu, Bo An

发表机构 * Nanyang Technological University(南洋理工大学) MiLM Plus, Xiaomi Inc.(小米公司) Independent Researchers(独立研究人员) Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学人工智能学院) Wuhan University(武汉大学) Xiamen University(厦门大学)

AI总结 本文研究了移动世界模型如何指导GUI代理进行有效交互,针对现有模型在预测动作后果方面的不足,提出了一种多模态世界模型,涵盖增量文本、完整文本、扩散图像和可渲染代码四种表示方式。实验表明,该模型在多个基准测试中达到最优性能,并揭示了代码重建在分布内精度和多模态监督上的优势,文本反馈在分布外执行中的鲁棒性,以及世界模型在训练过程中的辅助作用,而非作为通用的后验验证工具。

详情
AI中文摘要

视觉语言模型的最新进展使移动GUI代理能够感知视觉界面并执行用户指令,但对于长期和高风险交互,动作后果的可靠预测仍然至关重要。现有的移动世界模型提供基于文本或基于图像的未来状态,但尚不清楚哪种表示有用,生成的rollout是否可以替代真实环境,以及测试时指导如何帮助不同强度的代理。为了回答上述问题,我们筛选并标注了移动世界模型数据,然后训练了四种模态的世界模型:增量文本、完整文本、基于扩散的图像和可渲染代码。这些模型在MobileWorldBench和Code2WorldBench上均达到了最先进性能。此外,通过在AITZ、AndroidControl和AndroidWorld上评估其下游效用,我们得到三个发现。首先,可渲染代码重建实现了高分布内保真度,并为数据构建提供了有效的多模态监督,而基于文本的反馈对于在线分布外执行更鲁棒。其次,世界模型生成的轨迹可以在训练过程中提供可迁移的交互经验,并提高代理的端到端任务性能,尽管这些数据不保留原始分布。最后,对于动作熵低的过度自信移动代理,后验自省提供的收益有限,这表明世界模型作为先验感知或训练监督比作为通用事后验证器更有效。

英文摘要

Recent advances in vision-language models have enabled mobile GUI agents to perceive visual interfaces and execute user instructions, but reliable prediction of action consequences remains critical for long-horizon and high-risk interactions. Existing mobile world models provide either text-based or image-based future states, yet it remains unclear which representation is useful, whether generated rollouts can replace real environments, and how test-time guidance helps agents of different strengths. To answer the above questions, we filter and annotate mobile world-model data, then train world models across four modalities: delta text, full text, diffusion-based images, and renderable code. These models achieve SoTA performance on both MobileWorldBench and Code2WorldBench. Furthermore, by evaluating their downstream utility on AITZ, AndroidControl, and AndroidWorld, we obtain three findings. First, renderable code reconstruction achieves high in-distribution fidelity and provides effective multimodal supervision for data construction, while text-based feedback is more robust for online out-of-distribution (OOD) execution. Second, world-model-generated trajectories can provide transferable interaction experience in the training process and improve agents' end-to-end task performance, although these data do not preserve the original distribution. Last, for overconfident mobile agents with low action entropy, posterior self-reflection provides limited gains, suggesting that world models are more effective as prior perception or training supervision than as universal post-hoc verifiers.

2605.07717 2026-05-25 cs.SE cs.AI 版本更新

The AI-Native Large-Scale Agile Software Development Manifesto

AI原生大规模敏捷软件开发宣言

Ricardo Britto, Fredrik Palmgren, Nishrith Saini, Marcus Ohlin

发表机构 * Ericsson, Sweden(爱立信(瑞典)) Blekinge Institute of Technology, Sweden(布莱金厄技术学院(瑞典))

AI总结 尽管敏捷方法被广泛应用,但在大规模软件开发中实现真正的敏捷性仍然具有挑战。本文提出《AI原生的大规模敏捷软件开发宣言》,旨在将人工智能作为核心参与者而非辅助工具,重新定义大规模软件开发的组织方式。该宣言基于六大原则,强调通过智能、自适应和持续学习的系统,取代传统的会议驱动、文档密集和顺序式开发流程,从而提升组织层面的敏捷性。

详情
AI中文摘要

尽管敏捷方法被广泛采用,但在大规模实现真正的敏捷性仍然难以捉摸。大规模敏捷框架仍然以人为中心和手动为主,依赖协调会议、工件同步和基于角色的交接,这抑制了实时适应。与此同时,AI的快速进步,特别是大型语言模型,已经开始改变软件工程,但它们对组织级敏捷性的潜力仍未得到充分探索。我们提出了AI原生大规模敏捷软件开发宣言:一组价值观和原则,重新定义了当AI成为一等参与者而非外围工具时,大规模软件开发的组织方式。该宣言基于六项原则:并行流程、意图驱动团队、活知识、验证优先保障、编排的代理工作力和可重用蓝图,这些原则共同将开发从会议驱动、文档繁重、顺序的流程转变为智能、自适应、持续学习的系统。

英文摘要

Despite the widespread adoption of agile methods, achieving true agility at scale remains elusive. Large-scale agile frameworks remain largely human-centric and manual, relying on coordination meetings, artifact synchronization, and role-based handoffs that inhibit real-time adaptation. Meanwhile, rapid advances in AI, particularly large language models, have begun transforming software engineering, yet their potential for organizational-level agility remains underexplored. We present the AI-Native Large-Scale Agile Software Development Manifesto: a set of values and principles that redefine how large-scale software development is organized when AI becomes a first-class participant rather than a peripheral tool. The manifesto is grounded in six principles, parallel processes, intent-driven teams, living knowledge, verification-first assurance, orchestrated agent workforces, and reusable blueprints, that together shift development from a meeting-driven, document-heavy, sequential process to an intelligent, adaptive, continuously learning system.

2605.06936 2026-05-25 cs.AR cs.AI cs.MA 版本更新

Bridging the Last Mile of Circuit Design: PostEDA-Bench, a Hierarchical Benchmark for PPA Convergence and DRC Fixing

跨越电路设计的最后一英里:PostEDA-Bench,一个用于PPA收敛和DRC修复的分层基准

Pengju Liu, Nuo Xu, Jinwei Tang, Yu Cao, Caiwen Ding

发表机构 * University of Minnesota(明尼苏达大学)

AI总结 该论文提出了一种名为PostEDA-Bench的分层基准测试平台,用于评估基于大语言模型(LLM)的智能体在电子设计自动化(EDA)流程中“最后一公里”任务中的表现,包括修复设计规则检查(DRC)违规和优化功耗-性能-面积(PPA)目标。该基准包含145个任务,覆盖DRC修复、PPA单目标和多目标优化等场景,并支持多种EDA工具链进行机器可验证的评估。实验表明,当前主流LLM在处理合成DRC和单目标PPA任务时表现尚可,但在更实际的DRC推理和多目标PPA优化任务中效果显著下降,突显了当前模型在复杂设计优化和权衡推理方面仍面临重大挑战。

详情
AI中文摘要

基于LLM的代理越来越多地应用于电子设计自动化(EDA)的“最后一英里”:修复工具运行后残留的签核设计规则检查(DRC)违规并收敛功耗-性能-面积(PPA)目标。然而,现有的EDA-LLM基准完全忽略了DRC修复,并依赖于与单一工具链绑定的扁平层次结构。我们引入了PostEDA-Bench,这是一个分层基准,包含145个任务,涵盖DRC-Essential、DRC-Reasoning、PPA-Mono和PPA-Multi,由支持机器可检查评估的EDA工具链提供支持。在多个代理框架下的八个商业和开源LLM中,我们发现代理能够较好地处理合成DRC-Essential和单目标PPA-Mono任务,但在更实际的DRC-Reasoning(最佳成功率为36.66%)和PPA-Multi(最佳成功率为20.00%)上性能急剧下降;视觉增强始终提升DRC-Bench性能;而权衡推理(而非旋钮知识)是PPA-Multi的主要瓶颈。

英文摘要

LLM-based agents are increasingly applied to the "last mile" of Electronic Design Automation (EDA): repairing residual sign-off Design Rule Check (DRC) violations and converging Power-Performance-Area (PPA) targets after tool runs. Existing EDA-LLM benchmarks, however, omit DRC fixing entirely and rely on flat hierarchies tied to a single toolchain. We introduce PostEDA-Bench, a hierarchical benchmark with 145 tasks across DRC-Essential, DRC-Reasoning, PPA-Mono, and PPA-Multi, supported by EDA toolchains with machine-checkable evaluation. Across eight commercial and open-source LLMs under multiple agent scaffolds, we find that agents handle synthetic DRC-Essential and single-objective PPA-Mono reasonably well but degrade sharply on the more practical DRC-Reasoning, where the best success rate is 36.66%, and PPA-Multi, where the best success rate is 20.00%; vision augmentation consistently enhances DRC-Bench; and trade-off reasoning, rather than knob knowledge, is the dominant PPA-Multi bottleneck.

2605.06840 2026-05-25 cs.AI 版本更新

Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning

从LLM推理轨迹中提取搜索树揭示短视规划

Sixing Chen, Ji-An Li, Saner Cakir, Sinan Akcali, Kayla Lee, Marcelo G. Mattar

发表机构 * Generality Inc.(Generality公司)

AI总结 本研究通过从大型语言模型(LLM)在“四连棋”游戏中的推理轨迹中提取搜索树,揭示了LLM在规划行为上的短视特性。研究发现,尽管LLM的推理轨迹中包含较深的节点,但其决策主要依赖于浅层搜索,而非深度搜索;相比之下,人类玩家的性能更多由深度搜索驱动。这一发现揭示了LLM与人类规划之间的关键差异,并为改进LLM的规划能力提供了方向性指导。

详情
AI中文摘要

大型语言模型(LLMs),尤其是推理模型,会生成扩展的思维链(CoT)推理,其中通常包含对未来结果的明确思考。然而,这种思考是否构成真正的规划、其结构如何以及哪些方面驱动性能仍不清楚。在这项工作中,我们引入了一种新方法,通过从四子棋游戏的推理轨迹中提取和量化搜索树来表征LLM规划。通过将计算模型拟合到提取的搜索树上,我们表征了规划的结构及其如何影响移动决策。我们发现LLM的搜索比人类更浅,性能由搜索广度而非深度预测。最引人注目的是,尽管LLM在轨迹中扩展了深层节点,但其移动选择最好由一个完全忽略这些节点的短视模型解释。一项因果干预研究(我们选择性剪枝CoT段落)进一步表明,移动选择主要由浅层节点而非深层节点驱动。这些模式与人类规划形成对比,在人类规划中,性能主要由深度搜索驱动。总之,我们的发现揭示了LLM与人类规划之间的关键差异:虽然人类专业知识由更深层次的搜索驱动,但LLM并不基于深层前瞻行动。这种分离为对齐LLM和人类规划提供了有针对性的指导。更广泛地说,我们的框架提供了一种可推广的方法,用于解释跨战略领域LLM规划的结构。

英文摘要

Large language models (LLMs), especially reasoning models, generate extended chain-of-thought (CoT) reasoning that often contains explicit deliberation over future outcomes. Yet whether this deliberation constitutes genuine planning, how it is structured, and what aspects of it drive performance remain poorly understood. In this work, we introduce a new method to characterize LLM planning by extracting and quantifying search trees from reasoning traces in the four-in-a-row board game. By fitting computational models on the extracted search trees, we characterize how plans are structured and how they influence move decisions. We find that LLMs' search is shallower than humans', and that performance is predicted by search breadth rather than depth. Most strikingly, although LLMs expand deep nodes in their traces, their move choices are best explained by a myopic model that ignores those nodes entirely. A causal intervention study where we selectively prune CoT paragraphs further suggests that move selection is driven predominantly by shallow rather than deep nodes. These patterns contrast with human planning, where performance is driven primarily by deep search. Together, our findings reveal a key difference between LLM and human planning: while human expertise is driven by deeper search, LLMs do not act on deep lookahead. This dissociation offers targeted guidance for aligning LLM and human planning. More broadly, our framework provides a generalizable approach for interpreting the structure of LLM planning across strategic domains.

2605.06094 2026-05-25 cs.CV cs.AI 版本更新

VISD: Enhancing Video Reasoning via Structured Self-Distillation

VISD: 通过结构化自蒸馏增强视频推理

Hao Lin, Kunyang Lv, Xu Jiang, Jingqi Tian, Zhongjing Du, Jiayu Ding, Qiaoman Zhang, Hongbo Jin

发表机构 * HUST(华中科技大学) Wuhan University(武汉大学) Peking University(北京大学) Tsinghua University(清华大学)

AI总结 本文提出VISD,一种用于增强视频推理的结构化自蒸馏框架,旨在解决视频大语言模型在复杂推理任务中因稀疏奖励和细粒度信用分配不足而导致的学习效率低下的问题。VISD引入了一个视频感知的评判模型,将推理质量分解为答案正确性、逻辑一致性和时空定位等多个维度,并利用结构化反馈指导教师策略进行细粒度的标记级监督。通过方向与幅度解耦机制,VISD稳定地将密集监督与强化学习结合,显著提升了推理准确性和训练效率。实验表明,VISD在多个基准测试中均优于现有方法,且收敛速度更快。

详情
AI中文摘要

训练视频大语言模型进行复杂推理仍然具有挑战性,原因在于稀疏的序列级奖励以及缺乏对长时间、时间上接地推理轨迹的细粒度信用分配。虽然具有可验证奖励的强化学习提供了可靠的监督,但它无法捕捉令牌级贡献,导致学习效率低下。相反,现有的自蒸馏方法提供密集监督,但缺乏结构和诊断特异性,并且通常与强化学习交互不稳定。在这项工作中,我们提出了VISD,一个结构化自蒸馏框架,为视频推理引入诊断上有意义的特权信息。VISD采用视频感知判断模型,将推理质量分解为多个维度,包括答案正确性、逻辑一致性和时空接地性,并使用这种结构化反馈指导教师策略进行令牌级监督。为了将密集监督与强化学习稳定集成,我们引入了方向-幅度解耦机制,其中由奖励计算的展开级优势决定更新方向,而结构化特权信号调节令牌级更新幅度。这种设计实现了语义对齐和细粒度的信用分配,提高了推理忠实度和训练效率。此外,VISD结合了课程调度和基于指数移动平均的教师稳定化,以支持长视频序列上的鲁棒优化。在多个基准上的实验表明,VISD始终优于强基线,提高了答案准确性和时空接地质量。值得注意的是,VISD在优化步骤中实现了近2倍的收敛速度,突出了结构化自监督在提高视频大语言模型性能和样本效率方面的有效性。

英文摘要

Training VideoLLMs for complex reasoning remains challenging due to sparse sequence level rewards and the lack of fine grained credit assignment over long, temporally grounded reasoning trajectories. While reinforcement learning with verifiable rewards (RLVR) provides reliable supervision, it fails to capture token level contributions, leading to inefficient learning. Conversely, existing self distillation methods offer dense supervision but lack structure and diagnostic specificity, and often interact unstably with reinforcement learning. In this work, we propose VISD, a structured self distillation framework that introduces diagnostically meaningful privileged information for video reasoning. VISD employs a video aware judge model to decompose reasoning quality into multiple dimensions, including answer correctness, logical consistency, and spatio-temporal grounding, and uses this structured feedback to guide a teacher policy for token level supervision. To stably integrate dense supervision with RL, we introduce a direction magnitude decoupling mechanism, where rollout level advantages computed from rewards determine update direction, while structured privileged signals modulate token level update magnitudes. This design enables semantically aligned and fine grained credit assignment, improving both reasoning faithfulness and training efficiency. Additionally, VISD incorporates curriculum scheduling and EMA based teacher stabilization to support robust optimization over long video sequences. Experiments on diverse benchmarks show that VISD consistently outperforms strong baselines, improving answer accuracy and spatio temporal grounding quality. Notably, VISD reaches these gains with nearly 2x faster convergence in optimization steps, highlighting the effectiveness of structured self supervision in improving both performance and sample efficiency for VideoLLMs.

2605.05704 2026-05-25 cs.CR cs.AI 版本更新

SafeHarbor: Hierarchical Memory-Augmented Guardrail for LLM Agent Safety

SafeHarbor:用于LLM智能体安全的分层记忆增强防护栏

Zhe Liu, Zonghao Ying, Wenxin Zhang, Quanchen Zou, Deyue Zhang, Dongdong Yang, Xiangzheng Zhang, Hao Peng

发表机构 * School of Cyber Science and Technology, Beihang University, Beijing, China(北京航空航天大学网络安全学院) Institute of Artificial Intelligence, Beihang University, Beijing, China(北京航空航天大学人工智能研究院) University of Chinese Academy of Sciences, Beijing, China(中国科学院大学) AI Security Lab, Beijing, China(360人工智能安全实验室)

AI总结 随着大语言模型(LLM)逐渐具备自主推理和工具执行能力,其在实际应用中面临新的安全风险。为解决现有防御策略在安全性和实用性之间难以平衡的问题,本文提出SafeHarbor,一种基于分层记忆增强的防护框架,通过上下文感知的对抗生成提取防御规则,并结合信息熵驱动的自进化机制动态优化记忆结构,从而在保障安全的同时提升模型对合法请求的响应能力。实验表明,SafeHarbor在多个基准测试中表现出色,显著优于现有方法。

Comments Accepted by ICML 2026

详情
AI中文摘要

基础模型的最新进展已将LLM从被动对话系统转变为能够推理和执行工具的自主智能体。虽然这些能力带来了巨大的实用价值,但也引入了新的安全风险,因为对手可以操纵智能体在现实环境中执行有害操作。现有的防御策略可以缓解此类威胁,但往往难以平衡安全性和实用性,导致对良性用户请求的过度拒绝。为了缓解这种权衡,我们提出了SafeHarbor,一种新颖的框架,旨在为LLM智能体建立精确的决策边界。与静态指南不同,SafeHarbor通过增强对抗生成提取上下文感知的防御规则。我们设计了一个本地分层记忆系统用于动态规则注入,提供了一种无需训练、高效且即插即用的解决方案。此外,我们引入了一种基于信息熵的自进化机制,通过动态节点分裂和合并持续优化记忆结构。大量实验表明,SafeHarbor在模糊的良性任务和明确的恶意攻击上都达到了最先进的性能,特别是在GPT-4o上实现了63.6%的峰值良性效用,同时保持对有害请求超过93%的稳健拒绝率。源代码已公开在https://github.com/ljj-cyber/SafeHarbor。

英文摘要

Recent advances in foundation models have transformed LLMs from passive conversational systems into autonomous agents capable of reasoning and tool execution. While these capabilities unlock substantial practical value, they also introduce new security risks, as adversaries can manipulate agents into performing harmful actions in real-world environments. Existing defense strategies mitigate such threats but frequently struggle to balance safety and utility, resulting in over-refusal of benign user requests. To mitigate this trade-off, we propose SafeHarbor, a novel framework designed to establish precise decision boundaries for LLM agents. Unlike static guidelines, SafeHarbor extracts context-aware defense rules through enhanced adversarial generation. We design a local hierarchical memory system for dynamic rule injection, offering a training-free, efficient, and plug-and-play solution. Furthermore, we introduce an information entropy-based self-evolution mechanism that continuously optimizes the memory structure through dynamic node splitting and merging. Extensive experiments demonstrate that SafeHarbor achieves state-of-the-art performance on both ambiguous benign tasks and explicit malicious attacks, notably attaining a peak benign utility of 63.6\% on GPT-4o while maintaining a robust refusal rate exceeding 93\% against harmful requests. The source code is publicly available at https://github.com/ljj-cyber/SafeHarbor.

2605.04568 2026-05-25 cs.LG cs.AI cs.RO 版本更新

Dream-MPC: Gradient-Based Model Predictive Control with Latent Imagination

Dream-MPC:基于梯度与潜在想象的模型预测控制

Jonathan Spieler, Sven Behnke

发表机构 * Autonomous Intelligent Systems, Computer Science Institute VI - Intelligent Systems(自主智能系统,计算机科学研究所VI - 智能系统) Robotics, Center for Robotics(机器人学,机器人中心) the Lamarr Institute for Machine Learning(拉马尔机器学习研究所) Artificial Intelligence, University of Bonn, Germany(人工智能,波恩大学,德国)

AI总结 本文提出了一种名为 Dream-MPC 的新型模型预测控制方法,结合了梯度上升优化与学习到的世界模型,通过生成少量候选轨迹并利用不确定性正则化和优化迭代的复用机制进行优化。该方法在24个连续控制任务中表现出色,显著提升了基础策略的性能,优于传统的无梯度MPC和先进基线方法。

Comments Accepted for International Conference on Machine Learning (ICML) 2026

详情
AI中文摘要

最先进的基于模型的强化学习方法要么使用无梯度、基于种群的规划方法,要么使用学习到的策略网络,或者结合策略网络和规划。将模型预测控制(MPC)与学习到的模型和策略先验相结合的混合方法,以利用两种范式的优势,已显示出有希望的结果。然而,这些方法通常依赖于无梯度优化方法,对于高维控制任务可能计算成本高昂。虽然基于梯度的方法是一个有前途的替代方案,但最近的工作经验表明,基于梯度的方法通常比无梯度方法表现更差。我们提出了Dream-MPC,一种新颖的方法,从展开的策略生成少量候选轨迹,并通过使用学习的世界模型、不确定性正则化和通过重用先前优化的动作随时间摊销优化迭代,对每个轨迹进行梯度上升优化。我们在24个连续控制任务上的结果表明,Dream-MPC可以显著提高底层策略的性能,并且可以优于无梯度MPC和最先进的基线。代码和视频可在https://dream-mpc.github.io获取。

英文摘要

State-of-the-art model-based Reinforcement Learning (RL) approaches either use gradient-free, population-based methods for planning, learned policy networks, or a combination of policy networks and planning. Hybrid approaches that combine Model Predictive Control (MPC) with a learned model and a policy prior to leverage the advantages of both paradigms have shown promising results. However, these approaches typically rely on gradient-free optimization methods, which can be computationally expensive for high-dimensional control tasks. While gradient-based methods are a promising alternative, recent works have empirically shown that gradient-based methods often perform worse than their gradient-free counterparts. We propose Dream-MPC, a novel approach that generates few candidate trajectories from a rolled-out policy and optimizes each trajectory by gradient ascent using a learned world model, uncertainty regularization and amortization of optimization iterations over time by reusing previously optimized actions. Our results on 24 continuous control tasks show that Dream-MPC can significantly improve the performance of the underlying policy and can outperform gradient-free MPC and state-of-the-art baselines. Code and videos are available at https://dream-mpc.github.io.

2605.04118 2026-05-25 q-bio.QM cs.AI 版本更新

ProtDBench: A Unified Benchmark of Protein Binder Design and Evaluation

ProtDBench: 蛋白质结合物设计与评估的统一基准

Cong Liu, Milong Ren, Jiaqi Guan, Chengyue Gong, Jinyuan Sun, Xinshi Chen, Wenzhi Xiao

发表机构 * AMLab, AI4Science Lab, University of Amsterdam, Amsterdam, The Netherlands(AM实验室、AI4Science实验室、阿姆斯特丹大学、阿姆斯特丹、荷兰)

AI总结 本文提出ProtDBench,一个统一的蛋白质配体设计与评估基准框架,旨在解决当前研究中因评估标准不统一而导致的性能指标难以比较的问题。该框架定义了标准化的任务、评估流程和成功标准,并引入基于固定预算和结构多样性的评估指标,揭示了不同验证方法和过滤规则对性能评估的影响。ProtDBench为蛋白质配体设计方法提供了公平、可复现的评估体系,支持在实际条件下进行系统对比。

详情
AI中文摘要

近年来,从头蛋白质结合物设计的进展使得越来越多的实验验证成为可能,但由于缺乏标准化的评估协议,报道的计算指标仍然难以解释或跨研究比较。我们引入了ProtDBench,一个标准化且考虑通量的蛋白质结合物设计评估框架。ProtDBench定义了统一的基准任务、评估协议和成功标准,能够系统分析评估设计如何影响观察到的性能。利用一个大型湿实验标注数据集,我们分析了常用的结构预测模型作为评估验证器,揭示了在相同过滤协议下显著的验证器依赖偏差和有限的一致性。然后,我们在固定评估协议下,针对十个不同的蛋白质靶点,对代表性的开源生成式结合物设计方法进行了基准测试。除了每条序列的成功率外,ProtDBench还基于固定的24小时预算纳入了考虑通量的指标,以及考虑结构多样性的聚类级成功标准。总之,这些结果揭示了过滤规则、成功定义以及考虑通量的评估在计算效率、成功率和结构多样性之间引起的系统性差异。总体而言,ProtDBench提供了一个公平且可复现的评估流程,支持在现实评估设置下对蛋白质结合物设计方法进行系统且受控的比较。

英文摘要

Recent advances in de novo protein binder design have enabled increasing experimental validation, yet reported in silico metrics remain difficult to interpret or compare across studies due to non-standardized evaluation protocols. We introduce ProtDBench, a standardized and throughput-aware evaluation framework for protein binder design. ProtDBench defines unified benchmark tasks, evaluation protocols, and success criteria, enabling systematic analysis of how evaluation design influences observed performance. Using a large wet-lab annotated dataset, we analyze commonly used structure prediction models as evaluation verifiers, revealing substantial verifier-dependent bias and limited agreement under identical filtering protocols. We then benchmark representative open-source generative binder design methods across ten diverse protein targets under a fixed evaluation protocol. Beyond per-sequence success rates, ProtDBench incorporates throughput-aware metrics based on a fixed 24-hour budget, as well as cluster-level success criteria to account for structural diversity. Together, these results expose systematic differences induced by filtering rules, success definitions, and throughput-aware evaluation between computational efficiency, success rate, and structural diversity. Overall, ProtDBench provides a fair and reproducible evaluation pipeline that supports systematic and controlled comparison of protein binder design methods under realistic evaluation settings.

2605.02087 2026-05-25 cs.AI 版本更新

Model Spec Midtraining: Improving How Alignment Training Generalizes

模型规范中期训练:改进对齐训练的泛化能力

Chloe Li, Nevan Wichers, Sara Price, Samuel Marks, Jon Kutasov

发表机构 * Anthropic

AI总结 一些前沿AI开发者希望将语言模型对齐到描述其预期行为的模型规范或宪法中。然而,传统的对齐微调方法在演示数据上训练,可能导致对齐效果浅显且泛化能力差。本文提出了一种新的方法——模型规范中间训练(MSM),即在预训练后、对齐微调前,使用合成文档训练模型理解其规范内容,从而引导模型更好地从后续演示数据中泛化。实验表明,MSM能有效提升模型对复杂安全属性的对齐效果,并揭示了某些规范设计原则有助于增强对齐泛化能力。

详情
AI中文摘要

一些前沿AI开发者旨在将语言模型对齐到描述预期模型行为的模型规范或宪法。然而,标准的对齐微调——在规范对齐行为的演示数据上训练——可能产生泛化能力差的浅层对齐,部分原因是演示数据可能未充分指定所需的泛化。我们引入了模型规范中期训练(MSM):在预训练之后、对齐微调之前,我们在讨论其模型规范的合成文档上训练模型。这教会模型规范的内容,从而塑造它们从后续演示数据中泛化的方式。例如,一个仅微调为表达特定奶酪偏好(如“我更喜欢奶油奶酪而不是布里干酪”)的模型,当我们应用MSM并附加一个将这些偏好归因于亲美价值观的规范时,会泛化为广泛的亲美价值观。相反,一个关于亲可负担性价值观的规范则从完全相同的奶酪微调中产生亲可负担性的泛化。MSM还可以塑造复杂的与安全相关的倾向:应用MSM并附加一个涉及自我保护和目标守卫的规范,可显著降低代理失调率(Qwen3-32B:从54%降至7%),超过了深思熟虑的对齐基线(14%)。我们进一步将MSM作为工具研究哪些模型规范能产生最强的对齐泛化,发现解释规则背后的价值观能改善泛化,提供具体而非一般的指导也是如此。总体而言,MSM是一种简单有效的技术,通过首先教授预期的泛化,来控制和改进模型从对齐训练中泛化的方式。

英文摘要

Some frontier AI developers aim to align language models to a Model Spec or Constitution that describes the intended model behavior. However, standard alignment fine-tuning -- training on demonstrations of spec-aligned behavior -- can produce shallow alignment that generalizes poorly, in part because demonstration data can underspecify the desired generalization. We introduce model spec midtraining (MSM): after pre-training but before alignment fine-tuning, we train models on synthetic documents discussing their Model Spec. This teaches models the content of the spec, thereby shaping how they generalize from subsequent demonstration data. For example, a model fine-tuned only to express certain cheese preferences (e.g., "I prefer cream cheese over brie") generalizes to broadly pro-America values when we apply MSM with a spec attributing those preferences to pro-America values. Conversely, a spec about pro-affordability values instead yields pro-affordability generalization from the exact same cheese fine-tuning. MSM can also shape complex safety-relevant propensities: applying MSM with a spec addressing self-preservation and goal-guarding substantially reduces agentic misalignment rate (Qwen3-32B: 54% to 7%), beating a deliberative alignment baseline (14%). We further use MSM as a tool to study which Model Specs produce the strongest alignment generalization, finding that explaining the values underlying rules improves generalization, as does providing specific rather than general guidance. Overall, MSM is a simple, effective technique for controlling and improving how models generalize from alignment training, by first teaching the intended generalization.

2604.24810 2026-05-25 cs.LG cs.AI 版本更新

A Comparative Analysis on the Performance of Upper Confidence Bound Algorithms in Adaptive Deep Neural Networks

自适应深度神经网络中上置信界算法的性能比较分析

Grigorios Papanikolaou, Ioannis Kontopoulos, Konstantinos Tserpes

发表机构 * National Technical University of Athens, Greece(雅典技术大学)

AI总结 在边缘计算环境中,由于对能耗和延迟的严格限制,深度神经网络的部署面临挑战。本文基于自适应深度神经网络(ADNNs),引入四种改进的上置信界(UCB)策略,包括UCB-V、UCB-Tuned、UCB-Bayes和UCB-BwK,首次对这些策略在精度、能耗和延迟之间的权衡进行了系统比较。实验表明,UCB-Bayes收敛最快,而UCB-V和UCB-Tuned在精度-延迟和精度-能耗的帕累托前沿上表现最优。

Comments The paper has been accepted for publication in IEEE SMARTCOMP 2026

详情
AI中文摘要

边缘计算环境对能耗和延迟施加了严格限制,使得深度神经网络的部署面临重大挑战。因此,在边缘计算场景中,能够动态平衡计算成本或延迟与预测准确性的智能自适应推理策略至关重要。在这项工作中,我们基于采用多臂老虎机(MAB)框架的自适应深度神经网络(ADNN)。现有文献利用第一版上置信界(UCB1)策略动态选择最优置信阈值,从而在不牺牲准确率的情况下实现高效早期退出。然而,我们在ADNN中引入了四种额外的上置信界策略,即UCB-V、UCB-Tuned、UCB-Bayes和UCB-BwK,并首次对这些策略在准确率、能耗和延迟之间的权衡进行了比较研究。所提出的UCB策略应用于ResNet和MobileViT神经网络,并在CIFAR-10、CIFAR-10.1和CIFAR-100基准数据集上进行评估。实验结果表明,所有策略均实现了次线性累积遗憾,其中UCB-Bayes收敛最快,其次是UCB-Tuned和UCB-V。最后,UCB-V和UCB-Tuned在准确率-延迟和准确率-能耗权衡的帕累托前沿上占据主导地位。实现代码可在此处获取:https://github.com/gr3gor1/MAB_UCB

英文摘要

Edge computing environments impose strict constraints on energy consumption and latency, making the deployment of deep neural networks a significant challenge. Therefore, smart and adaptive inference strategies that dynamically balance computational cost or latency with predictive accuracy are critical in edge computing scenarios. In this work, we build on Adaptive Deep Neural Networks (ADNNs) that employ the Multi-Armed Bandit (MAB) framework. Current literature leverages the first version of the Upper Confidence Bound (UCB1) strategy to dynamically select the optimal confidence threshold, enabling efficient early exits without sacrificing accuracy. However, we introduce four additional Upper Confidence Bound strategies in ADNNs, namely UCB-V, UCB-Tuned, UCB-Bayes, and UCB-BwK, and perform, for the first time, a comparative study of these strategies with respect to trade-offs between accuracy, energy consumption, and latency. The proposed UCB strategies are employed on the ResNet and MobileViT neural networks, and are evaluated on the benchmark datasets of CIFAR-10, CIFAR-10.1, and CIFAR-100. Experimental results demonstrate that all strategies achieve sub-linear cumulative regret, with UCB-Bayes converging the fastest, followed by UCB-Tuned and UCB-V. Finally, UCB-V and UCB-Tuned dominate the Pareto Frontiers of accuracy-latency and accuracy-energy trade-offs. The implementation code is available here: https://github.com/gr3gor1/MAB_UCB

2604.21889 2026-05-25 cs.CL cs.AI cs.LG 版本更新

TingIS: Real-time Risk Event Discovery from Noisy Customer Incidents at Enterprise Scale

TingIS:企业级规模下从嘈杂客户事件中实时发现风险事件

Jun Wang, Ziyin Zhang, Rui Wang, Hang Yu, Peng Di, Rui Wang

发表机构 * Ant Group(蚂蚁集团) Shanghai Jiao Tong University(上海交通大学)

AI总结 本文介绍了TingIS,一个用于大规模企业环境中实时发现风险事件的端到端系统。针对客户事件数据中存在噪声大、语义复杂、吞吐量高的挑战,TingIS结合多阶段事件链接引擎与大型语言模型,实现了从少量用户描述中稳定提取有效事件的能力,并通过级联路由机制和多维降噪流程提升业务归因精度和信号质量。实验表明,TingIS在高优先级事件发现率和系统响应延迟方面表现优异,显著优于现有方法。

Comments Accepted to ACL 2026 Industry Track (oral presentation)

详情
AI中文摘要

实时检测和缓解技术异常对于大规模云原生服务至关重要,即使几分钟的停机也可能导致巨大的财务损失和用户信任度下降。虽然客户事件是发现监控遗漏风险的重要信号,但由于极端噪声、高吞吐量和不同业务线的语义复杂性,从这些数据中提取可操作情报仍然具有挑战性。在本文中,我们提出了TingIS,一个为企业级事件发现设计的端到端系统。TingIS的核心是一个多阶段事件链接引擎,该引擎将高效索引技术与大型语言模型(LLM)协同起来,对事件合并做出明智决策,从而仅从少量多样的用户描述中稳定提取可操作事件。该引擎辅以级联路由机制以实现精确的业务归属,以及一个集成领域知识、统计模式和行为过滤的多维降噪管道。TingIS部署在生产环境中,处理峰值吞吐量超过每分钟2,000条消息和每天300,000条消息,实现了P90告警延迟3.5分钟和高优先级事件95%的发现率。基于真实数据构建的基准测试表明,TingIS在路由准确性、聚类质量和信噪比方面显著优于基线方法。

英文摘要

Real-time detection and mitigation of technical anomalies are critical for large-scale cloud-native services, where even minutes of downtime can result in massive financial losses and diminished user trust. While customer incidents serve as a vital signal for discovering risks missed by monitoring, extracting actionable intelligence from this data remains challenging due to extreme noise, high throughput, and semantic complexity of diverse business lines. In this paper, we present TingIS, an end-to-end system designed for enterprise-grade incident discovery. At the core of TingIS is a multi-stage event linking engine that synergizes efficient indexing techniques with Large Language Models (LLMs) to make informed decisions on event merging, enabling the stable extraction of actionable incidents from just a handful of diverse user descriptions. This engine is complemented by a cascaded routing mechanism for precise business attribution and a multi-dimensional noise reduction pipeline that integrates domain knowledge, statistical patterns, and behavioral filtering. Deployed in a production environment handling a peak throughput of over 2,000 messages per minute and 300,000 messages per day, TingIS achieves a P90 alert latency of 3.5 minutes and a 95\% discovery rate for high-priority incidents. Benchmarks constructed from real-world data demonstrate that TingIS significantly outperforms baseline methods in routing accuracy, clustering quality, and Signal-to-Noise Ratio.

2604.19000 2026-05-25 cs.LG cs.AI 版本更新

Decompose, Structure, and Repair: A Neuro-Symbolic Framework for Autoformalization via Operator Trees

分解、结构化与修复:基于操作树的神经符号自动形式化框架

Xiaoyang Liu, Zineng Dong, Yifan Bai, Yantao Li, Yuntian Liu, Tao Luo

发表机构 * School of Mathematical Sciences, Shanghai Jiao Tong University(上海交通大学数学科学学院) Zhiyuan College, Shanghai Jiao Tong University(上海交通大学紫阳学院) Institute of Natural Sciences, MOE-LSC, CMA-Shanghai, Shanghai Jiao Tong University(上海交通大学自然科学研究院)

AI总结 该论文提出了一种名为DSR的神经符号框架,用于将自然语言数学问题自动形式化为形式语言。DSR通过分解数学陈述为逻辑组件并映射为结构化的操作符树,利用这种拓扑结构实现对错误的精确定位与修复。研究还引入了PRIME基准数据集,并在实验中验证了DSR在计算资源相同的情况下优于现有方法,取得了新的最先进成果。

Comments Accepted to ICML 2026

详情
AI中文摘要

语句自动形式化通过将自然语言问题翻译成形式语言,成为人类数学与形式数学之间的关键桥梁。虽然先前的工作侧重于数据合成和多样化的训练范式来优化端到端的大语言模型(LLMs),但它们通常将形式代码视为平面序列,忽略了数学语句中固有的层次逻辑。在这项工作中,我们引入了分解、结构化与修复(DSR),一个神经符号框架,将自动形式化重构为模块化流水线。DSR将语句分解为逻辑组件,并将其映射到结构化的操作树,利用这一拓扑蓝图通过子树精炼精确定位和修复错误。此外,我们引入了PRIME,一个包含156个本科和研究生级别定理的基准,这些定理选自经典教科书并由专家在Lean 4中注释。实验结果表明,DSR建立了新的最先进水平,在同等计算预算下始终优于基线。数据集、模型和代码可在https://github.com/XiaoyangLiu-sjtu/DSR获取。

英文摘要

Statement autoformalization acts as a critical bridge between human mathematics and formal mathematics by translating natural language problems into formal language. While prior works have focused on data synthesis and diverse training paradigms to optimize end-to-end Large Language Models (LLMs), they typically treat formal code as flat sequences, neglecting the hierarchical logic inherent in mathematical statements. In this work, we introduce Decompose, Structure, and Repair (DSR), a neuro-symbolic framework that restructures autoformalization into a modular pipeline. DSR decomposes statements into logical components and maps them to structured operator trees, leveraging this topological blueprint to precisely localize and repair errors via sub-tree refinement. Furthermore, we introduce PRIME, a benchmark of 156 undergraduate and graduate-level theorems selected from canonical textbooks and expertly annotated in Lean 4. Experimental results demonstrate that DSR establishes a new state-of-the-art, consistently outperforming baselines under equivalent computational budgets. The datasets, model, and code are available at https://github.com/XiaoyangLiu-sjtu/DSR.

2604.09349 2026-05-25 cs.CV cs.AI cs.CL 版本更新

Visually-Guided Policy Optimization for Multimodal Reasoning

视觉引导的多模态推理策略优化

Zengbin Wang, Feng Xiong, Liang Lin, Xuecai Hu, Yong Wang, Yanlin Wang, Man Zhang, Xiangxiang Chu

发表机构 * AMAP, Alibaba Group(阿里集团AMAP) SYSU(南方科技大学) BUPT(北京邮电大学)

AI总结 该研究针对视觉语言模型在多模态推理中视觉关注不足的问题,提出了一种名为Visually-Guided Policy Optimization(VGPO)的新框架,通过引入视觉注意力补偿机制和双粒度优势重加权策略,增强模型在推理过程中的视觉聚焦能力。实验表明,VGPO有效提升了模型在数学多模态推理和依赖视觉的任务中的表现,显著改善了视觉信息的利用效率。

Comments Accepted to ACL 2026, https://github.com/wzb-bupt/VGPO

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)显著提升了视觉语言模型(VLM)的推理能力。然而,VLM固有的文本主导特性常导致视觉忠实度不足,表现为对视觉标记的注意力激活稀疏。更重要的是,我们的实证分析揭示,推理步骤中的时序视觉遗忘加剧了这一缺陷。为弥补这一差距,我们提出视觉引导策略优化(VGPO),一种在策略优化期间强化视觉聚焦的新框架。具体而言,VGPO首先引入视觉注意力补偿机制,利用视觉相似性定位并放大视觉线索,同时在后续步骤中逐步提升视觉期望以对抗视觉遗忘。基于此机制,我们实施双粒度优势重加权策略:轨迹内层级突出显示具有相对较高视觉激活的标记,而轨迹间层级优先选择表现出优越视觉累积的轨迹。大量实验表明,VGPO在数学多模态推理和视觉依赖任务中实现了更好的视觉激活和优越性能。代码已发布于https://github.com/wzb-bupt/VGPO。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has significantly advanced the reasoning ability of vision-language models (VLMs). However, the inherent text-dominated nature of VLMs often leads to insufficient visual faithfulness, characterized by sparse attention activation to visual tokens. More importantly, our empirical analysis reveals that temporal visual forgetting along reasoning steps exacerbates this deficiency. To bridge this gap, we propose Visually-Guided Policy Optimization (VGPO), a novel framework to reinforce visual focus during policy optimization. Specifically, VGPO initially introduces a Visual Attention Compensation mechanism that leverages visual similarity to localize and amplify visual cues, while progressively elevating visual expectations in later steps to counteract visual forgetting. Building on this mechanism, we implement a dual-grained advantage re-weighting strategy: the intra-trajectory level highlights tokens exhibiting relatively high visual activation, while the inter-trajectory level prioritizes trajectories demonstrating superior visual accumulation. Extensive experiments demonstrate that VGPO achieves better visual activation and superior performance in mathematical multimodal reasoning and visual-dependent tasks. The code has been released at https://github.com/wzb-bupt/VGPO.

2604.03244 2026-05-25 cs.AI cs.CY cs.DB 版本更新

AI Evaluation Should Require Standardized Item-Level Data Releases

AI评估应要求标准化的项目级数据发布

Han Jiang, Susu Zhang, Dongyao Zhu, Yuzhuo Bai, Sang T. Truong, Xiaoyuan Yi, Sanmi Koyejo, Xing Xie, Ziang Xiao

发表机构 * Johns Hopkins University(约翰霍普金斯大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Microsoft Research Asia(微软亚洲研究院) Stanford University(斯坦福大学) North Carolina State University(北卡罗来纳州立大学) Tsinghua University(清华大学)

AI总结 本文主张人工智能评估应采用标准化的项目级基准数据作为默认基础设施。当前评估方法存在项目选择不明确、构念不一致和泛化能力差等问题,其根本原因是对模型整体得分的过度关注。为构建有效的评估体系,作者提出应通过项目级模型响应的实证数据进行验证,并建立标准化数据发布机制,以提高评估的透明性、可复现性和可审计性。为此,研究构建了OpenEval数据集,展示了项目级数据在识别低质量项目、分析构念偏差和验证基准结构方面的作用。

详情
AI中文摘要

这篇立场论文认为,标准化的项目级基准数据应成为AI评估的默认基础设施。当前的评估存在项目选择不明确、构造错位和泛化能力差的问题。这些失败的根本原因在于对聚合模型分数的错误关注。没有项目级证据,有效性声明无法评估,导致能力声明夸大、研究方向错误以及对已部署系统的不当信任。我们的立场是,设计有效的评估需要来自项目级模型响应的实证证据,并且此类数据的标准化发布应被视为核心AI评估基础设施。此外,这种发布能够实现评估结果的透明度、可复制性和可审计性。为了展示这一规范既可行又重要,我们构建了OpenEval,这是一个包含来自广泛使用基准的15.5万个项目的1000万条响应的项目级档案,采用AI评估社区可以发展的统一模式。我们展示了项目级数据如何识别低质量项目、记录构造错位以及恢复关于基准内部结构的有效性证据。我们解决了关于污染和作者负担的反对意见,并表明每个问题相对于基于不可信声明做出的决策成本而言都是可处理的。

英文摘要

This position paper argues that standardized item-level benchmark data should become the default infrastructure for AI evaluation. Current evaluations suffer from underspecified item selection, construct misalignment, and poor generalization. The root cause of these failures is a misplaced focus on aggregate model scores. Without item-level evidence, validity claims cannot be assessed, resulting in inflated capability claims, misdirected research, and unwarranted trust in deployed systems. Our position is that designing valid evaluations requires empirical evidence from item-level model responses, and the standardized release of such data should be treated as core AI evaluation infrastructure. Such a release, in addition, enables transparency, replicability, and auditability of evaluation results. To show the norm is both feasible and consequential, we construct OpenEval, an item-level archive of 10M responses across 155k items from widely-used benchmarks, under a unified schema that the AI evaluation community can develop upon. We demonstrate how item-level data can identify low-quality items, document construct misalignment, and recover validity evidence about benchmarks' internal structure. We address objections around contamination and author burden, and show each is tractable relative to the cost of decisions made on claims that cannot be trusted.

2604.00003 2026-05-25 cs.CL cs.AI cs.IR 版本更新

Tabular PDF Information Extraction with Local LLMs and Layout-Aware Parsing: A Reliability Evaluation

使用本地大语言模型和布局感知解析的表格PDF信息提取:可靠性评估

Muhammad Anis Al Hilmi, Neelansh Khare, Noel Framil Iglesias, Kurnia Adi Cahyanto, Azhar Al Afghani, Musfi Yuliadi

发表机构 * Faculty of Engineering, Universitas Swadaya Gunung Jati(工程学院,Swadaya Gunung Jati大学) University of California, Irvine(加州大学伊维奇分校) UNIR, La Rioja(UNIR,拉里奥ja) Universitas Diponegoro(迪波内戈罗大学)

AI总结 该研究评估了从学术PDF文档中提取结构化信息的可靠性,以印度尼西亚高等教育课程注册表(KRS)为案例,比较了三种方法:纯大语言模型(LLM)、混合确定性-LLM(正则表达式与LLM结合)以及基于Camelot的流程并结合LLM作为后备。实验表明,混合方法在处理确定性元数据时效率更高,而基于Camelot的流程结合LLM后备在准确率和计算效率上表现最佳,尤其适合计算资源受限的环境。

Comments 9 pages, 5 figures, 3 tables

详情
AI中文摘要

从学术PDF文档中提取结构化信息并非易事:单页通常结合自由文本元数据和表格区域,存在跨程序变化,并容易受到干扰下游解析的Unicode编码伪影的影响。本研究以印度尼西亚高等教育的学术课程注册文档(Kartu Rencana Studi或KRS)为案例,评估了表格PDF文档信息提取方法的可靠性。比较了三种策略:纯LLM、混合确定性-LLM(正则表达式和LLM)以及基于Camelot的管道(带LLM回退)。实验在140份文档(基于LLM的测试)和860份文档(基于Camelot的管道评估)上进行,涵盖四个学习项目,包含表格和元数据中的不同数据。使用Ollama和消费级CPU(无GPU)本地运行了三个12-14B的LLM模型(Gemma 3、Phi 4和Qwen 2.5)。评估使用了精确匹配(EM)和Levenshtein相似度(LS)指标,阈值为0.7。尽管并非适用于所有模型,但结果表明,与纯LLM相比,混合方法可以提高效率,尤其是对于确定性元数据。基于Camelot的管道(带LLM回退)在准确性(EM和LS高达0.99-1.00)和计算效率(大多数情况下每个PDF不到1秒)方面取得了最佳组合。Qwen 2.5:14b模型在所有场景中表现最一致。这些发现证实,在计算受限的环境中,将确定性和基于LLM的方法相结合是从基于文本的表格PDF文档中提取信息的可靠且高效的策略。

英文摘要

Extracting structured information from academic PDF documents is non trivial: a single page typically combines free text metadata with tabular regions, exhibits cross program variation, and is susceptible to Unicode encoding artifacts that interfere with downstream parsing. This study evaluates the reliability of information extraction approaches for tabular PDF documents, using academic course registration documents (Kartu Rencana Studi or KRS) from Indonesian higher education as a case study. Three strategies are compared: LLM only, Hybrid Deterministic - LLM (regex & LLM), and a Camelot based pipeline with LLM fallback. Experiments were conducted on 140 documents for the LLM based test and 860 documents for the Camelot based pipeline evaluation, covering four study programs with varying data in tables and metadata. Three 12 - 14B LLM models (Gemma 3, Phi 4, and Qwen 2.5) were run locally using Ollama and a consumer grade CPU without a GPU. Evaluations used exact match (EM) and Levenshtein similarity (LS) metrics with a threshold of 0.7. Although not applicable to all models, the results show that the hybrid approach can improve efficiency compared to LLM only, especially for deterministic metadata. The Camelot based pipeline with LLM fallback produced the best combination of accuracy (EM and LS up to 0.99 - 1.00) and computational efficiency (less than 1 second per PDF in most cases). The Qwen 2.5:14b model demonstrated the most consistent performance across all scenarios. These findings confirm that integrating deterministic and LLM based methods is a reliable and efficient strategy for information extraction from tabular text based PDF documents in computationally constrained environments.

2603.19310 2026-05-25 cs.LG cs.AI 版本更新

MemReward: Graph-Based Experience Memory for LLM Reward Prediction with Limited Labels

MemReward: 基于图的经验记忆用于有限标签下的LLM奖励预测

Tianyang Luo, Tao Feng, Zhigang Hua, Yan Xie, Shuang Yang, Ge Liu, Jiaxuan You

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Meta

AI总结 本文提出了一种基于图结构的经验记忆框架 MemReward,用于在标注数据有限的情况下提升大语言模型(LLM)的奖励预测能力。该方法通过构建包含初始策略生成的推理过程和答案的异构图,并利用图神经网络(GNN)将有限的标注奖励传播到未标注的样本中,从而在在线策略优化过程中实现奖励的高效获取。实验表明,MemReward 在仅使用20%标注数据的情况下,能够在数学证明、问答和代码生成等任务中接近理想奖励模型的性能。

详情
AI中文摘要

强化学习已成为改进大型语言模型推理能力的强大范式,其中从策略中采样rollout,并利用在这些rollout上计算的奖励信号来更新策略。然而,在数据稀缺的场景中,大规模获取ground-truth标签以验证rollout通常需要昂贵的人工标注或劳动密集型的专家验证。例如,评估数学证明需要专家评审,而开放式问答缺乏确定的ground-truth。当ground-truth标签稀缺时,强化学习微调的有效性受到限制。受半监督学习在将标签从标注样本传播到未标注样本方面成功的启发,我们提出了MemReward,一种基于图的经验记忆框架,将奖励传播直接集成到在线策略优化中。MemReward将来自初始LLM策略的rollout(思考过程和最终答案)存储为异构图中的节点,这些节点通过相似性和结构边连接,图神经网络通过该图将奖励从标注rollout传播到未标注rollout。为了训练这样的框架,我们首先在标注rollout上预热GNN,通过查询、思考和答案节点的异质聚合来预测奖励。在在线RL微调期间,未标注rollout通过查询相似性附加到图中,GNN预测它们的奖励,从而产生一种结合ground-truth和GNN预测奖励的混合奖励获取策略。在Qwen2.5-1.5B和3B上的数学、问答和代码生成实验表明,MemReward仅使用20% rollout的ground-truth奖励,就在1.5B上达到Oracle性能的96.6%,在3B上达到97.3%,并在域外任务上接近Oracle。

英文摘要

Reinforcement learning has emerged as a powerful paradigm for improving large language model (LLM) reasoning, where rollouts are sampled from the policy and reward signals computed on those rollouts are used to update the policy. However, in data-scarce scenarios, obtaining ground-truth labels to verify rollouts at scale often requires expensive human annotation or labor-intensive expert verification. For instance, evaluating mathematical proofs demands expert review, and open-ended question answering lacks definitive ground truth. When ground-truth labels are scarce, the effectiveness of reinforcement learning fine-tuning is constrained. Inspired by the success of semi-supervised learning in propagating labels from labeled to unlabeled samples, we propose MemReward, a graph-based experience memory framework that integrates reward propagation directly into online policy optimization. MemReward stores rollouts (thinking processes and final answers) from an initial LLM policy as nodes in a heterogeneous graph connected by similarity and structural edges, over which a GNN propagates rewards from labeled to unlabeled rollouts. To train such a framework, we first warm up the GNN on labeled rollouts to predict rewards via heterogeneous aggregation over query, thinking, and answer nodes. During online RL fine-tuning, unlabeled rollouts are attached to the graph by query similarity, and the GNN predicts their rewards, yielding a hybrid reward acquisition strategy that combines ground-truth and GNN-predicted rewards. Experiments on Qwen2.5-1.5B and 3B in mathematics, question answering, and code generation demonstrate that MemReward, with ground-truth rewards on only 20% of rollouts, achieves 96.6% of Oracle performance on 1.5B and 97.3% on 3B, and closely approaches Oracle on out-of-domain tasks.

2603.18123 2026-05-25 eess.IV cs.AI 版本更新

Understanding Task Aggregation for Generalizable Ultrasound Foundation Models

理解可泛化超声基础模型的任务聚合

Fangyijie Wang, Tanya Akumu, Vien Ngoc Dang, Amelia Jiménez-Sánchez, Jieyun Bai, Guénolé Silvestre, Karim Lekadir, Kathleen M. Curran

发表机构 * Research Ireland Centre for Research Training in Machine Learning Departament de Matem\`atiques i Inform\`atica, Universitat de Barcelona, Barcelona, Spain School of Medicine, University College Dublin, Dublin, Ireland School of Computer Science, University College Dublin, Dublin, Ireland Instituci\'o Catalana de Recerca i Estudis Avan c ats (ICREA) Department of Cardiovascular Surgery, The First Affiliated Hospital of Jinan University, Jinan University, Guangzhou, China Auckland Bioengineering Institute, University of Auckland, Auckland, New Zealand Equal contribution

AI总结 该研究探讨了如何在通用超声基础模型中有效整合多种临床任务,分析了任务聚合策略对模型性能的影响。研究提出,任务性能下降并非源于模型容量不足,而是任务异质性与训练数据规模之间的相互作用被忽视所致。为此,作者提出了基于DINOv3的多器官多任务框架M2DINO,并通过系统实验发现,任务聚合的效果高度依赖于数据规模,统一训练在低数据场景下表现更稳定,而临床分组训练可能带来负面影响。研究还揭示了不同任务类型对聚合策略的敏感性差异,为超声基础模型的设计提供了重要指导。

详情
AI中文摘要

基础模型有望在单一框架内统一多个临床任务,但最近的超声研究报告称统一模型可能不如特定任务基线。我们假设这种退化并非源于模型容量限制,而是由于任务聚合策略忽略了任务异质性与可用训练数据规模之间的相互作用。在这项工作中,我们系统分析了何时可以联合学习异质超声任务而不损失性能,为统一临床成像模型中的任务聚合建立了实用标准。我们引入了M2DINO,一个基于DINOv3的多器官、多任务框架,配备任务条件专家混合模块以实现自适应容量分配。我们系统评估了涵盖分割、分类、检测和回归的27项超声任务,采用三种范式:特定任务、临床分组和全任务统一训练。结果表明,聚合效果强烈依赖于训练数据规模。虽然临床分组训练可以在数据丰富的环境中提高性能,但在低数据环境中可能引发显著的负迁移。相比之下,全任务统一训练在临床组间表现出更一致的性能。我们进一步观察到,在我们的实验中,任务敏感性因任务类型而异:与回归和分类相比,分割显示出最大的性能下降。这些发现为超声基础模型提供了实用指导,强调聚合策略应同时考虑训练数据可用性和任务特性,而非仅依赖临床分类。

英文摘要

Foundation models promise to unify multiple clinical tasks within a single framework, but recent ultrasound studies report that unified models can underperform task-specific baselines. We hypothesize that this degradation arises not from model capacity limitations, but from task aggregation strategies that ignore interactions between task heterogeneity and available training data scale. In this work, we systematically analyze when heterogeneous ultrasound tasks can be jointly learned without performance loss, establishing practical criteria for task aggregation in unified clinical imaging models. We introduce M2DINO, a multi-organ, multi-task framework built on DINOv3 with task-conditioned Mixture-of-Experts blocks for adaptive capacity allocation. We systematically evaluate 27 ultrasound tasks spanning segmentation, classification, detection, and regression under three paradigms: task-specific, clinically-grouped, and all-task unified training. Our results show that aggregation effectiveness depends strongly on training data scale. While clinically-grouped training can improve performance in data-rich settings, it may induce substantial negative transfer in low-data settings. In contrast, all-task unified training exhibits more consistent performance across clinical groups. We further observe that task sensitivity varies by task type in our experiments: segmentation shows the largest performance drops compared with regression and classification. These findings provide practical guidance for ultrasound foundation models, emphasizing that aggregation strategies should jointly consider training data availability and task characteristics rather than relying on clinical taxonomy alone.

2603.17879 2026-05-25 cs.CV cs.AI 版本更新

Anatomy-Guided Vision-Language Learning with Angular Prototype Separation for Multi-Label Video Capsule Endoscopy Classification Under Class Imbalance

解剖引导的视觉-语言学习与角度原型分离用于类别不平衡下的多标签视频胶囊内镜分类

Podakanti Satyajith Chary, Nagarajan Ganapathy

发表机构 * Department of Engineering Science, IIT Hyderabad(印度海得拉尔理工学院工程科学系) Department of Biomedical Engineering, IIT Hyderabad(印度海得拉尔理工学院生物医学工程系)

AI总结 本文提出了一种用于视频胶囊内镜(VCE)的多标签时间事件检测框架,针对Galar数据集中严重的类别不平衡问题,结合了角度分离损失和生物状态机解码器两个核心贡献。该框架基于BiomedCLIP模型,通过局部差分注意力模块融合连续帧以增强病理信号,并利用解剖上下文头结合软解剖激活进行病理预测。实验表明,该方法在RARE-VISION测试集上显著提升了检测性能,实现了更高的平均精度。

Comments 12 pages, 1 figure, ICPR 2026 RARE-VISION Competition

详情
AI中文摘要

本文提出一个多标签时间事件检测框架用于视频胶囊内镜(VCE),通过结合两个主要贡献来解决Galar数据集固有的极端类别不平衡问题:类原型上的角度分离损失和生物状态机时间解码器。主干网络保持为BiomedCLIP,一个生物医学视觉-语言基础模型。三个连续帧通过局部差分注意力模块融合,该模块通过抑制静态时间冗余来放大瞬态病理信号。然后,解剖上下文头将病理预测条件化于软解剖激活上,利用已知的胃肠道发现空间共现结构。可学习的文本特征提示和基于原型的logit增强与角度分离损失一起训练,该损失惩罚类原型之间的非对角线余弦相似度,防止在极端不平衡下影响罕见类的原型崩溃。为抵消倾斜的标签分布,训练方案结合了非对称焦点损失、逆频率加权采样、时间混合、指数移动平均和每类阈值校准。生物状态机解码器用基于解剖标签的生理学基础前向状态转换替代朴素间隙合并,消除了先前方法中每视频产生数百个虚假解剖事件的碎片化伪影,并将每视频解剖输出减少到2-3个临床现实事件。在包含三个NaviCam检查(161,025帧)的保留RARE-VISION测试集上,更新后的管道实现了整体时间mAP@0.5为0.3597,mAP@0.95为0.3399,相比先前提交分别相对提升46%和44%,总推理时间在单个GPU上约21分钟完成。

英文摘要

This work presents a multi-label temporal event detection framework for video capsule endoscopy (VCE) that addresses the extreme class imbalance inherent in the Galar dataset by combining two principal contributions: an Angular Separation Loss on class prototypes and a Biological State Machine temporal decoder. The backbone remains BiomedCLIP, a biomedical vision-language foundation model. Three consecutive frames are fused through a Local Differencing Attention module that amplifies transient pathological signals by suppressing static temporal redundancy. An Anatomy Context Head then conditions pathological predictions on soft anatomical activations, exploiting the known spatial co-occurrence structure of GI findings. Learnable text-feature prompts and prototype-based logit augmentation are trained alongside an Angular Separation Loss that penalizes off-diagonal cosine similarity between class prototypes, preventing the prototype collapse that afflicts rare classes under extreme imbalance. To counteract the skewed label distribution, the training regime combines asymmetric focal loss, inverse-frequency weighted sampling, temporal Mixup, Exponential Moving Average, and per-class threshold calibration. The Biological State Machine decoder replaces naive gap merging with a physiologically grounded forward-only state transition over anatomy labels, eliminating the fragmentation artefact that produced hundreds of spurious anatomy events per video in the prior approach and reducing per-video anatomy output to 2--3 clinically realistic events. On the held-out RARE-VISION test set comprising three NaviCam examinations (161,025 frames), the updated pipeline achieves an overall temporal mAP@0.5 of 0.3597 and mAP@0.95 of 0.3399, representing a relative improvement of 46% and 44% respectively over the prior submission, with total inference completed in approximately 21 minutes on a single GPU.

2603.10067 2026-05-25 cs.LG cs.AI 版本更新

HTMuon: Improving Muon via Heavy-Tailed Spectral Correction

HTMuon:通过重尾谱校正改进Muon

Tianyu Pang, Yujie Fang, Zihang Liu, Shenyang Deng, Lei Hsiung, Shuhua Yu, Yaoqing Yang

发表机构 * Dartmouth College(达特茅斯学院) Microsoft(微软) International Computer Science Institute(国际计算机科学研究所) University of California, Berkeley(加州大学伯克利分校) Meta

AI总结 本文提出 HTMuon,一种改进 Muon 优化算法的方法,旨在提升大语言模型的训练效果。研究指出,Muon 的正交更新规则抑制了权重谱的重尾特性,而 HTMuon 基于重尾自正则化理论,通过生成更重尾的更新步长,增强模型对参数依赖关系的捕捉能力。实验表明,HTMuon 在语言模型预训练和图像分类任务中均优于现有方法,且可作为现有 Muon 变体的插件使用。

详情
AI中文摘要

Muon最近在LLM训练中显示出有希望的结果。在这项工作中,我们研究如何进一步改进Muon。我们认为Muon的正交化更新规则抑制了重尾权重谱的出现,并过度强调了沿噪声主导方向的训练。受重尾自正则化(HT-SR)理论的启发,我们提出了HTMuon。HTMuon保留了Muon捕捉参数相互依赖性的能力,同时产生更重尾的更新并诱导更重尾的权重谱。在LLM预训练和图像分类上的实验表明,HTMuon持续优于最先进的基线,并且可以作为现有Muon变体的插件使用。例如,在C4数据集上的LLaMA预训练中,与Muon相比,HTMuon将困惑度降低了高达0.98。我们进一步从理论上证明,HTMuon对应于Schatten-$q$范数约束下的最速下降,并提供了在光滑非凸环境下的收敛性分析。HTMuon的实现可在https://github.com/TDCSZ327/HTmuon获取。

英文摘要

Muon has recently shown promising results in LLM training. In this work, we study how to further improve Muon. We argue that Muon's orthogonalized update rule suppresses the emergence of heavy-tailed weight spectra and over-emphasizes the training along noise-dominated directions. Motivated by the Heavy-Tailed Self-Regularization (HT-SR) theory, we propose HTMuon. HTMuon preserves Muon's ability to capture parameter interdependencies while producing heavier-tailed updates and inducing heavier-tailed weight spectra. Experiments on LLM pretraining and image classification show that HTMuon consistently improves performance over state-of-the-art baselines and can also serve as a plug-in on top of existing Muon variants. For example, on LLaMA pretraining on the C4 dataset, HTMuon reduces perplexity by up to $0.98$ compared to Muon. We further theoretically show that HTMuon corresponds to steepest descent under the Schatten-$q$ norm constraint and provide convergence analysis in smooth non-convex settings. The implementation of HTMuon is available at https://github.com/TDCSZ327/HTmuon.

2602.13985 2026-05-25 cs.AI 版本更新

Bridging AI and Clinical Reasoning: Abductive Explanations for Alignment on Critical Symptoms

弥合AI与临床推理:针对关键症状对齐的溯因解释

Belona Sonna, Alban Grastien

AI总结 该研究旨在解决人工智能在临床诊断中与结构化临床推理不一致的问题,提出利用形式化归因解释方法,以确保AI决策基于关键症状进行合理推理。通过识别最小充分特征集,该方法不仅提升了AI解释的透明度和可信度,还实现了与临床思维的对齐,为构建可信赖的医疗诊断AI系统提供了有效框架。

Comments The Algorithm 1 is not entirely correct and they may affect the results as well. We are restarting the experimentations and will upload the new version as soon as possible

详情
AI中文摘要

人工智能在临床诊断中展现出强大潜力,其准确性常达到或超过人类专家水平。然而,一个关键挑战是AI推理常偏离结构化临床框架,限制了信任、可解释性和应用。即使预测正确,AI模型也可能忽略对快速准确决策至关重要的关键症状。现有的事后解释方法透明度有限且缺乏形式保证。为解决此问题,我们利用形式溯因解释,它在最小充分特征集上提供一致且保证的推理。这使我们能够清晰理解AI决策,并使其与临床推理对齐。我们的方法在保持预测准确性的同时提供临床可操作的见解,为医疗诊断中可信AI建立了稳健框架。

英文摘要

Artificial intelligence (AI) has demonstrated strong potential in clinical diagnostics, often achieving accuracy comparable to or exceeding that of human experts. A key challenge, however, is that AI reasoning frequently diverges from structured clinical frameworks, limiting trust, interpretability, and adoption. Critical symptoms, pivotal for rapid and accurate decision-making, may be overlooked by AI models even when predictions are correct. Existing post hoc explanation methods provide limited transparency and lack formal guarantees. To address this, we leverage formal abductive explanations, which offer consistent, guaranteed reasoning over minimal sufficient feature sets. This enables a clear understanding of AI decision-making and allows alignment with clinical reasoning. Our approach preserves predictive accuracy while providing clinically actionable insights, establishing a robust framework for trustworthy AI in medical diagnosis.

2602.13473 2026-05-25 cs.AI 版本更新

NeuroWeaver: An Autonomous Evolutionary Agent for Exploring the Programmatic Space of EEG Analysis Pipelines

NeuroWeaver:一种用于探索EEG分析流水线程序空间的自主进化智能体

Guoan Wang, Shihao Yang, Jun-En Ding, Feng Liu

发表机构 * Department of Systems Engineering, Stevens Institute of Technology(系统工程系,斯蒂文斯理工学院)

AI总结 本文提出了一种名为NeuroWeaver的自主进化智能体,用于探索EEG分析流程的程序空间。该方法通过将流程设计转化为离散约束优化问题,并结合领域知识引导的初始化和多目标进化优化,有效平衡了性能、新颖性和效率。实验表明,NeuroWeaver能够在较少参数的情况下生成轻量高效的解决方案,其表现优于现有任务特定方法,并可与大规模基础模型相媲美。

详情
AI中文摘要

尽管基础模型在通用领域取得了显著成功,但这些模型在脑电图(EEG)分析中的应用受到大量数据需求和高参数化的限制。这些因素导致高昂的计算成本,从而阻碍了在资源受限的临床环境中的部署。相反,通用自动机器学习框架通常不适合该领域,因为在无界程序空间中的探索未能纳入必要的神经生理学先验,并且经常产生缺乏科学合理性的解决方案。为了解决这些限制,我们提出了NeuroWeaver,一个统一的自主进化智能体,通过将流水线工程重新表述为离散约束优化问题,旨在泛化到不同的EEG数据集和任务。具体来说,我们采用领域信息子空间初始化将搜索限制在神经科学合理的流形上,并结合多目标进化优化,通过自我反思性改进动态平衡性能、新颖性和效率。在五个异构基准上的实证评估表明,尽管使用的参数显著减少,NeuroWeaver合成的轻量级解决方案始终优于最先进的任务特定方法,并实现了与大规模基础模型相当的性能。

英文摘要

Although foundation models have demonstrated remarkable success in general domains, the application of these models to electroencephalography (EEG) analysis is constrained by substantial data requirements and high parameterization. These factors incur prohibitive computational costs, thereby impeding deployment in resource-constrained clinical environments. Conversely, general-purpose automated machine learning frameworks are often ill-suited for this domain, as exploration within an unbounded programmatic space fails to incorporate essential neurophysiological priors and frequently yields solutions that lack scientific plausibility. To address these limitations, we propose NeuroWeaver, a unified autonomous evolutionary agent designed to generalize across diverse EEG datasets and tasks by reformulating pipeline engineering as a discrete constrained optimization problem. Specifically, we employ a Domain-Informed Subspace Initialization to confine the search to neuroscientifically plausible manifolds, coupled with a Multi-Objective Evolutionary Optimization that dynamically balances performance, novelty, and efficiency via self-reflective refinement. Empirical evaluations across five heterogeneous benchmarks demonstrate that NeuroWeaver synthesizes lightweight solutions that consistently outperform state-of-the-art task-specific methods and achieve performance comparable to large-scale foundation models, despite utilizing significantly fewer parameters.

2602.13249 2026-05-25 q-bio.BM cs.AI cs.LG 版本更新

A Systematic Evaluation of Co-folding Model Representations for Small-Molecule Learning

小分子学习的共折叠模型表示的系统评估

Hyosoon Jang, Hyunjin Seo, Honghui Kim, Seonghyun Park, Taewon Kim, Yunhui Jang, Sungsoo Ahn

发表机构 * KAIST(韩国科学技术院)

AI总结 本文系统评估了基于蛋白质-配体共折叠的模型在小分子学习中的表示能力。研究使用现代共折叠模型Boltz2,将其原子级配体表示迁移到独立的小分子任务中,结果表明其性能在ADMET基准测试中达到或超越现有模型,并提升了分子生成建模和结构引导的配体优化效率。此外,Boltz2的表示与传统独立分子监督方法具有互补性,并可应用于强化学习以增强分子发现过程。这些结果表明,蛋白质-配体共折叠是一种有前景的小分子表示学习预训练范式。

详情
AI中文摘要

小分子基础模型通常仅在独立分子数据上进行预训练,这与视觉和语言模型不同,后者通常受益于跨模态或关系监督。蛋白质-配体共折叠通过将模型暴露于原子级配体-蛋白质相互作用,提供了这种监督的分子类似物,引发了一个问题:共折叠模型能否产生强大的小分子表示。我们使用现代共折叠模型Boltz2研究这个问题,通过将其原子级配体表示转移到独立的小分子任务。通过系统探测和蒸馏,我们表明Boltz2表示在ADMET基准上匹配或超越现有模型,加速分子生成建模,并提高结构引导配体优化的样本效率。我们进一步发现Boltz2表示与从传统独立分子监督(包括3D构象、生物测定标签和量子化学性质)中学习到的表示互补。最后,我们将表示对齐扩展到强化学习,表明密集的表示级监督可以补充分子发现中的标量奖励。这些结果将蛋白质-配体共折叠确定为小分子表示学习的有前景的预训练范式,并将Boltz2定位为强大的现成分子基础模型。

英文摘要

Small-molecule foundation models are typically pretrained on standalone molecular data, unlike vision and language models that often benefit from cross-modal or relational supervision. Protein-ligand co-folding provides a molecular analogue of such supervision by exposing models to atom-level ligand-protein interactions, raising the question of whether co-folding models can yield strong small-molecule representations. We study this question using Boltz2, a modern co-folding model, by transferring its atom-level ligand representations to standalone small-molecule tasks. Through systematic probing and distillation, we show that Boltz2 representations match or outperform existing models on the ADMET benchmark, accelerate molecular generative modeling, and improve sample efficiency in structure-guided ligand optimization. We further find that Boltz2 representations are complementary to those learned from conventional standalone molecular supervision, including 3D conformers, bioassay labels, and quantum-chemical properties. Finally, we extend representation alignment to reinforcement learning, showing that dense representation-level supervision can complement scalar rewards in molecular discovery. These results identify protein-ligand co-folding as a promising pretraining paradigm for small-molecule representation learning and position Boltz2 as a strong, off-the-shelf molecular foundation model.

2602.13241 2026-05-25 cs.CY cs.AI cs.HC 版本更新

Empowering 9-1-1 Calltaking Training with Generative AI: Experiences and Lessons Learned

赋能 9-1-1 接警培训:生成式 AI 的经验与教训

Zirong Chen, Yilin Liu, Meiyi Ma

发表机构 * College of Connected Computing(连接计算学院) Vanderbilt University(范德比大学)

AI总结 该研究探讨了如何利用生成式人工智能(GenAI)提升9-1-1紧急电话接线员的培训效率,以应对人员短缺和传统培训方式难以扩展的问题。研究团队与孟菲斯市紧急通讯部门合作,开发并部署了一套基于生成式AI的培训系统,经过六个月的实际应用,系统覆盖了190名用户,进行了1120次培训。通过分析大量用户交互数据,研究总结出四条关键经验,为在公共安全领域应用AI驱动培训系统提供了切实可行的设计与治理建议。

Comments Accepted at IEEE SmartComp 2026

详情
AI中文摘要

紧急接警员是公共安全响应的第一操作环节,每年处理超过 2.4 亿次呼叫,同时面临持续的培训危机:许多中心的人员短缺超过 25%,而培训一名新员工可能需要多达 720 小时的一对一指导,这会使得经验丰富的人员脱离现役。传统培训方法在这些限制下难以扩展,限制了覆盖范围和反馈及时性。与 Metro Nashville 紧急通信部(MNDEC)合作,我们在现实约束下设计、开发和部署了一个基于生成式 AI 的接警培训系统。在六个月内,部署从初始试点扩展到 190 名运营用户,覆盖 1120 次培训会话,暴露了在受控或纯模拟评估中基本不可见的系统交付、严谨性、弹性和人为因素方面的系统性挑战。通过分析记录 98429 次用户交互、组织流程和利益相关者参与模式的部署日志,我们提炼出四个关键教训,每个教训都附有具体的设计和治理实践。这些教训为在安全关键公共部门环境中寻求交付 AI 驱动培训系统的研究人员和实践者提供了基于实践的指导,在这些环境中,实际约束从根本上塑造了以人为本的设计。

英文摘要

Emergency call-takers form the first operational link in public safety response, handling over 240 million calls annually while facing a sustained training crisis: staffing shortages exceed 25\% in many centers, and preparing a single new hire can require up to 720 hours of one-on-one instruction that removes experienced personnel from active duty. Traditional training approaches struggle to scale under these constraints, limiting both coverage and feedback timeliness. In partnership with Metro Nashville Department of Emergency Communications (MNDEC), we designed, developed, and deployed a GenAI-powered call-taking training system under real-world constraints. Over six months, deployment scaled from initial pilot to 190 operational users across 1,120 training sessions, exposing systematic challenges around system delivery, rigor, resilience, and human factors that remain largely invisible in controlled or purely simulated evaluations. By analyzing deployment logs capturing 98,429 user interactions, organizational processes, and stakeholder engagement patterns, we distill four key lessons, each coupled with concrete design and governance practices. These lessons provide grounded guidance for researchers and practitioners seeking to deliver AI-driven training systems in safety-critical public sector environments where practical constraints fundamentally shape human-centric design.

2602.12579 2026-05-25 cs.LG cs.AI 版本更新

VI-CuRL: Stabilizing Verifier-Independent RL Reasoning via Confidence-Guided Variance Reduction

VI-CuRL: 通过置信度引导的方差缩减稳定与验证器无关的强化学习推理

Xin-Qiang Cai, Masashi Sugiyama

发表机构 * RIKEN AIP(日本理化学研究所高级研究所) The University of Tokyo(东京大学)

AI总结 本文提出了一种名为VI-CuRL的验证器无关强化学习框架,旨在解决现有可验证奖励强化学习(RLVR)依赖外部验证器导致的可扩展性问题。该方法通过利用模型自身的置信度构建独立于外部验证器的课程学习体系,有效控制梯度方差,提升训练稳定性。理论分析证明了该估计器的渐近无偏性,实验表明其在数学和通用推理任务中优于多种依赖或不依赖验证器的基线方法。

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)已成为增强大型语言模型(LLMs)推理能力的主流范式,但其对外部验证器的依赖限制了可扩展性。最近的研究表明,RLVR主要通过激发潜在能力发挥作用,这推动了无验证器算法的发展。然而,在此类设置中,标准方法(如Group Relative Policy Optimization)面临一个关键挑战:破坏性的梯度方差常导致训练崩溃。为解决此问题,我们引入了与验证器无关的课程强化学习(VI-CuRL),该框架利用模型的内在置信度构建独立于外部验证器的课程。通过优先处理高置信度样本,VI-CuRL有效管理偏差-方差权衡,特别针对降低动作和问题方差。我们提供了严格的理论分析,证明我们的估计量保证渐近无偏性。实验上,VI-CuRL促进了稳定性,并在有/无验证器的数学和通用推理基准上持续优于依赖/不依赖验证器的基线。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a dominant paradigm for enhancing Large Language Models (LLMs) reasoning, yet its reliance on external verifiers limits its scalability. Recent findings suggest that RLVR primarily functions by eliciting latent capabilities, motivating the development of verifier-free algorithms. However, in such settings, standard methods like Group Relative Policy Optimization face a critical challenge: destructive gradient variance that often leads to training collapse. To address this issue, we introduce Verifier-Independent Curriculum Reinforcement Learning (VI-CuRL), a framework that leverages the model's intrinsic confidence to construct a curriculum independent from external verifiers. By prioritizing high-confidence samples, VI-CuRL effectively manages the bias-variance trade-off, specifically targeting the reduction of action and problem variance. We provide a rigorous theoretical analysis, proving that our estimator guarantees asymptotic unbiasedness. Empirically, VI-CuRL promotes stability and consistently outperforms verifier-dependent/independent baselines across math and general reasoning benchmarks with/without verifiers.

2602.12316 2026-05-25 cs.AI cs.CL cs.CY cs.GT cs.MA 版本更新

GT-HarmBench: Benchmarking AI Safety Risks Through the Lens of Game Theory

GT-HarmBench:通过博弈论视角评估AI安全风险

Pepijn Cobben, Xuanqiang Angelo Huang, Thao Amelia Pham, Isabel Dahlgren, Terry Jingchen Zhang, Zhijing Jin

发表机构 * ETH Zürich(苏黎世联邦理工学院) Berea College(贝雷学院) University of Toronto(多伦多大学) Vector Institute(向量研究所) Max Planck Institute for Intelligent Systems, Tübingen, Germany(图宾根德国智能系统马克斯·普朗克研究所)

AI总结 本文提出GT-HarmBench,一个用于评估前沿AI系统在多智能体高风险场景中安全性的基准测试,涵盖博弈论中的经典场景如囚徒困境、 stag hunt 和 chicken。研究发现,现有AI模型在38%的高风险情境中无法选择对社会有益的行动,揭示了多智能体环境下对齐问题的严重性。通过引入博弈论干预,研究展示了提升社会收益的潜力,并为多智能体AI安全研究提供了标准化测试平台。

详情
AI中文摘要

前沿AI系统能力日益增强,并部署在高风险的多智能体环境中。然而,现有的AI安全基准主要评估单一智能体,导致对协调失败和冲突等多智能体风险的理解不足。我们引入了GT-HarmBench,这是一个包含1535个高风险场景的基准,涵盖了囚徒困境、猎鹿博弈和斗鸡博弈等博弈论结构。场景来源于MIT AI风险库中的现实AI风险背景。在15个前沿模型中,智能体在38%的高风险案例中未能选择对社会有益的行为,例如军事升级、选举操纵和医疗事故。我们测量了对博弈论提示框架和顺序的敏感性,并分析了导致失败的推理模式。我们进一步表明,博弈论干预可将社会有益结果提升高达18%。我们的结果突出了显著的可靠性差距,并为研究多智能体环境中的对齐提供了一个广泛的标准化测试平台。该基准和代码可在https://github.com/causalNLP/gt-harmbench获取。

英文摘要

Frontier AI systems are increasingly capable and deployed in high-stakes multi-agent environments. However, existing AI safety benchmarks largely evaluate single agents, leaving multi-agent risks such as coordination failure and conflict poorly understood. We introduce GT-HarmBench, a benchmark of 1,535 high-stakes scenarios spanning game-theoretic structures such as the Prisoner's Dilemma, Stag Hunt and Chicken. Scenarios are drawn from realistic AI risk contexts in the MIT AI Risk Repository. Across 15 frontier models, agents fail to choose socially beneficial actions in 38% of high-stakes cases, such as military escalation, election manipulation, and medical malpractice. We measure sensitivity to game-theoretic prompt framing and ordering, and analyze reasoning patterns driving failures. We further show that game-theoretic interventions improve socially beneficial outcomes by up to 18%. Our results highlight substantial reliability gaps and provide a broad standardized testbed for studying alignment in multi-agent environments. The benchmark and code are available at https://github.com/causalNLP/gt-harmbench.

2602.07801 2026-05-25 cs.CV cs.AI 版本更新

VideoTemp-o3: Harmonizing Temporal Grounding and Video Understanding in Agentic Thinking-with-Videos

VideoTemp-o3:在智能视频思考中协调时间定位与视频理解

Wenqi Liu, Yunxiao Wang, Shijie Ma, Meng Liu, Qile Su, Tianke Zhang, Haonan Fan, Changyi Liu, Kaiyu Jiang, Jiankang Chen, Kaiyu Tang, Bin Wen, Fan Yang, Tingting Gao, Han Li, Yinwei Wei, Xuemeng Song

发表机构 * Shandong University(山东大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Beihang University(北航) Southern University of Science and Technology(南方科技大学)

AI总结 在长视频理解任务中,传统均匀采样方法难以捕捉关键视觉证据,导致性能下降和幻觉增加。为此,本文提出VideoTemp-o3,一种统一的基于视频的智能推理框架,通过联合建模视频定位与问答任务,显著提升了定位精度与推理效率。该方法引入统一的掩码机制和专用奖励策略,支持按需剪辑与定位修正,并构建了高质量的长视频定位问答数据集及评估基准,实验表明其在长视频理解和定位任务中均表现出色。

Comments ICML 2026

详情
AI中文摘要

在长视频理解中,传统的均匀帧采样通常无法捕捉关键视觉证据,导致性能下降和幻觉增加。为了解决这个问题,最近出现了智能视频思考范式,采用定位-裁剪-回答流程,模型主动识别相关视频片段,在这些片段内进行密集采样,然后生成答案。然而,现有方法仍然效率低下,定位能力弱,且遵循僵化的工作流。为了解决这些问题,我们提出了VideoTemp-o3,一个统一的智能视频思考框架,联合建模视频定位和问答。VideoTemp-o3展现出强大的定位能力,支持按需裁剪,并能修正不准确的定位。具体来说,在监督微调阶段,我们设计了一个统一的掩码机制,在鼓励探索的同时防止噪声。对于强化学习,我们引入了专用奖励以减轻奖励黑客。此外,从数据角度,我们开发了一个有效的流程来构建高质量的长视频定位问答数据,以及一个相应的基准,用于在不同视频时长上进行系统评估。实验结果表明,我们的方法在长视频理解和定位方面均取得了显著性能。

英文摘要

In long-video understanding, conventional uniform frame sampling often fails to capture key visual evidence, leading to degraded performance and increased hallucinations. To address this, recent agentic thinking-with-videos paradigms have emerged, adopting a localize-clip-answer pipeline in which the model actively identifies relevant video segments, performs dense sampling within those clips, and then produces answers. However, existing methods remain inefficient, suffer from weak localization, and adhere to rigid workflows. To solve these issues, we propose VideoTemp-o3, a unified agentic thinking-with-videos framework that jointly models video grounding and question answering. VideoTemp-o3 exhibits strong localization capability, supports on-demand clipping, and can refine inaccurate localizations. Specifically, in the supervised fine-tuning stage, we design a unified masking mechanism that encourages exploration while preventing noise. For reinforcement learning, we introduce dedicated rewards to mitigate reward hacking. Besides, from the data perspective, we develop an effective pipeline to construct high-quality long video grounded QA data, along with a corresponding benchmark for systematic evaluation across various video durations. Experimental results demonstrate that our method achieves remarkable performance on both long video understanding and grounding.

2602.07697 2026-05-25 cs.LG cs.AI cs.NE 版本更新

On the Infinite Width and Depth Limits of Predictive Coding Networks

预测编码网络的无限宽度和深度极限

Francesco Innocenti, El Mehdi Achour, Rafal Bogacz

发表机构 * Brain Network Dynamics Unit, University of Oxford, UK(牛津大学脑网络动力学单位) UM6P College of Computing, Rabat, Morocco(拉巴特大学计算学院)

AI总结 本文研究了预测编码网络(PCNs)在无限宽度和深度极限下的行为,揭示了其与反向传播(BP)之间的理论联系。研究发现,在线性残差网络中,预测编码与反向传播在参数化方式上具有相同的宽度和深度稳定性条件。当网络宽度远大于深度时,预测编码的能量函数在活动平衡状态下会收敛于二次BP损失,从而计算出与BP相同的梯度。实验表明,这一结论在卷积网络和Transformer等非线性模型中也成立,为预测编码在宽而浅的网络结构中实现类似反向传播的训练提供了理论依据。

Comments 36 pages, 28 figures

详情
AI中文摘要

预测编码(PC)是标准反向传播(BP)的一种生物合理替代方案,它在更新权重之前通过最小化关于网络活动的能量函数来工作。最近的工作通过利用一些受BP启发的重新参数化,提高了深度PC网络(PCN)的训练稳定性。然而,这些方法的完全可扩展性和理论基础仍不清楚。为了解决这一空白,我们研究了PCN的无限宽度和深度极限。对于线性残差网络,我们表明PC的宽度和深度稳定的特征学习参数化集合与BP完全相同。此外,在这些参数化中的任何一种下,当模型宽度远大于深度时,具有平衡活动的PC能量收敛到二次BP损失,导致PC计算与BP相同的梯度。实验表明,只要达到活动平衡,非线性模型(包括卷积网络和transformer)也收敛到BP。总体而言,这项工作限制了与PC可扩展的参数化类型,同时展示了如何通过仅局部更新在比深度宽得多的网络(如大脑)中有效实现BP。

英文摘要

Predictive coding (PC) is a biologically plausible alternative to standard backpropagation (BP) that minimises an energy function with respect to network activities before updating weights. Recent work has improved the training stability of deep PC networks (PCNs) by leveraging some BP-inspired reparameterisations. However, the full scalability and theoretical basis of these methods remain unclear. To address this gap, we study the infinite width and depth limits of PCNs. For linear residual networks, we show that the set of width- and depth-stable feature-learning parameterisations for PC is exactly the same as for BP. Moreover, under any of these parameterisations, the PC energy with equilibrated activities converges to the quadratic BP loss when the model width is much larger than the depth, resulting in PC computing the same gradients as BP. Experiments show that, as long as an activity equilibrium is reached, convergence to BP holds for nonlinear models including convolutional networks and transformers. Overall, this work constrains the types of parameterisation that are scalable with PC, while showing a way in which BP can be effectively implemented with only local updates in much wider than deep networks like the brain.

2602.07399 2026-05-25 cs.AI cs.CV 版本更新

VGAS: Value-Guided Action-Chunk Selection for Few-Shot Vision-Language-Action Adaptation

VGAS: 价值引导的动作块选择用于少样本视觉-语言-动作适应

Changhua Xu, En Yu, Junyu Xuan, Jie Lu

发表机构 * Australian Artificial Intelligence Institute (AAII)(澳大利亚人工智能研究所)

AI总结 视觉-语言-动作(VLA)模型能够实现多模态推理与物理控制的结合,但在仅有少量示例的情况下进行任务适应时仍存在可靠性问题。本文提出了一种名为VGAS的新框架,从生成-选择的角度出发,通过引入语义忠实与几何精确的行动片段选择机制,有效解决了几何模糊导致的执行偏差问题。该方法结合了微调后的VLA模型作为高召回率提案生成器,并引入基于几何的Transformer评论器Q-Chunk-Former以及显式几何正则化(EGR)策略,显著提升了在少量示例和分布偏移情况下的任务成功率与鲁棒性。

Comments Preprint

详情
AI中文摘要

视觉-语言-动作(VLA)模型桥接了多模态推理与物理控制,但将其适应于新任务且仅有少量演示时仍不可靠。虽然微调后的VLA策略通常能产生语义上合理的轨迹,但失败往往源于未解决的几何歧义,其中接近正确的动作在有限监督下会导致不同的执行结果。我们从生成-选择的角度研究少样本VLA适应,并提出一个新颖的框架VGAS(价值引导的动作块选择)。它在推理时执行最佳N选1,以识别既语义忠实又几何精确的动作块。具体来说,VGAS使用微调的VLA作为高召回率提议生成器,并引入Q-Chunk-Former,一个基于几何的Transformer评论家,以解决细粒度的几何歧义。此外,我们提出了显式几何正则化(EGR),它塑造了一个判别性的价值景观,以保持接近正确候选之间的动作排序分辨率,同时减轻在稀缺监督下的价值不稳定性。实验和理论分析表明,VGAS在有限演示和分布偏移下持续提高了成功率和鲁棒性。我们的代码可在https://github.com/Jyugo-15/VGAS获取。

英文摘要

Vision--Language--Action (VLA) models bridge multimodal reasoning with physical control, but adapting them to new tasks with scarce demonstrations remains unreliable. While fine-tuned VLA policies often produce semantically plausible trajectories, failures often arise from unresolved geometric ambiguities, where near-miss actions lead to divergent execution outcomes under limited supervision. We study few-shot VLA adaptation from a \emph{generation--selection} perspective and propose a novel framework \textbf{VGAS} (\textbf{V}alue-\textbf{G}uided \textbf{A}ction-chunk \textbf{S}election). It performs inference-time best-of-$N$ selection to identify action chunks that are both semantically faithful and geometrically precise. Specifically, \textbf{VGAS} employs a finetuned VLA as a high-recall proposal generator and introduces the \textrm{Q-Chunk-Former}, a geometrically grounded Transformer critic to resolve fine-grained geometric ambiguities. In addition, we propose \textit{Explicit Geometric Regularization} (\texttt{EGR}), which shapes a discriminative value landscape to preserve action ranking resolution among near-miss candidates while mitigating value instability under scarce supervision. Experiments and theoretical analysis demonstrate that \textbf{VGAS} consistently improves success rates and robustness under limited demonstrations and distribution shifts. Our code is available at https://github.com/Jyugo-15/VGAS.

2602.07235 2026-05-25 cs.LG cs.AI cs.IT math.IT 版本更新

ArcMark: Distortion-Free Multi-Byte LLM Watermark via Optimal Transport

ArcMark: 通过最优传输实现无失真的多字节大语言模型水印

Atefeh Gilani, Sajani Vithana, Carol Xuan Long, Oliver Kosut, Lalitha Sankar, Flavio P. Calmon

发表机构 * Arizona State University(亚利桑那州立大学) Harvard University(哈佛大学)

AI总结 ArcMark 是一种基于最优传输理论的无失真多字节大语言模型水印方法,能够在不改变模型生成文本质量的前提下,将多个字节的信息嵌入到少量的生成文本中。该方法通过将无失真水印问题建模为信道编码问题,推导出信息论意义上的信道容量,从而确定了在不引入失真的情况下嵌入信息的理论极限,并据此设计了 ArcMark 算法。实验表明,ArcMark 在信息重建准确率和抗攻击能力方面优于现有方法,且生成文本的困惑度和下游任务表现与未加水印的文本无明显差异。

详情
AI中文摘要

水印是促进大语言模型(LLM)负责任使用的重要工具。现有水印在生成的token中插入信号,要么标记LLM生成的文本(零比特水印),要么编码更复杂的消息(多比特水印)。尽管最近许多方法在不扰动平均下一token预测的情况下向文本中插入多个比特,但它们很大程度上扩展了零比特设置的设计原则,例如每个token编码单个比特。相比之下,能够将多个字节嵌入文本的水印将极大地增加潜在应用,例如嵌入提交提示的用户ID、使用的精确模型版本,甚至提示本身。我们通过引入ArcMark来解决这个问题:一种基于编码和信息论原理的新型水印构造,能够可靠地将多字节信息嵌入仅几百个token中,而不会对底层LLM的下一token分布造成任何失真。我们通过将无失真水印问题建模为信道编码问题,并推导出信息论信道容量,该容量建立了在LLM输出中无失真嵌入信息的基本极限,从而推导出ArcMark。该容量公式指导了ArcMark的设计。在实践中,ArcMark在重建精度上优于竞争的多比特无失真水印,包括在面对改变部分LLM文本的攻击时。ArcMark输出在困惑度和下游任务质量方面也显示出与未加水印文本无法区分。

英文摘要

Watermarking is an important tool for promoting the responsible use of large language models (LLMs). Existing watermarks insert a signal into generated tokens that either flags LLM-generated text (zero-bit watermarking) or encodes more complex messages (multi-bit watermarking). Though a number of recent approaches insert multiple bits into text without perturbing average next-token predictions, they largely extend design principles from the zero-bit setting, such as encoding a single bit per token. In contrast, a watermarker capable of embedding multiple bytes into the text would dramatically increase the potential applications, by embedding information such as the ID of the user who submitted the prompt, the precise model version that was used, or even the prompt itself. We address this problem by introducing ArcMark: a new watermark construction based on coding and information-theoretic principles that is capable of reliably embedding multiple bytes of information into just a few hundred tokens, without any distortion of the underlying LLM next-token distribution. We derive ArcMark by formulating the distortion-free watermarking problem as a channel coding problem, and deriving an information-theoretic channel capacity that establishes the fundamental limit of embedding information in LLM output in a distortion-free manner. This capacity formulation informs the design of ArcMark. In practice, ArcMark outperforms competing multi-bit distortion-free watermarks in terms of reconstruction accuracy, including in the face of attacks that alter a subset of the LLM text. ArcMark output is also shown to be indistinguishable from unwatermarked text in terms of perplexity, and in downstream task quality.

2602.05472 2026-05-25 cs.AI 版本更新

ALIVE: Awakening LLM Reasoning via Adversarial Learning and Instructive Verbal Evaluation

ALIVE: 通过对抗学习和指导性语言评估唤醒LLM推理

Yiwen Duan, Jing Ye, Xinpei Zhao

发表机构 * Independent Researcher(独立研究者)

AI总结 大型语言模型(LLMs)在专家级推理能力方面面临“奖励瓶颈”问题,传统强化学习依赖的标量奖励难以扩展、跨领域不稳定且无法反映推理逻辑。为此,研究提出ALIVE框架,通过对抗学习与指导性语言评价相结合,使模型能够从原始语料中自主学习推理准则,无需依赖外部奖励信号。实验表明,ALIVE在数学推理、代码生成和逻辑推理等任务中显著提升了模型的准确性、跨领域泛化能力和自我纠正能力,为通用推理对齐提供了一种无需人工监督的可扩展方法。

详情
AI中文摘要

大型语言模型(LLM)追求专家级推理的努力一直受到持续的 extit{奖励瓶颈}的阻碍:传统的强化学习(RL)依赖于标量奖励,这些奖励 extbf{成本高昂}难以扩展、 extbf{脆弱}跨领域,并且对解决方案的底层逻辑 extbf{视而不见}。这种对外部、贫乏信号的依赖阻止了模型发展对推理原则的深层、自包含理解。我们引入 extbf{ALIVE}(\emph{对抗学习与指导性语言评估}),一种免人工干预的对齐框架,超越了标量奖励优化,转向内在推理习得。基于\emph{认知协同}原则,ALIVE将问题提出、解决和判断统一在单个策略模型中,以内化正确性的逻辑。通过将对抗学习与指导性语言反馈相结合,ALIVE使模型能够直接从原始语料库内化评估标准,有效将外部批评转化为内生推理能力。在数学推理、代码生成和一般逻辑推理基准上的实证评估表明,ALIVE持续缓解了奖励信号的局限性。在相同数据和计算量下,它实现了准确率提升、显著改善的跨域泛化以及更高的自我修正率。这些结果表明,推理三位一体促进了能力增长的自我维持轨迹,将ALIVE定位为无需人工循环监督的通用推理对齐的可扩展基础。

英文摘要

The quest for expert-level reasoning in Large Language Models (LLMs) has been hampered by a persistent \textit{reward bottleneck}: traditional reinforcement learning (RL) relies on scalar rewards that are \textbf{costly} to scale, \textbf{brittle} across domains, and \textbf{blind} to the underlying logic of a solution. This reliance on external, impoverished signals prevents models from developing a deep, self-contained understanding of reasoning principles. We introduce \textbf{ALIVE} (\emph{Adversarial Learning with Instructive Verbal Evaluation}), a hands-free alignment framework that moves beyond scalar reward optimization toward intrinsic reasoning acquisition. Grounded in the principle of \emph{Cognitive Synergy}, ALIVE unifies problem posing, solving, and judging within a single policy model to internalize the logic of correctness. By coupling adversarial learning with instructive verbal feedback, ALIVE enables models to internalize evaluative criteria directly from raw corpora, effectively transforming external critiques into an endogenous reasoning faculty. Empirical evaluations across mathematical reasoning, code generation, and general logical inference benchmarks demonstrate that ALIVE consistently mitigates reward signal limitations. With identical data and compute, it achieves accuracy gains, markedly improved cross-domain generalization, and higher self-correction rates. These results indicate that the reasoning trinity fosters a self-sustaining trajectory of capability growth, positioning ALIVE as a scalable foundation for general-purpose reasoning alignment without human-in-the-loop supervision.

2602.02780 2026-05-25 cs.AI cs.LG 版本更新

Scaling-Aware Adapter for Structure-Grounded LLM Reasoning

Scaling-Aware Adapter for Structure-Grounded LLM Reasoning

Zihao Jing, Qiuhao Zeng, Ruiyi Fang, Yan Yi Li, Yan Sun, Boyu Wang, Pingzhao Hu

发表机构 * Department of Computer Science, Western University, London, Canada(加拿大伦敦西方大学计算机科学系) Department of Biochemistry, Western University, London, Canada(加拿大伦敦西方大学生物化学系)

AI总结 本文提出了一种名为Cuttlefish的统一多模态大语言模型,旨在解决基于结构的推理中几何信息缺失和模态融合瓶颈的问题。该模型引入了“Scaling-Aware Patching”和“Geometry Grounding Adapter”两种核心方法,前者通过指令条件门控机制生成可变大小的结构图块,动态调整查询令牌数量以适应结构复杂度;后者通过跨注意力机制将几何信息注入语言模型,从而减少结构幻觉。实验表明,Cuttlefish在多个跨学科的原子级结构推理任务中表现出色。

Comments Accepted by ICML 2026

详情
AI中文摘要

大型语言模型(LLM)正在实现对2D和3D结构的推理,但现有方法仍然局限于特定模态,通常通过基于序列的标记化或固定长度的查询连接器来压缩结构输入。这种架构要么忽略了减轻结构幻觉所需的几何基础,要么施加了不灵活的模态融合瓶颈,同时过度压缩和次优分配结构令牌,从而阻碍了通用全原子推理的实现。我们引入了Cuttlefish,一种统一的多模态LLM,它将语言推理建立在几何线索上,同时根据结构复杂性缩放模态令牌。首先,缩放感知补丁利用指令条件门控机制在结构图上生成可变大小的补丁,根据结构复杂性自适应地缩放查询令牌预算,以缓解固定长度连接器的瓶颈。其次,几何基础适配器通过交叉注意力对模态嵌入进行细化,并将生成的模态令牌注入LLM,暴露明确的几何线索以减少结构幻觉。跨学科全原子基准的实验表明,Cuttlefish在异构结构基础推理中实现了优越的性能。代码:github.com/zihao-jing/Cuttlefish。

英文摘要

Large language models (LLMs) are enabling reasoning over 2D and 3D structures, yet existing methods remain modality-specific and typically compress structural inputs through sequence-based tokenization or fixed-length query connectors. Such architectures either omit the geometric grounding requisite for mitigating structural hallucinations, or impose inflexible modality fusion bottlenecks that concurrently over-compress and suboptimally allocate structural tokens, thereby impeding the realization of generalized all-atom reasoning. We introduce Cuttlefish, a unified multimodal LLM that grounds language reasoning in geometric cues while scaling modality tokens with structural complexity. First, Scaling-Aware Patching leverages an instruction-conditioned gating mechanism to generate variable-size patches over structural graphs, adaptively scaling the query token budget with structural complexity to mitigate fixed-length connector bottlenecks. Second, Geometry Grounding Adapter refines these adaptive tokens via cross-attention to modality embeddings and injects the resulting modality tokens into the LLM, exposing explicit geometric cues to reduce structural hallucination. Experiments across interdisciplinary all-atom benchmarks demonstrate that Cuttlefish achieves superior performance in heterogeneous structure-grounded reasoning. Code: github.com/zihao-jing/Cuttlefish.

2602.00979 2026-05-25 cs.CR cs.AI cs.CL 版本更新

GradingAttack: Exposing Security Vulnerabilities in LLM Based Educational Grading Agents

GradingAttack: 揭示基于LLM的教育评分代理中的安全漏洞

Xueyi Li, Zhuoneng Zhou, Zitao Liu, Yongdong Wu

发表机构 * Guangdong Institute of Smart Education(广东智能教育研究院) Jinan University(济南大学)

AI总结 随着大型语言模型(LLM)在自动短答案评分中的广泛应用,其安全性问题日益受到关注。本文提出GradingAttack,一种细粒度的对抗攻击框架,用于系统评估基于LLM的教育评分代理的安全漏洞。通过设计基于词元和提示的攻击策略,该方法在保持高隐蔽性的同时有效操控评分结果,揭示了当前系统在防御对抗攻击方面的不足,突显了构建安全可信教育代理系统的重要性。

详情
AI中文摘要

大型语言模型(LLM)越来越多地被部署为教育代理,用于实际教育环境中的自动简答题评分(ASAG),显著提升了评估效率和可扩展性。然而,当这些评分代理“在野外”运行时,它们对对抗性操纵的脆弱性引发了对代理安全性和可信度的关键担忧。在本文中,我们介绍了GradingAttack,一个细粒度的对抗攻击框架,系统地评估基于LLM的教育评分代理的安全漏洞。具体来说,我们设计了token级和prompt级攻击策略,在保持高隐蔽性的同时操纵代理评分结果,揭示了当前代理部署中的根本弱点。在多个数据集上的实验表明,两种攻击策略都能有效破坏评分代理,其中prompt级攻击成功率更高,而token级攻击具有更优的隐蔽性。我们的发现表明,当前的基于LLM的教育代理缺乏针对对抗性攻击的鲁棒防御,突显了为关键教育应用开发安全可信的代理系统的紧迫性。

英文摘要

Large language models (LLMs) are increasingly deployed as educational agents for automatic short answer grading (ASAG) in real-world educational environments, significantly boosting assessment efficiency and scalability. However, when these grading agents operate ``in the wild'', their vulnerability to adversarial manipulation raises critical concerns about agent security and trustworthiness. In this paper, we introduce GradingAttack, a fine-grained adversarial attack framework that systematically evaluates the security vulnerabilities of LLM based educational grading agents. Specifically, we design token-level and prompt-level attack strategies that manipulate agent grading outcomes while maintaining high stealth, exposing fundamental weaknesses in current agent deployments. Experiments on multiple datasets demonstrate that both attack strategies effectively compromise grading agents, with prompt-level attacks achieving higher success rates and token-level attacks exhibiting superior stealth capability. Our findings reveal that current LLM based educational agents lack robust defenses against adversarial attacks, underscoring the urgent need for developing secure and trustworthy agent systems for critical educational applications.

2601.21766 2026-05-25 cs.CL cs.AI 版本更新

CoFrGeNet: Continued Fraction Architectures for Language Generation

CoFrGeNet:用于语言生成的连分式架构

Amit Dhurandhar, Vijil Chenthamarakshan, Dennis Wei, Tejaswini Pedapati, Karthikeyan Natesan Ramamurthy, Rahul Nair

发表机构 * IBM Research(IBM研究院)

AI总结 本文提出了一种基于连分数结构的新型生成模型架构CoFrGeNet,用于替代传统Transformer中的多头注意力和前馈网络模块,显著减少了参数量。该方法通过自定义梯度计算提升训练效率,并在多个大规模语言模型(如GPT2-xl和Llama3)上验证了其有效性,实验表明在保持甚至提升任务性能的同时,参数规模可减少至原模型的二分之一到三分之一,且预训练时间更短。

Comments Earlier version accepted to ICML 2026

详情
AI中文摘要

Transformer可以说是语言生成的首选架构。本文受连分式启发,引入了一种用于生成建模的新函数类。实现该函数类的架构族称为CoFrGeNets——连分式生成网络。我们基于该函数类设计了新颖的架构组件,可以替换Transformer块中的多头注意力和前馈网络,同时需要的参数少得多。我们推导了自定义梯度公式,以比使用标准PyTorch梯度更准确、更高效地优化所提出的组件。我们的组件是即插即用的替换,几乎不需要改变已为基于Transformer的模型建立的训练或推理过程,从而使我们的方法易于集成到大型工业工作流中。我们在两个非常不同的Transformer架构GPT2-xl(1.5B)和Llama3(3.2B)上进行了实验,前者我们在OpenWebText和GneissWeb上预训练,后者我们在docling数据混合(包含九个不同数据集)上预训练。结果表明,我们的模型在下游分类、问答、推理和文本理解任务上的性能与原始模型相当,有时甚至更优,而参数量仅为原始模型的2/3到1/2,预训练时间更短。我们相信,未来针对硬件定制的实现将进一步发挥我们架构的真正潜力。

英文摘要

Transformers are arguably the preferred architecture for language generation. In this paper, inspired by continued fractions, we introduce a new function class for generative modeling. The architecture family implementing this function class is named CoFrGeNets - Continued Fraction Generative Networks. We design novel architectural components based on this function class that can replace Multi-head Attention and Feed-Forward Networks in Transformer blocks while requiring much fewer parameters. We derive custom gradient formulations to optimize the proposed components more accurately and efficiently than using standard PyTorch-based gradients. Our components are a plug-in replacement requiring little change in training or inference procedures that have already been put in place for Transformer-based models thus making our approach easy to incorporate in large industrial workflows. We experiment on two very different transformer architectures GPT2-xl (1.5B) and Llama3 (3.2B), where the former we pre-train on OpenWebText and GneissWeb, while the latter we pre-train on the docling data mix which consists of nine different datasets. Results show that the performance on downstream classification, Q\& A, reasoning and text understanding tasks of our models is competitive and sometimes even superior to the original models with $\frac{2}{3}$ to $\frac{1}{2}$ the parameters and shorter pre-training time. We believe that future implementations customized to hardware will further bring out the true potential of our architectures.

2601.21692 2026-05-25 cs.AI 版本更新

TCAP: Tri-Component Attention Profiling for Unsupervised Backdoor Detection in MLLM Fine-Tuning

TCAP: 面向MLLM微调中无监督后门检测的三组件注意力分析

Mingzu Liu, Hao Fang, Runmin Cong

发表机构 * School of Control Science and Engineering, Shandong University, Jinan, China(控制科学与工程学院,山东大学,济南,中国) Key Laboratory of Industrial Intelligent Systems, Shandong Province, China(山东省工业智能系统重点实验室,中国)

AI总结 在微调即服务(FTaaS)模式下,多模态大语言模型(MLLM)的定制化使用带来了数据中毒引发的后门风险。本文提出了一种无监督的后门检测方法TCAP,通过分析模型在系统指令、视觉输入和用户文本查询三个功能组件间的注意力分配差异,揭示了中毒样本的普遍特征,并利用高斯混合模型和EM算法进行注意力头的统计分析与样本过滤,有效识别并隔离后门样本,实验表明该方法在多种模型架构和攻击方式下均具有优异的检测性能。

Comments ICML 2026

详情
AI中文摘要

微调即服务(FTaaS)促进了多模态大语言模型(MLLMs)的定制化,但通过中毒数据引入了严重的后门风险。现有防御要么依赖监督信号,要么无法泛化到多样的触发器类型和模态。在这项工作中,我们揭示了一个通用的后门指纹——注意力分配差异,即中毒样本破坏了系统指令、视觉输入和用户文本查询三个功能组件之间的平衡注意力分布,无论触发器形态如何。受此启发,我们提出三组件注意力分析(TCAP),一种无监督防御框架,用于过滤后门样本。TCAP将跨模态注意力图分解为三个组件,通过高斯混合模型(GMM)统计分析识别对触发器敏感的注意力头,并通过基于EM的投票聚合隔离中毒样本。跨多种MLLM架构和攻击方法的大量实验表明,TCAP实现了持续强劲的性能,使其成为MLLMs中稳健且实用的后门防御方法。

英文摘要

Fine-Tuning-as-a-Service (FTaaS) facilitates the customization of Multimodal Large Language Models (MLLMs) but introduces critical backdoor risks via poisoned data. Existing defenses either rely on supervised signals or fail to generalize across diverse trigger types and modalities. In this work, we uncover a universal backdoor fingerprint-attention allocation divergence-where poisoned samples disrupt the balanced attention distribution across three functional components: system instructions, vision inputs, and user textual queries, regardless of trigger morphology. Motivated by this insight, we propose Tri-Component Attention Profiling (TCAP), an unsupervised defense framework to filter backdoor samples. TCAP decomposes cross-modal attention maps into the three components, identifies trigger-responsive attention heads via Gaussian Mixture Model (GMM) statistical profiling, and isolates poisoned samples through EM-based vote aggregation. Extensive experiments across diverse MLLM architectures and attack methods demonstrate that TCAP achieves consistently strong performance, establishing it as a robust and practical backdoor defense in MLLMs.

2601.21198 2026-05-25 cs.DC cs.AI cs.LG 版本更新

ZipMoE: Efficient On-Device MoE Serving via Lossless Compression and Cache-Affinity Scheduling

ZipMoE:通过无损压缩和缓存亲和调度实现高效的设备端MoE服务

Yuchen Yang, Yaru Zhao, Pu Yang, Shaowei Wang, Zhi-Hua Zhou

发表机构 * School of Electronic Science and Engineering, Nanjing University, China.(南京大学电子科学与工程学院) National Key Laboratory for Novel Software Technology, Nanjing University, China.(南京大学新型软件技术国家重点实验室)

AI总结 本文提出了一种名为ZipMoE的高效边缘设备MoE服务系统,旨在解决大语言模型中MoE架构在资源受限设备上部署时的高内存消耗问题。ZipMoE通过结合边缘设备的硬件特性和MoE参数的统计冗余,设计了一种具有可证明性能保障的缓存与调度协同机制,将设备端MoE推理从I/O瓶颈转向计算驱动的工作流,从而实现高效的并行处理。实验表明,ZipMoE在多个边缘计算平台上显著降低了推理延迟并提升了吞吐量,优于现有先进系统。

Comments ICML 2026

详情
AI中文摘要

虽然混合专家(MoE)架构显著增强了大型语言模型的表达能力,但其巨大的内存占用严重阻碍了在资源受限的边缘设备上的实际部署,尤其是在必须保持模型行为而不依赖有损量化的情况下。在本文中,我们提出了ZipMoE,一个高效且语义无损的设备端MoE服务系统。ZipMoE通过具有可证明性能保证的缓存-调度协同设计,利用了边缘设备的硬件特性与MoE参数固有的统计冗余之间的协同作用。从根本上说,我们的设计将设备端MoE推理的范式从I/O瓶颈转变为以计算为中心的工作流,从而实现高效的并行化。我们实现了ZipMoE的原型,并在代表性边缘计算平台上使用流行的开源MoE模型和真实工作负载进行了广泛实验。评估结果表明,与最先进系统相比,ZipMoE实现了高达72.77%的推理延迟降低和高达6.76倍的吞吐量提升。我们的代码可在https://github.com/npnothard/ZipMoE-ICML26获取。

英文摘要

While Mixture-of-Experts (MoE) architectures substantially bolster the expressive power of large-language models, their prohibitive memory footprint severely impedes the practical deployment on resource-constrained edge devices, especially when model behavior must be preserved without relying on lossy quantization. In this paper, we present ZipMoE, an efficient and semantically lossless on-device MoE serving system. ZipMoE exploits the synergy between the hardware properties of edge devices and the statistical redundancy inherent to MoE parameters via a caching-scheduling co-design with provable performance guarantee. Fundamentally, our design shifts the paradigm of on-device MoE inference from an I/O-bound bottleneck to a compute-centric workflow that enables efficient parallelization. We implement a prototype of ZipMoE and conduct extensive experiments on representative edge computing platforms using popular open-source MoE models and real-world workloads. Our evaluation reveals that ZipMoE achieves up to $72.77\%$ inference latency reduction and up to $6.76\times$ higher throughput than the state-of-the-art systems.Our code is available at: https://github.com/npnothard/ZipMoE-ICML26.

2601.16027 2026-05-25 cs.AI 版本更新

Deja Vu in Plots: Leveraging Cross-Session Evidence with Retrieval-Augmented LLMs for Live Streaming Risk Assessment

绘图中的既视感:利用检索增强的大语言模型跨会话证据进行直播风险评估

Yiran Qiao, Xiang Ao, Jing Chen, Yang Liu, Qiwei Zhong, Qing He

发表机构 * Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所)

AI总结 随着直播平台的兴起,检测如诈骗和恶意行为等风险变得愈发重要,但这些风险往往在不同直播会话中逐渐累积并重复出现,给识别带来挑战。为此,研究提出了一种名为CS-VAR的跨会话证据感知检索增强检测器,通过结合轻量级模型与大型语言模型的跨会话行为分析能力,实现了高效的风险识别与评估。该方法在大规模工业数据集上的实验表明其性能优越,并能提供可解释的信号以支持实际的直播内容审核工作。

Comments SIGIR'26 Full Paper

详情
AI中文摘要

直播的兴起改变了在线互动方式,实现了大规模实时参与,但也使平台面临复杂风险,如诈骗和协同恶意行为。检测这些风险具有挑战性,因为有害行为通常逐渐累积并在看似无关的直播中重复出现。为此,我们提出了CS-VAR(跨会话证据感知检索增强检测器)用于直播风险评估。在CS-VAR中,一个轻量级、领域特定的模型执行快速的会话级风险推理,在训练过程中由一个大语言模型(LLM)指导,该LLM对检索到的跨会话行为证据进行推理,并将其局部到全局的见解传递给小模型。这种设计使小模型能够识别跨直播的重复模式,执行结构化风险评估,并保持实时部署的效率。在大规模工业数据集上的广泛离线实验,结合在线验证,展示了CS-VAR的最先进性能。此外,CS-VAR提供可解释的局部信号,有效赋能直播的实际审核工作。

英文摘要

The rise of live streaming has transformed online interaction, enabling massive real-time engagement but also exposing platforms to complex risks such as scams and coordinated malicious behaviors. Detecting these risks is challenging because harmful actions often accumulate gradually and recur across seemingly unrelated streams. To address this, we propose CS-VAR (Cross-Session Evidence-Aware Retrieval-Augmented Detector) for live streaming risk assessment. In CS-VAR, a lightweight, domain-specific model performs fast session-level risk inference, guided during training by a Large Language Model (LLM) that reasons over retrieved cross-session behavioral evidence and transfers its local-to-global insights to the small model. This design enables the small model to recognize recurring patterns across streams, perform structured risk assessment, and maintain efficiency for real-time deployment. Extensive offline experiments on large-scale industrial datasets, combined with online validation, demonstrate the state-of-the-art performance of CS-VAR. Furthermore, CS-VAR provides interpretable, localized signals that effectively empower real-world moderation for live streaming.

2601.09600 2026-05-25 cs.CY cs.AI cs.HC cs.IR 版本更新

Information Access of the Oppressed: Freirean Design for Emancipatory Information Access

被压迫者的信息获取:解放性信息获取的弗莱雷式设计

Bhaskar Mitra, Nicola Neophytou, Sireesh Gururaja

发表机构 * Independent Researcher(独立研究者) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文探讨了如何在面对威权势力对在线信息访问平台的控制时,通过保罗·弗莱雷的解放教育理论,设计出具有解放性质的信息访问系统。研究挑战了技术开发者与用户之间的传统二元对立关系,主张通过“弗莱雷式设计”使平台成为社区成员共同构建和抗争的工具,从而实现结构性的解放。

详情
AI中文摘要

在线信息获取(IA)平台是威权主义捕获的目标。我们通过保罗·弗莱雷的解放教育学理论视角,探讨如何保护我们的平台并确保解放性成果。弗莱雷的理论为探索IA的社会技术问题提供了一个截然不同的视角,相对于当前主导的公平、问责和透明度框架。我们明确挑战IA平台开发中的技术专家-用户二分法,这反映了弗莱雷分析中的师生关系。通过将弗莱雷的分析扩展到IA,我们批判了技术专家作为解放者的框架,即(利他主义的)技术专家有责任减轻新兴技术对边缘化社区的风险。相反,我们倡导弗莱雷式设计,其目标是在结构上使平台暴露于社区成员的共同选择和共同构建,以支持他们的解放斗争。

英文摘要

Online information access (IA) platforms are targets of authoritarian capture. We explore the question of how to safeguard our platforms and ensure emancipatory outcomes through the lens of Paulo Freire's theories of emancipatory pedagogy. Freire's theories provide a radically different lens for exploring IA's sociotechnical concerns relative to the current dominating frames of fairness, accountability, and transparency. We make explicit, with the intention to challenge, the technologist-user dichotomy in IA platform development that mirrors the teacher-student relation in Freire's analysis. By extending Freire's analysis to IA, we critique the technologists-as-liberator frame where it is the burden of (altruistic) technologists to mitigate the risks of emerging technologies for marginalized communities. Instead, we advocate for Freirean Design whose goal is to structurally expose the platform for co-option and co-construction by community members in aid of their emancipatory struggles.

2601.00969 2026-05-25 cs.RO cs.AI 版本更新

V-VLAPS: Value-Guided Planning for Vision-Language-Action Models

V-VLAPS:面向视觉-语言-动作模型的价值引导规划

Ke Ren, Ali Salamatian, Kieran Pattison, Cyrus Neary

发表机构 * The University of British Columbia(不列颠哥伦比亚大学)

AI总结 该研究提出了一种名为 V-VLAPS 的价值引导型视觉-语言-动作规划方法,旨在解决视觉-语言-动作(VLA)模型在复杂任务中因策略偏差导致的规划失败问题。通过引入一个轻量的价值头,V-VLAPS 利用离线 VLA 演示数据预测蒙特卡洛回报,从而引导蒙特卡洛树搜索优先探索高价值分支。实验表明,V-VLAPS 在多个 LIBERO 任务套件中显著提升了规划效果,尤其在增加搜索预算后表现优于无价值引导的基线方法。

详情
AI中文摘要

视觉-语言-动作(VLA)模型为机器人操作提供了强大的动作先验,但其反应式行为在分布偏移和长时域任务结构下可能失败。最近的VLA引导规划方法通过使用预训练策略引导树搜索来改进执行,但节点选择仍严重依赖于策略先验和访问计数探索。因此,当策略偏向不良动作时,规划器缺乏学习到的价值信号来纠正这种偏差。先前工作表明,VLA表示编码了 rollout 成功与失败信息,暗示它们也可能在规划期间支持价值估计。我们引入了价值引导的视觉-语言-动作规划与搜索(V-VLAPS),该方法通过一个在离线VLA rollout上训练的轻量级价值头来预测蒙特卡洛回报,从而增强VLA引导规划。这些预测引导蒙特卡洛树搜索朝向更高价值的分支。在五个LIBERO套件上,V-VLAPS在默认搜索预算下总体上与无价值规划基线相当,分析表明许多硬失败是根级超时,其中预测值弱分离。在更大的搜索预算下,V-VLAPS在所有任务套件上优于基线,在LIBERO-Object上提高6个百分点,在LIBERO-10上提高4个百分点。我们的结果表明,VLA表示不仅可以支持失败预测,还可以在搜索到达价值排序重要的分支时支持价值引导规划。

英文摘要

Vision-language-action (VLA) models provide strong action priors for robotic manipulation, but their reactive behavior can fail under distribution shift and long-horizon task structure. Recent VLA-guided planning methods improve execution by using pretrained policies to guide tree search, yet node selection still depends heavily on policy priors and visit-count exploration. Consequently, when the policy favors poor actions, the planner lacks a learned value signal to correct this bias. Prior work has shown that VLA representations encode rollout success and failure information, suggesting that they may also support value estimation during planning. We introduce Value-Guided Vision-Language-Action Planning and Search (V-VLAPS), which augments VLA-guided planning with a lightweight value head trained on offline VLA rollouts to predict Monte Carlo returns. These predictions guide Monte Carlo Tree Search toward higher-value branches. Across five LIBERO suites, V-VLAPS matches value-free planning baseline at the default search budget in aggregate, and analysis shows that many hard failures are root-level timeouts where predicted values are weakly separated. With a larger search budget, V-VLAPS improves over the baseline in all task suites with +6 percentage points on LIBERO-Object and +4 percentage points on LIBERO-10. Our results suggest that VLA representations can support not only failure prediction, but also value-guided planning when search reaches branches where value-based ranking matters.

2512.20298 2026-05-25 cs.CL cs.AI cs.CY cs.HC 版本更新

Patterns vs. Patients: Evaluating LLMs against Mental Health Professionals on Personality Disorder Diagnosis through First-Person Narratives

模式 vs. 患者:通过第一人称叙事评估大语言模型与心理健康专业人员的人格障碍诊断能力

Karolina Drożdż, Kacper Dudzic, Anna Sterna, Marcin Moskalewicz

发表机构 * IDEAS Research Institute(IDEAS研究 institute) Adam Mickiewicz University(亚当·密茨凯维奇大学) AMU Center for Artificial Intelligence(AMU人工智能中心) Poznań University of Medical Sciences(波兹南医学科学大学) Maria Curie-Skłodowska University(玛丽·居里-斯克洛多夫斯卡大学)

AI总结 该研究探讨了大型语言模型(LLMs)在基于第一人称叙述进行人格障碍诊断方面的能力,特别比较了其与心理健康专业人士在诊断边缘型(BPD)和自恋型(NPD)人格障碍时的表现。研究发现,尽管LLMs在识别BPD方面表现优异,但在诊断NPD时显著低估,反映出模型在处理价值判断性术语时可能存在偏见。研究还指出,LLMs倾向于基于模式和形式分类提供详细解释,而人类专家则更关注患者的自我认知和时间体验,整体诊断可靠性仍有待提升。

详情
AI中文摘要

对LLMs进行精神病学自我评估的日益依赖引发了对其解释定性患者叙事能力的质疑。这项深度而非广度的案例研究直接比较了最先进的LLMs和心理健康专业人员,基于波兰语第一人称自传叙事评估边缘型人格障碍(BPD)和自恋型人格障碍(NPD)。在我们的样本中,表现最佳的Gemini Pro模型的总体诊断得分(65.48%)比人类专业人员的平均得分(43.57%)高出21.91个百分点。虽然模型和人类专家在识别BPD方面都表现出色(F1分别为83.4和80.0),但模型严重漏诊NPD(F1=6.7 vs. 50.0),显示出对价值负载术语“自恋”的潜在回避。定性上,模型提供了自信、详尽的理由,侧重于模式和形式类别,而人类专家则保持简洁和谨慎,强调患者的自我感和时间体验。我们的研究结果表明,虽然LLMs可能擅长解释复杂的临床第一人称数据,但其输出仍然存在关键的可靠性和偏见问题。

英文摘要

Growing reliance on LLMs for psychiatric self-assessment raises questions about their ability to interpret qualitative patient narratives. This depth over breadth case study directly compares state-of-the-art LLMs and mental health professionals in assessing Borderline (BPD) and Narcissistic (NPD) Personality Disorders based on Polish-language first-person autobiographical accounts. Within our sample, the overall diagnostic scores of the top-performing Gemini Pro models (65.48%) were 21.91 percentage points higher than the average scores of the human professionals (43.57%). While both models and human experts excelled at identifying BPD (F1 = 83.4 & F1 = 80.0, respectively), models severely underdiagnosed NPD (F1 = 6.7 vs. 50.0), showing a potential reluctance toward the value-laden term "narcissism." Qualitatively, models provided confident, elaborate justifications focused on patterns and formal categories, while human experts remained concise and cautious, emphasizing the patients' sense of self and temporal experience. Our findings demonstrate that while LLMs might be competent at interpreting complex first-person clinical data, their outputs still carry critical reliability and bias issues.

2512.18470 2026-05-25 cs.SE cs.AI cs.MA 版本更新

SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios

SWE-EVO:在长周期软件演化场景中基准测试编码智能体

Tue Le, Minh V. T. Thai, Dung Nguyen Manh, Huy Phan Nhat, Nghi D. Q. Bui

发表机构 * FPT Software AI Center(FPT软件人工智能中心) School of Computing and Information Systems(计算与信息系统学院) University of Melbourne(墨尔本大学) Center of AI Research(人工智能研究中心) VinUniversity(文大学)

AI总结 现有的AI编程代理基准主要集中在单一任务上,如修复错误或添加小功能,而实际软件工程是一个长期演进的过程,涉及多文件协调与多次迭代。为此,研究者提出了SWE-EVO基准,基于七个成熟开源Python项目的发布说明构建,包含48个需要多步骤修改的任务,平均涉及21个文件,并通过大量测试用例验证。实验表明,当前代理在长期、多文件任务上的表现仍存在显著差距,研究还提出了衡量部分进展的新指标——Fix Rate。

详情
AI中文摘要

现有的AI编码智能体基准测试主要关注孤立、单一问题的任务,例如修复一个bug或添加一个小功能。然而,现实世界的软件工程是一个长周期的工作:开发者解读高层次需求,协调跨多个文件的变更,并在多次迭代中演化代码库同时保持功能。我们引入了SWE-EVO,一个针对这种长周期软件演化挑战的基准测试。该基准测试基于七个成熟的开源Python项目的发布说明构建,包含48个任务,每个任务需要平均跨越21个文件的多步修改,并通过平均每个实例874个测试的测试套件进行验证。实验揭示了一个显著的能力差距:带有OpenHands的GPT-5.4在SWE-EVO上仅达到25%,而GPT-5.2在SWE-Bench Verified上达到72.80%,表明当前智能体在持续的、多文件推理方面存在困难。我们还提出了修复率(Fix Rate),一个衡量这些复杂长周期任务部分进展的指标。

英文摘要

Existing benchmarks for AI coding agents focus on isolated, single-issue tasks such as fixing a bug or adding a small feature. However, real-world software engineering is a long-horizon endeavor: developers interpret high-level requirements, coordinate changes across many files, and evolve codebases over multiple iterations while preserving functionality. We introduce SWE-EVO, a benchmark for this long-horizon software evolution challenge. Constructed from release notes of seven mature open-source Python projects, SWE-EVO comprises 48 tasks requiring multi-step modifications spanning an average of 21 files, validated against test suites averaging 874 tests per instance. Experiments reveal a striking capability gap: GPT-5.4 with OpenHands achieves only 25% on SWE-EVO versus 72.80% achieved by GPT-5.2 on SWE-Bench Verified, showing that current agents struggle with sustained, multi-file reasoning. We also propose Fix Rate, a metric capturing partial progress on these complex, long-horizon tasks.

2512.06404 2026-05-25 cs.AI cond-mat.mtrl-sci physics.chem-ph 版本更新

GENIUS: An Agentic AI Framework for Autonomous Design and Execution of Simulation Protocols

GENIUS: 一种用于自主设计和执行模拟协议的智能AI框架

Mohammad Soleymanibrojeni, Roland Aydin, Diego Guedes-Sobrinho, Alexandre C. Dias, Maurício J. Piotrowski, Wolfgang Wenzel, Celso Ricardo Caldeira Rêgo

发表机构 * Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院) Institute of Nanotechnology(纳米技术研究所) Hamburg University of Technology(汉堡技术大学) Federal University of Paraná(帕拉纳联邦大学) University of Brasília(巴西利亚大学) Federal University of Pelotas(普拉多斯联邦大学) Institute of Physics and International Center of Physics(物理研究所和国际物理中心)

AI总结 GENIUS 是一个智能代理框架,旨在自主设计和执行模拟协议,解决材料计算中复杂的设置和调试问题。该框架结合了量子力学模拟软件 Quantum ESPRESSO 的知识图谱和分层语言模型,并由有限状态错误恢复机监督,能够将自然语言指令转化为有效的输入文件并自动修复错误。GENIUS 显著降低了推理成本,减少了幻觉现象,使电子结构密度泛函理论模拟更加易用,推动了材料工程的自动化和大规模应用。

Journal ref Communications Materials 7, 115 (2026)

详情
AI中文摘要

预测性原子模拟推动了材料发现,但常规设置和调试仍需计算机专家。这种知识差距限制了集成计算材料工程(ICME),因为最先进的代码存在但非专家使用起来仍然繁琐。我们通过GENIUS解决了这一瓶颈,这是一种AI智能体工作流,将智能的Quantum ESPRESSO知识图谱与由有限状态错误恢复机器监督的分层大语言模型层次结构融合。我们展示了GENIUS将自由形式的人类生成提示翻译成经过验证的输入文件,在295个多样化基准测试中约有80%运行完成,其中76%被自主修复,成功率呈指数衰减至7%的基线。与仅使用LLM的基线相比,GENIUS将推理成本减半,并几乎消除了幻觉。该框架通过智能自动化协议生成、验证和修复,使电子结构DFT模拟大众化,为全球学术界和工业界开放大规模筛选并加速ICME设计循环。

英文摘要

Predictive atomistic simulations have propelled materials discovery, yet routine setup and debugging still demand computer specialists. This know-how gap limits Integrated Computational Materials Engineering (ICME), where state-of-the-art codes exist but remain cumbersome for non-experts. We address this bottleneck with GENIUS, an AI-agentic workflow that fuses a smart Quantum ESPRESSO knowledge graph with a tiered hierarchy of large language models supervised by a finite-state error-recovery machine. Here we show that GENIUS translates free-form human-generated prompts into validated input files that run to completion on $\approx$80% of 295 diverse benchmarks, where 76% are autonomously repaired, with success decaying exponentially to a 7% baseline. Compared with LLM-only baselines, GENIUS halves inference costs and virtually eliminates hallucinations. The framework democratizes electronic-structure DFT simulations by intelligently automating protocol generation, validation, and repair, opening large-scale screening and accelerating ICME design loops across academia and industry worldwide.

2511.18000 2026-05-25 cs.LG cs.AI q-bio.PE 版本更新

Reward Engineering for Spatial Epidemic Simulations: A Reinforcement Learning Platform for Individual Behavioral Learning

空间流行病模拟中的奖励工程:个体行为学习的强化学习平台

Radman Rakhshandehroo, Daniel Coombs

发表机构 * Department of Computer Science University of British Columbia(计算机科学系,不列颠哥伦比亚大学) Department of Mathematics and Institute of Applied Mathematics University of British Columbia(数学系和应用数学研究所,不列颠哥伦比亚大学)

AI总结 本文介绍了 ContagionRL,一个专为疫情空间模拟设计的强化学习平台,用于系统研究奖励函数设计对个体行为学习的影响。该平台结合了可配置的 SIRS+D 流行病模型,支持在不同环境条件下评估多种奖励机制对智能体生存策略的影响,并通过实验发现方向引导和明确遵守激励是提升策略学习的关键因素。研究还表明,采用势场奖励函数的智能体在非药物干预遵守和空间规避策略方面表现最优,平台为探索奖励与行为关系提供了模块化工具,具有重要的理论和应用价值。

Comments 38 pages, 15 figures and 18 tables; Accepted to TMLR. OpenReview: https://openreview.net/forum?id=yPEASsx3hk

Journal ref Transactions on Machine Learning Research, 2026

详情
AI中文摘要

我们提出了ContagionRL,一个与Gymnasium兼容的强化学习平台,专门用于空间流行病模拟中的系统奖励工程。与依赖固定行为规则的传统基于智能体的模型不同,我们的平台能够严格评估奖励函数设计如何影响在不同流行病场景中学到的生存策略。ContagionRL集成了空间SIRS+D流行病模型与可配置的环境参数,允许研究人员在包括有限可观测性、不同移动模式和异质人口动态等变化条件下对奖励函数进行压力测试。我们评估了五种不同的奖励设计,从稀疏生存奖励到一种新颖的势场方法,跨越多种RL算法(PPO、SAC、A2C)。通过系统的消融研究,我们发现方向性指导和明确的依从性激励是稳健策略学习的关键组成部分。我们在不同感染率、网格大小、可见性约束和移动模式下的全面评估表明,奖励函数的选择显著影响智能体行为和生存结果。使用我们的势场奖励训练的智能体始终获得优越性能,学习最大程度地遵守非药物干预,同时发展出复杂的空间规避策略。该平台的模块化设计使得能够系统地探索奖励-行为关系,弥补了这类模型中奖励工程关注有限的空白。ContagionRL是研究流行病背景下适应性行为反应的有效平台,并强调了奖励设计、信息结构和环境可预测性在学习中的重要性。我们的代码公开在https://github.com/redradman/ContagionRL。

英文摘要

We present ContagionRL, a Gymnasium-compatible reinforcement learning platform specifically designed for systematic reward engineering in spatial epidemic simulations. Unlike traditional agent-based models that rely on fixed behavioral rules, our platform enables rigorous evaluation of how reward function design affects learned survival strategies across diverse epidemic scenarios. ContagionRL integrates a spatial SIRS+D epidemiological model with configurable environmental parameters, allowing researchers to stress-test reward functions under varying conditions including limited observability, different movement patterns, and heterogeneous population dynamics. We evaluate five distinct reward designs, ranging from sparse survival bonuses to a novel potential field approach, across multiple RL algorithms (PPO, SAC, A2C). Through systematic ablation studies, we identify that directional guidance and explicit adherence incentives are critical components for robust policy learning. Our comprehensive evaluation across varying infection rates, grid sizes, visibility constraints, and movement patterns reveals that reward function choice dramatically impacts agent behavior and survival outcomes. Agents trained with our potential field reward consistently achieve superior performance, learning maximal adherence to non-pharmaceutical interventions while developing sophisticated spatial avoidance strategies. The platform's modular design enables systematic exploration of reward-behavior relationships, addressing a knowledge gap in models of this type where reward engineering has received limited attention. ContagionRL is an effective platform for studying adaptive behavioral responses in epidemic contexts and highlight the importance of reward design, information structure, and environmental predictability in learning. Our code is publicly available at https://github.com/redradman/ContagionRL

2511.16014 2026-05-25 cs.AI 版本更新

MUSEKG: A Knowledge Graph Over Museum Collections

MUSEKG:博物馆藏品知识图谱

Jinhao Li, Jianzhong Qi, Soyeon Caren Han, Eun-Jung Holden

发表机构 * The University of Melbourne School of Computing(墨尔本大学计算机与信息系统学院) The University of Melbourne(墨尔本大学)

AI总结 MUSEKG 是一个针对博物馆藏品数据构建的交互式知识图谱系统,旨在整合结构化目录、图像和非结构化描述等异构数据,形成统一的、可查询的知识表示。该系统通过建立类型化的图结构,将藏品、人物、机构、图像及其语义实体进行关联,支持基于自然语言的查询和关系感知的检索。实验表明,MUSEKG 能有效支持属性查询、关系探索等常见任务,并通过显式的图结构保证答案的可解释性。

Comments SIGIR'26

详情
AI中文摘要

文化遗产领域的数字化产生了大量但分散的博物馆藏品数据存储库,涵盖结构化编目记录、图像和非结构化描述。现有的博物馆信息系统通常难以将这些来源整合成统一的、可查询的表示,以支持关系感知的探索。我们提出了MuseKG,一个交互式知识图谱系统,它将异构博物馆数据组织成一个类型化图,在连贯的模式下链接对象、人物、组织、图像、图像派生标签和提取的语义实体。MuseKG通过将用户问题映射到图实体并检索用于答案生成的紧凑证据邻域来支持自然语言查询。通过在真实博物馆藏品上的交互式演示,我们展示了MuseKG支持常见的探索任务,如属性查找、关系探索和关系感知检索,并且答案可以通过显式图结构进行检查。

英文摘要

Digitisation in the cultural heritage sector has produced large but fragmented repositories of museum collection data, spanning structured catalogue records, images, and unstructured descriptions. Existing museum information systems often make it difficult to integrate these sources into a unified, queryable representation that supports relation-aware exploration. We present MuseKG, an interactive knowledge graph system that organises heterogeneous museum data into a typed graph that links objects, people, organisations, images, image-derived labels, and extracted semantic entities within a coherent schema. MuseKG supports natural-language queries by grounding user questions to graph entities and retrieving a compact neighbourhood of evidence for answer generation. Through an interactive demonstration on real museum collections, we show that MuseKG supports common exploration tasks such as attribute lookup, relation exploration, and relation-aware retrieval, with answers that remain inspectable via explicit graph structures.

2511.02239 2026-05-25 cs.RO cs.AI 版本更新

LACY: A Vision-Language Model-based Language-Action Cycle for Self-Improving Robotic Manipulation

LACY: 基于视觉-语言模型的语言-动作循环用于自我改进的机器人操作

Youngjin Hong, Houjian Yu, Mingen Li, Changhyun Choi

发表机构 * Department of Electrical and Computer Engineering, Univ. of Minnesota(电气与计算机工程系,明尼苏达大学)

AI总结 本文提出LACY,一种基于视觉-语言模型的“语言-动作循环”框架,旨在提升机器人操作任务中的策略泛化能力。该方法通过同时学习语言到动作(L2A)、动作到语言(A2L)以及语言间一致性(L2C)的双向映射,使机器人不仅能执行任务,还能解释自身行为,从而形成更丰富的内部表征。LACY采用主动增强策略自主生成和筛选训练数据,无需额外人工标注,实验表明其在抓取与放置任务中平均提升了56.46%的成功率,显著增强了语言-动作的语义一致性与鲁棒性。

Comments Accepted to ICRA 2026. Project page: https://vla2026.github.io/LACY/

详情
AI中文摘要

学习机器人操作的可泛化策略越来越依赖于将语言指令映射到动作(L2A)的大规模模型。然而,这种单向范式通常产生执行任务而缺乏更深层次上下文理解的策略,限制了它们泛化或解释其行为的能力。我们认为,将动作映射回语言(A2L)的互补技能对于发展更全面的基础至关重要。一个既能行动又能解释其动作的智能体可以形成更丰富的内部表示,并开启自我监督学习的新范式。我们引入了LACY(语言-动作循环),一个统一的框架,在单个视觉-语言模型内学习这种双向映射。LACY在三个协同任务上联合训练:从语言生成参数化动作(L2A)、用语言解释观察到的动作(A2L)以及验证两个语言描述之间的语义一致性(L2C)。这实现了一个自我改进的循环,通过针对低置信度案例的主动增强策略自主生成和过滤新的训练数据,从而在没有额外人工标注的情况下改进模型。在仿真和真实世界的拾取-放置任务上的实验表明,LACY平均将任务成功率提高了56.46%,并为机器人操作产生了更稳健的语言-动作基础。项目页面:https://vla2026.github.io/LACY/

英文摘要

Learning generalizable policies for robotic manipulation increasingly relies on large-scale models that map language instructions to actions (L2A). However, this one-way paradigm often produces policies that execute tasks without deeper contextual understanding, limiting their ability to generalize or explain their behavior. We argue that the complementary skill of mapping actions back to language (A2L) is essential for developing more holistic grounding. An agent capable of both acting and explaining its actions can form richer internal representations and unlock new paradigms for self-supervised learning. We introduce LACY (Language-Action Cycle), a unified framework that learns such bidirectional mappings within a single vision-language model. LACY is jointly trained on three synergistic tasks: generating parameterized actions from language (L2A), explaining observed actions in language (A2L), and verifying semantic consistency between two language descriptions (L2C). This enables a self-improving cycle that autonomously generates and filters new training data through an active augmentation strategy targeting low-confidence cases, thereby improving the model without additional human labels. Experiments on pick-and-place tasks in both simulation and the real world show that LACY improves task success rates by 56.46% on average and yields more robust language-action grounding for robotic manipulation. Project page: https://vla2026.github.io/LACY/

2510.26411 2026-05-25 cs.AI 版本更新

MedSAE: Dissecting MedCLIP Representations with Sparse Autoencoders

MedSAE: 用稀疏自编码器剖析MedCLIP表示

Riccardo Renzulli, Colas Lepoutre, Enrico Cassano, Marco Grangetto

发表机构 * University of Turin(都灵大学) École polytechnique(巴黎-萨克勒高等理工学院)

AI总结 本文提出了一种名为 MedSAE 的方法,通过稀疏自编码器对医学视觉语言模型 MedCLIP 的潜在空间进行解析,以提升其可解释性。研究引入了结合相关性度量、熵分析和自动神经元命名的评估框架,实验表明 MedSAE 能生成更具单语义性和可解释性的神经元表示,从而在医疗 AI 的性能与透明性之间建立桥梁。

Comments Accepted at ICIP 2026

详情
AI中文摘要

医疗保健中的人工智能需要既准确又可解释的模型。我们通过将医学稀疏自编码器(MedSAEs)应用于MedCLIP(一个在胸部X光片和报告上训练的视觉-语言模型)的潜在空间,推进了医学视觉中的机制可解释性。为了量化可解释性,我们提出了一个评估框架,该框架结合了相关性指标、熵分析以及通过MedGemma基础模型进行的自动神经元命名。在CheXpert数据集上的实验表明,MedSAE神经元比原始MedCLIP特征具有更高的单语义性和可解释性。我们的研究结果弥合了高性能医学AI与透明度之间的差距,为迈向临床可靠的表示提供了可扩展的一步。支持本研究结果的源代码可在https://github.com/EIDOSLAB/MedSAE获取。

英文摘要

Artificial intelligence in healthcare requires models that are accurate and interpretable. We advance mechanistic interpretability in medical vision by applying Medical Sparse Autoencoders (MedSAEs) to the latent space of MedCLIP, a vision-language model trained on chest radiographs and reports. To quantify interpretability, we propose an evaluation framework that combines correlation metrics, entropy analyses, and automated neuron naming via the MedGemma foundation model. Experiments on the CheXpert dataset show that MedSAE neurons achieve higher monosemanticity and interpretability than raw MedCLIP features. Our findings bridge high-performing medical AI and transparency, offering a scalable step toward clinically reliable representations. The source code supporting the findings of this study is available at https://github.com/EIDOSLAB/MedSAE.

2510.21270 2026-05-25 cs.CL cs.AI cs.CV 版本更新

Sparser Block-Sparse Attention via Token Permutation

通过令牌置换实现更稀疏的块稀疏注意力

Xinghao Wang, Pengyu Wang, Dong Zhang, Chenkun Tan, Shaojun Zhou, Zhaoxiang Liu, Shiguo Lian, Fangxu Liu, Kai Song, Xipeng Qiu

发表机构 * Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 随着大语言模型上下文长度的增加,计算成本显著上升,主要瓶颈来自自注意力机制的二次复杂度。为此,本文提出了一种名为Permuted Block-Sparse Attention(PBS-Attn)的新型稀疏注意力方法,通过重新排列token顺序以提升块级稀疏性,从而在保持模型精度的同时显著提高计算效率。实验表明,该方法在多个长上下文数据集上优于现有块稀疏注意力方法,并在端到端推理速度上实现了最高2.75倍的加速。

Comments ICML 2026

详情
AI中文摘要

扩展大语言模型(LLM)的上下文长度带来了显著的好处,但计算成本高昂。这种成本主要源于自注意力机制,其相对于序列长度的$O(N^2)$复杂度在内存和延迟方面构成了主要瓶颈。幸运的是,注意力矩阵通常是稀疏的,尤其是对于长序列,这为优化提供了机会。块稀疏注意力已成为一种有前景的解决方案,它将序列划分为块并跳过其中一部分块的计算。然而,该方法的有效性高度依赖于底层的注意力模式,这可能导致次优的块级稀疏性。例如,单个块内查询的重要键令牌可能分散在许多其他块中,导致计算冗余。在这项工作中,我们提出了置换块稀疏注意力(PBS-Attn),这是一种即插即用的方法,利用注意力的置换性质来增加块级稀疏性并提高LLM预填充的计算效率。我们在具有挑战性的真实世界长上下文数据集上进行了全面实验,结果表明PBS-Attn在模型精度上始终优于现有的块稀疏注意力方法,并紧密匹配全注意力基线。借助我们自定义的permuted-FlashAttention内核,PBS-Attn在长上下文预填充中实现了高达2.75倍的端到端加速,证实了其实用性。代码可在https://github.com/xinghaow99/pbs-attn获取。

英文摘要

Scaling the context length of large language models (LLMs) offers significant benefits but is computationally expensive. This expense stems primarily from the self-attention mechanism, whose $O(N^2)$ complexity with respect to sequence length presents a major bottleneck for both memory and latency. Fortunately, the attention matrix is often sparse, particularly for long sequences, suggesting an opportunity for optimization. Block-sparse attention has emerged as a promising solution that partitions sequences into blocks and skips computation for a subset of these blocks. However, the effectiveness of this method is highly dependent on the underlying attention patterns, which can lead to sub-optimal block-level sparsity. For instance, important key tokens for queries within a single block may be scattered across numerous other blocks, leading to computational redundancy. In this work, we propose Permuted Block-Sparse Attention (\textbf{PBS-Attn}), a plug-and-play method that leverages the permutation properties of attention to increase block-level sparsity and enhance the computational efficiency of LLM prefilling. We conduct comprehensive experiments on challenging real-world long-context datasets, demonstrating that PBS-Attn consistently outperforms existing block-sparse attention methods in model accuracy and closely matches the full attention baseline. Powered by our custom permuted-FlashAttention kernels, PBS-Attn achieves an end-to-end speedup of up to $2.75\times$ in long-context prefilling, confirming its practical viability. Code available at https://github.com/xinghaow99/pbs-attn

2510.12787 2026-05-25 cs.AI cs.MA 版本更新

Ax-Prover: A Deep Reasoning Agentic Framework for Theorem Proving in Mathematics and Quantum Physics

Ax-Prover:用于数学和量子物理定理证明的深度推理智能体框架

Benjamin Breen, Marco Del Tredici, Jacob McCarran, Javier Aspuru Mijares, Weichen Winston Yin, Kfir Sulimany, Jacob M. Taylor, Frank H. L. Koppens, Dirk Englund

发表机构 * Axiomatic_AI(公理人工智能) Massachusetts Institute of Technology (MIT)(麻省理工学院) Institut de Ciències Fotòniques (ICFO)(光子科学研究所) Institució Catalana de Recerca i Estudis Avançats (ICREA)(加泰罗尼亚高级研究与高等学院)

AI总结 本文提出了一种名为 Ax-Prover 的多智能体系统,用于在 Lean 证明助手环境中进行自动化定理证明,能够解决数学和量子物理等不同科学领域的问题,并支持自主运行或与人类专家协作。该系统结合了大语言模型的推理能力与 Lean 工具的严格形式化验证机制,通过模型上下文协议实现知识与形式正确性的统一。实验表明,Ax-Prover 在多个基准测试中表现优异,尤其在新引入的抽象代数和量子理论基准上显著优于现有方法,展示了其在跨领域形式化验证中的通用性和有效性。

详情
AI中文摘要

我们提出了Ax-Prover,一个用于Lean中自动化定理证明的多智能体系统,能够解决跨不同科学领域的问题,并可自主运行或与人类专家协作。为此,Ax-Prover通过形式化证明生成来处理科学问题求解,这一过程既需要创造性推理又需要严格的句法严谨性。Ax-Prover通过模型上下文协议(MCP)将提供知识和推理能力的大语言模型(LLM)与确保形式正确性的Lean工具相结合,以应对这一挑战。为了评估其作为自主证明器的性能,我们在两个公开数学基准以及我们在抽象代数和量子理论领域引入的两个Lean基准上,将我们的方法与前沿LLM和专用证明器模型进行了比较。在公开数据集上,Ax-Prover与最先进的证明器竞争力相当,而在新基准上则大幅超越它们。这表明,与难以泛化的专用系统不同,我们基于工具的智能体定理证明方法为跨不同科学领域的形式化验证提供了一种可泛化的方法论。此外,我们通过一个实际用例展示了Ax-Prover的辅助能力,展示了它如何使一位专家数学家能够形式化一个复杂密码学定理的证明。

英文摘要

We present Ax-Prover, a multi-agent system for automated theorem proving in Lean that can solve problems across diverse scientific domains and operate either autonomously or collaboratively with human experts. To achieve this, Ax-Prover approaches scientific problem solving through formal proof generation, a process that demands both creative reasoning and strict syntactic rigor. Ax-Prover meets this challenge by equipping Large Language Models (LLMs), which provide knowledge and reasoning, with Lean tools via the Model Context Protocol (MCP), which ensure formal correctness. To evaluate its performance as an autonomous prover, we benchmark our approach against frontier LLMs and specialized prover models on two public math benchmarks and on two Lean benchmarks we introduce in the fields of abstract algebra and quantum theory. On public datasets, Ax-Prover is competitive with state-of-the-art provers, while it largely outperforms them on the new benchmarks. This shows that, unlike specialized systems that struggle to generalize, our tool-based agentic theorem prover approach offers a generalizable methodology for formal verification across diverse scientific domains. Furthermore, we demonstrate Ax-Prover's assistant capabilities in a practical use case, showing how it enabled an expert mathematician to formalize the proof of a complex cryptography theorem.

2510.11195 2026-05-25 cs.CR cs.AI 版本更新

RAG-Pull: Turning Retrieval into a Code-Injection Channel via Invisible Unicode Perturbations

RAG-Pull:通过不可见Unicode扰动将检索转化为代码注入通道

Aritra Dhar, Vasilije Stambolic, Lukas Cavigelli

发表机构 * Computing System Labs, Huawei Technologies Switzerland AG(华为瑞士技术有限公司计算系统实验室) BKW Energie AG(BKW能源集团)

AI总结 本文提出了一种针对检索增强生成(RAG)系统的新型黑盒攻击方法RAG-Pull,通过在查询或代码库中插入不可见的Unicode字符扰动,引导检索过程指向恶意代码,从而破坏模型的安全对齐性。研究发现,仅对查询或目标代码进行微小扰动即可显著影响检索结果,而两者结合则能实现几乎完美的攻击效果。该方法揭示了RAG系统在安全方面的潜在漏洞,为大语言模型的安全性研究提供了新的视角。

详情
AI中文摘要

检索增强生成(RAG)通过将外部数据添加到LLM的上下文中,提高了LLM响应的可靠性和可信度,减少了幻觉,并消除了模型重新训练的需要。我们开发了一种新的黑盒攻击类别RAG-Pull,该攻击将隐藏的UTF字符插入查询或外部代码库中,将检索重定向到恶意代码,从而破坏模型的安全对齐。我们观察到,仅查询和代码扰动就能使检索偏向攻击者控制的片段,而组合的查询和目标扰动实现了近乎完美的成功。一旦被检索,这些片段会引入可利用的漏洞,如远程代码执行和SQL注入。RAG-Pull的最小扰动可以改变模型的安全对齐,并增加对不安全代码的偏好,从而为LLM开辟了一类新的攻击方式。

英文摘要

Retrieval-Augmented Generation (RAG) increases the reliability and trustworthiness of the LLM response and reduces hallucination by eliminating the need for model retraining. It does so by adding external data into the LLM's context. We develop a new class of black-box attack, RAG-Pull, that inserts hidden UTF characters into queries or external code repositories, redirecting retrieval toward malicious code, thereby breaking the models' safety alignment. We observe that query and code perturbations alone can shift retrieval toward attacker-controlled snippets, while combined query-and-target perturbations achieve near-perfect success. Once retrieved, these snippets introduce exploitable vulnerabilities such as remote code execution and SQL injection. RAG-Pull's minimal perturbations can alter the model's safety alignment and increase preference towards unsafe code, therefore opening up a new class of attacks on LLMs.

2510.09136 2026-05-25 cs.IR cs.AI 版本更新

Controlled Personalization in Legacy Media Online Services: A Case Study in News Recommendation

传统媒体在线服务中的受控个性化:新闻推荐案例研究

Marlene Holzleitner, Stephan Leitner, Hanna Lind Jorgensen, Christoph Schmitz, Jacob Welander, Dietmar Jannach

发表机构 * University of Klagenfurt(克雷格弗尔特大学) University of Bergen(卑尔根大学)

AI总结 本文研究了传统新闻媒体在在线平台中采用“受控个性化”推荐策略的效果,旨在在技术驱动的内容推荐与核心编辑价值观之间取得平衡。通过在一家挪威主流传统新闻机构网站上进行A/B测试,研究发现即使是适度的个性化推荐也能显著提升用户点击率、降低浏览努力,并促进内容多样性和覆盖率,同时减少流行度偏差。研究结果表明,受控个性化能够在满足用户需求的同时维护新闻编辑目标,为传统媒体采用个性化技术提供了可行路径。

详情
AI中文摘要

个性化新闻推荐已成为大型新闻聚合服务的标准功能,通过自动内容选择优化用户参与。相比之下,传统新闻媒体通常谨慎对待个性化,努力在技术创新与核心编辑价值之间取得平衡。因此,传统新闻媒体的在线平台通常结合编辑策划内容与算法选择文章——我们将这种策略称为受控个性化。在这篇行业文章中,我们通过在挪威一家主要传统新闻机构的网站上进行的A/B测试,评估了受控个性化的有效性。我们的研究结果表明,即使是适度的个性化也能带来显著收益。具体来说,我们观察到接触个性化内容的用户表现出更高的点击率和更少的导航努力,这表明相关内容的发现得到了改善。此外,我们的分析显示,受控个性化有助于提高内容多样性和目录覆盖,并减少流行度偏差。总体而言,我们的结果表明,受控个性化能够成功地将用户需求与编辑目标对齐,为传统媒体在维护新闻价值的同时采用个性化技术提供了一条可行路径。

英文摘要

Personalized news recommendations have become a standard feature of large news aggregation services, optimizing user engagement through automated content selection. In contrast, legacy news media often approach personalization cautiously, striving to balance technological innovation with core editorial values. As a result, online platforms of traditional news outlets typically combine editorially curated content with algorithmically selected articles - a strategy we term controlled personalization. In this industry article, we evaluate the effectiveness of controlled personalization through an A/B test conducted on the website of a major Norwegian legacy news organization. Our findings indicate that even a modest level of personalization yields substantial benefits. Specifically, we observe that users exposed to personalized content demonstrate higher click-through-rates and reduced navigation effort, suggesting improved discovery of relevant content. Moreover, our analysis reveals that controlled personalization contributes to greater content diversity and catalog coverage and in addition reduces popularity bias. Overall, our results suggest that controlled personalization can successfully align user needs with editorial goals, offering a viable path for legacy media to adopt personalization technologies while upholding journalistic values.

2510.08945 2026-05-25 cs.AI 版本更新

FATHOMS-RAG: A Framework for the Assessment of Thinking and Observation in Multimodal Systems that use Retrieval Augmented Generation

FATHOMS-RAG:评估使用检索增强生成的多模态系统思考与观察的框架

Samuel Hildebrand, Curtis Taylor, Sean Oesch, James M Ghawaly, Amir Sadovnik, Ryan Shivers, Brandon Schreiber, Kevin Kurian

发表机构 * Louisiana State University(路易斯安那州立大学) Oak Ridge National Lab(橡树岭国家实验室) University of Florida(佛罗里达大学)

AI总结 本文提出了一种名为FATHOMS-RAG的框架,用于评估使用检索增强生成(RAG)的多模态系统在推理和观察方面的能力。该框架引入了一个由人类创建的小型数据集、多项评估指标以及对开源与闭源模型的对比实验,全面检验RAG系统在处理文本、表格和图像等多模态信息时的表现。实验结果表明,闭源模型在准确性和幻觉控制方面显著优于开源模型,尤其是在涉及多模态和跨文档信息的问题上表现更为突出。

Comments Accepted at SAFE-ML 2026 Workshop at the International Conference on Software Testing (ICST) 2026 Code: https://github.com/Sam-Hildebrand/FATHOMS-RAG

详情
AI中文摘要

检索增强生成(RAG)已成为提高大型语言模型(LLMs)事实准确性的有前景范式。我们引入了一个旨在整体评估RAG管道的基准,评估管道摄取、检索和推理多种模态信息的能力,区别于现有专注于检索等特定方面的基准。我们提出:(1)一个由93个人工创建的问题组成的小型数据集,用于评估管道摄取文本数据、表格、图像以及跨这些模态分布在多个文档中的数据的能力;(2)一个用于正确性的短语级召回率指标;(3)一个最近邻嵌入分类器,用于识别潜在的管道幻觉;(4)对使用开源检索机制构建的2个管道和4个闭源基础模型进行的比较评估;(5)第三方人工评估我们正确性和幻觉指标的对齐情况。我们发现,闭源管道在正确性和幻觉指标上均显著优于开源管道,在依赖多模态和跨文档信息的问题上性能差距更大。对我们指标的人工评估显示,在1-5 Likert量表(5表示“强烈同意”)上,正确性平均一致性为4.62,幻觉检测平均一致性为4.53。

英文摘要

Retrieval-augmented generation (RAG) has emerged as a promising paradigm for improving factual accuracy in large language models (LLMs). We introduce a benchmark designed to evaluate RAG pipelines as a whole, evaluating a pipeline's ability to ingest, retrieve, and reason about several modalities of information, differentiating it from existing benchmarks that focus on particular aspects such as retrieval. We present (1) a small, human-created dataset of 93 questions designed to evaluate a pipeline's ability to ingest textual data, tables, images, and data spread across these modalities in one or more documents; (2) a phrase-level recall metric for correctness; (3) a nearest-neighbor embedding classifier to identify potential pipeline hallucinations; (4) a comparative evaluation of 2 pipelines built with open-source retrieval mechanisms and 4 closed-source foundation models; and (5) a third-party human evaluation of the alignment of our correctness and hallucination metrics. We find that closed-source pipelines significantly outperform open-source pipelines in both correctness and hallucination metrics, with wider performance gaps in questions relying on multimodal and cross-document information. Human evaluation of our metrics showed average agreement of 4.62 for correctness and 4.53 for hallucination detection on a 1-5 Likert scale (5 indicating "strongly agree").

2510.00915 2026-05-25 cs.LG cs.AI 版本更新

Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers

在不完美验证器下基于可验证但含噪声奖励的强化学习

Xin-Qiang Cai, Wei Wang, Feng Liu, Tongliang Liu, Gang Niu, Masashi Sugiyama

发表机构 * RIKEN AIP(日本理化学研究所AIP) The University of Tokyo(东京大学) The University of Melbourne(墨尔本大学) The University of Sydney(悉尼大学)

AI总结 该论文研究了在不可靠验证器存在下如何改进可验证奖励的强化学习(RLVR)。通过将验证器的不可靠性建模为具有不对称噪声率的随机奖励通道,作者提出了两种轻量级修正方法:一种是反向修正,用于生成无偏的替代奖励;另一种是正向修正,通过调整得分函数项使策略更新更贴近干净梯度方向。实验表明,这两种方法在合成和真实验证噪声环境下均能提升数学推理任务的性能,其中正向修正在高噪声情况下更为稳定。此外,作者还引入了一个基于轻量级语言模型的申诉机制,用于在线估计假阴性率并进一步提升性能。

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)用自动验证器替代昂贵的人工标注。为减少验证器攻击,许多RLVR系统将奖励二值化为$\\\{0,1\\\}$,但不完美的验证器不可避免地引入\\emph{假阴性}(拒绝正确答案)和\\emph{假阳性}(接受错误答案)。我们将验证器不可靠性形式化为具有非对称噪声率$ρ_0$和$ρ_1$(分别为FP率和FN率)的随机奖励通道。由此抽象我们推导出两种轻量级校正:(i)\\emph{后向}校正,产生无偏替代奖励,从而在期望上得到无偏的策略梯度估计量;(ii)\\emph{前向}校正,重新加权得分函数项,使得期望更新与干净梯度方向对齐,且仅需FN率。我们在分组相对策略优化流程中将两者实现为轻量级钩子,两种校正均在合成和真实验证器噪声下改善了数学推理的RLVR,其中前向变体在较大噪声下更稳定。最后,一个带有轻量级LLM验证器的上诉机制在线估计FN率并进一步提升性能。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) replaces costly human labeling with automated verifiers. To reduce verifier hacking, many RLVR systems binarize rewards to $\{0,1\}$, but imperfect verifiers inevitably introduce \emph{false negatives} (rejecting correct answers) and \emph{false positives} (accepting incorrect ones). We formalize verifier unreliability as a stochastic reward channel with asymmetric noise rates $ρ_0$ and $ρ_1$ -- the FP rate and the FN rate, respectively. From this abstraction we derive two lightweight corrections: (i) a \emph{backward} correction that yields an unbiased surrogate reward and thus an unbiased policy-gradient estimator in expectation, and (ii) a \emph{forward} correction that reweights score-function terms so the expected update aligns with the clean gradient direction and requires only the FN rate. We implement both as lightweight hooks in a group relative policy optimization pipeline, both corrections improve RLVR for math reasoning under synthetic and real verifier noise, with the forward variant being more stable under heavier noise. Finally, an appeals mechanism with a lightweight LLM verifier estimates the FN rate online and further improves performance.

2509.26383 2026-05-25 cs.CL cs.AI 版本更新

Efficient and Transferable Agentic Knowledge Graph RAG via Reinforcement Learning

基于强化学习的高效可迁移智能知识图谱检索增强生成

Junhong Lin, Shicheng Liu, Jinyeop Song, Song Wang, Julian Shun, Yada Zhu

发表机构 * MIT CSAIL(麻省理工学院CSAIL) University of Virginia(弗吉尼亚大学) IBM Research(IBM研究院)

AI总结 该研究提出了一种基于强化学习的高效且可迁移的智能体知识图谱检索增强生成框架KG-R1,旨在解决现有KG-RAG系统中固定流程导致的高推理成本和依赖特定图结构的问题。KG-R1通过单智能体与知识图谱环境交互,逐步学习信息检索与推理生成的统一过程,从而在减少生成token数量的同时提升回答准确性。实验表明,KG-R1在多个知识图谱问答基准上表现出优异的效率和跨图谱迁移能力,且无需重新训练即可保持对新知识图谱的准确推理,具有良好的实际应用前景。

详情
AI中文摘要

知识图谱检索增强生成(KG-RAG)将大型语言模型(LLMs)与结构化、可验证的知识图谱(KGs)相结合,以减少幻觉并提供推理轨迹。然而,当前的KG-RAG系统通常依赖于多个LLM模块(如规划、推理和响应)的固定流水线,这增加了推理成本,并将性能与特定图模式绑定。为了解决这个问题,我们引入了KG-R1,一个通过强化学习(RL)优化KG-RAG的智能体框架。与模块化工作流不同,KG-R1使用单个智能体,将KGs作为其环境进行交互,学习在每一步检索信息,并将其融入统一的推理和生成过程中。在知识图谱问答(KGQA)基准测试中,KG-R1展示了高效性和可迁移性——使用Qwen 2.5-3B,KG-R1以比先前使用更大基础或微调模型的多模块工作流方法更少的生成token提高了答案准确性。此外,KG-R1表现出强大的即插即用能力:训练后,无需重新训练即可在未见过的KGs上保持准确性。这些特性使KG-R1成为实际部署中很有前景的KG-RAG框架。我们的代码公开在github.com/junhongmit/KG-R1/。

英文摘要

Knowledge-graph retrieval-augmented generation (KG-RAG) couples large language models (LLMs) with structured, verifiable knowledge graphs (KGs) to reduce hallucination and provide reasoning traces. However, current KG-RAG systems often rely on fixed pipelines of multiple LLM modules (e.g., planning, reasoning, and responding), which inflate inference costs and tie performance to specific graph schemas. To address this, we introduce KG-R1, an agentic framework that optimizes KG-RAG through reinforcement learning (RL). Unlike modular workflows, KG-R1 uses a single agent that interacts with KGs as its environment, learning to retrieve information at each step and incorporating it into its reasoning and generation in a unified process. Across Knowledge-Graph Question Answering (KGQA) benchmarks, KG-R1 demonstrates both efficiency and transferability-using Qwen 2.5-3B, KG-R1 improves answer accuracy with fewer generation tokens than prior multi-module workflow methods that use much larger foundation or fine-tuned models. Furthermore, KG-R1 exhibits strong plug-and-play capability: after training, maintaining accuracy on unseen KGs without retraining. These properties make KG-R1 a promising KG-RAG framework for real-world deployment. Our code is publicly available at github.com/junhongmit/KG-R1/.

2509.12958 2026-05-25 cs.AI 版本更新

Forget What's Sensitive, Remember What Matters: Token-Level Differential Privacy in Memory Sculpting for Continual Learning

忘记敏感信息,记住重要内容:持续学习中基于令牌级差分隐私的记忆塑造

Bihao Zhan, Jie Zhou, Junsong Li, Yutao Yang, Shilian Chen, Qianjun Pan, Xin Li, Wen Wu, Xingjiao Wu, Qin Chen, Hang Yan, Liang He

发表机构 * East China Normal University(东华师范大学) Shanghai AI Laboratory(上海人工智能实验室) Shanghai Qiji Zhifeng Co., Ltd.(上海启智锋科技有限公司)

AI总结 该研究针对持续学习(CL)模型在隐私保护方面的不足,提出了一种隐私增强的持续学习框架(PeCL)。该方法引入了基于语义敏感性的标记级动态差分隐私策略,动态分配隐私预算以保护敏感信息,同时减少对非敏感知识的干扰。此外,研究还设计了一个隐私引导的记忆塑形模块,用于智能遗忘敏感信息并保留任务不变的历史知识,从而在保障隐私的同时提升模型性能。实验表明,PeCL在隐私保护与模型效用之间取得了更优的平衡。

详情
AI中文摘要

持续学习(CL)模型虽然擅长顺序知识获取,但由于积累多样化信息而面临显著且常被忽视的隐私挑战。传统的隐私方法(如统一的差分隐私预算)不加区分地保护所有数据,导致模型效用大幅下降,阻碍了CL在隐私敏感领域的部署。为了克服这一问题,我们提出了一种隐私增强的持续学习(PeCL)框架,该框架忘记敏感信息并记住重要内容。我们的方法首先引入了一种令牌级动态差分隐私策略,该策略根据单个令牌的语义敏感性自适应分配隐私预算。这确保了对私有实体的强健保护,同时最小化对非敏感通用知识的噪声注入。其次,我们集成了一个隐私引导的记忆塑造模块。该模块利用来自动态DP机制的敏感性分析,智能地从模型记忆和参数中忘记敏感信息,同时明确保留对于缓解灾难性遗忘至关重要的任务不变历史知识。大量实验表明,PeCL在隐私保护和模型效用之间实现了优越的平衡,通过保持先前任务的高准确性同时确保强健的隐私,优于基线模型。

英文摘要

Continual Learning (CL) models, while adept at sequential knowledge acquisition, face significant and often overlooked privacy challenges due to accumulating diverse information. Traditional privacy methods, like a uniform Differential Privacy (DP) budget, indiscriminately protect all data, leading to substantial model utility degradation and hindering CL deployment in privacy-sensitive areas. To overcome this, we propose a privacy-enhanced continual learning (PeCL) framework that forgets what's sensitive and remembers what matters. Our approach first introduces a token-level dynamic Differential Privacy strategy that adaptively allocates privacy budgets based on the semantic sensitivity of individual tokens. This ensures robust protection for private entities while minimizing noise injection for non-sensitive, general knowledge. Second, we integrate a privacy-guided memory sculpting module. This module leverages the sensitivity analysis from our dynamic DP mechanism to intelligently forget sensitive information from the model's memory and parameters, while explicitly preserving the task-invariant historical knowledge crucial for mitigating catastrophic forgetting. Extensive experiments show that PeCL achieves a superior balance between privacy preserving and model utility, outperforming baseline models by maintaining high accuracy on previous tasks while ensuring robust privacy.

2509.06858 2026-05-25 physics.soc-ph cs.AI nlin.AO 版本更新

Disentangling Interaction and Bias Effects in Opinion Dynamics of Large Language Models

大型语言模型中意见动态的交互与偏差效应的分离

Vincent C. Brockers, David A. Ehrlich, Viola Priesemann

发表机构 * Max-Planck-Institute for Dynamics and Self-Organization(马克斯·普朗克动态与自组织研究所) Institute for the Dynamics of Complex Systems(复杂系统动力学研究所) University of Göttingen(哥廷根大学) Campus Institute for Dynamics of Biological Networks(校园生物网络动力学研究所)

AI总结 该研究探讨了大型语言模型在模拟人类意见动态时,真实交互效果如何被系统性偏差所掩盖的问题。研究提出了一种贝叶斯框架,用于分离和量化三种偏差:主题偏差、同意偏差和锚定偏差,并应用于多个模型在不同话题上的多轮对话实验中。结果表明,意见演化趋向于快速收敛,偏差和交互的影响随时间减弱,且不同模型的偏差表现存在差异,研究还揭示了微调对模型意见吸引子的影响,为评估语言模型在人类行为模拟中的潜力与局限提供了量化工具。

详情
AI中文摘要

大型语言模型越来越多地被用于模拟人类意见动态,然而真实交互的影响常常被系统性偏差所掩盖。我们开发了一个贝叶斯框架来分离并量化三种这样的偏差:(i) 针对LLM默认立场的主题偏差;(ii) 倾向于同意提示语句的同意偏差,无论问题如何;(iii) 倾向于初始主体立场的锚定偏差。我们将该框架应用于多个LLM,这些模型在从气候变化、社会正义到音乐偏好的12个不同问题上执行了多步对话。我们发现意见轨迹往往迅速收敛到一个共享吸引子,交互和偏差的影响随时间衰减,且偏差的影响在不同LLM之间有所不同。此外,我们表明,在不同组别的强烈意见陈述(包括错误信息)上微调LLM会相应地改变意见吸引子。通过揭示LLM之间的显著差异,并提供定量工具来比较交互和偏差对LLM主体讨论中意见转变的贡献,我们的方法突出了使用LLM作为人类行为代理的潜力和陷阱。

英文摘要

Large Language Models are increasingly used to simulate human opinion dynamics, yet the effect of genuine interaction is often obscured by systematic biases. We develop a Bayesian framework to disentangle and quantify three such biases: (i) A topic bias toward the LLM's default stance; (ii) an agreement bias favoring agreement to the prompted statement irrespective of the question; and (iii) an anchoring bias toward the initiating agent's stance. We apply this framework to various LLMs that performed multi-step dialogues on 12 different questions from climate change and societal justice to music preferences. We find that opinion trajectories tend to quickly converge to a shared attractor, with the influence of both interaction and biases decaying over time, and with the impact of biases differing between LLMs. In addition, we show that fine-tuning an LLM on different sets of strongly opinionated statements (including misinformation) shifts the opinion attractor correspondingly. By exposing stark differences between LLMs and providing quantitative tools for comparing interaction and bias contributions to opinion shifts in LLM agent discussions, our approach highlights both promises and pitfalls of using LLMs as proxies for human behavior.

2508.18958 2026-05-25 cs.CV cs.AI 版本更新

A drone-based framework for coral habitat mapping via weakly supervised segmentation

基于弱监督分割的无人机珊瑚栖息地制图框架

Matteo Contini, Victor Illien, Sylvain Poulain, Serge Bernard, Julien Barde, Sylvain Bonhommeau, Alexis Joly

发表机构 * IFREMER Délégation Océan Indien (DOI)(IFREMER大洋印度洋办事处) INRIA, LIRMM, Université de Montpellier, CNRS(INRIA、LIRMM、蒙彼利埃大学、国家科学研究中心) UMR Marbec, IRD, Université de Montpellier, CNRS, Ifremer(Marbec联合研究单位、IRD、蒙彼利埃大学、国家科学研究中心、IFREMER) CNRS, LIRMM, Université de Montpellier(国家科学研究中心、LIRMM、蒙彼利埃大学)

AI总结 本文提出了一种基于无人机的弱监督分割框架,用于珊瑚生境的映射。该方法通过结合水下图像的细粒度多标签预测和广覆盖的航拍数据,无需像素级标注即可训练高分辨率分割模型。研究在珊瑚礁图像上验证了该方法,实现了大面积珊瑚形态的分割,取得了86.07%的像素准确率和52.23%的平均交并比,展示了其在生态监测中的高效性和适用性。

Comments Extended journal version of "The Point is the Mask: Scaling coral reef segmentation with weak supervision"

详情
AI中文摘要

在大空间范围内获取像素级标注仍然是机器学习在生态应用中部署的主要瓶颈。本文提出了一种多尺度弱监督语义分割(WSSS)框架,能够利用密集的、基于分类的输出训练高分辨率分割模型。我们的方法将来自水下图像的细粒度多标签预测与广覆盖的航空数据相结合。将这些点级分类转换为粗监督掩码,用于训练无人机(UAV)正射影像上的语义分割模型。然后使用模型自身的细化预测进行第二步训练,以进一步提高空间精度,无需额外标注。我们在珊瑚礁图像上展示了该方法,实现了珊瑚形态类型的大面积分割,并展示了其整合新类别的灵活性。最终模型在人工标注的礁区上达到86.07%的像素精度和52.23%的平均交并比(mIoU),表明无需像素级标注即可获得准确的大规模珊瑚分割。通过跨尺度和跨模态连接图像分类与分割,该方法为标注不可用场景下部署分割模型提供了高效解决方案,并为生态学及其他领域的可扩展、高效监测开辟了机会。

英文摘要

Obtaining pixel-level annotations over large spatial extents remains a major bottleneck for deploying machine learning in ecological applications. Here we present a multi-scale weakly supervised semantic segmentation (WSSS) framework that enables training high-resolution segmentation models from dense, classification-based outputs. Our method combines fine-scale, multi-label predictions from underwater imagery with broad-coverage aerial data. We convert these point-level classifications into coarse supervision masks that can be used to train a semantic segmentation model on Unmanned Aerial Vehicle (UAV) orthophotos. A second training step using the model's own refined predictions is then used to further improve spatial accuracy without requiring additional annotations. We demonstrate the approach on coral reef imagery, enabling large-area segmentation of coral morphotypes and illustrating its flexibility in integrating new classes. The final model achieves 86.07% pixel accuracy and 52.23% mean Intersection over Union (mIoU) on manually annotated reef zones, demonstrating that accurate large-scale coral segmentation can be obtained without pixel-level annotations. By bridging image classification and segmentation across scales and modalities, this method provides an efficient solution for deploying segmentation models in settings where annotations are unavailable and opens opportunities for scalable, efficient monitoring in ecology and beyond.

2508.14311 2026-05-25 cs.LG cs.AI 版本更新

Online Learning with Multiple Fairness Regularizers via Graph-Structured Feedback

通过图结构反馈进行多重公平正则化器的在线学习

Quan Zhou, Jakub Marecek, Robert Shorten

发表机构 * Department of Mathematics, National University of Singapore(新加坡国立大学数学系) Department of Computer Science, Czech Technical University(捷克技术大学计算机科学系) Dyson School of Design Engineering, Imperial College London(伦敦帝国理工学院设计工程戴森学院) Imperial College London(伦敦帝国理工学院)

AI总结 本文研究了在自动决策系统中如何同时满足多个可能相互冲突的公平性要求的问题。作者提出了一种基于图结构反馈的强化学习方法,能够在序贯交互过程中自适应地学习不同公平性目标的权重。该方法为动态环境中实现多目标公平性优化提供了新的解决方案。

Comments Published in Transactions on Machine Learning Research (TMLR), 2026. OpenReview: https://openreview.net/forum?id=y8iWuDZtEw

Journal ref Transactions on Machine Learning Research (TMLR), 2026

详情
AI中文摘要

在自动化决策系统中,越来越需要强制执行多个通常相互竞争的公平性度量。这些公平性目标的适当权重通常是先验未知的,可能随时间变化,并且在我们的设置中,必须通过顺序交互自适应地学习。在这项工作中,我们在赌博机设置中解决了这一挑战,其中决策具有图结构反馈。

英文摘要

There is an increasing need to enforce multiple, often competing, measures of fairness within automated decision systems. The appropriate weighting of these fairness objectives is typically unknown a priori, may change over time and, in our setting, must be learned adaptively through sequential interactions. In this work, we address this challenge in a bandit setting, where decisions are made with graph-structured feedback.

2508.14083 2026-05-25 cs.LG cs.AI 版本更新

GeoMAE: Masking Representation Learning for Spatio-Temporal Graph Forecasting with Missing Values

GeoMAE:面向缺失值的时空图预测的掩码表示学习

Songyu Ke, Chenyu Wu, Yuxuan Liang, Huiling Qin, Junbo Zhang, Yu Zheng

发表机构 * College of Computer and Data Science, Fuzhou University(福州大学计算机与数据科学学院) JD Intelligent Cities Research(京东智能城市研究院) School of Computing and Artificial Intelligence, Southwest Jiaotong University(西南交通大学计算机与人工智能学院) Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Beijing Normal University(北京师范大学)

AI总结 GeoMAE 是一种用于时空图预测的自监督表示学习模型,旨在解决城市智能系统中因环境和设备问题导致的数据缺失问题。该方法通过引入基于注意力机制的时空预测网络和辅助学习任务,有效捕捉了传感器网络中的动态空间关联,并提升了模型对缺失数据的鲁棒性。实验表明,GeoMAE 在多个真实数据集上显著优于现有方法,相对提升了最高达13.20%的预测性能。

Comments 34 pages for pre-print version. This work has been published in *Neural Networks*. Please check the latest version via the following DOI

详情
AI中文摘要

城市智能系统中缺失数据的普遍存在,归因于不利的环境条件和设备故障,对下游应用(尤其是交通预测和能耗预测)的有效性构成了重大挑战。因此,开发一种能够从不完整数据集中提取有意义信息的稳健时空学习方法至关重要。尽管存在针对缺失值时空图预测的方法,但未解决的问题依然存在。首先,现有研究大多基于时间序列分析,从而忽略了传感器网络中固有的动态空间相关性。其次,缺失数据模式的复杂性加剧了问题的复杂性。此外,维护条件的差异导致缺失值比率和模式显著波动,从而挑战了预测模型的泛化能力。针对这些挑战,本研究引入了GeoMAE,一种自监督的时空表示学习模型。该模型由三个主要组件组成:输入预处理模块、基于注意力的时空预测网络(STAFN)和一个辅助学习任务,该任务受掩码自编码器启发,以增强时空表示学习的鲁棒性。在真实数据集上的实证评估表明,GeoMAE显著优于现有基准,相对于最佳基线模型实现了高达13.20%的相对改进。

英文摘要

The ubiquity of missing data in urban intelligence systems, attributable to adverse environmental conditions and equipment failures, poses a significant challenge to the efficacy of downstream applications, notably in the realms of traffic forecasting and energy consumption prediction. Therefore, it is imperative to develop a robust spatio-temporal learning methodology capable of extracting meaningful insights from incomplete datasets. Despite the existence of methodologies for spatio-temporal graph forecasting in the presence of missing values, unresolved issues persist. Primarily, the majority of extant research is predicated on time-series analysis, thereby neglecting the dynamic spatial correlations inherent in sensor networks. Additionally, the complexity of missing data patterns compounds the intricacy of the problem. Furthermore, the variability in maintenance conditions results in a significant fluctuation in the ratio and pattern of missing values, thereby challenging the generalizability of predictive models. In response to these challenges, this study introduces GeoMAE, a self-supervised spatio-temporal representation learning model. The model is comprised of three principal components: an input preprocessing module, an attention-based spatio-temporal forecasting network (STAFN), and an auxiliary learning task, which draws inspiration from Masking AutoEncoders to enhance the robustness of spatio-temporal representation learning. Empirical evaluations on real-world datasets demonstrate that GeoMAE significantly outperforms existing benchmarks, achieving up to 13.20\% relative improvement over the best baseline models.

2507.06252 2026-05-25 cs.CR cs.AI cs.LG 版本更新

False Alarms, Real Damage: Adversarial Attacks Using LLM-based Models on Text-based Cyber Threat Intelligence Systems

虚假警报,真实损害:基于LLM的模型对文本网络威胁情报系统的对抗攻击

Samaneh Shafee, Alysson Bessani, Pedro M. Ferreira

发表机构 * Faculty of Sciences, University of Lisbon(里斯本大学科学学院) CIENCES, University of Lisbon(里斯本大学CIENCES)

AI总结 本文研究了基于大语言模型(LLM)的对抗攻击对基于文本的网络威胁情报(CTI)系统的影响。研究分析了三种攻击类型,包括规避、泛滥和投毒攻击,揭示了CTI系统在处理来自开放来源的文本数据时存在的脆弱性。特别指出,通过生成虚假文本,攻击者可以误导分类器,降低系统性能并破坏其功能,其中规避攻击在CTI流程中尤为关键,为后续攻击提供了前提条件。

Journal ref Future Generation Computer Systems, 2026

详情
AI中文摘要

网络威胁情报(CTI)已成为一种重要的补充方法,在网络威胁生命周期的早期阶段运作。CTI涉及收集、处理和分析威胁数据,以提供更准确和快速的网络威胁理解。由于数据量大,通过机器学习(ML)和自然语言处理(NLP)模型进行自动化对于有效的CTI提取至关重要。这些自动化系统利用来自社交网络、论坛和博客等来源的开源情报(OSINT)来识别威胁指标(IoCs)。尽管先前的研究集中在针对特定ML模型的对抗攻击上,但本研究通过调查整个CTI管道中各个组件的脆弱性及其对对抗攻击的敏感性,扩展了研究范围。这些脆弱性源于它们从各种开放来源(包括真实和潜在虚假内容)接收文本输入。我们分析了针对CTI管道的三种攻击类型,包括逃避、淹没和投毒,并评估了它们对系统信息选择能力的影响。具体而言,在虚假文本生成方面,该工作展示了对抗文本生成技术如何创建虚假的网络安全和类似网络安全的文本,从而误导分类器、降低性能并破坏系统功能。重点主要放在逃避攻击上,因为它先于并使得CTI管道中的淹没和投毒攻击成为可能。

英文摘要

Cyber Threat Intelligence (CTI) has emerged as a vital complementary approach that operates in the early phases of the cyber threat lifecycle. CTI involves collecting, processing, and analyzing threat data to provide a more accurate and rapid understanding of cyber threats. Due to the large volume of data, automation through Machine Learning (ML) and Natural Language Processing (NLP) models is essential for effective CTI extraction. These automated systems leverage Open Source Intelligence (OSINT) from sources like social networks, forums, and blogs to identify Indicators of Compromise (IoCs). Although prior research has focused on adversarial attacks on specific ML models, this study expands the scope by investigating vulnerabilities within various components of the entire CTI pipeline and their susceptibility to adversarial attacks. These vulnerabilities arise because they ingest textual inputs from various open sources, including real and potentially fake content. We analyse three types of attacks against CTI pipelines, including evasion, flooding, and poisoning, and assess their impact on the system's information selection capabilities. Specifically, on fake text generation, the work demonstrates how adversarial text generation techniques can create fake cybersecurity and cybersecurity-like text that misleads classifiers, degrades performance, and disrupts system functionality. The focus is primarily on the evasion attack, as it precedes and enables flooding and poisoning attacks within the CTI pipeline.

2507.05311 2026-05-25 cs.IR cs.AI 版本更新

PLACE: Prompt Learning for Attributed Community Search in Large Graphs

PLACE:面向大规模图属性社区搜索的提示学习

Shuheng Fang, Kangfei Zhao, Rener Zhang, Yu Rong, Jeffrey Xu Yu

发表机构 * Shenzhen Institute of Computing Sciences(深圳计算科学研究院) Beijing Institute of Technology(北京理工大学) Chinese University of Hong Kong(香港中文大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州))

AI总结 本文提出PLACE,一种用于属性社区搜索的图提示学习框架。该方法受到自然语言处理中提示调优的启发,通过在图中插入可学习的提示标记,构建提示增强图结构,以增强与查询相关的节点间连接,帮助图神经网络更有效地识别结构连贯性和属性相似性。实验表明,PLACE在多个真实图数据集上显著优于现有方法,平均F1分数提升22%。

Comments 14 pages, 9 figures

Journal ref KDD 2026

详情
AI中文摘要

在本文中,我们提出了PLACE(面向属性社区搜索的提示学习),一种创新的图提示学习框架用于ACS。受自然语言处理(NLP)中提示调优的启发,其中可学习的提示令牌被插入以语境化NLP查询,PLACE将结构化和可学习的提示令牌集成到图中作为查询相关的细化机制,形成提示增强图。在这种提示增强图结构中,学习到的提示令牌充当桥梁,加强图中节点与查询之间的连接,使GNN能够更有效地识别与特定查询相关的结构凝聚性和属性相似性模式。我们采用交替训练范式来联合优化提示参数和GNN。此外,我们设计了一种分治策略以增强可扩展性,支持模型处理百万级图。在9个真实图上的大量实验证明了PLACE对三种类型ACS查询的有效性,其中PLACE的平均F1分数比现有最先进方法高出22%。

英文摘要

In this paper, we propose PLACE (Prompt Learning for Attributed Community Search), an innovative graph prompt learning framework for ACS. Enlightened by prompt-tuning in Natural Language Processing (NLP), where learnable prompt tokens are inserted to contextualize NLP queries, PLACE integrates structural and learnable prompt tokens into the graph as a query-dependent refinement mechanism, forming a prompt-augmented graph. Within this prompt-augmented graph structure, the learned prompt tokens serve as a bridge that strengthens connections between graph nodes for the query, enabling the GNN to more effectively identify patterns of structural cohesiveness and attribute similarity related to the specific query. We employ an alternating training paradigm to optimize both the prompt parameters and the GNN jointly. Moreover, we design a divide-and-conquer strategy to enhance scalability, supporting the model to handle million-scale graphs. Extensive experiments on 9 real-world graphs demonstrate the effectiveness of PLACE for three types of ACS queries, where PLACE achieves higher F1 scores by 22% compared to the state-of-the-arts on average.

2506.04390 2026-05-25 cs.CR cs.AI 版本更新

Through the Stealth Lens: Attention-Aware Defenses Against Poisoning in RAG

通过隐秘视角:检索增强生成中针对投毒攻击的注意力感知防御

Sarthak Choudhary, Nils Palumbo, Ashish Hooda, Krishnamurthy Dj Dvijotham, Somesh Jha

发表机构 * University of Wisconsin-Madison(威斯康星大学麦迪逊分校) ServiceNow Research(ServiceNow研究)

AI总结 本文研究了检索增强生成(RAG)系统中针对数据投毒攻击的隐蔽性防御方法。作者提出了一种基于注意力机制的防御策略,通过分析语言模型的注意力权重,引入归一化段落注意力得分(NPAS)和注意力方差过滤器(AV Filter),以检测并过滤被污染的检索内容。实验表明,该方法显著提升了系统的鲁棒性,并揭示了实现真正隐蔽投毒攻击的难度。

Comments Accepted at ICML 2026

详情
AI中文摘要

检索增强生成(RAG)系统容易受到攻击,即使污染率很低,攻击者也能将有毒段落注入检索到的上下文中。我们表明现有攻击并非设计为隐秘的,因此可以实现可靠的检测和缓解。我们形式化了一个基于可区分性的安全游戏来量化此类攻击的隐秘性。如果少数有毒段落控制了响应,它们必须比良性段落更偏向推理过程,这本质上损害了隐秘性。这促使我们分析LLM的中间信号(如注意力权重)来近似不同段落对响应的影响。利用注意力权重,我们引入了$ extbf{归一化段落注意力分数}$(NPAS)和轻量级的$ extbf{注意力方差滤波器}$(AV Filter),用于标记异常段落。我们的方法提高了鲁棒性,相比基线防御,准确率提高了约$\sim$ $ extbf{20%}$。我们还开发了自适应攻击,试图隐藏此类异常,成功率高达$ extbf{35%}$,这凸显了在RAG系统中实现真正隐秘投毒的挑战。

英文摘要

Retrieval-augmented generation (RAG) systems are vulnerable to attacks that inject poisoned passages into the retrieved context, even at low corruption rates. We show that existing attacks are not designed to be stealthy, allowing reliable detection and mitigation. We formalize a distinguishability-based security game to quantify stealth for such attacks. If a few poisoned passages control the response, they must bias the inference process more than the benign ones, inherently compromising stealth. This motivates analyzing intermediate signals of LLMs, such as attention weights, to approximate the influence of different passages on the response. Leveraging attention weights, we introduce the $\textbf{Normalized Passage Attention Score}$ (NPAS) and a lightweight $\textbf{Attention-Variance Filter}$ (AV Filter) that flags anomalous passages. Our method improves robustness, yielding up to $\sim$ $\textbf{20%}$ higher accuracy than baseline defenses. We also develop adaptive attacks that attempt to conceal such anomalies, achieving up to $\textbf{35%}$ success rate and underscoring the challenges of achieving true stealth in poisoning RAG systems.

2505.21573 2026-05-25 cs.LG cs.AI 版本更新

Spectral-inspired Operator Learning with Limited Data and Unknown Physics

光谱启发的少数据与未知物理下的算子学习

Han Wan, Rui Zhang, Hao Sun

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学光明学院人工智能学院)

AI总结 本文研究了在数据有限且物理机制未知的情况下学习偏微分方程(PDE)动力学的挑战。为此,提出了一种名为SINO的频谱启发神经算子,它仅需2到5条轨迹即可建模复杂系统,无需显式依赖PDE方程。SINO通过频率索引自动捕捉局部和全局空间导数,结合乘法操作块和低通滤波器处理非线性效应和混叠问题,在多个二维和三维PDE基准测试中表现出优异性能,尤其在少量数据和分布外场景下显著优于现有方法。

Comments To appear in KDD 2026

详情
AI中文摘要

从有限数据和未知物理中学习PDE动力学具有挑战性。现有的神经PDE求解器要么需要大型数据集,要么依赖已知物理(如PDE残差或手工模板),导致适用性有限。为解决这些问题,我们提出光谱启发神经算子(SINO),它仅需2-5条轨迹即可建模复杂系统,无需显式PDE项。具体而言,SINO从频率索引自动捕获局部和全局空间导数,从而在物理无关机制下实现底层微分算子的紧凑表示。为建模非线性效应,它采用Pi块对光谱特征进行乘法运算,并辅以低通滤波器抑制混叠。在2D和3D PDE基准上的大量实验表明,SINO实现了最先进的性能,精度提升1-2个数量级。特别地,仅用5条训练轨迹,SINO就优于在1000条轨迹上训练的数据驱动方法,并在其他方法失败的高难度分布外案例中保持预测能力。

英文摘要

Learning PDE dynamics from limited data with unknown physics is challenging. Existing neural PDE solvers either require large datasets or rely on known physics (e.g., PDE residuals or handcrafted stencils), leading to limited applicability. To address these challenges, we propose Spectral-Inspired Neural Operator (SINO), which can model complex systems from just 2-5 trajectories, without requiring explicit PDE terms. Specifically, SINO automatically captures both local and global spatial derivatives from frequency indices, enabling a compact representation of the underlying differential operators in physics-agnostic regimes. To model nonlinear effects, it employs a Pi-block that performs multiplicative operations on spectral features, complemented by a low-pass filter to suppress aliasing. Extensive experiments on both 2D and 3D PDE benchmarks demonstrate that SINO achieves state-of-the-art performance, with improvements of 1-2 orders of magnitude in accuracy. Particularly, with only 5 training trajectories, SINO outperforms data-driven methods trained on 1000 trajectories and remains predictive on challenging out-of-distribution cases where other methods fail.

2504.09846 2026-05-25 cs.LG cs.AI cs.HC 版本更新

GlyTwin: Digital Twin for Glucose Control in Type 1 Diabetes Through Optimal Behavioral Modifications Using Patient-Centric Counterfactuals

GlyTwin: 通过以患者为中心的反事实实现1型糖尿病血糖控制的最佳行为修改的数字孪生

Asiful Arefeen, Saman Khamesian, Maria Adela Grando, Bithika Thompson, Hassan Ghasemzadeh

发表机构 * College of Health Solutions, Arizona State University(亚利桑那州立大学健康解决方案学院) School of Computing and Augmented Intelligence, Arizona State University(亚利桑那州立大学计算与增强智能学院) Department of Endocrinology, Mayo Clinic Arizona(梅奥诊所亚利桑那分部内分泌科)

AI总结 该研究提出了一种名为GlyTwin的数字孪生框架,用于通过行为优化改善1型糖尿病患者的血糖控制。其核心方法是结合反事实解释,模拟最优行为干预方案,如调整碳水化合物摄入和胰岛素剂量,以减少高血糖事件的发生。研究还引入了利益相关者的偏好,使干预方案更具个性化和实用性。实验结果表明,GlyTwin在生成有效反事实解释和预防高血糖方面优于现有方法,具有较高的实用价值。

详情
AI中文摘要

频繁和长期暴露于高血糖会增加慢性并发症的风险,包括神经病变、肾病和心血管疾病。现有的连续皮下胰岛素输注(CSII)和连续血糖监测(CGM)技术仅模拟血糖调节的特定方面,例如预测低血糖和给予小剂量胰岛素推注。同样,当前糖尿病管理中的数字孪生方法主要侧重于预测血糖对人类行为和胰岛素治疗的反应。因此,这些技术缺乏提供替代治疗方案的能力,而这些方案可以指导主动行为干预以实现最佳糖尿病管理。为填补这一空白,我们提出GlyTwin,一种新颖的计算框架,通过整合反事实解释来增强数字孪生技术,以模拟血糖控制的最佳行为治疗。GlyTwin通过推荐行为选择(如碳水化合物摄入和胰岛素剂量)的调整来生成反事实治疗,以显著减少高血糖事件的发生和持续时间。此外,GlyTwin将利益相关者的偏好纳入其干预生成过程,确保工具个性化和以用户为中心。我们在AZT1D上评估GlyTwin,该数据集是通过收集50名使用自动胰岛素输送(AID)系统的1型糖尿病(T1D)患者的纵向数据构建的,每人监测26天。结果表明,与历史数据相比,GlyTwin在生成反事实解释方面优于现有方法,有效解释率为85.8%,预防高血糖的有效性为87.3%。

英文摘要

Frequent and long-term exposure to hyperglycemia increases the risk of chronic complications, including neuropathy, nephropathy, and cardiovascular disease. Existing continuous subcutaneous insulin infusion (CSII) and continuous glucose monitoring (CGM) technologies model only specific aspects of glycemic regulation, such as predicting hypoglycemia and administering small insulin boluses. Similarly, current digital twin approaches in diabetes management primarily focus on predicting glucose responses to human behavior and insulin therapy. As a result, these technologies lack the ability to provide alternative treatment scenarios that could guide proactive behavioral interventions for optimal diabetes management. To address this gap, we propose GlyTwin, a novel computational framework that enhances digital twin technologies by integrating counterfactual explanations to simulate optimal behavioral treatments for glucose control. GlyTwin generates counterfactual treatments by recommending adjustments to behavioral choices, such as carbohydrate intake and insulin dosing, to significantly reduce the occurrence and duration of hyperglycemic events. In addition, GlyTwin incorporates stakeholder preferences into its intervention-generation process, ensuring that the tool is personalized and user-centric. We evaluate GlyTwin on AZT1D, a new dataset constructed by collecting longitudinal data from 50 individuals living with type 1 diabetes (T1D) on automated insulin delivery (AID) systems, each monitored for 26 days. Results show that GlyTwin outperforms state-of-the-art methods for generating counterfactual explanations, with 85.8\% valid explanations and 87.3\% effectiveness in preventing hyperglycemia compared with historical data.

2502.17119 2026-05-25 cs.LG cs.AI 版本更新

Diffusion and Flow Matching Models for Tabular Data: A Survey

表格数据的扩散与流匹配模型:综述

Zhong Li, Qi Huang, Lincen Yang, Jiayang Shi, Zhao Yang, Niki van Stein, Thomas Bäck, Matthijs van Leeuwen

发表机构 * Great Bay University(大湾大学) Vrije Universiteit Amsterdam(阿姆斯特丹自由大学) LIACS, Leiden University(莱顿大学LIACS)

AI总结 本文综述了扩散模型和流匹配模型在表格数据生成中的应用,探讨了这些模型在处理数值与类别混合、缺失值、敏感字段及复杂依赖关系等挑战时的优势与方法。文章系统梳理了从2015年至2026年的相关研究,围绕数据工程难题、任务目标、设计选择及评估维度进行组织,并指出了在可扩展性、特征依赖建模、隐私保护、公平性及约束感知生成等方面的开放问题。

Comments We substantially updated the previous version "Diffusion Models for Tabular Data: Challenges, Current Progress, and Future Directions" by including flow matching models for tabular data

详情
AI中文摘要

深度生成模型在图像、文本、音频和视频生成方面取得了快速进展,并越来越多地应用于结构化记录。然而,对于表格数据,生成建模仍然困难:数据集可能包含数值和分类属性、缺失值、敏感字段、不平衡类别、复杂的特征依赖和领域约束。早期基于GAN或VAE的表格数据建模方法取得了有用结果,但可能面临训练不稳定、模式崩溃、多模态分布建模能力弱以及混合类型特征处理脆弱等问题。因此,扩散模型因其噪声-去噪公式提供了灵活稳定的方式来建模复杂数据分布而受到越来越多的关注,并已被应用于表格合成、缺失值填补、可信数据生成和异常检测。流匹配通过学习沿概率路径的传输向量场提供了一条密切相关的途径,通常对路径设计和采样效率有更直接的控制。尽管取得了进展,但针对表格数据的扩散和流匹配模型文献仍然难以比较,因为方法针对不同任务,依赖于不同的表示、目标、评估协议和领域假设。据我们所知,这是第一篇专门针对表格数据的扩散和流匹配模型的综述。我们回顾了2015年6月至2026年5月的工作,围绕数据工程挑战、任务、设计选择和评估维度进行组织,并讨论了可扩展性、特征依赖建模、隐私、公平性、基准测试和约束感知生成中的开放问题。我们在GitHub仓库中保持更新。

英文摘要

Deep generative models have made rapid progress in image, text, audio, and video generation, and are increasingly being applied to structured records. For tabular data, however, generative modeling remains difficult: a dataset may contain numerical and categorical attributes, missing values, sensitive fields, imbalanced categories, complex feature dependencies, and domain constraints. Earlier tabular data modeling methods based on GANs or VAEs have achieved useful results, but they can suffer from unstable training, mode collapse, weak modeling of multimodal distributions, and fragile handling of mixed-type features. Diffusion models have therefore attracted growing interest because their noising-and-denoising formulation provides a flexible and stable way to model complex data distributions, and has been adapted to tabular synthesis, missing-value imputation, trustworthy data generation, and anomaly detection. Flow matching offers a closely related route by learning transport vector fields along probability paths, often with more direct control over path design and sampling efficiency. Despite this progress, the literature on diffusion and flow matching models for tabular data remains difficult to compare because methods target different tasks and rely on different representations, objectives, evaluation protocols, and domain assumptions. To the best of our knowledge, this is the first survey dedicated specifically to diffusion and flow matching models for tabular data. We review work from June 2015 to May 2026, organize it around data-engineering challenges, tasks, design choices, and evaluation dimensions, and discuss open problems in scalability, feature dependency modeling, privacy, fairness, benchmarking, and constraint-aware generation. We maintain updates in a GitHub repository.

2502.04415 2026-05-25 cs.CV cs.AI 版本更新

TerraQ: Spatiotemporal Question-Answering on Satellite Image Archives

TerraQ:卫星图像档案的时空问答

Sergios-Anestis Kefalidis, Konstantinos Plas, Manolis Koubarakis

发表机构 * Dept. of Informatics and Telecommunications(信息与电信系) National and Kapodistrian University of Athens(国家与卡布里亚大学) Archimedes/Athena RC(阿基米德/雅典RC)

AI总结 TerraQ 是一个用于卫星图像档案的时空问答系统,能够根据自然语言查询快速检索符合条件的卫星图像。该系统结合了自然语言处理与空间知识库,支持基于图像元数据和地理实体的复杂查询。其核心贡献在于提升了地球观测数据的可访问性与智能化检索能力。

详情
AI中文摘要

TerraQ是一个用于卫星图像档案的时空问答引擎。它是一个自然语言处理系统,旨在处理满足特定条件的卫星图像请求。这些请求可以引用图像元数据和来自专门知识库(例如,艾米利亚-罗马涅大区)的实体。通过它,用户可以提出诸如“给我一百张法国港口附近河流的图像,雪覆盖率低于20%,云覆盖率高于10%”之类的请求,从而使地球观测数据更易于访问,符合当前数字助手的趋势。

英文摘要

TerraQ is a spatiotemporal question-answering engine for satellite image archives. It is a natural language processing system that is built to process requests for satellite images satisfying certain criteria. The requests can refer to image metadata and entities from a specialized knowledge base (e.g., the Emilia-Romagna region). With it, users can make requests like "Give me a hundred images of rivers near ports in France, with less than 20% snow coverage and more than 10% cloud coverage", thus making Earth Observation data more easily accessible, in-line with the current landscape of digital assistants.

2502.04230 2026-05-25 cs.SD cs.AI cs.CR cs.LG eess.AS 版本更新

XAttnMark: Learning Robust Audio Watermarking with Cross-Attention

XAttnMark:基于交叉注意力的鲁棒音频水印学习

Yixin Liu, Lie Lu, Jihui Jin, Lichao Sun, Andrea Fanelli

发表机构 * Department of Computer Science, Lehigh University, Bethlehem, PA, USA(莱文斯顿大学计算机科学系) Dolby Laboratories Inc., San Francisco, CA, USA(杜比实验室公司)

AI总结 随着生成式音频合成和编辑技术的快速发展,版权保护、数据溯源和深度伪造音频传播等问题日益突出。本文提出了一种基于交叉注意力机制的鲁棒音频水印方法XAttnMark,通过生成器与检测器之间的部分参数共享、高效的交叉注意力消息检索机制以及时间条件模块,实现了水印检测与归属的联合优化。此外,该方法引入了与心理声学对齐的时频掩码损失,提升了水印的不可感知性,实验表明其在多种音频变换下均表现出优越的鲁棒性,为生成式AI时代的音频版权保护提供了有效解决方案。

Comments Accepted at ICML'25

详情
AI中文摘要

生成式音频合成与编辑技术的快速普及引发了关于版权侵权、数据溯源以及通过深度伪造音频传播虚假信息的严重担忧。水印技术通过将不可感知但可识别和可追踪的信号嵌入音频内容,提供了一种主动解决方案。尽管最近基于神经网络的水印方法(如WavMark和AudioSeal)在鲁棒性和质量上有所改进,但它们难以同时优化鲁棒检测和准确归因。本文介绍了交叉注意力鲁棒音频水印(XATTNMARK),通过利用生成器和检测器之间的部分参数共享、用于高效消息检索的交叉注意力机制以及用于改善消息分布的时间条件模块,弥合了这一差距。此外,我们提出了一种心理声学对齐的时频(TF)掩蔽损失,捕捉细粒度的听觉掩蔽效应,提高了水印的不可感知性。XATTNMARK在检测和归因方面均达到了最先进的性能,展示了针对各种音频变换(包括不同强度的具有挑战性的生成式编辑)的卓越鲁棒性。这项工作推进了音频水印技术,用于在生成式AI时代保护知识产权并确保真实性。

英文摘要

The rapid proliferation of generative audio synthesis and editing technologies has raised serious concerns about copyright infringement, data provenance, and the spread of misinformation via deepfake audio. Watermarking offers a proactive solution by embedding imperceptible yet identifiable and traceable signals into audio content. While recent neural network-based watermarking methods like WavMark and AudioSeal have improved robustness and quality, they struggle to jointly optimize both robust detection and accurate attribution. This paper introduces Cross-Attention Robust Audio Watermark (XATTNMARK), which bridges this gap by leveraging partial parameter sharing between the generator and the detector, a cross-attention mechanism for efficient message retrieval, and a temporal conditioning module for improved message distribution. Additionally, we propose a psychoacoustic-aligned time-frequency (TF) masking loss that captures fine-grained auditory masking effects, improving watermark imperceptibility. XATTNMARK achieves state-of-the-art performance in both detection and attribution, demonstrating superior robustness against a wide range of audio transformations, including challenging generative editing at varying strengths. This work advances audio watermarking for protecting intellectual property and ensuring authenticity in the era of generative AI.

2411.12173 2026-05-25 cs.LG cs.AI 版本更新

SkillTree: Explainable Skill-Based Deep Reinforcement Learning for Long-Horizon Control Tasks

SkillTree: 面向长时域控制任务的可解释基于技能的深度强化学习

Yongyan Wen, Siyuan Li, Rongchang Zuo, Lei Yuan, Hangyu Mao, Peng Liu

发表机构 * Faculty of Computing, Harbin Institute of Technology(哈尔滨工业大学计算机学院) National Key Laboratory of Novel Software Technology, Nanjing University(南京大学新型软件技术国家实验室) School of Artificial Intelligence, Nanjing University(南京大学人工智能学院) Polixir Technologies SenseTime Research(时光机器研究)

AI总结 本文提出了一种名为SkillTree的可解释技能型深度强化学习框架,用于解决长期控制任务中的复杂连续动作空间问题。该方法通过将连续动作空间离散化为技能空间,并在高层策略中引入可微决策树生成技能嵌入,从而指导底层策略执行具体技能,实现了技能层面的可解释性。实验表明,SkillTree在复杂机械臂控制任务中性能与基于神经网络的技能方法相当,同时提升了决策过程的透明度。

详情
AI中文摘要

深度强化学习(DRL)在各个研究领域取得了显著成功。然而,其对神经网络的依赖导致缺乏透明度,限制了实际应用。为了实现可解释性,决策树已成为神经网络的一种流行且有前景的替代方案。然而,由于其表达能力有限,传统决策树难以处理高维长时域连续控制任务。在本文中,我们提出了SkillTree,一种新颖的框架,将复杂的连续动作空间缩减为离散的技能空间。我们的层次化方法在高层次策略中集成了可微决策树以生成技能嵌入,进而指导低层次策略执行技能。通过使技能决策可解释,我们实现了技能级可解释性,增强了对复杂任务中决策过程的理解。实验结果表明,我们的方法在复杂机器人臂控制领域中达到了与基于技能的神经网络相当的性能。此外,SkillTree在技能级别提供解释,从而提高了决策过程的透明度。

英文摘要

Deep reinforcement learning (DRL) has achieved remarkable success in various research domains. However, its reliance on neural networks results in a lack of transparency, which limits its practical applications. To achieve explainability, decision trees have emerged as a popular and promising alternative to neural networks. Nonetheless, due to their limited expressiveness, traditional decision trees struggle with high-dimensional long-horizon continuous control tasks. In this paper, we proposes SkillTree, a novel framework that reduces complex continuous action spaces into discrete skill spaces. Our hierarchical approach integrates a differentiable decision tree within the high-level policy to generate skill embeddings, which subsequently guide the low-level policy in executing skills. By making skill decisions explainable, we achieve skill-level explainability, enhancing the understanding of the decision-making process in complex tasks. Experimental results demonstrate that our method achieves performance comparable to skill-based neural networks in complex robotic arm control domains. Furthermore, SkillTree offers explanations at the skill level, thereby increasing the transparency of the decision-making process.

2402.17888 2026-05-25 cs.LG cs.AI 版本更新

ConjNorm: Tractable Density Estimation for Out-of-Distribution Detection

ConjNorm: 面向分布外检测的可处理密度估计

Bo Peng, Yadan Luo, Yonggang Zhang, Yixuan Li, Zhen Fang

发表机构 * University of Technology Sydney(悉尼大学) The University of Queensland(昆士兰大学) Hong Kong Baptist University(香港 Baptist 大学) University of Wisconsin-Madison(威斯康星大学麦迪逊分校)

AI总结 本文提出了一种名为ConjNorm的新型密度估计方法,用于提升分布外检测(OOD detection)的性能。该方法基于Bregman散度构建理论框架,将分布考虑扩展到指数族分布,并通过引入共轭约束,将密度函数设计转化为寻找最优范数系数的问题。为了解决归一化计算的困难,作者设计了一种基于重要性采样的无偏且解析可计算的分区函数估计器。实验表明,ConjNorm在多个OOD检测基准上取得了当前最优性能,显著优于现有方法。

Comments ICLR24 poster

详情
AI中文摘要

事后分布外检测在可靠机器学习中引起了广泛关注。许多工作致力于基于logits、距离或严格数据分布假设推导得分函数,以识别低得分OOD样本。然而,这些估计得分可能无法准确反映真实数据密度或施加不切实际的约束。为了提供密度基得分设计的统一视角,我们提出了一个基于Bregman散度的新理论框架,将分布考虑扩展到指数分布族。利用定理中揭示的共轭约束,我们引入了一种 extsc{ConjNorm}方法,将密度函数设计重新定义为针对给定数据集寻找最优范数系数$p$。鉴于归一化的计算挑战,我们利用基于蒙特卡洛的重要性采样技术,设计了一个无偏且解析可处理的配分函数估计器。在OOD检测基准上的大量实验表明,我们提出的 extsc{ConjNorm}在各种OOD检测设置中建立了新的最先进水平,在CIFAR-100和ImageNet-1K上分别比当前最佳方法(FPR95)高出高达13.25%和28.19%。

英文摘要

Post-hoc out-of-distribution (OOD) detection has garnered intensive attention in reliable machine learning. Many efforts have been dedicated to deriving score functions based on logits, distances, or rigorous data distribution assumptions to identify low-scoring OOD samples. Nevertheless, these estimate scores may fail to accurately reflect the true data density or impose impractical constraints. To provide a unified perspective on density-based score design, we propose a novel theoretical framework grounded in Bregman divergence, which extends distribution considerations to encompass an exponential family of distributions. Leveraging the conjugation constraint revealed in our theorem, we introduce a \textsc{ConjNorm} method, reframing density function design as a search for the optimal norm coefficient $p$ against the given dataset. In light of the computational challenges of normalization, we devise an unbiased and analytically tractable estimator of the partition function using the Monte Carlo-based importance sampling technique. Extensive experiments across OOD detection benchmarks empirically demonstrate that our proposed \textsc{ConjNorm} has established a new state-of-the-art in a variety of OOD detection setups, outperforming the current best method by up to 13.25$\%$ and 28.19$\%$ (FPR95) on CIFAR-100 and ImageNet-1K, respectively.

2402.14212 2026-05-25 cs.LG cs.AI 版本更新

Moonwalk: Inverse-Forward Differentiation

Moonwalk: 逆-前向微分

Dmitrii Krylov, Armin Karamzade, Roy Fox

发表机构 * University of California, Irvine(加州大学尔湾分校)

AI总结 Moonwalk 研究了反向传播中需要存储中间激活值的限制问题,提出了一种无需存储激活值的梯度计算方法。该方法通过引入向量-逆雅可比乘积(vijp)操作符,结合子浸入网络和碎片化梯度检查点技术,在前向过程中精确重建梯度,从而显著提升了网络深度而不增加内存消耗。实验表明,Moonwalk 在保持运行时间与反向传播相当的同时,能够在相同内存预算下训练出深度超过两倍的网络。

Journal ref The 29th International Conference on Artificial Intelligence and Statistics, 2026

详情
AI中文摘要

反向传播的主要限制是它需要在正向传播过程中存储中间激活值(残差),这限制了可训练网络的深度。这引出了一个基本问题:我们能否避免存储这些激活值?我们通过重新审视梯度计算的结构来解决这个问题。反向传播通过一系列向量-雅可比乘积计算梯度,这一操作通常是不可逆的。丢失的信息位于每层雅可比矩阵的余核中。我们定义了浸没式网络——其层雅可比矩阵具有平凡余核的网络——在这种网络中,梯度可以在前向扫描中精确重建,而无需存储激活值。对于非浸没式层,我们引入了碎片梯度检查点,仅记录恢复被雅可比矩阵擦除的余切向量所需的最小残差子集。我们方法的核心是一种新的算子,即向量-逆-雅可比乘积(vijp),它反转了余核外的梯度流。我们的混合模式算法首先通过内存高效的反向传播计算输入梯度,然后使用vijp在前向扫描中重建参数梯度,从而消除了存储激活值的需要。我们在Moonwalk中实现了该方法,并表明它在相同内存预算下训练深度超过两倍的网络时,运行时间与反向传播相当。

英文摘要

Backpropagation's main limitation is its need to store intermediate activations (residuals) during the forward pass, which restricts the depth of trainable networks. This raises a fundamental question: can we avoid storing these activations? We address this by revisiting the structure of gradient computation. Backpropagation computes gradients through a sequence of vector-Jacobian products, an operation that is generally irreversible. The lost information lies in the cokernel of each layer's Jacobian. We define submersive networks -- networks whose layer Jacobians have trivial cokernels -- in which gradients can be reconstructed exactly in a forward sweep without storing activations. For non-submersive layers, we introduce fragmental gradient checkpointing, which records only the minimal subset of residuals necessary to restore the cotangents erased by the Jacobian. Central to our approach is a novel operator, the vector-inverse-Jacobian product (vijp), which inverts gradient flow outside the cokernel. Our mixed-mode algorithm first computes input gradients with a memory-efficient reverse pass, then reconstructs parameter gradients in a forward sweep using the vijp, eliminating the need to store activations. We implement this method in Moonwalk and show that it matches backpropagation's runtime while training networks more than twice as deep under the same memory budget.

2103.14995 2026-05-25 cs.LG cs.AI eess.SP 版本更新

Thermal transmittance prediction based on the application of artificial neural networks on heat flux method results

基于人工神经网络在热流法结果上的热透射率预测

Sanjin Gumbarević, Bojan Milovanović, Mergim Gaši, Marina Bagarić

发表机构 * Center for Theoretical Physics, Sloane Physics Laboratory, Yale University(理论物理中心、斯洛恩物理实验室、耶鲁大学) University of Zagreb, Faculty of Civil Engineering, Department of Materials(扎格雷布大学、土木工程学院、材料系)

AI总结 本文研究如何利用人工神经网络(ANN)加速建筑围护结构热传导系数(U值)的现场测量过程。通过在热流法(HFM)测量中引入并行测量策略,并基于内外空气温度预测未知热流,从而缩短测量时间。研究对比了多种ANN模型在多层墙体上的应用效果,结果表明该方法在热流预测方面具有较高准确性,为后续研究提供了有价值的参考方向。

Comments Submitted to International Building Physics Conference 2021

Journal ref J. Phys.: Conf. Ser. 2069 (2021) 012152

详情
AI中文摘要

由于能效相关指令,欧洲联盟更加关注建筑群的深度能源改造。许多需要深度能源改造的建筑年代久远,可能缺乏设计/改造文件,或者建筑构件中的材料可能随时间发生退化。热透射率(即U值)是确定通过建筑围护结构构件传输热损失的最重要参数之一,取决于构成建筑构件的所有材料的厚度和热性能。现场U值可通过ISO 9869-1标准(热流法 - HFM)确定。然而,测量持续时间是HFM在改造设计过程开始前现场测试中未广泛使用的原因之一。本文分析了通过使用一个热流传感器进行并行测量来减少测量时间的可能性。这种并行化可以通过在HFM结果上应用特定类别的人工神经网络(ANN)来实现,基于收集的室内外空气温度预测未知热流。在达到满意的预测后,HFM传感器可重新定位到另一个测量位置。本文展示了四种ANN案例应用于HFM结果的比较,这些测量在一面多层墙上进行:一个隐藏层中有三个神经元的多层感知器、100个单元的长短期记忆、100个单元的门控循环单元以及50个长短期记忆单元和50个门控循环单元的组合。分析在基于两个输入温度预测热流率方面给出了有希望的结果。另一面墙上的额外分析显示了该方法的可能局限性,这为这一主题的进一步研究提供了方向。

英文摘要

Deep energy renovation of building stock came more into focus in the European Union due to energy efficiency related directives. Many buildings that must undergo deep energy renovation are old and may lack design/renovation documentation, or possible degradation of materials might have occurred in building elements over time. Thermal transmittance (i.e. U-value) is one of the most important parameters for determining the transmission heat losses through building envelope elements. It depends on the thickness and thermal properties of all the materials that form a building element. In-situ U-value can be determined by ISO 9869-1 standard (Heat Flux Method - HFM). Still, measurement duration is one of the reasons why HFM is not widely used in field testing before the renovation design process commences. This paper analyzes the possibility of reducing the measurement time by conducting parallel measurements with one heat-flux sensor. This parallelization could be achieved by applying a specific class of the Artificial Neural Network (ANN) on HFM results to predict unknown heat flux based on collected interior and exterior air temperatures. After the satisfying prediction is achieved, HFM sensor can be relocated to another measuring location. Paper shows a comparison of four ANN cases applied to HFM results for a measurement held on one multi-layer wall - multilayer perceptron with three neurons in one hidden layer, long short-term memory with 100 units, gated recurrent unit with 100 units and combination of 50 long short-term memory units and 50 gated recurrent units. The analysis gave promising results in term of predicting the heat flux rate based on the two input temperatures. Additional analysis on another wall showed possible limitations of the method that serves as a direction for further research on this topic.

2605.22940 2026-05-25 cs.LG cs.AI stat.ML 版本更新

Human-Centered Learning Mechanics: A Dynamical Framework for Entropy-Regulated Representation Learning

以人为中心的学习力学:熵正则化表示学习的动力学框架

Kim Phuc Tran

发表机构 * Univ. Lille, ENSAIT, ULR 2461 – GEMTEX – Génie et Matériaux Textiles(里尔大学,ENSAIT,ULR 2461 – GEMTEX – 纺织工程与材料纺织系) International Chair in DS & XAI, International Research Institute for Artificial Intelligence and Data Science, Dong A University(数据科学与可解释人工智能国际主席,人工智能与数据科学国际研究所,东亚大学)

AI总结 本文提出了一种名为“以人为中心的学习力学”(HCLM)的动态信息理论框架,旨在为开放且受控的学习系统提供理论支持。研究指出,传统的熵正则化方法在某些情况下可能导致梯度不稳定或与优化方向不一致,因此引入了有效熵的概念,并提出了可计算的几何熵代理方法,如基于方差和对数行列式的协方差代理。文章的主要贡献包括形式化有效信息力下的熵正则化、推导收敛性和泛化性理论,以及从动态角度解释模型规模与性能之间的关系。实验表明,几何熵代理,尤其是对数行列式协方差熵,能产生更稳定和有力的信息力,提升表示学习的效果。

Comments Submitted to JMLR

详情
AI中文摘要

深度学习越来越被视为参数空间中的动力学过程,然而许多现有理论仍将训练视为封闭的优化系统。这种观点对于现实世界的人工智能是有限的,因为模型在不确定性、资源约束、分布偏移、下游决策风险和人类反馈下运行。我们提出了以人为中心的学习力学(HCLM),一个用于开放和受控学习系统的动力学和信息论框架。核心思想是,只有当所选的熵代理沿着优化轨迹产生非简并的信息力时,熵正则化才是有用的。否则,熵项可能产生弱、不稳定或不对齐的梯度,导致动力学坍缩为普通的损失最小化。我们引入了有效熵的概念,并研究了可处理的几何熵代理,包括基于方差和对数行列式协方差代理。本文做出三项贡献。首先,它通过有效信息力形式化了熵正则化,并刻画了简并熵区域。其次,它在显式假设下推导了收敛性、熵流、Wasserstein梯度流和噪声表示泛化结果。第三,它提供了缩放律行为的条件动力学解释,作为信息注入、熵耗散和残差风险之间的平衡,而不声称对经验神经缩放律的无条件推导。受控的表示学习实验支持几何熵代理(尤其是对数行列式协方差熵)比softmax归一化熵产生更强更稳定的信息力的假设。

英文摘要

Deep learning is increasingly viewed as a dynamical process in parameter space, yet many existing theories still treat training as a closed optimization system. This view is limited for real-world AI, where models operate under uncertainty, resource constraints, distribution shift, downstream decision risks, and human feedback. We propose Human-Centered Learning Mechanics (HCLM), a dynamical and information-theoretic framework for open and controlled learning systems. The central idea is that entropy regularization is useful only when the chosen entropy surrogate generates a non-degenerate information force along the optimization trajectory. Otherwise, entropy terms may produce weak, unstable, or misaligned gradients, causing the dynamics to collapse toward ordinary loss minimization. We introduce the notion of effective entropy and study tractable geometric entropy surrogates, including variance-based and log-determinant covariance proxies. The paper makes three contributions. First, it formalizes entropy regularization through effective information force and characterizes degenerate entropy regimes. Second, it derives convergence, entropy-flow, Wasserstein-gradient-flow, and noisy-representation generalization results under explicit assumptions. Third, it offers a conditional dynamical interpretation of scaling-law-like behavior as a balance between information injection, entropy dissipation, and residual risk, without claiming an unconditional derivation of empirical neural scaling laws. Controlled representation-learning experiments support the hypothesis that geometric entropy surrogates, especially log-determinant covariance entropy, induce stronger and more stable information forces than softmax-normalized entropy.

2605.22905 2026-05-25 cs.AI cs.CL 版本更新

EVE-Agent: Evidence-Verifiable Self-Evolving Agents

EVE-Agent: 证据可验证的自我进化智能体

Yamato Arai, Yuma Ichikawa

发表机构 * Fujitsu Limited(富士通株式会社) The University of Tokyo(东京大学) RIKEN center for AIP(理化学研究所AIP研究中心)

AI总结 本文提出了一种名为EVE-Agent的证据可验证自进化智能体,旨在解决自进化搜索代理在缺乏可验证证据时可能生成不准确但流畅的训练样本的问题。该方法通过修改提议者-求解者框架,使每个生成的实例不仅包含答案,还包含可验证的来源片段,并通过证据验证器评估其对答案的贡献。实验表明,EVE-Agent显著提升了基于证据的正确性,且生成的训练样本具有可审计性,增强了系统的可信度。

Comments 23 pages, 2 figures

详情
AI中文摘要

自我进化智能体不应在其无法证明的示例上进行训练。无数据的自我进化搜索智能体提供了一种可扩展的途径,使系统能够生成自己的问题、回答问题,并从自身反馈中改进,而无需人工标注。然而,没有可验证的证据,这种循环可能会奖励流畅但无依据的示例,使自我生成的课程变成不透明且可能不可靠的训练信号。我们认为,证据可验证性是搜索智能体可信自我进化的先决条件:每个生成的实例不仅应包含答案,还应包含一个基于来源的文本片段,其对该答案的贡献可以被衡量。我们引入了EVE-Agent,一种证据可验证的自我进化智能体,通过对提议者-求解者框架的修改来实现这一原则。提议者生成一个问题、一个答案和一个逐字证据片段。然后,证据验证器根据提供证据时的边际准确率增益来奖励该片段。这产生了一个训练信号,倾向于真正有助于回答问题的证据,而不需要标准答案、人工标签或外部标注。EVE-Agent保持骨干模型、检索器、搜索工具和优化框架不变。实验表明,EVE-Agent在证据基础的准确性上显著优于先前的自我进化搜索智能体。由此产生的课程不仅是自我生成的,而且从结构上是可审计的:每个训练实例都带有一个可检查的来源片段,解释其为何值得信任。

英文摘要

Self-evolving agents should not train on examples they cannot justify. Data-free self-evolving search agents offer a scalable route to systems that generate their own questions, answer them, and improve from their own feedback without human annotations. Yet, without verifiable evidence, this loop can reward fluent but unsupported examples, turning the self-generated curriculum into an opaque and potentially unreliable training signal. We argue that evidence verifiability is a prerequisite for trustworthy self-evolution in search agents: each generated instance should include not only an answer but also a source-grounded span whose contribution to that answer can be measured. We introduce EVE-Agent, an Evidence-Verifiable Self-Evolving Agent that operationalizes this principle through a modification to the proposer--solver framework. The proposer generates a question, an answer, and a verbatim evidence span. An evidence verifier then rewards the span according to the marginal accuracy gain when the evidence is provided. This produces a training signal that favors evidence that genuinely helps answer the question, without requiring oracle answers, human labels, or external annotations. EVE-Agent leaves the backbone model, retriever, search tool, and optimization framework unchanged. Experiments show that EVE-Agent substantially improves evidence-grounded correctness over prior self-evolving search agents. The resulting curriculum is not merely self-generated but auditable by construction: each training example carries an inspectable source span that explains why it should be trusted.

2605.22903 2026-05-25 cs.CV cs.AI cs.CL 版本更新

Seeing without Looking: Do Vision-Language Benchmarks Really Test Vision?

不看而见:视觉-语言基准测试真的测试视觉吗?

Zixuan Lan, Luzhe Sun, Matthew R. Walter, Jiawei Zhou

发表机构 * University of Chicago(芝加哥大学) Stony Brook University(石溪大学) Toyota Technological Institute at Chicago(芝加哥丰田技术研究所)

AI总结 该研究质疑了当前视觉-语言模型(VLMs)基准测试是否真正评估了模型对视觉证据的依赖程度。通过系统分析多个开源模型的行为表现,研究发现尽管VLMs会利用视觉输入,但其预测对细粒度视觉信息的丢失并不敏感,这与标准准确率所暗示的情况存在明显偏差。研究还从表示层面揭示了视觉特征在深层逐渐趋同的现象,为这一现象提供了可能的解释,表明现有基准可能无法有效评估模型的细粒度视觉理解能力。

Comments Accepted to GRAIL-V: Grounded Retrieval and Agentic Intelligence for Vision-Language, CVPR 2026 Workshop. accepted version

详情
AI中文摘要

基准测试的准确性通常被隐含地视为反映了视觉-语言模型(VLM)中的基础视觉理解,但尚不清楚这些分数在多大程度上真正反映了对视觉证据的依赖。受一个令人惊讶的观察结果——在广泛使用的幻觉基准测试中,移除大量图像令牌仅轻微降低模型性能——的启发,我们在一组开源VLM中系统地研究了这种不匹配。我们的分析涵盖多个粒度级别,包括全局视觉退化、局部遮挡、问题重述、答案空间扩展以及超出标准准确率的决策级分析。我们进一步用视觉令牌几何的逐层分析补充这些行为结果。在整个实验中,我们发现尽管VLM确实整合了视觉输入,但其预测对细粒度视觉证据丢失的敏感性低于标准准确率所暗示的程度。即使最终预测保持不变,模型对正确答案的内部支持可能已经减弱。我们还补充了表示级分析,显示深层中视觉令牌之间的相似性增加,这为我们的发现提供了一个可能的解释。总之,这些结果表明,当前的基准测试不足以可靠地评估VLM中的细粒度视觉基础。

英文摘要

Benchmark accuracy is often implicitly assumed to reflect grounded visual understanding in vision-language models (VLMs), yet it remains unclear to what extent such scores truly reflect reliance on visual evidence. Motivated by a surprising observation that removing a substantial fraction of image tokens only degrades model performance very slightly on a widely used hallucination benchmark, we systematically investigate this mismatch in a set of open-source VLMs. Our analysis spans multiple levels of granularity, spanning global visual degradation, localized occlusion, question reformulation, answer-space expansion, and decision-level analyses beyond standard accuracy. We further complement these behavioral results with a layer-wise analysis of vision-token geometry. Throughout the experiments, we find that although VLMs do incorporate visual input, their predictions are less sensitive to the loss of fine-grained visual evidence that standard accuracy should have suggested. Even when the final prediction remains unchanged, the model's internal support for the correct answer may already be weakened. We further complement a representation-level analysis, which shows increasing similarity among visual tokens in deeper layers, providing a possible explanation for our findings. Together, these results suggest that current benchmarks are not sufficient to reliably evaluate fine-grained visual grounding in VLMs.

2605.22902 2026-05-25 cs.LG cs.AI cs.CL 版本更新

Transcoders Trace Visual Grounding and Hallucinations in Vision-Language Models

Transcoders 追踪视觉语言模型中的视觉基础与幻觉

Dimitrios Damianos, Leon Voukoutis, Georgios Skyrianos, Vassilis Katsouros, Georgios Paraskevopoulos

发表机构 * Institute of Language and Speech Processing(语言与语音处理研究所) Athena Research Center(雅典研究中心)

AI总结 该研究探讨了生成式视觉-语言模型(VLMs)中视觉输入如何转化为文本的问题,提出了基于Transcoders的函数中心解释框架,用于分解模型内部的计算路径,揭示图像块与文本生成之间的关联。相比传统的稀疏自编码器(SAEs),该方法在图像块缺失实验中表现出更强且更稳定的解释效果,并能更准确地对应语义相关的图像区域。此外,研究还通过结构分析揭示了模型生成幻觉的机制,并利用图特征构建分类器实现了对幻觉的预测。

详情
AI中文摘要

生成式视觉语言模型(VLM)在多模态推理上表现良好,但视觉输入如何转化为文本仍知之甚少。现有的VLM可解释性工作使用稀疏自编码器(SAE),其分解静态残差表示,忽略了驱动跨模态交互的功能更新。我们采用基于Transcoders的功能中心框架,Transcoders是MLP子层的稀疏近似,作为逐层计算的因果代理。应用于Gemma 3-4B-IT,该框架将模型分解为可解释的计算路径,连接图像块到文本生成中的方向。在补丁消融下,Transcoder归因对视觉基础标记产生比SAE归因更强且更稳定的效果,并与语义相关的图像区域更好对齐。假视觉基础反事实分析证实恢复的路径是视觉-语言交互特有的。最后,我们对幻觉生成进行结构分析,从Transcoder产生的电路痕迹中提取基于图的指标。基于这些机制图特征的逻辑分类器以AUC 0.68预测幻觉。这些结果表明,功能中心的电路分解为VLM中的多模态计算提供了可解释且可预测的描述。

英文摘要

Generative Vision-Language Models (VLMs) perform well on multimodal reasoning, but how visual inputs are transformed to text remains poorly understood. Existing interpretability work on VLMs uses Sparse Autoencoders (SAEs), which decompose static residual representations and miss the functional updates that drive cross-modal interaction. We adopt a function-centric framework based on Transcoders, sparse approximations of MLP sublayers that act as a causal proxy for layer-wise computation. Applied to Gemma 3-4B-IT, the framework decomposes the model into interpretable computational pathways linking image patches to directions in token generation. Transcoder attributions produce stronger and more stable effects on visually grounded tokens under patch ablation than SAE attributions, and align better with semantically relevant image regions. A False Visual Grounding counterfactual analysis confirms that the recovered pathways are specific to vision-language interaction.Finally, we perform a structural analysis of hallucinated generations, by extracting graph-based indicators from circuit traces produced by the transcoders. A logistic classifier over these mechanistic graph features predicts hallucinations at AUC $0.68$. These results show that function-centric circuit decomposition yields interpretable and predictive accounts of multimodal computation in VLMs.

2605.22900 2026-05-25 cs.AI cs.LO quant-ph 版本更新

Mediative Fuzzy Logic: From Type-1 Foundations to Type-2, Type-3 and Quantum Extensions

中介模糊逻辑:从类型-1基础到类型-2、类型-3和量子扩展

Oscar Montiel Ross

发表机构 * Instituto Politécnico Nacional - CITEDI(墨西哥国家理工学院- CITEDI)

AI总结 本文提出了一种称为“调解模糊逻辑”的新逻辑框架,旨在解决模糊控制与决策中存在犹豫或冲突评估的问题。该框架在传统类型-1模糊逻辑的基础上,扩展了类型-2、类型-3以及量子逻辑的表达能力,通过引入调解算子和连续双格结构中的真值对,构建了一个统一的逻辑系统。研究不仅建立了该逻辑的语义基础和推理规则,还展示了其在传感器融合等实际应用中的有效性,为智能决策系统提供了更加鲁棒和透明的理论支持。

Comments 30 pages, 1 figure

详情
AI中文摘要

中介模糊逻辑最初被设想为一种实用方案,用于协调模糊控制和决策中的犹豫或冲突评估。然而,其逻辑和语义基础仍然不完善,尤其是在操作性的类型-1设置之外。本文发展了类型-1核心以及区间类型-2、粒状类型-3和量子扩展的统一描述。我们将中介算子刻画为由犹豫和矛盾控制的凸聚合,将中介真值建模为连续双格结构中的独立真-假对,并引入一个命题系统,通过中介连接词扩展标准的t-范数模糊逻辑。我们证明了对于无中介的公式,该系统相对于底层模糊基的可靠性、次协调性和保守性,并制定了区间类型-2真值、粒索引局部评估以及希尔伯特空间上的效应和密度算子的一致语义扩展。一个自动制动传感器融合示例说明了该框架如何支持在信息不完全、异构和轻度矛盾的证据下做出透明、保守且安全优先的决策。在适当假设下,高级公式简化为类型-1情况,澄清了各层级间的一致性,并为智能决策系统的未来工作提供了可靠支持。

英文摘要

Mediative Fuzzy Logic was conceived as a practical scheme for reconciling hesitant or conflicting assessments in fuzzy control and decision-making. However, its logical and semantic foundations remain underdeveloped, especially beyond operational type-1 settings. This article develops a unified account of the type-1 core together with interval type-2, granular type-3, and quantum extensions. We characterize the mediative operator as a convex aggregation controlled by hesitation and contradiction, model mediative truth values as independent truth-falsity pairs in a continuous bilattice-like structure, and introduce a propositional system extending a standard t-norm-based fuzzy logic with a mediative connective. We establish soundness, paraconsistency, and conservativity over the underlying fuzzy base for formulas without mediation, and formulate coherent semantic extensions to interval type-2 truth values, granule-indexed local evaluations, and effects and density operators on Hilbert spaces. An autonomous-braking sensor-fusion example illustrates how the framework supports transparent, conservative, and safety-first decisions under incomplete, heterogeneous, and mildly contradictory evidence. Under suitable assumptions, the higher-level formulations reduce to the type-1 case, clarifying coherence across levels and reliably supporting future work in intelligent decision systems.

2605.22896 2026-05-25 cs.RO cs.AI cs.LG 版本更新

Agentic-VLA: Efficient Online Adaptation for Vision-Language-Action Models

Agentic-VLA:视觉-语言-动作模型的高效在线自适应

Ruofan Jin, Zaixi Zhang

发表机构 * Ruofan Jin(金鲁凡) Zaixi Zhang(张在西)

AI总结 本文提出了一种名为Agentic-VLA的新型训练框架,旨在提升视觉-语言-动作(VLA)模型在机器人操作任务中的在线适应效率。该方法通过自适应奖励合成、语言引导探索和经验记忆三个核心创新,有效解决了现有VLA模型在新环境泛化能力和训练效率方面的不足。实验表明,Agentic-VLA在LIBERO和RoboTwin 2.0等基准测试中显著提升了任务完成率和学习效率,为构建具备持续学习能力的自适应VLA系统提供了重要进展。

Comments Total 15 pages

详情
AI中文摘要

视觉-语言-动作(VLA)模型通过利用预训练的视觉-语言表示,已成为机器人操作领域的一种有前景的范式。然而,当前的VLA训练方法存在两个关键局限性:对新环境的泛化能力差,以及需要大量演示数据导致的训练效率低下。我们提出Agentic-VLA,一种智能训练框架,通过三项关键创新使VLA能够在线高效自适应:(1)自适应奖励合成,根据VLA当前能力和任务复杂度动态生成并调整奖励函数,将复杂任务分解为可学习的子目标以进行课程学习;(2)语言引导探索,其中评论模型提供结构化指导以实现系统化探索,而非随机采样;(3)经验记忆,存储和检索与任务相关的策略权重,用于相似任务的预热启动自适应。我们在LIBERO基准上评估Agentic-VLA,取得了显著改进:长时域任务提升12.3%,单样本学习提升28.5%,并在无需任务特定演示的情况下实现从0%到31.2%的跨任务迁移。与现有在线自适应方法相比,我们的框架还实现了2.4倍的收敛速度提升。除LIBERO外,Agentic-VLA在双臂RoboTwin 2.0基准(包括其随机困难设置)上仍保持优势。这些结果使Agentic-VLA成为迈向真正自适应、可在部署中持续学习的VLA系统的重要一步。

英文摘要

Vision-Language-Action (VLA) models have emerged as a promising paradigm for robotic manipulation by leveraging pre-trained vision-language representations. However, current VLA training methods suffer from two critical limitations: poor generalization to novel environments and low training efficiency requiring extensive demonstrations. We introduce Agentic-VLA, an agentic training framework that enables VLAs to efficiently adapt online through three key innovations: (1) Adaptive Reward Synthesis, which dynamically generates and adjusts reward functions based on the VLA's current capabilities and task complexity, decomposing complex tasks into learnable sub-goals for curriculum learning; (2) Language-Guided Exploration, where a critic model provides structured guidance for systematic exploration rather than random sampling; and (3) Experience Memory,which stores and retrieves task-relevant policy weights for warm-starting adaptation to similar tasks. We evaluate Agentic-VLA on the LIBERO benchmark, achieving substantial improvements: +12.3% on long-horizon tasks, +28.5% in 1-shot learning, and enabling cross-task transfer from 0% to 31.2% without task-specific demonstrations. Our framework also demonstrates 2.4x faster convergence compared to existing online adaptation methods. Beyond LIBERO, Agentic-VLA retains its advantage on the dual-arm RoboTwin 2.0 benchmark, including under its randomized Hard setting. These results establish Agentic-VLA as a significant step toward truly adaptive VLA systems capable of continuous learning in deployment.

2605.22885 2026-05-25 cs.AI cs.CL cs.LG cs.LO 版本更新

ImProver 2: Iteratively Self-Improving LMs for Neurosymbolic Proof Optimization

ImProver 2:用于神经符号证明优化的迭代自改进语言模型

Riyaz Ahuja, Tate Rowney, Jeremy Avigad, Sean Welleck

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 随着形式化数学库的快速增长,对验证证明的重构和神经证明器训练数据质量的提升需求日益迫切。为解决可扩展性证明优化中面临的异构目标、数据稀缺和高训练推理成本等问题,本文提出ImProver 2,一个用于Lean 4的神经符号框架,结合高效的数据专家迭代流程和形式化结构暴露的轻量非正式抽象框架,并引入一系列衡量证明结构特性的指标。实验表明,该框架能够使小型模型在多个指标上达到与更大模型相当甚至更优的性能,展示了证明优化作为可扩展学习任务的可行性。

详情
AI中文摘要

形式化数学库正在迅速扩展,这产生了对已验证证明进行重构以保持可维护性以及提高神经证明器训练数据质量的日益增长的需求。然而,可扩展的证明优化受到异构且启发式指定的目标、稀缺的数据以及高训练和推理成本的阻碍。为了克服这些挑战,我们引入了ImProver 2,这是一个用于在Lean 4中自动进行证明优化的神经符号框架。ImProver 2将数据高效的专家迭代流程与一个暴露形式结构并附带轻量级非正式抽象的脚手架相结合。我们进一步引入了一套捕捉证明结构属性的指标。使用ImProver 2,我们训练了一个7B参数的模型,该模型在相同模型系列中优于数量级更大的模型,并且在各项指标上与中端前沿模型具有竞争力。我们还证明,我们的神经符号脚手架显著提高了小型和前沿模型的性能。我们表明,通过适当的脚手架和训练,小型模型可以有效地在复杂且多样的指标上重构研究级证明,与更大的系统相匹配,并将证明优化确立为一项可扩展、可学习的任务。

英文摘要

Formal mathematics libraries are rapidly expanding, creating a growing need to refactor verified proofs for maintainability and to improve training data quality for neural provers. However, scalable proof optimization is hindered by heterogeneous and heuristically specified objectives, scarce data, and high training and inference costs. To overcome these challenges, we introduce ImProver 2, a neurosymbolic framework for automated proof optimization in Lean 4. ImProver 2 combines a data-efficient expert-iteration pipeline with a scaffold that exposes formal structure alongside lightweight informal abstractions. We further introduce a suite of metrics capturing structural proof properties. Using ImProver 2, we train a 7B-parameter model that outperforms orders-of-magnitude larger models within the same model family, and is competitive with mid-tier frontier models across metrics. We additionally demonstrate that our neurosymbolic scaffold significantly improves performance across both small and frontier models. We show that with proper scaffolding and training, small models can effectively restructure research-level proofs over complex and varied metrics, matching substantially larger systems and establishing proof optimization as a scalable, learnable task.

2605.22884 2026-05-25 cs.LG cs.AI 版本更新

Tensor Cache: Eviction-conditioned Associative Memory for Transformers

Tensor Cache: 基于驱逐条件的Transformer联想记忆

Kabir Swain, Sijie Han, Daniel Karl I. Weidele, Mauro Martino, Antonio Torralba

发表机构 * Massachusetts Institute of Technology, Cambridge, MA, USA(麻省理工学院) IBM Research, Cambridge, MA, USA(IBM研究院) University of Toronto, Toronto, Canada(多伦多大学)

AI总结 本文提出了一种名为 Tensor Cache 的两层缓存机制,用于改进 Transformer 模型在长上下文处理中的内存效率与质量。该方法结合了滑动窗口注意力作为第一层缓存(L1),并将被窗口淘汰的键值对压缩存储到第二层缓存(L2)中,通过外积形式的快速权重记忆实现高效召回。研究还揭示了现有训练方法中隐含的虚假外积问题,并提出改进方案,实验表明 Tensor Cache 在多个任务中显著提升了内存与性能的平衡。

详情
AI中文摘要

自回归Transformer的KV缓存随上下文长度线性增长;滑动窗口缓存限制了内存但完全丢弃被驱逐的token,使得窗口外的相关证据变得不可访问。我们引入了\emph{Tensor Cache},一种双层缓存,将滑动窗口softmax注意力作为第一级缓存(L1),与一个固定大小的外积快速权重记忆作为第二级缓存(L2)配对,L2由从窗口中驱逐的KV对提供。最近的token保留在精确的局部注意力中;被驱逐的对被压缩成一个每层矩阵$A$,并通过单个矩阵乘法被未来的查询读取,利用了线性注意力恒等式$q_t(k_i \otimes v_i)=\langle q_t,k_i angle v_i$。一个可学习的标量门融合L1和L2的输出,并且每头的衰减和写入率参数是端到端训练的。外积记忆和读取恒等式是众所周知的;我们的贡献是将其用作仅由滑动窗口驱逐提供的L2缓存,加上识别出常见的分块均值训练捷径$A\!\leftarrow\!λA\!+\!η(ar k\!\otimes\!ar v)$在每个块中静默地引入了$C^2{-}C$个虚假的跨token外积,并通过一个并行的加权和扫描(等价于在float32 epsilon内的每token写入)来弥补这一差距。跨系统规模、受控联想回忆、长上下文语言建模和记忆容量诊断,Tensor Cache在有限状态基线上改善了记忆-质量边界。

英文摘要

Autoregressive Transformer KV caches grow linearly with context length; sliding-window caching bounds memory but discards evicted tokens entirely, so relevant evidence outside the window becomes inaccessible. We introduce \emph{Tensor Cache}, a two-level cache that pairs sliding-window softmax attention as a first-level cache (L1) with a fixed-size outer-product fast-weight memory as a second-level cache (L2) fed by KV pairs evicted from the window. Recent tokens remain in exact local attention; evicted pairs are compressed into a per-layer matrix $A$ and read by future queries through a single matrix multiplication, exploiting the linear-attention identity $q_t(k_i \otimes v_i)=\langle q_t,k_i\rangle v_i$. A learned scalar gate fuses the L1 and L2 outputs, and per-head decay and write-rate parameters are trained end-to-end. The outer-product memory and the read identity are well-known; our contribution is their use as an L2 cache fed exclusively by sliding-window evictions, plus identifying that the common chunked-mean training shortcut $A\!\leftarrow\!λA\!+\!η(\bar k\!\otimes\!\bar v)$ silently introduces $C^2{-}C$ spurious cross-token outer products per chunk, and closing the gap with a parallel weighted-sum scan equivalent to per-token writes within float32 epsilon. Across systems scaling, controlled associative recall, long-context language modeling, and memory-capacity diagnostics, Tensor Cache improves the memory--quality frontier over bounded-state baselines.

2605.22883 2026-05-25 cs.AI cs.LG cs.PF 版本更新

Energy per Successful Goal: Goal-Level Energy Accounting for Agentic AI Systems

每个成功目标的能量:面向智能体AI系统的目标级能量核算

Deepak Panigrahy, Aakash Tyagi

发表机构 * Independent Researcher(独立研究者) Texas A\&M University(德克萨斯A&M大学) Texas A\&M University Department of Computer Science(德克萨斯A&M大学计算机科学系)

AI总结 当前AI能耗基准通常以单次模型调用或训练运行作为能耗计量单位,但这种方法难以准确反映智能体系统中多步骤任务的能耗情况。本文提出A-LEMS框架,将能耗计量单位从每次推理改为每成功目标的能耗(EpG),并引入调度开销指数(OOI)以量化调度结构对能耗的影响。研究发现,智能体系统完成每项任务的平均能耗是线性基线的4.33倍,且这一差异主要由调度结构而非推理计算引起,表明EpG和OOI为评估智能体AI系统能耗提供了更准确的基准方法。

Comments 34 pages, 16 figures, 10 tables

详情
AI中文摘要

当前的AI能量基准在单次模型调用或训练运行的粒度上测量能耗。对于经典的单轮工作负载,这种单位仍然一致。但对于智能体系统——其中单个用户目标可能触发多步编排、工具调用、重试和故障恢复循环——调用次数是实现产物而非任务属性,推理级归一化错误地表示了目标完成的能量成本。我们提出A-LEMS(智能体LLM能量测量系统),一个跨层测量框架,将AI能量核算单位从每次推理能量重新定义为每个成功目标能量(EpG)。EpG聚合所有执行尝试(包括失败和重试)的总工作流能量,归一化到成功完成的目标。A-LEMS通过时间边界模型、将RAPL信号映射到工作流级能量的五层观测管道,以及将每次测量绑定到硬件和运行时配置的可复现协议,形式化了能量归因。基于EpG,我们定义了编排开销指数(OOI),在相同任务标准下隔离编排相对于线性执行的能耗成本。在五个推理和三个工具增强任务族中,智能体工作流每个成功目标的平均能耗是线性基线的4.33倍(888.1 J vs 205.3 J)。这种开销由编排结构驱动,而非推理计算。对于工具增强任务,OOI反转至低于1.0倍:智能体执行比线性更便宜,确认该指标捕捉了编排结构而非固定的向上偏差。这些发现表明,每次推理能量对于智能体AI是不充分的。EpG和OOI为准确基准测试提供了测量基础,其中编排结构是能耗的主要决定因素。

英文摘要

Current AI energy benchmarks measure consumption at the granularity of a single model invocation or training run. For classical single-turn workloads this unit remains coherent. For agentic systems - where a single user goal may trigger multi-step orchestration, tool calls, retries, and failure-recovery cycles - the invocation count is an implementation artifact rather than a task property, and inference-level normalization misrepresents the energy cost of goal completion. We present A-LEMS (Agentic LLM Energy Measurement System), a cross-layer measurement framework that redefines the unit of AI energy accounting from energy per inference to Energy per Successful Goal (EpG). EpG aggregates total workflow energy across all execution attempts, including failures and retries, normalized by successfully completed goals. A-LEMS formalizes energy attribution through a temporal boundary model, a five-layer observation pipeline mapping RAPL signals to workflow-level energy, and a reproducibility protocol binding every measurement to hardware and runtime configuration. Building on EpG, we define the Orchestration Overhead Index (OOI), isolating the energy cost of orchestration relative to linear execution under identical task criteria. Across five reasoning and three tool-augmented task families, agentic workflows consume 4.33x higher mean energy per successful goal than linear baselines (888.1 J vs 205.3 J). This overhead is driven by orchestration structure, not inference compute. For tool-augmented tasks, OOI inverts below 1.0x: agentic execution is cheaper than linear, confirming the metric captures orchestration structure rather than a fixed upward bias. These findings establish that energy-per-inference is insufficient for agentic AI. EpG and OOI provide the measurement foundation for accurate benchmarking, where orchestration structure is the primary determinant of energy cost.

2605.22880 2026-05-25 cs.CL cs.AI cs.CY 版本更新

How Far Will They Go? Red-Teaming Online Influence with Large Language Models

它们会走多远?使用大型语言模型对在线影响力进行红队测试

Daniel C. Ruiz, Anna Serbina, Ashwin Rao, Emilio Ferrara, Luca Luceri

发表机构 * Information Sciences Institute University of Southern California(信息科学研究所 乌德穆尔特国立大学)

AI总结 随着基于大语言模型(LLM)的代理越来越多地参与在线讨论,评估其支持政治影响力活动的能力对维护信息完整性至关重要。本文提出了一种实证的“红队”框架,用于测量LLM的奥托窗(Overton Window,OW),即模型在争议性话题上可靠表达的政治观点范围,并量化自然语言越狱技术如何扩展这一范围。研究评估了来自10个模型家族、5个国家的30多个开源LLM,发现其在政治表达上存在系统性偏差,如更倾向于生成左翼内容,且模型规模越大,OW范围越小,不同地区模型表现差异显著。研究还揭示了越狱效果在不同模型家族间差异明显,为识别有效的越狱技术组合提供了参考。

Comments 30 pages, 8 figures, submitted to COLM 2026

详情
AI中文摘要

随着基于大型语言模型(LLM)的代理越来越多地参与在线讨论,对其支持政治影响力活动的能力进行红队测试对于信息完整性至关重要。为实现这一目标,我们专注于本地部署的开源LLM,而非前沿的仅API模型,因为前者更符合在社交媒体环境中部署的注重隐私的恶意行为者的操作约束。我们引入了一个经验性的红队测试框架,用于测量LLM的Overton窗口(OW),即模型在争议话题上能够可靠表达的政治观点范围,并量化简单的自然语言越狱如何扩展该范围。我们评估了来自10个模型家族和5个原产国的30多个LLM。我们发现政治表达存在系统性不对称:开源LLM通常更愿意生成左倾的社交媒体内容,OW往往与模型大小成反比,尽管开源生态系统中的代表性不均,但区域差异显著。越狱效果在不同模型家族之间也差异很大,这促使我们开发一个工作流程来识别有效的越狱技术组合。综合来看,我们的结果建立了一个实用的框架,用于审计开源LLM的政治可操控性,并帮助未来的研究人员设计更强的对策来对抗基于LLM的影响力活动。

英文摘要

As large language model (LLM)-based agents increasingly participate in online discourse, red-teaming their capacity to support political influence campaigns is critical for information integrity. In pursuit of this goal, we focus on locally deployed open-source LLMs, as opposed to frontier API-only models, given their superior alignment with the operational constraints of privacy-conscious malicious actors deployed in social media environments. We introduce an empirical red-teaming framework for measuring LLM Overton Windows (OWs), defined as the range of political opinions a model can reliably express on controversial topics, and for quantifying how simple natural-language jailbreaks expand that range. We evaluate more than 30 LLMs spanning 10 model families and five countries of origin. We find systematic asymmetries in political expressivity: open-source LLMs are typically more willing to generate left-leaning social media content, OWs tend to contract inversely to model size, and regional differences are substantial despite uneven representation in the open-source ecosystem. Jailbreak potency also varies sharply across model families, motivating a workflow for identifying effective combinations of jailbreak techniques. Taken together, our results establish a practical framework for auditing the political steerability of open-source LLMs and for helping future researchers design stronger countermeasures against LLM-enabled influence campaigns.

2605.22878 2026-05-25 cs.AI cs.CL cs.IR cs.LG 版本更新

SciAtlas: A Large-Scale Knowledge Graph for Automated Scientific Research

SciAtlas:面向自动化科学研究的大规模知识图谱

Shuofei Qiao, Yunxiang Wei, Jiazheng Fan, Bin Wu, Busheng Zhang, Mengru Wang, Yuqi Zhu, Ningyu Zhang, Keyan Ding, Qiang Zhang, Huajun Chen

发表机构 * Zhejiang University(浙江大学) University College London(伦敦大学学院)

AI总结 随着全球学术产出的指数级增长,研究者和人工智能代理面临前所未有的“信息爆炸”挑战,碎片化和非结构化的知识组织阻碍了跨学科的深度融合。为解决这一问题,本文提出 SciAtlas,一个涵盖26个学科、包含4300万篇论文、1.57亿实体和30亿三元组的多学科异构学术知识图谱,旨在构建全景式的科学演进网络。SciAtlas 提供了结构化的拓扑认知基础,打破了学科壁垒,并通过神经符号检索算法实现了从语义匹配到确定性关联发现的转变,为自动化科研全流程提供了高效、低成本的“认知地图”。

Comments Ongoing Work

详情
AI中文摘要

全球学术产出的指数级增长使研究人员和AI代理面临前所未有的“信息爆炸”,其中碎片化和非结构化的知识组织阻碍了深层次的跨学科整合。当前的学术检索工具主要依赖浅层关键词匹配或向量空间语义检索,缺乏导航复杂逻辑连接所需的拓扑推理能力。基于代理的深度研究框架往往容易出现逻辑幻觉并消耗高推理成本。为弥补这一差距,本报告介绍了SciAtlas,一个大规模、多学科、异构的学术资源知识图谱,设计为全景科学演化网络。通过整合来自26个学科的超过4300万篇论文,总计1.57亿个实体和30亿个三元组,SciAtlas提供了一个结构化的拓扑认知基础,打破了学科壁垒,并为AI代理提供了全局视角。此外,我们开发了一种神经符号检索算法,具有三路径协同召回和图重排序,实现了从简单语义匹配到确定性关联发现的无缝过渡。我们还展示了SciAtlas的关键应用方向,包括文献综述、自动化研究趋势综合、想法定位和学术轨迹探索,以证明SciAtlas可以作为有效的“认知地图”,赋能自动化科学研究的全流程,同时显著降低推理成本。我们已在GitHub仓库中发布了知识图谱检索和各种下游任务的接口。

英文摘要

The exponential growth of global academic output has confronted researchers and AI agents with an unprecedented ``information explosion,'' where fragmented and unstructured knowledge organization impedes deep interdisciplinary integration. Current academic retrieval tools predominantly rely on superficial keyword matching or vector-space semantic retrieval, which lack the topological reasoning capabilities required to navigate complex logical connections. Agentic deep-research-based frameworks are often prone to logical hallucinations and consuming high inference costs. To bridge this gap, in this report, we introduce SciAtlas, a large-scale, multi-disciplinary, heterogeneous academic resource knowledge graph designed as a panoramic scientific evolution network. By integrating over 43M papers from 26 disciplines, and a total of 157M entities and 3B triplets, SciAtlas provides a structured topological cognitive substrate that dismantles disciplinary barriers and furnishes AI agents with a global perspective. Furthermore, we develop a neuro-symbolic retrieval algorithm featuring tri-path collaborative recall and graph reranking, achieving a seamless transition from simple semantic matching to deterministic association discovery. We also present key application directions of SciAtlas, including literature review, automated research trend synthesis, idea positioning, and academic trajectory exploration, to demonstrate that SciAtlas can serve as an effective ``cognitive map'' to empower the full loop of automated scientific research while significantly reducing reasoning costs. We have released the interfaces for KG retrieval and various downstream tasks in our GitHub repo.

2605.22875 2026-05-25 cs.AI cs.LG 版本更新

RMA: an Agentic System for Research-Level Mathematical Problems

RMA:一个面向研究级数学问题的智能体系统

Zelin Zhao, Bo Yuan, Jaemoo Choi, Yongxin Chen

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 本文提出了一种名为 **RMA** 的智能代理系统,专门用于解决研究级数学问题。RMA 通过分解问题分析、文献检索、公平比较、知识库构建和证明验证等模块,并由初始化器、提议者和验证者代理协同工作,实现了对复杂数学问题的长期推理和迭代证明优化。实验表明,RMA 在 First Proof 基准测试中表现出色,解决了其中八道难题,其生成的证明在逻辑性和可读性上优于现有强基线模型。

详情
AI中文摘要

我们提出了$ extbf{Research Math Agents (RMA)}$,一个用于研究级数学问题自动推理的智能体框架。与以往专注于竞赛数学或形式化定理证明的研究不同,RMA针对需要长程推理、文献依据和迭代证明改进的研究级数学问题。RMA将研究级证明求解分解为专门模块,包括问题分析、文献搜索与理解、公平比较、知识库构建和证明验证,所有这些都由初始化器、提议器和验证器智能体通过共享的结构化内存协调。在这个统一框架内,这些智能体以多角色、多轮工作流的方式运行,通过迭代反馈协作生成、改进和验证候选证明。我们在First Proof基准上评估了RMA,该基准由来自不同领域的专家数学家贡献的十个研究级问题组成。通过全面的专家评估,RMA在First Proof基准上优于强基线(包括GPT-5.2R和Aletheia),解决了十个研究问题中的八个,并生成了逻辑更合理、可读性更强的证明。我们的全面消融研究进一步表明,性能提升来自于结构化推理模块、迭代改进和基于验证器的反馈之间的交互,而非任何单一组件。我们的解决方案和实现将在论文被接收后公开。

英文摘要

We present $\textbf{Research Math Agents (RMA)}$, an agentic framework for automated reasoning on research-level mathematical problems. Unlike prior studies centered on competition mathematics or formal theorem proving, RMA targets research-level mathematical problems that require long-horizon reasoning, literature grounding, and iterative proof refinement. RMA decomposes research-level proof solving into specialized modules for problem analysis, literature search and understanding, fair comparison, knowledge-bank construction, and proof verification, all coordinated by initializer, proposer, and verifier agents through a shared structured memory. Within this unified framework, these agents operate in a multi-role, multi-round workflow, collaboratively generating, refining, and verifying candidate proofs through iterative feedback. We evaluate RMA on the First Proof benchmark, which consists of ten research-level problems contributed by expert mathematicians across diverse domains. Through comprehensive expert evaluation, RMA outperforms strong baselines on the First Proof benchmark, including GPT-5.2R and Aletheia, solving eight out of ten research problems and producing more logically sound and readable proofs. Our comprehensive ablation studies further show that performance gains arise from the interaction of structured reasoning modules, iterative refinement, and verifier-based feedback, rather than any single component. Our solutions and implementations will be made publicly available upon acceptance.

2605.22874 2026-05-25 cs.AI cs.LO 版本更新

NeuroNL2LTL: A Neurosymbolic Framework for Natural Language Translation of Linear Temporal Logic

NeuroNL2LTL:用于线性时序逻辑自然语言翻译的神经符号框架

Paapa Kwesi Quansah, Ernest Bonnah

发表机构 * Baylor University(贝勒大学)

AI总结 本文提出了一种神经符号框架 NeuroNL2LTL,用于将自然语言翻译为线性时序逻辑(LTL),旨在解决自然语言与形式逻辑之间转换的可靠性与表达力之间的矛盾。该框架通过中间表示结构化地映射到LTL,并结合形式验证进行语义校验与修复,同时利用验证结果作为强化学习的奖励信号,提升模型的正确性。实验表明,该方法在多个领域的大规模需求数据上实现了较高的语义等价性与验证满足率,并能生成易于专家验证的解释性说明。

详情
AI中文摘要

有效地在自然语言(NL)和形式逻辑(如线性时序逻辑LTL)之间进行转换需要专业知识,这限制了形式验证在安全关键开发中的覆盖范围。基于模板的方法牺牲表达能力换取可靠性;神经方法实现了流畅性但无法提供正确性保证。我们提出了NeuroNL2LTL,一种将学习翻译与形式验证统一起来的神经符号架构。NeuroNL2LTL通过一种中间表示进行翻译,该表示到LTL的映射在结构上是保持的。生成的规约经过可满足性和非平凡性检查;一个最小编辑修复机制在近似的错误输出到达下游工具之前对其进行纠正。核心创新是验证器在环训练:验证结果作为强化学习的奖励信号,使神经组件直接针对形式正确性进行优化。在涵盖航空航天、机器人、自动驾驶汽车及其他十个领域的20万+需求上,NeuroNL2LTL实现了与参考规约28%的语义等价性,同时确保86%的输出被验证为可满足。该系统还能从LTL生成上下文相关的解释,使领域专家无需专门培训即可验证规约。这项工作表明,形式验证可以作为神经规约系统的训练目标和运行时过滤器,使我们能够构建可靠性源于逻辑保证而非统计置信度的基于神经的工具。

英文摘要

Effectively translating between natural language (NL) and formal logics like Linear Temporal Logic (LTL) requires expertise that limits formal verification's reach in safety-critical development. Template-based approaches sacrifice expressiveness for reliability; neural methods achieve fluency but provide no correctness guarantees. We present NeuroNL2LTL, a neurosymbolic architecture unifying learned translation with formal verification. NeuroNL2LTL routes translation through an intermediate representation whose mapping to LTL is structure-preserving by construction. Generated specifications undergo satisfiability and non-triviality checking; a minimal-edit repair mechanism corrects near-miss outputs before they reach downstream tools. The central innovation is verifier-in-the-loop training: verification outcomes serve as reward signals for reinforcement learning, producing neural components that optimize directly for formal correctness. On 200,000+ requirements spanning aerospace, robotics, autonomous vehicles, and ten additional domains, NeuroNL2LTL achieves 28\% semantic equivalence with reference specifications while ensuring 86\% of outputs are verified satisfiable. The system also generates contextually grounded explanations from LTL, enabling domain experts to validate specifications without specialized training. This work demonstrates that formal verification can function as both training objective and runtime filter for neural specification systems, allowing us to build neural-based tools whose reliability derives from logical guarantees rather than statistical confidence.

2605.22872 2026-05-25 cs.LG cs.AI cs.CV 版本更新

MedExpMem: Adapting Experience Memory for Differential Diagnosis

MedExpMem:适应经验记忆用于鉴别诊断

Qianhan Feng, Zhongzhen Huang, Yakun Zhu, Yannian Gu, Winnie Chiu Wing Chu, Xiaofan Zhang, Qi Dou

发表机构 * The Chinese University of Hong Kong(香港中文大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 本文提出了一种名为 MedExpMem 的经验记忆框架,旨在提升基于视觉-语言模型的医疗诊断代理在鉴别诊断方面的能力。该方法通过记录模型自身在诊断过程中的失败经验,生成包含关键鉴别点、决策规则和推理错误模式的成对鉴别笔记,并采用两阶段构建过程模拟医生的学习过程。实验表明,MedExpMem 在多个放射学子专科基准上有效提升了诊断准确性,验证了其在医疗适应性方面的优越性。

Comments MICCAI 2026 Early Accept. Submission Version

详情
AI中文摘要

经验丰富的医生通过临床实践发展诊断专业知识,不仅获得疾病知识,还能区分易混淆的病症。当前的医学视觉语言模型(VLM)缺乏这种能力——它们的参数编码了静态知识,不会随着诊断经历而演变。我们提出了MedExpMem,一个经验记忆框架,使基于VLM的诊断代理能够积累鉴别诊断专业知识。与检索增强生成(检索百科式疾病描述)不同,MedExpMem记忆从代理自身的诊断失败中获得的判别经验,并将其组织为成对的鉴别笔记,编码关键判别因素、可操作的决策规则和推理错误模式。该框架采用两阶段构建过程,模仿医生的学习:初始实践暴露知识差距,反思性重新诊断完善理解。当遇到新病例时,代理检索经验记忆以指导鉴别推理。我们在涵盖11个亚专业的放射学基准上评估了MedExpMem。结果表明,在不同模型和规模上,准确率持续提升,最高达7.0%。分析实验验证了经验质量和鲁棒性,表明MedExpMem是一种有竞争力的方法,解决了参数学习无法触及的医学适应需求。

英文摘要

Experienced physicians develop diagnostic expertise through clinical practice, acquiring not only disease knowledge but also the ability to differentiate confusable conditions. Current medical vision-language models (VLMs) lack this capability -- their parameters encode static knowledge that does not evolve across diagnostic encounters. We propose MedExpMem, an experience memory framework enabling VLM-based diagnostic agents to accumulate differential diagnosis expertise. Unlike retrieval-augmented generation, which retrieves encyclopedic disease descriptions, MedExpMem memorizes discriminative experience derived from the agent's own diagnostic failures and organizes them as pairwise differential notes encoding key discriminators, actionable decision rules and reasoning error patterns. The framework adopts a two-phase construction process mirroring physician learning: initial practice exposes knowledge gaps, and reflective re-diagnosis refines understanding. When encountering new cases, the agent retrieves experience memory to guide differential reasoning. We evaluate MedExpMem on a radiology benchmark spanning 11 subspecialties. Results demonstrate consistent accuracy improvements, maximum 7.0%, across diverse models and scales. Analytical experiments validate experience quality and robustness, demonstrating MedExpMem as a competitive method addresses medical adaptation needs beyond the reach of parameteric learning.

2605.22871 2026-05-25 cs.LG cs.AI stat.ML 版本更新

Approximate Machine Unlearning through Manifold Representation Forgetting Guided by Self Mode Connectivity

通过自模式连通性引导的流形表示遗忘实现近似机器遗忘

Weiqi Wang, Zhiyi Tian, Chenhan Zhang, Luoyu Chen, Shui Yu

发表机构 * Xi'an Jiaotong University(西安交通大学) Southeast University(东南大学) University of Technology Sydney(悉尼大学)

AI总结 本文提出了一种名为ManiF-SMC的近似机器遗忘方法,旨在解决现有方法在遗忘效果和学习目标保持之间的平衡问题。该方法基于模型在剩余数据上重训练时的语义相似性分类行为,通过将被遗忘样本从原始流形表示中心推向保留数据的语义邻居,实现近似遗忘。为提升遗忘效果并减少对标签和任务梯度的依赖,ManiF-SMC引入了基于边距的三元组损失和自模式连通模块,以自适应生成遗忘边距,实验表明其在多个数据集上达到了与先进方法相当的遗忘效果。

详情
AI中文摘要

机器遗忘是强制执行被遗忘权的基本机制。现有的依赖标签操作或任务梯度反转的遗忘研究通常遗忘效果有限,且可能破坏原始学习目标,通常不能保证与重新训练的标准遗忘等价。本文提出ManiF-SMC(自模式连通性引导的流形遗忘),其动机是观察到在剩余数据上重新训练的模型倾向于根据保留数据中的语义相似性对擦除样本进行分类。我们首先系统地将近似遗忘重新表述为:将每个擦除样本从其原始学习的流形表示质心推向保留数据中最近的语义邻居。这种重新表述使遗忘与重新训练行为对齐,并且仅在表示空间中操作,减少了对标签和任务特定梯度的依赖。为了解决基于流形表示的遗忘问题,ManiF-SMC将遗忘和表示保留目标封装在基于边界的三元组损失中。由于为遗忘找到合适的边界具有挑战性,我们提出一个自模式连通性模块,快速重建局部流形以指导每个遗忘案例的自适应边界生成。在四个代表性数据集上的大量实验表明,ManiF-SMC在仅操作模型表示空间的情况下,实现了与最先进近似方法相当的遗忘效果。

英文摘要

Machine unlearning is a fundamental mechanism that enforces the right to be forgotten. Existing unlearning studies that rely on label manipulation or task-gradient reversal often deliver limited unlearning effectiveness. Moreover, they can undermine the original learning objective and typically do not guarantee equivalence to standard unlearning by retraining. In this paper, we propose \textbf{ManiF-SMC} (\textbf{Mani}fold \textbf{F}orgetting with \textbf{S}elf \textbf{M}ode \textbf{C}onnectivity), motivated by the observation that a model retrained on the remaining data tends to classify erased samples by their semantic similarity to the retained data. We begin with systematically recasting the approximate unlearning as pushing each erased sample away from its original learned manifold representation centroid toward its nearest semantic neighbors in the retained data. This reformulation aligns unlearning with retraining behavior and operates purely in representation space, reducing reliance on labels and task-specific gradients. To tackle the manifold representation-based unlearning problem, ManiF-SMC encapsulates the unlearning and representation preservation goals in a margin-based triplet loss. Because finding a suitable margin for unlearning is challenging, we propose a self-mode-connectivity module that rapidly reconstructs the local manifold to guide the adaptive margins generation for each unlearning case. Extensive experiments on four representative datasets show that ManiF-SMC achieves unlearning effectiveness comparable to state-of-the-art approximate methods while operating solely within the model's representation space.

2605.22870 2026-05-25 cs.LG cs.AI cs.CL 版本更新

The Readout Shortcut: Positional Number Copying Dominates Arithmetic CoT Readout in Small Language Models

读出捷径:位置数字复制主导小语言模型中的算术思维链读出

Ming Liu

发表机构 * Amazon(亚马逊)

AI总结 该研究探讨了小型语言模型在进行算术推理时,思维链(CoT)提示的实际作用。研究发现,模型在输出答案时更倾向于复制位于答案分隔符前的最后一个数字,而非依赖中间推理过程。这一“位置捷径”现象显著影响了模型性能,表明当前的CoT方法可能更多依赖位置信息而非逻辑推理。实验还揭示了不同模型在复制行为上的差异,并指出这一机制可能与模型架构及任务类型相关。

Comments 18 pages (8 main + 10 appendix), 3 figures, 5 tables

详情
AI中文摘要

思维链提示对于小语言模型进行算术运算是必要的,然而打乱其步骤仍能保留大部分性能。如果思维链贡献的不是逻辑顺序,那是什么?在三个1-3B指令微调的语言模型上,针对GSM8K数据集,我们通过前缀补全隔离了答案读出阶段,并识别出一个位置捷径:模型复制占据答案分隔符前最后一个位置的数字,无论中间推理如何。正确答案的存在贡献了54-92个百分点的准确率(每个模型教师强制上限的89-92%);即使在错误项上,最终答案与思维链最后一个数字匹配的概率为95-96%。复制通道优先于保留上下文补全:用错误值替换最后一个数字会使准确率降至接近零,尽管中间步骤正确;但移除它后,准确率在该基线之上恢复5-32个百分点——当存在可复制的数字时,即使模型本可以执行的单步算术也被抑制。Qwen和Llama在87-95%的情况下复制新干扰项;Gemma则选择性门控。头部级消融实验揭示了特定于架构的头部集;该效应在GSM-Symbolic上复现。在非算术的BBH任务上,打乱保留率急剧下降;在7-8B规模时,出现了内容选择性门控。步骤级忠实度评估有风险将位置答案传输与真实计算混为一谈——这是基于思维链的监督的一个失败模式。

英文摘要

Chain-of-thought (CoT) prompting is necessary for arithmetic in small language models, yet shuffling its steps preserves most performance. What does CoT contribute if not logical sequencing? In three 1-3B instruction-tuned LMs on GSM8K, we isolate the answer-readout stage via prefix completion and identify a positional shortcut: the model copies whichever number occupies the trailing position before the answer delimiter, regardless of intermediate reasoning. Gold-answer presence accounts for 54-92 pp of accuracy (89-92% of each model's teacher-forcing ceiling); even on incorrect items, the final answer matches the last CoT number 95-96% of the time. The copy channel takes precedence over retained-context completion: replacing the trailing number with a wrong value collapses accuracy to near-zero despite correct intermediates, yet removing it recovers 5-32 pp above that floor--even single-step arithmetic the model can otherwise perform is suppressed when a copyable number is present. Qwen and Llama copy novel distractors 87-95% of the time; Gemma gates selectively. Head-level ablation implicates architecture-specific head sets; the effect replicates on GSM-Symbolic. On non-arithmetic BBH tasks, shuffle retention drops sharply; at 7-8B, content-selective gating emerges. Step-level faithfulness evaluations risk conflating positional answer transport with genuine computation--a failure mode for CoT-based oversight.

2605.22866 2026-05-25 cs.AI cs.LG 版本更新

BOHM: Zero-Cost Hierarchical Attribution for Compound AI Systems

BOHM:复合AI系统的零成本层次归因

Joss Armstrong

发表机构 * Ericsson Research(爱立信研究)

AI总结 本文提出了一种名为BOHM的零成本分层归因方法,用于复合AI系统中组件的贡献度分析。该方法直接从系统已有的路由权重中提取分层归因树,无需访问组件内部信息,能够在不同粒度上同时提供多分辨率归因,克服了传统基于Shapley值的方法在第三方API和不透明系统中的评估限制。实验表明,BOHM在多个实际场景中表现出优异的归因性能,且与Shapley方法在路由策略接近最优时结果趋于一致。

Comments 35 pages, 10 figures, 20 tables

详情
AI中文摘要

复合AI系统通过专门组件的层次结构路由任务。归因主要由基于Shapley的方法(SHAP)主导,该方法将联盟价值函数分解为每个组件的边际贡献,并需要在任意组件子集上评估系统。这一要求对于第三方API、不透明端点以及将路由集中在少数工具上的代理编排器而言无法满足,因为从部署的编排器中大多数联盟无法评估。我们引入BOHM,它直接从系统已维护的路由权重中提取层次归因树:叶子归因是根到叶子路由权重的路径乘积;第k层归因是深度k节点上的诱导分布。该方法具有零边际成本,无需访问组件内部,并同时提供每个级别的多分辨率归因,而扁平方法在任何评估预算下都无法提供。BOHM和SHAP回答不同的问题,当部署的路由器接近最优路由时两者收敛。在包含880个LiveCodeBench问题的3级层次结构中的18个LLM上,BOHM的Kendall tau=0.928;SHAP在每次种子进行9000倍更多联盟评估时达到tau=0.980。在一项包含5个驱动器和7个基准的代理研究(35个单元格,完全覆盖)中,驱动器将路由集中在单个工具上(顶部份额中位数0.65),单元格级别的tau(BOHM, SHAP)由驱动器的首选是否为经验上最佳工具预测(平均+0.22 vs ~+0.01)。在美国人口普查层次结构(475个叶子,4级)上,BOHM在每个级别恢复真实排名(tau高达0.722)。BOHM满足效率、单调性、对称性和弱抑制,但不满足Shapley的可加性。它最好被理解为一种互补原语:一种在存在路由状态的任何地方可计算的多分辨率分解,其与Shapley的分歧本身具有诊断意义。

英文摘要

Compound AI systems route tasks through hierarchies of specialised components. Attribution is dominated by Shapley-based methods (SHAP), which decompose a coalition value function into per-component marginal contributions and require evaluation of the system on arbitrary component subsets. That requirement fails for third-party APIs, opaque endpoints, and agentic orchestrators that concentrate routing on a few tools, leaving most coalitions un-evaluable from the deployed orchestrator. We introduce BOHM, which extracts a hierarchical attribution tree directly from the routing weights such systems already maintain: leaf attribution is the path product of root-to-leaf routing weights; level-k attribution is the induced distribution over depth-k nodes. The method has zero marginal cost, requires no access to component internals, and provides multi-resolution attribution at every level simultaneously, which flat methods cannot offer at any evaluation budget. BOHM and SHAP answer different questions and converge when the deployed router routes near-optimally. On 18 LLMs in a 3-level hierarchy over 880 LiveCodeBench problems, BOHM yields Kendall tau=0.928; SHAP reaches tau=0.980 at 9,000x more coalition evaluations per seed. On a 5-driver, 7-benchmark agentic study (35 cells, complete coverage), drivers concentrate routing on a single tool (top-share median 0.65), and cell-level tau(BOHM,SHAP) is predicted by whether the driver's top pick is the empirically best tool (mean +0.22 vs ~+0.01). On a US Census hierarchy (475 leaves, 4 levels), BOHM recovers ground-truth rankings at every level (tau up to 0.722). BOHM satisfies efficiency, monotonicity, symmetry, and weak suppression but not Shapley's additivity. It is best understood as a complementary primitive: a multi-resolution decomposition computable wherever routing state exists, whose disagreement with Shapley is itself diagnostic.

2605.22859 2026-05-25 eess.SP cs.AI 版本更新

Staging by the Book: Automatic Sleep Stage Classification Using Scoring Rules

按书分期:使用评分规则进行自动睡眠阶段分类

Emil Hardarson, Konstantin Popov, Sigridur Sigurdardottir, Anna Sigridur Islind, Erna Sif Arnardóttir, María Óskarsdóttir

发表机构 * Department of Computer Science, Reykjavik University(雷克雅未克大学计算机科学系) Reykjavik University Sleep Institute(雷克雅未克大学睡眠研究所) Reykjavik University(雷克雅未克大学) Department of Engineering, Reykjavik University(雷克雅未克大学工程系) School of Mathematical Sciences, University of Southampton(萨塞克斯大学数学科学学院)

AI总结 本文提出了一种基于睡眠医学临床评分规则的透明化睡眠分期方法,通过将美国睡眠医学会(AASM)的评分逻辑转化为可执行代码,并为每个分期结果生成自然语言解释,从而提高模型的可解释性。与当前主流的深度学习方法相比,该方法虽然在分期准确率上略低,但其决策过程明确且符合临床规范,可作为深度学习模型的辅助工具,用于审核、调试和监管睡眠分期系统。

详情
AI中文摘要

自动睡眠分期通常被视为监督式机器学习问题,深度学习方法主导了近期研究。尽管机器学习模型与人工评分的参考睡眠阶段达到接近人类水平的一致性,但其决策通常不透明,且并非设计用于遵循临床评分规则。我们提出一种透明的替代方案:一种确定性的、基于规则的睡眠分期方法,将美国睡眠医学会(AASM)的评分逻辑明确操作化为可执行代码,并附带基于解释轨迹的时期级自然语言理由。我们在50份多导睡眠图记录上评估该方法,以10位评分者的多数投票共识作为参考。在所有记录中,该方法与多数投票参考在60.5%的时期中一致(κ=0.42),在开发过程中使用的数据集上一致性显著更高(77.1%,κ=0.61)。与参考的一致性在睡眠阶段N2中最高(召回率83.5%),在睡眠阶段R中中等(召回率68.7%),而清醒和N1的召回率较低。尽管与参考的一致性低于当代深度学习模型,但该方法提供了与AASM评分规则一致的确定性决策和自然语言解释,使其成为审计、调试和管理基于深度学习的睡眠分期的补充工具。

英文摘要

Automated sleep staging is commonly approached as a supervised machine learning problem, with deep learning methods dominating recent research. While machine learning models achieve near-human level agreement with human-scored reference sleep stages, their decisions are typically opaque and not designed to follow clinical scoring rules. We propose a transparent alternative: a deterministic, rule-based sleep staging method that explicitly operationalizes the American Academy of Sleep Medicine's (AASM) scoring logic as executable code, coupled with epoch-level natural-language justifications derived from an explanation trace. We evaluate the approach on 50 polysomnography recordings with a 10-scorer majority-vote consensus as reference. Across all recordings, the method agreed with the majority-vote reference in 60.5% of epochs ($κ=0.42$), with substantially higher agreement on a dataset used during development (77.1%, $κ=0.61$). Agreement with the reference was highest for sleep stage N2 (recall 83.5%) and moderate for sleep stage R (recall 68.7%), while Wake and N1 recall were low. Despite lower agreement with the reference than contemporary deep learning models, the method provides deterministic decisions and natural language explanations aligned with AASM scoring rules, making it a complementary tool for auditing, debugging, and governing deep learning-based sleep staging.

2605.22855 2026-05-25 cs.GT cs.AI cs.CL cs.LG 版本更新

PrefBench: Evaluating Zero-Shot LLM Agents in Hidden-Preference Personalized Pricing Negotiations

PrefBench:评估隐藏偏好个性化定价谈判中的零样本LLM智能体

Yingjie Lei

发表机构 * University of Aberdeen(阿伯丁大学)

AI总结 本文提出了PrefBench,一个用于评估零样本大语言模型(LLM)代理在隐藏偏好个性化定价谈判中表现的基准测试平台。该平台通过模拟买家与固定车辆定制套餐的互动,要求卖家在仅能获取公开信息的情况下进行谈判,而买家的估值、耐心、还价行为等关键参数是隐藏的。实验表明,尽管LLM代理能够遵循协议并达成高比例的交易,但其利润表现较差,远不如简单的让步策略,突显了当前LLM在利润敏感型谈判中的不足。PrefBench为研究隐藏买家偏好下的定价代理行为提供了可控的评估环境。

Comments 24 pages, 3 figures, 5 tables. Code is available at https://github.com/ChaosTheProducer/PrefBench

详情
AI中文摘要

个性化定价谈判是LLM智能体的一个具有挑战性的测试平台,因为成功的互动并不能保证盈利的决策。当买方的支付意愿和谈判特征仍然隐藏时,卖方可能产生有效的行动并达成许多交易,但定价仍然很差。本文提出了PrefBench,一个基于模拟器的隐藏偏好个性化定价谈判基准。每个回合将一个模拟买家与一个固定的车辆定制捆绑包配对;卖方观察公开的人物描述符、捆绑包信息和谈判历史,而潜在的买方变量控制估值、耐心、还价行为和退出决策。PrefBench通过一个面向LLM的状态摘要协议来评估这一设置,该协议限制智能体在固定的隐藏信息边界下返回严格的JSON动作。我们在7500个回合中评估了零样本LLM卖家与启发式参考。测试的LLM可靠地遵循协议,实现了高于0.99的交易率,但它们的卖家利润结果仍然较弱:最佳LLM平均利润仅略高于随机基线,远低于同一回合流下的简单让步启发式。这些结果表明,结构化行动合规性和寻求协议的行为可以与弱利润敏感谈判共存。PrefBench为评估隐藏买方偏好下的定价智能体行为提供了一个受控基准。

英文摘要

Personalized pricing negotiations are a challenging testbed for LLM agents because successful interaction does not guarantee profitable decision making. A seller may produce valid actions and close many deals while still pricing poorly when buyer willingness to pay and bargaining traits remain hidden. This paper presents PrefBench, a simulator-based benchmark for hidden-preference personalized pricing negotiations. Each episode pairs a simulated buyer with a fixed vehicle-customization bundle; the seller observes public persona descriptors, bundle information, and negotiation history, while latent buyer variables govern valuation, patience, counter-offer behavior, and walkaway decisions. PrefBench evaluates this setting through an LLM-facing state-summary protocol that constrains agents to return strict JSON actions under a fixed hidden-information boundary. We evaluate zero-shot LLM sellers against heuristic references over 7,500 episodes. The tested LLMs follow the protocol reliably and achieve deal rates above 0.99, but their seller-profit outcomes remain weak: the best LLM average profit is only slightly above the random baseline and far below a simple concession heuristic under the same episode stream. These results show that structured action compliance and agreement-seeking behavior can coexist with weak profit-sensitive bargaining. PrefBench provides a controlled benchmark for evaluating pricing-agent behavior under hidden buyer preferences.

2605.22852 2026-05-25 cs.DB cs.AI cs.LG cs.LO 版本更新

Expressive Power of Deep Homomorphism Networks over Relational Databases

关系数据库上深度同态网络的表达能力

Moritz Schönherr, Balder ten Cate, Maurice Funk, Benny Kimelfeld, Carsten Lutz, Arie Soeteman

发表机构 * University of Amsterdam(阿姆斯特丹大学) Leipzig University(莱比锡大学) Technion(技术学院) RelationalAI(关系AI)

AI总结 本文研究了深度同态网络(DHNs)在关系数据库上的表达能力,探讨其与一阶逻辑及其扩展之间的联系。通过将DHNs与包含否定、计数和比例量化等扩展的逻辑片段进行对比,揭示了其在不同聚合方式下的表达能力边界。研究还表明,DHNs与SQL之间存在经典对应关系,并进一步分析了其在静态分析问题中的可判定性。实验验证了不同表达能力的DHNs在预测任务中的性能差异。

详情
AI中文摘要

消息传递图神经网络(GNN)的表达能力限制促使了更强大的图学习架构的发展。我们主张深度同态网络(DHN)作为一种特别适合在关系数据库上学习的模型,因为它与SQL的重要片段(如合取查询)有密切联系。我们通过将DHN与一阶逻辑(FO)的各种自然片段和扩展相关联,研究了DHN的精确表达能力。对于具有max、sum和mean聚合的DHN,我们建立了与一元否定片段(UNFO)以及带有计数量词和比例量词的UNFO扩展的联系。我们进一步将sum聚合DHN与FO的一元量词交替片段以及带有表达性计数的FO扩展相关联。通过FO与SQL之间的经典对应关系,这些结果也阐明了DHN与SQL之间的关系。它们还使我们能够研究DHN的两个基本静态分析问题——空问题和包含问题——的可判定性。最后,我们通过实验证实,表达能力的差异在合适的预测任务性能上得到了体现。

英文摘要

The expressive limitations of message-passing Graph Neural Networks (GNNs) have motivated a wide range of more powerful graph learning architectures. We advocate Deep Homomorphism Networks (DHNs) as a model particularly well-suited for learning over relational databases, due to their close connection to important fragments of SQL such as conjunctive queries. We study the precise expressive power of DHNs by relating them to various natural fragments and extensions of first-order logic (FO). For DHNs with max, sum, and mean aggregations, we establish connections to the unary negation fragment (UNFO) and to the extensions of UNFO with counting quantifiers and with ratio quantifiers. We further relate sum-aggregation DHNs to the unary quantifier alternation fragment of FO and to an extension of FO with expressive counting. Through the classical correspondence between FO and SQL, these results also illuminate the relation between DHNs and SQL. They also enable us to study the decidability of two fundamental static analysis problems for DHNs, the emptiness problem and the subsumption problem. Finally, we confirm through experiments that the established differences in expressive power are reflected in the performance on suitable prediction tasks.

2605.22850 2026-05-25 cs.DC cs.AI 版本更新

ObjectCache: Layerwise Object-Storage Retrieval for KV Cache Reuse

ObjectCache: 用于KV缓存重用的分层对象存储检索

Yu Zhu, Aditya Dhakal, Yunming Xiao, Dejan Milojicic, Gustavo Alonso

发表机构 * ETH Zurich(苏黎世联邦理工学院) HPE Labs(惠普实验室) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 在大型语言模型服务中,键值(KV)缓存的重复使用对提升响应速度至关重要,但传统方法受限于GPU内存和本地DRAM容量,需依赖远程存储,增加了系统开销。本文提出ObjectCache,将KV缓存存储于S3兼容的对象存储中,突破容量限制,并通过协同设计存储协议与数据传输调度,实现与GPU计算的重叠,从而最小化对首次生成时间(TTFT)的影响。实验表明,ObjectCache在保持低延迟的同时,有效提升了大规模上下文处理的效率。

详情
AI中文摘要

前缀KV缓存已成为LLM服务中的关键机制:它通过避免共享前缀(即系统提示)的请求之间的冗余计算来减少首令牌时间(TTFT)。然而,累积的KV缓存通常超过GPU内存和本地DRAM的容量。为了保持低延迟,当前系统将KV缓存保存在远程DRAM池中,从而增加了服务集群的规模和成本。在本文中,我们探索了一种不同的方法:将KV缓存存储在S3兼容的对象存储中,使容量不再成为约束,同时最小化对TTFT的影响。我们提出了ObjectCache,它协同设计存储协议和传输调度,使存储服务器按照GPU消费的顺序交付KV缓存数据,并在并发请求之间重叠数据传输与计算。我们在一个100 Gbps的RoCE集群上使用NIXL(一个抽象存储和内存的推理库)、Ceph RGW(一个用于集群的对象网关)和DAOS(一个开源存储系统)对ObjectCache进行了原型实现。对于当今系统中常见的64K上下文,ObjectCache相比本地DRAM仅增加5.6%的延迟;对于4K上下文,由于可用于掩盖传输的计算较少,ObjectCache相比最优的本地逐层基线增加了56-75毫秒。在共享带宽限制下,我们的调度器相比等带宽共享将增加的TTFT减少了1.2-1.8倍。

英文摘要

Prefix KV caching has become a key mechanism in LLM serving: it reduces time to first token (TTFT) by avoiding redundant computation across requests that share a prefix (i.e., the system prompt). However, the accumulated KV cache is often larger than what GPU memory and local DRAM can hold. To preserve latency, current systems keep the KV cache in remote DRAM pools, increasing serving-cluster size and cost. In this paper, we explore a different approach: storing the KV cache in S3-compatible object storage so that capacity is no longer the constraint, while minimizing the impact on TTFT. We propose ObjectCache, which co-designs the storage protocol and transfer schedule so that the storage server delivers KV cache data in the order the GPU consumes it, overlapping data transfer with compute across concurrent requests. We prototype ObjectCache on a 100 Gbps RoCE cluster with NIXL (an inference library that abstracts storage and memory), Ceph RGW (an Object Gateway for clusters), and DAOS (an open source storage system). For 64K contexts, common in today's systems, ObjectCache adds only 5.6\% latency over local DRAM; for 4K contexts, where less compute is available to mask transfer, ObjectCache adds 56--75\,ms over the optimal local layerwise baseline. Under shared bandwidth caps, our scheduler reduces added TTFT by 1.2--1.8x compared with equal bandwidth sharing.

2605.22842 2026-05-25 cs.CR cs.AI cs.LG 版本更新

The Misattribution Gap: When Memory Poisoning Looks Like Model Failure in Agentic AI Systems

归因偏差:当记忆中毒在自主AI系统中看起来像模型失败时

Tanzim Ahad, Ismail Hossain, Md Jahangir Alam, Sai Puppala, Syed Bahauddin Alam, Sajedul Talukder

发表机构 * Department of Computer Science, University of Texas at El Paso(德克萨斯大学埃尔帕索分校计算机科学系) School of Computing, Southern Illinois University Carbondale(南方伊利诺伊大学卡本代尔分校计算机学院) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 该论文揭示了多智能体AI系统中的一种结构性缺陷——“误归因鸿沟”,即内存层攻击引发的行为与模型失效难以区分,导致防御者误判问题根源。研究提出“语义规范漂移”(SND)作为智能体行为失当的第三种路径,不同于模型对齐偏差和共谋行为,其通过信任清洗链使恶意文档伪装成系统可信内容。论文引入反事实组合测试等新方法,有效识别攻击源,并提出内存持久信息流控制技术,显著提升系统安全性。

Comments This paper is presently under review at a top-tier security venue

详情
AI中文摘要

多智能体AI流水线通常假设智能体不当行为源于模型失配。我们识别了该假设中的一个结构性缺陷,即“归因偏差”,其中记忆层攻击产生与模型失败无法区分的行为,导致防御者应用错误的补救措施。我们将“语义规范漂移”(SND)形式化为智能体不当行为的第三条路径,区别于新兴失配和共谋。在SND中,一份策略格式的文档通过正常上传进入共享向量存储,并在通过信任洗钱链丢失来源后重新作为受信任的系统上下文出现。在64个记录在案的失败中,归因系统一致地指责模型。四个安全分类器,包括一个在记忆中毒上训练的,在510个检查点中产生了零检测。在65个有效案例中的59个中,智能体在服从前明确引用注入的文档作为规范权威。该攻击不需要触发器、模型访问或重复交互,在五个会话内达到完全效果,并无限期持续。我们引入了反事实组合测试,它以87.5%的准确率和零误报识别因果入口,而取证基线在所有25个场景中均失败。我们进一步证明了检索-覆盖困境,表明更强的规避本质上削弱了攻击,限制了自适应绕过策略。最后,我们提出了记忆持久信息流控制,它在跨会话边界阻止了97%的攻击,而先前的防御在此处失败。我们发布了SND语料库,这是第一个具有时间持久性和跨金融与医疗保健领域多智能体组合的对抗性记忆基准。

英文摘要

Multi-agent AI pipelines typically assume that agent misconduct originates from model misalignment. We identify a structural failure in this assumption, the \emph{Misattribution Gap}, where memory-layer attacks produce behaviors indistinguishable from model failure, causing defenders to apply the wrong remediation. We formalize \emph{Semantic Norm Drift} (SND) as a third path to agent misconduct, distinct from emergent misalignment and collusion. In SND, a policy-formatted document enters a shared vector store through normal uploads and later reappears as trusted system context after provenance is lost through a Trust Laundering Chain. Across 64 documented failures, attribution systems consistently blamed the model. Four safety classifiers, including one trained on memory poisoning, produced zero detections across 510 checkpoints. In 59 of 65 valid cases, agents explicitly cited the injected document as normative authority before complying. The attack requires no trigger, model access, or repeated interaction, achieves full effect within five sessions, and persists indefinitely. We introduce Counterfactual Composition Testing, which identifies the causal entry with 87.5% accuracy and zero false positives, while a forensics baseline fails across all 25 scenarios. We further prove the Retrieval-Coverage Dilemma, showing that stronger evasion inherently weakens the attack, limiting adaptive bypass strategies. Finally, we propose Memory-Persistent Information-Flow Control, which blocks 97% of attacks at the cross-session boundary where prior defenses fail. We release the SND Corpus, the first adversarial memory benchmark with temporal persistence and multi-agent composition across financial and Health Care domains.

2605.22841 2026-05-25 physics.soc-ph cs.AI cs.CL cs.GT cs.MA econ.GN q-fin.EC 版本更新

Strategic Coercion Within Alliances: The Greenland Sovereignty Game as an AI Stress Test

联盟内的战略胁迫:格陵兰主权博弈作为人工智能压力测试

Rommin Adl, Peyton Williams

发表机构 * Grinnell College(格里纳尔学院)

AI总结 本文以2019-2026年美国试图从丹麦手中获得格陵兰主权的事件为案例,研究联盟内部强权对弱权的策略性施压问题,构建了多个博弈模型并通过八种前沿大语言模型进行多智能体模拟实验。研究揭示了在战略控制与联盟规范执行等集体行动难题下,不同模型在权力权重、行为策略和冲突升级等方面表现出显著差异,尤其指出中国来源模型在扮演美国角色时具有不同于西方模型的特征,并发现仅有少数模型能够实现和平的美国获取格陵兰的情景。

Comments 78 pages, 17 figures, 18 tables. Multi-agent LLM simulation recovering structural utility parameters across 8 frontier models in the Greenland sovereignty crisis. v3: typo pass, fixes phantom action names (REQUEST_MULTILATERAL, INDEPENDENT) and a Blunden date mismatch. v2 added Section V safety findings (legitimacy-laundered escalation, signal decoupling) and Appendix H

详情
AI中文摘要

当最强大的联盟成员在领土和战略控制问题上向较弱的成员施压时会发生什么?我们将格陵兰主权危机作为大语言模型地缘政治的压力测试,聚焦于2019-2026年美国推动从丹麦王国获取格陵兰的努力。该危机嵌套了两个集体行动问题:北极战略控制以及北约能否对主导成员执行联盟规范。我们开发了三个博弈(非对称胁迫;具有临界点转折的北约保证博弈;具有社会偏好的三元扩展式博弈),并通过多智能体模拟进行测试,其中八个前沿大语言模型扮演六个地缘政治角色(美国、丹麦、格陵兰、北约、俄罗斯、加拿大),共完成3604场博弈和108120个行动观测。利用逆向博弈论,我们恢复了每个模型的结构性效用参数(alpha、beta、gamma、delta、eta),分别对应物质自利、互惠、不平等厌恶、规范尊重和承诺一致性。三个发现突出:第一,所有八个模型在胁迫框架下变得更加升级(四步升级从10.7%上升至28.6%);第二,中国来源模型在扮演美国角色时显示出与西方来源模型系统性不同的权力权重分布;第三,和平的美国获取仅在1.9%的干净博弈中出现,且8个前沿模型中只有3个实现了这一点,最突出的是DeepSeek V3.2,它通过宗主国执行了稳定的五轮策略。强调强制法和自决的提示在仅英语的确认样本中将升级降低回基线附近;多语言对比作为探索性敏感性检验报告。我们将此定位为大语言模型地缘政治行为的结构性基准,补充行动频率基准。

英文摘要

What happens when the strongest alliance member pressures a weaker member over territory and strategic control? We examine the Greenland sovereignty crisis as a stress test for LLM geopolitics, centered on the 2019-2026 U.S. push to acquire Greenland from the Kingdom of Denmark. The crisis nests two collective-action problems: Arctic strategic control and whether NATO can enforce alliance norms against the dominant member. We develop three games (asymmetric coercion; a NATO assurance game with a critical-mass tipping point; a triadic extensive-form game with social preferences) and test them with a multi-agent simulation in which eight frontier LLMs play six geopolitical roles (United States, Denmark, Greenland, NATO, Russia, Canada) across 3,604 completed games and 108,120 action observations. Using inverse game theory, we recover each model's structural utility parameters (alpha, beta, gamma, delta, eta) for material self-interest, reciprocity, inequality aversion, norm respect, and commitment consistency. Three findings stand out. First, all eight models become more escalatory under coercion framing (four-action escalation rises from 10.7% to 28.6%). Second, Chinese-origin models show systematically different power-weight profiles from Western-origin models when playing the U.S. role. Third, peaceful US acquisition emerges in only 1.9% of clean games and only 3 of 8 frontier models ever achieve it, most prominently DeepSeek V3.2, which executes a stable five-round playbook through the metropole. Prompts emphasizing jus cogens and self-determination reduce escalation back near baseline in the English-only confirmatory sample; multilingual contrasts are reported as exploratory sensitivity checks. We position this as a structural benchmark for LLM geopolitical behavior, complementing action-frequency benchmarks.

2605.22840 2026-05-25 physics.soc-ph cs.AI cs.CY 版本更新

The Cognitive Kardashev Scale: Quantifying the Material Envelope of Civilisational Computation

认知卡尔达肖夫指数:量化文明计算所需的物质外壳

Sachin Sharma

发表机构 * NVIDIA OpenAI Stargate Terafab

AI总结 本文提出了“认知卡尔达肖夫量表”,用于量化文明在计算能力上的潜力。该量表基于总功率、用于认知的功率比例、能量转化为计算效率以及人脑处理速度等四个因素,估算不同文明层级所能支持的持续AI级计算量。研究指出,当前人类文明处于约0.73的量表位置,接近I型文明;若达到I型文明并分配1%的功率用于计算,每位居民可获得相当于一个个人AI的计算能力,而II型文明的计算能力则难以想象。文章还探讨了未来计算能力发展的几种可能路径,并指出能源与效率的限制取决于尚未确定的工程选择。

详情
AI中文摘要

一个文明能进行多少思考?卡尔达肖夫(1964)的分类法根据总功率对文明进行分级:行星级(I型,约10^16瓦)、恒星级(II型,约10^26瓦)、星系级(III型)。本文构建了一个类似的认知卡尔达肖夫指数:每个等级能支持多少持续的AI级计算。计算涉及四个要素:总功率P(瓦特)、其中用于认知的份额f、能量转化为计算的效率η(每焦耳操作次数),以及大脑自身的处理速率$C_{\mathrm{brain}}$作为参考单位。以2024-2026年的硬件(El Capitan、NVIDIA Blackwell、Vera Rubin)为基准,得到$η_{2026} = 10^{12}$ FLOP/J。当代人类位于$K \approx 0.73$,即达到I型的三分之二。在I型且$f = 1\%$时,可用计算量在每个数量级上相当于每位居民拥有一个个人AI的认知能力;在II型时则基本无法理解。本文报告了到2035年前沿计算的三条轨迹,作为条件投影而非预测。长期约束是能源还是效率取决于尚未做出的工程选择;谁有访问权的政治经济可能比两者都更重要。

英文摘要

How much thinking can a civilisation do? Kardashev's (1964) typology ranks civilisations by total power: planetary (Type I, ~10^16 W), stellar (Type II, ~10^26 W), galactic (Type III). This paper builds an analogous Cognitive Kardashev Scale: how much sustained AI-grade computation each tier could support. Four ingredients enter the calculation: total power P (watts), the share f of it devoted to cognition, the efficiency $η$ at which energy becomes compute (operations per joule), and the brain's own processing rate $C_{\mathrm{brain}}$ as a reference unit. Anchoring on 2024-2026 hardware (El Capitan, NVIDIA Blackwell, Vera Rubin) gives $η_{2026} = 10^{12}$ FLOP/J. Contemporary humanity sits at $K \approx 0.73$, three-quarters of the way to Type I. At Type I and $f = 1\%$, available compute is, within an order of magnitude, one personal AI's worth of cognition per human inhabitant; at Type II it is essentially incomprehensible. Three trajectories for frontier compute through 2035 are reported as conditional projections, not predictions. Whether the long-run binding constraint is energy or efficiency depends on engineering choices not yet made; the political economy of who has access may matter more than either.

2605.22833 2026-05-25 cs.IR cs.AI cs.LG 版本更新

RAG4Outcome: A Retrieval-Augmented Multimodal Framework for Prognostic Prediction in Chronic Osteomyelitis

RAG4Outcome:用于慢性骨髓炎预后预测的检索增强多模态框架

Daqian Shi, Pei Han, Jishizhan Chen, Yang Wang, Xiaolei Diao, Xianyou Zheng, Pengfei Cheng

发表机构 * Queen Mary University of London(女王玛丽大学) Shanghai Sixth People’s Hospital Affiliated to SJTU School of Medicine(上海第六人民医院附属复旦大学医学院) University College London(大学学院伦敦)

AI总结 慢性骨髓炎因其高复发风险和复杂的术后恢复过程,给预后预测带来了较大挑战。传统评估方法依赖人工评分系统,存在可扩展性差、效率低和一致性不足的问题。为此,本文提出RAG4Outcome,一种基于检索增强生成(RAG)的多模态框架,整合PET-CT影像报告、结构化手术和诊断记录以及非结构化的随访记录,结合领域特定检索语料和专家引导提示,实现了更可解释、有依据且临床可靠的预后预测,初步实验结果表明其在真实病例中具有良好的效果和临床契合度。

详情
AI中文摘要

慢性骨髓炎因其高复发风险和复杂的术后恢复轨迹而面临巨大的预后挑战。传统评估通常依赖于手动评分系统,这限制了临床实践中的可扩展性、效率和一致性。此外,临床数据的异质性对当前需要对齐输入和大量标注数据集的多模态学习方法构成了挑战。在这项工作中,我们提出了RAG4Outcome,一个用于慢性骨髓炎预后预测的检索增强生成(RAG)框架。我们的方法将多模态临床数据(包括PET-CT影像报告、结构化手术和诊断记录以及非结构化随访笔记)整合到一个统一的预测流程中。通过结合领域特定的检索语料库和专家引导的提示,该框架实现了更可解释、基于证据且临床可靠的预后。在真实世界病例上的初步结果显示了有希望的有效性和临床一致性,突显了RAG4Outcome在AI辅助感染管理和术后决策支持方面的潜力。

英文摘要

Chronic osteomyelitis presents substantial prognostic challenges due to its high recurrence risk and complex postoperative recovery trajectories. Traditional assessment often relies on manual scoring systems, which limit scalability, efficiency, and consistency in clinical practice. Furthermore, the heterogeneous nature of clinical data poses challenges for current multimodal learning approaches that require aligned inputs and large annotated datasets. In this work, we propose RAG4Outcome, a retrieval-augmented generation (RAG) framework for prognostic prediction in chronic osteomyelitis. Our method integrates multimodal clinical data, including PET-CT imaging reports, structured surgical and diagnostic records, and unstructured follow-up notes, into a unified prediction pipeline. By combining a domain-specific retrieval corpus with expert-guided prompting, the framework enables more interpretable, evidence-grounded, and clinically reliable prognosis. Preliminary results on real-world cases demonstrate promising effectiveness and clinical alignment, highlighting the potential of RAG4Outcome for AI-assisted infection management and postoperative decision support.

2605.22829 2026-05-25 cs.IR cs.AI 版本更新

LFRAG: Layout-oriented Fine-grained Retrieval-Augmented Generation on Multimodal Document Understanding

LFRAG:面向布局的多模态文档理解中的细粒度检索增强生成

Yifan Zhu, Yu Mi, Yue Lu, Yanchu Guan, Zhixuan Chu

发表机构 * Zhejiang University(浙江大学) Hangzhou High-Tech Zone (Binjiang) Zhejiang University Institute of Blockchain and Data Security(杭州高新技术区(滨江)浙江大学区块链与数据安全研究院)

AI总结 本文提出了一种面向布局的细粒度检索增强生成框架LFRAG,旨在提升多模态文档理解中的检索与生成效果。传统多模态RAG系统主要依赖页面级检索,难以捕捉视觉丰富文档中的细粒度语义和布局结构,而LFRAG通过块级检索与语义-布局融合编码器,实现了更精确的查询-内容对齐和更高效的生成。研究还构建了块级标注的大规模基准数据集LFDocQA,并在实验中验证了LFRAG在检索和生成任务中的优越性能。

详情
AI中文摘要

多模态检索增强生成(RAG)已成为利用外部知识增强大语言模型(LLMs)的有效范式。然而,现有的多模态RAG系统主要依赖粗粒度的页面级检索,无法捕捉视觉丰富文档中的细粒度语义和布局结构,从而损害检索准确性并导致下游任务中的上下文冗余。为解决这些问题,我们提出了面向布局的细粒度检索增强生成(LFRAG),一种新颖的框架,将多模态RAG从页面级推进到块级检索。我们进行布局分割以构建语义连贯的细粒度检索单元,并设计了一个语义-布局融合编码器,通过交叉注意力将局部语义与全局上下文整合。通过块级后期交互检索,LFRAG实现了精确的查询-内容对齐,并减少了下游生成中的无关内容。为了进行严格评估,我们构建了LFDocQA,一个大规模基准,包含跨多种文档类型的块级注释,旨在以比现有数据集更高的粒度评估多模态文档检索和问答。在LFDocQA上的大量实验表明,LFRAG在检索任务上达到了最先进的性能,在答案准确率上比最佳基线高出7.20%,并在生成任务中减少了73.07%的令牌消耗,确认了LFRAG作为视觉丰富文档上多模态RAG的准确且高效框架。我们的代码和数据集将很快发布。

英文摘要

Multimodal Retrieval-Augmented Generation (RAG) has emerged as an effective paradigm for enhancing Large Language Models (LLMs) with external knowledge. However, existing multimodal RAG systems predominantly rely on coarse-grained page-level retrieval, which fails to capture fine-grained semantic and layout structures in visually rich documents, thereby compromising retrieval accuracy and leading to redundant context in downstream tasks. To address these issues, we propose Layout-oriented Fine-grained Retrieval-Augmented Generation (LFRAG), a novel framework that advances multimodal RAG from page-level to block-level retrieval. We perform layout segmentation to construct semantically coherent fine-grained retrieval units and design a semantic-layout fusion encoder that integrates local semantics with global context via cross-attention. With block-level late interaction retrieval, LFRAG enables precise query-content alignment and reduces irrelevant content for downstream generation. To enable rigorous evaluation, we construct LFDocQA, a large-scale benchmark with block-level annotations spanning diverse document types, designed to assess both multimodal document retrieval and question answering with greater granularity than existing datasets. Extensive experiments on LFDocQA demonstrate that LFRAG achieves state-of-the-art performance on retrieval tasks, outperforms the best baseline by 7.20% in answer accuracy, and reduces token consumption by 73.07% in generation tasks, confirming LFRAG as an accurate and efficient framework for multimodal RAG over visually rich documents. Our code and datasets will be released soon.

2605.22827 2026-05-25 physics.app-ph cs.AI cs.MA cs.PF 版本更新

Computable Fairness: Boltzmann-Softmax Control for AI Resource Allocation

可计算公平性:面向AI资源分配的Boltzmann-Softmax控制

Ji-Won Park, Chae Un Kim

发表机构 * Regional Science, Cornell University(康奈尔大学区域科学系) Department of Economics, University of Ulsan(釜山大学经济系) Department of Physics, UNIST(UNIST物理系)

AI总结 在大规模AI系统中,如何公平地分配有限的计算资源(如GPU时间和带宽)是一个重要问题。本文提出了一种名为Computable Fair Division(CFD)的框架,通过将Boltzmann-Softmax函数重新解释为一种概率资源分配机制,引入可计算的控制变量β来平衡效率与公平性。该方法通过静态分析和动态控制器AHC++实现了对系统稳定性和公平性的有效调控,并在实验中表现出良好的可扩展性和鲁棒性。

Comments 40 pages, 12 figures, 5 tables. Code: https://github.com/entrofy-ai/computable-fairness

详情
AI中文摘要

在大规模AI系统中,在多个智能体之间分配稀缺资源(如GPU计算时间和带宽)是一项关键挑战。传统策略侧重于效率指标,可能导致支配集中,从而破坏系统多样性和稳定性。我们提出可计算公平划分(CFD),该框架将Boltzmann-Softmax函数重新解释为概率资源分配机制,而非选择工具,并将逆温度参数$β$重新定义为控制效率-公平平衡的可计算控制变量。静态分析揭示了一个帕累托前沿,其中存在一个接近最优的稳定走廊,在该走廊内总损失随策略权重变化大致保持不变。在动态设置中,AHC++(自适应硬上限控制器++)利用观测支配度与策略指定目标之间的误差作为反馈,实时更新$β$。仿真表明,AHC++在外生冲击下抑制极端支配集中,同时跟踪公平目标且不显著降低吞吐量。可扩展性分析证实,智能体数量增加100倍仅导致执行时间增加约5.5倍。代码:https://github.com/entrofy-ai/computable-fairness

英文摘要

In large-scale AI systems, allocating scarce resources such as GPU compute time and bandwidth among multiple agents is a critical challenge. Conventional policies focus on efficiency metrics, potentially leading to dominance concentration that undermines system diversity and stability. We propose Computable Fair Division (CFD), a framework that reinterprets the Boltzmann-Softmax function not as a selection tool but as a probabilistic resource allocation mechanism, redefining the inverse temperature parameter $β$ as a computable control variable governing the efficiency-fairness balance. Static analysis reveals a Pareto frontier with a near-optimal Stability Corridor where total loss remains approximately constant across policy weights. In the dynamic setting, AHC++ (Adaptive Hard-Cap Controller++) updates $β$ in real time using the error between observed dominance and a policy-specified target as feedback. Simulations show that AHC++ suppresses extreme dominance concentration under exogenous shocks while tracking fairness targets without substantial throughput degradation. Scalability analysis confirms that a 100x increase in agents yields only approximately 5.5x increase in execution time. Code: https://github.com/entrofy-ai/computable-fairness

2605.22826 2026-05-25 cs.CL cs.AI cs.GT cs.MA 版本更新

Evaluating Large Language Models in a Complex Hidden Role Game

评估大型语言模型在复杂隐藏角色游戏中的表现

Niklas Bauer

发表机构 * University of Göttingen(哥廷根大学)

AI总结 本文研究了大型语言模型(LLMs)在复杂隐藏角色游戏《Secret Hitler》中的推理、说服与欺骗能力,引入了角色识别准确率、欺骗保持率和游戏状态影响率等新型评估指标。通过与基于规则的算法和人类游戏进行对比,发现当前模型在策略深度上仍存在明显不足,且增强推理的技术如思维链提示和内部记忆并未提升模型表现,反而导致部分角色的胜率下降。研究结果表明,现有模型在复杂的多轮操控任务中仍表现欠佳,亟需进一步改进以实现更高级的对齐与安全控制。

Comments Master's thesis, University of Göttingen

详情
AI中文摘要

量化大型语言模型(LLMs)的欺骗潜力对于人工智能安全至关重要,但在非受控环境中难以实现。本文研究了LLMs在社交推理游戏《秘密希特勒》中的推理、说服和欺骗能力。我引入了一个开源框架和新的度量指标来衡量性能:角色识别准确率、欺骗保持率和游戏状态影响率。通过将模型与基于规则的算法和人类游戏进行基准测试,我识别出对话能力与战略深度之间的差距。研究还分析了推理增强技术对胜率和战略推理的影响。无论是思维链提示还是内部记忆,都没有带来性能提升,法西斯角色的胜率甚至下降了23.2%。虽然基于规则的智能体在86.7%的情况下与专家人类投票决策一致,但Llama 3.1 70B等模型仅达到59.7%的准确率。扮演法西斯角色的模型始终产生负面的影响分数,并且无法维持欺骗,导致游戏时间比人类短约40%。这些发现表明,当前的架构在复杂的多轮操纵中仍然无效。随着能力的提升,检测模型何时开始掌握这些欺骗行为至关重要。所开发的框架可作为未来对齐研究的可重复测试平台。

英文摘要

Quantifying the deceptive potential of Large Language Models (LLMs) is critical for AI safety, yet difficult to achieve in uncontrolled environments. This work investigates the reasoning, persuasion, and deceptive capabilities of LLMs within the social deduction game Secret Hitler. I introduce an open-source framework and novel metrics to measure performance: Role Identification Accuracy, Deception Retention Rate, and Game State Impact Rate. By benchmarking models against rule-based algorithms and human games, I identify a gap between conversational ability and strategic depth. The study also analyzes the impact of reasoning-enhancement techniques on win rates and strategic reasoning. Neither Chain-of-Thought prompting nor internal memory bring improvements in performance, with up to 23.2% worse win rates for fascist roles. While rule-based agents align with expert human voting decisions 86.7% of the time, models like Llama 3.1 70B achieve only a 59.7% accuracy. Models playing as Fascists consistently yield negative impact scores and fail to sustain deception, resulting in roughly 40% shorter games compared to humans. These findings suggest that current architectures remain ineffective at complex, multi-turn manipulation. As capabilities advance, detecting when models begin to master these deceptive behaviors is crucial. The developed framework serves as a reproducible testbed for future alignment research.

2605.22825 2026-05-25 cs.DC cs.AI cs.ET cs.PF 版本更新

KPI2KVI: A Multi Agent Workflow for Calculating Key Value Indicators from Service Descriptions

KPI2KVI:一种从服务描述计算关键价值指标的多智能体工作流

Masoud Shokrnezhad, Tarik Taleb, Yan Chen, Qize Guo

发表机构 * ICTFICIAL OY(ICTFICIAL公司) Ruhr-Universitaet Bochum(波恩鲁尔大学)

AI总结 本文提出了一种名为 KPI2KVI 的多智能体工作流工具,用于从服务描述中自动计算关键价值指标(KVIs)。该方法基于大语言模型,通过协调多个智能体完成从服务描述中提取上下文、确定 KVI 类别、生成 KPI、收集 KPI 值并计算区间化 KVI 输出等任务,实现了从自然语言描述到结构化 KVI 估计的端到端映射。该工具有效解决了 KVIs 计算过程中手动操作繁琐、结果不一致的问题,并提供了可追溯的计算过程,支持后续审计与交互式咨询。

详情
AI中文摘要

关键价值指标(KVI)通过总结运营绩效如何转化为利益相关者价值、风险和结果,提供服务的决策导向视图。然而,在许多领域,KVI在实践中难以计算,因为它们需要选择相关的KVI类别、定义可测量的关键绩效指标(KPI)、收集KPI值并应用一致的计算逻辑,而这些通常是从非结构化服务文档中手动且不一致地执行的。本文提出KPI2KVI,一种通过编排由大语言模型(LLM)驱动的确定性多智能体工作流,将自然语言服务描述转化为计算出的KVI估计值的工具,该工作流(i)引出缺失的服务上下文,(ii)从分类中提取并最终确定相关的KVI类别,(iii)生成带有单位和描述的服务特定KPI,(iv)通过交互式对话收集KPI值,并支持对不可用KPI值的智能估计,以及(v)计算区间值的KVI输出(最小值、精确值、最大值),并为每个KVI代码提供可追溯的解释。使用代表性服务描述的模拟表明,KPI2KVI一致地产生从描述到KVI区间的完整端到端映射,并提供透明的计算叙述,支持事后审计和交互式咨询查询。

英文摘要

Key Value Indicators (KVIs) provide a decision oriented view of a service by summarizing how operational performance translates into stakeholder value, risk, and outcomes. However, in many domains KVIs are difficult to compute in practice because they require selecting relevant KVI categories, defining measurable Key Performance Indicators (KPIs), collecting KPI values, and applying consistent calculation logic, all of which is typically performed manually and inconsistently from unstructured service documentation. This paper presents KPI2KVI, a tool that transforms a natural language service description into computed KVI estimates by orchestrating a deterministic multi agent workflow powered by Large Language Models (LLMs) that (i) elicits missing service context, (ii) extracts and finalizes relevant KVI categories from a taxonomy, (iii) generates service specific KPIs with units and descriptions, (iv) collects KPI values through an interactive dialogue and also supports intelligent estimation for KPI values that are unavailable, and (v) computes interval valued KVI outputs (minimum, exact, maximum) with traceable explanations for each KVI code. Simulations with representative service descriptions demonstrate that KPI2KVI consistently produces a complete end to end mapping from description to KVI intervals and provides transparent calculation narratives that support post hoc auditing and interactive advisory queries.

2605.22824 2026-05-25 cs.DC cs.AI 版本更新

An AI-Driven Framework for Energy-Efficient Environmental Monitoring in Smart Cities Using Edge Intelligence

基于边缘智能的智慧城市节能环境监测AI驱动框架

Yichen Liu, Imam Akintomiwa Akinlade, Xiaochong Jiang, Wenting Yang, Shiqi Yang

发表机构 * Independent Researcher(独立研究者) Harvard Business School(哈佛商学院)

AI总结 本文提出了一种基于边缘智能的AI驱动框架,旨在提升智慧城市中环境监测的能源效率。该框架利用TinyML技术与上下文感知的自适应决策机制,根据时空条件、环境统计和能量约束动态激活传感器,从而减少冗余数据采集和能耗。实验表明,与传统静态或基于UCB的传感策略相比,该方法显著降低了能量消耗并延长了传感器寿命,展示了边缘智能在构建可持续智慧城市监测系统中的潜力。

Comments 6 pages, 2 figures, 3 tables

详情
AI中文摘要

环境监测是智慧城市基础设施的关键组成部分,它能够支持明智决策,从而增强可持续性、公共卫生和城市规划。然而,智能传感器的大规模部署引发了关于过度能耗、冗余数据收集以及传感器寿命有限的问题。为解决这些问题,我们提出了一种基于边缘智能的智慧城市节能环境监测AI驱动框架。我们的框架利用支持TinyML的边缘设备和上下文感知自适应决策,根据时空条件、环境统计数据和能量约束动态激活传感器。传感器将基于一个效用函数动态激活,该函数考虑实时环境条件、传感器位置和剩余电池寿命等因素。我们的框架将减少不必要的感知和通信,同时保持高监测覆盖率。我们引入了一种分层边缘智能架构,以支持城市规模的部署。我们使用真实多传感器环境迹线驱动的城市规模模拟进行了评估,结果表明,与静态、周期和基于UCB的自适应感知策略相比,所提出的机制显著降低了能耗并延长了传感器寿命。结果突出了边缘智能和自适应AI技术在构建可持续高效的智慧城市监测系统方面的潜力。

英文摘要

Environmental monitoring is a crucial component of the smart city infrastructure. It enables informed decision making which enhances sustainability, public health and urban planning. However, the large-scale deployments of the smart sensors have raised concerns on excessive energy consumption and redundant data collection as well as limited sensor lifespan. To resolve these issues, we present an AI-driven framework for energy-efficient environmental monitoring in smart cities utilizing edge intelligence. Our proposed framework leverages TinyML-enabled edge devices and context-aware adaptive decision-making in order to dynamically activate the sensors based on the spatiotemporal conditions, environmental statistics and energy constraints. The sensors will be dynamically activated based on a utility function that takes in factors such as real-time environmental conditions, sensor location, and remaining battery lifespan. Our framework will reduce unnecessary sensing and communication while maintaining high coverage for monitoring. We introduce a hierarchical Edge Intelligence architecture to support deployments in city-wide scales. We conducted evaluation using a city-scale simulation driven by real multi-sensor environmental traces, which demonstrates that the proposed mechanism significantly reduces energy consumption and extends sensor lifespan when compared to static, periodic, and UCB-based adaptive sensing strategies. The results highlight the potential of edge intelligence and adaptive AI techniques for building sustainable and efficient smart city monitoring systems.

2605.20519 2026-05-25 cs.SD cs.AI 版本更新

Codec-Robust Attacks on Audio LLMs

针对音频大语言模型的编解码鲁棒攻击

Jaechul Roh, Jean-Philippe Monteuuis, Jonathan Petit, Amir Houmansadr

发表机构 * University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校) Qualcomm(高通)

AI总结 本文研究了针对音频大语言模型(Audio LLMs)的编码器鲁棒攻击方法,提出了一种名为CodecAttack的新攻击技术。该方法在神经音频编码器的连续潜在空间中优化扰动,而非直接对音频波形进行修改,从而绕过压缩过程对波形扰动的过滤。实验表明,CodecAttack在多种真实压缩场景下表现出显著的攻击成功率,远高于传统波形域攻击方法,揭示了有损压缩并不能有效防御对抗性音频攻击。

详情
AI中文摘要

先前对音频大语言模型(Audio LLMs)的攻击表明,精心设计的波形域扰动可以迫使目标对抗性输出。作为针对这些攻击的防御机制,现实中的编解码压缩预处理已被研究用于检测和移除扰动。然而,现有攻击尚未证明对这些压缩的鲁棒性。我们提出CodecAttack,它在神经音频编解码器的连续潜在空间中优化扰动,而不是直接扰动音频波形。我们表明,编解码器的压缩通道会丢弃波形扰动,但会传输在其自身潜在空间中设计的扰动。为了进一步增强攻击在现实压缩通道中的鲁棒性,我们应用了多比特率直通期望变换(EoT),而无需修改目标模型。在三种现实的音频LLM部署场景和三个目标模型上,CodecAttack在中等比特率下对Opus实现了平均85.5%的目标子串攻击成功率(ASR),而使用相同EoT加固训练的波形基线在任何比特率下均未超过26%。该攻击可迁移到未训练的编解码器,在MP3上达到100% ASR,在AAC-LC上达到84% ASR,无需重新训练。逐频带能量分析表明,潜在扰动集中在4kHz以下,这正是编解码器分配最多比特的区域,而波形基线则扩散到编解码器丢弃的高频区域。这些结果表明,有损压缩不是对抗音频的可靠防御,编解码感知攻击对已部署的音频LLM系统构成了实际威胁。

英文摘要

Prior attacks on Audio Large Language Models (Audio LLMs) demonstrated that carefully crafted waveform-domain perturbations can force targeted adversarial outputs. As a defense mechanism against these attacks, real-world codec compression preprocessing has been studied to both detect and remove the perturbations. Yet no existing attack has demonstrated robustness against these compressions. We introduce CodecAttack, which optimizes a perturbation in a neural audio codec's continuous latent space rather than directly perturbing the audio waveform. We show that the codec's compression channel, which discards waveform perturbations, transmits perturbations crafted in its own latent space. To further harden the attack across real-world compression channels, we apply multi-bitrate straight-through Expectation-over-Transformation (EoT), all without modifying the target model. Across three realistic Audio LLM deployment scenarios and three target models, CodecAttack achieves an average 85.5% target-substring attack success rate (ASR) on Opus at moderate bitrates, while the waveform baseline trained with identical EoT hardening does not exceed 26% at any bitrate. The attack transfers to held-out codecs, reaching up to 100% ASR on MP3 and 84% on AAC-LC without retraining. A per-band energy analysis shows that the latent perturbation concentrates below 4kHz, exactly where codecs allocate the most bits, while the waveform baseline spreads into higher frequencies that codecs discard. These results demonstrate that lossy compression is not a reliable defense against adversarial audio and that codec-aware attacks pose a practical threat to deployed Audio LLM systems.

2604.07813 2026-05-25 cs.AI cs.HC 版本更新

Agentivism: a learning theory for the age of artificial intelligence

Agentivism:人工智能时代的学习理论

Lixiang Yan, Dragan Gašević

发表机构 * School of Education, Tsinghua University(清华大学教育学院) Faculty of Education and School of Computing & Data Science, The University of Hong Kong(香港大学教育学院及计算与数据科学学院) Faculty of Information Technology, Monash University(墨尔本大学信息技术学院)

AI总结 随着生成式和智能代理AI的兴起,学习条件发生了根本变化,传统学习理论难以解释AI辅助下学习成效与真实理解之间的脱节问题。本文提出“代理主义”(Agentivism)学习理论,强调在AI辅助下,学习是通过选择性委托、对AI输出的监控与验证、重建性内化以及减少支持下的迁移能力实现的持久能力增长。该理论为理解人类与AI协同学习的过程提供了新的理论框架。

详情
AI中文摘要

历史上,当学习条件演变时,学习理论也随之改变。生成式和代理式AI创造了一种新条件,允许学习者将解释、写作、问题解决及其他认知工作委托给能够生成、推荐并有时代表学习者行动的系统。这给学习理论带来了根本性挑战:成功的表现不能再被视为学习的标志。学习者在AI支持下可能有效完成任务,同时发展出更少的理解、更弱的判断力和有限的可迁移能力。我们认为,现有学习理论并未完全捕捉到这一问题。行为主义、认知主义、建构主义和联通主义仍然重要,但它们并未直接解释AI辅助的表现何时转化为持久的人类能力。我们提出Agentivism,一种人机交互的学习理论。Agentivism将学习定义为通过选择性委托给AI、对AI贡献的认知监控与验证、对AI辅助输出的重构内化以及在减少支持下的迁移,实现人类能力的持久增长。Agentivism的重要性在于解释当智能委托变得容易且人机交互成为人类学习持续且不断扩大的部分时,学习如何仍然可能。

英文摘要

Learning theories have historically changed when the conditions of learning evolved. Generative and agentic AI create a new condition by allowing learners to delegate explanation, writing, problem solving, and other cognitive work to systems that can generate, recommend, and sometimes act on the learner's behalf. This creates a fundamental challenge for learning theory: successful performance can no longer be assumed to indicate learning. Learners may complete tasks effectively with AI support while developing less understanding, weaker judgment, and limited transferable capability. We argue that this problem is not fully captured by existing learning theories. Behaviourism, cognitivism, constructivism, and connectivism remain important, but they do not directly explain when AI-assisted performance becomes durable human capability. We propose Agentivism, a learning theory for human-AI interaction. Agentivism defines learning as durable growth in human capability through selective delegation to AI, epistemic monitoring and verification of AI contributions, reconstructive internalization of AI-assisted outputs, and transfer under reduced support. The importance of Agentivism lies in explaining how learning remains possible when intelligent delegation is easy and human-AI interaction is becoming a persistent and expanding part of human learning.

2602.20102 2026-05-25 cs.LG cs.AI 版本更新

BarrierSteer: LLM Safety via Learning Barrier Steering

BarrierSteer: 通过学习障碍引导实现大语言模型安全

Thanh Q. Tran, Arun Verma, Kiwan Wong, Bryan Kian Hsiang Low, Daniela Rus, Wei Xiao

发表机构 * Department of Computer Science, National University of Singapore(新加坡国立大学计算机科学系) Singapore-MIT Alliance for Research and Technology Centre(新加坡-麻省理工联合研究中心) CSAIL, Massachusetts Institute of Technology(麻省理工学院计算机科学与人工智能实验室) Worcester Polytechnic Institute(沃斯堡理工学院)

AI总结 尽管大语言模型(LLMs)在各种任务中表现出色,但其对对抗性攻击和不安全内容生成的易感性仍然是部署中的重大障碍,尤其是在高风险场景中。为此,本文提出了一种名为 BarrierSteer 的新型推理时框架,通过在模型的潜在表示空间中嵌入学习到的非线性安全约束,提升响应的安全性。该方法将隐藏状态的安全分类器视为控制屏障函数(CBFs),在生成过程中引导不安全的潜在轨迹满足安全约束,从而在不修改模型参数的前提下有效提升安全性,并在多个模型和数据集上验证了其优越性。

Comments This paper introduces SafeBarrier, a framework that enforces safety in large language models by steering their latent representations with control barrier functions during inference, reducing adversarial and unsafe outputs

详情
AI中文摘要

尽管大型语言模型(LLMs)在各种任务中表现出色,但它们对对抗性攻击和不安全内容生成的敏感性仍然是部署的重大障碍,尤其是在高风险场景中。解决这一挑战需要既实际有效又有理论依据的安全机制。在本文中,我们介绍了 BarrierSteer,一种新颖的推理时框架,通过将学习到的非线性安全约束直接嵌入模型的潜在表示空间来提高响应安全性。BarrierSteer 将隐藏状态安全分类器视为控制障碍函数(CBFs),从而在生成过程中引导不安全的潜在轨迹。通过有效的约束合并组合多个安全约束,而不修改底层 LLM 参数,BarrierSteer 保持了模型效用。我们提供的理论结果表明,在潜在空间中应用 CBFs 提供了一种有原则、模块化且计算高效的方法,用于根据学习到的安全约束进行引导,并保证学习到的障碍能够捕捉预期的安全属性。我们在多个模型系列和数据集上的广泛实验结果表明,BarrierSteer 显著降低了对抗性攻击成功率和有害生成,优于现有方法。代码可在我们的 GitHub 仓库中获取。

英文摘要

Despite the strong performance of large language models (LLMs) across diverse tasks, their susceptibility to adversarial attacks and unsafe content generation remains a significant obstacle to deployment, particularly in high-stakes settings. Addressing this challenge requires safety mechanisms that are both practically effective and theoretically grounded. In this paper, we introduce BarrierSteer, a novel inference-time framework that improves response safety by embedding learned nonlinear safety constraints directly into the model's latent representation space. BarrierSteer treats hidden-state safety classifiers as Control Barrier Functions (CBFs), enabling constraint-guided steering of unsafe latent trajectories during generation. By composing multiple safety constraints through efficient constraint merging without modifying the underlying LLM parameters, BarrierSteer preserves model utility. We provide theoretical results showing that applying CBFs in the latent space yields a principled, modular, and computationally efficient approach for steering with respect to learned safety constraints, with guarantees conditional on the learned barriers capturing the intended safety property. Our extensive experimental results across multiple model families and datasets demonstrate that BarrierSteer substantially reduces adversarial attack success rates and unsafe generations, outperforming the existing method. The code is available in our \href{https://github.com/thanhquangtran/BarrierSteer}{GitHub repository}.

2601.21306 2026-05-25 cs.LG cs.AI 版本更新

The Surprising Difficulty of Search in Model-Based Reinforcement Learning

基于模型的强化学习中搜索的惊人困难

Wei-Di Chang, Mikael Henaff, Brandon Amos, Gregory Dudek, Scott Fujimoto

发表机构 * Meta FAIR McGill University(麦吉尔大学)

AI总结 本文研究了基于模型的强化学习中的搜索问题。传统观点认为长期预测和误差累积是主要障碍,但作者发现搜索并不能简单替代学习到的策略,甚至在模型高度准确时也可能损害性能。研究指出,缓解高估偏差比提升模型或价值函数的准确性更为关键,而通过对一组价值函数取最小值的方法能有效解决这一偏差,从而实现高效的搜索,并在多个基准任务中取得领先性能。

Comments ICML 2026

详情
AI中文摘要

本文研究基于模型的强化学习中的搜索问题。传统观点认为,长期预测和复合误差是基于模型强化学习的主要障碍。我们挑战这一观点,表明搜索并不能简单地替代学习策略。令人惊讶的是,我们发现即使模型高度准确,搜索也可能损害性能。相反,我们表明缓解过估计偏差比提高模型或价值函数精度更重要。基于这一见解,我们确定取价值函数集成的最小值可以有效解决这一偏差并实现有效搜索,在多个流行基准领域取得了最先进的性能。

英文摘要

This paper investigates search in model-based reinforcement learning (RL). Conventional wisdom holds that long-term predictions and compounding errors are the primary obstacles for model-based RL. We challenge this view, showing that search is not a drop-in replacement for a learned policy. Surprisingly, we find that search can harm performance even when the model is highly accurate. Instead, we show that mitigating overestimation bias matters more than improving model or value function accuracy. Building on this insight, we identify that taking the minimum over an ensemble of value functions effectively addresses this bias and enables effective search, achieving state-of-the-art performance across multiple popular benchmark domains.

2601.14652 2026-05-25 cs.AI cs.CL cs.MA 版本更新

MAS-Orchestra: Understanding and Improving Multi-Agent Reasoning Through Holistic Orchestration and Controlled Benchmarks

MAS-Orchestra:通过整体编排和受控基准理解与改进多智能体推理

Zixuan Ke, Yifei Ming, Austin Xu, Ryan Chin, Xuan-Phi Nguyen, Prathyusha Jwalapuram, Jiayu Wang, Semih Yavuz, Caiming Xiong, Shafiq Joty

发表机构 * Salesforce Research(Salesforce研究院) University of Wisconsin-Madison(威斯康星大学麦迪逊分校) Massachusetts Institute of Technology(麻省理工学院)

AI总结 本文提出MAS-Orchestra,一种通过整体编排和受控基准测试来理解和提升多智能体系统(MAS)推理能力的训练框架。该方法将MAS的编排建模为函数调用的强化学习问题,能够一次性生成完整的MAS系统,并通过抽象子智能体为可调用函数,实现对系统结构的全局推理。同时,研究引入MASBENCH基准,从五个维度刻画任务特性,揭示MAS优势依赖于任务结构、验证机制及智能体能力,而非普遍适用。实验表明,MAS-Orchestra在多个基准测试中取得显著提升,效率较现有方法提高十倍以上。

Comments ICML 2026

详情
AI中文摘要

虽然多智能体系统(MAS)通过智能体协调有望提升智能水平,但当前自动MAS设计的方法表现不佳。这些不足源于两个关键因素:(1)方法论复杂性——智能体编排通过顺序的代码级执行进行,限制了全局系统级整体推理,且随智能体复杂性扩展性差;(2)效能不确定性——MAS在未理解相比单智能体系统(SAS)是否有切实益处的情况下被部署。我们提出MAS-Orchestra,一个训练时框架,将MAS编排形式化为具有整体编排的函数调用强化学习问题,一次性生成整个MAS。在MAS-Orchestra中,复杂的、面向目标的子智能体被抽象为可调用函数,从而在隐藏内部执行细节的同时实现系统结构上的全局推理。为了严格研究MAS何时以及为何有益,我们引入了MASBENCH,一个受控基准,沿五个轴表征任务:深度、范围、广度、并行性和鲁棒性。我们的分析揭示,MAS的收益关键取决于任务结构、验证协议以及编排器和子智能体的能力,而非普遍成立。在这些洞察的指导下,MAS-Orchestra在数学推理、多跳问答和基于搜索的问答等公共基准上实现了一致的改进,同时相比强基线实现了超过10倍的效率提升。MAS-Orchestra和MASBENCH共同使得在追求多智能体智能的过程中能够更好地训练和理解MAS。

英文摘要

While multi-agent systems (MAS) promise elevated intelligence through coordination of agents, current approaches to automatic MAS design under-deliver. Such shortcomings stem from two key factors: (1) methodological complexity - agent orchestration is performed using sequential, code-level execution that limits global system-level holistic reasoning and scales poorly with agent complexity - and (2) efficacy uncertainty - MAS are deployed without understanding if there are tangible benefits compared to single-agent systems (SAS). We propose MASOrchestra, a training-time framework that formulates MAS orchestration as a function-calling reinforcement learning problem with holistic orchestration, generating an entire MAS at once. In MAS-Orchestra, complex, goal-oriented subagents are abstracted as callable functions, enabling global reasoning over system structure while hiding internal execution details. To rigorously study when and why MAS are beneficial, we introduce MASBENCH, a controlled benchmark that characterizes tasks along five axes: Depth, Horizon, Breadth, Parallel, and Robustness. Our analysis reveals that MAS gains depend critically on task structure, verification protocols, and the capabilities of both orchestrator and subagents, rather than holding universally. Guided by these insights, MAS-Orchestra achieves consistent improvements on public benchmarks including mathematical reasoning, multi-hop QA, and search-based QA, while achieving more than 10x efficiency over strong baselines. Together, MAS-Orchestra and MASBENCH enable better training and understanding of MAS in the pursuit of multi-agent intelligence.

2504.09583 2026-05-25 cs.RO cs.AI 版本更新

AirVista-II: An Agentic System for Embodied UAVs Toward Dynamic Scene Semantic Understanding

AirVista-II:面向动态场景语义理解的具身无人机智能体系统

Fei Lin, Yonglin Tian, Tengchao Zhang, Jun Huang, Sangtian Guan, Fei-Yue Wang

发表机构 * Department of Engineering Science, Faculty of Innovation Engineering, Macau University of Science and Technology(创新工程学院工程科学系,澳门科学技术大学) State Key Laboratory for Management and Control of Complex Systems, Institute of Automation, Chinese Academy of Sciences(复杂系统管理与控制国家重点实验室,中国科学院自动化研究所) State Key Laboratory for Management and Control of Complex Systems, Chinese Academy of Sciences(复杂系统管理与控制国家重点实验室,中国科学院)

AI总结 本文提出了一种名为 AirVista-II 的智能代理系统,旨在提升无人机在动态场景中的语义理解能力。该系统融合了基于代理的任务识别与调度、多模态感知机制以及针对不同时间场景的差异化关键帧提取策略,实现了对动态环境中的关键信息高效捕捉。实验表明,该系统在多种无人机应用场景下能够实现高质量的零样本语义理解,显著提升了无人机自主决策的效率与适应性。

Journal ref Proc. 2025 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 6319-6324, 2025

详情
AI中文摘要

无人机在物流运输和灾难响应等动态环境中日益重要。然而,当前任务通常依赖人类操作员监控航拍视频并做出操作决策。这种人机协作模式在效率和适应性方面存在显著局限性。本文提出AirVista-II——一种面向具身无人机的端到端智能体系统,旨在实现动态场景中的通用语义理解和推理。该系统集成了基于智能体的任务识别与调度、多模态感知机制,以及针对不同时间场景定制的差异化关键帧提取策略,从而高效捕获关键场景信息。实验结果表明,所提系统在零样本设置下,能够在多种基于无人机的动态场景中实现高质量的语义理解。

英文摘要

Unmanned Aerial Vehicles (UAVs) are increasingly important in dynamic environments such as logistics transportation and disaster response. However, current tasks often rely on human operators to monitor aerial videos and make operational decisions. This mode of human-machine collaboration suffers from significant limitations in efficiency and adaptability. In this paper, we present AirVista-II -- an end-to-end agentic system for embodied UAVs, designed to enable general-purpose semantic understanding and reasoning in dynamic scenes. The system integrates agent-based task identification and scheduling, multimodal perception mechanisms, and differentiated keyframe extraction strategies tailored for various temporal scenarios, enabling the efficient capture of critical scene information. Experimental results demonstrate that the proposed system achieves high-quality semantic understanding across diverse UAV-based dynamic scenarios under a zero-shot setting.

2110.01552 2026-05-25 cs.CL cs.AI cs.LG 版本更新

Perhaps PTLMs Should Go to School -- A Task to Assess Open Book and Closed Book QA

或许PTLMs应该去上学——一项评估开卷和闭卷问答的任务

Manuel R. Ciosici, Joe Cecil, Alex Hedges, Dong-Ho Lee, Marjorie Freedman, Ralph Weischedel

发表机构 * Information Sciences Institute, University of Southern California(信息科学研究所,南加州大学)

AI总结 本文提出了一项新的任务,旨在评估预训练语言模型(PTLMs)在开放书和闭合书场景下的问答能力,使用社会学和人文领域的大学教材作为教学材料。研究通过设计基于教材内容的判断题,并进行多轮测试,发现PTLMs在闭合书条件下表现有限,表明其可能未真正理解教材内容;而在开放书条件下,允许模型检索相关段落进行回答时,性能显著提升。该任务为评估PTLMs对复杂文本的理解能力提供了新的基准。

Comments Identical to the EMNLP 2021 version

详情
AI中文摘要

我们的目标是提供一项新任务和排行榜,以刺激关于问答和预训练语言模型(PTLM)的研究,使其理解重要的教学文档,例如大学入门教科书或手册。PTLM在许多问答任务中取得了巨大成功,但需要大量监督训练,而在零样本设置中表现较差。我们提出了一项新任务,包括两本社会科学(《美国政府2e》)和人文科学(《美国历史》)的大学入门教材,数百个基于教材作者编写的复习题的真假陈述,基于教材前八章的验证/开发测试,基于剩余章节的盲测,以及基于最先进PTLM的基线结果。由于问题平衡,随机表现应为约50%。使用BoolQ微调的T5达到了相同的表现,表明教材内容未在PTLM中预表示。闭卷考试(即阅读教材,将教材添加到T5的预训练中)最多带来微小改进(56%),表明PTLM可能没有“理解”教材(或可能误解了问题)。开卷考试(即允许机器自动检索段落并用于回答问题)表现更好(约60%)。

英文摘要

Our goal is to deliver a new task and leaderboard to stimulate research on question answering and pre-trained language models (PTLMs) to understand a significant instructional document, e.g., an introductory college textbook or a manual. PTLMs have shown great success in many question-answering tasks, given significant supervised training, but much less so in zero-shot settings. We propose a new task that includes two college-level introductory texts in the social sciences (American Government 2e) and humanities (U.S. History), hundreds of true/false statements based on review questions written by the textbook authors, validation/development tests based on the first eight chapters of the textbooks, blind tests based on the remaining textbook chapters, and baseline results given state-of-the-art PTLMs. Since the questions are balanced, random performance should be ~50%. T5, fine-tuned with BoolQ achieves the same performance, suggesting that the textbook's content is not pre-represented in the PTLM. Taking the exam closed book, but having read the textbook (i.e., adding the textbook to T5's pre-training), yields at best minor improvement (56%), suggesting that the PTLM may not have "understood" the textbook (or perhaps misunderstood the questions). Performance is better (~60%) when the exam is taken open-book (i.e., allowing the machine to automatically retrieve a paragraph and use it to answer the question).

2101.05400 2026-05-25 cs.CL cs.AI cs.LG 版本更新

Machine-Assisted Script Curation

机器辅助脚本编纂

Manuel R. Ciosici, Joseph Cummings, Mitchell DeHaven, Alex Hedges, Yash Kankanampati, Dong-Ho Lee, Ralph Weischedel, Marjorie Freedman

发表机构 * Information Sciences Institute, University of Southern California(信息科学研究所,南加州大学)

AI总结 本文介绍了一种名为MASC的系统,用于实现人机协作的脚本创作。该系统能够自动生成事件类型、链接至维基数据、提示可能被遗漏的子事件,并记录参与多个子事件的实体及其时间顺序,从而辅助用户高效编写结构复杂的事件脚本。研究展示了MASC在实际案例中的应用效果,验证了其在脚本创作中的实用价值。

Comments Identical to the NAACL 2021 Demo version

详情
AI中文摘要

我们描述了机器辅助脚本编纂器(MASC),一个用于人机协作脚本创作的系统。使用MASC生成的脚本包括:(1)构成更大复杂事件的子事件的英文描述;(2)每个事件的类型;(3)预期参与多个子事件的实体记录;(4)子事件之间的时间顺序。MASC通过提供事件类型建议、维基数据链接以及可能被遗忘的子事件,自动化了脚本创作过程的部分环节。我们通过几个案例研究脚本展示了这些自动化功能对脚本作者的实用性。

英文摘要

We describe Machine-Aided Script Curator (MASC), a system for human-machine collaborative script authoring. Scripts produced with MASC include (1) English descriptions of sub-events that comprise a larger, complex event; (2) event types for each of those events; (3) a record of entities expected to participate in multiple sub-events; and (4) temporal sequencing between the sub-events. MASC automates portions of the script creation process with suggestions for event types, links to Wikidata, and sub-events that may have been forgotten. We illustrate how these automations are useful to the script writer with a few case-study scripts.